Pandas ENH: add additional extras_require sections for optional dependencies

Is your feature request related to a problem?

It would be nice to have more extras_require sections in the setup.py for optional dependencies.

Describe the solution you'd like

I'd like to see more sections added here:

https://github.com/pandas-dev/pandas/blob/master/setup.py#L743

for optional dependencies; e.g. pandas[s3] for Pandas + everything you need to write to S3 (just for the sake of example; this is probably not the right division). I found a similar request here: https://github.com/pandas-dev/pandas/issues/35206. (Sorry if a previous issue exists and I missed it.)

API breaking implications

None

Describe alternatives you've considered

Pin all desired optional dependencies.

Additional context

The idea is that if you're using a tool like pip-compile, then currently you have to pin all of these requirements, even if they are only used transitively by Pandas. It would be nice to do say pandas[s3] in your requirements.in and have s3fs get added to the compiled requirements.

Happy to make a PR if folks don't object to the idea.

Comment From: JMBurley

Idea is still relevant, but since initial request pandas has refactored to more modern dependency mgmt and the necessary extras_require is held in setup.cfg: https://github.com/pandas-dev/pandas/blob/master/setup.cfg#L48.

I'll fix to insert current recommended dependencies

Comment From: JMBurley

@mroeschke PR #47336 resolves some of the issues discussed here.

It would also be trivial to repeat my code to provide functional-level extras like pandas s3, gcp (etc...) if package ranges are agreed upon. I can do that over the next few weeks.

(Well, pandas s3 uses s3fs which means a hideous set of dependency problems derived from their use in turn of aiobotocore (in s3fs>=0.5.0) which is typically incompatible with any other package using boto3 but that's not for pandas to resolve. I'd want to check with pandas leadership if s3fs>=0.5 is needed for any reason and strongly suggest pinning s3fs ~=0.4 until that dependency mess is resolved. Summary here)

Comment From: mroeschke

Thanks for initiating this! I think there may need to be a broader discussion with the core team before we make any PR on organizing the optional dependencies. I think the broad steps will be.

Decide on the correct logical grouping of optional dependencies
Establish the groupings in 1 PR.

Part of 1) is also revisiting pandas optional dependencies. pandas has like 25+ optional dependencies and there has been discussion if we could also spin off some dependencies elsewhere (https://github.com/pandas-dev/pandas/issues/45433).

A good first step as you mention would to be propose dependency groupings and gain some feedback.

Comment From: JMBurley

I agree with the overall path to solving all extras_require options, and that that would be a nice improvement to pandas.

However I do want to highlight that that will be a quite large discussion that will inevitably reveal a series of packaging problems and lead to this issue dragging on and taking a while to resolve.

s3fs example

s3fs is required for s3 interaction but (because s3fs>=0.5.0 requires aiobotocore) that is mutually incompatible with moto[boto3] (some unresolved issues explaining the aiobotocore botocore problem).

It isn't actively documented in pandas, but right now the entire stability of pandas is pinned to two s3fs releases from March 2020 because pandas needs moto for tests, and moto cannot coexist with s3fs>0.5.0 and pandas requires s3fs>0.4.0.

But pandas has non-professional users who don't care about writing tests (ie. don't care about moto) and therefore might prefer pandas to allow a more recent version of s3fs.

So what versions of s3fs would we pin pandas to?

I'm diving deep into that one s3fs issue to flag that solving packaging in full is not going to be quick, and the fundamental work to understand packaging and versioning is not complete (eg. environment.yml has required indirectly may be? as a comment, and many dependencies have no stated versions although other requirements / setup files and documentation may suggest versions). There will be multiple issues like s3fs.

I propose incremental improvement. Allowing the recommended installs to be managed dependencies is an easy improvement to pandas packaging + dependencies that we can make right now that will save developers some serious headaches. How we solve that will not change in future and it is independent of the optional dependencies that extend functionality. Perhaps we can bundle that under an optional extra performance or recommended?

We can then work towards the larger, laudable goal of allowing functionality groupings (eg. s3, gcp, excel, parquet, sql, html, plots, dask) as extras.

Thoughts?

Groupings

(Based off current functionality enhancements in setup & environment.yml) | extras_require name | Enforces | | --- | --- | |recommended: | numexper>=2.7.1, bottleneck>=1.3.1 | | s3: | TBC | | gcp: | TBC | | excel: | TBC | | parquet: | TBC | | sql: | TBC | | html: | TBC | | plots: | TBC | | dask: | TBC |

Comment From: bluss

I would wish for that pandas not recommend bottleneck until there is a mitigation for #42878

Comment From: JMBurley

@mroeschke pinging on the above. I would still advocate that the best path is to package what can be trivialise packaged right now in order to improve pandas as-is, and then finalise more difficult packaging decisions later.

Otherwise, I think the packaging below would be in line with current understanding.

Groupings

(Based off current functionality enhancements in setup & environment.yml, install_guide where available, or open-ended >=current_stable_version when no information available regarding pandas compatibility) | extras_require name | Enforces | | --- | --- | |recommended: | numexper>=2.7.1; bottleneck>=1.3.1 | |computation: | SciPy >=1.4.1; numba>=0.50.1 ; xarray>=0.15.1| | s3: | s3fs>=0.4.0 ; boto3>=1.22.7 |
| gcp: | gcsfs>=0.6.0 ; pandas-gbq>=0.14.0 ; | | excel: | xlrd >=2.0.1 ; xlwt >=1.3.0 ; xlsxwriter v1.2.2 ; openpyxl >=3.0.3 ; pyxlsb >=1.0.6 | | parquet: | fastparquet>=0.4.0; pyarrow>=1.0.1 | | feather: | pyarrow>=1.0.1 | | hdf5: | PyTables>=3.6.1 ; blosc>=1.20.1 ; | | sql: | SQLAlchemy>=1.4.0 ; psycopg2>=2.8.4 ; pymysql>=0.10.1| | html: | BeautifulSoup4>=4.8.2 ; html5lib>=1.1 ; lxml>=4.5.0 | | viz: | matplotlib>=3.3.2 ; Jinja2>=2.11 ; tabulate>=0.8.7 |

Notes: - recommended: It seems that Numba should be in this list for performance boost on rolling operations? - s3: fsspec is explicit dependency of s3fs, pandas can ignore. - hdf5: zlib not pip package, so ignore in setup. nb./ pandas install_guide should recommend >=1.1.4 due to security flaws in prior zlib - clipboard: I am ignoring because OS-dependent packaging is outside of current scope, and such functionality very rarely relevant to dockerising production code (thus, low priority IMO) -

I can incorporate all of the above in my PR to manage optional dependencies via [options.extras_require] if there is consensus that we should follow this path. Thoughts?

Note that there would be an open question on how to ensure that https://pandas.pydata.org/docs/getting_started/install.html is kept current with the actual packaging.

Comment From: mroeschke

Sorry for the delay @JMBurley.

On the surface these groupings seem reasonable, I would however be weary to make a final call here without more buy in from the core team @pandas-dev/pandas-core

This type of enhancement may warrant going through our developing enhancement proposal process https://github.com/pandas-dev/pandas/pull/47444 as this would be a larger change for the library, so let's wait for those developments before moving forward here.

Comment From: bashtage

Would it make sense to have a kitchen sink option, e.g., pip install pandas[all]?

Comment From: bashtage

recommended: It seems that Numba should be in this list for performance boost on rolling operations?

I think numba can be a bit too fiddly on some OS/arch types to be "recommended".

Comment From: jreback

well numba is supported in the 3 main architectures so IMHO it's enough (it's still optional in any event)

Comment From: TomAugspurger

A few notes:

I don't think parquet should include both fastparquet and pyarrow, users aren't likely to use both simultaneously. It should either be just arrow since it's the default or split into two extras
Likewise with sql: people probably aren't using both MySQL and PostgreSQL. Split these into a bunch of sql-* extras?
If you're adding an s3 and gcp you should also add an azure that depends on adlfs (and why stop there? fsspec has lots of other implementations)
I'm a bit surprised to see Jinaj2 and tabulate under "viz". Viz makes me think .plot. I wouldn't necessarily consider .to_markdown() or .table "visualizations". So perhaps split those yet again into plot or plotting and something else for .table?

Comment From: JMBurley

Thanks everyone for comments.

Actions / discussion

[I think numba can be a bit too fiddly on some OS/arch types to be "recommended".] [Numba can be fiddly but supported on "3 main architectures"]

I'll leave Numba in Computation only for now. Also saves on rewriting the install.rst file with a detailed reason behind why it is in recommended.

I don't think parquet should include both fastparquet and pyarrow, users aren't likely to use both simultaneously. It should either be just arrow since it's the default or split into two extras

Good catch thanks. Happy to take just pyarrow as the default. This is probably also more convenient to most users as other major libraries are more likely to request pyarrow than fastparquet.

Likewise with sql: people probably aren't using both MySQL and PostgreSQL. Split these into a bunch of sql-* extras?

I'll keep a general sql that will ensure user can interact with any (common) SQL implementation, and also split out specific sql-* to facilitate tighter packaging.

If you're adding an s3 and gcp you should also add an azure that depends on adlfs

Good catch, currently there are no install notes for azure requirements in the install_guide so that might need an update as well. I'll add azure to [options.extras_require]. @TomAugspurger are there any other dependencies needed to make adlfs work (that it won't handle itself)? PyPi documentation looks very dask-focused.

Although adlfs is not imported anywhere in pandas so I'm not sure why/how the azure interaction works? Perhaps it just needs fsspec?

(and why stop there? fsspec has lots of other implementations)

The way I think about packaging is that pandas should be able to manage its dependencies for any meaningful function that an end-user might need. Therefore s3, gcp are options. Other capabilities of fsspec that are meaningful to the end-user can be added. That said, if fsspec underlies an overwhelming fraction of pandas file I/O then perhaps it should be a core dependency?

I'm a bit surprised to see Jinaj2 and tabulate under "viz". Viz makes me think .plot. I wouldn't necessarily consider .to_markdown() or .table "visualizations". So perhaps split those yet again into plot or plotting and something else for .table?

Agreed. I followed the current install guide structure, will split into plot and table. I'm honestly not sure if Jinja2 is needed for any pandas functionality? (Jinja2 not imported anywhere in pandas).

Updated Groupings for [options.extras_require]

extras_require name	Enforces
recommended:	numexper>=2.7.1; bottleneck>=1.3.1 ; numba>=0.50.1
computation:	SciPy >=1.4.1; numba>=0.50.1 ; xarray>=0.15.1
s3:	s3fs>=0.4.0 ; boto3>=1.22.7
gcp:	gcsfs>=0.6.0 ; pandas-gbq>=0.14.0 ;
azure:	adlfs >=0.6.0 (triage needed?)
excel:	xlrd >=2.0.1 ; xlwt >=1.3.0 ; xlsxwriter>=1.2.2 ; openpyxl >=3.0.3 ; pyxlsb >=1.0.6
parquet:	pyarrow>=1.0.1
feather:	pyarrow>=1.0.1
hdf5:	PyTables>=3.6.1 ; blosc>=1.20.1 ;
sql-postgresql:	SQLAlchemy>=1.4.0 ; psycopg2>=2.8.4
sql-mysql:	SQLAlchemy>=1.4.0 ; pymysql>=0.10.1
sql-other:	SQLAlchemy>=1.4.0
html:	BeautifulSoup4>=4.8.2 ; html5lib>=1.1 ; lxml>=4.5.0
plot:	matplotlib>=3.3.2
table:	Jinja2>=2.11 ; tabulate>=0.8.7

Comment From: TomAugspurger

are there any other dependencies needed to make adlfs work (that it won't handle itself)? PyPi documentation looks very dask-focused.

Just adlfs.

Although adlfs is not imported anywhere in pandas so I'm not sure why/how the azure interaction works? Perhaps it just needs fsspec?

It's the same as s3fs / gcsfs. I don't think pandas imports those anywhere outside of tests.

It wouldn't be appropriate to add fsspec as a required dependency.

Comment From: fangchenli

Sorry, I'm late for the discussion. Before doing this, we better update all the minimum support versions (if needed).

Comment From: JMBurley

Before doing this, we better update all the minimum support versions (if needed).

Found the updated list in the 1.5.0 rst and applied to the PR

Comment From: JMBurley

@mroeschke @bashtage @TomAugspurger @jreback @fangchenli :

Thanks, comments & feedback incorporated. I've updated the PR https://github.com/pandas-dev/pandas/pull/47336/.

Let me know if further changes needed and/or who to tag for the PR review.

Comment From: JMBurley

@mroeschke @TomAugspurger @jreback

Sorry to ping again, want to try and get this into the 1.5 milestone if possible.

Issue-closing PR #47336 for this optional dependencies managed via extras_require argument is up and passing tests. Are you guys happy to review as-is or is further discussion needed?

Comment From: JMBurley

@mroeschke @TomAugspurger @jreback

Earlier discussion here suggested that this PR adding proper optional packaging to Pandas might be suited for PDEP when PDEP was finalised (#47444).

As PDEP is now done, I could make this this the first PDEP?

Although: 1) It is somewhat of a fait accompli as the solution is already built has complete PR & documentation waiting to go in #47336 2) I'm not sure if the process for the core team reviewing PDEP is actually ready to go?

Let me know your thoughts.

Comment From: JMBurley

Bumping this issue. Problem is solved and we are just waiting for approval on the solution #47366, hopefully in version 1.5.0.

Remaining issues & debates are best resolved by having the dependency options available for open-source users to change in minor version updates. I think it would be a significant loss for this to be delayed until pandas v1.6.0, although I'm not sure what else is competing for attention prior to that release.

Comment From: rendner

@JMBurley

I'm honestly not sure if Jinja2 is needed for any pandas functionality? (Jinja2 not imported anywhere in pandas).

jinja2 is required if you use pandas Styler. As you can see here: import_optional_dependency("jinja2", extra="DataFrame.style requires jinja2.")