Pandas netCDF IO suport - Nineya|java/go/python

I'd like to propose read_netcdf (netCDF (Network Common Data Format)) a new Pandas I/O api feature with a similar top level interface as the other reader functions. This format is widely used in the scientific community. Furthermore, netCDF4 is implemented on top of the HDF5 library, making this a natural extension to functionality already in the api.

Most likely this would sit on top of the exiting Python/numpy interface to netCDF, and because each variables metadata is stored in the file header, no complicated parsing would necessary. Multidimensional variables could be handled in a similar manner as hdf.

This may have been brought up in the past but my search here and on the google didn't bring anything up.

Comment From: jtratner

Interested in contributing to make this happen?

Comment From: jhamman

Certainly willing to contribute on this. My proposal was meant to see if this would be feature that folks were interested in. I have some project specific implementations that could be generalized but the effort may be best served by a discussion on how to approach this. Presumably, following along the lines of the HDF5 application would be ideal.

Comment From: jtratner

Is it tabular/columnar data?

If so, can you map netCDF data types directly to numpy dtypes? If so, seems like it's pretty clear what should happen.

Comment From: jreback

@jhamman

netCDF is quite similar to the way HDF5 / PyTables work. I think you could make this very similar, as a separate module. I think you could maybe use HDFStore as a base class.

Comment From: jhamman

@jtratner - N-dimensional homogeneous arrays, the netCDF4 package takes care of the loading into numpy arrays.

@jreback - I'll take a look at replicating the HDFStore features in terms of the netCDF4 package. I actually don't think this should be all that difficult.

I'll take a first stab at it over the next week or so and report back.

Comment From: jtratner

looking forward to seeing what you come up with.

Comment From: jreback

@jhamman gr8....be sure to add the dep (netCDF ?) to ci/requirements (you don't have to add to all, but make sure at least 2.7 and 3.3. (you can add where pytables is tested), you can add for different versions (if that matters).

provide plenty of test cases! start with a smaller feature set, you can build/add over time.

pls hook up to travis!

Comment From: ebrevdo

@jhamman You may also be interested in the xray package that @jreback references. It's partly built on pandas and has support for n-dimensional, gridded, datatypes (the initial goal was to naturally represent netcdf3/4, which it does). You may find some of the code in there useful for your read/write functionality. Note that we use the netCDF4 library, which supports netcdf3 + 4, not any HDF5 library.

Comment From: jhamman

Thanks @ebrevdo . This looks very promising.

In fact, it may make sense, as @jreback suggests, to use some of xray's back-ends in pandas.

I'll poke around xray and see how it all works.

Comment From: shoyer

One tricky aspect here is dealing with large variables. It's possible (indeed, somewhat common) to have netCDF files with variables too big to fit into memory. Libraries like netCDF4-python (or xray, for that matter) let you use slicing syntax to only load part of an variable. I may be mistaken, but I don't think pandas has any support (yet) for objects that don't fit entirely into memory.

Comment From: jreback

what is a 'large variable' ? you can easily load part of an object via query or slicing, see docs: http://pandas-docs.github.io/pandas-docs-travis/io.html#io-hdf5

Comment From: shoyer

@jreback You're right, I stand corrected.

Comment From: jhamman

@shoyer and @jreback,

Have there been any other discussions on this recently? I've been using xray to handle most of my netCDF I/O and the xray.Dataset.to_pandas() method has been meeting all my needs in terms of reading netCDF data into Pandas data types. I'm not sure I see much benefit in building a separate netCDF reader for Pandas given the functionality that xray provides.

Comment From: shoyer

To add to what @jhamman writes, NetCDF is a highly structured data format that isn't a great fit for existing pandas data structures, except in some special cases. This would be a little different if we had working nd panels, but you'd still end up reading a netcdf file into a dict of nd panels -- pretty ugly and not very useable. Pandas itself is never going to have the right data structures for netcdf.

This is, of course, the motivation for why I wrote xray. I'm not saying xray is the end all solution here, but it does add no additional dependencies (beyond itself and a library to read netcdf files) and makes it quite straightforward to read netcdf into pandas, with explicit choices about how to handle edge cases along the way. That's all you could hope for from an implementation in pandas itself.

Comment From: jorisvandenbossche

I was just thinking, maybe we could add somewhere in the IO docs a reference to this? (saying that if you want to import NetCDF -> use xray, and you can always convert that (or part of the data) to pandas dataframes)

Comment From: jhamman

Thanks @shoyer for elaborating on my point.

@jorisvandenbossche - seems reasonable to point to xray as a way to get netCDF data into pandas.

Comment From: shoyer

@jorisvandenbossche good idea, just made a PR for that (#10027)