Pandas ENH: PyTables Enhancements for future

open (not in any particular order)

add support for other dtypes in table columns (datetime,date,unicode)
Implement variable length strings in a parallel VLArray (and synchronize): https://github.com/PyTables/PyTables/issues/198
revisit Term syntax - can we do better / more readability? 3a. implement or in Terms (maybe use pyparsing like syntax)
implement WORMTable
one big area is to test whether data columns really are slower; it thus may make sense to make data columns = True the default (but not necessarily index them). see https://groups.google.com/forum/m/?fromgroups#!topic/pydata/cmw1F3OFJSc - see the end of this post for some perf tests, so this is prob not a good idea after all
add export function, to export to different PyTables formats(an easy to read table for R (partially done), and output a GenericTable)
provide better access to columns that are data_columns (as we can directly select them) - see read_column, expand this to the entire table (if possible), allows one to avoid selecting all columns in a table (and then reindexing), this works if columns argument is provided to select or inferred from the where.
add out-of-core computation support (see my comment about 1/2 down in #622), this is partially supported now that we have an iterator (#3078)
add a method to create a table structure (create_table)?, w/o actually appending, so don't have to add parms in each call to append.
Support a better mechanism for table splitting Splitter? that a user can specify how to split (rather than a dict); then store this object, so can automatically recreate the resulting table (enable for both Storer and Table objects)
Optimize table appending, I think we can do better! (GH #3537) makes some improvements
allow itemsize='truncate' to allow subsquent appends to proceed with string truncation (on specific columns)
allow where in select_column, return a properly indexed Series, add option to include the index (use_index=True?)
Better deal with a very long list as input to a Term, but running multiple or sub-queries
Add support for coulumn oriented tables, dep is carray, http://carray.pytables.org/docs/manual/

done

DONE (GH #2401): access store paths via path notation / dot notation (GH #2755)
DONE (GH #2497): add to docs (GH #2397) - issues about reading/writing concurrently in threads/processes http://sourceforge.net/mailarchive/message.php?msg_id=30190886
DONE (GH #2497): support panelnd (GH #2242)
DONE (GH #2561): Should DataFrames be automagically indexed on 'index' (prob yes), but then should have a flag in append/put, and enable passing of the indexing options
DONE (GH #2497): Check if create_table_index changes the current index if different options are passed
DONE (GH #2561): for writing add chunk keyword to select to provide generator like behavior - each call to return the next chunk of data
DONE (GH #2561): support multi indexes on tables 5a. DONE real dtype integration is coming on PR #2708 (eg even though 0.10.1 will actually read/write float32 columns u can't really do much with them w/o having them upcasted) - in any event I think HDFStore will accommodate this already. but more testing needed
DONE iterator support in select, http://stackoverflow.com/questions/14614512/merging-two-tables-with-millions-of-rows-in-python (GH #3078)
DONE (GH #3531) support timezones in datelike columns (index should be ok already) (scott?), (GH #2852)

Comment From: gerigk

what about allowing creation/access of groups by using "/" in the key.

i.e.,

store.put('some/path/to/df', df)

would create/access the groups some, path, to and finally df.

Right now I can only save the data on one level within an hdf5 file although HDF5/PyTables supports access by file system like paths. It would not break anything since the occurrence of a '/' raises an exception right now.

On Thu, Nov 29, 2012 at 6:20 PM, jreback notifications@github.com wrote:

add support for other dtypes in table columns (datetime64,datetime,date,unicode)

support min_itemsize for table columns (currently supported only in indexers) also might be a better way of doing this (e.g. have the info attached to a dataframe, or support a global pandas option to provide a minimum)

revisit Term syntax - can we do better / more readability?

implement WORMTable

— Reply to this email directly or view it on GitHubhttps://github.com/pydata/pandas/issues/2391.

Comment From: jreback

good idea...shouldn't be too hard to implement

Comment From: scottkidder

Here are things that are most interesting/beneficial to my current workload:

Full Float32 support & full pandas dtype support WORMTable (unsure of implementation or performance gains) data_columns is very useful and I can do more testing to determine how fast/slow they are. **read_column would also be very useful in many instances.

I like the way Term's work. Is there support for ORing Terms or other logical operations in the Selection?

I can pick up work on any of these issues, but I would absolutely to like to discuss some of the details first.

Comment From: jreback

Scott send me an email and I'll send u offline so we can correspond jeff@reback.net

Comment From: alvorithm

Term language: perhaps it makes sense to piggyback on existing syntax. SQL comes to mind, but also XESAM (whole http://xesam.org is down at the time, but one can get the gist of it here: http://banshee.fm/support/guide/searching/.

Comment From: alvorithm

It would be nice if attribute access (e.g. store.df) could be enabled for all the leaves that have suitable names. This might require a big API overhaul, though (store.df.append ...).

Comment From: jreback

see #2485, this is actually somewhat easy in HDFStore, the problem is that pandas in general doesnt' propogate these attributes; you can easily store/retrieve attributes if you want on the nodes themselves

something like:

s = store.get_storer('df')
s.attrs['my_attribute'] = 1

Comment From: jreback

sorry...misundestood your comment....(though you meant saving attributes)

attribute access on the store is not a big deal, will add to the list

Comment From: alvorithm

Thank you for considering this, dotted access will save my pinky a lot of strain [''] (dead keys b/c need accents...).

Regarding attributes on DFs actually this would preempt a number of cases for specialization of DataFrame (see recent MetaDataFrame PR #2695) and in particular perhaps support the addition for metadata that would facilitate automated merges (foreign keys...).