Pandas Specifying data_columns for non-selector tables with append_to_multiple

HDFStore.append_to_multiple only lets your specify data_columns for the "selector" table; all other tables you make with it have no data columns. Is there a reason it needs to be that way?

I think the append_to_multiple/select_as_multiple workflow would be somewhat more flexible if we didn't view the set of tables being appended/selected as a monolithic grouping that must always be used together, with the "selector" (queryable) table specified up front. For write operations it makes sense to append together to keep them in sync, but for read operations, is there any particular reason why you always need to be querying on the same table?

I am working with some census data that has many columns representing different sorts of data. Some are geographic identifiers like FIPS codes, some are overall population counts, some are detailed population counts for subcategories like racial/age groups, some may be other kinds of data like housing units or income. It would be nice if you could use append_to_multiple to write a row of all the data to all the tables, but then later still be able to query or retrieve some of the data depending on what you need.

For instance, suppose you have a dataset with columns GEO1, GEO2, POP1, POP2, INCOME1, INCOME2. Sometimes I might need the population columns but not income, or maybe the income columns but not the population ones. It would be nice if I could use append_to_multiple to write my dataset to three separate tables, but then do something like select_as_multiple(['GeoTable', 'PopTable'], where=['POP1>1000'], selector='PopTable') or, on the same data, select_as_multiple(['GeoTable', 'IncomeTable', where=['INCOME1>20000'], selector='IncomeTable'). In other words, I want to specify the selector table at query time, not write time.

I think just modifying append_to_multiple to allow setting data columns on any table, not just the "selector" table. In fact, there's no need to specify a "selector" table at write time at all (although it could still be done for convenience). This would mean that every one of the sub-tables created by append_to_multiple would be nicely queryable on its own. As far as I can tell, there's nothing in the existing implementations of append/select that would preclude this; they just don't expose any interface for it.

For now this would probably stick to allowing just one selector table per query (e.g., in the above example you can query on population columns or income columns, but not both in one query). But in the future it could be possible to allow multi-selector queries by intersecting the resulting row sets of each sub-query.

Comment From: mroeschke

Thanks for the suggestion but it appears there hasn't been much appetite for this issue over the years so closing. Happy to reopen if there's renewed interest