Just on panels, works fine for DataFrames.
Gives a TypeError
when reading:
In [7]: df = pd.DataFrame({'A': [1, 2], 'B': pd.to_datetime(['2010-01-01', np.nan])})
In [8]: df
Out[8]:
A B
0 1 2010-01-01 00:00:00
1 2 NaT
In [10]: tst = pd.HDFStore('tst.h5')
In [12]: df.to_hdf('tst.h5', 'df')
In [13]: tst.select('df')
Out[13]:
A B
0 1 2010-01-01 00:00:00
1 2 NaT
In [14]: df2 = pd.DataFrame({'A': [1, 2], 'B': pd.to_datetime(['2010-01-01', np.nan])})
In [17]: wp = pd.Panel({'i1': df, 'i2': df2})
In [18]: wp
Out[18]:
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 2 (major_axis) x 2 (minor_axis)
Items axis: i1 to i2
Major_axis axis: 0 to 1
Minor_axis axis: A to B
In [19]: wp.to_hdf(tst, key='wp')
/Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/pandas-0.12.0_993_gda89834-py2.7-macosx-10.8-x86_64.egg/pandas/io/pytables.py:2310: PerformanceWarning:
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed-integer,key->block0_values] [items->['i1', 'i2']]
warnings.warn(ws, PerformanceWarning)
In [20]: tst.select('wp')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-20-da42baeaf1c6> in <module>()
----> 1 tst.select('wp')
/Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/pandas-0.12.0_993_gda89834-py2.7-macosx-10.8-x86_64.egg/pandas/io/pytables.pyc in select(self, key, where, start, stop, columns, iterator, chunksize, auto_close, **kwargs)
595
596 return TableIterator(self, func, nrows=s.nrows, start=start, stop=stop,
--> 597 auto_close=auto_close).get_values()
598
599 def select_as_coordinates(
/Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/pandas-0.12.0_993_gda89834-py2.7-macosx-10.8-x86_64.egg/pandas/io/pytables.pyc in get_values(self)
1225
1226 def get_values(self):
-> 1227 results = self.func(self.start, self.stop)
1228 self.close()
1229 return results
/Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/pandas-0.12.0_993_gda89834-py2.7-macosx-10.8-x86_64.egg/pandas/io/pytables.pyc in func(_start, _stop)
584 def func(_start, _stop):
585 return s.read(where=where, start=_start, stop=_stop,
--> 586 columns=columns, **kwargs)
587
588 if iterator or chunksize is not None:
/Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/pandas-0.12.0_993_gda89834-py2.7-macosx-10.8-x86_64.egg/pandas/io/pytables.pyc in read(self, **kwargs)
2517 for i in range(self.nblocks):
2518 blk_items = self.read_index('block%d_items' % i)
-> 2519 values = self.read_array('block%d_values' % i)
2520 blk = make_block(values, blk_items, items)
2521 blocks.append(blk)
/Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/pandas-0.12.0_993_gda89834-py2.7-macosx-10.8-x86_64.egg/pandas/io/pytables.pyc in read_array(self, key)
2075 import tables
2076 node = getattr(self.group, key)
-> 2077 data = node[:]
2078 attrs = node._v_attrs
2079
/Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/tables-3.0.0-py2.7-macosx-10.8-x86_64.egg/tables/vlarray.pyc in __getitem__(self, key)
659 start, stop, step = self._process_range(
660 key.start, key.stop, key.step)
--> 661 return self.read(start, stop, step)
662 # Try with a boolean or point selection
663 elif type(key) in (list, tuple) or isinstance(key, numpy.ndarray):
/Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/tables-3.0.0-py2.7-macosx-10.8-x86_64.egg/tables/vlarray.pyc in read(self, start, stop, step)
799 atom = self.atom
800 if not hasattr(atom, 'size'): # it is a pseudo-atom
--> 801 outlistarr = [atom.fromarray(arr) for arr in listarr]
802 else:
803 # Convert the list to the right flavor
/Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/tables-3.0.0-py2.7-macosx-10.8-x86_64.egg/tables/atom.pyc in fromarray(self, array)
1149 if array.size == 0:
1150 return None
-> 1151 return cPickle.loads(array.tostring())
1152
1153
TypeError: ('__new__() takes exactly one argument (2 given)', <class 'pandas.tslib.NaTType'>, ('\x00\x01\xff\xff\x00\x00\x00\x00\x00\x00',))
Haven't had a chance to look at what's going on.
Comment From: jreback
This is not a problem with hdf per se, much more of an issue of the creation of the Panel itself. The blocks are not being separated correctly into Datetime and such. They are getting mashed together into an ObjectBlock, which is holding non-creatable types (like NaT)
(Pdb) df
A B
0 1 2010-01-01 00:00:00
1 2 NaT
(Pdb) df._data
BlockManager
Items: Index([u'A', u'B'], dtype='object')
Axis 1: Int64Index([0, 1], dtype='int64')
IntBlock: [A], 1 x 2, dtype: int64
DatetimeBlock: [B], 1 x 2, dtype: datetime64[ns]
(Pdb) Panel({ 'i1' : df, 'i2' : df})._data
BlockManager
Items: Index([u'i1', u'i2'], dtype='object')
Axis 1: Int64Index([0, 1], dtype='int64')
Axis 2: Index([u'A', u'B'], dtype='object')
ObjectBlock: [i1, i2], 2 x 2 x 2, dtype: object
Comment From: jreback
This correctly constructs the blocks; I think the dict constructor is failing when it has dtypes in a contained frame that are mixed.
In [16]: df = pd.DataFrame({'A': [1, 2], 'B': pd.to_datetime(['2010-01-01', np.nan])})
In [17]: df
Out[17]:
A B
0 1 2010-01-01 00:00:00
1 2 NaT
In [18]: df.dtypes
Out[18]:
A int64
B datetime64[ns]
dtype: object
In [19]: y = Panel({'a' : df[['A']] })
In [20]: x = Panel({'b' : df[['B']] })
In [21]: y._data
Out[21]:
BlockManager
Items: Index([u'a'], dtype='object')
Axis 1: Int64Index([0, 1], dtype='int64')
Axis 2: Index([u'A'], dtype='object')
IntBlock: [a], 1 x 2 x 1, dtype: int64
In [22]: x._data
Out[22]:
BlockManager
Items: Index([u'b'], dtype='object')
Axis 1: Int64Index([0, 1], dtype='int64')
Axis 2: Index([u'B'], dtype='object')
DatetimeBlock: [b], 1 x 2 x 1, dtype: datetime64[ns]
In [23]: concat([x,y])
Out[23]:
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 2 (major_axis) x 2 (minor_axis)
Items axis: b to a
Major_axis axis: 0 to 1
Minor_axis axis: A to B
In [24]: concat([x,y])._data
Out[24]:
BlockManager
Items: Index([u'b', u'a'], dtype='object')
Axis 1: Int64Index([0, 1], dtype='int64')
Axis 2: Index([u'A', u'B'], dtype='object')
DatetimeBlock: [b], 1 x 2 x 2, dtype: datetime64[ns]
FloatBlock: [a], 1 x 2 x 2, dtype: float64
Comment From: TomAugspurger
@jreback I'm going to take a stab at this today. Do you know offhand what a good approach would be? The problem is coming in Panel._init_dict
where an array
ipdb> arrays
[array([[1, Timestamp('2010-01-01 00:00:00', tz=None)],
[2, NaT]], dtype=object), array([[1, Timestamp('2010-01-01 00:00:00', tz=None)],
[2, NaT]], dtype=object)]
is created and handed off to create_block_manager_from_arrays
. I was thinking about creating the blocks individually instead of dumping all the data into an array. I'll see what happens.
Comment From: jreback
sounds good
Comment From: jreback
@TomAugspurger bumping this....but of course if you'd like to work on it...fee free
Comment From: TomAugspurger
I ran into some troubles with that approach, but I can't remember what the problem was. I'll look today and report back.
Comment From: TomAugspurger
I'm not having any luck with this. Like I posted above, you've got a list of arrays
with mixed type. I'm not sure how we could infer that it's actually contains datetime data along that slice.
It also doesn't matter whether df
contains any NaT
s.
In [20]: dfa
Out[20]:
A B
0 1 2010-01-01
1 2 2010-01-01
[2 rows x 2 columns]
In [21]: wp = pd.Panel({'i1': dfa, 'i2': dfa})
In [22]: wp._data
Out[22]:
BlockManager
Items: Index([u'i1', u'i2'], dtype='object')
Axis 1: Int64Index([0, 1], dtype='int64')
Axis 2: Index([u'A', u'B'], dtype='object')
ObjectBlock: [i1, i2], 2 x 2 x 2, dtype: object
The form_blocks
function in core/internals
isn't able to take this array:
ipdb> p v
array([[1, Timestamp('2010-01-01 00:00:00', tz=None)],
[2, Timestamp('2010-01-01 00:00:00', tz=None)]], dtype=object)
and split it into two blocks (not that I'm saying it should be able to).
Maybe I'm trying to fix this in the wrong area though? Should we just construct it as is now, and then check if any objects can be converted before returning? That isn't quite working either, but that may be a bug / not implemented:
In [45]: wp._data
Out[45]:
BlockManager
Items: Index([u'i1', u'i2'], dtype='object')
Axis 1: Int64Index([0, 1], dtype='int64')
Axis 2: Index([u'A', u'B'], dtype='object')
ObjectBlock: [i1, i2], 2 x 2 x 2, dtype: object # should be split into an Int and Datetime
In [46]: wp.convert_objects(convert_dates=True)._data
Out[46]:
BlockManager
Items: Index([u'i1', u'i2'], dtype='object')
Axis 1: Int64Index([0, 1], dtype='int64')
Axis 2: Index([u'A', u'B'], dtype='object')
ObjectBlock: [i1, i2], 2 x 2 x 2, dtype: object # no change
In [47]: wp['i1']._data
Out[47]:
BlockManager
Items: Index([u'A', u'B'], dtype='object')
Axis 1: Int64Index([0, 1], dtype='int64')
ObjectBlock: [A, B], 2 x 2, dtype: object # should be int and Datetime
In [48]: wp['i1'].convert_objects(convert_dates=True)._data
Out[48]:
BlockManager
Items: Index([u'A', u'B'], dtype='object')
Axis 1: Int64Index([0, 1], dtype='int64')
IntBlock: [A], 1 x 2, dtype: int64
DatetimeBlock: [B], 1 x 2, dtype: datetime64[ns]
so the convert objects work on the underlying DataFrames, but the changes don't get sent back up to the Panel. Sorry this comment got long.
Comment From: jreback
I am not sure that this is really an issue.
I think orientation really matters here; see below works just fine. Blocks are oriented with items as the 0th axis, so if ALL of the items are datetime then it will be a datetimeblock.
In [28]: dfa = pd.DataFrame({'A': [1, 2], 'B': pd.to_datetime(['2010-01-01', np.nan])})
In [29]: dfb = pd.DataFrame({'A': [3, 4], 'B': pd.to_datetime(['2010-01-02', np.nan])})
In [30]: p = Panel({ 'dfa' : dfa, 'dfb' : dfb }).transpose(2,0,1)
In [31]: p
Out[31]:
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 2 (major_axis) x 2 (minor_axis)
Items axis: A to B
Major_axis axis: dfa to dfb
Minor_axis axis: 0 to 1
In [32]: p['B']
Out[32]:
0 1
dfa 2010-01-01 NaT
dfb 2010-01-02 NaT
[2 rows x 2 columns]
In [33]: p['B'].dtypes
Out[33]:
0 datetime64[ns]
1 datetime64[ns]
dtype: object
I think maybe selection (via loc/iloc) on the non 0th axis that DOES NOT coerce is a problem, see this example below. Their may need a convert_objects
step when ndim > 2 (e.g. this case)
in the indexing routines.
In [40]: p.transpose(2,0,1).loc[:,'B']
Out[40]:
0 1
dfa 2010-01-01 NaT
dfb 2010-01-02 NaT
[2 rows x 2 columns]
In [41]: p.transpose(2,0,1).loc[:,'B'].dtypes
Out[41]:
0 object
1 object
dtype: object
Comment From: jreback
closing as Panels are deprecated