Pandas BUG: Panel.to_hdf silently fails with NaT

Just on panels, works fine for DataFrames. Gives a TypeError when reading:

In [7]: df = pd.DataFrame({'A': [1, 2], 'B': pd.to_datetime(['2010-01-01', np.nan])})

In [8]: df
Out[8]:
   A                   B
0  1 2010-01-01 00:00:00
1  2                 NaT

In [10]: tst = pd.HDFStore('tst.h5')

In [12]: df.to_hdf('tst.h5', 'df')

In [13]: tst.select('df')
Out[13]:
   A                   B
0  1 2010-01-01 00:00:00
1  2                 NaT

In [14]: df2 = pd.DataFrame({'A': [1, 2], 'B': pd.to_datetime(['2010-01-01', np.nan])})

In [17]: wp = pd.Panel({'i1': df, 'i2': df2})

In [18]: wp
Out[18]:
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 2 (major_axis) x 2 (minor_axis)
Items axis: i1 to i2
Major_axis axis: 0 to 1
Minor_axis axis: A to B

In [19]: wp.to_hdf(tst, key='wp')
/Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/pandas-0.12.0_993_gda89834-py2.7-macosx-10.8-x86_64.egg/pandas/io/pytables.py:2310: PerformanceWarning:
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed-integer,key->block0_values] [items->['i1', 'i2']]

  warnings.warn(ws, PerformanceWarning)

In [20]: tst.select('wp')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-20-da42baeaf1c6> in <module>()
----> 1 tst.select('wp')

/Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/pandas-0.12.0_993_gda89834-py2.7-macosx-10.8-x86_64.egg/pandas/io/pytables.pyc in select(self, key, where, start, stop, columns, iterator, chunksize, auto_close, **kwargs)
    595
    596         return TableIterator(self, func, nrows=s.nrows, start=start, stop=stop,
--> 597                              auto_close=auto_close).get_values()
    598
    599     def select_as_coordinates(

/Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/pandas-0.12.0_993_gda89834-py2.7-macosx-10.8-x86_64.egg/pandas/io/pytables.pyc in get_values(self)
   1225
   1226     def get_values(self):
-> 1227         results = self.func(self.start, self.stop)
   1228         self.close()
   1229         return results

/Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/pandas-0.12.0_993_gda89834-py2.7-macosx-10.8-x86_64.egg/pandas/io/pytables.pyc in func(_start, _stop)
    584         def func(_start, _stop):
    585             return s.read(where=where, start=_start, stop=_stop,
--> 586                           columns=columns, **kwargs)
    587
    588         if iterator or chunksize is not None:

/Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/pandas-0.12.0_993_gda89834-py2.7-macosx-10.8-x86_64.egg/pandas/io/pytables.pyc in read(self, **kwargs)
   2517         for i in range(self.nblocks):
   2518             blk_items = self.read_index('block%d_items' % i)
-> 2519             values = self.read_array('block%d_values' % i)
   2520             blk = make_block(values, blk_items, items)
   2521             blocks.append(blk)

/Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/pandas-0.12.0_993_gda89834-py2.7-macosx-10.8-x86_64.egg/pandas/io/pytables.pyc in read_array(self, key)
   2075         import tables
   2076         node = getattr(self.group, key)
-> 2077         data = node[:]
   2078         attrs = node._v_attrs
   2079

/Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/tables-3.0.0-py2.7-macosx-10.8-x86_64.egg/tables/vlarray.pyc in __getitem__(self, key)
    659             start, stop, step = self._process_range(
    660                 key.start, key.stop, key.step)
--> 661             return self.read(start, stop, step)
    662         # Try with a boolean or point selection
    663         elif type(key) in (list, tuple) or isinstance(key, numpy.ndarray):

/Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/tables-3.0.0-py2.7-macosx-10.8-x86_64.egg/tables/vlarray.pyc in read(self, start, stop, step)
    799         atom = self.atom
    800         if not hasattr(atom, 'size'):  # it is a pseudo-atom
--> 801             outlistarr = [atom.fromarray(arr) for arr in listarr]
    802         else:
    803             # Convert the list to the right flavor

/Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/tables-3.0.0-py2.7-macosx-10.8-x86_64.egg/tables/atom.pyc in fromarray(self, array)
   1149         if array.size == 0:
   1150             return None
-> 1151         return cPickle.loads(array.tostring())
   1152
   1153

TypeError: ('__new__() takes exactly one argument (2 given)', <class 'pandas.tslib.NaTType'>, ('\x00\x01\xff\xff\x00\x00\x00\x00\x00\x00',))

Haven't had a chance to look at what's going on.

Comment From: jreback

This is not a problem with hdf per se, much more of an issue of the creation of the Panel itself. The blocks are not being separated correctly into Datetime and such. They are getting mashed together into an ObjectBlock, which is holding non-creatable types (like NaT)

(Pdb) df
   A                   B
0  1 2010-01-01 00:00:00
1  2                 NaT

(Pdb) df._data

BlockManager
Items: Index([u'A', u'B'], dtype='object')
Axis 1: Int64Index([0, 1], dtype='int64')
IntBlock: [A], 1 x 2, dtype: int64
DatetimeBlock: [B], 1 x 2, dtype: datetime64[ns]


(Pdb) Panel({ 'i1' : df, 'i2' : df})._data

BlockManager
Items: Index([u'i1', u'i2'], dtype='object')
Axis 1: Int64Index([0, 1], dtype='int64')
Axis 2: Index([u'A', u'B'], dtype='object')
ObjectBlock: [i1, i2], 2 x 2 x 2, dtype: object

Comment From: jreback

This correctly constructs the blocks; I think the dict constructor is failing when it has dtypes in a contained frame that are mixed.

In [16]: df = pd.DataFrame({'A': [1, 2], 'B': pd.to_datetime(['2010-01-01', np.nan])})

In [17]: df
Out[17]: 
   A                   B
0  1 2010-01-01 00:00:00
1  2                 NaT

In [18]: df.dtypes
Out[18]: 
A             int64
B    datetime64[ns]
dtype: object

In [19]: y = Panel({'a' : df[['A']] })

In [20]: x = Panel({'b' : df[['B']] })

In [21]: y._data
Out[21]: 
BlockManager
Items: Index([u'a'], dtype='object')
Axis 1: Int64Index([0, 1], dtype='int64')
Axis 2: Index([u'A'], dtype='object')
IntBlock: [a], 1 x 2 x 1, dtype: int64

In [22]: x._data
Out[22]: 
BlockManager
Items: Index([u'b'], dtype='object')
Axis 1: Int64Index([0, 1], dtype='int64')
Axis 2: Index([u'B'], dtype='object')
DatetimeBlock: [b], 1 x 2 x 1, dtype: datetime64[ns]

In [23]: concat([x,y])
Out[23]: 
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 2 (major_axis) x 2 (minor_axis)
Items axis: b to a
Major_axis axis: 0 to 1
Minor_axis axis: A to B

In [24]: concat([x,y])._data
Out[24]: 
BlockManager
Items: Index([u'b', u'a'], dtype='object')
Axis 1: Int64Index([0, 1], dtype='int64')
Axis 2: Index([u'A', u'B'], dtype='object')
DatetimeBlock: [b], 1 x 2 x 2, dtype: datetime64[ns]
FloatBlock: [a], 1 x 2 x 2, dtype: float64

Comment From: TomAugspurger

@jreback I'm going to take a stab at this today. Do you know offhand what a good approach would be? The problem is coming in Panel._init_dict where an array

ipdb> arrays
[array([[1, Timestamp('2010-01-01 00:00:00', tz=None)],
       [2, NaT]], dtype=object), array([[1, Timestamp('2010-01-01 00:00:00', tz=None)],
       [2, NaT]], dtype=object)]

is created and handed off to create_block_manager_from_arrays. I was thinking about creating the blocks individually instead of dumping all the data into an array. I'll see what happens.

Comment From: jreback

sounds good

Comment From: jreback

@TomAugspurger bumping this....but of course if you'd like to work on it...fee free

Comment From: TomAugspurger

I ran into some troubles with that approach, but I can't remember what the problem was. I'll look today and report back.

Comment From: TomAugspurger

I'm not having any luck with this. Like I posted above, you've got a list of arrays with mixed type. I'm not sure how we could infer that it's actually contains datetime data along that slice.

It also doesn't matter whether df contains any NaTs.

In [20]: dfa
Out[20]: 
   A          B
0  1 2010-01-01
1  2 2010-01-01

[2 rows x 2 columns]

In [21]: wp = pd.Panel({'i1': dfa, 'i2': dfa})

In [22]: wp._data
Out[22]: 
BlockManager
Items: Index([u'i1', u'i2'], dtype='object')
Axis 1: Int64Index([0, 1], dtype='int64')
Axis 2: Index([u'A', u'B'], dtype='object')
ObjectBlock: [i1, i2], 2 x 2 x 2, dtype: object

The form_blocks function in core/internals isn't able to take this array:

ipdb> p v
array([[1, Timestamp('2010-01-01 00:00:00', tz=None)],
       [2, Timestamp('2010-01-01 00:00:00', tz=None)]], dtype=object)

and split it into two blocks (not that I'm saying it should be able to).

Maybe I'm trying to fix this in the wrong area though? Should we just construct it as is now, and then check if any objects can be converted before returning? That isn't quite working either, but that may be a bug / not implemented:

In [45]: wp._data
Out[45]: 
BlockManager
Items: Index([u'i1', u'i2'], dtype='object')
Axis 1: Int64Index([0, 1], dtype='int64')
Axis 2: Index([u'A', u'B'], dtype='object')
ObjectBlock: [i1, i2], 2 x 2 x 2, dtype: object  # should be split into an Int and Datetime

In [46]: wp.convert_objects(convert_dates=True)._data
Out[46]: 
BlockManager
Items: Index([u'i1', u'i2'], dtype='object')
Axis 1: Int64Index([0, 1], dtype='int64')
Axis 2: Index([u'A', u'B'], dtype='object')
ObjectBlock: [i1, i2], 2 x 2 x 2, dtype: object  # no change

In [47]: wp['i1']._data
Out[47]: 
BlockManager
Items: Index([u'A', u'B'], dtype='object')
Axis 1: Int64Index([0, 1], dtype='int64')
ObjectBlock: [A, B], 2 x 2, dtype: object  # should be int and Datetime

In [48]: wp['i1'].convert_objects(convert_dates=True)._data
Out[48]: 
BlockManager
Items: Index([u'A', u'B'], dtype='object')
Axis 1: Int64Index([0, 1], dtype='int64')
IntBlock: [A], 1 x 2, dtype: int64
DatetimeBlock: [B], 1 x 2, dtype: datetime64[ns]

so the convert objects work on the underlying DataFrames, but the changes don't get sent back up to the Panel. Sorry this comment got long.

Comment From: jreback

I am not sure that this is really an issue.

I think orientation really matters here; see below works just fine. Blocks are oriented with items as the 0th axis, so if ALL of the items are datetime then it will be a datetimeblock.

In [28]: dfa = pd.DataFrame({'A': [1, 2], 'B': pd.to_datetime(['2010-01-01', np.nan])})

In [29]: dfb = pd.DataFrame({'A': [3, 4], 'B': pd.to_datetime(['2010-01-02', np.nan])})

In [30]: p = Panel({ 'dfa' : dfa, 'dfb' : dfb }).transpose(2,0,1)

In [31]: p
Out[31]: 
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 2 (major_axis) x 2 (minor_axis)
Items axis: A to B
Major_axis axis: dfa to dfb
Minor_axis axis: 0 to 1

In [32]: p['B']
Out[32]: 
             0   1
dfa 2010-01-01 NaT
dfb 2010-01-02 NaT

[2 rows x 2 columns]

In [33]: p['B'].dtypes
Out[33]: 
0    datetime64[ns]
1    datetime64[ns]
dtype: object

I think maybe selection (via loc/iloc) on the non 0th axis that DOES NOT coerce is a problem, see this example below. Their may need a convert_objects step when ndim > 2 (e.g. this case) in the indexing routines.

In [40]: p.transpose(2,0,1).loc[:,'B']
Out[40]: 
             0   1
dfa 2010-01-01 NaT
dfb 2010-01-02 NaT

[2 rows x 2 columns]

In [41]: p.transpose(2,0,1).loc[:,'B'].dtypes
Out[41]: 
0    object
1    object
dtype: object

Comment From: jreback

closing as Panels are deprecated