From the categorical example docs, with python 2.7:
In [1]: s = pd.Series(["a","b","c","a"])
In [2]: s2 = s.astype('category')
In [3]: s2.astype('string')
Out[3]:
0 a
1 b
2 c
3 a
dtype: object
However, with python 3, this fails:
In [74]: s = pd.Series(["a","b","c","a"])
In [75]: s2 = s.astype('category')
In [76]: s2.astype('string')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-76-70b55f934dfe> in <module>()
----> 1 s2.astype('string')
/home/joris/scipy/pandas/pandas/core/generic.py in astype(self, dtype, copy, raise_on_error, **kwargs)
3179 conversion, with unconvertible values becoming NaT.
3180 convert_numeric : boolean, default False
-> 3181 If True, attempt to coerce to numbers (including strings), with
3182 unconvertible values becoming NaN.
3183 convert_timedeltas : boolean, default True
/home/joris/scipy/pandas/pandas/core/internals.py in astype(self, dtype, **kwargs)
3188
3189 def astype(self, dtype, **kwargs):
-> 3190 return self.apply('astype', dtype=dtype, **kwargs)
3191
3192 def convert(self, **kwargs):
/home/joris/scipy/pandas/pandas/core/internals.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
3055
3056 kwargs['mgr'] = self
-> 3057 applied = getattr(b, f)(**kwargs)
3058 result_blocks = _extend_blocks(applied, result_blocks)
3059
/home/joris/scipy/pandas/pandas/core/internals.py in astype(self, dtype, copy, raise_on_error, values, **kwargs)
459 **kwargs):
460 return self._astype(dtype, copy=copy, raise_on_error=raise_on_error,
--> 461 values=values, **kwargs)
462
463 def _astype(self, dtype, copy=False, raise_on_error=True, values=None,
/home/joris/scipy/pandas/pandas/core/internals.py in _astype(self, dtype, copy, raise_on_error, values, klass, mgr)
2158 values = self.values
2159 else:
-> 2160 values = np.asarray(self.values).astype(dtype, copy=False)
2161
2162 if copy:
TypeError: data type "string" not understood
Should this also work? (or should be just update the docs) In any case, consistency would be nice here.
The root cause is probably that this difference also exists in numpy's astype.
Output of pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-53-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.19.0+270.gc72f297
nose: 1.3.7
pip: 8.1.2
setuptools: 23.0.0
Cython: 0.24.1
numpy: 1.11.1
scipy: 0.18.0
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.5
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.0.0
tables: 3.3.0
numexpr: 2.6.1
feather: 0.3.1
matplotlib: 2.0.0rc2
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: None
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: 2.8
s3fs: 0.0.7
pandas_datareader: None
Comment From: chris-b1
xref https://github.com/numpy/numpy/issues/6023 (may be other open issues)
Comment From: jorisvandenbossche
OK, maybe we should just leave it as a numpy issue then. I am updating the docs in any case to have them py3 compat for now.
Comment From: jorisvandenbossche
Doc issue is addressed in https://github.com/pandas-dev/pandas/pull/15011
Comment From: TomAugspurger
Looks like we're good here.