Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
from pandas import SparseDtype
df = pd.DataFrame([["a", 0],["b", 1], ["b", 2]], columns=["A","B"])
df["A"].astype(SparseDtype("category"))
# or: df["A"].astype(SparseDtype("category", fill_value="not_in_series"))
Issue Description
I am unable to convert a dense categorical series to a sparse one when I leave the fill_value
at default, or a value which does not exist in the series.
Stacktrace:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File ~/repositories/arff-to-parquet/venv/lib/python3.10/site-packages/IPython/core/formatters.py:706, in PlainTextFormatter.__call__(self, obj)
699 stream = StringIO()
700 printer = pretty.RepresentationPrinter(stream, self.verbose,
701 self.max_width, self.newline,
702 max_seq_length=self.max_seq_length,
703 singleton_pprinters=self.singleton_printers,
704 type_pprinters=self.type_printers,
705 deferred_pprinters=self.deferred_printers)
--> 706 printer.pretty(obj)
707 printer.flush()
708 return stream.getvalue()
File ~/repositories/arff-to-parquet/venv/lib/python3.10/site-packages/IPython/lib/pretty.py:410, in RepresentationPrinter.pretty(self, obj)
407 return meth(obj, self, cycle)
408 if cls is not object \
409 and callable(cls.__dict__.get('__repr__')):
--> 410 return _repr_pprint(obj, self, cycle)
412 return _default_pprint(obj, self, cycle)
413 finally:
File ~/repositories/arff-to-parquet/venv/lib/python3.10/site-packages/IPython/lib/pretty.py:778, in _repr_pprint(obj, p, cycle)
776 """A pprint that just redirects to the normal repr function."""
777 # Find newlines and replace them with p.break_()
--> 778 output = repr(obj)
779 lines = output.splitlines()
780 with p.group():
File ~/repositories/arff-to-parquet/venv/lib/python3.10/site-packages/pandas/core/series.py:1550, in Series.__repr__(self)
1548 # pylint: disable=invalid-repr-returned
1549 repr_params = fmt.get_series_repr_params()
-> 1550 return self.to_string(**repr_params)
File ~/repositories/arff-to-parquet/venv/lib/python3.10/site-packages/pandas/core/series.py:1643, in Series.to_string(self, buf, na_rep, float_format, header, index, length, dtype, name, max_rows, min_rows)
1597 """
1598 Render a string representation of the Series.
1599
(...)
1629 String representation of Series if ``buf=None``, otherwise None.
1630 """
1631 formatter = fmt.SeriesFormatter(
1632 self,
1633 name=name,
(...)
1641 max_rows=max_rows,
1642 )
-> 1643 result = formatter.to_string()
1645 # catch contract violations
1646 if not isinstance(result, str):
File ~/repositories/arff-to-parquet/venv/lib/python3.10/site-packages/pandas/io/formats/format.py:393, in SeriesFormatter.to_string(self)
390 return f"{type(self.series).__name__}([], {footer})"
392 fmt_index, have_header = self._get_formatted_index()
--> 393 fmt_values = self._get_formatted_values()
395 if self.is_truncated_vertically:
396 n_header_rows = 0
File ~/repositories/arff-to-parquet/venv/lib/python3.10/site-packages/pandas/io/formats/format.py:377, in SeriesFormatter._get_formatted_values(self)
376 def _get_formatted_values(self) -> list[str]:
--> 377 return format_array(
378 self.tr_series._values,
379 None,
380 float_format=self.float_format,
381 na_rep=self.na_rep,
382 leading_space=self.index,
383 )
File ~/repositories/arff-to-parquet/venv/lib/python3.10/site-packages/pandas/io/formats/format.py:1326, in format_array(values, formatter, float_format, na_rep, digits, space, justify, decimal, leading_space, quoting)
1311 digits = get_option("display.precision")
1313 fmt_obj = fmt_klass(
1314 values,
1315 digits=digits,
(...)
1323 quoting=quoting,
1324 )
-> 1326 return fmt_obj.get_result()
File ~/repositories/arff-to-parquet/venv/lib/python3.10/site-packages/pandas/io/formats/format.py:1357, in GenericArrayFormatter.get_result(self)
1356 def get_result(self) -> list[str]:
-> 1357 fmt_values = self._format_strings()
1358 return _make_fixed_width(fmt_values, self.justify)
File ~/repositories/arff-to-parquet/venv/lib/python3.10/site-packages/pandas/io/formats/format.py:1658, in ExtensionArrayFormatter._format_strings(self)
1656 array = values._internal_get_values()
1657 else:
-> 1658 array = np.asarray(values)
1660 fmt_values = format_array(
1661 array,
1662 formatter,
(...)
1670 quoting=self.quoting,
1671 )
1672 return fmt_values
ValueError: object __array__ method not producing an array
Expected Behavior
I expect it to "just work", similar to providing a fill value which does exist in the series, or how it works with other dtypes:
import pandas as pd
from pandas import SparseDtype
df = pd.DataFrame([["a", 0],["b", 1], ["b", 2]], columns=["A","B"])
# works, since "a" is a value present in the series
df["A"].astype(SparseDtype("category", fill_value="a"))
# also works, despite -1 not being present in the series
df["B"].astype(SparseDtype(int, fill_value=-1))
Installed Versions
INSTALLED VERSIONS
------------------
commit : 8dab54d6573f7186ff0c3b6364d5e4dd635ff3e7
python : 3.10.5.final.0
python-bits : 64
OS : Darwin
OS-release : 21.5.0
Version : Darwin Kernel Version 21.5.0: Tue Apr 26 21:08:37 PDT 2022; root:xnu-8020.121.3~4/RELEASE_ARM64_T6000
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8
pandas : 1.5.2
numpy : 1.23.5
pytz : 2022.6
dateutil : 2.8.2
setuptools : 58.1.0
pip : 22.3.1
Cython : None
pytest : 7.2.0
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.6.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.6.2
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 10.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.9.3
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None
Comment From: rhshadrach
Thanks for the report! Agreed the case where fill_value
is specified but does not exist looks like a bug. For
df["A"].astype(SparseDtype("category"))
on the other hand, what's the expectation on unspecified values? While it's coming from a dense object so there are no unspecified values, I would think we still need a default.
Comment From: PGijsbers
The current default of fill_value
is float("nan")
, which seems a reasonable default to me given that's how None
is represented in a dense categorical series. The problem is that if no such None
/nan
value exists in the data an error gets raised (similarly to setting the fill_value
explicitly to a value not present in the data). I would even assume that if bug is fixed when setting the fill_value
explicitly, the same patch also fixes the error when it is left at default.
For an alternative default, the mode of the data also seems reasonable to me as it compresses the data, but that is not in line with the default for other types (e.g., integers default to fill_value=0
regardless of data). For that reason, I would recommend sticking with float("nan")
as default.
Comment From: rhshadrach
Thanks - makes sense. Further investigations and PRs to fix are most welcome!
Comment From: PGijsbers
At this time I won't commit to that, unfortunately. So if anyone wants to work on this - go ahead! :) If there's no progress by the time I do have time, I'll leave a message.
Comment From: yanxiaole
Seems this issue no longer exists in the latest main branch, @PGijsbers , could you confirm?
Comment From: PGijsbers
Just re-ran the example myself on latest, and it looks like it works now 🥳 thanks!
Comment From: rhshadrach
Unless tests have already been added in the PR that (inadvertently?) fixed this, this needs tests before the issue is closed.
Comment From: luke396
take