Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
pd.DataFrame.from_records([]) # Empty dataframe
pd.DataFrame.from_records([], index="foo") # Raises a KeyError("foo")
Issue Description
Building a dataframe from records using an empty iterable for data
and a string for index
raises a KeyError
exception instead of returning a empty dataframe with a named index.
Here is the stack trace:
Traceback (most recent call last):
File "/usr/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3361, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 76, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'foo'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.10/site-packages/pandas/core/frame.py", line 2097, in from_records
i = columns.get_loc(index)
File "/usr/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3363, in get_loc
raise KeyError(key) from err
KeyError: 'foo'
Expected Behavior
I would expect .from_records()
not to raise and instead return an empty dataframe with a named index.
Put into code:
import pandas as pd
def test_empty_dataframe_from_records():
expected = pd.DataFrame()
actual = pd.DataFrame.from_records([])
pd.testing.assert_frame_equal(actual, expected)
def test_empty_dataframe_from_records_with_named_index():
expected = pd.DataFrame(index=pd.Index([], name="foo"))
actual = pd.DataFrame.from_records([], index="foo")
pd.testing.assert_frame_equal(actual, expected)
At the moment only the first test passes.
Installed Versions
INSTALLED VERSIONS ------------------ commit : 5f648bf1706dd75a9ca0d29f26eadfbb595fe52b python : 3.10.4.final.0 python-bits : 64 OS : Linux OS-release : 3.10.0-1160.66.1.el7.x86_64 Version : #1 SMP Wed May 18 16:02:34 UTC 2022 machine : x86_64 processor : byteorder : little LC_ALL : None LANG : None LOCALE : en_US.UTF-8 pandas : 1.3.2 numpy : 1.22.3 pytz : 2022.1 dateutil : 2.8.2 pip : None setuptools : None Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None fsspec : None fastparquet : None gcsfs : None matplotlib : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None numba : None
Comment From: ypsah
Relates to https://github.com/pandas-dev/pandas/issues/2633#issuecomment-567512115.
Comment From: phofl
Hi, thanks for your report.
Not sure if this qualifies as a bug, this looks to me as it behaves as documented.
indexstr, list of fields, array-like
Field of array to use as the index, alternately a specific set of input labels to use.
Since your field does not exist, it does not work.
Comment From: simonjayhawkins
Not sure if this qualifies as a bug, this looks to me as it behaves as documented.
in the docs, three compatible data structures are included in the examples.
now for the list of dicts case I can see that for consistency the empty list should work
data = [
{"col_1": 3, "col_2": "a"},
{"col_1": 2, "col_2": "b"},
{"col_1": 1, "col_2": "c"},
{"col_1": 0, "col_2": "d"},
]
result = pd.DataFrame.from_records(data, index="col_1")
print(result)
col_2
col_1
3 a
2 b
1 c
0 d
data = [
{"col_1": 3, "col_2": "a"},
{"col_1": 2, "col_2": "b"},
{"col_2": "c"},
{"col_1": 0, "col_2": "d"},
]
result = pd.DataFrame.from_records(data, index="col_1")
print(result)
col_2
col_1
3.0 a
2.0 b
NaN c
0.0 d
Maybe it should raise KeyError for this last case. but if the list is empty as in the OP example I would expect to get a empty DataFrame with the same Index name and columns (no rows)
Now, if the data structure is a np,array it works as intended.
# empty data, includes "foo" field
data = np.array([], dtype=[("foo", "i4"), ("col_2", "U1")])
print(repr(data))
print(pd.DataFrame.from_records(data, index="foo"))
array([], dtype=[('foo', '<i4'), ('col_2', '<U1')])
Empty DataFrame
Columns: [col_2]
Index: []
if the record data structure is a list of tuples, i.e. does not have field names, it correctly raises KeyError: 'foo'
data = [(3, "a"), (2, "b"), (1, "c"), (0, "d")]
pd.DataFrame.from_records(data, index="foo")
since for the empty list case we don't know if it is a list of dicts or a list of tuples, we should probably not raise.
Comment From: simonjayhawkins
since for the empty list case we don't know if it is a list of dicts or a list of tuples, we should probably not raise.
I think that we should assume that the user is passing a empty list of dicts otherwise they would not be passing a field label to index
.
Comment From: simonjayhawkins
Maybe it should raise KeyError for this last case. but if the list is empty as in the OP example I would expect to get a empty DataFrame with the same Index name and columns (no rows)
This wasn't clear. without a schema, we do not know the expected columns, so result would be an empty DataFrame (no columns and no rows)
Comment From: nir-ml
I guess you can catch the KeyError exception and return an empty dataframe with a named index explicitly:
import pandas as pd
try:
df = pd.DataFrame.from_records([], index="foo")
except KeyError:
df = pd.DataFrame(index=pd.Index([], name="foo"))
# df is now an empty dataframe with a named index "foo"