Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[ ] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
pd.DataFrame.from_records([]) # Empty dataframe
pd.DataFrame.from_records([], index="foo") # Raises a KeyError("foo")
Issue Description
Building a dataframe from records using an empty iterable for data and a string for index raises a KeyError exception instead of returning a empty dataframe with a named index.
Here is the stack trace:
Traceback (most recent call last):
File "/usr/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3361, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 76, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'foo'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.10/site-packages/pandas/core/frame.py", line 2097, in from_records
i = columns.get_loc(index)
File "/usr/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3363, in get_loc
raise KeyError(key) from err
KeyError: 'foo'
Expected Behavior
I would expect .from_records() not to raise and instead return an empty dataframe with a named index.
Put into code:
import pandas as pd
def test_empty_dataframe_from_records():
expected = pd.DataFrame()
actual = pd.DataFrame.from_records([])
pd.testing.assert_frame_equal(actual, expected)
def test_empty_dataframe_from_records_with_named_index():
expected = pd.DataFrame(index=pd.Index([], name="foo"))
actual = pd.DataFrame.from_records([], index="foo")
pd.testing.assert_frame_equal(actual, expected)
At the moment only the first test passes.
Installed Versions
INSTALLED VERSIONS
------------------
commit : 5f648bf1706dd75a9ca0d29f26eadfbb595fe52b
python : 3.10.4.final.0
python-bits : 64
OS : Linux
OS-release : 3.10.0-1160.66.1.el7.x86_64
Version : #1 SMP Wed May 18 16:02:34 UTC 2022
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : None
LOCALE : en_US.UTF-8
pandas : 1.3.2
numpy : 1.22.3
pytz : 2022.1
dateutil : 2.8.2
pip : None
setuptools : None
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None
Comment From: ypsah
Relates to https://github.com/pandas-dev/pandas/issues/2633#issuecomment-567512115.
Comment From: phofl
Hi, thanks for your report.
Not sure if this qualifies as a bug, this looks to me as it behaves as documented.
indexstr, list of fields, array-like
Field of array to use as the index, alternately a specific set of input labels to use.
Since your field does not exist, it does not work.
Comment From: simonjayhawkins
Not sure if this qualifies as a bug, this looks to me as it behaves as documented.
in the docs, three compatible data structures are included in the examples.
now for the list of dicts case I can see that for consistency the empty list should work
data = [
{"col_1": 3, "col_2": "a"},
{"col_1": 2, "col_2": "b"},
{"col_1": 1, "col_2": "c"},
{"col_1": 0, "col_2": "d"},
]
result = pd.DataFrame.from_records(data, index="col_1")
print(result)
col_2
col_1
3 a
2 b
1 c
0 d
data = [
{"col_1": 3, "col_2": "a"},
{"col_1": 2, "col_2": "b"},
{"col_2": "c"},
{"col_1": 0, "col_2": "d"},
]
result = pd.DataFrame.from_records(data, index="col_1")
print(result)
col_2
col_1
3.0 a
2.0 b
NaN c
0.0 d
Maybe it should raise KeyError for this last case. but if the list is empty as in the OP example I would expect to get a empty DataFrame with the same Index name and columns (no rows)
Now, if the data structure is a np,array it works as intended.
# empty data, includes "foo" field
data = np.array([], dtype=[("foo", "i4"), ("col_2", "U1")])
print(repr(data))
print(pd.DataFrame.from_records(data, index="foo"))
array([], dtype=[('foo', '<i4'), ('col_2', '<U1')])
Empty DataFrame
Columns: [col_2]
Index: []
if the record data structure is a list of tuples, i.e. does not have field names, it correctly raises KeyError: 'foo'
data = [(3, "a"), (2, "b"), (1, "c"), (0, "d")]
pd.DataFrame.from_records(data, index="foo")
since for the empty list case we don't know if it is a list of dicts or a list of tuples, we should probably not raise.
Comment From: simonjayhawkins
since for the empty list case we don't know if it is a list of dicts or a list of tuples, we should probably not raise.
I think that we should assume that the user is passing a empty list of dicts otherwise they would not be passing a field label to index.
Comment From: simonjayhawkins
Maybe it should raise KeyError for this last case. but if the list is empty as in the OP example I would expect to get a empty DataFrame with the same Index name and columns (no rows)
This wasn't clear. without a schema, we do not know the expected columns, so result would be an empty DataFrame (no columns and no rows)
Comment From: nir-ml
I guess you can catch the KeyError exception and return an empty dataframe with a named index explicitly:
import pandas as pd
try:
df = pd.DataFrame.from_records([], index="foo")
except KeyError:
df = pd.DataFrame(index=pd.Index([], name="foo"))
# df is now an empty dataframe with a named index "foo"