Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [ ] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

pd.DataFrame.from_records([])               # Empty dataframe
pd.DataFrame.from_records([], index="foo")  # Raises a KeyError("foo")

Issue Description

Building a dataframe from records using an empty iterable for data and a string for index raises a KeyError exception instead of returning a empty dataframe with a named index.

Here is the stack trace:

Traceback (most recent call last):
  File "/usr/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3361, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 76, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'foo'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.10/site-packages/pandas/core/frame.py", line 2097, in from_records
    i = columns.get_loc(index)
  File "/usr/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3363, in get_loc
    raise KeyError(key) from err
KeyError: 'foo'

Expected Behavior

I would expect .from_records() not to raise and instead return an empty dataframe with a named index.

Put into code:

import pandas as pd

def test_empty_dataframe_from_records():
    expected = pd.DataFrame()
    actual = pd.DataFrame.from_records([])
    pd.testing.assert_frame_equal(actual, expected)

def test_empty_dataframe_from_records_with_named_index():
    expected = pd.DataFrame(index=pd.Index([], name="foo"))
    actual = pd.DataFrame.from_records([], index="foo")
    pd.testing.assert_frame_equal(actual, expected)

At the moment only the first test passes.

Installed Versions

INSTALLED VERSIONS
------------------
commit           : 5f648bf1706dd75a9ca0d29f26eadfbb595fe52b
python           : 3.10.4.final.0
python-bits      : 64
OS               : Linux
OS-release       : 3.10.0-1160.66.1.el7.x86_64
Version          : #1 SMP Wed May 18 16:02:34 UTC 2022
machine          : x86_64
processor        :
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : en_US.UTF-8
pandas           : 1.3.2
numpy            : 1.22.3
pytz             : 2022.1
dateutil         : 2.8.2
pip              : None
setuptools       : None
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : None
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pyxlsb           : None
s3fs             : None
scipy            : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
numba            : None
    

Comment From: ypsah

Relates to https://github.com/pandas-dev/pandas/issues/2633#issuecomment-567512115.

Comment From: phofl

Hi, thanks for your report.

Not sure if this qualifies as a bug, this looks to me as it behaves as documented.

indexstr, list of fields, array-like

    Field of array to use as the index, alternately a specific set of input labels to use.

Since your field does not exist, it does not work.

Comment From: simonjayhawkins

Not sure if this qualifies as a bug, this looks to me as it behaves as documented.

in the docs, three compatible data structures are included in the examples.

now for the list of dicts case I can see that for consistency the empty list should work

data = [
    {"col_1": 3, "col_2": "a"},
    {"col_1": 2, "col_2": "b"},
    {"col_1": 1, "col_2": "c"},
    {"col_1": 0, "col_2": "d"},
]
result = pd.DataFrame.from_records(data, index="col_1")
print(result)
      col_2
col_1      
3         a
2         b
1         c
0         d
data = [
    {"col_1": 3, "col_2": "a"},
    {"col_1": 2, "col_2": "b"},
    {"col_2": "c"},
    {"col_1": 0, "col_2": "d"},
]
result = pd.DataFrame.from_records(data, index="col_1")
print(result)
      col_2
col_1      
3.0       a
2.0       b
NaN       c
0.0       d

Maybe it should raise KeyError for this last case. but if the list is empty as in the OP example I would expect to get a empty DataFrame with the same Index name and columns (no rows)

Now, if the data structure is a np,array it works as intended.

# empty data, includes "foo" field
data = np.array([], dtype=[("foo", "i4"), ("col_2", "U1")])
print(repr(data))
print(pd.DataFrame.from_records(data, index="foo"))
array([], dtype=[('foo', '<i4'), ('col_2', '<U1')])
Empty DataFrame
Columns: [col_2]
Index: []

if the record data structure is a list of tuples, i.e. does not have field names, it correctly raises KeyError: 'foo'

data = [(3, "a"), (2, "b"), (1, "c"), (0, "d")]
pd.DataFrame.from_records(data, index="foo")

since for the empty list case we don't know if it is a list of dicts or a list of tuples, we should probably not raise.

Comment From: simonjayhawkins

since for the empty list case we don't know if it is a list of dicts or a list of tuples, we should probably not raise.

I think that we should assume that the user is passing a empty list of dicts otherwise they would not be passing a field label to index.

Comment From: simonjayhawkins

Maybe it should raise KeyError for this last case. but if the list is empty as in the OP example I would expect to get a empty DataFrame with the same Index name and columns (no rows)

This wasn't clear. without a schema, we do not know the expected columns, so result would be an empty DataFrame (no columns and no rows)

Comment From: nir-ml

I guess you can catch the KeyError exception and return an empty dataframe with a named index explicitly:

import pandas as pd

try:
    df = pd.DataFrame.from_records([], index="foo")
except KeyError:
    df = pd.DataFrame(index=pd.Index([], name="foo"))

# df is now an empty dataframe with a named index "foo"