Pandas PERF: using use_nullable_dtypes=True in read_parquet slows performance on large dataframes

Pandas version checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this issue exists on the latest version of pandas.
[x] I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

Generate data

The effect becomes visible for large n, ie. >1000000.

import pandas as pd

n = 10000000
df = pd.DataFrame(
    {
        "a": np.random.choice([0, 1, None], n),
        "b": np.random.choice([0, 10, 90129, None], n),
        "c": np.random.choice([True, False, None], n),
        "d": np.random.choice(["a", "b", None], n),
    },
    dtype=object,
)
df["a"] = df["a"].astype("Int8")
df = df.convert_dtypes()
# To check that types set correctly
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000000 entries, 0 to 9999999
Data columns (total 4 columns):
 #   Column  Dtype  
---  ------  -----  
 0   a       Int8   
 1   b       Int64  
 2   c       boolean
 3   d       string 
dtypes: Int64(1), Int8(1), boolean(1), string(1)
memory usage: 200.3 MB

Write it out with `pyarrow`:

import pyarrow as pa
from pyarrow import parquet as pq

table = pa.Table.from_pandas(df)
# Remove metadata to prevent automatic use of nullable types
table = table.replace_schema_metadata()
pq.write_table(table, "test.parquet")

Speed test

%timeit df = pd.read_parquet("test.parquet", use_nullable_dtypes=True)

997 ms ± 20.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df = pd.read_parquet("test.parquet", use_nullable_dtypes=False)

632 ms ± 20.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Note that the use_nullable_dtypes=False returns a dataframe with wrong types, ie. [float64, float64, object, object]. I would expect the version with True to run faster - after all it doesn't have to cast any types. The opposite is the case.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 4bfe3d07b4858144c219b9346329027024102ab6 python : 3.9.2.final.0 python-bits : 64 OS : Linux OS-release : 5.10.0-9-amd64 Version : #1 SMP Debian 5.10.70-1 (2021-09-30) machine : x86_64 processor : byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8 pandas : 1.4.2 numpy : 1.22.4 pytz : 2022.1 dateutil : 2.8.2 pip : 20.3.4 setuptools : 44.1.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : 2.9.3 jinja2 : None IPython : 8.4.0 pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : 0.8.1 fsspec : 2022.5.0 gcsfs : None markupsafe : None matplotlib : 3.5.2 numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 8.0.0 pyreadstat : None pyxlsb : None s3fs : None scipy : 1.8.1 snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None

Prior Performance

No response

Comment From: jorisvandenbossche

@timlod thanks for the report!

Doing a quick profile of both read_parquet calls using your reproducer, it seems that it is mostly the string column that is causing the slowdown.

What is happening under the hood for that column:

In [10]: arr = pa.array(np.random.choice(["a", "b", None], 10000000))

In [12]: pd.StringDtype().__from_arrow__(arr)
Out[12]: 
<StringArray>
[ 'a', <NA>, <NA>,  'a',  'a',  'a',  'a',  'a',  'b',  'a',
 ...
  'b',  'a',  'b',  'a',  'a', <NA>, <NA>,  'a',  'b', <NA>]
Length: 10000000, dtype: string

In [13]: %timeit pd.StringDtype().__from_arrow__(arr)
556 ms ± 5.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [14]: %timeit arr.to_pandas()
174 ms ± 10.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

This StringDtype.__from_arrow__ has a lot of overhead, if you compare it to the conversion to an object array by pyarrow itself (the arr.to_pandas()). It should certainly be possible to optimize __from_arrow__ in this case (it seems to be validating each value (as we need to do for a generic object dtype array), while in case of a pyarrow string array, we already know that it only contains strings)

Comment From: timlod

Interesting - I hadn't thought of checking which one caused it, I guess I naively assumed it was the 'wrapping' of the NA itself that caused the overhead. Do you think this would warrant not using the nullable string type here, or would there be room to allow to choose which nullable types to use (perhaps something like the arguments in df.convert_dtypes)? I hadn't used StringDtype before, because object with None works fine.

If it's relatively straightforward (ie. Python code only) I'd be happy to take a look at optimizing this. Would it be possible to to just call .to_pandas in this case, if arrow already handles these better?

Comment From: timlod

To bypass the StringDtype, I just use pyarrow directly now with a types_mapper:

types_mapper = {
    pa.int64(): pd.Int64Dtype(),
    pa.int16(): pd.Int8Dtype(),
    pa.bool_(): pd.BooleanDtype(),
    # add more types here, but leave out pa.string()
}.get
df = pq.read_table("test.parquet").to_pandas(types_mapper=types_mapper)

Asking pyarrow to convert to StringDtype incurs the same performance hit we see in pandas - I see that pandas essentially does what I'm doing here, just including the StringDtype conversion (hence my question to use to_pandas() can be answered with no).

I did one quick check and used

            result = lib.convert_nans_to_NA(scalars)

in StringArray._from_sequence instead of

            result = lib.ensure_string_array(
                scalars, na_value=StringDtype.na_value, copy=copy
            )

to mitigate string validation and only convert None to NA.

There is some performance improvement, but still a lot slower than using object strings. I'm not sure how I could optimize this further without knowing more about the internals here.

Comment From: jorisvandenbossche

Do you think this would warrant not using the nullable string type here, or would there be room to allow to choose which nullable types to use (perhaps something like the arguments in df.convert_dtypes)?

Not using the nullable string type could be a possible work-around yes, but eventually we should simply fix this, so it shouldn't matter if you also use the nullable string type or not. Personally I am a bit wary of adding additional arguments to choose which nullable dtypes you want and which not, as that becomes quite a lot of arguments when the number of nullable dtypes keep growing (and one that we would need to add to several read_.. functions or other IO functions).

I did one quick check and used lib.convert_nans_to_NA in StringArray._from_sequence instead of lib.ensure_string_array to mitigate string validation and only convert None to NA. There is some performance improvement, but still a lot slower than using object strings.

Yeah, so there are still some other places with overhead. It starts with inside __from_arrow__ where we call _from_sequence:

https://github.com/pandas-dev/pandas/blob/fc68a9a290fc314c090e037597138c74fa23ee6d/pandas/core/arrays/string_.py#L201

and _from_sequence then calling lib.ensure_string_array which gives overhead. Ideally we should have a StringArray private constructor that avoids all overhead (or only does None -> pd.NA conversion with convert_nans_to_NA).

Now, as you have seen, only updating this one places still leaves a lot of the overhead, and that is because we do yet another (unnecessary) validation at the end of __from_arrow__ when concatenating the resulting string arrays:

https://github.com/pandas-dev/pandas/blob/fc68a9a290fc314c090e037597138c74fa23ee6d/pandas/core/arrays/string_.py#L204-L208

This is generally needed because the pyarrow StringArray could consist of multiple chunks. However, we should 1) avoid this concat step if len(results) == 1, and 2) in general there should also be no need for validation when concatting (as all original arrays that are being concatted are already string arrays).

If you would be interested in doing a PR, I can certainly give a helping hand.

Comment From: timlod

I'd be interested in doing a PR.

As mentioned above, I already tried replacing lib.ensure_string_array in favour of convert_nans_to_NA (btw, your reply sounds like this may not be necessary?).

I wasn't aware that ._concat_same_type may do another round of validations - in fact, I'm not sure of the exact source for StringArray. My best guess would be that it's this: https://github.com/pandas-dev/pandas/blob/fc68a9a290fc314c090e037597138c74fa23ee6d/pandas/core/arrays/_mixins.py#L214-L227

But here I don't see any additional validation happening.

Avoiding concatenation when len(results) == 1 is straightforward enough, like here: https://github.com/pandas-dev/pandas/blob/fc68a9a290fc314c090e037597138c74fa23ee6d/pandas/core/arrays/numeric.py#L106-L114

Edit:

OK, I figured out where it goes wrong. Indeed the code from _mixins.py is applied. The issue is that https://github.com/pandas-dev/pandas/blob/fc68a9a290fc314c090e037597138c74fa23ee6d/pandas/core/arrays/numpy_.py#L112-L113 results in again calling the StringArray constructor, which will validate the data unnecessarily.

In general, this seems very redundant - we create StringArrays for each chunk of data, concatenate their ndarray storage, and create a StringArray again.

The crux here lies within bypassing the constructor, or at least the validation happening there, in _from_backing_data. I'll start thinking of/implementing a fix - if you already have a suggestion, let me know!

Pandas PERF: using use_nullable_dtypes=True in read_parquet slows performance on large dataframes

Pandas version checks

Reproducible Example

Generate data

Write it out with pyarrow:

Speed test

Installed Versions

Prior Performance

Write it out with `pyarrow`: