Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this issue exists on the latest version of pandas.
-
[x] I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
Generate data
The effect becomes visible for large n
, ie. >1000000.
import pandas as pd
n = 10000000
df = pd.DataFrame(
{
"a": np.random.choice([0, 1, None], n),
"b": np.random.choice([0, 10, 90129, None], n),
"c": np.random.choice([True, False, None], n),
"d": np.random.choice(["a", "b", None], n),
},
dtype=object,
)
df["a"] = df["a"].astype("Int8")
df = df.convert_dtypes()
# To check that types set correctly
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000000 entries, 0 to 9999999
Data columns (total 4 columns):
# Column Dtype
--- ------ -----
0 a Int8
1 b Int64
2 c boolean
3 d string
dtypes: Int64(1), Int8(1), boolean(1), string(1)
memory usage: 200.3 MB
Write it out with pyarrow
:
import pyarrow as pa
from pyarrow import parquet as pq
table = pa.Table.from_pandas(df)
# Remove metadata to prevent automatic use of nullable types
table = table.replace_schema_metadata()
pq.write_table(table, "test.parquet")
Speed test
%timeit df = pd.read_parquet("test.parquet", use_nullable_dtypes=True)
997 ms ± 20.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df = pd.read_parquet("test.parquet", use_nullable_dtypes=False)
632 ms ± 20.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Note that the use_nullable_dtypes=False
returns a dataframe with wrong types, ie. [float64, float64, object, object].
I would expect the version with True to run faster - after all it doesn't have to cast any types. The opposite is the case.
Installed Versions
Prior Performance
No response
Comment From: jorisvandenbossche
@timlod thanks for the report!
Doing a quick profile of both read_parquet calls using your reproducer, it seems that it is mostly the string column that is causing the slowdown.
What is happening under the hood for that column:
In [10]: arr = pa.array(np.random.choice(["a", "b", None], 10000000))
In [12]: pd.StringDtype().__from_arrow__(arr)
Out[12]:
<StringArray>
[ 'a', <NA>, <NA>, 'a', 'a', 'a', 'a', 'a', 'b', 'a',
...
'b', 'a', 'b', 'a', 'a', <NA>, <NA>, 'a', 'b', <NA>]
Length: 10000000, dtype: string
In [13]: %timeit pd.StringDtype().__from_arrow__(arr)
556 ms ± 5.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [14]: %timeit arr.to_pandas()
174 ms ± 10.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
This StringDtype.__from_arrow__
has a lot of overhead, if you compare it to the conversion to an object array by pyarrow itself (the arr.to_pandas()
). It should certainly be possible to optimize __from_arrow__
in this case (it seems to be validating each value (as we need to do for a generic object dtype array), while in case of a pyarrow string array, we already know that it only contains strings)
Comment From: timlod
Interesting - I hadn't thought of checking which one caused it, I guess I naively assumed it was the 'wrapping' of the NA itself that caused the overhead.
Do you think this would warrant not using the nullable string type here, or would there be room to allow to choose which nullable types to use (perhaps something like the arguments in df.convert_dtypes
)?
I hadn't used StringDtype
before, because object
with None
works fine.
If it's relatively straightforward (ie. Python code only) I'd be happy to take a look at optimizing this. Would it be possible to to just call .to_pandas
in this case, if arrow already handles these better?
Comment From: timlod
To bypass the StringDtype
, I just use pyarrow directly now with a types_mapper
:
types_mapper = {
pa.int64(): pd.Int64Dtype(),
pa.int16(): pd.Int8Dtype(),
pa.bool_(): pd.BooleanDtype(),
# add more types here, but leave out pa.string()
}.get
df = pq.read_table("test.parquet").to_pandas(types_mapper=types_mapper)
Asking pyarrow to convert to StringDtype
incurs the same performance hit we see in pandas - I see that pandas essentially does what I'm doing here, just including the StringDtype
conversion (hence my question to use to_pandas()
can be answered with no).
I did one quick check and used
result = lib.convert_nans_to_NA(scalars)
in StringArray._from_sequence
instead of
result = lib.ensure_string_array(
scalars, na_value=StringDtype.na_value, copy=copy
)
to mitigate string validation and only convert None
to NA
.
There is some performance improvement, but still a lot slower than using object
strings.
I'm not sure how I could optimize this further without knowing more about the internals here.
Comment From: jorisvandenbossche
Do you think this would warrant not using the nullable string type here, or would there be room to allow to choose which nullable types to use (perhaps something like the arguments in
df.convert_dtypes
)?
Not using the nullable string type could be a possible work-around yes, but eventually we should simply fix this, so it shouldn't matter if you also use the nullable string type or not.
Personally I am a bit wary of adding additional arguments to choose which nullable dtypes you want and which not, as that becomes quite a lot of arguments when the number of nullable dtypes keep growing (and one that we would need to add to several read_..
functions or other IO functions).
I did one quick check and used
lib.convert_nans_to_NA
inStringArray._from_sequence
instead oflib.ensure_string_array
to mitigate string validation and only convertNone
toNA
. There is some performance improvement, but still a lot slower than usingobject
strings.
Yeah, so there are still some other places with overhead. It starts with inside __from_arrow__
where we call _from_sequence
:
https://github.com/pandas-dev/pandas/blob/fc68a9a290fc314c090e037597138c74fa23ee6d/pandas/core/arrays/string_.py#L201
and _from_sequence
then calling lib.ensure_string_array
which gives overhead.
Ideally we should have a StringArray private constructor that avoids all overhead (or only does None -> pd.NA conversion with convert_nans_to_NA
).
Now, as you have seen, only updating this one places still leaves a lot of the overhead, and that is because we do yet another (unnecessary) validation at the end of __from_arrow__
when concatenating the resulting string arrays:
https://github.com/pandas-dev/pandas/blob/fc68a9a290fc314c090e037597138c74fa23ee6d/pandas/core/arrays/string_.py#L204-L208
This is generally needed because the pyarrow StringArray could consist of multiple chunks. However, we should 1) avoid this concat step if len(results) == 1
, and 2) in general there should also be no need for validation when concatting (as all original arrays that are being concatted are already string arrays).
If you would be interested in doing a PR, I can certainly give a helping hand.
Comment From: timlod
I'd be interested in doing a PR.
As mentioned above, I already tried replacing lib.ensure_string_array
in favour of convert_nans_to_NA
(btw, your reply sounds like this may not be necessary?).
I wasn't aware that ._concat_same_type
may do another round of validations - in fact, I'm not sure of the exact source for StringArray
. My best guess would be that it's this:
https://github.com/pandas-dev/pandas/blob/fc68a9a290fc314c090e037597138c74fa23ee6d/pandas/core/arrays/_mixins.py#L214-L227
But here I don't see any additional validation happening.
Avoiding concatenation when len(results) == 1
is straightforward enough, like here:
https://github.com/pandas-dev/pandas/blob/fc68a9a290fc314c090e037597138c74fa23ee6d/pandas/core/arrays/numeric.py#L106-L114
Edit:
OK, I figured out where it goes wrong. Indeed the code from _mixins.py
is applied. The issue is that https://github.com/pandas-dev/pandas/blob/fc68a9a290fc314c090e037597138c74fa23ee6d/pandas/core/arrays/numpy_.py#L112-L113
results in again calling the StringArray
constructor, which will validate the data unnecessarily.
In general, this seems very redundant - we create StringArray
s for each chunk of data, concatenate their ndarray
storage, and create a StringArray
again.
The crux here lies within bypassing the constructor, or at least the validation happening there, in _from_backing_data
.
I'll start thinking of/implementing a fix - if you already have a suggestion, let me know!