Feature Type

  • [X] Adding new functionality to pandas

  • [ ] Changing existing functionality in pandas

  • [ ] Removing existing functionality in pandas

Problem Description

Regarding the pyarrow engine for read_csv():

Currently it appears that the column_types parameter is not provided to ConvertOptions of pyarrow.csv.read_csv() , even though it seems very analogous to the dtype option of pd.read_csv().

If provided, it would disable type inference for those columns and improve performance.

Currently the dtype parameter provided to pd.read_csv() is only used to convert data types of the DataFrame after it is produced by pa.read_csv().to_pandas(), so does not improve the performance of pa.read_csv()

Feature Description

All that would be needed is to create a mapping of Pandas dtypes to PyArrow dtypes (maybe this already exists)? And then use this mapping to create column_types from dtype, and provide to ConvertOptions

Alternative Solutions

Additional Context

No response

Comment From: phofl

Hey, thanks for your request. We should add engine_kwargs like for other io methods, which would accommodate this

Comment From: Finndersen

Hey, thanks for your request. We should add engine_kwargs like for other io methods, which would accommodate this

I don't think it should have or require engine_kwargs, other existing normal args of pd.read_csv() are translated to pa.read_csv() options, and the existing dtypes arg of pd.read_csv() is a direct analogy of pa.read_csv()'s column types so it can be used directly

Comment From: lithomas1

This seems reasonable to me.

Implementing this will be slightly tricky, though, since from what I remember, the column_types parameter doesn't play to well with things such as names.

Also, I am pretty sure that column_types doesn't support passing an int for the column position whose dtype you want to specify, or just a dtype for the whole thing.

I can have a look at this later this week, but if you want to take a stab at it before me, that's also fine by me.

Comment From: Finndersen

@lithomas1 I'm not sure what you mean by column_types not playing well with names, since its documentation says: "Explicitly map column names to column types"

However you might be right about the column positions or global dtype.

At the very least, it seems dtypes could be passed through if provided as a dict with names (or a dict with column indices if "names" parameter is also provided? Since we could do the mapping using that)

Otherwise, if there's a cheap way to inspect and get the column names in the CSV file without reading the whole thing, could do that first before pa.read_csv(), which would add support for dtype provided as mapping of column indices or single type.

Unfortunately I'm overseas travelling at the moment and not in a good position to develop the capability myself right now

Comment From: lithomas1

I meant the renamed column names from the names parameter in read_csv. (That's one parameter I didn't pass to pyarrow on purpose and I'll need to investigate again to find out why.)

Re: missing features in pyarrow, I think now is probably a good time to fix things upstream, since a decent chunk of people are using or interested in the arrow parser.

I think for now I'll just special case what pyarrow can handle and try to fix the missing cases upstream.

Comment From: phofl

I'd rather use engine_kwargs here instead of special casing stuff because the arrow engine can not deal with all dtype configurations. This will cause unnecessary confusion later on. Also, hierarchy between application of names and dtype should stay the same for all engines

Comment From: lithomas1

34823 is the other issue related to engine_kwargs, but it's restricted to just the to_pandas part of reading.

I don't think we should allow users to pass options to the pyarrow read_csv call itself since it's just going to be challenging to avoid conflicts with what we're passing in.

The only thing I think is probably useful for users from the read_csv options that we aren't doing ourselves, is the auto_dict_encode feature, but I'm inclined to say that people should just make the pyarrow read_csv call themselves if they want it.

I think I agree on holding off on this for now. I think I kinda forgot about things like third-party ExtensionArrays that will be broken.