xref https://github.com/pandas-dev/pandas/pull/47962#issuecomment-1209776265

Currently, the date parsing is done by pandas after pyarrow has read the input in - the date parsing should be done by pyarrow, and pandas should let an error be thrown if pyarrow can't parse a date

This would cause some breakage (see the linked discussion for an example), and so we should go through a deprecation cycle first

This would involve adding a FutureWarning to the function _do_date_conversions called in https://github.com/pandas-dev/pandas/blob/7214f857e45965fe7a0cfbba1d2dc35abb0fd7f4/pandas/io/parsers/arrow_parser_wrapper.py#L111 saying that in the future, date parsing will be handed off to pyarrow itself and therefore some unusual formats currently handled by pandas might not continue to work

Comment From: shourya5

take

Comment From: jreback

i disagree here

sure we should dispatch to pyarrow for it's known date parsing

but (and this may not be simple) we can always post process to replicate what our other engines do

rather than confusing it fills in the feature gaps until/unless pyarrow can fill them (at a possible increase in pandas complexity)

but imho worth it - we do this is most other functions - eg smooth over numpy specific issues or gaps

don't see why this should be different

Comment From: MarcoGorelli

So, try to parse with pyarrow, and otherwise keep the pandas post-processing? cc @mroeschke @WillAyd

@shourya5 is the issue clear to you? It's a bit more complex than what was originally described

Comment From: shourya5

I think I do understand,can I additionally get an example in this repo where post-processing is done

Comment From: MarcoGorelli

It's the function _do_date_conversions

I think the idea would be something like: If pyarrow is able to do the conversion, then no need to call _do_date_conversions. If pyarrow throws an error, then retry getting pyarrow to read the file, but without parsing the date column as date, and then call _do_date_conversions

I'm writing from my phone as I'm travelling, but if you check the linked (closed) PR, there should be links to where this happens in the code base. Please do ask if something's not clear

Comment From: WillAyd

I think having two date parsing functions applied to the arrow engine will be problematic. As a somewhat contrived example, if arrow decided at some point to parse a date as day first and we did month first, then adding an extra compatability layer on top of theirs can cause data to change unexpectedly.

i.e. if we had some kind of file like

date
5/7/2022

And arrow decided this was July 5, 2022 then it would do so when called with engine="pyarrow". However, if we introduce some troublesome data on the following line:

date
5/7/2022
19700101 00:00:00

Then the engine="pyarrow" might now parse the first line as May 7, 2022 just because we kicked into our own custom date parser when arrow tripped up on the second line

Comment From: mroeschke

Yeah I think maintaining logic for a pyarrow behavior -> pandas behavior for csv parsing (not just date parsing) is going to be a very large undertaking. I think the compatibility logic would eventually need to compensate for

  • Different pyarrow versions
  • Value dependent behavior (i.e. knowing the input value, what pyarrow returned, and what pandas "should" return)
  • The combination of the above points

Generally, my philosophy (opinion) of incorporating pyarrow into pandas has generally been "the user should be getting pyarrow behavior but returned as pandas data structures". This is what I've been aiming towards with the ArrowExtensionArray, but open to being wrong about this thinking.

Comment From: jreback

yep suooort argument makes sense +1 on proceeding for marco summary

Comment From: MCRE-BE

Bug is still open. Any fix planned @MarcoGorelli ?

Comment From: MarcoGorelli

yup, it still needs doing

@shourya5 still interested in working on this?

Comment From: lithomas1

@MarcoGorelli Can you hold off on this for a bit? I think we might be able to do some date parsing directly with pyarrow with PDEP-4, since we can pass through a timestamp format to pyarrow.

I think the only case that probably isn't supported is parse_dates=False, since while there is a way to disable parsing timestamps, you can't stop pyarrow from parsing dates as date32.

Comment From: MarcoGorelli

sure, thanks for the heads up!