xref https://github.com/pandas-dev/pandas/pull/47962#issuecomment-1209776265
Currently, the date parsing is done by pandas after pyarrow
has read the input in - the date parsing should be done by pyarrow
, and pandas should let an error be thrown if pyarrow
can't parse a date
This would cause some breakage (see the linked discussion for an example), and so we should go through a deprecation cycle first
This would involve adding a FutureWarning
to the function _do_date_conversions
called in https://github.com/pandas-dev/pandas/blob/7214f857e45965fe7a0cfbba1d2dc35abb0fd7f4/pandas/io/parsers/arrow_parser_wrapper.py#L111 saying that in the future, date parsing will be handed off to pyarrow
itself and therefore some unusual formats currently handled by pandas
might not continue to work
Comment From: shourya5
take
Comment From: jreback
i disagree here
sure we should dispatch to pyarrow for it's known date parsing
but (and this may not be simple) we can always post process to replicate what our other engines do
rather than confusing it fills in the feature gaps until/unless pyarrow can fill them (at a possible increase in pandas complexity)
but imho worth it - we do this is most other functions - eg smooth over numpy specific issues or gaps
don't see why this should be different
Comment From: MarcoGorelli
So, try to parse with pyarrow, and otherwise keep the pandas post-processing? cc @mroeschke @WillAyd
@shourya5 is the issue clear to you? It's a bit more complex than what was originally described
Comment From: shourya5
I think I do understand,can I additionally get an example in this repo where post-processing is done
Comment From: MarcoGorelli
It's the function _do_date_conversions
I think the idea would be something like:
If pyarrow is able to do the conversion, then no need to call _do_date_conversions
. If pyarrow throws an error, then retry getting pyarrow to read the file, but without parsing the date column as date, and then call _do_date_conversions
I'm writing from my phone as I'm travelling, but if you check the linked (closed) PR, there should be links to where this happens in the code base. Please do ask if something's not clear
Comment From: WillAyd
I think having two date parsing functions applied to the arrow engine will be problematic. As a somewhat contrived example, if arrow decided at some point to parse a date as day first and we did month first, then adding an extra compatability layer on top of theirs can cause data to change unexpectedly.
i.e. if we had some kind of file like
date
5/7/2022
And arrow decided this was July 5, 2022 then it would do so when called with engine="pyarrow". However, if we introduce some troublesome data on the following line:
date
5/7/2022
19700101 00:00:00
Then the engine="pyarrow" might now parse the first line as May 7, 2022 just because we kicked into our own custom date parser when arrow tripped up on the second line
Comment From: mroeschke
Yeah I think maintaining logic for a pyarrow behavior -> pandas behavior
for csv parsing (not just date parsing) is going to be a very large undertaking. I think the compatibility logic would eventually need to compensate for
- Different pyarrow versions
- Value dependent behavior (i.e. knowing the input value, what pyarrow returned, and what pandas "should" return)
- The combination of the above points
Generally, my philosophy (opinion) of incorporating pyarrow into pandas has generally been "the user should be getting pyarrow behavior but returned as pandas data structures". This is what I've been aiming towards with the ArrowExtensionArray
, but open to being wrong about this thinking.
Comment From: jreback
yep suooort argument makes sense +1 on proceeding for marco summary
Comment From: MCRE-BE
Bug is still open. Any fix planned @MarcoGorelli ?
Comment From: MarcoGorelli
yup, it still needs doing
@shourya5 still interested in working on this?
Comment From: lithomas1
@MarcoGorelli Can you hold off on this for a bit? I think we might be able to do some date parsing directly with pyarrow with PDEP-4, since we can pass through a timestamp format to pyarrow.
I think the only case that probably isn't supported is parse_dates=False
, since while there is a way to disable parsing timestamps, you can't stop pyarrow from parsing dates as date32.
Comment From: MarcoGorelli
sure, thanks for the heads up!