As far as I could see, there is no easy way given a PyArrow table, to get a DataFrame with pyarrow types.
I'd expect that those idioms work:
import numpy
import pyarrow
import pandas
arrow_u8 = pyarrow.array([1, 2, 3], type=pyarrow.uint8())
arrow_f64 = pyarrow.array([1., 2., 3.], type=pyarrow.float64())
table = pyarrow.table([arrow_u8, arrow_f64], names=['u8', 'f64'])
# Using the PyArrow `to_pandas` method will use NumPy backed data
df = table.to_pandas()
# Using the constructor with a PyArrow table raises: ValueError: DataFrame constructor not properly called!
df = pandas.DataFrame(table)
# This is not implemented (the method doesn't exist)
df = pandas.DataFrame.from_arrow(table)
# Creating a dataframe column by column naively from the arrow array will use NumPy dtypes
df = pandas.DataFrame({'u8': arrow_u8,
'f64': arrow_f64})
I think the easier way to make the transition is with something like this:
df = pandas.DataFrame({name: pandas.Series(array,
dtype=pandas.ArrowDtype(array.type))
for array, name
in zip(table.columns, table.column_names)})
@pandas-dev/pandas-core Given that Arrow dtypes is one of the highlights of pandas 2.0, shouldn't we provide at least one easy way to convert before the release?
Comment From: phofl
Arrow isn’t fully supported yet, just something to consider when communicating this.
the easiest way is to provide a types mapper to to_pandas on an arrow table, that’s what we are using Internally as well
Comment From: datapythonista
Arrow isn’t fully supported yet, just something to consider when communicating this.
Agree, but this seems like an important and common use case, and doesn't seem difficult to implement, no?
the easiest way is to provide a types mapper to to_pandas on an arrow table, that’s what we are using Internally as well
I didn't think about it, sounds good. But do we have a pandas function for the mapper, or is it something every user wanting this functionality should write? If that's the case the code I wrote is probably simpler.
Comment From: phofl
No we don’t have a common function and I remembers this incorrectly, we are only using it for our own nullable dtypes. Internally we are doing more or less the same as you did.
you can wrap in an ArrowExtensionArray instead of using the Series
Comment From: jbrockmendel
# Using the PyArrow
to_pandas
method will use NumPy backed data
This seems like something we should ask pyarrow to change?
Supporting this in DataFrame.__init__
would be hacky without making pyarrow a required dependency xref #50285, but I could be convinced.
Comment From: phofl
Ok there is actually an easy way to do this:
table.to_pandas(types_mapper=lambda x: pd.ArrowDtype(x))
Edit: I think this should be sufficient for now? We should definitely document this. I'll open a PR.
Comment From: jbrockmendel
@phofl i think you can get rid of the lambda and just pass types_mapper=pd.ArrowDtype
Comment From: phofl
Oh good point, yes you are correct.
Comment From: datapythonista
Supporting this in
DataFrame.__init__
would be hacky without making pyarrow a required dependency xref #50285, but I could be convinced.
I have a working version of it in #51769, and seems simple enough to be worth adding. With pyarrow being optional.
Comment From: jreback
@phofl i think you can get rid of the lambda and just pass
types_mapper=pd.ArrowDtype
this seems like the ideal soln; should just document for now