Pandas Provide a way to convert Arrow tables to Arrow-backed dataframes

As far as I could see, there is no easy way given a PyArrow table, to get a DataFrame with pyarrow types.

I'd expect that those idioms work:

import numpy
import pyarrow
import pandas

arrow_u8 = pyarrow.array([1, 2, 3], type=pyarrow.uint8())
arrow_f64 = pyarrow.array([1., 2., 3.], type=pyarrow.float64())
table = pyarrow.table([arrow_u8, arrow_f64], names=['u8', 'f64'])

# Using the PyArrow `to_pandas` method will use NumPy backed data
df = table.to_pandas()

# Using the constructor with a PyArrow table raises: ValueError: DataFrame constructor not properly called!
df = pandas.DataFrame(table)

# This is not implemented (the method doesn't exist)
df = pandas.DataFrame.from_arrow(table)

# Creating a dataframe column by column naively from the arrow array will use NumPy dtypes
df = pandas.DataFrame({'u8': arrow_u8,
                       'f64': arrow_f64})

I think the easier way to make the transition is with something like this:

df =  pandas.DataFrame({name: pandas.Series(array,
                                            dtype=pandas.ArrowDtype(array.type))
                        for array, name
                        in zip(table.columns, table.column_names)})

@pandas-dev/pandas-core Given that Arrow dtypes is one of the highlights of pandas 2.0, shouldn't we provide at least one easy way to convert before the release?

Comment From: phofl

Arrow isn’t fully supported yet, just something to consider when communicating this.

the easiest way is to provide a types mapper to to_pandas on an arrow table, that’s what we are using Internally as well

Comment From: datapythonista

Arrow isn’t fully supported yet, just something to consider when communicating this.

Agree, but this seems like an important and common use case, and doesn't seem difficult to implement, no?

the easiest way is to provide a types mapper to to_pandas on an arrow table, that’s what we are using Internally as well

I didn't think about it, sounds good. But do we have a pandas function for the mapper, or is it something every user wanting this functionality should write? If that's the case the code I wrote is probably simpler.

Comment From: phofl

No we don’t have a common function and I remembers this incorrectly, we are only using it for our own nullable dtypes. Internally we are doing more or less the same as you did.

you can wrap in an ArrowExtensionArray instead of using the Series

Comment From: jbrockmendel

# Using the PyArrow to_pandas method will use NumPy backed data

This seems like something we should ask pyarrow to change?

Supporting this in DataFrame.__init__ would be hacky without making pyarrow a required dependency xref #50285, but I could be convinced.

Comment From: phofl

Ok there is actually an easy way to do this:

table.to_pandas(types_mapper=lambda x: pd.ArrowDtype(x))

Edit: I think this should be sufficient for now? We should definitely document this. I'll open a PR.

Comment From: jbrockmendel

@phofl i think you can get rid of the lambda and just pass types_mapper=pd.ArrowDtype

Comment From: phofl

Oh good point, yes you are correct.

Comment From: datapythonista

Supporting this in DataFrame.__init__ would be hacky without making pyarrow a required dependency xref #50285, but I could be convinced.

I have a working version of it in #51769, and seems simple enough to be worth adding. With pyarrow being optional.

Comment From: jreback

@phofl i think you can get rid of the lambda and just pass types_mapper=pd.ArrowDtype

this seems like the ideal soln; should just document for now