Pandas Efficient conversion from dataframes to numpy arrays retaining column dtypes and names

Problem description

I was looking into how to convert dataframes to numpy arrays so that both column dtypes and names would be retained, preferably in an efficient way so that memory is not duplicated while doing this. In some way, I would like to have a view on internal data already stored by dataframes as a numpy array. I am good with all datatypes already used in dataframe, and names there.

The issue is that both as_matrix and values convert dtypes of all values. And to_records does not create a simple numpy array.

I have found two potential StackOverflow answers: * https://stackoverflow.com/questions/40554179/how-to-keep-column-names-when-converting-from-pandas-to-numpy * https://stackoverflow.com/questions/13187778/convert-pandas-dataframe-to-numpy-array-preserving-index

But it seems to me that all those solutions copy data around through intermediate data structures, and then just store them into a new numpy array.

So I would ask for a way to get data as it is, without any conversions of dtypes, as a numpy array.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.9.27-moby machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: None LOCALE: None.None pandas: 0.20.1 pytest: None pip: 9.0.1 setuptools: 20.7.0 Cython: 0.25.2 numpy: 1.12.1 scipy: 0.19.0 xarray: None IPython: None sphinx: None patsy: None dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None sqlalchemy: None pymysql: None psycopg2: None jinja2: None s3fs: None pandas_gbq: None pandas_datareader: None

Comment From: TomAugspurger

Can you show an actual example (construct a dataframe, and then what you'd like to be able to do)?

I would like to have a view on internal data already stored by dataframes as a numpy array.

Not sure how much you've looked into the internals, but there won't necessarily be a numpy array, as in a single numpy array, backing a DataFrame. The data model is a bit more complex than that.

Could you also say a bit more about your use-case?

Comment From: mitar

I am using Pandas to read multiple CSV files, make sure indexes across the match and are in same order (using .loc), especially to match samples and targets which are split into two files. Some of columns are categorical, so I can use Pandas to automatically encode them for me. And then I want to convert them to a numpy array because the rest of the pipeline expects that.

An example would be something like:

>>> import pandas as pd
>>> import numpy as np
>>> from io import StringIO
>>> data = """col1,col2,col3
1,a,3.4
1,a,3.4
2,b,4.5"""
>>> frame = pd.read_csv(StringIO(data), dtype={0: 'int', 1: 'category', 2: 'float64'})
>>> >>> frame.dtypes
col1       int64
col2    category
col3     float64
dtype: object

What I would like is to convert this to:

>>> np.array([(1, 0, 3.4), (1, 0, 3.4), (2, 1, 4.5)], dtype=[('col1', np.int64), ('col2', np.int8), ('col3', np.float64)])
array([(1, 0,  3.4), (1, 0,  3.4), (2, 1,  4.5)], dtype=[('col1', '<i8'), ('col2', 'i1'), ('col3', '<f8')])

I see that I can do the first thing of converting categorical attributes to coded ones with:

categorical_columns = frame.select_dtypes(('category',)).columns
frame[categorical_columns] = frame[categorical_columns].apply(lambda c: c.cat.codes)

But not how to do the next step.

Although, I am realizing now that numpy does not support 2d matrix with different types for different columns, and not with labels for different columns. The above is just a 1d array of tuples. So as_matrix or values seem to be the best way to do so.

Comment From: chris-b1

Isn't your desired output what to_records(index=False) produces?

In [7]: frame.to_records(index=False)
Out[7]:
rec.array([(1, 0, 3.4), (1, 0, 3.4), (2, 1, 4.5)],
          dtype=[('col1', '<i4'), ('col2', 'i1'), ('col3', '<f8')])

Comment From: mitar

But that is np.recarray and not np.array? Is np.recarray a strict extension of np.array?

Comment From: jreback

@mitar

using multi-dtype ndarrays is only supported via rec-arrays (as @chris-b1 shows how to convert).

You certainly can select out columns or do a .values conversion. But the target function needs to potentially deal with an object dtype array. So this is not efficient at all. You need to segregate dtypes; it is simply a lot of work to do with numpy arrays. pandas does this with ease. So you can certainly use some of the pointed to solutions. But I suspect you have other issues if the conversion to an ndarray is your bottleneck.

Comment From: mitar

Thank for the input. I will look more into this.

Comment From: chris-b1

From what I recall recarray is very thin subclass, something like this probably works if you have a strict ndarray requirement downstream.

In [14]: ra = frame.to_records(index=False)

In [15]: np.asarray(ra)
Out[15]: 
array([(1, 0, 3.4), (1, 0, 3.4), (2, 1, 4.5)], 
      dtype=(numpy.record, [('col1', '<i4'), ('col2', 'i1'), ('col3', '<f8')]))

Comment From: sai9010

how to remove dtype from output?

Pandas Efficient conversion from dataframes to numpy arrays retaining column dtypes and names

Problem description

Output of pd.show_versions()

Output of `pd.show_versions()`