Pandas PERF: Explore even faster path for df.to_csv - 玖涯软件开发|java/go/python

iotop and a simple-mined c program indicates we're nowhere near IO-bound in df.to_csv, at about ~10-15x.

It might be possible to speed things up considerably with a fast path for special cases (numerical only) that don't need fancy quoting and other bells and whistles provided by the underlying csv python module.

#include <stdio.h>
#include <stdlib.h>

int main(int argc,char **argv)
{
    int i;
    FILE *f;
    char fmt[] = "%f,%f,%f,%f,%f\n";
    while (1) {
    f = fopen("out.csv","wb");
    for(i=0;i<1000000;i++) {
        fprintf(f,fmt, 1.0,2.0,3.0,4.0,5.0);
    }
    fclose(f);
    }
}

sustains about 30MB/s on my machine (without even batching writes) vs ~2-3MB/s for the new (0.11.0) cython df.to_csv().

need to check if it's the stringifying, quoting logic, memory layout, or something else that constitutes the difference.

Should also yield insights for any future binary serialization format implemented.

Comment From: cpcloud

what if you were to build up buffers of some specified chunk size using iovec structs and make a call into writev/readv, i.e. scatter/gather? is that what you mean by batched writes?

Comment From: ghost

By batched writes I meant that the c code does not buffer pending write data before issuing a write. presumably the OS might do this for you rather then going out to disk everytime, so it's just functional call overhead.

The iovec idea sound interesting, but how do you know the perf difference has anything to do with the issue that addresses?

Comment From: cpcloud

what was the code you used to benchmark to_csv?

Comment From: ghost

I used iotop.

Comment From: wesm

probably make sense to write a C to-csv routine for the simplest of to_csv outputs (maybe not support custom formatter functions to start) at some point. the IO cost will probably outweight the irregular memory access patterns.

Comment From: michaelaye

I'm interested in making this happen, here's what I found so far, most likely this is obvious to you, but I would need a hint to improve this:

df = pd.DataFrame(randn(10000, 30))
# this is the slow part, guess those loops are dodgy...
def df_to_string(df):
    s = '\n'.join([','.join(df.irow(i).astype('string')) for i in xrange(len(df))])
    return s

This is the fast part and I'm a cython noob, so be gentle:

cimport cython
from libc.stdio cimport fopen, FILE, fclose, fprintf

def c_write_to_file(filename, content):
    filename_byte_string = filename.encode("UTF-8")
    cdef char* fname = filename_byte_string
    cdef char* line = content

    cdef FILE* cfile
    cfile = fopen(fname, "w")
    if cfile == NULL:
        return

    fprintf(cfile, line)
    fclose(cfile)
    return []

Here's some benchmarks I took:

def df_to_csv_cython(df):
    content = df_to_string(df)
    c_write_to_file('test_out_c.txt', content)

%timeit df_to_csv_cython(df):
1 loops, best of 3: 1.67 s per loop

%timeit df.to_csv('test_csv_out_pandas.csv')
1 loops, best of 3: 416 ms per loop

So, what needs to be improved is the dataframe to string conversion, but I guess you guys knew that already, I just had to dig down to what the actual bottleneck is. Can you provide any hints how I should proceed?

Comment From: michaelaye

Forgot the benchmarking of the cython write, it's blazing fast, once the content string is available:

%timeit c_write_to_file('test_out_cython.txt', content)
100 loops, best of 3: 12.2 ms per loop

Comment From: jreback

you just need to change lib.write_rows_csv to a new version (it's a bit trickier because u have to decide a bit higher up in the formatted to user the fast path so u don't create the csv writer at all - but for proof of concept that didn't matter)

all the conversions and such already happen by then

take the same data that is passed to write_rows_csv

and just write a new version that takes that data and actually writes it to the file handle

Comment From: jreback

yep

in fact u can almost reuse write_to_csv almost entirely

it's the call to the write that is slow

(because I think it does a lot of conversions and such that don't matter for a plain vanilla csv)

Comment From: michaelaye

which 'write_to_csv' do you mean here? I think I understand now that I have to reimplement lib.pyx's write_csv_rows and make a write_csv_rows_fast there. But from below line profile I took for CSVFormatter._save_chunk() I guess I have to change _save_chunk() as well:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  1279                                               def _save_chunk(self, start_i, end_i):
  1280                                           
  1281         4           11      2.8      0.0          data_index = self.data_index
  1282                                           
  1283                                                   # create the data for a chunk
  1284         4            8      2.0      0.0          slicer = slice(start_i, end_i)
  1285         8           20      2.5      0.0          for i in range(len(self.blocks)):
  1286         4            4      1.0      0.0              b = self.blocks[i]
  1287         4            5      1.2      0.0              d = b.to_native_types(slicer=slicer, na_rep=self.na_rep,
  1288         4            4      1.0      0.0                                    float_format=self.float_format,
  1289         4        59994  14998.5     13.6                                    date_format=self.date_format)
  1290                                           
  1291       124          296      2.4      0.1              for i, item in enumerate(b.items):
  1292                                           
  1293                                                           # self.data is a preallocated list
  1294       120         3337     27.8      0.8                  self.data[self.column_map[b][i]] = d[i]
  1295                                           
  1296         4            9      2.2      0.0          ix = data_index.to_native_types(slicer=slicer, na_rep=self.na_rep,
  1297         4            4      1.0      0.0                                          float_format=self.float_format,
  1298         4         1010    252.5      0.2                                          date_format=self.date_format)
  1299                                           
  1300         4       377245  94311.2     85.4          lib.write_csv_rows(self.data, ix, self.nlevels, self.cols, self.writer)

At least the b.to_native_types() takes quite some time.

Comment From: jreback

no most of the slowness is with write_csv_rows

since _save_chunk calls this it is included in its time as well (a caller has the time of itself plus sum of its callees)

Comment From: michaelaye

well, not even talking about absolute time, but isn't it correct that _save_chunk spends 13% of its time for the b.to_native_types() call? It is almost second order effect (85/13 = 6.5) but not completely negligible. Okay, so I assume in above comment you meant that I can reuse lib.write_csv_rows() and kickout what I don't need for a fast simple write-out?

Funnily enough, I profiled that a python write of a long string is actually faster than a cython write-out of a long string, I guess due to cython overhead:

%timeit c_write_to_file('test_out_cython.txt', content)
1 loops, best of 3: 144 ms per loop
%timeit python_write_to_file('test_out_python.txt', content)
10 loops, best of 3: 67 ms per loop

PS: mention myself to find this issue easier: @michaelaye

Comment From: jreback

I would simply copy lib.write_csv_rows and make a fast version

convert to native types is necessary for proper dtype handling

worry about that later

always optimizes biggest time sync first

Comment From: wesm

We will need to tackle this in the course of working on libpandas. I suggest we create a new set of issues around more optimally writing to CSV once we are ready to do that

Comment From: swt2c

I know this is closed, but I would still like to work toward improving to_csv in Pandas 0.x.

I have been doing some profiling and so far I've found that the biggest (by far) CPU bottleneck in write_csv_rows() is this list assignment:

                row[1 + i] = data[i][j]

Per individual call, it doesn't take long, but since this gets executed for every cell it adds up. It's not immediately obvious how to improve this, though, as Pandas seems to store its data in columns, whereas we need to get data out in rows.

Comment From: swt2c

Also, to_csv is much slower when writing a sparse data frame. In that case, the performance is dominated by _to_native_types().