iotop and a simple-mined c program indicates we're nowhere near IO-bound in df.to_csv, at about ~10-15x.
It might be possible to speed things up considerably with a fast path for special cases (numerical only) that don't need fancy quoting and other bells and whistles provided by the underlying csv python module.
#include <stdio.h>
#include <stdlib.h>
int main(int argc,char **argv)
{
int i;
FILE *f;
char fmt[] = "%f,%f,%f,%f,%f\n";
while (1) {
f = fopen("out.csv","wb");
for(i=0;i<1000000;i++) {
fprintf(f,fmt, 1.0,2.0,3.0,4.0,5.0);
}
fclose(f);
}
}
sustains about 30MB/s on my machine (without even batching writes) vs ~2-3MB/s for the new (0.11.0) cython df.to_csv().
need to check if it's the stringifying, quoting logic, memory layout, or something else that constitutes the difference.
Should also yield insights for any future binary serialization format implemented.
Comment From: cpcloud
what if you were to build up buffers of some specified chunk size using iovec
structs and make a call into writev
/readv
, i.e. scatter/gather? is that what you mean by batched writes?
Comment From: ghost
By batched writes I meant that the c code does not buffer pending write data
before issuing a write
. presumably the OS might do this for you rather then
going out to disk everytime, so it's just functional call overhead.
The iovec idea sound interesting, but how do you know the perf difference has anything to do with the issue that addresses?
Comment From: cpcloud
what was the code you used to benchmark to_csv
?
Comment From: ghost
I used iotop.
Comment From: wesm
probably make sense to write a C to-csv routine for the simplest of to_csv outputs (maybe not support custom formatter functions to start) at some point. the IO cost will probably outweight the irregular memory access patterns.
Comment From: michaelaye
I'm interested in making this happen, here's what I found so far, most likely this is obvious to you, but I would need a hint to improve this:
df = pd.DataFrame(randn(10000, 30))
# this is the slow part, guess those loops are dodgy...
def df_to_string(df):
s = '\n'.join([','.join(df.irow(i).astype('string')) for i in xrange(len(df))])
return s
This is the fast part and I'm a cython noob, so be gentle:
cimport cython
from libc.stdio cimport fopen, FILE, fclose, fprintf
def c_write_to_file(filename, content):
filename_byte_string = filename.encode("UTF-8")
cdef char* fname = filename_byte_string
cdef char* line = content
cdef FILE* cfile
cfile = fopen(fname, "w")
if cfile == NULL:
return
fprintf(cfile, line)
fclose(cfile)
return []
Here's some benchmarks I took:
def df_to_csv_cython(df):
content = df_to_string(df)
c_write_to_file('test_out_c.txt', content)
%timeit df_to_csv_cython(df):
1 loops, best of 3: 1.67 s per loop
%timeit df.to_csv('test_csv_out_pandas.csv')
1 loops, best of 3: 416 ms per loop
So, what needs to be improved is the dataframe to string conversion, but I guess you guys knew that already, I just had to dig down to what the actual bottleneck is. Can you provide any hints how I should proceed?
Comment From: michaelaye
Forgot the benchmarking of the cython write, it's blazing fast, once the content
string is available:
%timeit c_write_to_file('test_out_cython.txt', content)
100 loops, best of 3: 12.2 ms per loop
Comment From: jreback
no
you just need to change lib.write_rows_csv to a new version (it's a bit trickier because u have to decide a bit higher up in the formatted to user the fast path so u don't create the csv writer at all - but for proof of concept that didn't matter)
all the conversions and such already happen by then
take the same data that is passed to write_rows_csv
and just write a new version that takes that data and actually writes it to the file handle
Comment From: jreback
yep
in fact u can almost reuse write_to_csv almost entirely
it's the call to the write that is slow
(because I think it does a lot of conversions and such that don't matter for a plain vanilla csv)
Comment From: michaelaye
which 'write_to_csv' do you mean here? I think I understand now that I have to reimplement lib.pyx's write_csv_rows
and make a write_csv_rows_fast
there. But from below line profile I took for CSVFormatter._save_chunk() I guess I have to change _save_chunk() as well:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1279 def _save_chunk(self, start_i, end_i):
1280
1281 4 11 2.8 0.0 data_index = self.data_index
1282
1283 # create the data for a chunk
1284 4 8 2.0 0.0 slicer = slice(start_i, end_i)
1285 8 20 2.5 0.0 for i in range(len(self.blocks)):
1286 4 4 1.0 0.0 b = self.blocks[i]
1287 4 5 1.2 0.0 d = b.to_native_types(slicer=slicer, na_rep=self.na_rep,
1288 4 4 1.0 0.0 float_format=self.float_format,
1289 4 59994 14998.5 13.6 date_format=self.date_format)
1290
1291 124 296 2.4 0.1 for i, item in enumerate(b.items):
1292
1293 # self.data is a preallocated list
1294 120 3337 27.8 0.8 self.data[self.column_map[b][i]] = d[i]
1295
1296 4 9 2.2 0.0 ix = data_index.to_native_types(slicer=slicer, na_rep=self.na_rep,
1297 4 4 1.0 0.0 float_format=self.float_format,
1298 4 1010 252.5 0.2 date_format=self.date_format)
1299
1300 4 377245 94311.2 85.4 lib.write_csv_rows(self.data, ix, self.nlevels, self.cols, self.writer)
At least the b.to_native_types() takes quite some time.
Comment From: jreback
no most of the slowness is with write_csv_rows
since _save_chunk calls this it is included in its time as well (a caller has the time of itself plus sum of its callees)
Comment From: michaelaye
well, not even talking about absolute time, but isn't it correct that _save_chunk spends 13% of its time for the b.to_native_types() call? It is almost second order effect (85/13 = 6.5) but not completely negligible. Okay, so I assume in above comment you meant that I can reuse lib.write_csv_rows()
and kickout what I don't need for a fast simple write-out?
Funnily enough, I profiled that a python write of a long string is actually faster than a cython write-out of a long string, I guess due to cython overhead:
%timeit c_write_to_file('test_out_cython.txt', content)
1 loops, best of 3: 144 ms per loop
%timeit python_write_to_file('test_out_python.txt', content)
10 loops, best of 3: 67 ms per loop
PS: mention myself to find this issue easier: @michaelaye
Comment From: jreback
I would simply copy lib.write_csv_rows and make a fast version
convert to native types is necessary for proper dtype handling
worry about that later
always optimizes biggest time sync first
Comment From: wesm
We will need to tackle this in the course of working on libpandas. I suggest we create a new set of issues around more optimally writing to CSV once we are ready to do that
Comment From: swt2c
I know this is closed, but I would still like to work toward improving to_csv
in Pandas 0.x.
I have been doing some profiling and so far I've found that the biggest (by far) CPU bottleneck in write_csv_rows()
is this list assignment:
row[1 + i] = data[i][j]
Per individual call, it doesn't take long, but since this gets executed for every cell it adds up. It's not immediately obvious how to improve this, though, as Pandas seems to store its data in columns, whereas we need to get data out in rows.
Comment From: swt2c
Also, to_csv
is much slower when writing a sparse data frame. In that case, the performance is dominated by _to_native_types()
.