Hello,
I think it would make sense to make read_csv()
, read_table()
, read_fwf()
able to read multiple files with the same structure. It might be tricky to read multiple files into one string, especially when all of them have header line(s) and when you want to use the following parameters: skiprows
, skipfooter
, nrows
.
The logic for (skiprows
, skipfooter
, nrows
) is already implemented, so IMO it shouldn't be very difficult. The header
parameter (if header exists) must be read/parsed only from the first (from one) file.
Of course one can always do something like:
df = pd.concat([pd.read_csv(f, **kwargs) for f in flist], ignore_index=True)
but it's not very efficient when working with big files.
In the last days there were plenty of similar questions stackoverflow.com, asking how to merge CSV files with the same structure.
Thank you!
Comment From: jreback
None of the read_*
functions work on a glob, but is pretty trivial to do so. You could add this to the cookbook docs; will accept a PR for that. I suppose in theory we could wrap around this. But would be a bit of an effort.
In [6]: pd.options.display.max_rows=10
In [7]: DataFrame(np.random.randn(100000,2)).to_csv('file1.csv')
In [8]: DataFrame(np.random.randn(100000,2)).to_csv('file2.csv')
In [9]: pd.concat([ pd.read_csv(f) for f in glob.glob('file*.csv') ], ignore_index=True)
Out[9]:
Unnamed: 0 0 1
0 0 0.431748 -0.558788
1 1 0.849501 -0.174770
2 2 -1.493231 0.160548
3 3 2.021627 -0.365592
4 4 1.640202 -0.595996
... ... ... ...
199995 99995 0.612665 0.599660
199996 99996 -0.644949 -1.029477
199997 99997 0.615948 -0.666595
199998 99998 0.450607 -0.275538
199999 99999 1.181064 -0.051599
[200000 rows x 3 columns]
In [10]: pd.concat([ pd.read_csv(f, index_col=0) for f in glob.glob('file*.csv') ], ignore_index=True)
Out[10]:
0 1
0 0.431748 -0.558788
1 0.849501 -0.174770
2 -1.493231 0.160548
3 2.021627 -0.365592
4 1.640202 -0.595996
... ... ...
199995 0.612665 0.599660
199996 -0.644949 -1.029477
199997 0.615948 -0.666595
199998 0.450607 -0.275538
199999 1.181064 -0.051599
[200000 rows x 2 columns]
Comment From: TomAugspurger
This was a dupe of https://github.com/pandas-dev/pandas/issues/15904 and closed by https://github.com/pandas-dev/pandas/pull/16166
Comment From: anmyachev
@jreback, @TomAugspurger Hello, I could add support for reading from multiple files, be it file list or wildcard support. Although this is syntactic sugar, but maintaining it is not worth a lot of effort, and will save users from boilerplate code. What do you think?
Comment From: jreback
it would be a good feature however
error handling would have to be very obvious
would likely want to extend to other read_* functions so ideally it's a somewhat generic soln
pls open a new issue for discussions
Comment From: anmyachev
https://github.com/pandas-dev/pandas/issues/39435