Pandas Feature request: read_csv/read_table/read_fwf - read multiple files with the same structure, applying the same parameters (skiprows, skipfooter, nrows)

Hello,

I think it would make sense to make read_csv(), read_table(), read_fwf() able to read multiple files with the same structure. It might be tricky to read multiple files into one string, especially when all of them have header line(s) and when you want to use the following parameters: skiprows, skipfooter, nrows. The logic for (skiprows, skipfooter, nrows) is already implemented, so IMO it shouldn't be very difficult. The header parameter (if header exists) must be read/parsed only from the first (from one) file.

Of course one can always do something like:

df = pd.concat([pd.read_csv(f, **kwargs) for f in flist], ignore_index=True)

but it's not very efficient when working with big files.

In the last days there were plenty of similar questions stackoverflow.com, asking how to merge CSV files with the same structure.

Thank you!

Comment From: jreback

None of the read_* functions work on a glob, but is pretty trivial to do so. You could add this to the cookbook docs; will accept a PR for that. I suppose in theory we could wrap around this. But would be a bit of an effort.

In [6]: pd.options.display.max_rows=10

In [7]: DataFrame(np.random.randn(100000,2)).to_csv('file1.csv')

In [8]: DataFrame(np.random.randn(100000,2)).to_csv('file2.csv')

In [9]: pd.concat([ pd.read_csv(f) for f in glob.glob('file*.csv') ], ignore_index=True)
Out[9]: 
        Unnamed: 0         0         1
0                0  0.431748 -0.558788
1                1  0.849501 -0.174770
2                2 -1.493231  0.160548
3                3  2.021627 -0.365592
4                4  1.640202 -0.595996
...            ...       ...       ...
199995       99995  0.612665  0.599660
199996       99996 -0.644949 -1.029477
199997       99997  0.615948 -0.666595
199998       99998  0.450607 -0.275538
199999       99999  1.181064 -0.051599

[200000 rows x 3 columns]

In [10]: pd.concat([ pd.read_csv(f, index_col=0) for f in glob.glob('file*.csv') ], ignore_index=True)
Out[10]: 
               0         1
0       0.431748 -0.558788
1       0.849501 -0.174770
2      -1.493231  0.160548
3       2.021627 -0.365592
4       1.640202 -0.595996
...          ...       ...
199995  0.612665  0.599660
199996 -0.644949 -1.029477
199997  0.615948 -0.666595
199998  0.450607 -0.275538
199999  1.181064 -0.051599

[200000 rows x 2 columns]

Comment From: TomAugspurger

This was a dupe of https://github.com/pandas-dev/pandas/issues/15904 and closed by https://github.com/pandas-dev/pandas/pull/16166

Comment From: anmyachev

@jreback, @TomAugspurger Hello, I could add support for reading from multiple files, be it file list or wildcard support. Although this is syntactic sugar, but maintaining it is not worth a lot of effort, and will save users from boilerplate code. What do you think?

Comment From: jreback

it would be a good feature however

error handling would have to be very obvious

would likely want to extend to other read_* functions so ideally it's a somewhat generic soln

pls open a new issue for discussions

Comment From: anmyachev

https://github.com/pandas-dev/pandas/issues/39435