Pandas read_csv(compression='gzip') fails while reading compressed file from s3

reading gzipped csv from s3 bucket (private)

import pandas as pd
import boto3

session = boto3.Session()
s3client = session.client('s3')
obj = s3client.get_object(Bucket=bucket_name, Key=objkey)
df = pd.read_csv(obj['Body'], compression='gzip', nrows=5, engine='python')

Error

 File "/Users/username/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 562, in parser_f
   return _read(filepath_or_buffer, kwds)
 File "/Users/username/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 315, in _read
   parser = TextFileReader(filepath_or_buffer, **kwds)
 File "/Users/username/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 645, in __init__
   self._make_engine(self.engine)
 File "/Users/username/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 805, in _make_engine
   self._engine = klass(self.f, **self.options)
 File "/Users/username/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 1608, in __init__
   self.columns, self.num_original_columns = self._infer_columns()
 File "/Users/username/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 1823, in _infer_columns
   line = self._buffered_line()
 File "/Users/username/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 1975, in _buffered_line
   return self._next_line()
 File "/Users/username/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 2006, in _next_line
   orig_line = next(self.data)
 File "/Users/username/anaconda/lib/python2.7/gzip.py", line 464, in readline
   c = self.read(readsize)
 File "/Users/username/anaconda/lib/python2.7/gzip.py", line 268, in read
   self._read(readsize)
 File "/Users/username/anaconda/lib/python2.7/gzip.py", line 295, in _read
   pos = self.fileobj.tell()   # Save current position
AttributeError: 'StreamingBody' object has no attribute 'tell'

output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None python: 2.7.12.final.0 python-bits: 64 OS: Darwin OS-release: 15.4.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8

pandas: 0.18.1 nose: 1.3.7 pip: 8.1.2 setuptools: 20.3 Cython: 0.23.4 numpy: 1.11.1 scipy: 0.18.0 statsmodels: 0.6.1 xarray: None IPython: 4.1.2 sphinx: 1.4.5 patsy: 0.4.0 dateutil: 2.5.3 pytz: 2016.6.1 blosc: None bottleneck: 1.0.0 tables: 3.2.2 numexpr: 2.6.1 matplotlib: 1.5.1 openpyxl: 2.3.2 xlrd: 0.9.4 xlwt: 1.0.0 xlsxwriter: 0.8.4 lxml: 3.6.0 bs4: 4.4.1 html5lib: None httplib2: None apiclient: None sqlalchemy: 1.0.12 pymysql: None psycopg2: None jinja2: 2.8 boto: 2.42.0 pandas_datareader: None

Comment From: jreback

see #13137 I don't think this is fully compat with pandas (boto3). @TomAugspurger

Comment From: TomAugspurger

@mandarup, our current implementation isn't file-like enough for the gzip decompression. As a workaround for now you can use https://github.com/dask/s3fs, which implements more file operations (like .tell), and is what pandas might use in the future. For now, it'd be

import pandas as pd
import s3fs

fs = s3fs.S3FileSystem()

with fs.open('s3://bucket_name/objkey') as f:
    df = pd.read_csv(f, compression='gzip', nrows=5)

Going to close this for now, as it'll be taken care of with #13137.

Pandas read_csv(compression='gzip') fails while reading compressed file from s3

reading gzipped csv from s3 bucket (private)

Error

output of pd.show_versions()

INSTALLED VERSIONS

output of `pd.show_versions()`