Pandas Reading SAS files from AWS S3

Currently the read_sas method doesn't support reading SAS7BDAT files from AWS S3 like read_csv. Can it be added?

Comment From: jreback

actually this should already work, have you tried it?

this goes thru the same filepath as is used by read_csv

Comment From: ankitdhingra

I have tried it, but unfortunately it doesn't work. I get the following error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-3-009ad5a2cf9f> in <module>()
      2 pd.show_versions()
      3 
----> 4 pd.read_sas("s3:/bucket-url/temp.sas7bdat")

/home/anaconda2/lib/python2.7/site-packages/pandas/io/sas/sasreader.py in read_sas(filepath_or_buffer, format, index, encoding, chunksize, iterator)
     52         reader = SAS7BDATReader(filepath_or_buffer, index=index,
     53                                 encoding=encoding,
---> 54                                 chunksize=chunksize)
     55     else:
     56         raise ValueError('unknown SAS format')

/home/anaconda2/lib/python2.7/site-packages/pandas/io/sas/sas7bdat.py in __init__(self, path_or_buf, index, convert_dates, blank_missing, chunksize, encoding, convert_text, convert_header_text)
     94             self._path_or_buf = open(self._path_or_buf, 'rb')
     95 
---> 96         self._get_properties()
     97         self._parse_metadata()
     98 

/home/anaconda2/lib/python2.7/site-packages/pandas/io/sas/sas7bdat.py in _get_properties(self)
    100 
    101         # Check magic number
--> 102         self._path_or_buf.seek(0)
    103         self._cached_page = self._path_or_buf.read(288)
    104         if self._cached_page[0:len(const.magic)] != const.magic:

AttributeError: 'BotoFileLikeReader' object has no attribute 'seek'

The documentation doesn't mention the ability to read from S3 as well Pandas.read_sas

Output from pd.show_versions():

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 4.1.13-18.26.amzn1.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 23.0.0
Cython: 0.24
numpy: 1.11.1
scipy: 0.17.1
statsmodels: 0.6.1
xarray: None
IPython: 4.2.0
sphinx: 1.4.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.0
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.2
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.40.0
pandas_datareader: None

Comment From: jreback

hmm, this may require something more then.

cc @kshedden cc @TomAugspurger; xref https://github.com/pydata/pandas/pull/13137

Comment From: TomAugspurger

switching over to s3fs should handle this, as it does implement seek.

Although I thing there's a second bug in read_sas only accepting filepaths, and not buffers.

In [31]: with open('test1.sas7bdat', 'rb') as f:
    ...:     pd.read_sas(f)
    ...:
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-31-85bbca72eb65> in <module>()
      1 with open('test1.sas7bdat') as f:
----> 2     pd.read_sas(f)
      3

/Users/tom.augspurger/Envs/surveyer/lib/python3.5/site-packages/pandas/io/sas/sasreader.py in read_sas(filepath_or_buffer, format, index, encoding, chunksize, iterator)
     43             pass
     44
---> 45     if format.lower() == 'xport':
     46         from pandas.io.sas.sas_xport import XportReader
     47         reader = XportReader(filepath_or_buffer, index=index,

AttributeError: 'NoneType' object has no attribute 'lower'

will make an issue later.

Working around that, and reading from s3,

In [33]: from pandas.io.sas.sas7bdat import SAS7BDATReader
In [34]: fs = s3fs.S3FileSystem(anon=False)

In [35]: f = fs.open('s3://<bucket>/test.sas7bdat')

In [36]: SAS7BDATReader(f).read()
Out[36]:
   Column1       Column2  Column3    Column4  Column5       Column6  Column7  \
0    0.636       b'pear'     84.0 1965-12-10    0.103      b'apple'     20.0
1    0.283        b'dog'     49.0 1977-03-07    0.398       b'pear'     50.0
2    0.452       b'pear'     35.0 1983-08-15    0.117       b'pear'     70.0
3    0.557        b'dog'     29.0 1974-06-28    0.640       b'pear'     34.0
4    0.138           NaN     55.0 1965-03-18    0.583  b'crocodile'     34.0
5    0.948        b'dog'     33.0 1984-07-15    0.691       b'pear'      NaN
6    0.162  b'crocodile'     17.0 1982-06-03    0.002       b'pear'     30.0
7    0.148  b'crocodile'     37.0 1964-10-06    0.411      b'apple'     23.0
8      NaN       b'pear'     15.0 1970-01-27    0.102       b'pear'      1.0
9    0.663       b'pear'      NaN 1981-03-06    0.086      b'apple'     80.0

Comment From: ankitdhingra

@TomAugspurger Are you talking about this s3fs. This is based on boto3 whereas pandas is based on boto2. Will that be fine?

Comment From: TomAugspurger

@ankitdhingra yeah, sorry. See https://github.com/pydata/pandas/issues/11915 for context. We were having issues with boto2 and are probably going to depend on s3fs for S3-related stuff in the future. That'll make supporting things like this (and writing s3 files) easier. I jsut haven't had time to finish that pull-request.

In the meantime, getting a filebuffer object from s3fs.open(s3://bucket/key) and passing that to any of the read_* methods should work (once we fix that other bug I mentioned).

Comment From: mroeschke

I think this fully works with fsspec support now so closing