Currently the read_sas
method doesn't support reading SAS7BDAT files from AWS S3 like read_csv
.
Can it be added?
Comment From: jreback
actually this should already work, have you tried it?
this goes thru the same filepath as is used by read_csv
Comment From: ankitdhingra
I have tried it, but unfortunately it doesn't work. I get the following error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-3-009ad5a2cf9f> in <module>()
2 pd.show_versions()
3
----> 4 pd.read_sas("s3:/bucket-url/temp.sas7bdat")
/home/anaconda2/lib/python2.7/site-packages/pandas/io/sas/sasreader.py in read_sas(filepath_or_buffer, format, index, encoding, chunksize, iterator)
52 reader = SAS7BDATReader(filepath_or_buffer, index=index,
53 encoding=encoding,
---> 54 chunksize=chunksize)
55 else:
56 raise ValueError('unknown SAS format')
/home/anaconda2/lib/python2.7/site-packages/pandas/io/sas/sas7bdat.py in __init__(self, path_or_buf, index, convert_dates, blank_missing, chunksize, encoding, convert_text, convert_header_text)
94 self._path_or_buf = open(self._path_or_buf, 'rb')
95
---> 96 self._get_properties()
97 self._parse_metadata()
98
/home/anaconda2/lib/python2.7/site-packages/pandas/io/sas/sas7bdat.py in _get_properties(self)
100
101 # Check magic number
--> 102 self._path_or_buf.seek(0)
103 self._cached_page = self._path_or_buf.read(288)
104 if self._cached_page[0:len(const.magic)] != const.magic:
AttributeError: 'BotoFileLikeReader' object has no attribute 'seek'
The documentation doesn't mention the ability to read from S3 as well Pandas.read_sas
Output from pd.show_versions():
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 4.1.13-18.26.amzn1.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 23.0.0
Cython: 0.24
numpy: 1.11.1
scipy: 0.17.1
statsmodels: 0.6.1
xarray: None
IPython: 4.2.0
sphinx: 1.4.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.0
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.2
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.40.0
pandas_datareader: None
Comment From: jreback
hmm, this may require something more then.
cc @kshedden cc @TomAugspurger; xref https://github.com/pydata/pandas/pull/13137
Comment From: TomAugspurger
switching over to s3fs should handle this, as it does implement seek
.
Although I thing there's a second bug in read_sas
only accepting filepaths, and not buffers.
In [31]: with open('test1.sas7bdat', 'rb') as f:
...: pd.read_sas(f)
...:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-31-85bbca72eb65> in <module>()
1 with open('test1.sas7bdat') as f:
----> 2 pd.read_sas(f)
3
/Users/tom.augspurger/Envs/surveyer/lib/python3.5/site-packages/pandas/io/sas/sasreader.py in read_sas(filepath_or_buffer, format, index, encoding, chunksize, iterator)
43 pass
44
---> 45 if format.lower() == 'xport':
46 from pandas.io.sas.sas_xport import XportReader
47 reader = XportReader(filepath_or_buffer, index=index,
AttributeError: 'NoneType' object has no attribute 'lower'
will make an issue later.
Working around that, and reading from s3,
In [33]: from pandas.io.sas.sas7bdat import SAS7BDATReader
In [34]: fs = s3fs.S3FileSystem(anon=False)
In [35]: f = fs.open('s3://<bucket>/test.sas7bdat')
In [36]: SAS7BDATReader(f).read()
Out[36]:
Column1 Column2 Column3 Column4 Column5 Column6 Column7 \
0 0.636 b'pear' 84.0 1965-12-10 0.103 b'apple' 20.0
1 0.283 b'dog' 49.0 1977-03-07 0.398 b'pear' 50.0
2 0.452 b'pear' 35.0 1983-08-15 0.117 b'pear' 70.0
3 0.557 b'dog' 29.0 1974-06-28 0.640 b'pear' 34.0
4 0.138 NaN 55.0 1965-03-18 0.583 b'crocodile' 34.0
5 0.948 b'dog' 33.0 1984-07-15 0.691 b'pear' NaN
6 0.162 b'crocodile' 17.0 1982-06-03 0.002 b'pear' 30.0
7 0.148 b'crocodile' 37.0 1964-10-06 0.411 b'apple' 23.0
8 NaN b'pear' 15.0 1970-01-27 0.102 b'pear' 1.0
9 0.663 b'pear' NaN 1981-03-06 0.086 b'apple' 80.0
Comment From: ankitdhingra
@TomAugspurger Are you talking about this s3fs. This is based on boto3 whereas pandas is based on boto2. Will that be fine?
Comment From: TomAugspurger
@ankitdhingra yeah, sorry. See https://github.com/pydata/pandas/issues/11915 for context. We were having issues with boto2 and are probably going to depend on s3fs for S3-related stuff in the future. That'll make supporting things like this (and writing s3 files) easier. I jsut haven't had time to finish that pull-request.
In the meantime, getting a filebuffer object from s3fs.open(s3://bucket/key)
and passing that to any of the read_*
methods should work (once we fix that other bug I mentioned).
Comment From: mroeschke
I think this fully works with fsspec support now so closing