Pandas pd.read_csv generating different pd.DataFrames with identical dataset

Code Sample, a copy-pastable example if possible



For loading the csvfile:

loadedFile = pd.read_csv(fileLocation)


For getting the "description" of the pd.DataFrame

loadedFile.describe()


For outputting the pd.DataFrames

         #  open DEBUG File
         file_debug_setup = open('/'XXX'/'XXX'/DEBUG_selectionPreparation_2.txt', "w")

          ##  outputs entire DataFrame ...
         with pd.option_context('display.max_rows', None, 'display.max_columns', None):
              file_debug_setup.write("" + str(original_dataset) + "\n\n")

        file_debug_setup.close()


*where 'original_dataset' contains the present pd.DataFrame from 'loadedFile'
This code is run twice, to compare the pd.DataFrames generated from each separate individual run

Problem description

The code above is run twice (or any number of times, greater than one) each run generates a pd.DataFrame using the same programming code. The problem is that a slightly different pd.DataFrame is being generated, EVEN THOUGH THE DATASET LOADED IS IDENTICAL EACH TIME - THE SAME CODE IS BEING USED TO GENERATE THE PD.DATAFRAME

When loading the identical dataset, the program returns a pd.DataFrame that contains slightly different values. When testing the two pd.DataFrames, the ".describe()" function returns slightly different values for the quartiles of each attribute. Further affirming the idea that the datasets generated are different is that a comparison of the datasets (outputted using code above) using the linux/unix 'diff' command results in many contrasts.

The difference is minimal - usually a hundredth of a decimal point or smaller. However ever small this may be, IS WITHIN A STATISTICAL CONTEXT large. For example, if selecting a model using selections via MSE, the rounding of these attributes significantly alter the selection process - leading to different selection of attributes when analyzing an IDENTICAL dataset using IDENTICAL code.

Expected Output

The pd.DataFrame values should always be identical when loading IDENTICAL DATASETS with IDENTICAL CODE

Output of `pd.show_versions()`

[paste the output of ``pd.show_versions()`` here below this line] In [17]: pd.show_versions() INSTALLED VERSIONS ------------------ commit: None python: 3.6.1.final.0 python-bits: 64 OS: Darwin OS-release: 17.3.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.20.1 pytest: 3.0.7 pip: 9.0.1 setuptools: 36.5.0.post20170921 Cython: 0.25.2 numpy: 1.12.1 scipy: 0.19.0 xarray: None IPython: 5.3.0 sphinx: 1.5.6 patsy: 0.4.1 dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: 1.2.1 tables: 3.3.0 numexpr: 2.6.2 feather: None matplotlib: 2.0.2 openpyxl: 2.4.7 xlrd: 1.0.0 xlwt: 1.2.0 xlsxwriter: 0.9.6 lxml: 3.7.3 bs4: 4.6.0 html5lib: 0.999 sqlalchemy: 1.1.9 pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: None

Comment From: TomAugspurger

Can you make a reproducible example? I can't run that since '/'XXX'/'XXX'/DEBUG_selectionPreparation_2.txt' doesn't exist on my computer.

For examples, it's usually best to make a string with the values and wrap that in a StringIO object, which can be passed to read_csv.

Comment From: st12yker

The code: "'/'XXX'/'XXX'/DEBUG_selectionPreparation_2.txt`" is for demonstration purposes, to show how I outputted the pd.DataFrame to a text file. You can use any code to write the pd.DataFrames to a file.

As stated in the text, I outputted two pd.DataFrame (loaded them seperately one after the other)

I used the text file so that I could use the linux/unix command "diff" to identify difference between the two different dataframes (which should have no differences, since I am loading the IDENTICAL dataset). As stated in the text, the "diff" linux/unix command identified a lot of differences. See the text for additional comments.

Thank you.

Sincerely, Julian Hershowitz

----- Original Message ----- From: "Tom Augspurger" notifications@github.com To: "pandas-dev/pandas" pandas@noreply.github.com Cc: "st12yker" julian@webintensive.com, "Author" author@noreply.github.com Sent: Tuesday, February 13, 2018 3:45:32 PM Subject: Re: [pandas-dev/pandas] pd.read_csv generating different pd.DataFrames with identical dataset (#19685)

Can you make a reproducible example? '/'XXX'/'XXX'/DEBUG_selectionPreparation_2.txt doesn't exist on my computer.

For examples, it's usually best to make a string with the values and wrap that in a StringIO object, which can be passed to read_csv.

-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/pandas-dev/pandas/issues/19685#issuecomment-365398205

Comment From: WillAyd

You should be using to_csv instead of writing the string representation of the DataFrame

Comment From: chris-b1

Agree with @WillAyd diagnosis, please re-open with a reproducible example if that's not the issue

Comment From: st12yker

Hello. You are all ( @TomAugspurger @WillAyd @chris-b1 ) focusing on the wrong part of this discussion. Exporting the dataframe to a text file is not where I am having issues. (I responded to @TomAugspurger regarding outputting the pandas dataframe since that was relevant to his response - his response, all be it, was off topic)

The problem with pandas is reading in a csv file using pandas.read_csv() - http://tinyurl.com/y8odcc9y - and creating the appropriate pd.DataFrame. See original post for details.

(@WillAyd using "to_csv" is relevant to outputting a pd.DataFrame to a csv file, NOT inputting a csv file and creating a pd.DataFrame - see pandas documentation at http://tinyurl.com/y8twzlu5).

In summary you have not read my post closely. You are focusing on @TomAugspurger off-topic comment.

Comment From: TomAugspurger

@st12yker you might want to have a look at https://stackoverflow.com/help/mcve

We still don't have code to run so we can't verify whether it's a bug or not.

Comment From: st12yker

The code - as mention in the post - is simply

import pandas as pd pd.read_csv(fileLocation)

The code to output the contents/statistics is not relevant to the issue.

Comment From: chris-b1

@st12yker - you're certainly welcome to investigate further, and if you find a problem or even a hint of what the problem is, a new issue or PR is more than welcome, but without example data there's nothing more we can do.

Comment From: st12yker

Here is the Kaggle page with a link to the dataset I am using: https://www.kaggle.com/tmdb/tmdb-movie-metadata

Pandas pd.read_csv generating different pd.DataFrames with identical dataset

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

Output of `pd.show_versions()`