Code Sample, a copy-pastable example if possible
For loading the csvfile:
loadedFile = pd.read_csv(fileLocation)
For getting the "description" of the pd.DataFrame
loadedFile.describe()
For outputting the pd.DataFrames
# open DEBUG File
file_debug_setup = open('/'XXX'/'XXX'/DEBUG_selectionPreparation_2.txt', "w")
## outputs entire DataFrame ...
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
file_debug_setup.write("" + str(original_dataset) + "\n\n")
file_debug_setup.close()
*where 'original_dataset' contains the present pd.DataFrame from 'loadedFile'
This code is run twice, to compare the pd.DataFrames generated from each separate individual run
Problem description
The code above is run twice (or any number of times, greater than one) each run generates a pd.DataFrame using the same programming code. The problem is that a slightly different pd.DataFrame is being generated, EVEN THOUGH THE DATASET LOADED IS IDENTICAL EACH TIME - THE SAME CODE IS BEING USED TO GENERATE THE PD.DATAFRAME
When loading the identical dataset, the program returns a pd.DataFrame that contains slightly different values. When testing the two pd.DataFrames, the ".describe()" function returns slightly different values for the quartiles of each attribute. Further affirming the idea that the datasets generated are different is that a comparison of the datasets (outputted using code above) using the linux/unix 'diff' command results in many contrasts.
The difference is minimal - usually a hundredth of a decimal point or smaller. However ever small this may be, IS WITHIN A STATISTICAL CONTEXT large. For example, if selecting a model using selections via MSE, the rounding of these attributes significantly alter the selection process - leading to different selection of attributes when analyzing an IDENTICAL dataset using IDENTICAL code.
Expected Output
The pd.DataFrame values should always be identical when loading IDENTICAL DATASETS with IDENTICAL CODE
Output of pd.show_versions()
Comment From: TomAugspurger
Can you make a reproducible example? I can't run that since '/'XXX'/'XXX'/DEBUG_selectionPreparation_2.txt'
doesn't exist on my computer.
For examples, it's usually best to make a string with the values and wrap that in a StringIO
object, which can be passed to read_csv
.
Comment From: st12yker
The code: "'/'XXX'/'XXX'/DEBUG_selectionPreparation_2.txt`" is for demonstration purposes, to show how I outputted the pd.DataFrame to a text file. You can use any code to write the pd.DataFrames to a file.
As stated in the text, I outputted two pd.DataFrame (loaded them seperately one after the other)
I used the text file so that I could use the linux/unix command "diff" to identify difference between the two different dataframes (which should have no differences, since I am loading the IDENTICAL dataset). As stated in the text, the "diff" linux/unix command identified a lot of differences. See the text for additional comments.
Thank you.
Sincerely, Julian Hershowitz
----- Original Message ----- From: "Tom Augspurger" notifications@github.com To: "pandas-dev/pandas" pandas@noreply.github.com Cc: "st12yker" julian@webintensive.com, "Author" author@noreply.github.com Sent: Tuesday, February 13, 2018 3:45:32 PM Subject: Re: [pandas-dev/pandas] pd.read_csv generating different pd.DataFrames with identical dataset (#19685)
Can you make a reproducible example? '/'XXX'/'XXX'/DEBUG_selectionPreparation_2.txt
doesn't exist on my computer.
For examples, it's usually best to make a string with the values and wrap that in a StringIO
object, which can be passed to read_csv
.
-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/pandas-dev/pandas/issues/19685#issuecomment-365398205
Comment From: WillAyd
You should be using to_csv
instead of writing the string representation of the DataFrame
Comment From: chris-b1
Agree with @WillAyd diagnosis, please re-open with a reproducible example if that's not the issue
Comment From: st12yker
Hello. You are all ( @TomAugspurger @WillAyd @chris-b1 ) focusing on the wrong part of this discussion. Exporting the dataframe to a text file is not where I am having issues. (I responded to @TomAugspurger regarding outputting the pandas dataframe since that was relevant to his response - his response, all be it, was off topic)
The problem with pandas is reading in a csv file using pandas.read_csv() - http://tinyurl.com/y8odcc9y - and creating the appropriate pd.DataFrame. See original post for details.
(@WillAyd using "to_csv" is relevant to outputting a pd.DataFrame to a csv file, NOT inputting a csv file and creating a pd.DataFrame - see pandas documentation at http://tinyurl.com/y8twzlu5).
In summary you have not read my post closely. You are focusing on @TomAugspurger off-topic comment.
Comment From: TomAugspurger
@st12yker you might want to have a look at https://stackoverflow.com/help/mcve
We still don't have code to run so we can't verify whether it's a bug or not.
Comment From: st12yker
The code - as mention in the post - is simply
import pandas as pd pd.read_csv(fileLocation)
The code to output the contents/statistics is not relevant to the issue.
Comment From: chris-b1
@st12yker - you're certainly welcome to investigate further, and if you find a problem or even a hint of what the problem is, a new issue or PR is more than welcome, but without example data there's nothing more we can do.
Comment From: st12yker
Here is the Kaggle page with a link to the dataset I am using: https://www.kaggle.com/tmdb/tmdb-movie-metadata