I wish I could say more about this, but it is with proprietary data. Basically, I have a DataFrame of 10 columns and around 7000 rows. When I call "duplicated" on 4 of the rows (3 have strings, 1 has a float), I get around 10 items being flagged as duplicate (5 pairs) when in fact, they are different.
This is sample text of what is being flagged:
Type Line LineString \
1885 else 832.0 (&temp32)->byte1 = (&temp32)->byte4
1895 do while 832.0 do while(0)
4515 enum 122.0 enum {QQ_ON_UNASSERTED = 0, QQ_ON_ASSERTED = 1}
4521 enum 167.0 enum {FIELD_RESET = 1, FIELD_TRIPPED = 0}
Parameter Filename
1885 temp32 arinc.c
1895 arinc.c
4515 QQ_ON_ASSERTED eeprom.c
4521 FIELD_TRIPPED eeprom.c
I know that duplicates are checked through hashing, but is there some way to, I don't know, compare a checksum or a more robust measurement to ensure that only duplicates are flagged? What is the chance that two items could be hashed to the same value?
By adding more of the fields to check the data I am able to prevent false duplicate flags.
I am using Pandas 0.18 from WinPython-64bit-3.4.4.2Qt5.
Comment From: jreback
show what you are actually calling as well as df.info()
.
Comment From: johnml1135
Here is the lines of code that I run:
dup_cols = ["Type","Line","LineString","Parameter"]
pdata3 = pdata2.drop_duplicates(subset=dup_cols)
and here is the result of pdata3.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 7120 entries, 0 to 20578
Data columns (total 17 columns):
Block 7120 non-null object
Blocks 7120 non-null object
Filename 7120 non-null object
Impacts 7120 non-null object
ImpactsUnique 7120 non-null object
Level 7120 non-null int64
Line 7120 non-null float64
LineString 7120 non-null object
OwningFunction 7120 non-null object
Parameter 7120 non-null object
ParameterType 7120 non-null object
ParameterUnique 7120 non-null object
Type 7120 non-null object
ParameterFull 7120 non-null object
ParameterField 7120 non-null object
ImpactsFull 7120 non-null object
ImpactsField 7120 non-null object
dtypes: float64(1), int64(1), object(15)
memory usage: 1001.2+ KB
Comment From: johnml1135
I have been looking further into the code and stepping along the pandas code and I found that (predictably) it was my own mistake. There were duplicates, I was just not seeing it. You can close this ticket.