Encountered via a column with all missing values read from a CSV that are filled by pd.csv_read as NaNs. I had logic to check for duplicate values that used set(df[col]), which I realize I can readily replace with df[col].unique(). However, found an intriguing bug, in that a lengthy list of np.NaN appears to hang the kernel... one core maxes at 100% and just sits there with the fans blasting.
This only seems to occur with NaNs... a similar list of integers or floats will chug but ultimately return.
System is a Gigabyte Aero 15 running Ubuntu 16.04 Python 3.6.1 Pandas 0.20.1 NumPy 1.13.1
Minimal reproduction
import pandas as pd
import numpy as np
ser = pd.Series([np.NaN]*1000000)
set(ser)
I mean, this is really just an entertaining curiosity, as behavior with .unique is fine. Still intriguing, though. Any ideas?
Comment From: gfyoung
@andybrnr : Interesting! Indeed, I've run this on other data types and can only replicate this issue with just np.nan
as you have. Not sure if this is a pandas
issue exactly, but I can't see why we can't look into this a little more (given that .unique()
is very performant).
Comment From: chris-b1
This happens due to interaction of NaN
semantics and how python's hash tables work. Basic rule of thumb is never put NaN
inside a set
or dict
. We get around this by special casing it.
What's specifically happening is:
iter(ser)
is yielding the valuenp.float64('nan')
over and over again- All these
NaN
are the same value, so their hash is equal - But,
NaN
!=NaN
, so the python hash table treats them as different keys - So you are not only creating a massive python set, but creating that set out of items that have the same hash, essentially doing a hash collision attack on yourself
def yield_nan(num):
for _ in range(num):
yield float('nan')
In [89]: %timeit set(yield_nan(100))
154 µs ± 4.6 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [90]: %timeit set(yield_nan(200))
469 µs ± 5.49 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [91]: %timeit set(yield_nan(500))
2.85 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# 5x elements, 18x slower
Comment From: andybrnr
Thanks for the explanation, Chris, most useful.