Encountered via a column with all missing values read from a CSV that are filled by pd.csv_read as NaNs. I had logic to check for duplicate values that used set(df[col]), which I realize I can readily replace with df[col].unique(). However, found an intriguing bug, in that a lengthy list of np.NaN appears to hang the kernel... one core maxes at 100% and just sits there with the fans blasting.

This only seems to occur with NaNs... a similar list of integers or floats will chug but ultimately return.

System is a Gigabyte Aero 15 running Ubuntu 16.04 Python 3.6.1 Pandas 0.20.1 NumPy 1.13.1

Minimal reproduction

import pandas as pd
import numpy as np 
ser = pd.Series([np.NaN]*1000000)
set(ser)

I mean, this is really just an entertaining curiosity, as behavior with .unique is fine. Still intriguing, though. Any ideas?

Comment From: gfyoung

@andybrnr : Interesting! Indeed, I've run this on other data types and can only replicate this issue with just np.nan as you have. Not sure if this is a pandas issue exactly, but I can't see why we can't look into this a little more (given that .unique() is very performant).

Comment From: chris-b1

This happens due to interaction of NaN semantics and how python's hash tables work. Basic rule of thumb is never put NaN inside a set or dict. We get around this by special casing it.

What's specifically happening is:

  1. iter(ser) is yielding the value np.float64('nan') over and over again
  2. All these NaN are the same value, so their hash is equal
  3. But, NaN != NaN, so the python hash table treats them as different keys
  4. So you are not only creating a massive python set, but creating that set out of items that have the same hash, essentially doing a hash collision attack on yourself
def yield_nan(num):
    for _ in range(num):
        yield float('nan')

In [89]: %timeit set(yield_nan(100))
154 µs ± 4.6 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [90]: %timeit set(yield_nan(200))
469 µs ± 5.49 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [91]: %timeit set(yield_nan(500))
2.85 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# 5x elements, 18x slower

Comment From: andybrnr

Thanks for the explanation, Chris, most useful.