Problem description
One thing I have to check a lot at my work is the percentage of null values in a DataFrame column. What I end up doing is ```python (df[column].isnull().sum() * 100/ len(df)).sort_values(ascending=False)
I think it would be very convenient to have a parameter, like:
```python
df[column].isnull(ratio=True,sort=True)
What do you guys think?
Comment From: max-sixty
Have you seen df.info()
? That has some summary info (though not the %)
There should be a better way of describing the texture of your data than adding more kwargs there
Comment From: lucianoviola
You are right about df.info(), but personally I find it a bit confusing. You have to keep checking the number of rows in your df and then compare it to the Null count. When there are many rows, it's hard to have a sense of proportion. Also, df.info() doesn't work with Series.
I find it useful to know which columns are above a certain threshold of missing values.
Comment From: gfyoung
@MaximilianR : Even better, just use df.describe()
. The count
row tells you how many non-null elements there are in each column:
percentages = df.describe.ix["count"] / len(df)
percentages.sort_values()
Comment From: gfyoung
@lucianoviola : I'm not very inclined at this point to add this functionality just be isnull()
is about telling you which values are null
and not how many are null. In addition, I think .describe()
takes care of that in large part.
I'll close the issue for the time being, but if you feel strongly about this, feel free to keep posting, and we can re-address.
Comment From: lucianoviola
@gfyoung no problem. Thank you!
Comment From: gfyoung
@lucianoviola : Certainly! I hope the code I provided above makes more sense for you to use!