Pandas Series.hist() fails for String Series

Not sure if that's a pandas or a mathplotlib bug.

import pandas
df = pandas.DataFrame({"year":[1,2,3,4],"name": [u"abc", u"bcd", u"cde", u"abc"]})
df["year"].hist() # Works
df["name"].hist()
TypeError                                 Traceback (most recent call last)
<ipython-input-11-79a5ab7bbbd5> in <module>()
----> 1 df["name"].hist()

C:\portabel\Python27\lib\site-packages\pandas\tools\plotting.pyc in hist_series(self, by, ax, grid, xlabelsize, xrot, ylabelsize, yrot, figsize, **kwds)
   2113         values = self.dropna().values
   2114 
-> 2115         ax.hist(values, **kwds)
   2116         ax.grid(grid)
   2117         axes = np.array([ax])

C:\portabel\Python27\lib\site-packages\matplotlib\axes.pyc in hist(self, x, bins, range, normed, weights, cumulative, bottom, histtype, align, orientation, rwidth, log, color, label, stacked, **kwargs)
   8325             # this will automatically overwrite bins,
   8326             # so that each histogram uses the same bins
-> 8327             m, bins = np.histogram(x[i], bins, weights=w[i], **hist_kwargs)
   8328             m = m.astype(float) # causes problems later if it's an int
   8329             if normed and not stacked:

C:\portabel\Python27\lib\site-packages\numpy\lib\function_base.pyc in histogram(a, bins, range, normed, weights, density)
    174             else:
    175                 range = (a.min(), a.max())
--> 176         mn, mx = [mi+0.0 for mi in range]
    177         if mn == mx:
    178             mn -= 0.5

TypeError: coercing to Unicode: need string or buffer, float found

Comment From: jorisvandenbossche

This also does not work with matplotlib (eg plt.hist([u"abc", u"bcd", u"cde", u"abc"])), although with another error. So maybe mark an issue there.

But should it work with non numeric data? A histogram is about binning? Maybe a better error message is possible.

Comment From: cpcloud

@JanSchulz What would you expect to happen here? Non-numeric data don't really make sense for a histogram, because nominal data (such as strings) don't have an ordering, can't be added together numerically and don't have any sort of multiplication operation that's germane to computing a histogram.

That said, I agree with @jorisvandenbossche that the error message could be a bit less cryptic.

Comment From: dragoljub

I have noticed series.hist() also fails for datetime64ns series. It would be cool if we could handle this. I'm not sure how NumPy/MatplotLib computes the bucketing on these value types to get the histogram. If its just 64bit ints underlying then it should not be too hard to bucket with datetime bin edges.

Comment From: jreback

value_counts handles all of the dtypes, so easy enough to prob just catch the error, do a value_counts and retry

In [4]: df['name'].value_counts()
Out[4]: 
abc    2
cde    1
bcd    1
dtype: int64

In [5]: df['name'].value_counts().hist()
Out[5]: <matplotlib.axes.AxesSubplot at 0x30fb250>

In [6]: df['name'].hist()

TypeError: coercing to Unicode: need string or buffer, float found

Comment From: jankatins

Actually I expected a bar plot (value -> count). So in the above example I wanted to have three bars, labels are "abc", "cde", "bcd" with the length of the bars 2, 1, 1.

Usung plot with kind=bar also fails:

s = df["name"]
s.plot(kind='bar')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-30-0d9dfbe49ce9> in <module>()
      1 s = df["name"]
----> 2 s.plot(kind='bar')

C:\portabel\Python27\lib\site-packages\pandas\tools\plotting.pyc in plot_series(series, label, kind, use_index, rot, xticks, yticks, xlim, ylim, ax, style, grid, legend, logx, logy, secondary_y, **kwds)
   1797                      secondary_y=secondary_y, **kwds)
   1798 
-> 1799     plot_obj.generate()
   1800     plot_obj.draw()
   1801 

C:\portabel\Python27\lib\site-packages\pandas\tools\plotting.pyc in generate(self)
    874     def generate(self):
    875         self._args_adjust()
--> 876         self._compute_plot_data()
    877         self._setup_subplots()
    878         self._make_plot()

C:\portabel\Python27\lib\site-packages\pandas\tools\plotting.pyc in _compute_plot_data(self)
    955         if is_empty:
    956             raise TypeError('Empty {0!r}: no numeric data to '
--> 957                             'plot'.format(numeric_data.__class__.__name__))
    958 
    959         self.data = numeric_data

TypeError: Empty 'Series': no numeric data to plot

df["name"].value_counts().hist() is not what I wanted: this does a histogram over the count values, so ends up in the above case with two bars, one for '1' (length two, as 1 is two times present) and one for '2' with length 1.

Comment From: jankatins

Thats what I wanted:

df["name"].value_counts().plot(kind="bar")

:-)

Comment From: themrmax

I really want this feature, since most of the data I deal with is categorical, and so I can never use the convenience of hist. (This is also relevant to https://github.com/pydata/pandas/issues/8712 and maybe https://github.com/pydata/pandas/issues/10254) Philosophically, I think this type of behaviour could be seen as more like an enhancement for bar, rather than hist for example in R, by default geom_bar will plot the value counts; you only get the actual values if you pass stat="identity" (opposite of the behaviour of pandas.plot.bar). So maybe this could also be referenced inside BarPlot via an optional value_counts flag.

I've hacked together something which provides this functionality using the Pandas DataFrame.plot API. Would this be suitable for merging into tools.plotting? My concern is that tools.plotting is mostly written using native calls to matplotlib, is it a problem to have calls to a Pandas API inside this library.

Apologies if this is a dumb question, or if I'm rasing this in the wrong place, this would be my first contribution to Pandas and I'm having trouble wrapping my head around exactly where this fits in.

def categorical_hist(data,column=None,by=None,layout=None,legend=None,**params):
  """
      Draw a bar plot of value counts. If layout=none, layout will be a square 
      grid, overriding the DataFrame.plot default of a single column of 
      stacked plots

      Parameters
      ----------
      data : DataFrame or Series
      column : object, optional
          the column to plot (only valid if data is a dataframe)
      by : object optional
          if passed, then used to form bar plots for separate groups
      layout : (int,int), optional
          a tuple (width, height) of the subplots grid
      legend : False/True/'reverse'
          place legend on axis subplots
      kwds : other plotting keyword arguments

  """
  from math import sqrt, ceil
  from pandas import DataFrame

  if column is None: column = data.name #if data is a Series, need to set this
  data = DataFrame(data) #if data is a series, make it a DataFrame

  if by is not None:
    if layout==None:
      s = ceil(sqrt(data[column].unique().size))
      layout = (s,s)
    data = data.groupby(by).value_counts().sort_index().unstack()
    return data[column].plot.bar(subplots=True,layout=layout,
                                 legend=legend,**params)
  return data[column].value_counts().plot.bar(subplots=False,layout=layout,
                                              legend=legend,**params)

Comment From: datapythonista

This doesn't raise an error anymore, it generates a plot. Not sure if the plot is meaningul, it's not equivalent to df["name"].value_counts().plot(kind="bar"). May be we can review in a separate issue the content of the plot, if that hasn't already been reported. Closing this issue.

Example output:

pandas.Series(['a', 'a', 'b', 'b', 'b', 'c'], dtype='object').hist()

str_hist