Pandas strings not properly detected despite correct dtype in read_csv

Hello there!

I am working with text data, and I read my data in using

full_list =[]

for myfile in all_files:
    print("processing " + myfile)
    news = pd.read_csv(myfile, usecols = ['FULL_TIMESTAMP', 'HEADLINE'], dtype = {'HEADLINE' : str})
    full_list.append(news)

data_full = pd.concat(full_list)

As you see, I make sure that my headline variable is a str. However, when I type

collapsed = data_full.groupby('day').HEADLINE.agg(lambda x: '| '.join(x))

I get :

  File "<ipython-input-1-8ce0197f52ac>", line 34, in <module>
    collapsed =data_full.groupby('day').HEADLINE.agg(lambda x: '| '.join(x))

  File "C:\Users\me\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\groupby.py", line 2668, in aggregate
    result = self._aggregate_named(func_or_funcs, *args, **kwargs)

  File "C:\Users\me\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\groupby.py", line 2786, in _aggregate_named
    output = func(group, *args, **kwargs)

  File "<ipython-input-1-8ce0197f52ac>", line 34, in <lambda>
    collapsed = data_full.groupby('day').HEADLINE.agg(lambda x: '| '.join(x))

TypeError: sequence item 21: expected string, float found

To fix the problem, I need first to type

data_full['HEADLINE'] = data_full['HEADLINE'].astype(str)

Is that expected? I thought specifying the dtypes in read_csv was the most robust solution to have consistent types in the data? Still using Pandas 19.2.

Thanks!

Comment From: TomAugspurger

Can you provide a copy-pastable example (including frame generation and writing out CSVs)?

Comment From: randomgambit

haha that would be to easy ! :)

The data contains news articles, so the headline can contain anything, from crazy chinese characters to numbers, special characters and the like.

Maybe a possible solution here would be to look at the rows that are incorrectly parsed (that is float .vs str)? Do you know how to do that? Then I can show you what they look like but of course I cannot post the raw data here (about 2GB total)

Comment From: jschendel

Do you have any missing values in your 'HEADLINE' column? You can check with data_full['HEADLINE'].isnull().any().

Missing values would be read in as np.nan despite the dtype specification, and the type of np.nan is float. Using astype(str) would convert np.nan values to the string 'nan', which would explain why it works after doing so, though I don't know that it's working as you intended, as you'd get something like 'string1| nan| string3' from the join.

Comment From: randomgambit

I think you found the correct solution!!

That raises the question of: is that the expected output? Shouldnt we have the missing values read in as "" instead of np.nan when the user specifies dtype = str ?

Comment From: TomAugspurger

I don't think so (at least not by default). There is the keep_default_na=False option to read_csv, if you'd like to disable that.

On Thu, Jun 1, 2017 at 11:41 AM, Olaf notifications@github.com wrote:

I think you found the correct solutionn!!

That raises the question of: is that the expected output? Shouldnt we have the missing values read in as "" instead of np.nan when the user specifies dtype = str ?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/16569#issuecomment-305550976, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIhzRbeVXyoWVfcEZFwTH64w7HilMks5r_um6gaJpZM4Ns4yP .

Comment From: jreback

closing as a usage issue.

Comment From: jowagner

@randomgambit An empty string cell is not necessarily a missing value. Missing value means that the true value that should have been recorded is unavailable. In some data files, an empty string is part of the set of possible true observations.