Hello there!
I am working with text data, and I read my data in using
full_list =[]
for myfile in all_files:
print("processing " + myfile)
news = pd.read_csv(myfile, usecols = ['FULL_TIMESTAMP', 'HEADLINE'], dtype = {'HEADLINE' : str})
full_list.append(news)
data_full = pd.concat(full_list)
As you see, I make sure that my headline variable is a str
. However, when I type
collapsed = data_full.groupby('day').HEADLINE.agg(lambda x: '| '.join(x))
I get :
File "<ipython-input-1-8ce0197f52ac>", line 34, in <module>
collapsed =data_full.groupby('day').HEADLINE.agg(lambda x: '| '.join(x))
File "C:\Users\me\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\groupby.py", line 2668, in aggregate
result = self._aggregate_named(func_or_funcs, *args, **kwargs)
File "C:\Users\me\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\groupby.py", line 2786, in _aggregate_named
output = func(group, *args, **kwargs)
File "<ipython-input-1-8ce0197f52ac>", line 34, in <lambda>
collapsed = data_full.groupby('day').HEADLINE.agg(lambda x: '| '.join(x))
TypeError: sequence item 21: expected string, float found
To fix the problem, I need first to type
data_full['HEADLINE'] = data_full['HEADLINE'].astype(str)
Is that expected? I thought specifying the dtypes
in read_csv
was the most robust solution to have consistent types in the data? Still using Pandas 19.2.
Thanks!
Comment From: TomAugspurger
Can you provide a copy-pastable example (including frame generation and writing out CSVs)?
Comment From: randomgambit
haha that would be to easy ! :)
The data contains news articles, so the headline can contain anything, from crazy chinese characters to numbers, special characters and the like.
Maybe a possible solution here would be to look at the rows that are incorrectly parsed (that is float .vs str)? Do you know how to do that? Then I can show you what they look like but of course I cannot post the raw data here (about 2GB total)
Comment From: jschendel
Do you have any missing values in your 'HEADLINE' column? You can check with data_full['HEADLINE'].isnull().any()
.
Missing values would be read in as np.nan
despite the dtype
specification, and the type of np.nan
is float. Using astype(str)
would convert np.nan
values to the string 'nan'
, which would explain why it works after doing so, though I don't know that it's working as you intended, as you'd get something like 'string1| nan| string3'
from the join
.
Comment From: randomgambit
I think you found the correct solution!!
That raises the question of: is that the expected output?
Shouldnt we have the missing values read in as ""
instead of np.nan
when the user specifies dtype = str
?
Comment From: TomAugspurger
I don't think so (at least not by default). There is the
keep_default_na=False
option to read_csv
, if you'd like to disable that.
On Thu, Jun 1, 2017 at 11:41 AM, Olaf notifications@github.com wrote:
I think you found the correct solutionn!!
That raises the question of: is that the expected output? Shouldnt we have the missing values read in as "" instead of np.nan when the user specifies dtype = str ?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/16569#issuecomment-305550976, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIhzRbeVXyoWVfcEZFwTH64w7HilMks5r_um6gaJpZM4Ns4yP .
Comment From: jreback
closing as a usage issue.
Comment From: jowagner
@randomgambit An empty string cell is not necessarily a missing value. Missing value means that the true value that should have been recorded is unavailable. In some data files, an empty string is part of the set of possible true observations.