Code Sample, a copy-pastable example if possible
df['url'] = df['text'].str.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
df['namedAuthor'] = df['text'].str.findall('(?<=\s)@\w*') # get the named authors
df['hash'] = df['text'].str.findall('(?<=\s)#\w*') # get the hastags
df['textGood'] = df['text'].str.replace('RT','')
df['textGood'] = df['textGood'].str.replace(r'@\S+','')
df['textGood'] = df.apply (lambda row: re.sub(r"http\S+",'',row["textGood"]),axis=1)
df['textGood'] = df.apply (lambda row: re.sub(r"#",'',row["textGood"]),axis=1)
df['tokenized'] = df.apply (lambda row: nltk.word_tokenize(row["textGood"]),axis=1)
df['posTag'] = df.apply (lambda row: nltk.pos_tag(row["tokenized"]),axis=1)
df['chuncked'] = df.apply (lambda row: nltk.ne_chunk(row["posTag"], binary=True),axis=1)
df['Names'] = df.apply (lambda row: getNames(row["posTag"]),axis=1)
df['Adj'] = df.apply (lambda row: getAdj(row["posTag"]),axis=1)
Problem description
I have a Pandas dataframe containing a "text" column. Every text (tweets) contains valuable information such as url, named authors (@Blabla) and hashtags (#onioq) i'd like to save in separate columns (first three lines) before I clean the text for an nltk based analysis (last five lines). My problem is: my code works every time I use only one of the first 3 lines but bug when I use 2 or more. What am I missing
Traceback (most recent call last):
File "test.py", line 87, in
df['chuncked'] = df.apply (lambda row: nltk.ne_chunk(row["posTag"], binary=True),axis=1)
File "/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py", line 4152, in apply
return self._apply_standard(f, axis, reduce=reduce)
File "/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py", line 4265, in _apply_standard
result = self._constructor(data=results, index=index)
File "/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py", line 266, in __init__
mgr = self._init_dict(data, index, columns, dtype=dtype)
File "/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py", line 402, in _init_dict
return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
File "/usr/local/lib/python3.5/dist-packages/pandas/core/frame.py", line 5408, in _arrays_to_mgr
return create_block_manager_from_arrays(arrays, arr_names, axes)
File "/usr/local/lib/python3.5/dist-packages/pandas/core/internals.py", line 4267, in create_block_manager_from_arrays
construction_error(len(arrays), arrays[0].shape, axes, e)
File "/usr/local/lib/python3.5/dist-packages/pandas/core/internals.py", line 4229, in construction_error
raise e
File "/usr/local/lib/python3.5/dist-packages/pandas/core/internals.py", line 4262, in create_block_manager_from_arrays
blocks = form_blocks(arrays, names, axes)
File "/usr/local/lib/python3.5/dist-packages/pandas/core/internals.py", line 4359, in form_blocks
object_blocks = _simple_blockify(object_items, np.object_)
File "/usr/local/lib/python3.5/dist-packages/pandas/core/internals.py", line 4389, in _simple_blockify
values, placement = _stack_arrays(tuples, dtype)
File "/usr/local/lib/python3.5/dist-packages/pandas/core/internals.py", line 4453, in _stack_arrays
stacked[i] = _asarray_compat(arr)
ValueError: could not broadcast input array from shape (28) into shape (15)
Comment From: jreback
please post a reproducible example. (IOW one that people can copy-past). What exactly is the bug report here?
Comment From: jreback
closing as not reproducible