Code Sample, a copy-pastable example if possible
import pandas
import io
csv_file = io.StringIO("""datetime,temperature_husconet,lat_husconet,lon_husconet,temperature_pws,dewpoint_pws,humidity_pws,lat_pws,lon_pws,temperature_eddh,dewpoint_eddh,windspeed_eddh,windgust_eddh,winddirection_eddh,pressure_eddh,humidity_eddh,precipitation_eddh,cloudcover_eddh_BKN,cloudcover_eddh_CAVOC,cloudcover_eddh_FEW,cloudcover_eddh_NSC,cloudcover_eddh_OVC,cloudcover_eddh_SCT,cloudcover_eddh_VV
2016-01-01 00:00:00,5.688,53.551105624479334,9.984235400036294,2.9,1.9,93.0,53.57898,9.79428,3.0,3.0,7.2,0.0,160.0,1023.0,94.0,,0,1,0,0,0,0,0
2016-01-01 00:00:00,4.508,53.540879976582616,9.995115621037971,2.9,1.9,93.0,53.57898,9.79428,3.0,3.0,7.2,0.0,160.0,1023.0,94.0,,0,1,0,0,0,0,0
2016-01-01 00:02:00,5.694,53.551105624479334,9.984235400036294,2.6,2.4,99.0,53.570890000000006,9.88888,3.0,3.0,7.2,0.0,160.0,1023.0,94.0,,0,1,0,0,0,0,0
2016-01-01 00:02:00,4.367,53.540879976582616,9.995115621037971,2.6,2.4,99.0,53.570890000000006,9.88888,3.0,3.0,7.2,0.0,160.0,1023.0,94.0,,0,1,0,0,0,0,0
""")
data_df = pandas.read_csv(
csv_file,
index_col="datetime",
parse_dates=["datetime"]
)
df_hour = pandas.get_dummies(data_df.index.hour, prefix="hour")
df_hour_dict_version = {column: df_hour[column] for column in df_hour.columns}
data_df = data_df.assign(**df_hour_dict_version)
Problem description
I want to add the dummy variables back to the data frame as later in the process I will remove the index.
While it is a dictionary, meaning df_hour_dict_version
, everything is fine, the dummy columns are series.
They are uint8 values even though as dummy variables boolean would be more appropriate but that is how get_dummies
is implemented.
But for some reason, after the assign the series turn into NaN columns. Out of the sudden all 0's and 1's are replaced by NaN!
Expected Output
Keep the series as they are.
Output of pd.show_versions()
Comment From: TomAugspurger
That's because you're indexes don't match.
In [21]: df_hour.index
Out[21]: RangeIndex(start=0, stop=4, step=1)
In [22]: data_df.index
...:
...:
Out[22]:
DatetimeIndex(['2016-01-01 00:00:00', '2016-01-01 00:00:00',
'2016-01-01 00:02:00', '2016-01-01 00:02:00'],
dtype='datetime64[ns]', name='datetime', freq=None)
In [23]: data_df['foo'] = df_hour
In [24]: data_df['foo']
Out[24]:
datetime
2016-01-01 00:00:00 NaN
2016-01-01 00:00:00 NaN
2016-01-01 00:02:00 NaN
2016-01-01 00:02:00 NaN
Name: foo, dtype: float64
.assign
is doing the same thing (you can check the source, it's quite short), so all the usual alignment rules apply. df_hour
is aligned to the original index, and NaNs
are inserted. Your best bet is to set df_hour.index = data_df.index
before assigning.