I have a very long csv file with some bad lines. Here is a sample of a bad line between some good lines: 2016.07.03 09:43:38\t0.01055556\t16.146\t26.9444444444444\n 2016.07.03 09:43:43\t0.01194444\t16.146\t26.9883333333333\n 2016.07.03 09:43:48\t0.01333333\t16.105\t27.0383333333333\n 2016.07.03 09:43:53\t0.01472222\t16.102\t42731,00\n 2016.07.03 09:43:58\t0.01611111\t16.138\t27.1822222222222\n
So basically the sensor has made a mistake when writing the 4th line, and written 42731,00 instead of an actual number. I want to just skip lines like that, so I read this file with the following statement: a = pd.read_csv(StringIO(bdy), sep = '\t', skiprows = 2, header = None, error_bad_lines = False, warn_bad_lines = True, parse_dates = [0], decimal = '.')
This results in the 4th column in the resulting df to be of type object (a string) instead of float64. If i try to force the reader to parse by doing: pd.read_csv(StringIO(bdy), sep = '\t', skiprows = 2, header = None, error_bad_lines = False, warn_bad_lines = True, parse_dates = [0], decimal = '.', dtype={1:np.float64,2:np.float64,3:np.float64}) i get: ValueError: could not convert string to float...but it is not being ignored even though i have set error_bad_lines = False. So how can i actually parse this correctly - preferrebly without having to supply dtype? Is this a bug?
Comment From: sinhrks
error_bad_line
intends to handle the exception raised because of number of fields. On your data, the line itself is valid (not bad) and error is caused by latter dtype coercing.
Your problem can be solved if you specify thousands
.
bdy = u"""2016.07.03 09:43:38\t0.01055556\t16.146\t26.9444444444444
2016.07.03 09:43:43\t0.01194444\t16.146\t26.9883333333333
2016.07.03 09:43:48\t0.01333333\t16.105\t27.0383333333333
2016.07.03 09:43:53\t0.01472222\t16.102\t42731,00
2016.07.03 09:43:58\t0.01611111\t16.138\t27.1822222222222"""
a = pd.read_csv(StringIO(bdy), sep = '\t',
skiprows = 2,
header = None, error_bad_lines = False, warn_bad_lines = True,
parse_dates = [0], decimal = '.', thousands=',')
a.dtypes
# 0 datetime64[ns]
# 1 float64
# 2 float64
# 3 float64
# dtype: object