I am outputting a single column from a dataframe to a csv. The data, however, is too long for some older downstream applications, so I have the csv add line breaks every size
items so that individual rows in the output aren't too long.
Example
from tkinter import Tk
from tkinter import filedialog
import pandas as pd
import numpy as np
Tk().withdraw() # we don't want a full GUI, so keep the root window from appearing
infilename = filedialog.askopenfilename()# show an "Open" dialog box and return the path to the selected file
data = pd.read_csv(infilename, header=None) #usecols=[0], only get 1st column, specify no header
outfilename = filedialog.asksaveasfile() #get save location
size=50 #number of items per line
col=0
indexes = np.arange(0,len(data),size)#have to use numpy since range is now an immutable type in python 3
indexes = np.append(indexes,[len(data)]) #add the uneven final index
for i in range(len(indexes)-1):
holder = pd.DataFrame(data.iloc[indexes[i]:indexes[i+1],col]).T
holder.to_csv(outfilename, index=False, header=False)
Expected Output
for input: A,B,C,D,E,F,G,H,I,J,K #this is actually a column in the input, rendered horizontal for space
Becomes in the file (size=3): A,B,C D,E,F G,H,I J,K
Despite not throwing any errors, the final loop (with the uneven final index) does not write to the file, even though the information is assigned to holder
without issue. Also, despite my not setting the mode
param, acts as w
and overwrites file if existing on the first loop, then acts as a
and appends to the file on subsequent loops, indicating some possible statefullness which may be relevant. Since no errors are thrown, I cannot figure out why the final information is not being written.
Output of pd.show_versions()
Comment From: TomAugspurger
Is tkinter necessary to reproduce the problem? Could you maybe simplify the example?
Comment From: Void2258
OK, I just tried hardcoding the paths and removing all tkinter. Now, it writes the data perfectly BUT it no longer shows the variable mode
behavior; it repeatedly overwrites the single line of data, but it writes the last set successfully. The example below includes a modified for loop to account for this.
import pandas as pd
import numpy as np
infilename = 'C:\\Users\\...infile.csv'
data = pd.read_csv(infilename, header=None)
outfilename = 'C:\\Users\\...test.txt'
size=50 #number of items per line
col=0
indexes = np.arange(0,len(data),size)#have to use numpy since range is now an immutable type in python 3
indexes = np.append(indexes,[len(data)]) #add the uneven final index
for i in range(len(indexes)-1):
holder = pd.DataFrame(data.iloc[indexes[i]:indexes[i+1],col]).T
if i ==0:
holder.to_csv(outfilename, index=False, header=False)
else:
holder.to_csv(outfilename, index=False, header=False, mode='a')
Comment From: TomAugspurger
@Void2258 we still can run that example since we don't have a 'C:\\Users\\...infile.csv'
file on our computers. It doesn't look like read_csv
is needed at all here. You could make the dataframe using pd.DataFrame()
.
Either way, I suspect the problem is with your looping logic, and not to_csv
. I've tested that out and the appending works as expected. Maybe try stack overflow if you have a question about getting the loop logic worked out.