When I run the code below, i get the SettingWithCopyWarning and I am unsure of what I am doing wrong. Is there a way to code without the warning?
for chrom, df in grouped:
bins = np.arange(0, 2e6, 1e5)
df.ix[:,'Bins'] = pd.cut(df.Pos, bins)
A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead
Comment From: jreback
you need to post more code.
is their a reason you are not doing:
df['Bins'] = df.groupby(...).Pos.apply(lambda x: pd.cut(x, bins))
?
Comment From: alvinwt
@jreback I tried to clean it up a little by removing some of the variables. The problem is that i need the Chr as a key for chromLen for bins, which I am wondering if can be done with the groupby(...).apply(lambda x: pd.cut(x,bins[Chr])) pattern.
grouped = inputDf.groupby('Chr')
for chrom, df in grouped:
try:
chromLen = chromDict[chrom]
except KeyError as e:
if 'chrY' in str(e):
pass
# make 1 Mbp bins
bins = np.arange(0, int(chromLen)+resolution, resolution)
df.ix[:,'Bins'] = pd.cut(df.Pos,bins)
Comment From: jreback
you need to show a complete example. e.g. generate a test frame and show each of the operatiosn that you intend.
Comment From: alvinwt
The operation that I intend is mainly the assignment of bins using the output of pd.cut.
def count_per_bin(inputDf,resolution=1e4):
grouped = inputDf.groupby('Chr')
countsDict = {}
binnedList = []
chromDict = {'chr1':24925,'chr2':24319}
for chrom, df in grouped:
try:
chromLen = chromDict[chrom]
except KeyError as e:
if 'chrY' in str(e):
pass
## make 1 Mbp bins and use pd.cut to allocate bins
bins = np.arange(0, int(chromLen)+resolution, resolution)
df.ix[:,'Bins'] = pd.cut(df.Pos,bins)
binCounts = pd.DataFrame(df.groupby('Bins').Pos.count())
countsDict[chrom] = binCounts
return countsDict
data = pd.DataFrame({'Chr':['chr1','chr1','chr2','chr2'],'Pos':[10,100,200,550]})
Comment From: jreback
ok. not very clear to me what you are trying to accomplish. can you post what you actually want as output?
Comment From: dejmail
@jreback
Hi there, I will try to explain what @alvinwt was/is seeing as I am getting the same message, and am also working with similar data.
I have a dataframe that has the following test structure (4 columns):
chromosomes = ("chrI", "chrII", "chrIII", "chrIV", "chrV")
chromosome_list = np.sort(np.random.choice(list(chromosomes), 100))
bin_list = []
for i, x in enumerate(chromosome_list):
if chromosome_list[i] == chromosome_list[i-1]:
bin_list.append(bin_list[i-1]+np.random.randint(0,200))
else:
bin_list.append(0)
test_dict = {'chromosome': chromosome_list,
'start': bin_list,
'stop': bin_list,
'value': [i for i in np.random.randint(0,25, 100)]}
test_df = pd.DataFrame(test_dict)
chrI 9 9 3 chrI 10 10 16 chrI 11 11 19 chrI 12 12 29 chrI 13 13 5 chrI 14 14 17 chrI 15 15 3 chrI 16 16 11 chrI 17 17 7
The first column is a list of values ranging from chrI to chrV (though it could be any length really), and what we want to do is place all the reads in the third column in a bin, in my case a bin of length 200 (0-200, 201-400, 401-600 etc) and then sum the values that fall into the bin. Fairly common use case.
My code that does the binning is as follows:
binned_df = pd.DataFrame()
for chromosome, chrsize in chromosomes.items():
# each chromosome is a different length, so # bins will differ
chr_subset = test_df[(test_df.chromosome == chromosome)]
chr_bins = np.arange(0, chrsize, int(200))
chr_subset.loc[:,'bins'] = pd.cut(x=chr_subset.loc[:,'start'],
bins=chr_bins,
labels=chr_bins[:-1],
include_lowest=True)
binned_df = pd.concat([binned_df,
chr_subset.groupby(['chromosome','bins'],
axis=0,
as_index=False).value.sum()])
binned_df = binned_df.fillna(0)
Which gives a dataframe output as below
chromosome | bin | value
chrI | 0 | 43.0 chrI | 200 | 6.0 chrI | 400 | 20.0 chrI | 600 | 24.0 chrI | 800 | 10.0 chrI | 1000 | 52.0 chrI | 1200 | 34.0
Including the warning which comes from the pd.cut command.
A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead
pandas version: 0.20.3 Python 3.6.2
Any comments or suggestions would be appreciated.
Comment From: TomAugspurger
@dejmail, ycan you check your example? chromosomes
is a tuple, so your example fails on .items()
. You can also remove debug code like set_trace
.
Anyway, I suspect that changing
chr_subset = test_df[(test_df.chromosome == chromosome)]
to
chr_subset = test_df[(test_df.chromosome == chromosome)].copy()
will fix the warning. It's unclear whether test_df
can / should be updated when you update chr_subset
in a couple lines.
Comment From: dejmail
@TomAugspurger Thanks for the reply. chromosomes
is indeed a tuple, but it is unpacked into chromosome
and chrsize
and it gives no error on my side.
Indeed including the .copy()
as you suggested gets rid of the warning. Perhaps we can close this now.