Pandas SettingWithCopyWarning and pd.cut

When I run the code below, i get the SettingWithCopyWarning and I am unsure of what I am doing wrong. Is there a way to code without the warning?

for chrom, df in grouped:
    bins = np.arange(0, 2e6, 1e5)
    df.ix[:,'Bins'] = pd.cut(df.Pos, bins)

A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

Comment From: jreback

you need to post more code.

is their a reason you are not doing:

df['Bins'] = df.groupby(...).Pos.apply(lambda x: pd.cut(x, bins))

Comment From: alvinwt

@jreback I tried to clean it up a little by removing some of the variables. The problem is that i need the Chr as a key for chromLen for bins, which I am wondering if can be done with the groupby(...).apply(lambda x: pd.cut(x,bins[Chr])) pattern.

grouped = inputDf.groupby('Chr')

    for chrom, df in grouped:

        try:
            chromLen =  chromDict[chrom]
        except KeyError as e:
            if 'chrY' in str(e):
                pass

        # make 1 Mbp bins
        bins = np.arange(0, int(chromLen)+resolution, resolution)
        df.ix[:,'Bins'] =  pd.cut(df.Pos,bins)

Comment From: jreback

you need to show a complete example. e.g. generate a test frame and show each of the operatiosn that you intend.

Comment From: alvinwt

The operation that I intend is mainly the assignment of bins using the output of pd.cut.

def count_per_bin(inputDf,resolution=1e4):
    grouped = inputDf.groupby('Chr')
    countsDict = {}
    binnedList = []

    chromDict = {'chr1':24925,'chr2':24319}

    for chrom, df in grouped:

        try: 
            chromLen =  chromDict[chrom]
        except KeyError as e:
            if 'chrY' in str(e):
                pass

        ## make 1 Mbp bins and use pd.cut to allocate bins
        bins = np.arange(0, int(chromLen)+resolution, resolution)
        df.ix[:,'Bins'] =  pd.cut(df.Pos,bins)

        binCounts =  pd.DataFrame(df.groupby('Bins').Pos.count())
        countsDict[chrom] = binCounts
    return countsDict


data = pd.DataFrame({'Chr':['chr1','chr1','chr2','chr2'],'Pos':[10,100,200,550]})

Comment From: jreback

ok. not very clear to me what you are trying to accomplish. can you post what you actually want as output?

Comment From: dejmail

@jreback

Hi there, I will try to explain what @alvinwt was/is seeing as I am getting the same message, and am also working with similar data.

I have a dataframe that has the following test structure (4 columns):


chromosomes = ("chrI", "chrII", "chrIII", "chrIV", "chrV")
chromosome_list = np.sort(np.random.choice(list(chromosomes), 100))

bin_list = []
for i, x  in enumerate(chromosome_list):
    if chromosome_list[i] == chromosome_list[i-1]:
        bin_list.append(bin_list[i-1]+np.random.randint(0,200))
    else:
        bin_list.append(0)

test_dict = {'chromosome': chromosome_list,
             'start': bin_list,
             'stop': bin_list,
             'value': [i for i in np.random.randint(0,25, 100)]}

test_df = pd.DataFrame(test_dict)

chrI 9 9 3 chrI 10 10 16 chrI 11 11 19 chrI 12 12 29 chrI 13 13 5 chrI 14 14 17 chrI 15 15 3 chrI 16 16 11 chrI 17 17 7

The first column is a list of values ranging from chrI to chrV (though it could be any length really), and what we want to do is place all the reads in the third column in a bin, in my case a bin of length 200 (0-200, 201-400, 401-600 etc) and then sum the values that fall into the bin. Fairly common use case.

My code that does the binning is as follows:

binned_df = pd.DataFrame()
for chromosome, chrsize in chromosomes.items():

        # each chromosome is a different length, so # bins will differ
        chr_subset = test_df[(test_df.chromosome == chromosome)]

        chr_bins = np.arange(0, chrsize, int(200))
        chr_subset.loc[:,'bins'] = pd.cut(x=chr_subset.loc[:,'start'],
                                    bins=chr_bins,
                                    labels=chr_bins[:-1],
                                    include_lowest=True)

        binned_df = pd.concat([binned_df,
                               chr_subset.groupby(['chromosome','bins'],
                                                                  axis=0,
                                                                  as_index=False).value.sum()])
        binned_df = binned_df.fillna(0)

Which gives a dataframe output as below

chromosome | bin | value

chrI | 0 | 43.0 chrI | 200 | 6.0 chrI | 400 | 20.0 chrI | 600 | 24.0 chrI | 800 | 10.0 chrI | 1000 | 52.0 chrI | 1200 | 34.0

Including the warning which comes from the pd.cut command.

A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

pandas version: 0.20.3 Python 3.6.2

Any comments or suggestions would be appreciated.

Comment From: TomAugspurger

@dejmail, ycan you check your example? chromosomes is a tuple, so your example fails on .items(). You can also remove debug code like set_trace.

Anyway, I suspect that changing

chr_subset = test_df[(test_df.chromosome == chromosome)]

chr_subset = test_df[(test_df.chromosome == chromosome)].copy()

will fix the warning. It's unclear whether test_df can / should be updated when you update chr_subset in a couple lines.

Comment From: dejmail

@TomAugspurger Thanks for the reply. chromosomes is indeed a tuple, but it is unpacked into chromosome and chrsize and it gives no error on my side.

Indeed including the .copy() as you suggested gets rid of the warning. Perhaps we can close this now.