Pandas pandas exponentially slower when running concurrently in multiple processes

I have a method that uses pandas to do extensive read-only calculations on a 800MB DataFrame loaded using read_pickle.

The method takes approx 500ms of CPU time to execute. Let's call this method search_with_pandas.

I saved this method in a module I'll call "search_with_pandas_module", it looks something like this:

import pandas as pd

# load 800MB dataframe
core = pd.read_pickle(r"C:\test\core.bz2")

def search_with_pandas(inputs):
    scores = core[core.input.isin(inputs)].groupby(["item_id"]).item_id.count().sort_values(ascending=False)
    return scores

Multiple users need to access this method so I went about parallelizing the code with multiple processes. I have a 12 core machine, and I'm starting 10 processes, like so:

if __name__=='__main__':
    PORT_START = 12000
    cpu_count = 10
    for x in range(cpu_count):
        proc = multiprocessing.Process(target=worker,args=(PORT_START+x,))
        proc.start()

Each of the 10 worker processes exposes a simple WSGI interface, each on it's own port. This means each process stays alive, waiting for requests.

def worker(port):
    from bottle import route, run, request
    from time import time

    import search_with_pandas_module

    @route('/search',method='GET')
    def search():
        time_start = time()
        response = search_with_pandas_module.search_with_pandas()
        print "time took:", time()-time_start

        return response        

    run(host="127.0.0.1",port=port)

Each worker process imports the search_with_pandas_module when it starts, which means each worker process has it's own copy of the 800MB DataFrame used by the search_with_pandas method. There is no IPC/synchronization overhead to deal with between processes because each worker has it's own independent copy.

I now have 10 processes, on 10 different ports, each addressable by: http://127.0.0.1:12000/search http://127.0.0.1:12001/search http://127.0.0.1:12002/search ...

Now the strange part. If I use one process at a time, the search_with_pandas method takes 500ms as expected. However, when I access these concurrently the search_with_pandas method starts getting slower and slower, even though each call is handled by an independent process.

no concurrency: 500ms 2 concurrent calls: 684 ms 3 concurrent calls: 959 ms 4 concurrent calls: 988 ms 5 concurrent calls: 1193 ms 6 concurrent calls: 1423 ms 7 concurrent calls: 1567 ms 8 concurrent calls: 1812 ms 9 concurrent calls: 2096 ms 10 concurrent calls: 2253 ms

Here is a screenshot of my CPU graph while running these tests, please note that with more concurrency you can see more CPU utilization as each worker operates concurrently.

I can't figure out why the method is taking exponentially longer to execute when running concurrently.

It's almost as if pandas is locking something internally that is forcing the other processes to wait until released.

Building/tearing down each process is a non-issue because each process starts only once and stays alive.

Each process has it's own independent copy of the 800MB dataframe so why would a CPU intensize read-only operation affect other processes?

Comment From: jreback

note really sure, you are trying to share a large read-only memory structure and the interpreter GIL locking may get in the way here even though you are using processes (because they are accessing the original object), you would have better visibility to post on SO: https://stackoverflow.com/questions/38666078/fast-queue-of-read-only-numpy-arrays/38775513

might be helpful.

Comment From: rf987

Each process is loading it's own copy of the DataFrame from disk, so there should be no sharing, unless Pandas does some sharing under the hood, across processes.

Comment From: gazon1

I have similar issue I Load several processes with flask app and pandas computations slow down response when try to increase number of flask processes