Python's undocumented ThreadPool

You disabled JavaScript. Please enable it for syntax-highlighting, or don't complain about unlegible code snippets =) This page doesn't contain any tracking/analytics/ad code.

TL;DR

Use Python's completely undocumented ThreadPool to parallelize non-bytecode work on shared memory! Like so:

from multiprocessing.pool import ThreadPool

pool = ThreadPool()
pool.map(io_heavy_work, data_in_shared_memory)

In contrast to multiprocessing.Pool, where each process gets a copy of the second argument of map and the main process is responsible for collecting results as returned by map, the io_heavy_work function here may modify the instances of data it gets, since it's threading with shared-memory.

Tutorial

There are two main reasons for writing multi-threaded code:

  1. high-performance shared-memory programs (including games),
  2. programs that spend a lot of time waiting for many different I/O tasks.

In Python code, due to the GIL which restricts bytecode execution to a single thread, the first point becomes hopeless and only the second one remains.

I recently had such a workload, specifically a web-forum crawler. I've prototyped all code for crawling in a single-threaded fashing, something like this:

class Subject:
    def update(self):
        # Look for new replies and store them somewhere

s = Subject(url)
while True:
    s.update()

Now imagine you've got a lot of urls, so many in fact, that running each in its own thread doesn't make sense. In addition, update spends most of its time waiting for replies from the server. This is a perfect illustration of the second case I've mentioned earlier.

If Python had a parallel for, we could simply do something like this:

subjects = [Subject(url) for url in all_my_urls]
while True:
    parallel for s in subjects:
        s.update()

Unfortuantely, this doesn't exist. Instead, people are always redirected to multiprocessing's Pool for such operations:

pool = multiprocessing.Pool()
subjects = [Subject(url) for url in all_my_urls]
while True:
    pool.map(lambda s: s.update(), subjects)

Well, beautiful! Except that this won't work. The reason is that every process gets its own, separate copy of subjects and won't update the objects in the main process! The usual way of solving this, is to completely restructure your code into a "return the result" and "main collector" way, which works, but is a nontrivial amount of effort and unnecessary code obfuscation. If only we had threads (and thus shared memory) instead of processes, this would just work.

Well, turns out that Python comes with a thread pool! It's just completely undocumented, because nobody bothered to document it so far. Check this out:

from multiprocessing.pool import ThreadPool
pool = ThreadPool()
subjects = [Subject(url) for url in all_my_urls]
while True:
    pool.map(lamda s: s.update(), subjects)

This now actually works, because all threads share the exact same subjects instance.

Beautiful, isn't it?