Monday, September 17, 2007

Parallel Processing in CPython

I am sick of hearing naive discussions about the GIL, and how it precludes Python programs from take advantage of multiple cpu/cores. Thats is absolutely a non-issue, given the abundant ways in which we can write parallel programs in Python today (MPI4Py, Ipython1, parallel-python, etc.). In this post I want to talk about parallel-python (PP), and pit it against threading solutions.

Before I go on, the usual disclaimer: I know that PP does multi-processing, not multi-threading, which is what the GIL won't let you do. But PP offers a very simple and intuitive API, that can be used for both multi-core CPUs and clusters. If, after seeing what PP can do for you, you still believe you need threads, use Jython!!

All examples here were run on a Xeon quad core, with 4GB of RAM, Running Ubuntu Feisty. Python interpreters used were: CPython 2.5.1, Jython 2.1 on java 1.6.0 and IronPython 1.0.2467.

Let's start with Parallel Python: I am using an example taken straight from PP's web site. Here is the code:


#!/usr/bin/python
# File: dynamic_ncpus.py
# Author: Vitalii Vanovschi
# Desc: This program demonstrates parallel computations with pp module
# and dynamic cpu allocation feature.
# Program calculates the partial sum 1-1/2+1/3-1/4+1/5-1/6+... (in the limit it is ln(2))
# Parallel Python Software: http://www.parallelpython.com

import math, sys, md5, time
import pp

def part_sum(start, end):
"""Calculates partial sum"""
sum = 0
for x in xrange(start, end):
if x % 2 == 0:
sum -= 1.0 / x
else:
sum += 1.0 / x
return sum

print """Using Parallel Python"""
print


start = 1
end = 20000000

# Divide the task into 64 subtasks
parts = 64
step = (end - start) / parts + 1

# Create jobserver
job_server = pp.Server()

# Execute the same task with different amount of active workers and measure the time
for ncpus in (1, 2, 4, 8, 16, 1):
job_server.set_ncpus(ncpus)
jobs = []
start_time = time.time()
print "Starting ", job_server.get_ncpus(), " workers"
for index in xrange(parts):
starti = start+index*step
endi = min(start+(index+1)*step, end)
# Submit a job which will calculate partial sum
# part_sum - the function
# (starti, endi) - tuple with arguments for part_sum
# () - tuple with functions on which function part_sum depends
# () - tuple with module names which must be imported before part_sum execution
jobs.append(job_server.submit(part_sum, (starti, endi)))

# Retrieve all the results and calculate their sum
part_sum1 = sum([job() for job in jobs])
# Print the partial sum
print "Partial sum is", part_sum1, "| diff =", math.log(2) - part_sum1

print "Time elapsed: ", time.time() - start_time, "s"
print
job_server.print_stats()


and here are the results:
Using Parallel Python

Starting 1 workers Partial sum is 0.69314720556 | diff = -2.50000421476e-08 Time elapsed: 7.85552501678 s

Starting 2 workers
Partial sum is 0.69314720556 | diff = -2.50000421476e-08 Time elapsed: 4.37666606903 s

Starting 4 workers
Partial sum is 0.69314720556 | diff = -2.50000421476e-08 Time elapsed: 2.11173796654 s Starting 8 workers Partial sum is 0.69314720556 | diff = -2.50000421476e-08 Time elapsed: 2.06818294525 s

Starting 16 workers
Partial sum is 0.69314720556 | diff = -2.50000421476e-08 Time elapsed: 2.06896090508 s

Starting 1 workers
Partial sum is 0.69314720556 | diff = -2.50000421476e-08 Time elapsed: 8.11736106873 s Job execution statistics: job count | % of all jobs | job time sum | time per job | job server 384 | 100.00 | 67.1039 | 0.174750 | local Time elapsed since server creation 27.0066168308

In order to compare it to threading code, I had to adapt the example to use threads. Before I fed the new code to Jython, I ran it though CPython to illustrate the fact that, under the GIL, threads are not executed in parallel but one at a time. This first run would also serve as a baseline to compare Jython results against.

The code is below. Since Jython 2.1 does not have the sum function, I implemented it with reduce (there was not perceptible perfomance difference when compared with the built-in sum).

#jython threads
import math, sys, time
import threading

global psums

def part_sum(start, end):
"""Calculates partial sum"""
sum = 0
for x in xrange(start, end):
if x % 2 == 0:
sum -= 1.0 / x
else:
sum += 1.0 / x
psums.append(sum)

def sum(seq):
# no sum in Jython 2.1, we will use reduce
return reduce(lambda x,y:x+y,seq)

print """Using: jython with threading module"""
print

start = 1
end = 20000000

# Divide the task into 64 subtasks
parts = 64
step = (end - start) / parts + 1
for ncpus in (1, 2, 4, 8, 16,1):
# Divide the task into n subtasks
psums = []
parts = ncpus
step = (end - start) / parts + 1
jobs = []
start_time = time.time()
print "Starting ",ncpus, " workers"
for index in xrange(parts):
starti = start+index*step
endi = min(start+(index+1)*step, end)
# Submit a job which will calculate partial sum
# part_sum - the function
# (starti, endi) - tuple with arguments for part_sum
t=threading.Thread(target=part_sum,name="", args=(starti, endi))
t.start()
jobs.append(t)
# wait for threads to finish
[job.join() for job in jobs]
# Retrieve all the results and calculate their sum
part_sum1 = sum(psums)
# Print the partial sum
print "Partial sum is", part_sum1, "| diff =", math.log(2) - part_sum1

print "Time elapsed: ", time.time() - start_time, "s"
print


and here are the results for CPython:
Using: CPython with threading module

Starting 1 workers
Partial sum is 0.69314720556 | diff = -2.50001152002e-08
Time elapsed: 8.17702198029 s

Starting 2 workers
Partial sum is 0.69314720556 | diff = -2.50001570556e-08
Time elapsed: 10.2990288734 s

Starting 4 workers
Partial sum is 0.69314720556 | diff = -2.50001127577e-08
Time elapsed: 11.1099839211 s

Starting 8 workers
Partial sum is 0.69314720556 | diff = -2.50001097601e-08
Time elapsed: 11.6850161552 s

Starting 16 workers
Partial sum is 0.69314720556 | diff = -2.50000701252e-08
Time elapsed: 11.8062999249 s

Starting 1 workers
Partial sum is 0.69314720556 | diff = -2.50001152002e-08
Time elapsed: 11.0002980232 s

Here are the results for Jython:

Using: jython with threading module

Starting 1 workers
Partial sum is 0.6931472055600734 | diff = -2.500012807882257E-8
Time elapsed: 4.14300012588501 s

Starting 2 workers
Partial sum is 0.6931472055601045 | diff = -2.500015916506726E-8
Time elapsed: 2.0239999294281006 s

Starting 4 workers
Partial sum is 0.6931472055600582 | diff = -2.5000112868767133E-8
Time elapsed: 2.1430001258850098 s

Starting 8 workers
Partial sum is 0.6931472055600544 | diff = -2.500010909400885E-8
Time elapsed: 1.6349999904632568 s

Starting 16 workers
Partial sum is 0.6931472055600159 | diff = -2.5000070569269894E-8
Time elapsed: 1.2360000610351562 s

Starting 1 workers
Partial sum is 0.6931472055600734 | diff = -2.500012807882257E-8
Time elapsed: 2.4539999961853027 s

And lastly, the results for IronPython:

Using: IronPython with threading module

Starting 1 workers
Partial sum is 0.6931472055601 | diff = -2.50001280788e-008
Time elapsed: 13.6127243042 s

Starting 2 workers
Partial sum is 0.6931472055601 | diff = -2.50001591651e-008
Time elapsed: 7.60165405273 s

Starting 4 workers
Partial sum is 0.6931472055601 | diff = -2.50001128688e-008
Time elapsed: 8.14302062988 s

Starting 8 workers
Partial sum is 0.6931472055601 | diff = -2.5000109205e-008
Time elapsed: 8.32349395752 s

Starting 16 workers
Partial sum is 0.6931472055600 | diff = -2.50000707913e-008
Time elapsed: 8.37589263916 s

Starting 1 workers
Partial sum is 0.6931472055601 | diff = -2.50001280788e-008
Time elapsed: 10.3567276001 s

Now on to some final considerations. The quality of a parallelization tool should be measured not in how fast it is, but how well it scales. The attentive reader may have noticed that Jython threads, were twice as fast than PP. But is that performance related to the threads? No, since it was already faster than CPython (with threading or with PP) for a single thread. PP scaled better up to the available number of Cores, consistently halving the time when doubling the number of cores used. Jython, halved the time when it went from one to two threads, but failed to halve the time again, when going to 4 threads. I'll give it a break here since it recovered at 8 and 16 threads.

Threads alone are not the answer, if they are not well implemented. Look at the results from IronPython, It seem not to be able to take advantage of more than two threads, on a four core system. Can anyone explain this? I'd be curious to know why.

6 comments:

Leonardo Santagada said...

IRC the mono frameworks use boehm garbage collector wich is not very good with threads, and not a very good one (no generational like the ones on jvm's). Probably if you run jython in an open source jvm (other than kaffe) and you will have similar results as most of them use boehm gc also.

LKRaider said...

So, it seems the conclusion is:

- PP is simple to implement and scales really well.

- Threads are hard to implement and thus may not scale well depending on their implementation.

Please correct me if I overgeneralised it :)

HIM said...

thanks for doing these comparison
with different VMs. I requested
something like this in a comment
to Guido's recent post about the GIL, but it seems to be more attractive to complain about the GIL and propose "new" "genuine" ideas how to remove it.

Patricio said...

Audio sequencing applications (for writing music in a studio) use a different thread to process the audio for each track. I wrote a scripting engine by embedding the python interpreter into it, but ran into CPU spikes at low latency because of the GIL. This is way audio threads running with realtime priority are never allowed to make locking calls.

Now, this is not a problem that can be fixed with nifty python code. Calls to CPython must be made by separate processes in order to avoid contention on the GIL.

In practice, the CPython implmentation is more than capable of running fast enough for our real-time requirements, but the contention on the GIL kills us. I have to write a ton of C code to manage and communicate with the extra processes used to fix the problem, and it's a lot of overhead to add to a host application that's loading our plugin. We also have a python extension module that links directly to pointers in our C++ audio engine, and I would have to add a whole middle layer to deal with that. It's SERIOUSLY ugly.

I don't even want to think about trying to debug crashes and multiprocessing bugs in that environment when the host applications support the three plugin standards so poorly that we can barely get our GUI code to work with a single-threaded debugger.

Example hosts are ableton live, Logic, CuBase, Pro Tools, Sonar, and any others supporting VST, AU and RTAS.

Comments?

Flavio Coelho said...

@Patricio,

I don't know much about programming audio applications, or the details of your application. But using multiprocessing you completely bypass the GIL, since your are spawning multiple processes. Multiprocessing also supports IPC (if you need to exchange data between processes). Especially with multicore cpu's you'll get to take advantage of all the processing power of the platform with no GIL to stand in your way.

I suggest coding a simple example application for testing multiprocessing.
Also check out my more recent post on the subject:

http://pyinsci.blogspot.com/2008/09/python-processing.html

http://pyinsci.blogspot.com/2009/02/usage-pattern-for-multiprocessing.html

http://pyinsci.blogspot.com/2009/04/parallel-version-of-gillespie-solver.html
good luck,

Patricio said...

Right, the multiprocessing module is nothing short of amazing. In fact, it's exactly the kind of magic that the python language is so perfectly suited for.

Unfortunately, my use case goes a bit deeper into the CPython implementation itself. The problem is that I would have to acquire the CPython GIL in order to import and use the code from the multiprocessing module itself, which would still cause thread competition on the GIL.

When approaching solutions like this with python, I tend to find that the impossible sometimes becomes possible through the wizardry of the language, and a lot of times I end up eating my own words, but in this case I'm still stumped. Actually, in a way I'm kind of psyched that I've found a use case that doesn't seem to have a solution, other than to remove the GIL or have some incredible IPC code :)

ccp

Amazon