Start Small: multiprocessing

Showing posts with label multiprocessing. Show all posts

Finding stuck or hot pixels in Nikon D80 images, a multiprocessing approach

Earlier I described a method to find hot or stuck pixels by determining the variance of (sub)pixels over a large set of photos. In this article we take a look at a way to farm out this work to multiple processes.

The parallel algorithm for calculating variance

The parallel algorithm allows us to combine calculated variances from separate sets. With the help of Python's multiprocessing module it is straight forward to implement our hot pixel finding algorithm to one that takes advantage of multiple cores:

import Image
import numpy as np
import sys
from glob import glob
from multiprocessing import Pool,current_process
from time import time

def process_pixels(shape,*args):
	first = True
	for filename in args:
		pic = Image.open(filename)
		if pix.shape != shape:
			print("shapes don't match")
			continue
		if first:
			first = False
			firstpix = pix
			n = np.zeros(firstpix.shape)
			mean = np.zeros(firstpix.shape)
			M2 = np.zeros(firstpix.shape)
			delta = np.zeros(firstpix.shape)
		n += 1
		delta = pix - mean
		mean += delta/n
		M2 += delta*(pix - mean)
	return M2
	
if __name__ == '__main__':
	global pool

	n=int(sys.argv[1])
	pool = Pool(n)
	filenames = []
	for a in sys.argv[2:]:
		filenames.extend(glob(a))
	
	shape = np.array(Image.open(filenames[0])).shape

	s=time()
	results = []
	for i in range(n):
	results.append(pool.apply_async(process_pixels,tuple([shape]+filenames[i::n])))
	for i in range(n):
		results[i]=results[i].get()
	M2 = sum(results)
	print(time()-s)
	
	mini = np.unravel_index(M2.argmin(),M2.shape)
	maxi = np.unravel_index(M2.argmax(),M2.shape)
	print('min',M2[mini],mini)
	print('max',M2[maxi],maxi)
	print('mean',np.mean(M2))

	sorti = M2.argsort(axis=None)
	print(sep="\n",*[(i,M2[i]) for i in [np.unravel_index(i,M2.shape) for i in sorti[:10]]])

	print(time()-s)

Note that because we return complete arrays (line 26) we gain almost nothing for small data sets because of the overhead of shuttling these large arrays (> 30 MByte) between processes. This is illustrated in the following graph with shows the elapsed time as a function of the number of processes, both for a small number of pictures (50) and a large number (480).

Some other notable issues: in the code we do not actually implement the parallel algorithm but we simple add together the variances. Because we're looking for a minimum variance we gain nothing by adding a constant value.

Memory usage is another thing to be aware of (and the reason that there is no entry for six cores in the graph. The algorithm we have implemented uses 5 arrays (the pixel data itself included). That makes for 10 megapixel X 3 colors X 5 arrays X 8 bytes data (because we use 64 bit floats by default) which makes for a whopping 1.2 Gigabyte of data per process or more than 6 Gig with 5 processes. With some other applications open a sixth process wouldn't fit on my test machine. Because we're adding pixel values in the range 0 - 255 we could probably gain a lot by using 32 bit floats or even 16 bit floats here.

A SQLite multiprocessing proxy, part 3

In a previous article I presented a first implementation of a SQLite proxy that makes it possible to distribute the workload of multiple processes with the use of Python's multiprocessing module. In this third part of the series we try to analyze the performance of this setup.

High workload example

In our sample implementation we can vary the workload inside the processes that interact with the SQLite database by varying the size of the table that we query. A table with many rows takes more time to scan for a certain random value than a table with just a few rows.

The first graph we present here is about high workload: the table that we query is initialized with one million records. The table shows the time to complete 100 queries. The test was done on a machine with 6 processor cores and in the graph we show the results for 2 (deep purple, back) and 6 (light purple, front) worker processes and a varying number of threads.

The results are more or less what we expect: more worker processes means that the time to complete all tasks is reduced. However the number of threads is also significant. If the number of threads is less than the number of available worker process we do not reach the full potential. Basically we need at least as many threads a there are worker processes to keep those processes busy. If we have more threads than worker processes there is no more gain, in fact we see a minute increase in the time needed to complete all tasks. This might be due to the overhead of creating and managing threads in Python.

Low workload example

If we initialize our table with just a single row the workload will be negligible. If we draw a similar graph as for the high workload we see a completely different picture.

Now we see hardly any difference between 2 work processes or 6 and increasing the number of threads also has no effect. Also the data is rather noisy, i.e. varies quite a bit in a non-uniform manner, especially for the case with 2 worker processes. The reason for this behavior is not entirely clear to me, although it is obvious that because of the very small workload the time to setup communication with the worker process is a significant factor here.

A SQLite multiprocessing proxy, part 2

In a previous article we decided to use Python's multiprocessing module to leverage the power of multi-core machines. Our use case is all about web applications served by CherryPy and so multi-processing isn't the only interesting part: our application will be multi-threaded as well. IN this article we present a first implementation of a multi-threaded application that hands off the heavy lifting to a pool of subprocesses.

The design

The design is centered on the following concepts:

The main process consists of multiple threads,
The work is done by a pool of subprocesses,
Transferring data to and from the subprocesses is left to the pool manager

Schematically we can visualize it as follows:

Sample code

We start of by including the necessary components:

from multiprocessing import Pool,current_process
from threading import current_thread,Thread
from queue import Queue
import sqlite3 as dbapi
from time import time,sleep
from random import random

The most important ones we need are the Pool class from the multiprocessing module and the Thread class from the threading module. We also import queue.Queue to act as a task list for the threads. Note that the multiprocessing module has its own Queue implementation that is not only thread safe but can be used for inter process communication as well but we won't be using that one here but rely on a simpler paradigm as we will see.

The next step is to define a function that may be called by the threads.

def execute(sql,params=tuple()):
 global pool
 return pool.apply(task,(sql,params))

It takes a string argument with SQL code and an optional tuple of parameters just like the Cursor.execute() method in the sqlite3 module. It merely passes on these arguments to the apply() method of the multiprocessing.Pool instance that is referred to by the global pool variable. Together with SQL string and parameters a reference to the task() function is passed, which is defined below:

def task(sql,params):
 global connection
 c=connection.cursor()
 c.execute(sql,params)
 l=c.fetchall()
 return l

This function just executes the SQL and returns the results. It assumes the global variable connection contains a valid sqlite3.Connection instance, something that is taken care of by the connect function that will be passed as an initializer to any new subprocess:

def connect(*args):
 global connection
 connection = dbapi.connect(*args)

Before we initialize our pool of subprocess let's have a look at the core function of any thread we start in our main process:

def threadwork(initializer=None,kwargs={}):
 global tasks
 if not ( initializer is None) :
  initializer(**kwargs)
 while(True):
  (sql,params) = tasks.get()
  if sql=='quit': break
  r=execute(sql,params)

It calls an optional thread initializer first and then enters a semi infinite loop in line 5. This loops starts by fetching an item from the global tasks queue. Each item is a tuple consisting of a string and another tuple with parameters. If the string is equal to quit we do terminate the loop otherwise we simple pass on the SQL statement and any parameters to the execute function we encountered earlier, which will take care of passing it to the pool of subprocesses. We store the result of this query in the r variable even though we do nothing with it in this example.

For this simple example we also need an database that holds a table with some data we can play with. We initialize this table with rows containing random numbers. When we benchmark the code we can make this as large as we wish to get meaningful results; after all, our queries should take some time to complete otherwise there would be no need to use more processes.

def initdb(db,rows=10000):
 c=dbapi.connect(db)
 cr=c.cursor()
 cr.execute('drop table if exists data');
 cr.execute('create table data (a,b)')
 for i in range(rows):
  cr.execute('insert into data values(?,?)',(i,random()))
 c.commit()
 c.close()

The final pieces of code tie everything together:

if __name__ == '__main__':
 global pool
 global tasks
 
 tasks=Queue()
 db='/tmp/test.db'
 
 initdb(db,100000)
 
 nthreads=10
 
 for i in range(100):
  tasks.put(('SELECT count(*) FROM data WHERE b>?',(random(),)))
 for i in range(nthreads):
  tasks.put(('quit',tuple()))
 
 pool=Pool(2,connect,(db,))
 
 threads=[]
 for t in range(nthreads):
  th=Thread(target=threadwork,kwargs={'initializer':thread_initializer})
  threads.append(th)
  th.start()
 for th in threads:
  th.join()

After creating a queue in line 5 and initializing the database in line 8, the next step is to fill a queue with a fair number of tasks (line 12). The final tasks we add to the queue signal a thread to stop (line 14). We need as many of them as there will be threads.

In line 17 we initialize our pool of processes. Just two in this example, but in general the number should be equal to the number of cpu's in the system. If you omit this argument the number will default to exactly that. Next we create (line 21) and start (line 23) the number of threads we want. The target argument points to the function we defined earlier that does all the work, i.e. pops tasks from the queue and passes these on to the pool of processes. The final lines simply wait till all threads are finished.

What's next?

In a following article we will benchmark and analyze this code and see how we can improve on this design.

A SQLite multiprocessing proxy

This is the first article in a series on improving the performance of Python web applications by leveraging the possibilities of the multiprocessing module. We'll focus on CherryPy and SQLite but the conclusions should be general enough for any Python based platform

Use case

Due to well known restrictions in the most common Python implementation, multithreading solutions will probably not help to solve performance issues (with the possible exception of serving slow network connections). The multiprocessing module offers an API similar to the threading module and might be an alternative when we want to divide the workload on a multicore machine.

The use case we're interested in is a CherryPy server that serves many requests, backed by a SQLite database. CherryPy is multithreaded by design and this approach is sensible as a web server may spend more time waiting for data to be transmitted over relatively slow network connections than actually doing work.

CherryPy however is also an excellent framework to host web applications and many web applications rely on some sort of database back-end. SQLite is a good choice for such a back-end as it comes bundled with Python (reducing the number of external dependencies), is easy to use and performs well enough. With some tricks it will even play nice in a multithreaded environment.

A disadvantage of using SQLite is that we do not have a separate database server: the SQLite engine is part of the same process that runs the Python interpreter. This means that it has the same handicap as any multithreaded application on CPython (the most common implementation of Python) and will not benefit from any extra cores or processors available on the server.

Now we could switch to MySQL or any other stand-alone database back-end but this would add quite an amount to the maintenance burden of our web application. Wouldn't it be nice if we could devise a way to use SQLite together with the multiprocessing module to have the best of both worlds: the ease of use of SQLite and the performance benefits of a stand-alone database server?

In this series of articles I will explore the possibilities and hopefully will come up with a solution that will provide:

a dbapi proxy (we'll use sqlite3 module but it should be general enough for any dbapi compliant database)
that will use the multiprocessing module to increase performance and
can be used from a multithreaded environment.

It would be nice if the API closely resembles the dbapi (but that is not an absolute requirement).

In the next article in this series I will explore the options to make threads and processes play nice, focusing on inter process communication.