Python regular expressions and the functools module

If we use annotations to sanitize incoming data easy access to regular expressions is a must. In this article I'll show how to make the most of Python's bundles functools module to dynamically generate functions that match regular expressions while reducing the overhead of compiling regular expressions as much as possible

Matching against regular expressions

In our framework we annotate function parameters with functions that will pass through or convert incoming arguments if they are ok or raise an exception other wise. For example, if we would want to make sure an argument is an integer, we'd annotate it with Python's built-in int function:

def myfunction(arg:int):
	...

The framework will use this function to check the argument before actually calling the function.

In many situations it would be convenient if we could specify a regular expression that arguments should match. We could extend the framework to check whether the annotation was a string and use that string as a regular expression but it is better to opt for a approach that will allow for the easy definition of categories. Some examples: def myfunc(arg:digits) ... or def myfunc(arg:zipcode) .... In these examples digits and zipcode are functions that take single argument and return that argument if matches a certain pattern or raise an exception if it doesn't.

To construct such functions easily we may utilize Python's re and functoolsmodules:

from re import compile

def regex_compile(pattern):
	return compile("^"+pattern+"$")

def match(pattern,string):
	re = regex_compile(pattern)
	if re.match(string) is None:
		raise ValueError("no pattern match")
	return string

To check whether a string matches a regular expression we define a match function (line 6) that will compile the regular expression and match the string against it. The regex_compile function will make sure the match is anchored at the beginning and the end because we want a complete match, that is string that merely contain the pattern are considered invalid. We don't allow extraneous characters.

Constructing partial functions

So far nothing new under the sun but we still need a simple way to construct functions that match a single argument against a regular expression. Enter the partial function. It will take a function and some arguments and use those to construct a new function that will pass any arguments together with the original arguments to the encapsulated function. This is exactly what we need to construct our categories:

from functools import partial

digits = partial(match,r'\d+')
zipcode = partial(match,r'\d{4}\s*[a-zA-Z]{2}')

print(digits('12345'))
print(zipcode('1234AB'))

Improving efficiency by Memoization

Each time we call digits(a) we actually call match(r'\d+',a) and this will result in compile the regular expression again and again. Compiling regular expressions is a rather expensive operation so we might want to avoid that by reusing compiled expressions. This can simply be accomplished by applying the lru_cache decorator from the functools module to implement a memoize pattern:

from functools import lru_cache
from re import compile

@lru_cache(maxsize=0)
def regex_compile(pattern):
	return compile("^"+pattern+"$")

This simple change will make sure we will reuse compiled regular expressions most of the time. The maxsize parameter should be set as needed: it may be set to zero in which case all regular expression will be cached, possibly a good choice as it is unlikely that your web application sports thousands of regular expressions.

Propagating Python function annotations through decorators

When we decorate a Python function we must take some care to ensure that the decorated function can present itself in the same way as before its decoration. Let's explore what is needed.

Decorating a Python function

Say we have an undecorated function f(). It features a docstring (line 2) and annotated arguments:

def f(a:str):
	"an undecorated function"
	return a

print(f.__name__,f.__annotations__,f.__doc__,f('foo'))

The final line in the previous snippet prints the name of the function, its annotation dictionary and the docstring. The output looks like this:

f {'a': <class 'str'>} an undecorated function foo

Decorating a Python function

Now say we have a function do_something() and we would like to have a decorator that would arrange that a function would call do_something() first. We might set this up like this:

def do_something():
	print('oink')
	
def decorator(f):
	def g(*args,**kwargs):
		do_something()
		return f(*args,**kwargs)
	return g
	
@decorator
def f(a:str):
	"a naively decorated function"
	return a

print(f.__name__,f.__annotations__,f.__doc__,f('foo'))

The final line would produce the following output:

oink
g {} None foo

It is clear that do_something() is executed as it does print oink but although our decorated function can be called as f(), its name, annototations and doctring are different.

Preserving attributes when decorating a Python function

To resolve these matters we can rewrite our decorator as follows:

def decorator(f):
	def g(*args,**kwargs):
		do_something()
		return f(*args,**kwargs)
	g.__name__ = f.__name__
	g.__doc__ = f.__doc__
	g.__annotations__ = f.__annotations__
	return g
	
@decorator
def f(a:str):
	"a less naively decorated function"
	return a

print(f.__name__,f.__annotations__,f.__doc__,f('foo'))

We now simply copy the relevant attributes to the newly created function and if we now examine the output produced by the final line we get what we expect:

oink
f {'a': <class 'str'>} a less naively decorated function foo

Implementing POST method handling in a web application server

In an ongoing project to implement a web application server that is as simple as possible we now implement handling POST requests

Handling POST requests

The HTTP POST method is often the preferred way to transfer information from a form to the server. If we use POST the arguments don't end up in the webserver log for example and it allows us to use a file upload tag.

Depending on the encoding attribute of a form element, POST data might be encode as a number of key/value pairs (one on each line) or a multipart MIME message (useful if you want to upload files). More information can be found here.

Wielding the power of Python's cgi module

Decoding a multipart MIME message isn't trivial but lucky for us we can offload the hard work to an existing module: cgi. It is part of the standard Python distribution and although designed to implemented cgi scripts we can co-opt its functionality for our purpose. All we have to do is add a do_POST() method to our ApplicationRequestHandler class as shown below:

from cgi import FieldStorage
from os import environ

class ApplicationRequestHandler(BaseHTTPRequestHandler):

	def do_POST(self):
		ob=self._find_app()
		
		environ['REQUEST_METHOD'] = 'POST'
		fs=FieldStorage(self.rfile,headers=self.headers)
		kwargs={}
		for name in fs:
			for i in fs.getlist(name):
				if fs[name].file:
					kwargs[name] = fs[name].file
					kwargs[name].srcfilename = fs[name].filename
				else:
					kwargs[name] = fs[name]
		
		return self._execute(ob,kwargs)

We factored out some common code that is also used in the do_GET() method (in the _find_app() and _execute()methods, not shown here), but basically we instantiate a cgi.FieldStorage instance and pass it the input filestream along with a dictionary of the HTTP headers we have enounterd so far. Because the cgi module expects some parameters to be present in the environment we set the REQUEST_METHOD to POST because that information not part of the headers. The FieldStorage instance returned can be used as dictionary with the names of the parameters as keys.

Any file input parameters are a bit special. The contents of the uploaded file are stored in a temporary file and if we would access this parameter the value would be the contents of this file as a byte object. We'd rather pass on the temporary file object to prevent passing around the contents of really big files unnecessarily. We can check if a parameter is an uploaded file by checking its file parameter (line 14). If the argument is a file, we tack on the original file name (that is, the one the user's PC) because we might want to use that later. The final line passes the arguments we have extracted to the _execute() method. This method is factored out from the do_GET() method. It locates the function to execute and checks its arguments against any annotations

Finding stuck or hot pixels in Nikon D80 images, a multiprocessing approach

Earlier I described a method to find hot or stuck pixels by determining the variance of (sub)pixels over a large set of photos. In this article we take a look at a way to farm out this work to multiple processes.

The parallel algorithm for calculating variance

The parallel algorithm allows us to combine calculated variances from separate sets. With the help of Python's multiprocessing module it is straight forward to implement our hot pixel finding algorithm to one that takes advantage of multiple cores:

import Image
import numpy as np
import sys
from glob import glob
from multiprocessing import Pool,current_process
from time import time

def process_pixels(shape,*args):
	first = True
	for filename in args:
		pic = Image.open(filename)
		if pix.shape != shape:
			print("shapes don't match")
			continue
		if first:
			first = False
			firstpix = pix
			n = np.zeros(firstpix.shape)
			mean = np.zeros(firstpix.shape)
			M2 = np.zeros(firstpix.shape)
			delta = np.zeros(firstpix.shape)
		n += 1
		delta = pix - mean
		mean += delta/n
		M2 += delta*(pix - mean)
	return M2
	
if __name__ == '__main__':
	global pool

	n=int(sys.argv[1])
	pool = Pool(n)
	filenames = []
	for a in sys.argv[2:]:
		filenames.extend(glob(a))
	
	shape = np.array(Image.open(filenames[0])).shape

	s=time()
	results = []
	for i in range(n):
	results.append(pool.apply_async(process_pixels,tuple([shape]+filenames[i::n])))
	for i in range(n):
		results[i]=results[i].get()
	M2 = sum(results)
	print(time()-s)
	
	mini = np.unravel_index(M2.argmin(),M2.shape)
	maxi = np.unravel_index(M2.argmax(),M2.shape)
	print('min',M2[mini],mini)
	print('max',M2[maxi],maxi)
	print('mean',np.mean(M2))

	sorti = M2.argsort(axis=None)
	print(sep="\n",*[(i,M2[i]) for i in [np.unravel_index(i,M2.shape) for i in sorti[:10]]])

	print(time()-s)

Note that because we return complete arrays (line 26) we gain almost nothing for small data sets because of the overhead of shuttling these large arrays (> 30 MByte) between processes. This is illustrated in the following graph with shows the elapsed time as a function of the number of processes, both for a small number of pictures (50) and a large number (480).

Some other notable issues: in the code we do not actually implement the parallel algorithm but we simple add together the variances. Because we're looking for a minimum variance we gain nothing by adding a constant value.

Memory usage is another thing to be aware of (and the reason that there is no entry for six cores in the graph. The algorithm we have implemented uses 5 arrays (the pixel data itself included). That makes for 10 megapixel X 3 colors X 5 arrays X 8 bytes data (because we use 64 bit floats by default) which makes for a whopping 1.2 Gigabyte of data per process or more than 6 Gig with 5 processes. With some other applications open a sixth process wouldn't fit on my test machine. Because we're adding pixel values in the range 0 - 255 we could probably gain a lot by using 32 bit floats or even 16 bit floats here.