Sunday, 25 September 2011

Python regular expressions and the functools module

If we use annotations to sanitize incoming data easy access to regular expressions is a must. In this article I'll show how to make the most of Python's bundles functools module to dynamically generate functions that match regular expressions while reducing the overhead of compiling regular expressions as much as possible

Matching against regular expressions

In our framework we annotate function parameters with functions that will pass through or convert incoming arguments if they are ok or raise an exception other wise. For example, if we would want to make sure an argument is an integer, we'd annotate it with Python's built-in int function:

def myfunction(arg:int):
	...

The framework will use this function to check the argument before actually calling the function.

In many situations it would be convenient if we could specify a regular expression that arguments should match. We could extend the framework to check whether the annotation was a string and use that string as a regular expression but it is better to opt for a approach that will allow for the easy definition of categories. Some examples: def myfunc(arg:digits) ... or def myfunc(arg:zipcode) .... In these examples digits and zipcode are functions that take single argument and return that argument if matches a certain pattern or raise an exception if it doesn't.

To construct such functions easily we may utilize Python's re and functoolsmodules:

from re import compile

def regex_compile(pattern):
	return compile("^"+pattern+"$")

def match(pattern,string):
	re = regex_compile(pattern)
	if re.match(string) is None:
		raise ValueError("no pattern match")
	return string

To check whether a string matches a regular expression we define a match function (line 6) that will compile the regular expression and match the string against it. The regex_compile function will make sure the match is anchored at the beginning and the end because we want a complete match, that is string that merely contain the pattern are considered invalid. We don't allow extraneous characters.

Constructing partial functions

So far nothing new under the sun but we still need a simple way to construct functions that match a single argument against a regular expression. Enter the partial function. It will take a function and some arguments and use those to construct a new function that will pass any arguments together with the original arguments to the encapsulated function. This is exactly what we need to construct our categories:

from functools import partial

digits = partial(match,r'\d+')
zipcode = partial(match,r'\d{4}\s*[a-zA-Z]{2}')

print(digits('12345'))
print(zipcode('1234AB'))

Improving efficiency by Memoization

Each time we call digits(a) we actually call match(r'\d+',a) and this will result in compile the regular expression again and again. Compiling regular expressions is a rather expensive operation so we might want to avoid that by reusing compiled expressions. This can simply be accomplished by applying the lru_cache decorator from the functools module to implement a memoize pattern:

from functools import lru_cache
from re import compile

@lru_cache(maxsize=0)
def regex_compile(pattern):
	return compile("^"+pattern+"$")

This simple change will make sure we will reuse compiled regular expressions most of the time. The maxsize parameter should be set as needed: it may be set to zero in which case all regular expression will be cached, possibly a good choice as it is unlikely that your web application sports thousands of regular expressions.

No comments:

Post a Comment