Sunday 30 October 2011

A jQuery Mobile SelectorImage plugin

I implemented a jQueryMobile plugin to enhance a <select> element with a clickable image map. A full article with examples is available on on my site

A simple, JavaScript only implementation of an clickable image map.

In this age of html5 it is of course not done to use clickable image maps :-) Yet I did want to enhance <select> elements with clickable images to allow for a visual alternative for a drop down list without the hard to maintain imagemaps of old. I therefore implemented a jQueryMobile plugin that takes two maps (these are specified with data- attributes): one to display to the user and one to act as a definition of hotspots. These hotspots are simply colored areas in the hotspot map (instead of polygon definitions in an image map), When the user clicks the image the corresponding location is checked in the hotspot map and the color found is looked up in the list of <option> elements. If there is a match, that option is selected. A working example can be found here. A full explanation, including a small list of known problems can be found here.

Sunday 23 October 2011

Implementing a HTTPS server in Python

Web applications often transfer sensitive data between client and server. Even a session id is not something that should be vulnerable to eavesdropping. It is therefore a very good idea to encrypt all communication and implement a HTTPS server.

Subclassing HTTPServer

Python's ssl module has been cleaned up quite a bit since version 3.x and with a little help from this recipe it was incredibly simple to adapt the HTTPServer class from the http.server module to accept only secure connections:

import ssl
import socket
from socketserver import BaseServer
from http.server import HTTPServer
  
class HTTPSServer(HTTPServer):
 def __init__(self,address,handler):
  BaseServer.__init__(self,address,handler)
  
  self.socket = ssl.SSLSocket(
   sock=socket.socket(self.address_family,self.socket_type),
   ssl_version=ssl.PROTOCOL_TLSv1,
   certfile='test.pem',
   server_side=True)
  self.server_bind()
  self.server_activate()

All we do basically is change the initialization code to create a secure socket instead of a regular one (in line 10). The things to watch out for is the ssl_version: older versions are considered unsafe so we use TLS 1.0 here. Also the certificate file we use here contains both our certificate and our private key. If you want to use a self signed certificate for testing purposes you could generate one with openssl (most UNIX-like operating systems offer binary packages, for a precompiled package for windows check the faq.)

openssl req -new -x509 -keyout test.pem -out test.pem -days 365 -nodes

Note that your browser will still complain about this certificate because it is self signed.

Sunday 16 October 2011

Managing a session id with cookies

When managing sessions in web applications the key to the castle is a sessionid. Most often this sessionid is passed to and from the client by means of a cookie. In this article we explore the tools available in Python to deal with cookies.

Reading and writing cookies

If we are extending the BaseHTTPRequestHandler class from Python's http.server module we have access to the request headers by means of the headers attribute. It provides a convient method get_all() to retrieve headers by name (line 12 in the code sample below):

from http.cookies import SimpleCookie as cookie
...
class ApplicationRequestHandler(BaseHTTPRequestHandler):
 
 sessioncookies = {}

 def __init__(self,*args,**kwargs):
  self.sessionidmorsel = None
  super().__init__(*args,**kwargs)
  
 def _session_cookie(self,forcenew=False):
  cookiestring = "\n".join(self.headers.get_all('Cookie',failobj=[]))
  c = cookie()
  c.load(cookiestring)
  
  try:
   if forcenew or self.sessioncookies[c['session_id'].value]-time() > 3600:
    raise ValueError('new cookie needed')
  except:
   c['session_id']=uuid().hex
  
  for m in c:
   if m=='session_id':
    self.sessioncookies[c[m].value] = time()
    c[m]["httponly"] = True
    c[m]["max-age"] = 3600
    c[m]["expires"] = self.date_time_string(time()+3600)
    self.sessionidmorsel = c[m]
    break

The way we call get_all() provides us with an empty list if there are no cookies in the headers. Either way we end up with a (possibly empty) string that contains the cookies the client sent us. This cookie (or cookies) can be converted to a SimpleCookie object from Python's http.cookie module (line 14).

Cookies are basically key/value pairs with some extra attributes. The whole ensemble is called a morsel. The SimpleCookie object acts as a dictionary that indexes those morsels by key. The whole excersize is aimed at maintaining a session id so we check if our cookie object holds a session_id morsel and use its value (a GUID) as an index into the sessioncookies class variable. This class variabele maintains a dictionary indexed by the GUIDs of the session id cookies we produced. The corresponding values are their timestamps. Line 17 will therefore raise an exception if

  • no session_id cookie was provided by the client,
  • the session_id is unknown to us,
  • the session_id is expired, i.e. older than one hour, or
  • when we explicitely indicated we want a new cookie, no matter what.
In those cases we generate a complete new, random, GUID and store its hexadecimal representation.

At this point we are guaranteed to have a SimpleCookie object that contains a session_id. Our final tasks are to store the timestamp of this cookie and to update or set some additional attributes on this morsel. The client might have sent more than one cookie so we iterate over all morsel and stop at the first session_id morsel we find. We set its httponly attribute to signal to the browset that this cookie should not be manipulated by any client side JavaScript and set both its expires attribute and its max-age before we store this specific morsel in an instance variabele. This way we can add this cookie to the response headers one we have processed the request. An outline is sketched in the snipper below:

...
 def do_GET(self):
  ...
  self._session_cookie()
  ...
  if not (self.sessionidmorsel is None):
    self.send_header('Set-Cookie',self.sessionidmorsel.OutputString())
  ...

Security considerations

At this point we have a tool to manage a session id. Such a session id can be used as a key to access other session information, a topic we cover in a later article. Before we even start thinking of using this we should consider the security issues.

Pythonsecurity.org has a handy checklist that will walk through:

Are session IDs exposed in the URL?
No, we use cookies and those are part of the HTTP headers.
Do session IDs timeout and can users log out?
Our session IDs certainly timeout but providing a log out option should be part of the web application.
When a user logs out or times out, is the session invalidated?
At this point we only dealing with session ids.
Are session IDs rotated after successful login?
That is why we provide the forcenew paramter. After a successful login the web application should call _session_cookie() again with forcenew=True
Are session IDs only sent over TLS/SSL?
That should be implemented in the HTTP server, here we only look at the request handling part.
Are session IDs completely randomly generated?
We use a variant 4 uuid from Python's uuid module which should be completely random. The documentation makes no claims about the quality of the random generator used but a quick look at the code of the uuid module (in Python 3.2) reveals that is uses either system provided functions or the built-in random() function. In other words, we don't know what we get and that is bad! It is probably better use os.urandom() directly which will raise a NotImplentedError if a source of randomness couldn't be found. Generating a a string of 32 hex digits might be done as follows: "%02x"*16%tuple(os.urandom(16))

Sunday 9 October 2011

A jQuery Mobile Address plugin

While playing around with jQuery Mobile and wondering about the support for the new HTML5 input types I realized there wasn't an address extension that was just as easy to use a date picker was for dates. So I wrote an address plug-in.

Adding Google Maps to an input element

The idea is very simple: addresses are text just like an e-mail address or an url. HTML5 makes it possible to identify this a input element types and jQuery Mobile acts on these types as well to display custom widgets for these elements.

I wanted an address type: an input element with that type would allow the user to enter an address or a click on it would open an interactive map. Clicking on a location in that map would fill in the address under the cursor. Writing such a plug-in was a fine exercise in both jQuery Mobile and the Google Maps API and the first results are documented on my website as part of the Erica Web Application framework.

Sunday 2 October 2011

Erica web application framework

I started developing and writing about a new Python web application framework in Python to aid people in understanding concepts in web application development.

Erica web application framework

The framework is named after one of our cutest pygmy goats, Erica, which is of course completely irrelevant :-). More important is that the first article outlining the goals, is already on-line on my website. The articles will build and extend on blog entries made in this blog. The idea is that the website will develop into a small but comprehensive tutorial on implementing a framework while this blog will be the place to watch for updates and place your comments.

Sunday 25 September 2011

Python regular expressions and the functools module

If we use annotations to sanitize incoming data easy access to regular expressions is a must. In this article I'll show how to make the most of Python's bundles functools module to dynamically generate functions that match regular expressions while reducing the overhead of compiling regular expressions as much as possible

Matching against regular expressions

In our framework we annotate function parameters with functions that will pass through or convert incoming arguments if they are ok or raise an exception other wise. For example, if we would want to make sure an argument is an integer, we'd annotate it with Python's built-in int function:

def myfunction(arg:int):
	...

The framework will use this function to check the argument before actually calling the function.

In many situations it would be convenient if we could specify a regular expression that arguments should match. We could extend the framework to check whether the annotation was a string and use that string as a regular expression but it is better to opt for a approach that will allow for the easy definition of categories. Some examples: def myfunc(arg:digits) ... or def myfunc(arg:zipcode) .... In these examples digits and zipcode are functions that take single argument and return that argument if matches a certain pattern or raise an exception if it doesn't.

To construct such functions easily we may utilize Python's re and functoolsmodules:

from re import compile

def regex_compile(pattern):
	return compile("^"+pattern+"$")

def match(pattern,string):
	re = regex_compile(pattern)
	if re.match(string) is None:
		raise ValueError("no pattern match")
	return string

To check whether a string matches a regular expression we define a match function (line 6) that will compile the regular expression and match the string against it. The regex_compile function will make sure the match is anchored at the beginning and the end because we want a complete match, that is string that merely contain the pattern are considered invalid. We don't allow extraneous characters.

Constructing partial functions

So far nothing new under the sun but we still need a simple way to construct functions that match a single argument against a regular expression. Enter the partial function. It will take a function and some arguments and use those to construct a new function that will pass any arguments together with the original arguments to the encapsulated function. This is exactly what we need to construct our categories:

from functools import partial

digits = partial(match,r'\d+')
zipcode = partial(match,r'\d{4}\s*[a-zA-Z]{2}')

print(digits('12345'))
print(zipcode('1234AB'))

Improving efficiency by Memoization

Each time we call digits(a) we actually call match(r'\d+',a) and this will result in compile the regular expression again and again. Compiling regular expressions is a rather expensive operation so we might want to avoid that by reusing compiled expressions. This can simply be accomplished by applying the lru_cache decorator from the functools module to implement a memoize pattern:

from functools import lru_cache
from re import compile

@lru_cache(maxsize=0)
def regex_compile(pattern):
	return compile("^"+pattern+"$")

This simple change will make sure we will reuse compiled regular expressions most of the time. The maxsize parameter should be set as needed: it may be set to zero in which case all regular expression will be cached, possibly a good choice as it is unlikely that your web application sports thousands of regular expressions.

Sunday 18 September 2011

Propagating Python function annotations through decorators

When we decorate a Python function we must take some care to ensure that the decorated function can present itself in the same way as before its decoration. Let's explore what is needed.

Decorating a Python function

Say we have an undecorated function f(). It features a docstring (line 2) and annotated arguments:

def f(a:str):
	"an undecorated function"
	return a

print(f.__name__,f.__annotations__,f.__doc__,f('foo'))

The final line in the previous snippet prints the name of the function, its annotation dictionary and the docstring. The output looks like this:

f {'a': <class 'str'>} an undecorated function foo

Decorating a Python function

Now say we have a function do_something() and we would like to have a decorator that would arrange that a function would call do_something() first. We might set this up like this:

def do_something():
	print('oink')
	
def decorator(f):
	def g(*args,**kwargs):
		do_something()
		return f(*args,**kwargs)
	return g
	
@decorator
def f(a:str):
	"a naively decorated function"
	return a

print(f.__name__,f.__annotations__,f.__doc__,f('foo'))

The final line would produce the following output:

oink
g {} None foo

It is clear that do_something() is executed as it does print oink but although our decorated function can be called as f(), its name, annototations and doctring are different.

Preserving attributes when decorating a Python function

To resolve these matters we can rewrite our decorator as follows:

def decorator(f):
	def g(*args,**kwargs):
		do_something()
		return f(*args,**kwargs)
	g.__name__ = f.__name__
	g.__doc__ = f.__doc__
	g.__annotations__ = f.__annotations__
	return g
	
@decorator
def f(a:str):
	"a less naively decorated function"
	return a

print(f.__name__,f.__annotations__,f.__doc__,f('foo'))

We now simply copy the relevant attributes to the newly created function and if we now examine the output produced by the final line we get what we expect:

oink
f {'a': <class 'str'>} a less naively decorated function foo

Sunday 11 September 2011

Implementing POST method handling in a web application server

In an ongoing project to implement a web application server that is as simple as possible we now implement handling POST requests

Handling POST requests

The HTTP POST method is often the preferred way to transfer information from a form to the server. If we use POST the arguments don't end up in the webserver log for example and it allows us to use a file upload tag.

Depending on the encoding attribute of a form element, POST data might be encode as a number of key/value pairs (one on each line) or a multipart MIME message (useful if you want to upload files). More information can be found here.

Wielding the power of Python's cgi module

Decoding a multipart MIME message isn't trivial but lucky for us we can offload the hard work to an existing module: cgi. It is part of the standard Python distribution and although designed to implemented cgi scripts we can co-opt its functionality for our purpose. All we have to do is add a do_POST() method to our ApplicationRequestHandler class as shown below:

from cgi import FieldStorage
from os import environ

class ApplicationRequestHandler(BaseHTTPRequestHandler):

	def do_POST(self):
		ob=self._find_app()
		
		environ['REQUEST_METHOD'] = 'POST'
		fs=FieldStorage(self.rfile,headers=self.headers)
		kwargs={}
		for name in fs:
			for i in fs.getlist(name):
				if fs[name].file:
					kwargs[name] = fs[name].file
					kwargs[name].srcfilename = fs[name].filename
				else:
					kwargs[name] = fs[name]
		
		return self._execute(ob,kwargs)

We factored out some common code that is also used in the do_GET() method (in the _find_app() and _execute()methods, not shown here), but basically we instantiate a cgi.FieldStorage instance and pass it the input filestream along with a dictionary of the HTTP headers we have enounterd so far. Because the cgi module expects some parameters to be present in the environment we set the REQUEST_METHOD to POST because that information not part of the headers. The FieldStorage instance returned can be used as dictionary with the names of the parameters as keys.

Any file input parameters are a bit special. The contents of the uploaded file are stored in a temporary file and if we would access this parameter the value would be the contents of this file as a byte object. We'd rather pass on the temporary file object to prevent passing around the contents of really big files unnecessarily. We can check if a parameter is an uploaded file by checking its file parameter (line 14). If the argument is a file, we tack on the original file name (that is, the one the user's PC) because we might want to use that later. The final line passes the arguments we have extracted to the _execute() method. This method is factored out from the do_GET() method. It locates the function to execute and checks its arguments against any annotations

Sunday 4 September 2011

Finding stuck or hot pixels in Nikon D80 images, a multiprocessing approach

Earlier I described a method to find hot or stuck pixels by determining the variance of (sub)pixels over a large set of photos. In this article we take a look at a way to farm out this work to multiple processes.

The parallel algorithm for calculating variance

The parallel algorithm allows us to combine calculated variances from separate sets. With the help of Python's multiprocessing module it is straight forward to implement our hot pixel finding algorithm to one that takes advantage of multiple cores:

import Image
import numpy as np
import sys
from glob import glob
from multiprocessing import Pool,current_process
from time import time

def process_pixels(shape,*args):
	first = True
	for filename in args:
		pic = Image.open(filename)
		if pix.shape != shape:
			print("shapes don't match")
			continue
		if first:
			first = False
			firstpix = pix
			n = np.zeros(firstpix.shape)
			mean = np.zeros(firstpix.shape)
			M2 = np.zeros(firstpix.shape)
			delta = np.zeros(firstpix.shape)
		n += 1
		delta = pix - mean
		mean += delta/n
		M2 += delta*(pix - mean)
	return M2
	
if __name__ == '__main__':
	global pool

	n=int(sys.argv[1])
	pool = Pool(n)
	filenames = []
	for a in sys.argv[2:]:
		filenames.extend(glob(a))
	
	shape = np.array(Image.open(filenames[0])).shape

	s=time()
	results = []
	for i in range(n):
	results.append(pool.apply_async(process_pixels,tuple([shape]+filenames[i::n])))
	for i in range(n):
		results[i]=results[i].get()
	M2 = sum(results)
	print(time()-s)
	
	mini = np.unravel_index(M2.argmin(),M2.shape)
	maxi = np.unravel_index(M2.argmax(),M2.shape)
	print('min',M2[mini],mini)
	print('max',M2[maxi],maxi)
	print('mean',np.mean(M2))

	sorti = M2.argsort(axis=None)
	print(sep="\n",*[(i,M2[i]) for i in [np.unravel_index(i,M2.shape) for i in sorti[:10]]])

	print(time()-s)

Note that because we return complete arrays (line 26) we gain almost nothing for small data sets because of the overhead of shuttling these large arrays (> 30 MByte) between processes. This is illustrated in the following graph with shows the elapsed time as a function of the number of processes, both for a small number of pictures (50) and a large number (480).

Some other notable issues: in the code we do not actually implement the parallel algorithm but we simple add together the variances. Because we're looking for a minimum variance we gain nothing by adding a constant value.

Memory usage is another thing to be aware of (and the reason that there is no entry for six cores in the graph. The algorithm we have implemented uses 5 arrays (the pixel data itself included). That makes for 10 megapixel X 3 colors X 5 arrays X 8 bytes data (because we use 64 bit floats by default) which makes for a whopping 1.2 Gigabyte of data per process or more than 6 Gig with 5 processes. With some other applications open a sixth process wouldn't fit on my test machine. Because we're adding pixel values in the range 0 - 255 we could probably gain a lot by using 32 bit floats or even 16 bit floats here.

Sunday 28 August 2011

Finding stuck or hot pixels in Nikon D80 images

In a previous article I presented a tiny Python program that used the Python Image Library (PIL) to correct a so called stuck or hot pixel in an image taken by a digital SLR. In this article we go one step further and see if we can use some simple statistics to identify those pixels automatically.

Finding pixels with the least variance

The idea is to identify those pixels in a bunch of pictures that don't change much from picture to picture, even if the subject and lighting conditions do change. There are of course many ways to define change but the one we explore here is the called the variance. Basically we compute for each pixel in a set of pictures its average and then sum (again for each pixel) the differences with its average in each picture. The pixels (or subpixels) that have the smallest sums are likely candidates for being hot or stuck.

Using PIL and Numpy for efficient number crunching

Obviously calculating the variance for each subpixel in a few hundred pictures will entail some serious number crunching if these pictures are from a megapixel camera. We therefore better use a serious number crunching library, like Numpy. Because we use Python 3, I recommend fetching the PIL and Numpy packages from Christoph Gohlke's page if you use Windows.

The code is shown below. The program will open all images given as arguments one by one using PIL's Image.open() function (line 9). PIL images can be directly converted by Numpy by the array() function. Because we might run out of memory we do not keep all images in memory together but process them one by one and calculate the variance using the so called on-line algorithm. The names of the variables used are the same as in the Wikipedia article but instead of scalars we use arrays. (Note, we do not actually calculate the variance but just the sum of squares of differences from the mean because we interested in the pixel with smallest variance whatever that may be and not in the value of the variance at such.) In the final lines (line 27) we print out the results, retrieving the index of the lowest variance with Numpy's argmin() function.

import Image
import numpy as np
import sys
from glob import glob

first = True
for arg in sys.argv[1:]:
	for filename in glob(arg):
		pic = Image.open(filename)
		pix = np.array(pic)
		if first:
			first = False
			firstpix = pix
			n = np.zeros(firstpix.shape)
			mean = np.zeros(firstpix.shape)
			M2 = np.zeros(firstpix.shape)
			delta = np.zeros(firstpix.shape)
		else:
			if pix.shape != firstpix.shape:
				print("shapes don't match")
				continue
		n += 1
		delta = pix - mean
		mean += delta/n
		M2 += delta*(pix - mean)

mini = np.unravel_index(M2.argmin(),M2.shape)
print('min',M2[mini],mini)

Results and limitations

Using a few hundred photo's I could easily identify a previously observed hot pixel. It did take a few minutes though, even on my pc which is a fairly powerful hexacore machine. Besides speed the biggest limitation is that I had to use JPEG images because I did not have RAW images available. JPEG image compression isn't lossless so the hot pixel tends to be smeared out. If you print out not just the pixel with the lowest variance but the ten pixels with the lowest variance, most of them are clustered around the same spot. (example code below).

sorti = M2.argsort(axis=None)
print(sep="\n",*[(i,M2[i]) for i in [np.unravel_index(i,M2.shape) for i in sorti[:10]]])

Sunday 21 August 2011

Windows 7 Gadgets: mini web applications

Windows 7 gadgets are small applications that are integrated in the Windows desktop. These gadgets are basically webpages displayed in a browser environment that hardly differs from a regular environment. It is therefore entirely possible to turn a gadget into a web application that retrieves data from a remote server with AJAX. In this article we explore some of the possibilities and see how we may use jQuery inside a gadget.

Anatomy of a Windows 7 gadget

There is actually some pretty decent documentation on gadgets available on the web (for example on msdn or here). The trick is to get an example gadget running with as little ballast as possible.

A gadget file is basically a zip file that contains a file gadget.html plus a number of additional files, the most important one, gadget.xml. This zip file does have a .gadget extension instead of a .zip extension. So these simple steps are needed to create a bare bones gadget:

  1. Create a directory with a descriptive name, e.g. mygadget
  2. In this directory create a file gadget.html
  3. a subdirectory named css with a file mygadget.css
  4. a file gadget.xml
  5. and finally a file icon.png
  6. pack the contents of this directory into a file mygadget.zip (i.e. not the toplevel directory itself)
  7. rename this file to mygadget.gadget (although 7zip for example can pack to a file with any extension directly)
  8. double click this file and follow through the install dialog

With the following gadget.html the result will look like the screenshot

<html xmlns="http://www.w3.org/1999/xhtml">
<head>	
	<title>My Gadget</title>
	<meta http-equiv="Content-Type" content="text/html; charset=unicode" />
	<link href="css/mygadget.css" rel="stylesheet" type="text/css" /> 
</head>
<body> 
<div id="main_image">
<p>42</p>
</div>
</body>
</html>

Not really impressive, I admit, but it shows how simple it is to create a gadget. The necessary gadget.xml mainly describes were to find the actual html code and what image to use in the gadget selector:
<?xml version="1.0" encoding="utf-8" ?>
<gadget>
  <name>Mygadget</name>
  <namespace><!--_locComment_text="{Locked}"-->StartSmall.Gadgets</namespace>
    <version><!--_locComment_text="{Locked}"-->1.0</version>
  <author name="Michel Anders">
    <info url="http://michelanders.blogspot.com" text="Start Small" />
    <logo src="icon.png" />
  </author>
  <copyright><!--_locComment_text="{Locked}"-->&#169; 2011</copyright>
  <description>Basic Gadget</description>
  <icons>
    <icon height="48" width="48" src="icon.png" />
  </icons>
  <hosts>
    <host name="sidebar">
      <base type="HTML" apiVersion="1.0.0" src="gadget.html" />
      <permissions>
        <!--_locComment_text="{Locked}"-->Full
      </permissions>
      <platform minPlatformVersion="1.0" />
    </host>
  </hosts>
</gadget>

Using jQuery in a Windows 7 gadget

Displaying static information is not much fun at all so let's see what options there are to create a more dynamic gadget:

  • Refer to external content like images that get refreshed
  • Refresh the content ourselves using JavaScript, possibly even interacting with the user
The first option is simple enough and we won't cover that here. The second option is more interesting because it opens up a whole array of possibilities. Let's have a look at how a gadget.html page should look to incorporate jQuery and how we can test if this works:
<html xmlns="http://www.w3.org/1999/xhtml">
<head>	
	<title>My Gadget</title>
	<meta http-equiv="Content-Type" content="text/html; charset=unicode" />
	<link href="css/mygadget.css" rel="stylesheet" type="text/css" />
	<script src="http://ajax.googleapis.com/ajax/libs/jquery/1.6.1/jquery.min.js" type="text/javascript"></script>
</head>
<body> 
<div id="main_image">
<p>42</p>
</div>
<script type="text/javascript">$("#main_image p").append('<span> 43 44 45 </span>');</script>
</body>
</html>
As you can see this is surprisingly simple. The screenshot proves that the final lines of code are actually executed and change the contents of our basic gadget: Our next step is to check if we can retrieve data from a server.

JSONP in a Windows 7 gadget

Due to the same origin policy it is not entirely straight-forward to retrieve data from a server different from the server we get our webpage from. In the gadget environment every server is considered a different server because the gadget.html file originates from a file system. This means that even if we access a web application server on the same pc, we will be denied access.

Fortunately there is a workaround available in the form of JSONP and jQuery makes it very simple for use to use this. Consider the following gadget.html:

<html xmlns="http://www.w3.org/1999/xhtml">
<head>	
	<title>My Gadget</title>
	<meta http-equiv="Content-Type" content="text/html; charset=unicode" />
	<link href="css/mygadget.css" rel="stylesheet" type="text/css" />
	<script src="http://ajax.googleapis.com/ajax/libs/jquery/1.6.1/jquery.min.js" type="text/javascript"></script>
</head>
<body> 
<div id="main_image">
<p>42</p>
</div>
<script type="text/javascript">
function Refresh() {
    $.ajax('http://127.0.0.1:8088/value',
		{dataType:'jsonp',scriptCharset: "utf-8",
		success:function(data, textStatus, jqXHR){
			$("#main_image").empty().append('<p>'+String(data)+'</p>');
		}});
}
window.setInterval("Refresh()", 5000);
</script>
</body>
</html>

It will contact an application server every 5 seconds to retrieve a number and will insert this number into the content div. That's all there is to it. The application server that serves these request is equally simple and can be build for example with the applicationserver module from a previous article:

from http.server import HTTPServer, BaseHTTPRequestHandler
from json import dumps
from applicationserver import IsExposed,ApplicationRequestHandler

class Application:

	def value(self,callback:str,_:int=0) -> IsExposed:
		r = "%s(%s);"%(callback,dumps(_ % 17)) 
		return r
		
class MyAppHandler(ApplicationRequestHandler):
	application=Application()
	
appserver = HTTPServer(('',8088),MyAppHandler)
appserver.serve_forever()

The only trick here is that JSONP requests pass an additional paramter (usually called callback that will hold a string (the name of a JavaScript function in the client) and that we have to use this parameter to contruct a return value that looks like a call to this function with the JSON encoded data as its argument (line 8). The _ parameter holds a random number added by JQuery to prevent caching by the browser. So if _ happened to be 123 and callback would be jquery456_123 the value we woudl return would be the string jquery456_123(4);

The data we return here is merely a random number but of course this could be anything and we could style it to be presented in a more readable form.

Sunday 14 August 2011

Correcting stuck or hot pixels in Nikon D80 images

A friend of mine became suddenly aware that her Nikon D80 camera had developed a so called stuck or hot pixel. Repairing a CCD chip in a camera is quite costly and what is more, looking back through the old photos she discovered the defect had been present for over a year. In this article we develop a tiny Python program that masks out a pixel in pretty much the same way that the built-in firmware in de camera does this for pixels that were hot or dead when the camera was constructed.

Correcting stuck or hot pixels in images

Hot or stuck pixels in a camera are annoying and other than replacing the ccd chip, nothing can be done about it with regard to the hardware. More on this in this illuminating article. Sometimes the camera can map out bad pixels but that won't fix old images, so the plan is to create a small program that does this on existing images.

The idea is to replace the offending pixel with a weighted average of the neighboring pixels so it will blend in with the surroundings. Diagonally adjacent pixels have a lower weight then pixels horizontally or vertically adjacent (see picture on the left). Such a program is simple enough to implement using PIL, the Python Imaging Library. There is even a version for Python 3 made available by Christoph Gohlke, although you will have to replace all relative import statements with absolute one if you want that to work. (Simply run the the installer, open all *.py files in site-packages/PIL and replace all occurrences of from . import with a simple import. A decent editor like notepad++ or ultraedit can manage that in one go).

The program will take the name of an image file and the x,y position of the pixel to fix as its arguments, e.g. python correctpixel.py myimage.jpg 1500,1271 and will produce a corrected image with the same name but with a c_ prefix added, in our example c_myimage.jpg. It will use a weighted average of the eight neighboring pixels as depicted in this image (the diagonally adjacent pixels have a weight of 1/sqrt(2).

The simplest way to find the exact location of the pixel to fix is to open the image in Gimp, locate the pixel and hover over it with the mouse: the location is shown in the status bar at the bottom. Depending on your zoom level these may be fractional numbers, but we need just the integral parts.

The program itself is rather straightforward and is shown in its entirety below:

import sys
import Image

def correct(im,xy,matrix):
 pim = im.load()
 x,y = xy
 maxx,maxy = im.size
 n = 0
 sumr,sumg,sumb = 0,0,0
 for dx,dy,w in matrix:
  px,py = x+dx,y+dy
  if px<0 or py<0 or px >= maxx or py >= maxy:
   break
  n += w
  r,g,b = pim[px,py]
  sumr,sumg,sumb = sumr+r*w,sumg+g*w,sumb+b*w
 pim[x,y]=(int(sumr/n),int(sumg/n),int(sumb/n))

w=1/(2**0.5)
matrixd=((0,1,1),(0,-1,1),(1,0,1),(-1,0,1),(-1,-1,w),(-1,1,w),(1,1,w),(1,-1,w))

im = Image.open(sys.argv[1])
xy = tuple(map(int,sys.argv[2].split(',')))
correct(im,xy,matrixd)
im.save('c_'+sys.argv[1],quality=97)
Given an opened image the function correct() will load the data and the loop of the list of pixels offsets and weights given in its matrix argument. It will check whether a neighbor is within the bounds of the image (line 12; if the stuck pixel is on an edge it might not be) and if so, sum its RGB components according to the weight of this pixel. The pixel is finally replaced by summed values divided by the number of summed pixels. We also take care to convert the RGB components to integers again (line 17Y).

The final lines take care of opening the image (line 22), and converting the pixel position argument to a tuple of integers before calling the correct() function. The image is saved again with a name prefixed with c_. The quality parameter is set to a very high value to more or less save the image with the same jpeg quality a the original image.

Note that at the moment the code shown works for jpeg images only (or to be precise, only for images with three bands, R,G,B.) Png images may have and extra transparency band and the code does not deal with that nor does it play nice with indexed formats (like gif).

Sunday 7 August 2011

Function annotations in Python, checking parameters in a web application server , part II

In a previous article I illustrated how we could put Python's function annotations to good use to provide use with a syntactically elegant way to describe which kind of parameter content would be acceptable to our simple web application framework. In this article the actual implementation is presented along with some notes on its usage.

A simple Python application server with full argument checking

The applicationserver module provides a ApplicationRequestHandler class that derives from the Python provided BaseHTTPRequestHandler class. For now all we do is provide a do_GET() method, i.e. we don't bother with HTTP POST methods yet, although it wouldn't take too much effort if we would.

The core of the code is presented below:

class ApplicationRequestHandler(BaseHTTPRequestHandler):
 
 def do_GET(self):
  url=urlsplit(self.path)
  path=url.path.strip('/')
  parts=path.split('/')
  if parts[0]=='':
   parts=['index']
   
  ob=self.application
  try:
   for p in parts:
    if hasattr(ob,p):
     ob=getattr(ob,p)
    else:
     raise AttributeError('unknown path '+ url.path)
  except AttributeError as e:
   self.send_error(404,str(e))
   return
   
  if ('return' in ob.__annotations__ and 
   issubclass(ob.__annotations__['return'],IsExposed)):
   try:
    kwargs=self.parse_args(url.query,ob.__annotations__)
   except Exception as e:
    self.send_error(400,str(e))
    return
    
   try:
    result=ob(**kwargs)
   except Exception as e:
    self.send_error(500,str(e))
    return
  else:
   self.send_error(404,'path not exposed'+ url.path)
   return
   
  self.wfile.write(result.encode())

 @staticmethod
 def parse_args(query,annotations):
  kwargs=defaultdict(list)
  if query != '':
   for p in query.split('&'):
    (name,value)=p.split('=',1)
    if not name in annotations:
     raise KeyError(name+' not annotated')
    else:
     kwargs[name].append(annotations[name](value))
   for k in kwargs:
    if len(kwargs[k])==1:
     kwargs[k]=kwargs[k][0]
  return kwargs
  
The first thing we do in the do_GET() method is splitting the path into separate components. The path is provided in the path member and is already stripped of hostname and query parameters. We strip from it any leading or trailing slashes as well (line 5). If there are no path components we will look for a method called index.

The next step is to check each path component and see if its a member of the application we registered with the handler. If it is, we retrieve it and check whether the next component is a member of this new object. If any of these path components is missing we raise an error (line 16).

If all parts of the path can be resolved as members, we check whether the final path points to an executable with a return allocation. If this return allocation is defined and equal to our IsExposed class (line 22) we are willing to execute it, otherwise we raise an error.

The next step is to check each argument, so we pass the query part of the URL to the static parse_args() method that will return us a dictionary of values if all values checked out ok. If so we call the method that we found earlier with these arguments and if all went well, write its result to the output stream that will deliver the content to the client.

The parse_args() method is not very complicated: It creates a default dictionary whose default will be an empty list. This way we can create a list of values if the query consists of more than one part with the same name. Then we split the query on the & character (line 44) and split each part in a name and a value part (these are separated by a = character). Next we check if the name is present in the annotations dictionary and if not raise a KeyError.

If we did find the name in the annotations dictionary its associated value should be an executable that we pass the value from the query part (line 48). The result of this check (or conversion) is appended to the list in the default dictionary. If the value in the annotation is not an executable an exception will be raised that will not be caught. Likewise will any exception within this callable bubble up to the calling code. The final lines (line 50-53) reduce those entries in the default dictionary that consist of a list with just a single item to just that item before returning.

Conclusion

Python function annotations can be used for many purposes and this example code show a rather elegant (I think) example that let's us specify in clear language what we expect of functions that are part of web applications, which makes it quite easy too catch many input errors in an way that can be adapted to almost any input pattern.

Sunday 31 July 2011

Function annotations in Python, checking parameters in a web application server

Parameter annotations in function definitions are a recent addition to the language. In this article we show how this feature can be put to good use when we build a simple web application server that checks its input parameters rigoreously.

Creating a simple web application server with HTTPServer

It is of course entirely possible to create a web application in a short time when you use an existing Python web application framework. In previous articles and my book on web applications I've used CherryPy extensively and although I recommend it for its flexibility an ease of use, it isn't all that difficult to create a web application framework from scratch.

Python's https.server module provides us with the basic building blocks: the HTTPServer class to handle incoming connections and a BaseHTTPRequestHandler class that processes requests and returns an answer. The main part of developing an application server is therefore sub-classing the BaseHTTPRequestHandler class. The minimum it will have to provide is a do_GET() method that will return results based on any parameters it receives.

Using Python parameter annotations

CherryPy uses classes with methods that serve together as an application: requested URLs are mapped to these methods and any parameters are passed along. CherryPy uses an expose decorator to identify the methods that may be called. Non exposed methods are invisible, i.e. URLs that match those methods do not result in the invocation of that method. This behavior is what we like to mimic in our own web application server.

Another important concept in web applications is the screening of input: we would like to check that incoming data (i.e. the arguments that come along with a query) are within the range of things we deem acceptable. For example, a function that adds its arguments should reject anything that cannot be interpreted as a float. We could write code easily enough that checks any function parameters explicitly but wouldn't it be nice if there was a more syntactically pleasing way of writing this?

Enter Python's function parameter annotations. Python allows us to augment each function parameter with an expression that is evaluated when the function is defined and which is stores in the functions __annotation__ field as a dictionary indexed by parameter name. Such an annotation might be as simple a single string but it can be anything, even a function reference. This function could be called with the value we would like to pass as an parameter to check if this value is ok. This could be done before the function is actually called, for example by the do_GET() method of our request handler.

Assuming we have our applicationserver module available, let's have a look what the definition of a new web application might look like:

from http.server import HTTPServer
from applicationserver import ApplicationRequestHandler,IsExposed

class Application:

 def donothing:
  pass
  
 def index(self) -> IsExposed:
  return 'index oink'
 
 def add(self,a:float,b:float) -> IsExposed:
  return str(a+b)
 
 def cat(self,a:str) -> IsExposed:
  return ' '.join(a)
 
 def opt(self,a:int=42) -> IsExposed:
  return str(a)
  
class MyAppHandler(ApplicationRequestHandler):
 application=Application()
 
appserver = HTTPServer(('',8088),MyAppHandler)
appserver.serve_forever()
The overall idea is to subclass the ApplicationRequestHandler and assign an instance of the Application class to its application field (line 21). This applicationhandler is then passed to a HTTPServer instance that will forward incoming requests to this handler (line 24).

Our ApplicationRequestHandler will try to map URLs of the form http://hostname:8080/foo?a=1&b=2 to member functions of the Application instance. It will only consider member function with a return annotation equal to an IsExposed object. So even though we have defined a donothing() function, it will not be executed when a URL like http://hostname:8080/donothing is received.

We also use annotations to restrict input values for parameters to functions that are exposed. Remember that annotations can be any expression and here we employ that fact to annotate the a and b parameters to the add() method with a reference to the built-in float() function. Our ApplicationRequestHandler will pass an argument to any callable it finds as its corresponding annotation and will only execute the method if this callable returns a value (and not raise an exception). So a URL like http://localhost:8080/add?a=1.23&b=4.56 will return a meaningful result while http://localhost:8080/add?a=1.23&b=spam will fail with an error. Off course we are not restricted to built-in functions here: we can refer to functions that may perform elaborate checking as well, perhaps checking against regular expressions or performing lookups in database tables.

All this shows that Python's function annotations allow for a rather elegant way to describe the expected behavior of methods that perform some sort of action in a web application. In a future article I'll show how to implement the applicationserver module.

Sunday 24 July 2011

More SQLite multithreading woes

In this article we revisit our thread safe persistent dictionary and encounter some irritating performance issues, both SQLite related and caused by Python itself.

SQLite.OperationalError, database is locked

When testing the persistentdict module with many simultaneous threads (more than twenty) I noticed a great number of errors: The many connections open to the same database caused SQLite to raise a lot of SQLite.OperationalError, database is locked exceptions. Getting decent performance with SQLite is by no means easy and because SQLite only locks complete database files and not just tables or rows there is a big chance that threads accessing the same database have to wait to get their turn.

sqlite3.connect, the check_same_thread parameter

Python 3.2 comes with a sqlite3 module that implements (but scarcely documents) a check_same_thread parameter that can be set to false to allow threads to use the same Connection object simultaneously. This is nice since this means we no longer have to implement all sorts of code to provide each thread with its own connection.

But we still have to regulate access to this connection because otherwise a commit in one thread may invalidate a longer running execute in another thread, leaving us with errors like sqlite3.InterfaceError: Cursor needed to be reset because of commit/rollback and can no longer be fetched from.

Python thread switching is really slow

Switching between threads, especially on multi core machines, has never been Python's strongest feature and making our persistent dict thread safe with a lock might hurt performance a lot, depending on what is going on in the threads itself (if there is a lot of I/O going on the impact might not be that big).

A new implementation

The code below shows the new implementation, with a single connection that may be shared by multiple threads. It works but it is really slow: with 40 threads I get just 10 to 20 dictionary assignments (d[1]=2) per second on my dual core Atom. That isn't the fastest machine around but those figures are ridiculously low. We will have to rethink our approach if we want to use SQLite in a multithreaded environment if we need any kind of performance!

"""
 persistentdict module $Revision: 98 $ $Date: 2011-07-23 14:01:04 +0200 (za, 23 jul 2011) $

 (c) 2011 Michel J. Anders

 This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with this program.  If not, see .

"""

from collections import UserDict
import sqlite3 as sqlite
from pickle import dumps,loads
import threading 

class PersistentDict(UserDict):

 """
 PersistentDict  a MutableMapping that provides a thread safe,
     SQLite backed persistent storage.
 
 db  name of the SQLite database file, e.g.
   '/tmp/persists.db', default to 'persistentdict.db'
 table name of the table that holds the persistent data,
   usefull if more than one persistent dictionary is
   needed. defaults to 'dict'
 
 PersistentDict tries to mimic the behaviour of the built-in
 dict as closely as possible. This means that keys should be hashable.
 
 Usage example:
 
 >>> from persistentdict import PersistentDict
 >>> a=PersistentDict()
 >>> a['number four'] = 4
 
 ... shutdown and then restart applicaion ...
 
 >>> from persistentdict import PersistentDict
 >>> a=PersistentDict()
 >>> print(a['number four'])
 4
 
 Tested with Python 3.2 but should work with 3.x and 2.7.x as well.
 
 run module directly to run test suite:
 
 > python PersistentDict.py
 
 """
 
 def __init__(self, dict=None, **kwargs):
  
  self.db    = kwargs.pop('db','persistentdict.db')
  self.table = kwargs.pop('table','dict')
  #self.local = threading.local()
  self.conn = None
  self.lock = threading.Lock()
  
  with self.lock:
   with self.connect() as conn:
    conn.execute('create table if not exists %s (hash unique not null,key,value);'%self.table)
    
  if dict is not None:
   self.update(dict)
  if len(kwargs):
   self.update(kwargs)
 
 def connect(self):
  if self.conn is None:
   self.conn = sqlite.connect(self.db,check_same_thread=False)
  return self.conn
   
 def __len__(self):
  with self.lock:
   cursor = self.connect().cursor()
   cursor.execute('select count(*) from %s'%self.table)
   return cursor.fetchone()[0]
 
 def __getitem__(self, key):
  with self.lock:
   cursor = self.connect().cursor()
   h=hash(key)
   cursor.execute('select value from %s where hash = ?'%self.table,(h,))
   try:
    return loads(cursor.fetchone()[0])
   except TypeError:
    if hasattr(self.__class__, "__missing__"):
     return self.__class__.__missing__(self, key)
   raise KeyError(key)
   
 def __setitem__(self, key, item):
  h=hash(key)
  with self.lock:
   with self.connect() as conn:
    conn.execute('insert or replace into %s values(?,?,?)'%self.table,(h,dumps(key),dumps(item)))

 def __delitem__(self, key):
  h=hash(key)
  with self.lock:
   with self.connect() as conn:
    conn.execute('delete from %s where hash = ?'%self.table,(h,))

 def __iter__(self):
  with self.lock:
   cursor = self.connect().cursor()
   cursor.execute('select key from %s'%self.table)
   rows = list(cursor.fetchall())
  for row in rows:
   yield loads(row[0])

 def __contains__(self, key):
  h=hash(key)
  with self.lock:
   cursor = self.connect().cursor()
   cursor.execute('select value from %s where hash = ?'%self.table,(h,))
   return not ( None is cursor.fetchone())

 # not implemented def __repr__(self): return repr(self.data)
 
 def copy(self):
  c = self.__class__(db=self.db)
  for key,item in self.items():
   c[key]=item
  return c

Sunday 17 July 2011

A SQLite multiprocessing proxy, part 3

In a previous article I presented a first implementation of a SQLite proxy that makes it possible to distribute the workload of multiple processes with the use of Python's multiprocessing module. In this third part of the series we try to analyze the performance of this setup.

High workload example

In our sample implementation we can vary the workload inside the processes that interact with the SQLite database by varying the size of the table that we query. A table with many rows takes more time to scan for a certain random value than a table with just a few rows.

The first graph we present here is about high workload: the table that we query is initialized with one million records. The table shows the time to complete 100 queries. The test was done on a machine with 6 processor cores and in the graph we show the results for 2 (deep purple, back) and 6 (light purple, front) worker processes and a varying number of threads.

The results are more or less what we expect: more worker processes means that the time to complete all tasks is reduced. However the number of threads is also significant. If the number of threads is less than the number of available worker process we do not reach the full potential. Basically we need at least as many threads a there are worker processes to keep those processes busy. If we have more threads than worker processes there is no more gain, in fact we see a minute increase in the time needed to complete all tasks. This might be due to the overhead of creating and managing threads in Python.

Low workload example

If we initialize our table with just a single row the workload will be negligible. If we draw a similar graph as for the high workload we see a completely different picture.

Now we see hardly any difference between 2 work processes or 6 and increasing the number of threads also has no effect. Also the data is rather noisy, i.e. varies quite a bit in a non-uniform manner, especially for the case with 2 worker processes. The reason for this behavior is not entirely clear to me, although it is obvious that because of the very small workload the time to setup communication with the worker process is a significant factor here.

Wednesday 13 July 2011

Nice discounts on open source titles at Packt

Packt has an offer on open source books, both in print and as e-book, that might interest you. Check out their July offering, there certainly are some interesting titles available, including a few on Python and web development.

Sunday 10 July 2011

A SQLite multiprocessing proxy, part 2

In a previous article we decided to use Python's multiprocessing module to leverage the power of multi-core machines. Our use case is all about web applications served by CherryPy and so multi-processing isn't the only interesting part: our application will be multi-threaded as well. IN this article we present a first implementation of a multi-threaded application that hands off the heavy lifting to a pool of subprocesses.

The design

The design is centered on the following concepts:

  • The main process consists of multiple threads,
  • The work is done by a pool of subprocesses,
  • Transferring data to and from the subprocesses is left to the pool manager
Schematically we can visualize it as follows:

Sample code

We start of by including the necessary components:

from multiprocessing import Pool,current_process
from threading import current_thread,Thread
from queue import Queue
import sqlite3 as dbapi
from time import time,sleep
from random import random
The most important ones we need are the Pool class from the multiprocessing module and the Thread class from the threading module. We also import queue.Queue to act as a task list for the threads. Note that the multiprocessing module has its own Queue implementation that is not only thread safe but can be used for inter process communication as well but we won't be using that one here but rely on a simpler paradigm as we will see.

The next step is to define a function that may be called by the threads.

def execute(sql,params=tuple()):
 global pool
 return pool.apply(task,(sql,params))
It takes a string argument with SQL code and an optional tuple of parameters just like the Cursor.execute() method in the sqlite3 module. It merely passes on these arguments to the apply() method of the multiprocessing.Pool instance that is referred to by the global pool variable. Together with SQL string and parameters a reference to the task() function is passed, which is defined below:
def task(sql,params):
 global connection
 c=connection.cursor()
 c.execute(sql,params)
 l=c.fetchall()
 return l
This function just executes the SQL and returns the results. It assumes the global variable connection contains a valid sqlite3.Connection instance, something that is taken care of by the connect function that will be passed as an initializer to any new subprocess:
def connect(*args):
 global connection
 connection = dbapi.connect(*args)

Before we initialize our pool of subprocess let's have a look at the core function of any thread we start in our main process:

def threadwork(initializer=None,kwargs={}):
 global tasks
 if not ( initializer is None) :
  initializer(**kwargs)
 while(True):
  (sql,params) = tasks.get()
  if sql=='quit': break
  r=execute(sql,params)
It calls an optional thread initializer first and then enters a semi infinite loop in line 5. This loops starts by fetching an item from the global tasks queue. Each item is a tuple consisting of a string and another tuple with parameters. If the string is equal to quit we do terminate the loop otherwise we simple pass on the SQL statement and any parameters to the execute function we encountered earlier, which will take care of passing it to the pool of subprocesses. We store the result of this query in the r variable even though we do nothing with it in this example.

For this simple example we also need an database that holds a table with some data we can play with. We initialize this table with rows containing random numbers. When we benchmark the code we can make this as large as we wish to get meaningful results; after all, our queries should take some time to complete otherwise there would be no need to use more processes.

def initdb(db,rows=10000):
 c=dbapi.connect(db)
 cr=c.cursor()
 cr.execute('drop table if exists data');
 cr.execute('create table data (a,b)')
 for i in range(rows):
  cr.execute('insert into data values(?,?)',(i,random()))
 c.commit()
 c.close()

The final pieces of code tie everything together:

if __name__ == '__main__':
 global pool
 global tasks
 
 tasks=Queue()
 db='/tmp/test.db'
 
 initdb(db,100000)
 
 nthreads=10
 
 for i in range(100):
  tasks.put(('SELECT count(*) FROM data WHERE b>?',(random(),)))
 for i in range(nthreads):
  tasks.put(('quit',tuple()))
 
 pool=Pool(2,connect,(db,))
 
 threads=[]
 for t in range(nthreads):
  th=Thread(target=threadwork,kwargs={'initializer':thread_initializer})
  threads.append(th)
  th.start()
 for th in threads:
  th.join()
After creating a queue in line 5 and initializing the database in line 8, the next step is to fill a queue with a fair number of tasks (line 12). The final tasks we add to the queue signal a thread to stop (line 14). We need as many of them as there will be threads.

In line 17 we initialize our pool of processes. Just two in this example, but in general the number should be equal to the number of cpu's in the system. If you omit this argument the number will default to exactly that. Next we create (line 21) and start (line 23) the number of threads we want. The target argument points to the function we defined earlier that does all the work, i.e. pops tasks from the queue and passes these on to the pool of processes. The final lines simply wait till all threads are finished.

What's next?

In a following article we will benchmark and analyze this code and see how we can improve on this design.

Sunday 3 July 2011

A SQLite multiprocessing proxy

This is the first article in a series on improving the performance of Python web applications by leveraging the possibilities of the multiprocessing module. We'll focus on CherryPy and SQLite but the conclusions should be general enough for any Python based platform

Use case

Due to well known restrictions in the most common Python implementation, multithreading solutions will probably not help to solve performance issues (with the possible exception of serving slow network connections). The multiprocessing module offers an API similar to the threading module and might be an alternative when we want to divide the workload on a multicore machine.

The use case we're interested in is a CherryPy server that serves many requests, backed by a SQLite database. CherryPy is multithreaded by design and this approach is sensible as a web server may spend more time waiting for data to be transmitted over relatively slow network connections than actually doing work.

CherryPy however is also an excellent framework to host web applications and many web applications rely on some sort of database back-end. SQLite is a good choice for such a back-end as it comes bundled with Python (reducing the number of external dependencies), is easy to use and performs well enough. With some tricks it will even play nice in a multithreaded environment.

A disadvantage of using SQLite is that we do not have a separate database server: the SQLite engine is part of the same process that runs the Python interpreter. This means that it has the same handicap as any multithreaded application on CPython (the most common implementation of Python) and will not benefit from any extra cores or processors available on the server.

Now we could switch to MySQL or any other stand-alone database back-end but this would add quite an amount to the maintenance burden of our web application. Wouldn't it be nice if we could devise a way to use SQLite together with the multiprocessing module to have the best of both worlds: the ease of use of SQLite and the performance benefits of a stand-alone database server?

In this series of articles I will explore the possibilities and hopefully will come up with a solution that will provide:

  • a dbapi proxy (we'll use sqlite3 module but it should be general enough for any dbapi compliant database)
  • that will use the multiprocessing module to increase performance and
  • can be used from a multithreaded environment.
It would be nice if the API closely resembles the dbapi (but that is not an absolute requirement).

In the next article in this series I will explore the options to make threads and processes play nice, focusing on inter process communication.

Sunday 26 June 2011

linkcheck a Python module to check for broken links

When maintaining a web site or even when hosting a web application it makes sense to check once in a while whether there are any broken links. Here I present a simple module to find broken links in a website that is pure Python and uses no external modules.

Checking for broken links

Checking for broken links can be as simple as shown in the following example:

from linkcheck import LinkChecker

lc = LinkChecker("http://www.example.org")
if lc.check():
    if not lc.follow():
        print("there were problems")
        print("\n".join(lc.failed))
        print("\n".join(lc.other))
    else:
        print("website OK")
else:
    print("cannot open website or homepage is not html")

The check() method tries to open a URL and checks if its content is html. If this went well it returns True. The next step is to use the follow() method the see if we can open any links present in the page and report any failures. If a link points to a page containing HTML this is repeated in a recursive fashion.

The linkcheck module

We highlight some implementation details of the linkcheck module here, but the full code can be found on my website. It is licensed under the GPL and comes with a modest test suite.

The code depends on two crucial elements: LinkParser a html.parse.HTMLParser derived class and LinkChecker. LinkParser

acts on just a few HTML start tags defined in the tagrefs class variable (line 9). Its initializer is passed a baseurl argument that is used to derive relative URLs and a callback parameter that is called for each of the relevant tags that hold a reference to another URL

The LinkChecker class is discussed in detail in an an article on my website. For now I highlight just its main methods:

  • __init__(), takes just one mandatory parameter the url we start with (line 28)
  • check(), checks if the url passed to __init__() could be opened and points to a HTML file (line 38)
  • follow(), finds any links within the HTML a marks any of those links it cannot opened as failed in the failed instance variable (line 59)
  • process(), is the call back passed to an instance of the LinkParser class. It will recursively create a new LinkChecker instance to check the link for additional links. (line 67)

from urllib.request import Request,urlopen
from urllib.parse import urlsplit,urljoin,urlunsplit,urldefrag
from urllib.error import HTTPError,URLError
from html.parser import HTMLParser
from re import compile,MULTILINE,IGNORECASE

class LinkParser(HTMLParser):

 tagsrefs = { 'a':'href', 'img':'src', 'script':'src', 'link':'href' }
 
 def __init__(self,baseurl,callback):
  self.callback = callback
  self.baseurl = baseurl
  super().__init__()
  
 def handle_starttag(self, tag, attrs):
  if tag in self.tagsrefs:
   for name,value in attrs:
    if name == self.tagsrefs[tag]:
     newurl=urljoin(self.baseurl,value)
     self.callback(newurl)
     break

class LinkChecker:

 html=compile(r'^Content-Type:\s+text/html$',MULTILINE|IGNORECASE)

 def __init__(self,url,host=None,seen=None,external=True):
  self.url    = url
  self.host   = urlsplit(url).hostname if host is None else host
  self.failed = []
  self.other  = []
  self.notopened = []
  self.duplicates = 0
  self.seen = set() if seen is None else seen
  self.external = external

 def check(self,open=True):
  self.seen.add(self.url)
  if not open :
   self.notopened.append(self.url)
   return False
  try:
   self.req=urlopen(url=self.url,timeout=10)
  except HTTPError as e:
   self.failed.append(self.url)
   return False
  except URLError as e:
   self.other.append(self.url+' ('+str(e)+')')
   return False
  except Exception as e:
   print('Exception',e,type(e))
   self.failed.append(self.url)
   return False
  headers=str(self.req.info())
  m=self.html.search(headers)
  return not(m is None)
  
 def follow(self):
  parser = LinkParser(self.url,self.process)
  try:
   parser.feed(self.req.read().decode())
  except Exception as e:
   self.other.append(self.url+' ('+str(e)+')')
  return len(self.failed)+len(self.other) == 0
  
 def process(self,newurl):
  newurl=urldefrag(newurl)[0]
  if not newurl in self.seen:
   lc = LinkChecker(newurl,self.host,self.seen,self.external)
   samesite = urlsplit(newurl).hostname == self.host
   if lc.check(self.external or samesite) and samesite:
    lc.follow()
   self.failed.extend(lc.failed)
   self.other.extend(lc.other)
   self.notopened.extend(lc.notopened)
   self.seen.update(lc.seen)
   self.duplicates+=lc.duplicates
  else:
   self.duplicates+=1

Sunday 19 June 2011

A jQuery plugin to add icons to elements

In this article I present a simple jQuery plugin that adds suitable icons to a selection of elements based on the value of the rel attribute.

CSS3 :after selector not suitable?

If the target browser understands CSS3 we could add an icon with the help of the :after selector. This has a couple of disadvantages though:

  • Only the newest browsers support this,
  • Although it is possible to insert a url that points to an image, this element is magic, i.e. not visible in the DOM, so we cannot style it with CSS,
  • We can select elements based the contents of an attribute but we cannot construct and element based on the result of this selection.
Fortunately a flexible yet simple solution can be crafted in a few lines of Javascript with the help of the jQuery library.

The iconify plugin

The code presented below assumes that jQuery is already loaded. It is tested with jQuery 1.6.1 but should work with earlier versions just as well. Using the plugin is as simple as $("a").iconify();. This will add an icon (an <img> tag) to every link, if any of those links has a rel attribute. You need to provide a directory with suitable icons. For example, if you have links with rel="python", an icon will be added with a src attribute of src="Icons/python-icon.png"

The plugin can be broken down to a few simple steps. The first it does (in line 2) is to make sure the selection is a jQuery object. The next two lines make sure that if any of the parameters is missing the corresponding default is used.

In line 6 we then iterate over all elements of the selection and retrieve the rel attribute (line 8). If this attribute is present and not empty, we check if it can be found in the iconmap object, if not we construct a generic name (line 13). Next we construct an <img> element with a suitable src attribute and a meaningful alt attribute and add this element to the current element in the iteration. Finally (line 24) we return the original selection to allow this plugin to be chained just like most jQuery functions.

$.fn.iconify = function (icondir,iconmap) {
 var $this = $(this);
 icondir = icondir || $.fn.iconify.icondir;
 iconmap = iconmap || $.fn.iconify.iconmap;
 
 $this.each(function(i,e){
  e=$(e);
  var rel=e.attr('rel');
  var src="";
  var alt="";
  
  if(rel !== undefined && rel != ''){
   if(iconmap[rel]!==undefined){
    src='src="'+icondir+'/'+iconmap[rel]+'"';
   }else{
    src='src="'+icondir+'/'+rel+'-icon.png'+'"';
   }
   alt='alt="'+rel+' icon"';
   var img=$('');
   e.append(img);
  }
 });
 
 return $this;
};
$.fn.iconify.icondir='Icons';
$.fn.iconify.iconmap={
'python' :'python-logo-24.png',
'blender' :'blender-logo-24.png',
'photo'  :'camera-logo-24.png',
'wikipedia' :'wikipedia-logo-24.png'
};

The CSS rel attribute

Using the rel attribute for this purpose might be considered misuse (see for example this page) so you might consider rewriting the plugin to check another attribute, for instance class.

Friday 17 June 2011

A SQLite thread safe password store, revisited

In a previous article I showed how to implement a thread safe persistent password store that was based on SQLite. In this article a reimplementation of that module is presented base on the persistentdict module.

A SQLite thread safe password store

As you can see in the code presented below, we can put our PersistentDict class developed earlier to good use. Because we use two instances of PersistentDict (lines 45, 46) to store the salt and the hashed passwords instead of interacting with a SQLite database ourselves, the code is much cleaner and therefore easier to maintain.

'''
 dbpassword.py Copyright 2011, Michel J. Anders

 $Revision: 70 $ $Date: 2011-06-10 16:34:28 +0200 (vr, 10 jun 2011) $
 
 This program is free software: you can redistribute it
 and/or modify it under the terms of the GNU General Public
 License as published by the Free Software Foundation,
 either version 3 of the License, or (at your option) any
 later version.

 This program is distributed in the hope that it will be 
 useful, but WITHOUT ANY WARRANTY; without even the implied
 warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR
 PURPOSE. See the GNU General Public License for more
 details.

 You should have received a copy of the GNU General Public
 License along with this program.  If not, see 
 www.gnu.org/licenses.
'''

import hashlib
from random import SystemRandom as sr
from persistentdict import PersistentDict

class dbpassword:

 @staticmethod
 def hashpassword(name,salt,plaintextpassword,n=10):
  if n<1 : raise ValueError("n < 1")
  d = hashlib.new(name,(salt+plaintextpassword).encode()).digest()
  while n:
   n -= 1
   d = hashlib.new(name,d).digest()
  return hashlib.new(name,d).hexdigest()

 @staticmethod
 def getsalt(randombits=64):
  if randombits<16 : raise ValueError("randombits < 16")
  return "%016x"%sr().getrandbits(randombits)

 def __init__(self,db='password.db',
    secure_hash='sha256',iterations=1000,saltbits=64):
  self.saltdict = PersistentDict(db=db,table='salt')
  self.pwdict = PersistentDict(db=db,table='password')
  self.secure_hash = secure_hash
  self.iterations = iterations
  self.saltbits = 64
  
 def update(self,user,plaintextpassword):
  salt=dbpassword.getsalt(self.saltbits)
  self.saltdict[user]=salt
  self.pwdict[user]=dbpassword.hashpassword(
     self.secure_hash,salt,plaintextpassword,
     self.iterations)

 def check(self,user,plaintextpassword):
  salt=self.saltdict[user]
  return self.pwdict[user]==dbpassword.hashpassword(
   self.secure_hash,salt,plaintextpassword,
   self.iterations)

Sunday 12 June 2011

A Python module providing thread safe SQLite backed persistent storage

In a previous article I wrote about some research I did on Python modules that could provide me with a persistent storage solution for dictionaries. I didn't quite find what I needed especially as none of the modules provided a thread safe solution. In the end I decided to write my own.

Thread safe, SQLite backed persistent storage

The module I wrote, persistentdict, is is freely available on my website along with some notes and examples. It also has a fairly extensive test suite. It provides a single class PersistentDict that behaves almost exactly like a native Python dict but stores its keys and values in a SQLite table instead of keeping it in memory.

Friday 10 June 2011

SQLite as persistent backing to a Python dict

Doing some research can save you a lot of work: while looking around for a way to use SQLite as a persistent backing store for a Python dictionary I found at least two decent implementations. This blog post are my research notes.

Requirements

  • pure Python, to ensure cross platform portability
  • no additional external dependencies, to facilitate easy packaging
  • portable data back-end format
  • thread safe
  • well written
  • well documented
  • open source

The requirement to have a portable back-end format makes SQLite based implementations such a strong preference as SQLite interfaces are available in a number of other programming languages as well, notably C

Thread safety is a strong requirement because I want to use this solution in CherryPy. It is not enough to restrict access to an object with some form of locking because if some database activity takes place, for example database transactions, this activity itself must be multithreading proof. Some database engines in Python are thread safe and SQLite can be made to work in a multithreading environment as long as each thread has its own connection. Check this post to see how this may be accomplished.

Python's shelve module

Python's shelve module is Python specific and not thread safe.

Seb Sauvages's dbdict

Seb Sauvages's dbdict is an interesting starting point although not thread safe

Erez' FileDict

Erez' FileDict is a more complete implementation but not thread safe either.

Tokyo Cabinet

Tokyo Cabinet feels a bit too complex for my taste and is another package that is not thread safe

Preliminary conclusion

Finding a thread safe solution to providing a persistent database backing for Python dictionaries is not as easy as I hoped. Finding one that meets my exact requirements may take more time than writing something from scratch which is a bit of a disappointment.

Sunday 5 June 2011

Packt Python Book Idea Generator

Packt Publishing invites current and future readers to send in suggestions for subjects to cover in books on Python.

What Python subject do you want authors to write about?

I think this is a pretty neat idea. As a writer myself I choose subjects that interest me most and if a publisher thinks there is a market for it, the game is on. But I have many Python related interests and that probably goes for other authors as well, so if people had a way to show what they would like to see covered in a Python book, the publisher and authors could pick up a subject that would certainly please future readers. That's a classic win-win situation. The people at Packt now offer a such platform so if you have a (Python related) subject you would love to see a book about, check out this poll.

Sunday 29 May 2011

Visualizing Python call graphs with jsPlumb

When analyzing or debugging complex code it can be a great help to have a call graph at your disposal. In this article we look at Python's bundled trace module to generate a call graph and at jsPlumb, a Javascript library to render an interactive visual representation.

Trace can do more than coverage analysis

In a previous article we saw how we could use Python's trace module to see which lines in in our code where actually executed when we run some function. The same module can provide us with a list of functions that are called by other functions. This is as simple as providing some extra parameters as can be seen in the example code below:
import trace
 
t=trace.Trace(count=1,trace=0,countfuncs=1,countcallers=1)
t.runfunc(f)
r=t.results()
This snippet will run the function f and in the end we will have a CoverageResults object in r. We can print these results with the write_results() method but a more graphical representation is preferred. Now the underlying data structures are not well documented some inspection and examination of the source code shows that we have a member called callers that hold a record of which function called which. It is a dictionary with a a complex key and a simple value: the value contains a count of how many times in a function another function is called, the key is a tuple of tuples (caller, callee) where eacht member has three parts: (module, filename, functionname). Lets see how we can use this information to visualize the call graph.

jsPlumb

The idea is to represent each function as a simple box and draw lines to indicate which function calls which other function. Now we could try to build an application based on some graphics library but rendering graphical boxes is something web browser are good at, so why not use HTML to represent our graph and a web browser to render the stuff? All we need is some way to draw attractive lines connecting those boxes. Enter jsPlumb, a versatile Javascript libraray to do just that. And a lot more as well: once a graph is drawn it lets you interact with the boxes to rearrange them as you see fit.
The example code provided at the end of this article will generate a number of div elements, one for each Python function encountered. It also generates a bit of Javascript to let jsPlumb draw its connecting lines. The HTML produced looks like this:
<html><head>
<script type="text/javascript" src="http://explorercanvas.googlecode.com/svn/trunk/excanvas.js"></script>
<script type="text/javascript" src="http://ajax.googleapis.com/ajax/libs/jquery/1.4.4/jquery.min.js"></script>
<script type="text/javascript" src="http://ajax.googleapis.com/ajax/libs/jqueryui/1.8.2/jquery-ui.min.js"></script>
<script type="text/javascript" src="jquery.jsPlumb-1.2.5-all-min.js "></script>
<link rel="stylesheet" href="callgraph.css" type="text/css" media="all" />
</head><body>
<div id="i" class="box" style="left:200px;top:200px">i</div>
<div id="h" class="box" style="left:200px;top:400px">h</div>
<div id="Trace-runfunc" class="box" style="left:200px;top:100px">Trace-runfunc</div>
<div id="g" class="box" style="left:200px;top:300px">g</div>
<div id="f" class="box" style="left:300px;top:200px">f</div>
<script>
jsPlumb.Defaults.Connector = new jsPlumb.Connectors.Bezier(50);
jsPlumb.Defaults.DragOptions = { cursor: 'pointer', zIndex:2000 };
jsPlumb.Defaults.PaintStyle = { strokeStyle:'gray', lineWidth:2 };
jsPlumb.Defaults.EndpointStyle = { radius:7, fillStyle:'gray' };
jsPlumb.Defaults.Anchors = [ "BottomCenter", "TopCenter" ];

$("#Trace-runfunc").plumb({target:"f"});
$("#g").plumb({target:"i"});
$("#f").plumb({target:"g"});
$("#g").plumb({target:"h"});
</script>
</body></html>
It consists of a number of script tags to include the jQuery library and jsPlumb together with the explorer canvas so everything will work on Internet Explorer just as well.
Next is a list of div elements with a box class and a final script elements that calls the plumb method for each connection. This final bit of Javascript is preceded by a few lines that set some default for the lines that will be drawn. These are fully documented on the jsPlumb site.
When this HTML and Javascript is viewed with a webbrowser the result looks something like the image preceding the HTML sample code. Creating a layout for a graph where nothing overlaps is a difficult job and clearly fails here, but the beauty of jsPlumb is that it provides an interactive graph: you can click and drag any box on the screen to rearrange the graph in any way. An example is shown in the image on the left where we have dragged about some of the boxes to provide a clear separation.

From CoverageResults to an interactive graph

So our problem boils down to generating suitable HTML and Javascript from the data provided in the CoverageResults object. I'll provide the full code without much explanation as it should be clear enough. It certainly isn't a sterling example of clean coding but it works:
def f():
 for i in range(10):
  g(i)
  
def g(n):
 return h(n)+i(n)

def h(n):
 return n+10

def i(n):
 return n*8
 
def box(name,pos=(400,300)):
 return ('<div id="%s" class="box" style="left:%dpx;top:%dpx">%s</div>'
   %(name,pos[0],pos[1],name))

def connect(f,t):
 return '$("#%s").plumb({target:"%s"});'%(f,t)

if __name__ == "__main__":
 import trace
 from random import randint
 
 t=trace.Trace(count=1,trace=0,countfuncs=1,countcallers=1)
 t.runfunc(f)
 r=t.results()

 print("""<html><head>
<script type="text/javascript" src="http://explorercanvas.googlecode.com/svn/trunk/excanvas.js"></script>
<script type="text/javascript" src="http://ajax.googleapis.com/ajax/libs/jquery/1.4.4/jquery.min.js"></script>
<script type="text/javascript" src="http://ajax.googleapis.com/ajax/libs/jqueryui/1.8.2/jquery-ui.min.js"></script>
<script type="text/javascript" src="jquery.jsPlumb-1.2.5-all-min.js "></script>
<link rel="stylesheet" href="callgraph.css" type="text/css" media="all" />
</head><body>""")

 boxen={} # unique names, values are positions
 for f,t in r.callers:
  fk=f[2].replace('.','-') # identifiers with dots cannot be used as (x)html ids
  tk=t[2].replace('.','-')
  if not fk in boxen:
   boxen[fk]=[200,100]
  if not tk in boxen:
   boxen[tk]=[200,100]
  boxen[tk][1]=boxen[fk][1]+100
 for b in boxen:
  for b2 in boxen:
   if b != b2 and boxen[b] == boxen[b2]:
    boxen[b2][0]+=100
 for b in boxen:
  print(box(b,boxen[b]))

 print("""<script>
jsPlumb.Defaults.Connector = new jsPlumb.Connectors.Bezier(50);
jsPlumb.Defaults.DragOptions = { cursor: 'pointer', zIndex:2000 };
jsPlumb.Defaults.PaintStyle = { strokeStyle:'gray', lineWidth:2 };
jsPlumb.Defaults.EndpointStyle = { radius:7, fillStyle:'gray' };
jsPlumb.Defaults.Anchors = [ "BottomCenter", "TopCenter" ];
""")
 for f,t in r.callers:
  fk=f[2].replace('.','-')
  tk=t[2].replace('.','-')
  print(connect(fk,tk))
 print("</script>")
 print("</body></html>")