Start Small: July 2011

Sunday 31 July 2011

Function annotations in Python, checking parameters in a web application server

Parameter annotations in function definitions are a recent addition to the language. In this article we show how this feature can be put to good use when we build a simple web application server that checks its input parameters rigoreously.

Creating a simple web application server with HTTPServer

It is of course entirely possible to create a web application in a short time when you use an existing Python web application framework. In previous articles and my book on web applications I've used CherryPy extensively and although I recommend it for its flexibility an ease of use, it isn't all that difficult to create a web application framework from scratch.

Python's https.server module provides us with the basic building blocks: the HTTPServer class to handle incoming connections and a BaseHTTPRequestHandler class that processes requests and returns an answer. The main part of developing an application server is therefore sub-classing the BaseHTTPRequestHandler class. The minimum it will have to provide is a do_GET() method that will return results based on any parameters it receives.

Using Python parameter annotations

CherryPy uses classes with methods that serve together as an application: requested URLs are mapped to these methods and any parameters are passed along. CherryPy uses an expose decorator to identify the methods that may be called. Non exposed methods are invisible, i.e. URLs that match those methods do not result in the invocation of that method. This behavior is what we like to mimic in our own web application server.

Another important concept in web applications is the screening of input: we would like to check that incoming data (i.e. the arguments that come along with a query) are within the range of things we deem acceptable. For example, a function that adds its arguments should reject anything that cannot be interpreted as a float. We could write code easily enough that checks any function parameters explicitly but wouldn't it be nice if there was a more syntactically pleasing way of writing this?

Enter Python's function parameter annotations. Python allows us to augment each function parameter with an expression that is evaluated when the function is defined and which is stores in the functions __annotation__ field as a dictionary indexed by parameter name. Such an annotation might be as simple a single string but it can be anything, even a function reference. This function could be called with the value we would like to pass as an parameter to check if this value is ok. This could be done before the function is actually called, for example by the do_GET() method of our request handler.

Assuming we have our applicationserver module available, let's have a look what the definition of a new web application might look like:

from http.server import HTTPServer
from applicationserver import ApplicationRequestHandler,IsExposed

class Application:

 def donothing:
  pass
  
 def index(self) -> IsExposed:
  return 'index oink'
 
 def add(self,a:float,b:float) -> IsExposed:
  return str(a+b)
 
 def cat(self,a:str) -> IsExposed:
  return ' '.join(a)
 
 def opt(self,a:int=42) -> IsExposed:
  return str(a)
  
class MyAppHandler(ApplicationRequestHandler):
 application=Application()
 
appserver = HTTPServer(('',8088),MyAppHandler)
appserver.serve_forever()

The overall idea is to subclass the ApplicationRequestHandler and assign an instance of the Application class to its application field (line 21). This applicationhandler is then passed to a HTTPServer instance that will forward incoming requests to this handler (line 24).

Our ApplicationRequestHandler will try to map URLs of the form http://hostname:8080/foo?a=1&b=2 to member functions of the Application instance. It will only consider member function with a return annotation equal to an IsExposed object. So even though we have defined a donothing() function, it will not be executed when a URL like http://hostname:8080/donothing is received.

We also use annotations to restrict input values for parameters to functions that are exposed. Remember that annotations can be any expression and here we employ that fact to annotate the a and b parameters to the add() method with a reference to the built-in float() function. Our ApplicationRequestHandler will pass an argument to any callable it finds as its corresponding annotation and will only execute the method if this callable returns a value (and not raise an exception). So a URL like http://localhost:8080/add?a=1.23&b=4.56 will return a meaningful result while http://localhost:8080/add?a=1.23&b=spam will fail with an error. Off course we are not restricted to built-in functions here: we can refer to functions that may perform elaborate checking as well, perhaps checking against regular expressions or performing lookups in database tables.

All this shows that Python's function annotations allow for a rather elegant way to describe the expected behavior of methods that perform some sort of action in a web application. In a future article I'll show how to implement the applicationserver module.

Sunday 24 July 2011

More SQLite multithreading woes

In this article we revisit our thread safe persistent dictionary and encounter some irritating performance issues, both SQLite related and caused by Python itself.

SQLite.OperationalError, database is locked

When testing the persistentdict module with many simultaneous threads (more than twenty) I noticed a great number of errors: The many connections open to the same database caused SQLite to raise a lot of SQLite.OperationalError, database is locked exceptions. Getting decent performance with SQLite is by no means easy and because SQLite only locks complete database files and not just tables or rows there is a big chance that threads accessing the same database have to wait to get their turn.

sqlite3.connect, the check_same_thread parameter

Python 3.2 comes with a sqlite3 module that implements (but scarcely documents) a check_same_thread parameter that can be set to false to allow threads to use the same Connection object simultaneously. This is nice since this means we no longer have to implement all sorts of code to provide each thread with its own connection.

But we still have to regulate access to this connection because otherwise a commit in one thread may invalidate a longer running execute in another thread, leaving us with errors like sqlite3.InterfaceError: Cursor needed to be reset because of commit/rollback and can no longer be fetched from.

Python thread switching is really slow

Switching between threads, especially on multi core machines, has never been Python's strongest feature and making our persistent dict thread safe with a lock might hurt performance a lot, depending on what is going on in the threads itself (if there is a lot of I/O going on the impact might not be that big).

A new implementation

The code below shows the new implementation, with a single connection that may be shared by multiple threads. It works but it is really slow: with 40 threads I get just 10 to 20 dictionary assignments (d[1]=2) per second on my dual core Atom. That isn't the fastest machine around but those figures are ridiculously low. We will have to rethink our approach if we want to use SQLite in a multithreaded environment if we need any kind of performance!

"""
 persistentdict module $Revision: 98 $ $Date: 2011-07-23 14:01:04 +0200 (za, 23 jul 2011) $

 (c) 2011 Michel J. Anders

 This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with this program.  If not, see .

"""

from collections import UserDict
import sqlite3 as sqlite
from pickle import dumps,loads
import threading 

class PersistentDict(UserDict):

 """
 PersistentDict  a MutableMapping that provides a thread safe,
     SQLite backed persistent storage.
 
 db  name of the SQLite database file, e.g.
   '/tmp/persists.db', default to 'persistentdict.db'
 table name of the table that holds the persistent data,
   usefull if more than one persistent dictionary is
   needed. defaults to 'dict'
 
 PersistentDict tries to mimic the behaviour of the built-in
 dict as closely as possible. This means that keys should be hashable.
 
 Usage example:
 
 >>> from persistentdict import PersistentDict
 >>> a=PersistentDict()
 >>> a['number four'] = 4
 
 ... shutdown and then restart applicaion ...
 
 >>> from persistentdict import PersistentDict
 >>> a=PersistentDict()
 >>> print(a['number four'])
 4
 
 Tested with Python 3.2 but should work with 3.x and 2.7.x as well.
 
 run module directly to run test suite:
 
 > python PersistentDict.py
 
 """
 
 def __init__(self, dict=None, **kwargs):
  
  self.db    = kwargs.pop('db','persistentdict.db')
  self.table = kwargs.pop('table','dict')
  #self.local = threading.local()
  self.conn = None
  self.lock = threading.Lock()
  
  with self.lock:
   with self.connect() as conn:
    conn.execute('create table if not exists %s (hash unique not null,key,value);'%self.table)
    
  if dict is not None:
   self.update(dict)
  if len(kwargs):
   self.update(kwargs)
 
 def connect(self):
  if self.conn is None:
   self.conn = sqlite.connect(self.db,check_same_thread=False)
  return self.conn
   
 def __len__(self):
  with self.lock:
   cursor = self.connect().cursor()
   cursor.execute('select count(*) from %s'%self.table)
   return cursor.fetchone()[0]
 
 def __getitem__(self, key):
  with self.lock:
   cursor = self.connect().cursor()
   h=hash(key)
   cursor.execute('select value from %s where hash = ?'%self.table,(h,))
   try:
    return loads(cursor.fetchone()[0])
   except TypeError:
    if hasattr(self.__class__, "__missing__"):
     return self.__class__.__missing__(self, key)
   raise KeyError(key)
   
 def __setitem__(self, key, item):
  h=hash(key)
  with self.lock:
   with self.connect() as conn:
    conn.execute('insert or replace into %s values(?,?,?)'%self.table,(h,dumps(key),dumps(item)))

 def __delitem__(self, key):
  h=hash(key)
  with self.lock:
   with self.connect() as conn:
    conn.execute('delete from %s where hash = ?'%self.table,(h,))

 def __iter__(self):
  with self.lock:
   cursor = self.connect().cursor()
   cursor.execute('select key from %s'%self.table)
   rows = list(cursor.fetchall())
  for row in rows:
   yield loads(row[0])

 def __contains__(self, key):
  h=hash(key)
  with self.lock:
   cursor = self.connect().cursor()
   cursor.execute('select value from %s where hash = ?'%self.table,(h,))
   return not ( None is cursor.fetchone())

 # not implemented def __repr__(self): return repr(self.data)
 
 def copy(self):
  c = self.__class__(db=self.db)
  for key,item in self.items():
   c[key]=item
  return c

Sunday 17 July 2011

A SQLite multiprocessing proxy, part 3

In a previous article I presented a first implementation of a SQLite proxy that makes it possible to distribute the workload of multiple processes with the use of Python's multiprocessing module. In this third part of the series we try to analyze the performance of this setup.

High workload example

In our sample implementation we can vary the workload inside the processes that interact with the SQLite database by varying the size of the table that we query. A table with many rows takes more time to scan for a certain random value than a table with just a few rows.

The first graph we present here is about high workload: the table that we query is initialized with one million records. The table shows the time to complete 100 queries. The test was done on a machine with 6 processor cores and in the graph we show the results for 2 (deep purple, back) and 6 (light purple, front) worker processes and a varying number of threads.

The results are more or less what we expect: more worker processes means that the time to complete all tasks is reduced. However the number of threads is also significant. If the number of threads is less than the number of available worker process we do not reach the full potential. Basically we need at least as many threads a there are worker processes to keep those processes busy. If we have more threads than worker processes there is no more gain, in fact we see a minute increase in the time needed to complete all tasks. This might be due to the overhead of creating and managing threads in Python.

Low workload example

If we initialize our table with just a single row the workload will be negligible. If we draw a similar graph as for the high workload we see a completely different picture.

Now we see hardly any difference between 2 work processes or 6 and increasing the number of threads also has no effect. Also the data is rather noisy, i.e. varies quite a bit in a non-uniform manner, especially for the case with 2 worker processes. The reason for this behavior is not entirely clear to me, although it is obvious that because of the very small workload the time to setup communication with the worker process is a significant factor here.

Wednesday 13 July 2011

Nice discounts on open source titles at Packt

Packt has an offer on open source books, both in print and as e-book, that might interest you. Check out their July offering, there certainly are some interesting titles available, including a few on Python and web development.

Sunday 10 July 2011

A SQLite multiprocessing proxy, part 2

In a previous article we decided to use Python's multiprocessing module to leverage the power of multi-core machines. Our use case is all about web applications served by CherryPy and so multi-processing isn't the only interesting part: our application will be multi-threaded as well. IN this article we present a first implementation of a multi-threaded application that hands off the heavy lifting to a pool of subprocesses.

The design

The design is centered on the following concepts:

The main process consists of multiple threads,
The work is done by a pool of subprocesses,
Transferring data to and from the subprocesses is left to the pool manager

Schematically we can visualize it as follows:

Sample code

We start of by including the necessary components:

from multiprocessing import Pool,current_process
from threading import current_thread,Thread
from queue import Queue
import sqlite3 as dbapi
from time import time,sleep
from random import random

The most important ones we need are the Pool class from the multiprocessing module and the Thread class from the threading module. We also import queue.Queue to act as a task list for the threads. Note that the multiprocessing module has its own Queue implementation that is not only thread safe but can be used for inter process communication as well but we won't be using that one here but rely on a simpler paradigm as we will see.

The next step is to define a function that may be called by the threads.

def execute(sql,params=tuple()):
 global pool
 return pool.apply(task,(sql,params))

It takes a string argument with SQL code and an optional tuple of parameters just like the Cursor.execute() method in the sqlite3 module. It merely passes on these arguments to the apply() method of the multiprocessing.Pool instance that is referred to by the global pool variable. Together with SQL string and parameters a reference to the task() function is passed, which is defined below:

def task(sql,params):
 global connection
 c=connection.cursor()
 c.execute(sql,params)
 l=c.fetchall()
 return l

This function just executes the SQL and returns the results. It assumes the global variable connection contains a valid sqlite3.Connection instance, something that is taken care of by the connect function that will be passed as an initializer to any new subprocess:

def connect(*args):
 global connection
 connection = dbapi.connect(*args)

Before we initialize our pool of subprocess let's have a look at the core function of any thread we start in our main process:

def threadwork(initializer=None,kwargs={}):
 global tasks
 if not ( initializer is None) :
  initializer(**kwargs)
 while(True):
  (sql,params) = tasks.get()
  if sql=='quit': break
  r=execute(sql,params)

It calls an optional thread initializer first and then enters a semi infinite loop in line 5. This loops starts by fetching an item from the global tasks queue. Each item is a tuple consisting of a string and another tuple with parameters. If the string is equal to quit we do terminate the loop otherwise we simple pass on the SQL statement and any parameters to the execute function we encountered earlier, which will take care of passing it to the pool of subprocesses. We store the result of this query in the r variable even though we do nothing with it in this example.

For this simple example we also need an database that holds a table with some data we can play with. We initialize this table with rows containing random numbers. When we benchmark the code we can make this as large as we wish to get meaningful results; after all, our queries should take some time to complete otherwise there would be no need to use more processes.

def initdb(db,rows=10000):
 c=dbapi.connect(db)
 cr=c.cursor()
 cr.execute('drop table if exists data');
 cr.execute('create table data (a,b)')
 for i in range(rows):
  cr.execute('insert into data values(?,?)',(i,random()))
 c.commit()
 c.close()

The final pieces of code tie everything together:

if __name__ == '__main__':
 global pool
 global tasks
 
 tasks=Queue()
 db='/tmp/test.db'
 
 initdb(db,100000)
 
 nthreads=10
 
 for i in range(100):
  tasks.put(('SELECT count(*) FROM data WHERE b>?',(random(),)))
 for i in range(nthreads):
  tasks.put(('quit',tuple()))
 
 pool=Pool(2,connect,(db,))
 
 threads=[]
 for t in range(nthreads):
  th=Thread(target=threadwork,kwargs={'initializer':thread_initializer})
  threads.append(th)
  th.start()
 for th in threads:
  th.join()

After creating a queue in line 5 and initializing the database in line 8, the next step is to fill a queue with a fair number of tasks (line 12). The final tasks we add to the queue signal a thread to stop (line 14). We need as many of them as there will be threads.

In line 17 we initialize our pool of processes. Just two in this example, but in general the number should be equal to the number of cpu's in the system. If you omit this argument the number will default to exactly that. Next we create (line 21) and start (line 23) the number of threads we want. The target argument points to the function we defined earlier that does all the work, i.e. pops tasks from the queue and passes these on to the pool of processes. The final lines simply wait till all threads are finished.

What's next?

In a following article we will benchmark and analyze this code and see how we can improve on this design.

Sunday 3 July 2011

A SQLite multiprocessing proxy

This is the first article in a series on improving the performance of Python web applications by leveraging the possibilities of the multiprocessing module. We'll focus on CherryPy and SQLite but the conclusions should be general enough for any Python based platform

Use case

Due to well known restrictions in the most common Python implementation, multithreading solutions will probably not help to solve performance issues (with the possible exception of serving slow network connections). The multiprocessing module offers an API similar to the threading module and might be an alternative when we want to divide the workload on a multicore machine.

The use case we're interested in is a CherryPy server that serves many requests, backed by a SQLite database. CherryPy is multithreaded by design and this approach is sensible as a web server may spend more time waiting for data to be transmitted over relatively slow network connections than actually doing work.

CherryPy however is also an excellent framework to host web applications and many web applications rely on some sort of database back-end. SQLite is a good choice for such a back-end as it comes bundled with Python (reducing the number of external dependencies), is easy to use and performs well enough. With some tricks it will even play nice in a multithreaded environment.

A disadvantage of using SQLite is that we do not have a separate database server: the SQLite engine is part of the same process that runs the Python interpreter. This means that it has the same handicap as any multithreaded application on CPython (the most common implementation of Python) and will not benefit from any extra cores or processors available on the server.

Now we could switch to MySQL or any other stand-alone database back-end but this would add quite an amount to the maintenance burden of our web application. Wouldn't it be nice if we could devise a way to use SQLite together with the multiprocessing module to have the best of both worlds: the ease of use of SQLite and the performance benefits of a stand-alone database server?

In this series of articles I will explore the possibilities and hopefully will come up with a solution that will provide:

a dbapi proxy (we'll use sqlite3 module but it should be general enough for any dbapi compliant database)
that will use the multiprocessing module to increase performance and
can be used from a multithreaded environment.

It would be nice if the API closely resembles the dbapi (but that is not an absolute requirement).

In the next article in this series I will explore the options to make threads and processes play nice, focusing on inter process communication.