Sunday, 10 October 2010

Sqlite multithreading woes

Registering a user defined regexp function with Sqlite from CherryPy

Multithreading can be tricky and can actually trip you in unexpected ways. And in some situations multithreading is almost unavoidable, for example when using CherryPy as it usually instantiates a fair number of threads to efficiently serve multiple HTTP requests in parallel. The situation that caught me unawares was the combination with Sqlite.

Sqlite can be used in a multithreaded fashion but you must make sure that each thread has it's own connection object to communicate with the database. This is easily accomplished by registering a function with CherryPy that will be called for each newly started thread. This might look as follows:

import sqlite3
import threading

data=None
db='/tmp/example.db'

def initdb():
    global data,db
    sql='create table if not exists mytable (col_a, col_b);'
    conn=sqlite3.connect(db)
    c = conn.cursor()
    c.execute(sql)
    conn.commit()
    conn.close() 
    data=threading.local()  

def connect(thread_index): 
    global data,db 
    data.conn = sqlite3.connect(db) 
    data.conn.row_factory = sqlite3.Row

if __name__ == "__main__": 
    initdb() 
    cherrypy.engine.subscribe('start_thread', connect)  
    <... code to start the cherrypy engine ...>

There are two functions in the example above. The first one, initdb(), is used to initialize the database, that is, to create any tables necessary if they are not defined yet and to prepare some storage that is unique for each thread. Normally, all global data is shared between threads so we have to take special measures to provide each thread with its own private data. This is accomplished by the call to threading.local(). The resulting object can be used to store data as attributes and this data is private to each thread. initdb() needs to be called only once before starting the CherryPy engine.

The second function, connect(), should be called once for every thread. It creates a database connection and stores a reference to this connection in the conn attribute of the global variable data. Because this was setup to be private data for each thread, we can use it to store a separate connection object.

In the main section of the code we simply call initdb() once and use the cherrypy.engine.subscribe() function to register our connect() function to be executed at the start of a new thread. The code to actually start CherryPy is not shown in this example.

User defined functions

Now how can this simple setup cause any troubles? Well, most database configuration actions in Sqlite are performed on connection objects and when we want them to work in a consistent way we should apply them to each and every connection. In other words, those configuration actions should be part of the connect() function. An example of that is shown in the last line of the connect() function where we assign a sqlite3.Row factory to the row_factory attribute of a connection object. Because we do it here we make sure that we may consistently access columns by name in any record returned from a query.

What I failed to do and what prompted this post was register a user defined function for each connection. Somehow it seemed logical to do it only once when initializing the database, but even if that connection wasn't closed it was impossible to use that function in a query. And user defined functions are not a luxury but a bare necessity if you want to use regular expressions in Sqlite!

Sqlite supports the REGEXP operator in queries so you may use a query like:


select * from mytable where a regexp '^a.*b$';

This will select any record that has a value in its a column that starts with an a and ends with a b. However, although the syntax is supported, it still raises a Sqlite3.OperationalError exception because the regexp function that is called by the regexp operator is not defined. If we want to use regular expressions in Sqlite we have to supply an implementation of the regexp function ourselves. Fortunately this is quite simple, a possible implementation is shown below:

import re

def regex(pattern,string):
    if string is None : string = ''
    return re.search(pattern,str(string))!=None

Note that this isn't a very efficient implementation as we compile a pattern again and again each time the function is called even when it may be called hundreds of times with the same pattern in a single query. It does the job however.

All that is left to do now, is register this function. Not, as I did, as part of the initdb() function, but as part of the connect() function that is called for each thread:

def connect(thread_index):
    global data,db 
    data.conn = sqlite3.connect(db)
    data.conn.row_factory = sqlite3.Row
    data.conn.create_function('regexp',2,regex)

The create_function() method will make our newly defined function available. It takes a name, the number of arguments and a reference to our new function as arguments. Note that despite what the Sqlite documentation states, our regular expression function should be registered with the name regexp (not regex!).

A side note on multiprocessing

If you have a multiprocessor or multicore machine, multithreading will in general not help you to tap into the full processing power of your server. In this article I explore ways to use Python's multiprocessing module in combination with Sqlite.

No comments:

Post a Comment