linkcheck a Python module to check for broken links

When maintaining a web site or even when hosting a web application it makes sense to check once in a while whether there are any broken links. Here I present a simple module to find broken links in a website that is pure Python and uses no external modules.

Checking for broken links

Checking for broken links can be as simple as shown in the following example:

from linkcheck import LinkChecker

lc = LinkChecker("http://www.example.org")
if lc.check():
    if not lc.follow():
        print("there were problems")
        print("\n".join(lc.failed))
        print("\n".join(lc.other))
    else:
        print("website OK")
else:
    print("cannot open website or homepage is not html")

The check() method tries to open a URL and checks if its content is html. If this went well it returns True. The next step is to use the follow() method the see if we can open any links present in the page and report any failures. If a link points to a page containing HTML this is repeated in a recursive fashion.

The linkcheck module

We highlight some implementation details of the linkcheck module here, but the full code can be found on my website. It is licensed under the GPL and comes with a modest test suite.

The code depends on two crucial elements: LinkParser a html.parse.HTMLParser derived class and LinkChecker. LinkParser

acts on just a few HTML start tags defined in the tagrefs class variable (line 9). Its initializer is passed a baseurl argument that is used to derive relative URLs and a callback parameter that is called for each of the relevant tags that hold a reference to another URL

The LinkChecker class is discussed in detail in an an article on my website. For now I highlight just its main methods:

  • __init__(), takes just one mandatory parameter the url we start with (line 28)
  • check(), checks if the url passed to __init__() could be opened and points to a HTML file (line 38)
  • follow(), finds any links within the HTML a marks any of those links it cannot opened as failed in the failed instance variable (line 59)
  • process(), is the call back passed to an instance of the LinkParser class. It will recursively create a new LinkChecker instance to check the link for additional links. (line 67)

from urllib.request import Request,urlopen
from urllib.parse import urlsplit,urljoin,urlunsplit,urldefrag
from urllib.error import HTTPError,URLError
from html.parser import HTMLParser
from re import compile,MULTILINE,IGNORECASE

class LinkParser(HTMLParser):

 tagsrefs = { 'a':'href', 'img':'src', 'script':'src', 'link':'href' }
 
 def __init__(self,baseurl,callback):
  self.callback = callback
  self.baseurl = baseurl
  super().__init__()
  
 def handle_starttag(self, tag, attrs):
  if tag in self.tagsrefs:
   for name,value in attrs:
    if name == self.tagsrefs[tag]:
     newurl=urljoin(self.baseurl,value)
     self.callback(newurl)
     break

class LinkChecker:

 html=compile(r'^Content-Type:\s+text/html$',MULTILINE|IGNORECASE)

 def __init__(self,url,host=None,seen=None,external=True):
  self.url    = url
  self.host   = urlsplit(url).hostname if host is None else host
  self.failed = []
  self.other  = []
  self.notopened = []
  self.duplicates = 0
  self.seen = set() if seen is None else seen
  self.external = external

 def check(self,open=True):
  self.seen.add(self.url)
  if not open :
   self.notopened.append(self.url)
   return False
  try:
   self.req=urlopen(url=self.url,timeout=10)
  except HTTPError as e:
   self.failed.append(self.url)
   return False
  except URLError as e:
   self.other.append(self.url+' ('+str(e)+')')
   return False
  except Exception as e:
   print('Exception',e,type(e))
   self.failed.append(self.url)
   return False
  headers=str(self.req.info())
  m=self.html.search(headers)
  return not(m is None)
  
 def follow(self):
  parser = LinkParser(self.url,self.process)
  try:
   parser.feed(self.req.read().decode())
  except Exception as e:
   self.other.append(self.url+' ('+str(e)+')')
  return len(self.failed)+len(self.other) == 0
  
 def process(self,newurl):
  newurl=urldefrag(newurl)[0]
  if not newurl in self.seen:
   lc = LinkChecker(newurl,self.host,self.seen,self.external)
   samesite = urlsplit(newurl).hostname == self.host
   if lc.check(self.external or samesite) and samesite:
    lc.follow()
   self.failed.extend(lc.failed)
   self.other.extend(lc.other)
   self.notopened.extend(lc.notopened)
   self.seen.update(lc.seen)
   self.duplicates+=lc.duplicates
  else:
   self.duplicates+=1

A jQuery plugin to add icons to elements

In this article I present a simple jQuery plugin that adds suitable icons to a selection of elements based on the value of the rel attribute.

CSS3 :after selector not suitable?

If the target browser understands CSS3 we could add an icon with the help of the :after selector. This has a couple of disadvantages though:

  • Only the newest browsers support this,
  • Although it is possible to insert a url that points to an image, this element is magic, i.e. not visible in the DOM, so we cannot style it with CSS,
  • We can select elements based the contents of an attribute but we cannot construct and element based on the result of this selection.
Fortunately a flexible yet simple solution can be crafted in a few lines of Javascript with the help of the jQuery library.

The iconify plugin

The code presented below assumes that jQuery is already loaded. It is tested with jQuery 1.6.1 but should work with earlier versions just as well. Using the plugin is as simple as $("a").iconify();. This will add an icon (an <img> tag) to every link, if any of those links has a rel attribute. You need to provide a directory with suitable icons. For example, if you have links with rel="python", an icon will be added with a src attribute of src="Icons/python-icon.png"

The plugin can be broken down to a few simple steps. The first it does (in line 2) is to make sure the selection is a jQuery object. The next two lines make sure that if any of the parameters is missing the corresponding default is used.

In line 6 we then iterate over all elements of the selection and retrieve the rel attribute (line 8). If this attribute is present and not empty, we check if it can be found in the iconmap object, if not we construct a generic name (line 13). Next we construct an <img> element with a suitable src attribute and a meaningful alt attribute and add this element to the current element in the iteration. Finally (line 24) we return the original selection to allow this plugin to be chained just like most jQuery functions.

$.fn.iconify = function (icondir,iconmap) {
 var $this = $(this);
 icondir = icondir || $.fn.iconify.icondir;
 iconmap = iconmap || $.fn.iconify.iconmap;
 
 $this.each(function(i,e){
  e=$(e);
  var rel=e.attr('rel');
  var src="";
  var alt="";
  
  if(rel !== undefined && rel != ''){
   if(iconmap[rel]!==undefined){
    src='src="'+icondir+'/'+iconmap[rel]+'"';
   }else{
    src='src="'+icondir+'/'+rel+'-icon.png'+'"';
   }
   alt='alt="'+rel+' icon"';
   var img=$('');
   e.append(img);
  }
 });
 
 return $this;
};
$.fn.iconify.icondir='Icons';
$.fn.iconify.iconmap={
'python' :'python-logo-24.png',
'blender' :'blender-logo-24.png',
'photo'  :'camera-logo-24.png',
'wikipedia' :'wikipedia-logo-24.png'
};

The CSS rel attribute

Using the rel attribute for this purpose might be considered misuse (see for example this page) so you might consider rewriting the plugin to check another attribute, for instance class.

A SQLite thread safe password store, revisited

In a previous article I showed how to implement a thread safe persistent password store that was based on SQLite. In this article a reimplementation of that module is presented base on the persistentdict module.

A SQLite thread safe password store

As you can see in the code presented below, we can put our PersistentDict class developed earlier to good use. Because we use two instances of PersistentDict (lines 45, 46) to store the salt and the hashed passwords instead of interacting with a SQLite database ourselves, the code is much cleaner and therefore easier to maintain.

'''
 dbpassword.py Copyright 2011, Michel J. Anders

 $Revision: 70 $ $Date: 2011-06-10 16:34:28 +0200 (vr, 10 jun 2011) $
 
 This program is free software: you can redistribute it
 and/or modify it under the terms of the GNU General Public
 License as published by the Free Software Foundation,
 either version 3 of the License, or (at your option) any
 later version.

 This program is distributed in the hope that it will be 
 useful, but WITHOUT ANY WARRANTY; without even the implied
 warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR
 PURPOSE. See the GNU General Public License for more
 details.

 You should have received a copy of the GNU General Public
 License along with this program.  If not, see 
 www.gnu.org/licenses.
'''

import hashlib
from random import SystemRandom as sr
from persistentdict import PersistentDict

class dbpassword:

 @staticmethod
 def hashpassword(name,salt,plaintextpassword,n=10):
  if n<1 : raise ValueError("n < 1")
  d = hashlib.new(name,(salt+plaintextpassword).encode()).digest()
  while n:
   n -= 1
   d = hashlib.new(name,d).digest()
  return hashlib.new(name,d).hexdigest()

 @staticmethod
 def getsalt(randombits=64):
  if randombits<16 : raise ValueError("randombits < 16")
  return "%016x"%sr().getrandbits(randombits)

 def __init__(self,db='password.db',
    secure_hash='sha256',iterations=1000,saltbits=64):
  self.saltdict = PersistentDict(db=db,table='salt')
  self.pwdict = PersistentDict(db=db,table='password')
  self.secure_hash = secure_hash
  self.iterations = iterations
  self.saltbits = 64
  
 def update(self,user,plaintextpassword):
  salt=dbpassword.getsalt(self.saltbits)
  self.saltdict[user]=salt
  self.pwdict[user]=dbpassword.hashpassword(
     self.secure_hash,salt,plaintextpassword,
     self.iterations)

 def check(self,user,plaintextpassword):
  salt=self.saltdict[user]
  return self.pwdict[user]==dbpassword.hashpassword(
   self.secure_hash,salt,plaintextpassword,
   self.iterations)

A Python module providing thread safe SQLite backed persistent storage

In a previous article I wrote about some research I did on Python modules that could provide me with a persistent storage solution for dictionaries. I didn't quite find what I needed especially as none of the modules provided a thread safe solution. In the end I decided to write my own.

Thread safe, SQLite backed persistent storage

The module I wrote, persistentdict, is is freely available on my website along with some notes and examples. It also has a fairly extensive test suite. It provides a single class PersistentDict that behaves almost exactly like a native Python dict but stores its keys and values in a SQLite table instead of keeping it in memory.

SQLite as persistent backing to a Python dict

Doing some research can save you a lot of work: while looking around for a way to use SQLite as a persistent backing store for a Python dictionary I found at least two decent implementations. This blog post are my research notes.

Requirements

  • pure Python, to ensure cross platform portability
  • no additional external dependencies, to facilitate easy packaging
  • portable data back-end format
  • thread safe
  • well written
  • well documented
  • open source

The requirement to have a portable back-end format makes SQLite based implementations such a strong preference as SQLite interfaces are available in a number of other programming languages as well, notably C

Thread safety is a strong requirement because I want to use this solution in CherryPy. It is not enough to restrict access to an object with some form of locking because if some database activity takes place, for example database transactions, this activity itself must be multithreading proof. Some database engines in Python are thread safe and SQLite can be made to work in a multithreading environment as long as each thread has its own connection. Check this post to see how this may be accomplished.

Python's shelve module

Python's shelve module is Python specific and not thread safe.

Seb Sauvages's dbdict

Seb Sauvages's dbdict is an interesting starting point although not thread safe

Erez' FileDict

Erez' FileDict is a more complete implementation but not thread safe either.

Tokyo Cabinet

Tokyo Cabinet feels a bit too complex for my taste and is another package that is not thread safe

Preliminary conclusion

Finding a thread safe solution to providing a persistent database backing for Python dictionaries is not as easy as I hoped. Finding one that meets my exact requirements may take more time than writing something from scratch which is a bit of a disappointment.

Packt Python Book Idea Generator

Packt Publishing invites current and future readers to send in suggestions for subjects to cover in books on Python.

What Python subject do you want authors to write about?

I think this is a pretty neat idea. As a writer myself I choose subjects that interest me most and if a publisher thinks there is a market for it, the game is on. But I have many Python related interests and that probably goes for other authors as well, so if people had a way to show what they would like to see covered in a Python book, the publisher and authors could pick up a subject that would certainly please future readers. That's a classic win-win situation. The people at Packt now offer a such platform so if you have a (Python related) subject you would love to see a book about, check out this poll.