linkcheck a Python module to check for broken links

When maintaining a web site or even when hosting a web application it makes sense to check once in a while whether there are any broken links. Here I present a simple module to find broken links in a website that is pure Python and uses no external modules.

Checking for broken links

Checking for broken links can be as simple as shown in the following example:

  1. from linkcheck import LinkChecker  
  2.   
  3. lc = LinkChecker("http://www.example.org")  
  4. if lc.check():  
  5.     if not lc.follow():  
  6.         print("there were problems")  
  7.         print("\n".join(lc.failed))  
  8.         print("\n".join(lc.other))  
  9.     else:  
  10.         print("website OK")  
  11. else:  
  12.     print("cannot open website or homepage is not html")  

The check() method tries to open a URL and checks if its content is html. If this went well it returns True. The next step is to use the follow() method the see if we can open any links present in the page and report any failures. If a link points to a page containing HTML this is repeated in a recursive fashion.

The linkcheck module

We highlight some implementation details of the linkcheck module here, but the full code can be found on my website. It is licensed under the GPL and comes with a modest test suite.

The code depends on two crucial elements: LinkParser a html.parse.HTMLParser derived class and LinkChecker. LinkParser

acts on just a few HTML start tags defined in the tagrefs class variable (line 9). Its initializer is passed a baseurl argument that is used to derive relative URLs and a callback parameter that is called for each of the relevant tags that hold a reference to another URL

The LinkChecker class is discussed in detail in an an article on my website. For now I highlight just its main methods:

  • __init__(), takes just one mandatory parameter the url we start with (line 28)
  • check(), checks if the url passed to __init__() could be opened and points to a HTML file (line 38)
  • follow(), finds any links within the HTML a marks any of those links it cannot opened as failed in the failed instance variable (line 59)
  • process(), is the call back passed to an instance of the LinkParser class. It will recursively create a new LinkChecker instance to check the link for additional links. (line 67)

  1. from urllib.request import Request,urlopen  
  2. from urllib.parse import urlsplit,urljoin,urlunsplit,urldefrag  
  3. from urllib.error import HTTPError,URLError  
  4. from html.parser import HTMLParser  
  5. from re import compile,MULTILINE,IGNORECASE  
  6.   
  7. class LinkParser(HTMLParser):  
  8.   
  9.  tagsrefs = { 'a':'href''img':'src''script':'src''link':'href' }  
  10.    
  11.  def __init__(self,baseurl,callback):  
  12.   self.callback = callback  
  13.   self.baseurl = baseurl  
  14.   super().__init__()  
  15.     
  16.  def handle_starttag(self, tag, attrs):  
  17.   if tag in self.tagsrefs:  
  18.    for name,value in attrs:  
  19.     if name == self.tagsrefs[tag]:  
  20.      newurl=urljoin(self.baseurl,value)  
  21.      self.callback(newurl)  
  22.      break  
  23.   
  24. class LinkChecker:  
  25.   
  26.  html=compile(r'^Content-Type:\s+text/html$',MULTILINE|IGNORECASE)  
  27.   
  28.  def __init__(self,url,host=None,seen=None,external=True):  
  29.   self.url    = url  
  30.   self.host   = urlsplit(url).hostname if host is None else host  
  31.   self.failed = []  
  32.   self.other  = []  
  33.   self.notopened = []  
  34.   self.duplicates = 0  
  35.   self.seen = set() if seen is None else seen  
  36.   self.external = external  
  37.   
  38.  def check(self,open=True):  
  39.   self.seen.add(self.url)  
  40.   if not open :  
  41.    self.notopened.append(self.url)  
  42.    return False  
  43.   try:  
  44.    self.req=urlopen(url=self.url,timeout=10)  
  45.   except HTTPError as e:  
  46.    self.failed.append(self.url)  
  47.    return False  
  48.   except URLError as e:  
  49.    self.other.append(self.url+' ('+str(e)+')')  
  50.    return False  
  51.   except Exception as e:  
  52.    print('Exception',e,type(e))  
  53.    self.failed.append(self.url)  
  54.    return False  
  55.   headers=str(self.req.info())  
  56.   m=self.html.search(headers)  
  57.   return not(m is None)  
  58.     
  59.  def follow(self):  
  60.   parser = LinkParser(self.url,self.process)  
  61.   try:  
  62.    parser.feed(self.req.read().decode())  
  63.   except Exception as e:  
  64.    self.other.append(self.url+' ('+str(e)+')')  
  65.   return len(self.failed)+len(self.other) == 0  
  66.     
  67.  def process(self,newurl):  
  68.   newurl=urldefrag(newurl)[0]  
  69.   if not newurl in self.seen:  
  70.    lc = LinkChecker(newurl,self.host,self.seen,self.external)  
  71.    samesite = urlsplit(newurl).hostname == self.host  
  72.    if lc.check(self.external or samesite) and samesite:  
  73.     lc.follow()  
  74.    self.failed.extend(lc.failed)  
  75.    self.other.extend(lc.other)  
  76.    self.notopened.extend(lc.notopened)  
  77.    self.seen.update(lc.seen)  
  78.    self.duplicates+=lc.duplicates  
  79.   else:  
  80.    self.duplicates+=1