Sunday, 26 June 2011

linkcheck a Python module to check for broken links

When maintaining a web site or even when hosting a web application it makes sense to check once in a while whether there are any broken links. Here I present a simple module to find broken links in a website that is pure Python and uses no external modules.

Checking for broken links

Checking for broken links can be as simple as shown in the following example:

from linkcheck import LinkChecker

lc = LinkChecker("http://www.example.org")
if lc.check():
    if not lc.follow():
        print("there were problems")
        print("\n".join(lc.failed))
        print("\n".join(lc.other))
    else:
        print("website OK")
else:
    print("cannot open website or homepage is not html")

The check() method tries to open a URL and checks if its content is html. If this went well it returns True. The next step is to use the follow() method the see if we can open any links present in the page and report any failures. If a link points to a page containing HTML this is repeated in a recursive fashion.

The linkcheck module

We highlight some implementation details of the linkcheck module here, but the full code can be found on my website. It is licensed under the GPL and comes with a modest test suite.

The code depends on two crucial elements: LinkParser a html.parse.HTMLParser derived class and LinkChecker. LinkParser

acts on just a few HTML start tags defined in the tagrefs class variable (line 9). Its initializer is passed a baseurl argument that is used to derive relative URLs and a callback parameter that is called for each of the relevant tags that hold a reference to another URL

The LinkChecker class is discussed in detail in an an article on my website. For now I highlight just its main methods:

  • __init__(), takes just one mandatory parameter the url we start with (line 28)
  • check(), checks if the url passed to __init__() could be opened and points to a HTML file (line 38)
  • follow(), finds any links within the HTML a marks any of those links it cannot opened as failed in the failed instance variable (line 59)
  • process(), is the call back passed to an instance of the LinkParser class. It will recursively create a new LinkChecker instance to check the link for additional links. (line 67)

from urllib.request import Request,urlopen
from urllib.parse import urlsplit,urljoin,urlunsplit,urldefrag
from urllib.error import HTTPError,URLError
from html.parser import HTMLParser
from re import compile,MULTILINE,IGNORECASE

class LinkParser(HTMLParser):

 tagsrefs = { 'a':'href', 'img':'src', 'script':'src', 'link':'href' }
 
 def __init__(self,baseurl,callback):
  self.callback = callback
  self.baseurl = baseurl
  super().__init__()
  
 def handle_starttag(self, tag, attrs):
  if tag in self.tagsrefs:
   for name,value in attrs:
    if name == self.tagsrefs[tag]:
     newurl=urljoin(self.baseurl,value)
     self.callback(newurl)
     break

class LinkChecker:

 html=compile(r'^Content-Type:\s+text/html$',MULTILINE|IGNORECASE)

 def __init__(self,url,host=None,seen=None,external=True):
  self.url    = url
  self.host   = urlsplit(url).hostname if host is None else host
  self.failed = []
  self.other  = []
  self.notopened = []
  self.duplicates = 0
  self.seen = set() if seen is None else seen
  self.external = external

 def check(self,open=True):
  self.seen.add(self.url)
  if not open :
   self.notopened.append(self.url)
   return False
  try:
   self.req=urlopen(url=self.url,timeout=10)
  except HTTPError as e:
   self.failed.append(self.url)
   return False
  except URLError as e:
   self.other.append(self.url+' ('+str(e)+')')
   return False
  except Exception as e:
   print('Exception',e,type(e))
   self.failed.append(self.url)
   return False
  headers=str(self.req.info())
  m=self.html.search(headers)
  return not(m is None)
  
 def follow(self):
  parser = LinkParser(self.url,self.process)
  try:
   parser.feed(self.req.read().decode())
  except Exception as e:
   self.other.append(self.url+' ('+str(e)+')')
  return len(self.failed)+len(self.other) == 0
  
 def process(self,newurl):
  newurl=urldefrag(newurl)[0]
  if not newurl in self.seen:
   lc = LinkChecker(newurl,self.host,self.seen,self.external)
   samesite = urlsplit(newurl).hostname == self.host
   if lc.check(self.external or samesite) and samesite:
    lc.follow()
   self.failed.extend(lc.failed)
   self.other.extend(lc.other)
   self.notopened.extend(lc.notopened)
   self.seen.update(lc.seen)
   self.duplicates+=lc.duplicates
  else:
   self.duplicates+=1

1 comment:

  1. When maintaining a web site or even when hosting a web application it makes sense to check once in a while whether there are any broken links.

    Thanks
    SEO company New York

    ReplyDelete