When maintaining a web site or even when hosting a web application it makes sense to check once in a while whether there are any broken links. Here I present a simple module to find broken links in a website that is pure Python and uses no external modules.
Checking for broken links
Checking for broken links can be as simple as shown in the following example:
from linkcheck import LinkChecker lc = LinkChecker("http://www.example.org") if lc.check(): if not lc.follow(): print("there were problems") print("\n".join(lc.failed)) print("\n".join(lc.other)) else: print("website OK") else: print("cannot open website or homepage is not html")
The check()
method tries to open a URL and checks if its content is html. If this went well it returns True
. The next step is to use the follow()
method the see if we can open any links present in the page and report any failures. If a link points to a page containing HTML this is repeated in a recursive fashion.
The linkcheck
module
We highlight some implementation details of the linkcheck
module here, but the full code can be found on my website. It is licensed under the GPL and comes with a modest test suite.
The code depends on two crucial elements: LinkParser
a html.parse.HTMLParser
derived class and LinkChecker
. LinkParser
tagrefs
class variable (line 9). Its initializer is passed a baseurl
argument that is used to derive relative URLs and a callback
parameter that is called for each of the relevant tags that hold a reference to another URL
The LinkChecker
class is discussed in detail in an an article on my website. For now I highlight just its main methods:
__init__()
, takes just one mandatory parameter the url we start with (line 28)check()
, checks if the url passed to__init__()
could be opened and points to a HTML file (line 38)follow()
, finds any links within the HTML a marks any of those links it cannot opened as failed in thefailed
instance variable (line 59)process()
, is the call back passed to an instance of theLinkParser
class. It will recursively create a newLinkChecker
instance to check the link for additional links. (line 67)
from urllib.request import Request,urlopen from urllib.parse import urlsplit,urljoin,urlunsplit,urldefrag from urllib.error import HTTPError,URLError from html.parser import HTMLParser from re import compile,MULTILINE,IGNORECASE class LinkParser(HTMLParser): tagsrefs = { 'a':'href', 'img':'src', 'script':'src', 'link':'href' } def __init__(self,baseurl,callback): self.callback = callback self.baseurl = baseurl super().__init__() def handle_starttag(self, tag, attrs): if tag in self.tagsrefs: for name,value in attrs: if name == self.tagsrefs[tag]: newurl=urljoin(self.baseurl,value) self.callback(newurl) break class LinkChecker: html=compile(r'^Content-Type:\s+text/html$',MULTILINE|IGNORECASE) def __init__(self,url,host=None,seen=None,external=True): self.url = url self.host = urlsplit(url).hostname if host is None else host self.failed = [] self.other = [] self.notopened = [] self.duplicates = 0 self.seen = set() if seen is None else seen self.external = external def check(self,open=True): self.seen.add(self.url) if not open : self.notopened.append(self.url) return False try: self.req=urlopen(url=self.url,timeout=10) except HTTPError as e: self.failed.append(self.url) return False except URLError as e: self.other.append(self.url+' ('+str(e)+')') return False except Exception as e: print('Exception',e,type(e)) self.failed.append(self.url) return False headers=str(self.req.info()) m=self.html.search(headers) return not(m is None) def follow(self): parser = LinkParser(self.url,self.process) try: parser.feed(self.req.read().decode()) except Exception as e: self.other.append(self.url+' ('+str(e)+')') return len(self.failed)+len(self.other) == 0 def process(self,newurl): newurl=urldefrag(newurl)[0] if not newurl in self.seen: lc = LinkChecker(newurl,self.host,self.seen,self.external) samesite = urlsplit(newurl).hostname == self.host if lc.check(self.external or samesite) and samesite: lc.follow() self.failed.extend(lc.failed) self.other.extend(lc.other) self.notopened.extend(lc.notopened) self.seen.update(lc.seen) self.duplicates+=lc.duplicates else: self.duplicates+=1