Tag Archives: Python

automagic Python urllib basic HTTP authentication

So, I have Python a script here at work that needs to use urllib to grab some pages from a site where HTTP basic access authentication is used. I had to work through some issues on my own after reading the code, and after not finding many references on Google decided to document it here in case someone else wants it. There were two basic problems I had to figure out.

  • By default, if you use urllib.urlopen() to request a page that is protected by HTTP basic auth in IDLE or some other interactive prompt, you are prompted to enter your user name and password using the prompt_user_passwd() function defined in the urllib.FancyURLopener class. If you want to automate your login to the web server, you have to override this method to return the user name and password.
  • The other, harder (for me) to figure out bit, is how to handle timeouts correctly. Since Python expects the authentication to happen manually, it doesn’t do anything to keep the automatically-provided user name and password pair from looping indefinitely if they are incorrect. So we must override the http_error_401() method as well to timeout. Fortunately, we can use the urllib.FancyURLopener attribute maxtries which is defined on instantiation as a limit to the number of authentication tries in the case of an incorrect password. This attribute is originally used by the http_error_302() method to prevent infinite-looping due to redirect recursion. We could end up with slightly less tries to authenticate if we go through a few redirects before getting the 401 error requiring us to authenticate, but since we only need one successful try at authenticating it shouldn’t be a big deal.
  • So basically, we’ll create our own class, inheriting from urllib.FancyURLopener, and overflow those two methods. The code, with the salient bits highlighted:

    class basicAuth(urllib.FancyURLopener):
    	def prompt_user_passwd(self, host, realm):
    		return "our_username", "our_password"
    	def http_error_401(self, url, fp, errcode, errmsg, headers, data=None):
    		"""Error 401 -- authentication required. This function supports Basic authentication only."""
    		self.tries += 1
    		if self.maxtries and self.tries >= self.maxtries:
    			self.tries = 0
    			return self.http_error_default(url, fp, 500, "HTTPS Basic Auth timed out after "+str(self.maxtries)+" attempts.", headers)
    		if not 'www-authenticate' in headers:
    			URLopener.http_error_default(self, url, fp, errcode, errmsg, headers)
    		stuff = headers['www-authenticate']
    		import re
    		match = re.match('[ \t]*([^ \t]+)[ \t]+realm="([^"]*)"', stuff)
    		if not match:
    			URLopener.http_error_default(self, url, fp, errcode, errmsg, headers)
    		scheme, realm = match.groups()
    		if scheme.lower() != 'basic':
    			URLopener.http_error_default(self, url, fp, errcode, errmsg, headers)
    		name = 'retry_' + self.type + '_basic_auth'
    		if data is None:
    			return getattr(self,name)(url, realm)
    			self.tries = 0
    			return getattr(self,name)(url, realm, data)