PyZine
 


Article Finder
People
Issue 7 - Revision 6  /   February 27, 2005 


 
  Py Links:
Latest Issue
Issue 08
Issue 07
Issue 06
Issue 05
Issue 04
Issue 02
Issue 01
 
 
Downloads
     
  Articles:
Throughout the quarter we cover topics of interest to Python developers.

COM & Python

Python on .NET

Python at Both Ends of the Web

GUI Testing Approach

Simulating with SimPy

Docutils

Mobile Collection System

 
 
 
     


Writing Web Clients in Python

- - - - - - - - - - - -

By Michael Foord | November 20, 2004

print

If you want to write a program to monitor your stock portfolio, gather news headlines, or just keep tabs on your favorite blogger, Python makes it easy. With just a little bit of Python code, you can fetch any web page.

However, as soon as you encounter any difficulties, you’ll need an understanding of the Hypertext Transfer Protocol (HTTP), the transfer protocol of the World Wide Web. [1]

This article gives you an overview of HTTP as used from Python. (If you want to understand all of the details about HTTP, you can read the related RFCs [2].)

 
 
Sidebar - Resources
Traffic Analysis http://www.ethereal.org
Sometimes the only way of working out what is happening is to watch the actual traffic between your script and the Internet. Also if you can't see why something doesn't work you can watch what your browser sends in a certain situation and replicate that. Ethereal is a program that displays your internet traffic. It uses pcap (packet capture) library, and there is at least one Python binding to this library if you fancy building your own version of ethereal.
TCPWatch http://hathawaymix.org/Software/TCPWatch
TCPWatch is a Python proxy server that’s great for showing you HTTP requests and responses. Just point your browser at 127.0.0.1 and TCPWatch 'proxies' all HTTP traffic and shows you what is happening. Very useful and simpler to use than Ethereal.
Urlparse Sometimes it's convenient to be able to break a URL up into it's component parts — scheme, server address, path and fragment. The right function to do this urlparse.urlparse() from the standard library.
Common User Agent Problems: http://www.w3.org/TR/cuap Being a good User-Agent. This is an official w3 reference about how User-Agents ought to behave. It has lots of good advice and pitfalls to avoid if you are writing a web client.
HTTP Pocket Reference Published by O’Reilly, this book is an invaluable source of reference on HTTP.
 
Hello? Internet?

HTTP is a protocol for transmitting web pages. Put simply, HTTP is simply an agreement about how web clients and web servers should interact with one another, including how a client must request resources from a server and how that server should respond.

Underlying HTTP (and many other Internet protocols, such as FTP and SSH) is the TCP/IP protocol, which governs how two computers exchange data via a network. TCP/IP ensures that messages are delivered to the right place in the right order, but otherwise isn’t concerned with what’s being communicated. In other words, if HTTP is a letter, then TCP/IP is the envelope.

Like HTTP and TCP/IP, Python’s Internet access libraries are also layered, with some libraries providing high-level features such as web access, and others providing basic connectivity.

The highest-level Python library for fetching Internet resources is called “urllib2.” As its name implies, urllib2 fetches resources via common and standard URLs. It’s intended to be very easy to use. In turn, urllib2 is built on “httplib,” which is built upon the socket library. At the lowest-level, the socket library handles the two way communication that occurs across networks and the Internet.

Thankfully, to use the high-level interface that urllib2 provides, you don't need to understand the details of the socket library or delve into the ins- and-outs of TCP/IP.

And because HTTP is a protocol designed by the kind of geeks who used UNIX type systems, its commands are largely just text instructions[3]. If you watch an HTTP transaction taking place[4], you can read most of the requests and responses and can understand most of what’s happening.

Decoding HTTP

A web site is just a computer somewhere with an Internet connection and a web server. The web server is a program that receives requests for web pages and returns replies. If the requested page exists and the requestor has the appropriate access rights, the server’s reply contains the requested web page (or other resource, as the case may be).

urllib2.urlopen() is the Python function to fetch web pages and other resources. You pass it either a URL, or you build a request object using urllib2.Request, which represents the HTTP request that you want to make. urlopen() returns a file handle on the page you asked for, or it throws an exception. If urlopen() succeeds, the file-like object returned has various useful attributes.

Here’s an example of pulling the home page from the Pyzine web site (all examples shown have been tested on Python 2.2.3 ).

txdata = None      # we're not POSTing any data in this example
txheaders = {}     # an empty headers dictionary, we're not adding any request headers CHUNKSIZE = 1024       # how much data to read at a time from the page
rxbuffer = []      # a temporary variable we'll use whilst reading
rxdata = ''        # this variable will contain the data when we've finished
theurl = 'http://www.pyzine.com'      # this is the url we are fetching
import urllib2

req = urllib2.Request(theurl, txdata, txheaders)     # create a request object
try:
   filehandle = urllib2.urlopen(req)      # attempt to fetch the page
except IOError, filehandle:
   if not hasattr(filehandle, 'read'):
     print 'An error occurred - no page was returned.'
     raise                             # re-raise the error
info = filehandle.info()
# the info and geturl methods contain details of the page we have fetched
realurl = filehandle.geturl()
while True:               # read the page using a loop
   chunk = filehandle.read(CHUNKSIZE)     
# We could have just used filehandle.read() to fetch thepage in one go,
# or the readlines method to return it as a list of lines
   if not chunk: break
   rxbuffer.append(chunk)
# if we were processing data as we read it we could do it here, e.g.
# feeding the page to a parser
rxdata = ''.join(rxbuffer)     # turn the buffer list into a single string
del rxbuffer
# delete the temporary variables from our namespace and free their memory
del chunk
print info.gettype()
# this is the 'type' of the resource we have just fetched
print realurl
print rxdata[:100]

After setting up some variables, the code creates a request object:

  • txheaders can hold any headers that you might want to add to the request.
  • To make a POST request, txdata should contain the body of the request (more on both of these later). This example does not use either and the code could just omit them.
  • CHUNKSIZE is how much data to read at a time from the page. rxdata contains the resource fetched, and rxbuffer is a temporary variable used to hold the data.

If the error object has a read attribute, then the server returned a page, even though an error occurred. (The page probably has a description of the problem, but this example doesn’t examine the error any further and just displays the page. Proper error handling is shown in a moment.)

urlopen() returns the file-like object, which has two interesting methods: info() and geturl().

info() returns an RFC822 message object. This contains some of the details of the underlying HTTP transaction, and you can use it to look at the headers the server returned, for example.

The message object also has methods gettype() and getmaintype(). gettype() yields what kind of resource was fetched. Common resources are “text/html” for the body of a web page, or “image/jpeg” for a jpeg image.[5] getmaintype() returns the first part of the gettype() string. For instance, for “text/html,” getmaintype() returns “text.”

The RFC822 message object also behaves like a dictionary, keyed by the headers the server sent. These are called the response headers. (Don't try and just iterate over a message object, though: the __iter__ method is missing in versions of Python before 2.4. Instead, iterate over messageobject.keys().)

The method geturl() returns the *real* URL of the resource fetched. On the other hand, urlopen() automatically follows redirects, which can spare you from the hassle of interpreting and acting upon redirect responses. (However, if you need to know the real location of the resource you have just fetched — say, to follow relative links in an html page — then you need to use geturl().)

Making Sense of Failure

So far, the details of the HTTP transaction have largely been hidden. But there are lots of reasons a request might fail. When an error occurs, you’ll need some inside knowledge to deduce what failed and why.

Luckily, HTTP records what happened in HTTP response codes (not all response codes are errors). As a second example, let's try and fetch a URL from a non-existent server.

import urllib2
theurl = 'http://www.anonexistentserver.com'
filehandle = urllib2.urlopen(theurl)

Traceback (most recent call last):
   File "", line 1, in -toplevel-
     filehandle = urllib2.urlopen(theurl)
   File "C:\Python23\Lib\urllib2.py", line 129, in urlopen
     return _opener.open(url, data)
   File "C:\Python23\Lib\urllib2.py", line 326, in open
     '_open', req)
   File "C:\Python23\Lib\urllib2.py", line 306, in _call_chain
     result = func(*args)
   File "C:\Python23\Lib\urllib2.py", line 901, in http_open
     return self.do_open(httplib.HTTP, req)
   File "C:\Python23\Lib\urllib2.py", line 886, in do_open
     raise URLError(err)
URLError:

By sheer coincidence this is *exactly* the same error you get if you try to fetch a real URL but don’t have an Internet connection. So, a useful function is one that attempts to detect whether or not the computer has an Internet connection. That’s what isonline() tries to do (the code is not foolproof, as the sidebar explains):

def isonline(reliableserver='http://www.google.com'):
  """Returns True if we appear to have an internet connection or False.
   It defaults to using google as a test server, but you can supply an alternative if you want."""
   from urllib2 import urlopen
   try:
     urlopen(reliableserver)
     return True
   except IOError:
     return False

theurl = 'http://www.pyzine.com'
try:
   filehandle = urllib2.urlopen(theurl)
except IOError:
     if isonline():
     print theurl, "doesn't seem to be available."
   else:
     print "We don't seem to have an internet connection."
else:
     print filehandle.read()

Let's try something else: what happens if you try to fetch a page that doesn't exist, from a server that does exist?

import urllib2
theurl = 'http://www.pyzine.com/nonexistentpage.html'
filehandle = urllib2.urlopen(theurl)

Traceback (most recent call last):
   File "", line 1, in ?
      filehandle = urllib2.urlopen(theurl)
   File "C:\PYTHON22\lib\urllib2.py", line 138, in urlopen
     return _opener.open(url, data)
   File "C:\PYTHON22\lib\urllib2.py", line 328, in open
      '_open', req)
   File "C:\PYTHON22\lib\urllib2.py", line 307, in _call_chain
      result = func(*args)
   File "C:\PYTHON22\lib\urllib2.py", line 824, in http_open
      return self.do_open(httplib.HTTP, req)
   File "C:\PYTHON22\lib\urllib2.py", line 818, in do_open
      return self.parent.error('http', req, fp, code, msg, hdrs)
   File "C:\PYTHON22\lib\urllib2.py", line 354, in error
      return self._call_chain(*args)
   File "C:\PYTHON22\lib\urllib2.py", line 307, in _call_chain
      result = func(*args)
   File "C:\PYTHON22\lib\urllib2.py", line 406, in http_error_default
      raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 404: Not Found

Here, the code returns an HTTPError, a subclass of URLError. HTTPError means something went wrong with the HTTP transaction. Normally, you won't deliberately try and fetch a page that you know doesn't exist, so how can this program tell what went wrong? This is where the details of HTTP start to show through, and you need to know things that don't appear in the Python manual.

  • In a nutshell, if the error occurs before a request gets to a server, then the error object will have a 'reason' attribute. The 'reason' attribute is a tuple with an error number and a description.
  • If the request reaches a real server, but an error still occurs, then the error object will have a 'code' attribute. The code is an integer that relates to the problem.
  • If a call to urlopen() fails without reaching a server, you may want to check if you have an active connection. That way you can give your user an appropriate error message.

So, what do the error codes mean? Code 404 (perhaps familiar) indicates 'page not found'. But 404 isn’t the only response code that servers send — it's just one of the few that most web clients actually pass onto the end- user to see. Most other response codes tend to be dealt with internally. (You can find a good list of all the responses in RFC2616, Section 10 [8].)

For Python, there is a useful dictionary of the error response codes, and their meanings, buried in the standard library HTTPBaseServer.

############### # Borrowed from BaseHTTPServer in the python standard library # This is the table of HTTP errors (to which I’ve added code 400)

errorlist = { 400: ('Bad Request', 'The Server thinks your request was malformed.'),
  401: ('Unauthorized',
     'No permission -- see authorization schemes'),
  402: ('Payment required',      'No payment -- see charging schemes'),
  403: ('Forbidden',
     'Request forbidden -- authorization will not help'),
  404: ('Not Found', 'Nothing matches the given URI'),
  405: ('Method Not Allowed',
     'Specified method is invalid for this server.'),
  406: ('Not Acceptable', 'URI not available in preferred format.'),
  407: ('Proxy Authentication Required', 'You must authenticate with '
     'this proxy before proceeding.'),
  408: ('Request Time-out', 'Request timed out; try again later.'),
  409: ('Conflict', 'Request conflict.'),
  410: ('Gone',
     'URI no longer exists and has been permanently removed.'),
  411: ('Length Required', 'Client must specify Content-Length.'),
  412: ('Precondition Failed', 'Precondition in headers is false.'),
  413: ('Request Entity Too Large', 'Entity is too large.'),
  414: ('Request-URI Too Long', 'URI is too long.'),
  415: ('Unsupported Media Type', 'Entity body in unsupported format.'),   416: ('Requested Range Not Satisfiable',
     'Cannot satisfy request range.'),
  417: ('Expectation Failed',
     'Expect condition could not be satisfied.'),

  500: ('Internal error', 'Server got itself in trouble'),
  501: ('Not Implemented',
     'Server does not support this operation'),
  502: ('Bad Gateway', 'Invalid responses from another server/proxy.'),
  503: ('Service temporarily overloaded',
     'The server cannot process the request due to a high load'),
  504: ('Gateway timeout',
     'The gateway server did not receive a timely response'),
  505: ('HTTP Version not supported', 'Cannot fulfil request.')

Some of these errors aren't seen in the wild too often! For example, the micropayment scheme the creators of error 402 originally envisioned hasn’t yet been implemented.)

Catching Up

The following piece of code illustrates what’s been covered so far (it depends on the isonline() function shown above):

def testurl(inurl):
   """Tests if a url can be reached - and prints an appropriate message depending on the result."""
   from urllib2 import urlopen
   try:
     urlopen(inurl)
   except IOError, e:
     if hasattr(e, 'reason'):
       print 'Access failed before we reached a server.'
       print 'Reason : ', e.reason
       if isonline():
         print "We have an internet
connection - so either the server is down or doesn't exist."
       else:          print "We don't appear to have
an internet connection - this is likely to be the source of the problem."
     elif hasattr(e, 'code'):
       print 'Server returned an error code.'
       print 'Error Code : ', e.code
       print 'Error type : ', errorlist[e.code][0]
       print 'Error msg : ', errorlist[e.code][1]

   else:
     print 'Success'

url1 = 'http://www.anonexistentserver.com'
url2 = 'http://www.pyzine.com/nonexistentpage.html'
url3 = ‘http://www.pyzine.com’
testurl(url1)
testurl(url2)
testurl(url3)

Overview of HTTP

To put the previous examples in context, let's look at a simple HTTP transaction. It consists of a request from the client and a response form the server.

As mentioned before, the request and response consist of text instructions, which are called headers. Both requests and response can also have a 'body' of data. In a request, the body is the information you’re sending to the server, usually from an HTML form. In a response, the body is the page or resource you asked for. This could be a HTML page, a JPEG image, a zip-file, or just about anything else.

For instance, here is a simple HTTP request:

GET /index.html HTTP/1.1
Accept: image/gif, image/x-xbitmap, image/jpeg, */*
Accept-Language: en-us
User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)
Host: pyzine.com

There are several different types of request that can be made, but overwhelmingly the most common are GET and POST requests. The example above shows a GET request, which is generally used for just fetching resources. It is entirely comprised of header lines.

  • • The first line tells the server that this is a GET request for the file '/index.html' using HTTP version 1.1. (There are two common versions of HTTP that are used: 1.0 and 1.1. However, this isn't the place to explain the differences, and if you use urllib2, it handles the differences for you.)
  • The next headers tell the server what sort of media the client can handle, what language it expects, and what program is making the request (this is the User-Agent header).
  • The last line specifies what host the request is for. The host header is necessary for the request to be a valid HTTP/1.1 request and is needed if the server is hosting several websites. Again, urllib2 handles this for you.

So far, the examples haven’t specified any headers with urlopen(). That’s common, because in normal use, urllib2 adds the 'Host' header and a 'User-Agent' header for you. However, you can change values in existing headers or add new headers.

Let’s see an example.

All the popular web browsers and their different versions render HTML and Javascript *slightly* differently. Web applications can use the User-Agent header to tailor their content according to the browser being used.

The default value that urllib2 uses for 'User-Agent' is 'Python-urllib/2.1". (“2.1” is the version number in Python 2.3.4, yet it might be different in other versions of Python.) This tells the web server and all web applications that the web client is 'Python-urllib'.

Some web sites however, don't like automatic programs using their services (Google being one of them). To work around this, you can specify a user agent string that mimics a real browser. One “user agent” that’s often used is 'User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'.[10] You can set this with:

txheaders = {'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}
req = urllib2.Request(theurl, txdata, txheaders)

or with:

req = urllib2.Request(theurl)
req.add_header('User-Agent', 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)')

Another common header is the 'Referrer' header. Browsers send this to tell a server what URL referred them to the requested URL. Web site owners find this information useful, because it shows where traffic comes from. Some web-applications actually use this header though, so it’s possible you’ll have to replicate it. [11]

And if you’re making a POST request, then urllib2 also adds a 'Content-type' and a 'Content-length' header, if you don't specify them yourself.

For example, this is the server response to a request made for the Pyzine home page.

HTTP/1.1 200 OK
Date: Thur, 28 Sep 2004 20:15:03 GMT
Server: Apache/1.3.6 (Unix)
Last-Modified: Mon, 06 Dec 1999 12:01:03 GMT
ETag: "2f5cd-964-381e1bd6"
Accept-Ranges: bytes
Content-length: 327
Connection: close
Content-Type: text/html

You can see that the first line contains a '200 OK', a response code that isn't an error code, but instead reports that everything is fine. (Again, not all of the response codes indicate errors.)

You can examine the response codes by looking at the 'headers' attribute of a filehandle returned by urlopen() and also by using the dictionary-like properties of an RFC822 message object returned by filehandle.info(), as in the first example.

It *does* matter how you react to these response codes, particularly the error codes. For a very good discussion of how an 'atom feed aggregator' ought to respond to error codes see http://diveintomark.org/archives/2003/07/21/atom_aggregator_behavior_http_level . This discussion should be of general interest to people writing web clients.

By the way, did you notice the blank line between the content line and the page body? Forgetting to include that blank line can be a cause of frustrating ‘Error Code 500’s when writing CGIs.

Often, clients aren’t fetching web pages. In many cases, clients are communicating with web applications.

To access a web application, you use a URL as before. However, additional protocol support is required to pass information to the script — information such as fields from forms, values of radio buttons and the like, even file uploads.

In a GET request, information is sent to applications as a query string appended to the URL. In a POST request, information is sent as the body of a request. No matter which method is used, certain characters, such as spaces, have to be 'escaped' for safe transmission. This is done using urllib.urlencode(). (This will all be covered in more detail in the next article, which looks at how a client communicates with a web application.)

Cookies, Handlers and Openers

urllib2 can fetch resources using several different Internet application protocols. For example, urlopen() quite happily fetches resources via HTTP, HTTPS, or FTP. It does this through a series of 'handlers' that handle the different protocols and situations.

There are a few situations where you might need to specify your own handler. All requests made via urllib2 are done with 'openers' that have these handlers built into them. urlopen() just uses the default opener. If you want to roll your own opener then you need to build one from your handler. You can then either use your opener directly or you can install it as the default handler.

However, fetching information from many web sites may well require you to set the right cookie header in your request. Thankfully, there is a very good extension library for Python called ClientCookie [12] that handles cookies transparently for you *and* allows you to load and save cookies across sessions.

The handler to create is called an HTTPCookieProcessor and it uses a CookieJar to store cookies from response headers. When you install this handler, it checks the CookieJar before making any requests. If it needs to send a cookie, it is sent automatically.

This next example uses an instance of the LWPCookieJar class that loads and saves cookies. The boilerplate here imports the libraries and then loads the cookies into the cookiejar, if the cookiefile exists.

import urllib2, ClientCookie, os
COOKIEFILE = 'cookies.lwp' # The filename we are saving cookies in
cj = ClientCookie.LWPCookieJar()
if os.path.isfile(COOKIEFILE):
   cj.load(COOKIEFILE)

ClientCookie does some of the things that urllib2 normally does. The functions ClientCookie.build_opener() and ClientCookie.install_opener() are direct equivalents of the urlib2 build_opener() and install_opener() functions. (Which is why in Python 2.4 they've been moved into urllib2).

First, create the handler:

handler = ClientCookie.HTTPCookieProcessor(cj)

Having created the handler, build an opener from it. You could use several different handlers to create a single opener if you wanted, because any opener will have the default handlers already in it.

opener = ClientCookie.build_opener(handler)

Now you can call the opener directly:

opener.open(theurl)
# request objects should be made using calls to ClientCookie.Request

Also, you can create several cookie jars and use several different openers.

The alternative is to install your own opener as the default opener, meaning that all calls to ClientCookie.urlopen use the same cookiejar.

ClientCookie.install_opener(opener)
ClientCookie.urlopen(theurl)

If you want to save the cookies from this session you can use :

cj.save(COOKIEFILE) # save the cookies again

Another example of a situation where you might want to install another handler is a proxy-handler. Some people only have an Internet connection via a proxy that they have to authenticate with. If you create a web client, make sure that you give people the option of specifying a proxy to make connections via. If the proxy doesn't require authentication, proxies are another thing that urllib2 handles automatically.

Authentication

Authentication (for a full explanation of basic authentication using Python, see http://www.voidspace.org.uk/atlantibots/recipebook.html#auth.) is yet another situation that needs a special handler. If you submit a request and the server replies with error code 401, you must login to the server to complete the request. Login requires that a username and password be encoded into the request.

Unfortunately, error code 401 is accompanied by a 'realm'. To use the normal authentication handler, you need to know what the realm is. Instead, you might try a combination of the HTTPBasicAuthHandler with an HTTPPasswordMgrWithDefaultRealm, which allows you to force the request to use the password/username for a 401 error without having to explicitly know the realm.

The toplevelurl is the first url that requires authentication. This is usually a 'super-url' of any others in the same realm. This url mustn’t include the protocol (the `http://` bit of the url) or it will fail. It should just be the server name and path.

passwordmgr = urllib2.HTTPPasswordMgrWithDefaultRealm()

# create a password manager

passwordmgr.add_password(None, toplevelurl, username, password)

# add the username and password

handler = urllib2.HTTPBasicAuthHandler(passwordmgr) # create the handler
opener = urllib2.build_opener(handler)

# from handler to opener

urllib2.install_opener(opener) # install the opener

Using the Web Page

Of course, once you've fetched a web page you need to extract the information from it.[13] There are various ways of doing this and the best tool to use depends largely on the job you want to do.

The details of web page “parsing” tools are beyond the bounds of this article, but there are lots of references on the web to this part of the job.

Do not use Python’s standard HTML parser in the standard library — a library called HTMLParser. Unfortunately, it is so strictly built that it typically fail on anything but completely correct HTML. This makes it basically useless for parsing most webpages.

BeautifulSoup [14] is an HTML parser that doesn't care ! It's very effective at extracting information. scraper.py [15] is a also very simple parser. It does a similar job to HTMLParser.HTMLParser, but doesn’t choke on bad HTML. HTMLTidy [16] can be used to turn bad HTML into valid XHTML. This means you can use *any* XML or XHTML parser (including HTMLParser).

It may be that regular expressions are powerful enough to pull out the information you want from a page. Whatever you use, enjoy it, and if possible, share your code and experiences with the rest of the Python community.

Hopefully this article has given you enough of an overview of web access from python to confidently experiment.

Sidebar - Are we connecting?

Recently, there was a discussion on comp.lang.python[6] about how you can tell if a computer has an internet connection or not. Strangely enough, the conclusion was that there is no easy way of telling.

There is a Windows function (part of the Windows API and callable from Python using win32all or ctypes) called InternetCheckConnection [7], but all it does is solicit a “well-known” Internet server for a connection. If the server responds, the function concludes that the computer is online. If the server does not respond, the function assumes there’s no viable connection.

You'd think there would be a more reliable method, but in fact, the Internet is just another network as far as your computer is concerned. All it is able to know is whether or not it can reach a particular location or not.

[1] For those who don't know, the 'HyperText Transfer Protocol' is the language used to describe the transmission of webpages (hypertext), and other resources, across the internet.

[2] The RFCs are the official definitions and specifications for all kinds of internet protocols. You can find them all at http://www.rfc-editor.org/

[3] If it had been designed by the kind of geeks who write 'other' operating systems, it might have been comprised of single character control codes - which are arguably more convenient for the computer, but less convenient for humans.

[4] See other resources, Ethereal and TCPWatch.

[5] They follow the pattern maintype/subtype. The official repository of registered types is at ftp://ftp.isi.edu/in-notes/iana/assignments/media-types/

[6] comp.lang.python is the official python newsgroup. Normally accessed via nntp an internet protocol different to http. A good web interface exists at http://groups.google.co.uk/groups?q=comp.lang.python

[7] See http://msdn.microsoft.com/library/en-us/wininet/wininet/internetcheckconnection.asp you should be able to access it via win32all or ctypes if you really want to.

[8] For a better reference - see http://libraries.ucsd.edu/about/tools/http-response-codes.html

[9] The relevant bit is here http://256.com/gray/docs/rfc2616/14.html it is technical though.

[10] A better way to access google is to use the google API and the python interface to this.

[11] If you do run into this problem then you may find the 'mechanize' library helpful. http://wwwsearch.sourceforge.net/mechanize/

[12] http://wwwsearch.sourceforge.net/ClientCookie/

[13] There is an example of using regular expressions to extract chess match statistics from webpages in an article called ‘Python Squeezes the Web’ - http://www.linuxplanet.com/linuxplanet/tutorials/1132/

[14] http://www.crummy.com/software/BeautifulSoup/http://www.crummy.com/software/BeautifulSoup/

[15] http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/286269

[16] http://tidy.sourceforge.net/ for the main project and http://utidylib.berlios.de/ for a python binding.


Michael Foord,:

can be contacted at michael@foord.me.uk He has a home page of python projects at http://www.voidspace.org.uk/atlantibots/pythonutils.html.


shim
shim

 Py is committed to bringing you great Python Articles.

shim
shim


Home   Subscribe   Migration FAQ   Contact PyZine   Write for PyZine   ZopeMag   opensourcexperts.com  

Reproduction of material from any of PyZine's pages without prior written permission is strictly prohibited. Copyright 2003 - 2005 PyZine Zope/Plone hosting by Nidelven IT