PythonTip >> 博文 >> python

Web scraping with Python (Part 2)

zihua 2014-01-20 23:01:38 点击: 932 | 收藏

Send this article

Complete the form below to send this article, Web scraping with Python (Part 2), to a friend (or to yourself). We will never share your details (or your friend's) with anyone. For more information, read our Privacy Policy.

by Javier Collado | August 2009 | Content Management Open Source

This article by Javier Collado expands the set of web scraping techniques shown in his previous article by looking closely into a more complex problem that cannot be solved with the tools that were explained there. For those who missed out on that article, here's the link. Web Scraping with Python

This article will show how to extract the desired information using the same three steps when the web page is not written directly using HTML, but is auto-generated using JavaScript to update the DOM tree.

As you may remember from that article, web scraping is the ability to extract information automatically from a set of web pages that were designed only to display information nicely to humans; but that might not be suitable when a machine needs to retrieve that information. The three basic steps that were recommended to be followed when performing a scraping task were the following:

  • Explore the website to find out where the desired information is located in the HTML DOM tree
  • Download as many web pages as needed
  • Parse downloaded web pages and extract the information from the places found in the exploration step

What should be taken into account when the content is not directly coded in the HTML DOM tree? The main difference, as you probably have already noted, is that using the downloading methods that were suggested in the previous article (urllib2 or mechanize) just don't work. This is because they generate an HTTP request to get the web page and deliver the received HTML directly to the scraping script. However, the pieces of information that are auto-generated by the JavaScript code are not yet in the HTML file because the code is not executed in any virtual machine as it happens when the page is displayed in a web browser.

Hence, instead of relying on a library that generates HTTP requests, we need a library that behaves as a real web browser, or even better, a library that interacts with a real web browser. So that we are sure that we obtain the same data as we see when manually opening a page in a web browser. Please remember that the aim of web scraping is actually parsing the data that a human user sees, so interacting with a real web browser would be a really nice feature.

Is there any tool out there to perform that? Fortunately, the answer is yes. In particular, there are a couple of tools used for web testing automation that can be used to solve the JavaScript execution problem: Selenium and Windmill . For the code samples in the sections below, Windmill is used. Any choice would be fine as both of them are well documented and stable tools ready to be used for production.

Let's now follow the same three steps that were suggested in the previous article to solve the scraping of the contents of a web page that is partly generated using JavaScript code.


Imagine that you are a fan of NASA Image of the day gallery. You want to get a list of the names of all the images in the gallery together with the link to the whole resolution picture just in case you decide to download it later to use as a desktop wallpaper.

The first thing to do is to locate the data that has to be extracted on the desired web page. In the case of the Image of the day gallery (see screenshot below), there are three elements that are important to note:

  • Title of the image that is being currently displayed
  • Link to the image full resolution file
  • Next link to make it possible navigate through all the images

To find out the location of each piece of interesting information, as it was already suggested in the previous article, it's better to use a tool such as Firebug whose inspect functionality can be really useful. The following picture, for example, shows the location of the image title inside an h3 tag:

The other two fields can be located as easily as the title, so no further explanation will be given here. Please refer to the previous article for further information.


As explained in the introduction, to download the content of the web page, we will use Windmill as it allows the JavaScript code to execute in the web browser before getting the page content.

Because Windmill is mostly a testing library, instead of writing a script that calls the Windmill API, I will write a test case for Windmill to navigate through all the image web pages. The code for the test should be as follows:

1 def test_scrape_iotd_gallery():

2 """

3 Scrape NASA Image of the Day Gallery

4 """

5 # Extra data massage for BeautifulSoup

6 my_massage = get_massage()


8 # Open main gallery page

9 client = WindmillTestClient(__name__)



12 # Page isn't completely loaded until image gallery data

13 # has been updated by javascript code

14 client.waits.forElement(xpath=u"//div[@id='gallery_image_area']/img",

15 timeout=30000)


17 # Scrape all images information

18 images_info = {}

19 while True:

20 image_info = get_image_info(client, my_massage)


22 # Break if image has been already scrapped

23 # (that means that all images have been parsed

24 # since they are ordered in a circular ring)

25 if image_info['link'] in images_info:

26 break


28 images_info[image_info['link']] = image_info


30 # Click to get the information for the next image



33 # Print results to stdout ordered by image name

34 for image_info in sorted(images_info.values(),

35 key=lambda image_info: image_info['name']):

36 print ("Name: %(name)sn"

37 "Link: %(link)sn" % image_info)

As it can be seen, the usage of Windmill is similar to other libraries such as mechanize. For example, first of all a client object has to be created to interact with the browser, (line 9) and later, the main web page, that is going to be used to navigate through all the information, has to be opened (line 10). Nevertheless, it also includes some facilities that take into account JavaScript code as shown at line 14. In this line, the waits.forElement method has been used to look for DOM element that is filled by the JavaScript code so when that element, in this case the big image in the image gallery, is displayed, the rest of the script can proceed. It is important to note here that the web page processing doesn't start when the page is downloaded (this happens after line 10), but when there's some evidence that JavaScript code has finished the DOM tree manipulation.

For navigating through all the pages that contain the information needed, this is just a matter of pressing over the next arrow (line 30). As the images are ordered in a circular buffer, the point when it is decided to stop is when the same image link has been parsed twice (line 25).

To execute the script, instead of launching it as we would normally do for a python script, we should call it through the Windmill script to properly initialize the environment:

$ windmill firefox

As it can be seen in the following screenshot, Windmill takes care of opening a browser (Firefox in this case) window and a controller window in which it's possible to see the commands that the script is executing (several clicks on next in the example):

The controller window is really interesting because not only does it display the progress of the test cases, but also allows to enter/record actions interactively, which is a nice feature when trying things out. In particular, the recording may be used under some situations to replace Firebug in the exploration step. This is because the captured actions may be stored in a script without spending much time in xpath expressions.

For more information about how to use Windmill and the complete API, please refer to the Windmill documentation.


The parsing of the web page can be performed with BeautifulSoup as explained in the previous article. The only thing that should be taken into account, is that the page contents has to be retrieved every time JavaScript code changes the DOM tree by using the commands.getPageTest() method of the Windmill client object.

Please see below the code that extracts the image information for the example of the image of the day gallery:

1 def get_image_info(client, my_massage):

2 """

3 Parse HTML page and extract featured image name and link

4 """

5 # Get Javascript updated HTML page

6 response = client.commands.getPageText()

7 assert response['status']

8 assert response['result']


10 # Create soup from HTML page and get desired information

11 soup = BeautifulSoup(response['result'], markupMassage=my_massage)

12 image_info = {'name': soup.find(id='caption_region').h3.string,

13 'link': urlparse.urljoin('',

14 soup.find(attrs='Full_Size')['href'])}

15 return image_info

Code and results

The complete code that performs the scraping and prints a simple report to the standard output is:

1 # Generated by the windmill services transformer

2 from windmill.authoring import WindmillTestClient

3 from BeautifulSoup import BeautifulSoup


5 import re, urlparse

6 from copy import copy


8 def get_image_info(client, my_massage):

9 """

10 Parse HTML page and extract featured image name and link

11 """

12 # Get Javascript updated HTML page

13 response = client.commands.getPageText()

14 assert response['status']

15 assert response['result']


17 # Create soup from HTML page and get desired information

18 soup = BeautifulSoup(response['result'], markupMassage=my_massage)

19 image_info = {'name': soup.find(id='caption_region').h3.string,

20 'link': urlparse.urljoin('',

21 soup.find(attrs='Full_Size')['href'])}

22 return image_info



25 def get_massage():

26 """

27 Provide extra data massage to solve HTML problems in BeautifulSoup

28 """

29 # Javascript code in ths page generates HTML markup

30 # that isn't parsed correctly by BeautifulSoup.

31 # To avoid this problem, all document.write fragments are removed

32 my_massage = copy(BeautifulSoup.MARKUP_MASSAGE)

33 my_massage.append((re.compile(u"document.write(.+);"), lambda match: ""))

34 my_massage.append((re.compile(u'alt=".+">'), lambda match: ">"))

35 return my_massage



38 def test_scrape_iotd_gallery():

39 """

40 Scrape NASA Image of the Day Gallery

41 """

42 # Extra data massage for BeautifulSoup

43 my_massage = get_massage()


45 # Open main gallery page

46 client = WindmillTestClient(__name__)



49 # Page isn't completely loaded until image gallery data

50 # has been updated by javascript code

51 client.waits.forElement(xpath=u"//div[@id='gallery_image_area']/img",

52 timeout=30000)


54 # Scrape all images information

55 images_info = {}

56 while True:

57 image_info = get_image_info(client, my_massage)


59 # Break if image has been already scrapped

60 # (that means that all images have been parsed

61 # since they are ordered in a circular ring)

62 if image_info['link'] in images_info:

63 break


65 images_info[image_info['link']] = image_info


67 # Click to get the information for the next image



70 # Print results to stdout ordered by image name

71 for image_info in sorted(images_info.values(),

72 key=lambda image_info: image_info['name']):

73 print ("Name: %(name)sn"

74 "Link: %(link)sn" % image_info)

where some interesting things to note that were not commented in previous sections are:

  • The get_massage function (lines 25-35) was needed to prevent BeautifulSoup parsing errors from stopping the script. This is because some pages use markup in a no standard way that fails the parser.
  • The urlparse library is used to transform relative URLs into absolute ones.

A fragment of the output that, at the time of writing, can be obtained by executing the code above, is:

Name: 3-2-1 and Liftoff of GOES-O


Name: A Ghost Remains


Name: A Parting Look


Name: A Super-Efficient Particle Accelerator




This article showed how to scrape the information from a web page whose content has been partially generated by JavaScript code. Aside from the three steps (explore, download and parse) explained in the previous article, the usage of a tool that is capable of executing that code, or interacting with a real web browser is fundamental to obtain at any time the real DOM tree of the information is being displayed to the user.

In the example, Windmill is used successfully to:

  • Open the main page.
  • Perform a check that makes sure the JavaScript code is executed before scraping any data.
  • Click in the next control, that is, navigate through all the content just a human user would do.
  • Get the updated DOM tree.

This is a simple but powerful functionality that can be used to scrape a large amount of web pages. As in any scraping task, the only maintenance that a script using this library would need, is to keep track of the changes that the page creator may introduce in the future to improve the web page look and feel.

About the Author :

Javier Collado is a software developer and a test design engineer with extensive experience in high availability telecommunications products. He also holds a position as an associate professor, which he enjoys a lot because it allows him to share and learn simultaneously.

Once a year, he takes a break and travels as far as possible to know different cultures.

Books From Packt


作者:zihua | 分类: python | 标签: python | 阅读: 932 | 发布于: 2014-01-20 23时 |