In the previous posts, i have shown ways of scraping web pages using Beautiful Soup. Beautiful Soup is a brilliant HTML parser and helps in parsing the HTML easily. I normally use urllib2 module to open a URL.  Later i use this URL open handle to create a Beautiful Soup Object. The normal script that i does is as below.

import urllib2

from bs4 import BeautifulSoup

url = "http://www.regulations.gov/#!docketBrowser;rpp=50;po=0;dct=PS;D=OSHA-2013-0020"'
page = urllib2.urlopen(url) 
soup = BeautifulSoup(page)

The above code of using urllib2 module to open the webpage and then using BeautifulSoup will work like a charm, if the page content is plain HTML without any content being load through JavaScript. But recently, i came across this website (regulations.gov), where JavaScript is enabled. This meant that the normal way of opening a webpage to get the page won’t work.  The website specifically checks if JavaScript is enabled and renders the page based on the Java Script being run after the page load. I used urllib2 to open the webpage, but it seems to contain only the message (“You need to have javascript enabled to view this page” ) . I was clueless on how to scrape this webpage.  Searched a lot and there where lot of possibilities. Among them where Selenium and PhantomJs. Phantomjs impressed me as it was a headless browser and i don’t need to have extra drivers or web browsers installed as in the case with Selenium. So this was the plan.

  1. Use phantomjs to open the page.
  2. Save it as a  local file using phantom js File System Module Api.
  3. Later use this local file handle to create BeautifulSoup object and then parse the page.

PhantomJs Script to load the Web Page Below is the script i used to load the web page using PhantomJs and save it as a local file.

var page = require('webpage').create();
var fs = require('fs');// File System Module
var args = system.args;
var output = './temp_htmls/test1.html'; // path for saving the local file 
page.open('http://www.regulations.gov/#!docketBrowser;rpp=50;po=0;dct=PS;D=OSHA-2013-0020', function() { // open the file 
  fs.write(output,page.content,'w'); // Write the page to the local file using page.content
  phantom.exit(); // exit PhantomJs
});

Here we have opened the page using PhantomJs and then saved locally. On inspecting the content of the file, we can see that the JavaScript was run and there is no error message regarding JavaScript requirement. Later we will open the local file and then scrape it using the code below.

page_name="./temp_htmls/test1.html"
local_page = open(page_name,"r")
soup = bs(local_page,"lxml")

If you want to learn and understand more examples and sample codes, I have authored a book on Beautiful Soup 4 and you can find more details here

Getting Started with Beautiful Soup

Happy Scraping :-)

Packt have kindly provided 3 free eBook copies of my newly published book, Getting Started with Beautiful Soup.

Getting Started with Beautiful Soup

How to Enter?

All you need to do is head on over to the book page (http://www.packtpub.com/getting-started-with-beautiful-soup/book)and look through the product description of the book and drop a line via the comments below this post to let us know what interests you the most about this book. It’s that simple.

The winners will be chosen from the comments below and contacted by email so please use a valid email address when you post your comment.

The contest ends when I run out of free copies to give away. Good luck!

Update 1 ( 5th April 2014 )

—————

Packtpub has provided 2 more copies to give away, that mean 5 copies to be won :-). Contest will end soon.

Update 2 (7th April 2014 )

———-

The first winner is Cm. Congratulations! Someone from Packt will be in contact with you shortly.

Update 3 ( 11th April 2014)

The second winner is Elizabeth.  Congratulations! Someone from Packt will be in contact with you shortly.

Update 4 ( 16th April 2014)

The third winner is Joel.  Congratulations! Someone from Packt will be in contact with you shortly.

Update 5  ( 16th April 2014)

The Fourth winner is Graham O’ Malley.  Congratulations! Someone from Packt will be in contact with you shortly.

Update 6  ( 21st April 2014)

The Final winner is Nick Fiorentini.  Congratulations! Someone from Packt will be in contact with you shortly.

The contest is now closed.  Thanks every one for the participation.