Before reading the post below, If you want to learn and understand more examples and sample codes, I have authored a book on Beautiful Soup 4 and you can find more details here

Getting Started with Beautiful Soup

Update 1: 

You have a Chance to Win Free Copies of Getting Started with Beautiful Soup by Packt Publishing. Please follow the instruction in this post

Web scraping is the easiest job you can ever find. It is quite useful because even if you don’t have access to database of the website , you can still get the data out of those sites using web scraping. For web data extraction we are creating a script that will visit the websites and extract the data you want, without any extra authentication and with these scripts its easy to get more data in less time from these websites.
I always relied on python to do tasks and in this too there is a good third party library, Beautiful Soup. The official site itself have good documentation and it is clearly understandable. For those who don’t want to read that lengthy one and just want to try something using Beautiful Soup and python, read this simple script with the explanation.
Task : Extract all U.S university name and url from University of texas website as a csv ( comma-seperated values ) format.
Dependencies : python and Beautiful Soup

Script with explanation:

from BeautifulSoup import BeautifulSoup
import urllib2

 

We have to use urllib2 to open the URL. Before we proceed further we should know these things. Web scraping will be effective only if we can find patterns used in the websites for denoting contents. For example in the university of texas website if you view the source of the page then you can see that all university names have a common format like as shown below in the screeshshot

View Source
Patterns found in the Texas university page

 

 

 

 

 

 

 

 

url="http://www.utexas.edu/world/univ/alpha/"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())

Here we opened the university of texas using urllib2.urlopen(url) and we create a BeautifulSoup Object using soup = BeautifulSoup(page.read()) . Now we can manipulate the webpage using the methods of the soup object.

universities=soup.findAll('a',{'class':'institution'})

Here we used the findAll method that will search through the soup object to match for text, html tags and their attributes or anything within the page. Here we know that university name and url has a pattern which has the html tag ‘a’ and that tag has css class ‘institution’.
So that’s why we used soup.findAll(‘a’,{‘class’:'institution’}). The css class institution actually filtered the search otherwise if we simple give findAll(‘a) , script could have returned all the links within the page. We could have done the same thing using regular expression but using BeautifulSoup is more better in this case compared to regexp.

for eachuniversity in universities:
print eachuniversity['href']+","+eachuniversity.string

Here we traversed thorugh the list of universities. During execution of the loop. eachuniversity['href'] will give us the link to the university because in the initial pattern we saw that the link to each university is within the ‘a’ tag’s href property and the name of the university is the string following the ‘a’ tag and that’s why we used eachuniversity.string.

from BeautifulSoup import BeautifulSoup
import urllib2
url="http://www.utexas.edu/world/univ/alpha/"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
universities=soup.findAll('a',{'class':'institution'})
for eachuniversity in universities:
   print eachuniversity['href']+","+eachuniversity.string

Download BeautifulSoup, install it and run this script you can see around 2128 U.S university names along with their url.

Posted in Expert Advise.

33 thoughts on “Let’s scrape the page, Using Python Beautiful Soup

  1. Good work buddy..
    very concise and great content for neoohytes like me…
    although to import Beautifulsoup you may want to edit the command published in this web page..

    from bs4 import BeautifulSoup

    Reply
  2. I get following error after typing the page statement. Please let me know the solution to the issue.

    >>> from BeautifulSoup import BeautifulSoup
    >>> import urllib2
    >>> url=”http://www.texas.edu/world/univ/alpha/”
    >>> page=urllib2.urlopen(url)

    Traceback (most recent call last):
    File “”, line 1, in
    page=urllib2.urlopen(url)
    File “C:\Users\us67771\Python27\lib\urllib2.py”, line 126, in urlopen
    return _opener.open(url, data, timeout)
    File “C:\Users\us67771\Python27\lib\urllib2.py”, line 400, in open
    response = self._open(req, data)
    File “C:\Users\us67771\Python27\lib\urllib2.py”, line 418, in _open
    ‘_open’, req)
    File “C:\Users\us67771\Python27\lib\urllib2.py”, line 378, in _call_chain
    result = func(*args)
    File “C:\Users\us67771\Python27\lib\urllib2.py”, line 1207, in http_open
    return self.do_open(httplib.HTTPConnection, req)
    File “C:\Users\us67771\Python27\lib\urllib2.py”, line 1177, in do_open
    raise URLError(err)
    URLError:

    Reply
  3. The syntax error is likely due to the last line not being indented if you just c+p the above code.

    Needs to be.

    for eachuniversity in universities:
    print eachuniversity['href']+”,”+eachuniversity.string

    Reply
    • My indent didn’t work then. I’ll try again. Ignore the dots before print

      for eachuniversity in universities:
      ……print eachuniversity['href']+”,”+eachuniversity.string

      Reply
  4. Encountered this error:

    UnicodeEncodeError: ‘charmap’ code can’t encode character u’\u2013′ in position

    What is it?
    How do I avoid it?

    Reply
  5. Thank you, I found this extremely useful and straightforward. For my code, I eventually added a try/except clause in there in case the page I was looking for didn’t exist:

    try:
    page = urllib2.open.urlopen(url)
    soup = BeautifulSoup(page.read())
    /…other code…/
    except (urllib2.HTTPError):
    /…more code…/

    Reply
  6. Hello and thanks for the tutorial. When I essentially just copied and pasted your code into my IDE, it worked. Also I was able to follow along. However, when I tried to modify it for my own purposes I ran into trouble, maybe you can help? I’m trying to scrape all the comments from articles on various websites. So, taking this very website as an example:

    I noticed that the comments appear under the following tag:

    And that is about as far as I got, which isn’t very far! My code looked like this:

    from BeautifulSoup import BeautifulSoup
    import urllib2
    url=”http://kochi-coders.com/2011/05/30/lets-scrape-the-page-using-python-beautifulsoup/”
    page=urllib2.urlopen(url)
    soup = BeautifulSoup(page.read())
    replies=soup.findAll(‘a’,{‘class’:'comment’})
    for eachcomment in replies:
    print eachcomment['span7']+”,”+eachcomment.string

    And it did nothing. Anyone want to give me any hints? Thanks!!

    Reply
    • Hi Kristina,
      In this page comments appear under the div tag with class “comment-content span7″. Try replies=soup.findAll(‘div’,{‘class’:’comment-content’})
      for eachcomment in replies:
      print eachcomment.

      Reply
  7. I am getting the following error . I am nooby.

    print eachuniversity['href']+","+eachuniversity.string
    ^
    IndentationError: expected an indented block

    Reply
  8. Pingback: How to extract the critical information from an html file with BeautifulSoup? | CopyQuery

  9. Pingback: My Bookkochi-coders.com | kochi-coders.com

  10. Great job! I converted the code over to Python 3.3 ( and BeautifulSoup BS4):

    import urllib.request
    import urllib.parse
    from bs4 import BeautifulSoup
    url="http://www.utexas.edu/world/univ/alpha/"
    page = urllib.request.urlopen(url)
    
    soup = BeautifulSoup(page.read())
    print("soup.title ", soup.title)
    
    universities=soup.findAll('a',{'class':'institution'})
    for school in universities:
        print(school['href']+","+school.string)
    
    Reply
  11. Pingback: Python Beautiful Soup 4 Example - Data extraction from website using Beautiful Soup4 | kochi-coders.com

  12. I enjoy what you guys tend to be up too. This type
    of clever work and coverage! Keep up the awesome works guys I’ve incorporated you guys to my
    own blogroll.

    Reply
  13. nice article……recently i came across some information about python libraries. Python foundation has released two new libraries for interacting with web…….1. urllib2 2. requests……you can find out more information from python foundation website…..

    Reply
  14. I am getting the desired result but I need to know about writing the output data to a CSV or JSON file for future use.

    Reply

Leave a reply

required


× 2 = 2

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>