Packt have kindly provided 3 free eBook copies of my newly published book, Getting Started with Beautiful Soup.

Getting Started with Beautiful Soup

How to Enter?

All you need to do is head on over to the book page (http://www.packtpub.com/getting-started-with-beautiful-soup/book)and look through the product description of the book and drop a line via the comments below this post to let us know what interests you the most about this book. It’s that simple.

The winners will be chosen from the comments below and contacted by email so please use a valid email address when you post your comment.

The contest ends when I run out of free copies to give away. Good luck!

Update 1 ( 5th April 2014 )

—————

Packtpub has provided 2 more copies to give away, that mean 5 copies to be won :-). Contest will end soon.

Update 2 (7th April 2014 )

———-

The first winner is Cm. Congratulations! Someone from Packt will be in contact with you shortly.

Update 3 ( 11th April 2014)

The second winner is Elizabeth.  Congratulations! Someone from Packt will be in contact with you shortly.

Update 4 ( 16th April 2014)

The third winner is Joel.  Congratulations! Someone from Packt will be in contact with you shortly.

Update 5  ( 16th April 2014)

The Fourth winner is Graham O’ Malley.  Congratulations! Someone from Packt will be in contact with you shortly.

I was thinking  about porting my blog post about scraping a website using Beautiful Soup from Python 2.7 and Beautiful Soup 3 to Python 3 and Beautiful Soup 4.  Thanks to Steve for his code which made it easy for me. In this blog post i will be scraping the same website using Beautiful Soup 4.

Task : Extract all U.S university name and url from University of texas website as a csv ( comma-seperated values ) format.
Dependencies : Python and Beautiful Soup 4

Script with Explanation

  • Importing Beautiful Soup 4
from bs4 import BeautifulSoup

This is major difference between Beautiful Soup 3 and 4. In 3 it was just

from BeautifulSoup import BeautifulSoup

But in Bs4 it is entirely different.

Next we will import the request module for opening the Url

import urllib.request

We now need to open the page at the above Url.

url="http://www.utexas.edu/world/univ/alpha/"
page = urllib.request.urlopen(url)
  • Creating the Soup
soup = BeautifulSoup(page.read())
  • Finding the pattern in the page

Web scraping will be effective only if we can find patterns used in the websites for the contents. For example in the university of texas website if you view the source of the page then you can see that all university names have a common format like as shown below in the screeshshot

View Source

From the pattern we can see that all universities will be within <a> tag with css class institution. So we need to find all the <a> tags whose class is institution to find all the universities. We can use Beautiful Soup 4 find_all() method to accomplish this.

universities=soup.find_all('a',class_='institution')

In the above code line, we used find_all method in Beautiful Soup 4 to find all the universities. We found all the <a> tags with class institution. In Beautiful Soup 4 we can use the keyword argument class_ to search based on the css classes. We can iterate over each university by using the code below

for university in universities:
    print(university['href']+","+university.string)

In the above page each University name is stored as the string of the <a> tag and URL  are stored as the href property of the <a> tag. So in the above code, by using university.string we will get each university name and using university['href'] we will get the university URL.

  • Putting it all together

The script for scraping University of texas website using Beautiful Soup 4 is as below.

from bs4 import BeautifulSoup
import urllib.request
url="http://www.utexas.edu/world/univ/alpha/"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page.read())
universities=soup.find_all('a',class_='institution')
for university in universities:
    print(university['href']+","+university.string)

If you want to learn and understand more examples and sample codes, I have authored a book on Beautiful Soup 4 and you can find more details here

Getting Started with Beautiful Soup

 Update 1: 

You have a Chance to Win Free Copies of Getting Started with Beautiful Soup by Packt Publishing. Please follow the instruction in this post

Happy Scraping :-)

In this tutorial we will see the simple task of Reading an Excel Spread sheet using Python. It is very easy to do these by following the steps below.

  1. Installing required library.
    xlrd is used for reading excel sheet. It can be easily installed in both Linux and Windows machines by following the steps below.

    • In linux Machines

    Open terminal and install using the below command

    sudo pip install xlrd
  2. Opening an Excel Spread Sheet
    import xlrd
    excel_sheet = xlrd.open_workbook('workbook.xls')

    Here we have used the method open_workbook to open the excel spread sheet workbook.xls

  3. Opening work sheet of an Excel Spread Sheet
    sheet1= excel_sheet.sheet_by_name('Sheet1')

    Here we have used the method sheet_by_name to open a specific sheet with the name Sheet1 . The names of all sheets with in a spread sheet can be done using,

    print excel_sheet.sheet_names()
  4. Reading row by row in a work sheet
    We can find the number of rows in the sheet using,

    sheet1= excel_sheet.sheet_by_name('Sheet1')
    max_rows = sheet1.nrows

    Here we have used sheet1.nrows to get the number of rows in the sheet. To read each row we can use the corresponding row index. For example,

    row = sheet1.row(0)
    print row

    will give us a list of xlrd.sheet.Cell objects and the result is similar to the output below.

    [empty:'',text:u'OWP :: 28 Oct - 1 Nov 2013', Number:5441, Date:39682, empty:'', empty:'', empty:'', empty:'', empty:'', empty:'', , empty:'', empty:'', empty:'', empty:'', empty:'', empty:'', empty:'', empty:'', empty:'', empty:'', empty:'', empty:'', empty:'', empty:'', empty:'', empty:'', empty:'', empty:'', empty:'', empty:'', empty:'', empty:'', empty:'', empty:'', empty:'', empty:'', empty:'', empty:'', empty:'', empty:'', empty:'', empty:'', empty:'', empty:'']

    first_cell = row(0)
    print type(first_cell)

    From this we can understand that first_cell is a xlrd.sheet.Cell object with cell type as  text and value u’OWP :: 28 Oct – 1 Nov 2013′.

    We can get the cell type using

    print first_cell.ctype

    Or

    print sheet1.cell_type(row_index,col_index)

    The different cell types for a Cell object are,

    0 denoting the cell value is Empty 
    1 denoting the cell value is of type Text
    2 denoting the cell value is of type Number
    3 denoting the cell value is of type Date
    4 denoting the cell value is of type Boolean
    5 denoting the cell value is of type Error
    and 6 denoting the cell value is Blank

    We can read each row in the sheet by using the sample code below.

    curr_row_index = 0
    
    while curr_row_index < max_rows:
    
        row = sheet1.row(curr_row_index)
    
        curr_row_index+=1
  5. Reading cell by cell in a worksheet

Like reading row we can read cell by cell of a sheet. For that we should know the column index as well as the row index. As we saw above for getting cell value type at index (0,0) we should use,

type = sheet1.cell_type(0,0)

print cell_type

will print any of the values from 0-6 denoting the type of value stored in the cell. Similarly to get the actual value being stored in the cell we can use,

value = sheet1.cell_value(0,0)

print value

To get all the cell value in a sheet we should know the maximum number of rows in the sheet as well as the maximum number of columns in the sheet. The maximum number of columns in a sheet can be find using

max_cols = sheet1.ncols

So  we can read each cell value using the below code,

curr_row_index = 0

while curr_row_index < max_rows:
   curr_col_index = 0
   while curr_col_index < max_cols:
      value = sheet1.row(curr_row_index,curr_col_index)
   curr_row_index+=1

So next time use Python when you want to deal with Excel Spread sheet :-)