Scraping information from the internet can be very handy. But often, information is located somewhere within the tags of a webpage.

Scraping this information into a format which can be used by another source can be easily achieved using Beautiful Soup.

An Example

In this example we shall retrieve the Barclay’s premier table located here.

To open this webpage we shall use urllib2.

import urllib2

webpage = urllib2.urlopen('http://www.premierleague.com/en-gb/matchday/league-table.html')

We can then “soup” up the webpage using Beautiful Soup.

from bs4 import BeautifulSoup

soup = BeautifulSoup(webpage)

To retrieve the information we shall use the find function in Beautiful Soup, to extract everything with the tag name of table.

table = soup.find('table')

Finally to extract the information we would make use of the tags tr and td to extract the row and column information respectively.

import re

string = ''
for row in table.find_all('tr'):
    for column in row.find_all('td'):
        string += "%s," % re.sub(r'\s+', ' ', column.get_text())
    string += '\n'

Which will provide the required table in the variable string.

And that’s it!


The full code is as below:

from bs4 import BeautifulSoup
import urllib2
import re

webpage = urllib2.urlopen('http://www.premierleague.com/en-gb/matchday/league-table.html')
soup = BeautifulSoup(webpage)

# extract table
table = soup.find('table')

# now for each table you want to extract the row, then column
# for each item in the column put a comma between it, and then if its
# the next row then we add a new line

string = ''
for row in table.find_all('tr'):
    for column in row.find_all('td'):
        string += "%s," % re.sub(r'\s+', ' ', column.get_text())
    string += '\n'