in WebScraping

WebScraping with Python, Beautiful Soup and Urllib3

webscraping-with-python-beautiful-soup-and-urllib3

In this day and age, information is key. Through the internet we have an unlimited amount of information and data at our disposal. The problem however, is because of the abundance of information we as the users become overwhelmed. Fortunately for those users, there are programmers with the ability to develop scripts that will do the sorting, organizing and extracting of this data for them. Work that would take hours to complete can be accomplished with just over 50 lines of code and run in under a minute. Today, using Python, Beautiful Soup and Urllib3, we will do a little WebScraping and even scratch the surface of data extraction to an excel document.

To Start we do a little research

The website that we will be working with is called books.toscrape.com. It’s one of those websites that is literally made for practicing the art of WebScraping. Before we begin please understand that we won’t be rotating our IP Addresses or User Agents. However on other websites this may be a good idea since they will most likely block you if your not “polite”. I’ll talk more on the concept of being polite in follow on posts, for now just know that it means to space out the amount of time between your individual scrapes. What we will do is place our entire script into a function.

Ok lets take a look at our target.

Basically what we want from this website is a list of every book title and price. We notice that the prices are in British Pounds so we’ll want to convert them into US Dollars. If we scroll to the bottom of the page we notice that there are 50 pages worth books. Therefore our script will have to iterate 50 times while altering the base url each time. The url for this page changes one number each time so a simple for loop should do the trick.

Basic Setup with Urllib3 and Beautiful Soup

Here’s a breakdown of our tasks:

  • Import the required modules and create two master lists (titles and prices)
  • Using Urllib3 and Beautiful Soup setup the environment to parse the first page
  • Collect every book title from the page and append to one of the master lists
  • Collect every book price from the page, convert to USD and append to the prices master list.
  • Convert both master lists into a single dictionary
  • Export to a CSV

Now that we have our outline we can get to work. Since we’ll be putting everything into a function, be mindful of your indentations. Lets begin!

First lets import our modules and define our function.

import urllib3, re
from bs4 import BeautifulSoup
from csv import DictReader, DictWriter

#The file name will be whatever you decide when running the function
def get_book_data(filename):

    #These will be our Master Lists and must remain outside of any loops
    titles = []
    prices = []

Urllib3 is an HTTP Client for Python. It’s pretty versatile and perfect for what we need. For more information check out the docs. Throughout most of your Web Scraping there will be a time that Regex typically comes in handy. A prime example of this is that all of the prices on the page have a pound symbol in front of the numbers. One of the easiest way to remove and replace the symbol is through Regular Expressions. Finally, since we want to write our information to a CSV via a dictionary, it only makes sense to use the csv module.

Next we define our function as get_book_data and pass in the argument filename that we will the name that we will choose for our csv.

#Convert British Pounds to USD (as of 20190801)
def gbp_to_usd(amount):
    return f'$ {round((amount * 1.21255), 2}'

As of August 1st the conversion rate from British pounds to US dollars is 1.21255. By defining this function, we are able call it later when the time comes. Then by wrapping our calculations in the round method, we limit the amount of numbers after the decimal to two.

Web Scraping is just a fancy way to say Downloading Information

So far so good now let’s get into it.

#Prepare to scrape all 50 pages
for i in range(1,51):
    #All of the page URLs follow the same format with the exception of one number followed by 'page-'
    url = f'https://books.toscrape.com/catalogue/category/books_1/page-{i}.html'
    req = urllib3.PoolManager()
    res = req.request('GET', url)
    soup = BeautifulSoup(res.data, 'html.parser')
    contents = soup.find_all(class_= 'product_pod')

Due to there being 50 pages, our range will need to be from 1 to 51 in order to capture all of them. Our URL takes us to the first page and throughout each iteration 1 will be added to i givings us a new URL each time.

The PoolManager method allows for arbitrary requests while transparently keeping track of connection pools for us. The type of request that we are initiating to our URL is a GET request, which means that all we want is data.

Using Beautiful Soup we pull all of the data from our request, specifically the html data. If we take a look at the source code on our webpage we’ll notice that all of the products fall under the class product_pod. By calling the find_all method we request all of the html with the class of product_pod.

Scope is important in Python: Watch your indentations

#Based off of the title parameter within the site html
        for i in soup.find_all(title=True):
            titles.append(i['title'])

If we look once again at the source data, we notice that the title appears twice. Once as the inner text and the other as the value for the parameter title. Normally, we would extract from the inner html but since the inner text cuts off most of the title we have to extract from within the a tag.

#Temporary lists for British currency conversions
        pounds = []
        c = []
        for i in contents:
            c.append(i.find(class_='price_color').get_text())
        
            for number in c:
                #Extract the British Pound symbol and join the numbers and decimal points back together
                amount = re.compile('[0-9]+.')
                num = amount.findall(number)
                pounds.append(float(''.join(num)))

Just like we did for title we going to extract all of the html that falls within the price_color class. Now remember that all of the prices are in British pounds; so we need to remove the pound symbol and replace it with the dollar sign. Before that , we extract all of the numbers and decimals from the gathered prices using regex. Once we have a list of numbers and decimals we join and then append the floats to our temporary pounds list.

        #Create a temporary list for the current loop and append to the master list after we run the conversion function
        temp = list(map(gbp_to_usd_rounded,pounds))
        for t in temp:
            prices.append(t)

    #Combine both lists into a dictionary
    res = dict(zip(titles,prices))

Yet again we create another temporary list and call our conversion function on the pounds list we just created. Finally we run a quick for loop and append those items to our master prices list.

Now that we have both master lists we’re able to create a dictionary using the zip method. Make sure that this is outside of any loops otherwise it won’t work as intended.

Conclusion

#Create an Excel Document with the dictionary
    with open(filename,'w',newline='') as file:
        headers = ("Book Title", 'Price (in usd)')
        csv_writer = DictWriter(file,fieldnames=headers)
        csv_writer.writeheader()
        for k,v in res.items():
            csv_writer.writerow({
                'Book Title' : k,
                'Price (in usd)' : v
        })

get_book_data('BookResearch_Beta.csv')

We open the specified filename with the intent to write. With Python 3 when we are web scraping, we have to include the argument newline='' otherwise the spreadsheet will have blank lines between each entry. We run a simple iteration through the dictionary ensuring that our keys match the headers that we specified. Finally we execute get_book_data with our file title and extension, wait about 20-30 seconds and boom we have an excel document with all of the titles and prices from our target website!

If you found value in this article please share and leave a comment below. Let me know, what projects are you working or would like to see in the future . Be sure to follow me on Facebook, Twitter and Instagram. You can find all of the source code on my GitHub page. Until next time!

PyTechDrae

Leave A Comment

Your email address will not be published. Required fields are marked *