Build a Web Scraper with Python in 5 Minutes

14.02.22
~5min
4

In this article, I will show you how to create a web scraper from scratch in Python.

Data scientists are often expected to collect large amounts of data to derive business value in organizations. Unfortunately, this is a skill that is often overlooked, as most data science courses don’t teach you to collect external data. Instead, there is a lot of emphasis placed on model building and training.
In this article, I will show you how to create a web scraper from scratch.
If you aren’t already familiar with the term, a web scraper is an automated tool that can extract large amounts of data from sites. You can collect up to hundreds of thousands of data points in just a few minutes with the help of web scraping.
We will scrape a site called “Quotes to Scrape” in this tutorial. This is a site that was specifically built to practice web scraping.
We will scrape a site called “Quotes to Scrape” in this tutorial. This is a site that was specifically built to practice web scraping.
Before we start building the scraper, make sure you have the following libraries installed — Pandas, BeautifulSoup, requests.
Once that’s done, let’s take a look at the site we want to scrape, and decide on the data points to extract from it.
This site consists of a list of quotes by prominent figures. There are three main bits of information displayed on the page — the quote, it’s author, and a few tags associated with it.
The site has ten pages, and we will scrape all the information available on it.
Let’s start by importing the following libraries:
import pandas as pd
from bs4 import BeautifulSoup
import requests
Then, using the requests library, we will get the page we want to scrape and extract it’s HTML:
f = requests.get('http://quotes.toscrape.com/')
Next, we will pass the site’s HTML text to BeautifulSoup, which will parse this raw data so it can be easily scraped:
soup = BeautifulSoup(f.text)
All of the site’s data is now stored in the soup object. We can easily run BeautifulSoup’s in-built functions on this object in order to extract the data we want.
For example, if we wanted to extract all of the text available on the web page, we can easily do it with the following lines of code:
print(soup.get_text())
You should see all of the site’s text appear on your screen.
Now, let’s start scraping by scraping the quotes listed on the site. Right click on any of the quotes, and select “Inspect Element.” The Chrome Developer Tools will appear on your screen:
BeautifulSoup has methods like find() and findAll() that you can use to extract specific HTML tags from the web page.
In this case, notice that the <span> class called text is highlighted. This is because you right-clicked on one of the quotes on the page, and all the quotes belong to this text class.
We need to extract all the data in this class:
for i in soup.findAll("div",{"class":"quote"}):
   print((i.find("span",{"class":"text"})).text)
Once you run the code above, the following output will be produced:
There are ten quotes on the web page, and our scraper has successfully collected all of them. Great!
Now, we will scrape author names using the same approach.
If you right click on any of the author names on the web page and click Inspect, you will see that they are contained within the <small> tag, and the class name is author.
We will use the find() and findAll() functions to extract all the author names within this tag.
for i in soup.findAll("div",{"class":"quote"}):
   print((i.find("small",{"class":"author"})).text)
Have question about digital transformation?
get answer

Related articles

your questions and special requests are always welcome
let's talk

Contact us

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.