An Introduction To Web Scraping

Amogh | Aug 21, 2024

I recently started seeing a lot of posts related to illegal web scarping by AI companies, all thanks to frequency illusion. In this blog, we’ll discuss the basics of web scraping and let’s learn how to scrape things from the web.  

Legality

Before jumping straight into the weeds, let’s address the elephant in the room: “Is it legal to scrape the web?”. Well, it depends on who we ask this question. In a broader sense, whatever we find on the web is either copyrighted or copylefted. Some things could be free to use, but with some restrictions. So it’s better to read the terms and conditions (which is a tedious process for sure!) of particular websites before we venture out. 

From a technical standpoint, anyone can just write 10 lines of HTML code, like put up some information, and can scrape that information using some tools by referring to HTML tags. Sounds fun, right? Don’t worry, It’s all covered in the next sections.

Web scraping is a gray area; it depends on a lot of things, such as the context, scale, licenses, legality, permission, etc. Follow these three golden rules before you start downloading cat pictures from the Internet! 

Check the robots.txt : In general, it’s a text file that resides at the root of a website, for example, https://google.com/robots.txt, that is used to communicate with web crawlers and bots about which pages on the site should or should not be accessed. Now go have fun appending /robots.txt to all the websites that you come across to check what is allowed and what is not. 

Terms and conditions : As aforementioned, please check the terms and conditions, privacy policies, and so on.

Rate limiting : Don’t overload the servers with your requests, consider adding a delay between requests. Meaning, if you find a website where you can scrape something, don’t keep on abusing by sending requests continuously (refreshing the page to tell you dad jokes).

Web Scraping

Web scraping is hard. Not because it requires extensive knowledge about technologies, but rather the ever-changing, dynamic nature of the web. Let’s say, if we write a web scraper to scrape a website today, it might not work after 3 years. Why? Because, over the time, the website might change in terms of technology, business logic, framework, and so on. Generally, companies don’t like others scraping their sites. They also employ a lot of methods to block scraping.

Lucky are the ones, who could still scrape the websites that were made 10 years ago. Thanks to those people who haven’t changed their code for a very long time!

There are ways to create scrapers that could last for a long time, but there is nothing called a universal scrapper, at least without changing anything.

If you have heard about web scraping before, there is a strong chance that you might have also heard about ‘bots’ doing all the magic. Well, in this blog, we are not going to use any ‘bots’ to scrape web, but Python. 

requests

I have used the term ‘sending requests’ in the previous sections. Call it a happy accident that Python has a HTTP client library called Requests. It lets us do some basic CRUD (Create, Read, Update, Delete) operations, which are fundamental in web development and database management. If you don’t have requests installed, you can do so by pip3 install requests. In this blog, we’ll be using https://jsonplaceholder.typicode.com/ to understand how requests work. This website is useful when we need some fake data to test things and play around.

  1. Create - As the name suggests, we are going to CREATE something. In the language of HTTP, let us call it POST. When we POST something, it gets CREATED on the database or a server. 
import requests

url = 'https://jsonplaceholder.typicode.com/posts'
data = {
    'title': 'Inception',
    'body': 'dream on',
    'userId': 1
}
response = requests.post(url, json=data)

print("Create Response:")
print(response.json())
print("Response code:", response.status_code)

In the above code, we are importing requests and using it to post (i.e requests.post(url, json=data)) some data to the jsonplaceholder website and printing the result. Here the Response code indicates whether our request was successfully fulfilled or not. We are good to go if we get 200 or 201.

#OUTPUT

Create Response:
{'title': 'Inception', 'body': 'dream on', 'userId': 1, 'id': 101}
Response code: 201
  1. Read - With Read, that is technically a GET request, we can do something fancy. Let’s retrieve our IP address and other juicy stuff about our location. 
import requests

response = requests.get('http://ipinfo.io/')
print("Response Text:", a.text)
print("Response code:", response.status_code)

In the above code, from ipinfo.io we are getting information (i.e requests.get(‘http://ipinfo.io/')) about our public IP and location.

#OUTPUT

Response Text: {
  "ip": "123.456.789.42",
  "city": "Area 52",
  "region": "Mars",
  "country": "No Country For Old Men",
  "loc": "370.4242,370.4242",
  "org": "Orange Is The Right Spelling",
  "postal": "424242",
  "timezone": "Mars/dark region",
  "readme": "https://ipinfo.io/missingauth"
}
Response code: 200
  1. Update/Patch
import requests

url = 'https://jsonplaceholder.typicode.com/posts/1'
data = {
    'title': 'Insomnia',
    'body': 'No sleeep to dream',
    'userId': 1
}
response = requests.put(url, json=data)

print("Update Response:")
print(response.json())
print("Response code:", response.status_code)
#OUTPUT

Update Response:
{'title': 'Insomnia', 'body': 'No sleeep to dream', 'userId': 1, 'id': 1}
Response code: 200
import requests

url = 'https://jsonplaceholder.typicode.com/posts/1'
data = {
    'title': 'Batman',
    'body': 'Vigilante'
}
response = requests.patch(url, json=data)

print("Partial Update Response:")
print(response.json())
print("Response code:", response.status_code)
#OUTPUT

Partial Update Response:
{'userId': 1, 'id': 1, 'title': 'Batman', 'body': 'Vigilante'}
Response code: 200

In the above two code snippets, we are updating (i.e requests.put(url, json=data))/patching(i.e requests.patch(url, json=data)) some data.

  1. Delete
import requests

url = 'https://jsonplaceholder.typicode.com/posts/1'
response = requests.delete(url)

print("Delete Response:")
print(response.status_code)
#OUTPUT

Delete Response:
200

In the above code snippet, we delete (i.e requests.delete(url)) the data.

If you are a curious kid, you might question, that, it’s easy to understand how get works as it’s just fetching some data from a website that is already there; what about other operations? Are we creating and deleting stuff on the website? Well, no. Jsonplaceholder simulates the behavior of an actual API.  It returns a successful response to indicate that our request was processed correctly, even though no actual changes are made to their persistent database.

Now, we’ll move on to the cool stuff. As aforementioned, web scraping is just scraping up some data associated with HTML tags. Let’s prove that right. Copy the code below and save the file as scrape.html.

Caution: Don’t follow the coding practice that’s demonstrated in the code snippet below, it’s just for the show. Using multiple <p> tags and throwing classes and <div> tags around, please no!

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Web Scraping</title>
</head>
<body>
    <p class="scrape"><span>This is a scrapable site</span></p> <br>
    <strong>Believe it or not</strong> <br>
    Click <a href="https://chatgpt.com">here</a> to scrape the internet. <br>
    <div id="quote">Knowledge is power</div> <br>
    <p><em>Eminem</em> is emphasized</p>
    <a href="mailto:[email protected]">Email</a>   
</body>
</html>

Now for the magical stuff, we’ll use a library named Beautiful Soup, for pulling data out of HTML and XML files. It can be installed by pip3 install beautifulsoup4. We’ll also be using lxml which helps us to parse through HTML. It can be installed by pip3 install lxml (if it’s not installed already). Save the file as soup1.py

import bs4
 
myfile = open('scrape.html')
soup = bs4.BeautifulSoup(myfile, "lxml")

print("ChatGPT:", soup.find_all('a')[0]['href'])
print("Quote of the day:", soup.select('#quote')[0].getText())
print("Span content:", soup.select('span')[0].getText())
print("Bold content:", soup.select('strong')[0].getText())
print("Emphasized content:", soup.select('em')[0].getText())

email_tag = soup.find('a', href=True)

# Looping through the <a> tags to find the one with mailto: link
email = None
for tag in soup.find_all('a', href=True):
    if tag['href'].startswith('mailto:'):
        email = tag['href'].replace('mailto:', '')
        break

if email:
    print("Email:", email)
else:
    print("Email not found")
#OUTPUT

ChatGPT: https://chatgpt.com
Quote of the day: Knowledge is power
Span content: This is a scrapable site
Bold content: Believe it or not
Emphasized content: Eminem
Email: [email protected]

In the above code, first we open our file scrape.html, and next we create a soup or soup object that helps parse the HTML document into a structured format that allows us to navigate and manipulate it easily. 

First, we find the first <a> and extract the href attribute of the first <a> tag in the HTML.

Next, we subsequently find #quote, span, strong, strong, em elements and print their content.

Finally, we grab the email by looping through the <a> tags to find the one with mailto, then we extract the email address by removing the mailto prefix. 

Now, let’s move on to do something cooler. We’ll try to download some images from the internet using a few lines of code. Cat pics dream coming true! Copy the code below and save it as unsplash.py.

import requests
from bs4 import BeautifulSoup
import os

search_query = 'cats'
url = f'https://unsplash.com/s/photos/{search_query}'

response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Find image URLs
image_tags = soup.find_all('img', {'src': True})
images = [tag['src'] for tag in image_tags if 'src' in tag.attrs][:5]

# Download images
for i, img_url in enumerate(images, 1):
    img_data = requests.get(img_url).content
    filename = f"IMG_{i}.jpg"
    with open(filename, 'wb') as f:
        f.write(img_data)
    print(f"Downloaded {filename}: {img_url}")
create

In the above code, we import requests, bs4 and os. We’ll use unsplash.com to get the images. Next, we create a soup object to parse the HTML content. Next, we find all <img> tags in the parsed HTML that have a src attribute (which usually contains the image URL). In the next line, we extract the src attribute from the first 5 <img> tags. We can increase the number if we want. Next, we download the image and store it with a.jpg extension using the f"IMG_{i}.jpg format.

If you don’t want cat pictures, you can just change the search_query to whatever you want. And if you want to increase or decrease the number of images, change the number in images = [tag['src'] for tag in image_tags if 'src' in tag.attrs][:5] to any number you want.

Finally, we’ll wrap this up by getting some details about restaurants in Bali (or any other major city you want) from the Yellow Pages. Save the file as yellow.py.

from bs4 import BeautifulSoup
import urllib.request
import urllib.error

# Location and what to find
location = "Bali"
search_query = "restaurants"
search_url = f"https://www.yellowpages.com/search?search_terms={search_query}&geo_location_terms={location.replace(' ', '%20')}"

# Function to fetch HTML content with error handling
def fetch_html(url):
    try:
        req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
        with urllib.request.urlopen(req) as response:
            return response.read()
    except urllib.error.HTTPError as e:
        print(f"HTTPError: {e.code} - {e.reason}")
    except urllib.error.URLError as e:
        print(f"URLError: {e.reason}")
    except Exception as e:
        print(f"Unexpected error: {e}")

# Fetching the search page
s_html = fetch_html(search_url)
if s_html:
    soup = BeautifulSoup(s_html, "lxml")

    # Extracting business listings
    businesses = soup.find_all('div', class_='result')

    for business in businesses:
        name = business.find('a', class_='business-name').get_text(strip=True) if business.find('a', class_='business-name') else None
        address = business.find('div', class_='street-address').get_text(strip=True) if business.find('div', class_='street-address') else None
        phone = business.find('div', class_='phones phone primary').get_text(strip=True) if business.find('div', class_='phones phone primary') else None

        # Print the extracted information
        print(f"Name: {name}")
        print(f"Address: {address}")
        print(f"Phone: {phone}")
        print("-------------------")
#OUTPUT

Name: Bj's Pizza House
Address: 6301 Monroe Hwy
Phone: (318) 640-2983
-------------------
Name: Crazy CaJun Restaurant
Address: 6300 Monroe Hwy
Phone: (318) 640-6699
-------------------
Name: Paradise Catfish Kitchen
Address: 4820 Monroe Hwy
Phone: (318) 640-5032
-------------------
Name: El Rodeo
Address: 6005 Monroe Hwy
Phone: (318) 641-0204
-------------------
Name: Subway
Address: 5826 Monroe Hwy
Phone: (318) 640-7827
-------------------
Name: Dairy Queen Grill & Chill
Address: 5830 Monroe Hwy
Phone: (318) 640-0959
......

First we import stuff as usual, in addition to bs4, we also need urllib.request and urllib.error for sending HTTP requests and handling errors. Next we add our search query; we’ll be using https://www.yellowpages.com/ to search for information. After that, we’ll add some logic to handle errors. Moving forward, we fetch the HTML content of search results, and create a soup to handle parsing.

Next comes the actual logic. We find all div elements with the class “result,” which contain individual business listings and loop through name, address, and phone numbers. If you want to get information about other cities, you can try to change the location.

As I mentioned at the start of the article, scraping is mainly extracting information from HTML tags. In most of the cases, class attributes play an important role, where we can extract information that are embedded inside them.

Well, that’s it for the blog. Hope you enjoyed it!