Web Crawler written in Python and Beautiful Soup

A web crawler, commonly referred to as a spider, is a piece of software that uses links from one online page to another to automatically browse the internet. In this guide, we'll examine how to use Python, the Beautiful Soup library, and the Request module to build a web crawler.

The Web Crawler scans through mavin.io search results and save retrieved information in an ordered format (.csv)

Complete results with:

a. products name, b. date sold, c. price sold, d. shipping cost and e. the pictures.

All of the previous categories will be extracted to different columns in an csv:

https://mavin.io/search?q=&bt=sold&cat=261332&sort=EndTimeSoonest&page=201 ( greater than 200 but less than 40,000) scraped

https://mavin.io/search?q=&cat=216 - results = 2,400,000

Getting Started

Prerequisites

To create a web crawler, you will need to have the following installed on your computer:

  • Python 3.x

  • Beautiful Soup 4

  • Requests module

Installation

Once you have Python installed on your computer, you can install Beautiful Soup and Requests by running the following commands in your terminal:

pip install beautifulsoup4
pip install requests

Program Design

To create a web crawler, you will need to design the program's structure and functionality. Here is a basic outline of what your program should do:

  1. Request a web page

  2. Parse the HTML content of the web page

  3. Extract relevant information from the parsed HTML

  4. Follow links on the web page to other pages

  5. Repeat steps 1-4 for each new page

Program Implementation

To implement the web crawler program, follow these steps:

  1. Import the necessary modules:
import requests
from bs4 import BeautifulSoup
  1. Define the starting URL for the web crawler:
start_url = "https://mavin.io/search?q=&bt=sold&cat=261332&sort=EndTimeSoonest&page=201"#page starts from 201 till n'th number of pages... This is usually in range of 201 - 400,000
  1. Send a request to the starting URL using the Requests module:
response = requests.get(start_url)
  1. Parse the HTML content of the starting URL using Beautiful Soup:
soup = BeautifulSoup(response.text, "html.parser")
  1. Extract relevant information from the parsed HTML. For example, if you want to extract all the links on the page, you can use the following code:
links = []
for link in soup.find_all('a'):
    href = link.get('href')
    if href and href.startswith("http"):
        links.append(href)
  1. Follow links on the web page to other pages. To do this, you can use a loop to repeat steps 1-5 for each new page:
for link in links:
    response = requests.get(link)
    soup = BeautifulSoup(response.text, "html.parser")
    # extract relevant information from the parsed HTML
  1. Save the extracted information to a file or database.

The program takes the item number [ self.num = ("2956") ] input at [ line 11 ] in the main.py file and scrapes all results corresponding to the said item number.

The only line that needs modification is [ line 11 ] . This is because of the remaining items that needs to be scraped. Allow the program scrape tons of information as its relates to a specific item. You can also check the [ log.csv ] file if the number of items scraped are up to 7 million, kill the program by ;

Ctrl + C

Results won't be duplicated and network issues wont affect the program as it automatically restarts anytime the signal is not strong enough.

Conclusion

This documentation provides a basic overview of how to create a web crawler for https://mavin.io using Python, but the program's functionality can be expanded and customized to suit your specific needs.

For enquires, contact: