Meftahul Jannat Mila

Posted on Jul 3

Web Scraping Project: Extracting Data from Wikipedia Using Python

#datascience #programming #dataengineering #python

In this project, I used Python to scrape a table of Bangladeshi companies from Wikipedia and convert it into a clean CSV file. The idea was to automatically collect and organize data from a web page without manually copying and pasting the information.

I'll walk you through the process step-by-step, including what each part of the code does and some challenges I faced during the project.

🔧 Tools & Libraries Used

Pandas: For handling tabular data.
Requests: To make HTTP requests and fetch web pages.
BeautifulSoup: To parse and extract data from HTML.

Step 1: Import the Required Libraries

import pandas as pd
import requests
from bs4 import BeautifulSoup

We import the necessary Python libraries to perform web scraping (requests and BeautifulSoup) and data handling (pandas).

Step 2: Request the Wikipedia Page

url = 'https://en.wikipedia.org/wiki/List_of_companies_of_Bangladesh'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')

We fetch the HTML content of the Wikipedia page using requests.get(), then parse it using BeautifulSoup with the HTML parser.

Step 3: Locate the Target Table

table = soup.find('table', class_='wikitable sortable')

We find the specific HTML table that contains the list of Bangladeshi companies. Wikipedia uses a table with the class 'wikitable sortable'.

Step 4: Extract Table Headers

c_titles = table.find_all('th',  attrs={"rowspan": "2"})
c_table_titles = [title.text.strip() for title in c_titles]

We extract the table headers (column titles) by finding <th> tags with rowspan="2" (which identifies the actual column names).

Step 5: Set Up the DataFrame

df = pd.DataFrame(columns=c_table_titles)

We create an empty DataFrame with the correct column names. This prepares us to insert the actual company data.

Step 6: Extract Data Rows

column_data = table.find_all('tr')
headers = [th.get_text(strip=True) for th in table.find_all('th', attrs={'rowspan': '2'})]
expected_columns = len(headers)

data_rows = []

for row in column_data[2:]:  # skip header rows
    row_data = row.find_all('td')
    individual_row_data = [td.get_text(strip=True) for td in row_data]

    # Remove extra columns if they exist
    if len(individual_row_data) > expected_columns:
        individual_row_data = individual_row_data[:expected_columns]

    # Skip rows with wrong column count
    if len(individual_row_data) != expected_columns:
        continue

    data_rows.append(individual_row_data)

We loop through all the table rows (skipping the first two header rows) and extract the text from each cell <td>. We also ensure each row matches the expected number of columns and remove any extra or irregular data.

Step 7: Create and Save the Final DataFrame

df = pd.DataFrame(data_rows, columns=headers)
df.to_csv('Companies_Of_BD.csv')

We create a final DataFrame using the collected data and headers, then export it to a CSV file named 'Companies_Of_BD.csv'.

Challenges I Faced

Every project has its share of hiccups. Here are some issues I ran into:

In the table I scraped from Wikipedia, there were six column headers, so each row should have six data values. However, some rows had eight values due to extra columns like status indicators and footnote references. This mismatch could cause errors when creating the DataFrame. I needed to remove the extra values to ensure each row had only six pieces of data, allowing the final CSV file to be clean and usable.

Final Output
The final result is a clean CSV file that contains a structured list of companies in Bangladesh from Wikipedia. This dataset can now be used for analysis, visualizations, or just general reference.

Conclusion
This was a great beginner-friendly project to learn about web scraping, HTML structure, and data cleaning in Python. It taught me how to be careful with real-world web data and handle unexpected formatting issues.

Top comments (2)

M. Oly Mahmud • Jul 3

Very helpful, Thanks a lot

Meftahul Jannat Mila • Jul 3

welcome