DEV Community

Meftahul Jannat Mila
Meftahul Jannat Mila

Posted on

Web Scraping Project: Extracting Data from Wikipedia Using Python

In this project, I used Python to scrape a table of Bangladeshi companies from Wikipedia and convert it into a clean CSV file. The idea was to automatically collect and organize data from a web page without manually copying and pasting the information.

I'll walk you through the process step-by-step, including what each part of the code does and some challenges I faced during the project.

🔧 Tools & Libraries Used

Pandas: For handling tabular data.
Requests: To make HTTP requests and fetch web pages.
BeautifulSoup: To parse and extract data from HTML.

Step 1: Import the Required Libraries

import pandas as pd
import requests
from bs4 import BeautifulSoup
Enter fullscreen mode Exit fullscreen mode

We import the necessary Python libraries to perform web scraping (requests and BeautifulSoup) and data handling (pandas).

Step 2: Request the Wikipedia Page

url = 'https://en.wikipedia.org/wiki/List_of_companies_of_Bangladesh'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
Enter fullscreen mode Exit fullscreen mode

We fetch the HTML content of the Wikipedia page using requests.get(), then parse it using BeautifulSoup with the HTML parser.

Step 3: Locate the Target Table

table = soup.find('table', class_='wikitable sortable')
Enter fullscreen mode Exit fullscreen mode

We find the specific HTML table that contains the list of Bangladeshi companies. Wikipedia uses a table with the class 'wikitable sortable'.

Step 4: Extract Table Headers

c_titles = table.find_all('th',  attrs={"rowspan": "2"})
c_table_titles = [title.text.strip() for title in c_titles]
Enter fullscreen mode Exit fullscreen mode

We extract the table headers (column titles) by finding <th> tags with rowspan="2" (which identifies the actual column names).

Step 5: Set Up the DataFrame

df = pd.DataFrame(columns=c_table_titles)
Enter fullscreen mode Exit fullscreen mode

We create an empty DataFrame with the correct column names. This prepares us to insert the actual company data.

Step 6: Extract Data Rows

column_data = table.find_all('tr')
headers = [th.get_text(strip=True) for th in table.find_all('th', attrs={'rowspan': '2'})]
expected_columns = len(headers)

data_rows = []

for row in column_data[2:]:  # skip header rows
    row_data = row.find_all('td')
    individual_row_data = [td.get_text(strip=True) for td in row_data]

    # Remove extra columns if they exist
    if len(individual_row_data) > expected_columns:
        individual_row_data = individual_row_data[:expected_columns]

    # Skip rows with wrong column count
    if len(individual_row_data) != expected_columns:
        continue

    data_rows.append(individual_row_data)

Enter fullscreen mode Exit fullscreen mode

We loop through all the table rows (skipping the first two header rows) and extract the text from each cell <td>. We also ensure each row matches the expected number of columns and remove any extra or irregular data.

Step 7: Create and Save the Final DataFrame

df = pd.DataFrame(data_rows, columns=headers)
df.to_csv('Companies_Of_BD.csv')

Enter fullscreen mode Exit fullscreen mode

We create a final DataFrame using the collected data and headers, then export it to a CSV file named 'Companies_Of_BD.csv'.

Challenges I Faced

Every project has its share of hiccups. Here are some issues I ran into:

In the table I scraped from Wikipedia, there were six column headers, so each row should have six data values. However, some rows had eight values due to extra columns like status indicators and footnote references. This mismatch could cause errors when creating the DataFrame. I needed to remove the extra values to ensure each row had only six pieces of data, allowing the final CSV file to be clean and usable.

Final Output
The final result is a clean CSV file that contains a structured list of companies in Bangladesh from Wikipedia. This dataset can now be used for analysis, visualizations, or just general reference.

Conclusion
This was a great beginner-friendly project to learn about web scraping, HTML structure, and data cleaning in Python. It taught me how to be careful with real-world web data and handle unexpected formatting issues.

Top comments (2)

Collapse
 
olymahmud profile image
M. Oly Mahmud

Very helpful, Thanks a lot

Collapse
 
meftamila profile image
Meftahul Jannat Mila

welcome