DEV Community

SOMTOO CHUKWUEZE
SOMTOO CHUKWUEZE

Posted on

PYTHON DATA EXCEL SCRAPER

PYTHON DATA EXCEL SCRAPER

Introduction

This Python script, authored by Somtoochukwu Chukwueze, serves as a powerful tool for automating the extraction of essential customer data from Excel workbooks. The script is designed to scrape information such as customer names, phone numbers, vehicle numbers, dates, emails, and customer complaints, creating a comprehensive database for further analysis.

Features

  • Dynamic Pathway Handling: Easily specify the directory containing the Excel files you wish to scrape.
  • Data Extraction Functions: Utilizes regex patterns and OpenPyXL to extract data from specific cells in the Excel sheets.
  • Flexible Output Formats: Outputs the extracted data into both Excel and CSV formats for convenient data management.
  • Automated Workbook Creation: Includes a function to open a new workbook, making it easy to store and organize the scraped data.

How to Use

Set Pathway: Define the directory path where your Excel files are located.

python

import os
pathway = r'C:\Users\user\Desktop\renamed'
list_of_files= os.listdir(pathway)

Enter fullscreen mode Exit fullscreen mode

import the required libraries: many python libraries could be imported to perform this project task but for the sake of ease, we would implement the quickest process and that is with the openpyxl, regular expressions a.k.a regex(re) and csv(comma seperated variables, in case you wish to save the document in that format.

python
import openpyxl

from openpyxl import load_workbook
from openpyxl import Workbook
from openpyxl.utils import get_column_letter
import re

import csv

from openpyxl.worksheet import worksheet
Enter fullscreen mode Exit fullscreen mode

create dictionaries to for each of the variables: create the variables for the data set you wish to extract.

python

names = ['Customer name']
phone = ['VEHICLE TYPE']
reg_no = ['Vehicle number']
dates = ['date']
emails=['email']
complaint=['complaints']

Enter fullscreen mode Exit fullscreen mode

create functions for vehicle number

python
def vehicle_no(data):
    # use regex to get the cell value
    pattern = r'(\d+)'
    l = re.split(pattern, data)
Enter fullscreen mode Exit fullscreen mode

THE CONCATENATE VARIABLE
the cocatenate variable adds/concatenates the regex value to the numerical value of the cell. e.g A+1= A1. get it?

python
concatenate_your_ass = int(l[1])+7

    cell = 'E' + str(concatenate_your_ass)
    print(cell)
    record = (wb[cell].value)
    print(f'reg_no:{record}')
    '''The if and else commaand below here is create a condition where the selected cell has no value so it's to return 'none'''
    if record == None:
        reg_no.append('NIL')
    else:
        reg_no.append(record)  # this appends the record of the cell value to the dictionary i created earlier above
    return reg_no 
Enter fullscreen mode Exit fullscreen mode

creat functions for customer_name

python
def customer_name(row_no):
    pattern = r'(\d+)' # introducing the rgex function
    l= re.split(pattern, row_no)# split the pattern of the regex functions

    cell = 'B' + str(l[1])  #this contactenates the regex functions to the cell value
    print(cell)
    record=(wb[cell].value)
    print(f'customer name:{record}')
    if record == None:
        names.append('NIL')
    else:
        names.append(record)
    return names
Enter fullscreen mode Exit fullscreen mode

create functions for dates , phone numbers and so on

python

def date(data):

    pattern = r'(\d+)'
    l= re.split(pattern, data)

    cell = 'B' + str(int(l[1])+1)
    print(cell)
    record=(wb[cell].value)
    print(f'customer name:{record}')
    if record == None:
        dates.append('NIL')
    else:
        dates.append(record)
    return dates


'this function saves the phone numbers'

def phone_number(data):
    pattern = r'(\d+)'
    l = re.split(pattern, data)

    check_PRO=int(l[1])+ 2
    cell = 'B' + str(check_PRO)
    print(cell)
    record = wb[cell].value
    print(f'phone number:{record}')
    if record == None:
        phone.append('NIL')
    else:
        phone.append(record)
    return phone


def complaints(data):
    pattern = r'(\d+)'
    l = re.split(pattern, data)

    cell = 'B' + str(int(l[1]) + 8)
    print(cell)
    record = (wb[cell].value)
    print(f'customer name:{record}')

    if record == None:

        complaint.append('NIL')
    else:
        complaint.append(record)
    return complaint


def email(data,wb):
    pattern = r'(\d+)'
    l = re.split(pattern, data)
    check_PRO = int(l[1]) + 5
    cell = 'B' + str(check_PRO)
    print(cell)
    record = wb[cell].value
    print(f'this is cell cordinate{record}')

    #omo=wb.cell(l[0],l[1])
    #cell_nemesis=wb.cell(row=l[0], column=l[1])
    print(f'email:{record}')
    if record is None:
        emails.append('NIL')
    else:
        emails.append(record)`
Enter fullscreen mode Exit fullscreen mode

Run the Script: Execute the script to iterate through each Excel file, extracting relevant customer data.

Data Output: The script creates a new workbook and CSV file containing the scraped data.

Top comments (0)