PYTHON DATA EXCEL SCRAPER
Introduction
This Python script, authored by Somtoochukwu Chukwueze, serves as a powerful tool for automating the extraction of essential customer data from Excel workbooks. The script is designed to scrape information such as customer names, phone numbers, vehicle numbers, dates, emails, and customer complaints, creating a comprehensive database for further analysis.
Features
- Dynamic Pathway Handling: Easily specify the directory containing the Excel files you wish to scrape.
- Data Extraction Functions: Utilizes regex patterns and OpenPyXL to extract data from specific cells in the Excel sheets.
- Flexible Output Formats: Outputs the extracted data into both Excel and CSV formats for convenient data management.
- Automated Workbook Creation: Includes a function to open a new workbook, making it easy to store and organize the scraped data.
How to Use
Set Pathway: Define the directory path where your Excel files are located.
python
import os
pathway = r'C:\Users\user\Desktop\renamed'
list_of_files= os.listdir(pathway)
import the required libraries: many python libraries could be imported to perform this project task but for the sake of ease, we would implement the quickest process and that is with the openpyxl, regular expressions a.k.a regex(re) and csv(comma seperated variables, in case you wish to save the document in that format.
python
import openpyxl
from openpyxl import load_workbook
from openpyxl import Workbook
from openpyxl.utils import get_column_letter
import re
import csv
from openpyxl.worksheet import worksheet
create dictionaries to for each of the variables: create the variables for the data set you wish to extract.
python
names = ['Customer name']
phone = ['VEHICLE TYPE']
reg_no = ['Vehicle number']
dates = ['date']
emails=['email']
complaint=['complaints']
create functions for vehicle number
python
def vehicle_no(data):
# use regex to get the cell value
pattern = r'(\d+)'
l = re.split(pattern, data)
THE CONCATENATE VARIABLE
the cocatenate variable adds/concatenates the regex value to the numerical value of the cell. e.g A+1= A1. get it?
python
concatenate_your_ass = int(l[1])+7
cell = 'E' + str(concatenate_your_ass)
print(cell)
record = (wb[cell].value)
print(f'reg_no:{record}')
'''The if and else commaand below here is create a condition where the selected cell has no value so it's to return 'none'''
if record == None:
reg_no.append('NIL')
else:
reg_no.append(record) # this appends the record of the cell value to the dictionary i created earlier above
return reg_no
creat functions for customer_name
python
def customer_name(row_no):
pattern = r'(\d+)' # introducing the rgex function
l= re.split(pattern, row_no)# split the pattern of the regex functions
cell = 'B' + str(l[1]) #this contactenates the regex functions to the cell value
print(cell)
record=(wb[cell].value)
print(f'customer name:{record}')
if record == None:
names.append('NIL')
else:
names.append(record)
return names
create functions for dates , phone numbers and so on
python
def date(data):
pattern = r'(\d+)'
l= re.split(pattern, data)
cell = 'B' + str(int(l[1])+1)
print(cell)
record=(wb[cell].value)
print(f'customer name:{record}')
if record == None:
dates.append('NIL')
else:
dates.append(record)
return dates
'this function saves the phone numbers'
def phone_number(data):
pattern = r'(\d+)'
l = re.split(pattern, data)
check_PRO=int(l[1])+ 2
cell = 'B' + str(check_PRO)
print(cell)
record = wb[cell].value
print(f'phone number:{record}')
if record == None:
phone.append('NIL')
else:
phone.append(record)
return phone
def complaints(data):
pattern = r'(\d+)'
l = re.split(pattern, data)
cell = 'B' + str(int(l[1]) + 8)
print(cell)
record = (wb[cell].value)
print(f'customer name:{record}')
if record == None:
complaint.append('NIL')
else:
complaint.append(record)
return complaint
def email(data,wb):
pattern = r'(\d+)'
l = re.split(pattern, data)
check_PRO = int(l[1]) + 5
cell = 'B' + str(check_PRO)
print(cell)
record = wb[cell].value
print(f'this is cell cordinate{record}')
#omo=wb.cell(l[0],l[1])
#cell_nemesis=wb.cell(row=l[0], column=l[1])
print(f'email:{record}')
if record is None:
emails.append('NIL')
else:
emails.append(record)`
Run the Script: Execute the script to iterate through each Excel file, extracting relevant customer data.
Data Output: The script creates a new workbook and CSV file containing the scraped data.
Top comments (0)