Business problem - Introduction
1. A description of the problem and a discussion of the background
Traffic accidents represent one of the leading causes of death worldwide and of economic expenditure. Despite the numerous measures and campaigns that are deployed every year to raise awareness of the seriousness of the problem, it still occurs quite frequently. The impact of road accidents on society and the economy is high, and human losses are compounded by large expenditures on health care, awareness campaigns, mobilization of specialized personnel, etc. The WHO sets the economic impact of road accidents in a developed country at 2 to 3% of GDP, a significant figure for any country. Collaboration to reduce these losses has become an important issue of general interest.
Defining the problem:
What are the factors that have a high impact on road accidents?
Is there a pattern to them?
Correlation?
We will have to analyze the data to get a clearer picture and draw conclusions.
Introduction
Note that this work represents the final project of the IBM certification course, for which we have provided the data with which we will develop the project.
These data have been collected and shared by the Seattle Police Department (Traffic Records) and are provided by Coursera for downloading through a link.
It takes into account a period of time from 2004 to the present, recording information related to the severity of the traffic accident, location, type of collision, weather and road conditions, visibility, number of people involved, etc.
The objective is to define the problem, to find the factors that can have a relevant weight in the quantity and seriousness of the accidents, so that any organism, company or enterprise interested in reducing these figures, can focus the resources in points where these conditions converge.
In order to provide greater clarity, I will try to analyze the data, see if there are relationships or patterns, especially in high impact accidents, so that preventive measures can focus on these points as a first prevention strategy.
Data to be used
2. A description of the data and how it will be used to solve the problem
For an accurate prediction of the magnitude of damage caused by accidents, they require a large number of reports on traffic accidents with accurate data to train prediction models. The data set provided for this work allows the analysis of a record of 200,000 accidents in the state of Seattle, from 2004 to the date it is issued, in which 37 attributes or variables are recorded and the codification of the type of accident is allowed, grouped according to 84 codes. The information can be extracted from it:
speed information
information on road conditions and visibility
type of collision
affected persons, etc
The data will be used so that we can determine which attributes are most common in traffic accidents in order to target prevention at these high-incidence points.
Data Source
Data Source: These data have been collected and shared by the Seattle Police Department (Traffic Records) and are provided by Coursera for downloading through a link.
Data Location: Coursera_Capstone/Data assets
Data set name: Data-Collisions (1)_shaped.csv
Methodology
Objective: The objective of this project is to predict the severity of a traffic accident based on the other characteristics contained in the report.
Packages and libraries: We will use libraries and packages for both data manipulation and data visualization. PANDA, NUMPY, SCIPY, Matplotlib, Seaborn
A data analysis will be performed in order to determine what type of methodology and learning of the machine will be the most appropriate, in addition to obtaining a first contact with the data that we find more relevant to use in this project.
Obtaining and cleaning data
Importing libraries and packages
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
print('imported')
Uploading the data
df_data_1 = pd.read_csv(Data-Collisions.csv)
df_data_1.head()
# choosing the data we will work with
test = ['SEVERITYCODE', 'SPEEDING','ROADCOND']
df_data_1 = df_data_1[test]
# obtaining data dimensions
for feature in ["SPEEDING", "ROADCOND"]:
print(df_data_1[feature].unique())
['N' 'Y']
['Wet' 'Dry' 'Unknown' 'Snow/Slush' 'Ice' 'Other' 'Sand/Mud/Dirt'
'Standing Water' 'Oil']
# in speed we replace Nan with a negative value N
df_data_1['SPEEDING'] = df_data_1['SPEEDING'].fillna('N')
#we replace the value Nan declaring it as unknown too
df_data_1['ROADCOND'] = df_data_1['ROADCOND'].fillna('Unknown')
# checking value once again...
for feature in ["SPEEDING", "ROADCOND"]:
print(df_data_1[feature].unique())
['N' 'Y']
['Wet' 'Dry' 'Unknown' 'Snow/Slush' 'Ice' 'Other' 'Sand/Mud/Dirt'
'Standing Water' 'Oil']
# We assign new values to roadcond
df_data_1['ROADCOND'].replace(to_replace=['Wet','Dry','Unknown','Snow/Slush','Ice','Other','Sand/Mud/Dirt','Standing Water','Oil'], value = ['Dangerous','Normal','Normal','Dangerous','Dangerous','Normal','Dangerous','Dangerous','Dangerous'], inplace=True)
df_data_1["SPEEDING"].replace(to_replace=['N', 'Y'], value=[0,1], inplace=True)
df_data_1['ROADCOND'].replace(to_replace=['Dangerous','Normal'],value=[0,1],inplace=True)
test_condition = df_data_1[['SPEEDING','ROADCOND']]
test_condition.head()
| | SPEEDING | ROADCOND |
| ------------- |:----------:|
| 0 | 0 | 0 |
| 1 | 0 | 0 |
| 2 | 0 | 1 |
| 3 | 0 | 1 |
| 4 | 0 | 0 |
Training the model
x = test_condition
y = df_data_1['SEVERITYCODE'].values.astype(str)
x = preprocessing.StandardScaler().fit(x).transform(x)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1234)
# obtaining data dimensions
print("Training set: ", x_train.shape, y_train.shape)
print("Testing set: ", x_test.shape, y_test.shape)
Training set: (155738, 2) (155738,)
Testing set: (38935, 2) (38935,)
Selecting the methods: Tree model, Logistic Regression and KNN methodology
#Tree model
Tree_model = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
Tree_model.fit(x_train, y_train)
predicted = Tree_model.predict(x_test)
Tree_f1 = f1_score(y_test, predicted, average='weighted')
Tree_acc = accuracy_score(y_test, predicted)
#Logistic Regression
LR_model = LogisticRegression(C=0.01, solver='liblinear').fit(x_train, y_train)
predicted = LR_model.predict(x_test)
LR_f1 = f1_score(y_test, predicted, average='weighted')
LR_acc = accuracy_score(y_test, predicted)
#KNN methodology
KNN_model = KNeighborsClassifier(n_neighbors = 4).fit(x_train, y_train)
predicted = KNN_model.predict(x_test)
KNN_f1 = f1_score(y_test, predicted, average='weighted')
KNN_acc = accuracy_score(y_test, predicted)
Results
Comparing the results obtained
results = {
"Method of Analisys": ["KNN", "Decision Tree", "LogisticRegression"],
"F1-score": [KNN_f1, Tree_f1, LR_f1],
"Accuracy": [KNN_acc, Tree_acc, LR_acc]
}
results = pd.DataFrame(results)
results
| | Method of Analisys | F1-score | Accuracy |
| ---------------------- |:--------:| :-------:|
| 0 | KNN | 0.591378 | 0.69675 |
| 1 | Decision Tree | 0.576051 | 0.699679|
| 2 | LogisticRegression | 0.576051 | 0.699679|
# Comparing results using LR
results = {
"Intercept": LR_model.intercept_,
"SPEEDING ": LR_model.coef_[:,0],
"ROADCOND ": LR_model.coef_[:,1],
}
results = pd.DataFrame(results)
results
| | Intercept | SPEEDING | ROADCOND |
| ----- ------- |:--------:| :--------:|
| 0 | -0.853729 | 0.067702 | -0.068295 |
Looking at the results obtained in the comparison, it is understood that speed and road conditions influence the severity of traffic accidents.
Top comments (0)