DEV Community: Siddhesh shankar

EDA, Feature Engineering and ML Model Creation - 100,000 UK Used Car Data Set (Kaggle)

Siddhesh shankar — Thu, 20 Jan 2022 06:40:51 +0000

Introduction:

Learning the concepts of Exploratory Data Analysis, Machine Learning as well as the life-cycle of a Data Science project, has not only helped me gain knowledge but also increased my capability of interpreting data correctly.
Not only that, it has also helped me thinking rationally how the process work, how the data is been collected, processed and analyzed to get insights which become crucial.

In this article, I am here to talk about the Exploratory Data Analysis, Feature Engineering as well as model creation that I have done in the 100,000 UK Used Car data set in Kaggle.

Description of the data set:

We all know how people upgrade themselves by buying a new car. The ownership of a new car has been in a boom since the last decade. Since the data set is of the used cars from United Kingdom, a study showed that over 1.63 million new cars were registered in the year 2020. For some people it is not only for the travelling purposes but also as a status upgrade.
This data set had many car companies giving the data of the used cars that are up for sale in United Kingdom(UK). So, we have chosen to work with the Audi Cars data.

Here's the link to my Kaggle notebook:

Audi Data Set- 96 % Accurate Model Creation & EDA - Kaggle

Let's start the project:

Always we need to know what our data set is and what all libraries the project needs. This is the first step towards "glory".

So, we import the libraries:

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # For visualzations and graph creations
%matplotlib inline
import seaborn as sns # For advanced visualizations and graph creations
from sklearn.preprocessing import LabelEncoder # For Feature Engineering Method - Label Encoding
from sklearn.preprocessing import MinMaxScaler # For Normalization
from sklearn.model_selection import train_test_split # For Splitting the data into train data and test data
from sklearn.ensemble import RandomForestRegressor # For Creation of Random Forest Regressor Model
from sklearn.linear_model import LinearRegression # For Creation of Linear Regression Model
from catboost import CatBoostRegressor # For Creation of CatBoost Regressor Model

# Libraries for calculation Metrics of the Model we create:

from sklearn.metrics import mean_squared_error 
from sklearn.metrics import r2_score

After importing the libraries, we import the data set:

audidata = pd.read_csv("../input/used-car-dataset-ford-and-mercedes/audi.csv")
audidata.head()

We have imported the audi data records with .read_csv() function of Pandas Library.

With the help of the .info() function we could get the information about the columns in the data set, their data types, Non-Null Values etc.

audidata.info()
audidata.isna().sum()
audidata.shape

The data set contains 10,668 entries/records with 9 columns. So, the columns and their descriptions are as follows:

model --> Model Name.
year --> The year it was bought.
price --> The price at which the used car will be sold.
transmission --> The transmission type i.e. manual, automatic or semi-automatic.
mileage --> The miles that used car has driven.
fuelType --> The fuel type of the car i.e. petrol, diesel or hybrid.
tax --> The tax that will be applied on the selling price of that used car.
mpg --> The miles per gallon ratio telling us how many mies it can drive per gallon of fuel.
engineSize --> The engine size of the used car.

No null records which means that a type of noise in data set is not present.

Outlier Removal:

Though we know that there are no NA values/records in the data set. Still, it doesn't mean that the data doesn't contain noisy data points. So, removing outliers becomes very much necessary. So with the help of boxplots, outlier detection was done.

box1 = sns.boxplot(x = 'mileage', data = a_clean)

We can see that there is a car which is above the 300,000 miles driven. This is an outlier hence removed it by keeping the range of data points for mileage as 0-200,000 miles.

a_clean = a_clean[a_clean['mileage'] < 200000]
print('We removed {} outliers!'.format(len(audidata) - len(a_clean)))

Similarly, we followed the same procedure for the other integer columns.

box1 = sns.boxplot(x = 'tax', data = a_clean)

We can observe that there are outliers in the range 0 € -100 € and 500 € - 600 €. Removed these outliers from the data in tax column.

a_clean = a_clean[a_clean['tax'] < 500]
print('We removed {} outliers!'.format(len(audidata) - len(a_clean)))
box1 = sns.boxplot(x = 'tax', data = a_clean)

After cleaning the outliers, this the boxplot for the tax column.

After removal of all the outliers from the numerical columns, we dive straight into the Exploratory Data Analysis(EDA).

Exploratory Data Analysis(EDA):

To start with EDA, we plotted a heatmap. For heatmap, we used Seaborn Library. In the heatmap, we passed the correlation of the clean data, used 'Reds' as color mapping.

sns.heatmap(a_clean.corr(), cmap ="Reds", annot = True)
plt.title("Correlation HeatMap/ Matrix")
plt.show()

.corr() function of Pandas Library uses Pearson's Correlation Method as the default method. But we can of course change the method by giving the method name like this:

a_clean.corr(method = 'Kendall')

From the Correlation Matrix we get the following information:

There is a negative correlation price and mileage. This means the car that has driven more has more mileage and therefore the price is lesser since the car is used more.
There is a negative correlation with mpg(miles per gallon) and the price. This means that the sports car have less miles per gallon ratio but whereas normal cars have more miles per gallon. So hence the price of the sports car models are more and the normal car models are of lesser price.
There is positive correlation between price of the car and the engine size of the car. It means that people tend to buy those cars having higher engine size.
There is a small positive correlation between tax and the price of the car. Cars with higher taxes on them are costlier. Total Price = Selling Price + VAT(Tax applied). ^ inc.

After taking insights from Correlation Matrix, we can use histplots too to check the skewness of the data. We can create 3 graphs or plots in two rows itself by using the command plt.subplot(figsize = (12,10), nrows = 2, ncols = 3).

fig, axes = plt.subplots(figsize = (12,10), nrows = 2, ncols = 3)
sns.histplot(a_clean["year"], ax = axes[0,0])
sns.histplot(a_clean["mileage"], ax = axes[0,1])
sns.histplot(a_clean["tax"], ax = axes[0,2])
sns.histplot(a_clean["mpg"], ax = axes[1,0])
sns.histplot(a_clean["engineSize"], ax = axes[1,1])
sns.histplot(a_clean["price"], ax = axes[1,2])
plt.show()

For year column, it is right-skewed which means that most of the cars are between 2015 to 2020.
For mileage column, it is left-skewed which means that most of the cars listed are driven for more than 5000 miles.
For engineSize column, the most used cars engine size is between 1.5 lts to 2 lts.

We can explore the categorical column 'transmission' using Seaborn's countplot like this:

sns.countplot(x = "transmission", data = a_clean)
plt.title("Transmission Types")
plt.show()

This countplot shows us that there are around 4000+ cars which are of Manual Transmission in UK. Around 2500+ cars which are Automatic Transmission in UK and around 3500+ Cars which are Semi-Auto transmission.

Now to get the unique Model names that have been listed:

print(a_clean['model'].unique())

Now, some very important insights were taken from this data set. They are:
Firstly,

sns.lineplot(x = "year", y = "tax", data = a_clean)
plt.title("The Taxes applied on the car based on the Number of Years old")
plt.show()

This lineplot shows us the taxes applied on the listed price of the used car over the years.

By this lineplot we can see that at least 150 € are the taxes applied on cars which are relatively new i.e. 1-2 years old. In UK, every car needs to pay road tax irrespective of being a used car or a new car. There can been some deviations in the taxes on the specific year-old car. From this there is a hypothesis that we can form which is:
The taxes are varying because the type of the car too. Like SUVs, Sedans will have more taxes applied on it.

Secondly, if we plot price vs the number of years old, we can see that:

By this lineplot we can see that the cars which are relatively new are having higher prices which is obvious because lesser distance the cars have travelled. But we can see that there are some deviations or differences in prices of the car which are like 4-5 years old. The Seller have tried to maximize their profit but the buyers haven't seen it through logically and mathematically.

sns.lineplot(x = "year", y = "price", data = a_clean)
plt.title("The Price based on the number of Years old")
plt.show()

With this we completed the Exploratory Data Analysis. There are lot of other plots and insights taken from this data set but the important ones I have mentioned above. After this, we directly went into Feature Engineering.

Feature Engineering:

We can see that this data set has lot of categorical columns. We had to convert those categories in terms of numbers so that in model creation it is just numbers that we are passing for computation.
So, we applied the concepts of Label Encoding.
Label Encoding is a Encoding method where we can convert the categorical variables into numbers such as 0,1,2,... depending on the categorical variables in a column.
We applied this concept like this:
We applied it on one categorical columns one by one.

encoder = LabelEncoder()
a_clean['model'] = encoder.fit_transform(a_clean['model'])
model_mapping = {index : label for index, label in enumerate(encoder.classes_)}
model_mapping

So, after applying this on all categorical columns, the result was this:

Before applying the machine learning algorithm on this data set, a final touch was required for so that all the Numerical values of the columns would be in the range of 0 and 1. For achieving this we used the concept of MinMax Scaler. It basically transformed the features into a given range. This is basically Normalization of the Data set.

scaler = MinMaxScaler(copy = True, feature_range = (0,1))
X = scaler.fit_transform(x)
X[:10]

Now, the best part of the project we all are waiting for.

Model Creation:

We split the data into two parts. First part was used to train our Machine Learning model. Other part was used as to test the machine learning model and seeing how accurate it is.

x_train, x_test,y_train,y_test = train_test_split(x,y,test_size = 0.35, random_state=0)
print("Shape of the x_train: ", x_train.shape)
print("Shape of the x_test: ", x_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of the y_test: ", y_test.shape)

We split the data into 65% train data set and 35% test data set.
We build three machine learning models:

Linear Regression Model:

Linear Regression Model which wasn't that much accurate.

LinearRegressionModel = LinearRegression(fit_intercept = True, normalize = True, copy_X = True, n_jobs = -1)
LinearRegressionModel.fit(x_train, y_train)


print('Linear Regression Train Score is : ' , LinearRegressionModel.score(x_train, y_train))
print('Linear Regression Test Score is : ' , LinearRegressionModel.score(x_test, y_test))

print('----------------------------------------------------')
y_pred = LinearRegressionModel.predict(x_test)
print('Predicted Value for Linear Regression is : ' , y_pred[:10])

We assumed a linear relationship between the "to be" predicted variable x and the feature variable that affects x i.e. y. We then applied the Linear Regression but setting the parameters like this:

copy_x = True --> It was given as True because we didn't want the the value of x to be overwritten.
n_jobs = - 1 --> The number of jobs for the computational purposes was given as -1 as the data set wasn't too large.
fit_intercept = True --> We wanted the model to calculate the intercept. Hence we gave it as True.
normalize = True --> We wanted the regressors to be normalized before regression was applied. Since we kept fit_intercept as True, we had to provide the bool value to normalize parameter too.

As we can see, we got only 80.236% as our test prediction accuracy.

There is a visible difference between the actual car price listed and the predicted price by Linear Regression model since the accuracy of the model is just 80.23%.
We weren't satisfied with this accuracy, hence we decided for a different model.

Random Forrest Regressor Model:

This model performed better in terms of train as well as test prediction accuracy.

RandomForestRegressorModel = RandomForestRegressor(n_estimators=100,max_depth=11, random_state=33)
RandomForestRegressorModel.fit(x_train, y_train)

print('Random Forest Regressor Train Score is : ' , RandomForestRegressorModel.score(x_train, y_train))
print('Random Forest Regressor Test Score is : ' , RandomForestRegressorModel.score(x_test, y_test))
print('Random Forest Regressor No. of features are : ' , RandomForestRegressorModel.n_features_)
print('----------------------------------------------------')

y_pred = RandomForestRegressorModel.predict(x_test)
print('Predicted Value for Random Forest Regressor is : ' , y_pred[:10])

We then applied the Random Forrest Regressor Model but setting the parameters like this:

n_estimators = 100 --> The number of trees that the model will use. We gave 100 as the parameter value to n_estimators.
max_Depth = 11 --> The maximum depth of the tree. We gave 11 as the parameter value to max_depth.
random_state = 33 --> Controls both the randomness of the bootstrapping of the samples used when building trees. We gave 33 as the parameter value to random_state.

Significant improvement over Linear Regression Model's Accuracy.
We were able to achieve around 95.56% accuracy on test data.

We can observe that there very little visible difference between the actual price of the car listed and the predicted price of the car, predicted by Random Forest Regressor Model, since it is 95.56% accurate.

CatBoost Regressor Model by Yandex:

catModel = CatBoostRegressor(verbose = 0, random_state = 33)
catModel.fit(x_train, y_train)
y_pred = catModel.predict(x_test)
r2 = r2_score(y_pred, y_test)
print(f'CatBoost Regressor Model by Yandex r2 score : {r2:0.5f}')

We then applied the CatBoost Regressor Model but setting the parameters like this:

verbose = 0 --> We can see the losing value in every computational iteration.
random_state = 33 --> Controls both the randomness of the bootstrapping of the samples used when building trees.

We achieved a whooping 96.002% Accuracy on test data.

pricePredicted = pd.DataFrame({'Actual Price': y_test, 'Predicted Price': y_pred})
pricePredicted = pricePredicted.reset_index()
pricePredicted.head(5)

We can observe that there very little visible difference between the actual price of the car listed and the predicted price of the car, predicted by CatBoost Regressor Model, since it is 96% accurate.

Conclusion:

It was part of my Data Science Specialization Project.
However do wanted to apply Hyper Parameter Tuning. It would have improved the machine learning model accuracy. Soon will apply Hyper Parameter Tuning concepts in some other project.

Here is my LinkedIn Profile:
Siddhesh Shankar

Here is my GitHub Repository link:
100,000 UK Used Car Data Set

I hope that you enjoyed the code walkthrough and explanation.
Thanks for reading, you can reach to me by mailing me --> barnali.siddhesh@gmail.com.

Beginner EDA on Video Game Sales - Kaggle Dataset

Siddhesh shankar — Fri, 12 Nov 2021 11:28:29 +0000

Introduction:

Data Science has become a booming field recent couple of years. It is the most hot topic right now besides AI/ML and Deep Learning. Since, the data is in abundance and the Big Companies need to get to know their crowd for maximizing their profits, Data Analysis and Data Science concepts are applied.

In any ML project, the steps are followed like this:
1] Data Collection
2] EDA
3] Feature Engineering
4] Feature Selection
5] Outlier
6] Model Creation
7 Deployment
I am here to talk about the EDA that I have done in the Video Game Sales data set in Kaggle.
Exploratory Data Analysis is the most crucial part of the Data Science project. It can "Make Or Break" your project as most of the important insights as well as predictions can be done using EDA concepts.

I have applied Concepts of EDA on Video Game Sales Data set of Kaggle.
Here's the link to my Kaggle Notebook:
Basic EDA on Video Game Sales - Kaggle

Let's start the project:

Always we need to know what our data set is and what all libraries the project needs.
There are 16,598 records. 2 records were dropped due to incomplete information.

It contains around
The video Game Sales data includes 11 fields/columns. They are:

Rank -> It shows the rank of the game.
Name -> Name of the game.
Platform -> The platform on which it was published.
Year -> The year at which the game was first released.
Genre -> The genre or topic on which the game is based.
Publisher -> The publisher who published the game in the gaming market.
NA_Sales -> The Sales of the game in the North American Gaming Market (in millions)
EU_Sales -> The Sales of the game in the European Union Gaming Market (in millions)
JP_Sales -> The Sales of the game in the Japanese Gaming Market (in millions)
Other_Sales -> The Sales of the game in the other Gaming Markets in Asia, Arab, Russia etc. (in millions)
Global_Sales -> The Total Sales of the game in the Gaming Market (in millions)

So, we import the libraries:

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # for data visualization
%matplotlib inline
import seaborn as sns # for advanced data visualizations

After importing the libraries, we import the data set:

vgSales = pd.read_csv("../input/videogamesales/vgsales.csv")
vgSales.head()

Here, the data file was a .csv file i.e. a comma separated value file.

VariableName = pd.read_csv("file_name.csv")

Comma Separated Values means that, the values in the file or records is separated by comma i.e. ','.
Here, to see the first 5 records of the data frame we have created using Pandas Library, I have used .head() function.
Now after importing the data set successfully, we are ready to go for the data cleaning.

To get information from the data about its column names, its data types, Null Count etc., we use .info() and .isna().sum():

vgSales.info()
vgSales.isna().sum()

We can observe that the Year column is having 271 NA records and the Publisher column is having 58 NA records, which makes a total of 329 NA records in the data set.
Hence, we need to remove these before visualization and taking useful insights from them.

Apart from the NA values worrying us, we can also find the other anomalies present in the data set.

This data set holds records till 2017th Year. But when we try to find the maximum value of the records in Year column, I get a crucial anomaly.

print("Max Year Value: ", vgSales['Year'].max())

By this function code we can get the max value in the column we want. This showed that the Year column had maximum value 2020. Then I mined out that record to investigate carefully.

maxEntry = vgSales['Year'].idxmax()
vgSales.iloc[maxEntry]

I observed that it gave a game record that wasn't released in 2020 Year but it was actually published by Ubisoft on 2009. This was the output.

I then replaced the value with the correct one.

vgSales["Year"] = vgSales["Year"].replace(2020.0, 2009.0)
print("Max Year Value: ", vgSales["Year"].max())

Now, this was just one anomaly. I found out that there were many anomalies like these. So, decided to replace them with the year 2009 as the visualization of that data showed that 2009 was the year in which most of the games were produced or released.

YearAnamoly = vgSales[vgSales['Year'].isnull()]['Name'].unique()
print("The year records having such anomaly: ", len(YearAnamoly))

There were 233 such records.

plt.figure(figsize = (20,20))
plt.bar(vgSales["Year"].value_counts().index, vgSales["Year"].value_counts())
plt.title("The most games produced in a specific year")
plt.show()

vgSales['Year'] = vgSales['Year'].fillna(2009.0)
vgSales['Year'].isnull().sum()

After this all the NA values were eradicated from Year column. I then converted the Year column Data Type which was float64 to int64 like this:

vgSales['Year'] = vgSales['Year'].astype('int64')

I then removed the 58 NA Records in Publisher because it won't affect the visualization as it was an unnecessary noise in the data using .dropna() function.

After anomalies and NA record cleaning I thought of removing the Skewness of columns that are numeric like NA_Sales, EU_Sales, JP_Sales, Other_Sales, Global_Sales etc.

print("The Skew Count of NA_Sales Column is:",vgSales["NA_Sales"].skew())
print("The Skew Count of EU_Sales Column is:",vgSales["EU_Sales"].skew())
print("The Skew Count of JP_Sales Column is:",vgSales["JP_Sales"].skew())
print("The Skew Count of Other_Sales Column is:",vgSales["Other_Sales"].skew())

This was the output which showed all these columns are highly positively skewed.
So, we are normalizing the data with the help of Square Root Normalization Method.

vgSales["NA_Sales"] = vgSales["NA_Sales"]**(1/2)
vgSales["EU_Sales"] = vgSales["EU_Sales"]**(1/2)
vgSales["JP_Sales"] = vgSales["JP_Sales"]**(1/2)
vgSales["Other_Sales"] = vgSales["Other_Sales"]**(1/2)
vgSales["Global_Sales"] = vgSales["NA_Sales"] + vgSales["EU_Sales"] + vgSales["JP_Sales"] + vgSales["Other_Sales"]

Now, the cleaning and normalization of data was over and Exploratory Data Analysis and Visualization started.

Exploratory Data Analysis and Visualization:

To get to know which genre sold the most globally:

vgSales[["Genre","Global_Sales"]].groupby("Genre").sum()

The Global Sales specific to Genre was done with the help of .groupby() function.
The most sold genre of games globally is Action Genre. The second highest selling genre globally is Sports Genre.

To get the most games produced in a specific Gaming Platform:

plt.figure(figsize = (20,20))
plt.bar(vgSales["Platform"].value_counts().index, vgSales["Platform"].value_counts())
plt.title("Most Games produced in Specific Gaming Platform")
plt.show()

We can see that DS and PS2 are tied for the top spot in this.

Conclusion:

With this, I did the basic EDA of this data set.
Here is my Linkedin Profile:
Siddhesh Shankar
Here is my github repo link:
SiddheshCodeMaster

Happy Data Cleaning ! Happy Exploration ! Happy Learning !

Learning MySQL with basic SQL Commands

Siddhesh shankar — Fri, 08 Oct 2021 06:52:07 +0000

Introduction:

MySQL is an open-source relational database management system. MySQL is free and open-source. MySQL is ideal for both small and large applications.

To interact with MySQL database or MySQL server, we have to learn Structure Query Language.

Installation:

But before we start here is the documentation link from where installation of MySQL and its components:

MySQL Installation Documentation Official

MySQL Installation Step By Step Guide - Javatpoint

The Awesome Stuff:

There are different types of Commands in SQL:
1] Data Definition Language (DDL) Commands
2] Data Manipulation Language (DML) Commands
3] Data Control Language (DCL) Commands
4] Transaction Control Language (TCL) Commands
5] Data Query Language (DQL) Commands

DATA DEFINITION LANGUAGE (DDL) Commands:

When we want to define the structure of the table or database which we create, DDL Commands come in handy.

CREATE Command:

It is very necessary to create a database first before going ahead to create tables.

CREATE DATABASE databaseName;

After the creation of the database, we have to tell the MySQL Command Line Client that we are going to work on the above-created database. We can do so like this:

USE databaseName;

Now, we can start MySQL Learning Journey.

MySQL keeps data in the form of tables. There can be more than one table in a database. We query for the data through the tables. Most of the commands work on the table or tables we create in the database.
So to create a table, we can do this:

CREATE TABLE Detail(Column Names Datatype.....);

We can have any number of columns and specifying the data type of that column is very necessary.
The data types used in MySQL, shown here in this link:
Data types in MySQL Documentation

DROP Command:

Sometimes we feel that we have created a table that is not necessary as in its existence can be justified with any other table.
Then we have used DROP Command.

DROP TABLE TableName;

Also, we can delete the whole database.

DROP DATABASE databaseName;

Though it is highly unlikely that we have to delete a database with records in it but it comes in very handy when we have to delete a table. Remember, if the table is connected to another table with a foreign key, first we have to delete that table to which it is connected. After deleting that table we can then safely delete the current table (to be deleted). Otherwise, it will produce an error.

ALTER Command:

Now, to change our table contents or any column data type, we can use ALTER Command. We can use ALTER for not only modifying purposes but also can be used to add a column inside the table like this:

ALTER TABLE tableName ADD columnName ColumnDefination/s;
ALTER TABLE tableName MODIFY ColumnDefination/s;
example:
ALTER TABLE details ADD (Address VARCHAR(50));
ALTER TABLE details MODIFY (Name VARCHAR(30));

TRUNCATE Command:

In MySQL, the data inserted into the tables are in the form of rows. These records can be inserted as well as deleted.
So, to delete all records from the table, we can use truncate in this way:

TRUNCATE TABLE tableName;

But use TRUNCATE carefully as it even frees up the space which was taken by the records.

With the TRUNCATE command, we have completed the data definition language (DDL) commands.

DATA MANIPULATION LANGUAGE (DML) COMMANDS:

DML commands are used to make changes in the data in the database.
But there is a catch. DML commands are not auto-committed i.e. aren't auto-saved as they can be ROLLBACK too.

INSERT command:

The command we all are most anxious to use is the INSERT command. It answers the Question:
How to insert a record into the table?

INSERT INTO tableName(Col1,Col2,Col3,....,Coln) VALUES(Val1, Val2, Val3,....,Valn);

example:

INSERT INTO detail(Name,Rollnumber,City,Country) VALUES('Siddhesh Shankar',19070122168,'Mumbai','India')('John Jones', 1907134444,'Amsterdam','Netherlands');

UPDATE Command:

The command is used to change the existing records in the table. Any records which are not correct can cause a lot of problems while querying. Hence as soon as the wrong record is been put into the database, it needs to be rectified immediately.

UPDATE tableName SET ColumnName1 = Val1, ColumnName2 = value2, ..... ColumnName n = valn WHERE condition;

Example:

UPDATE details SET Name = 'Roger Kook', City= 'Riga' WHERE ID= 1;

Here, the WHERE clause condition is very important. Without the correct condition, we can end up making a bigger mistake in the database which can cause more problems.

DELETE Command:

Now, after seeing the UPDATE as well as the TRUNCATE command, I was thinking that is there any command which is user-friendly or understands that a user might want to delete only some rows instead of the whole table records.

Yes, there is a command called as DELETE command.
Again the WHERE Clause Condition is super important.

DELETE FROM TableName WHERE Condition;

DATA CONTROL LANGUAGE (DCL) Commands:

DCL is used to grant and take back authority from any database user. In other words, we can fiddle around with the rights, permissions a user has for the specific database.

COMMIT Command:

COMMIT command is used to save all the changes, creations we have done with the database.

COMMIT;

example:

DELETE FROM details WHERE ID = 1; COMMIT;

ROLLBACK Command:

This command is related to TRUNCATE Command as the TRUNCATE command can be ROLLBACK or can be undone. So, ROLLBACK can undo only those changes that aren't saved yet.

ROLLBACK;

example:

DELETE FROM details WHERE ID = 1; ROLLBACK;

SAVEPOINT Command:

We can create a save point which means that we can save the state of our changes so that if in the future we want to ROLLBACK to that spot, we can do so easily.
It is like gaming where we can save our level before going for the next mission because if we lose in the mission ahead, we can start from the last save point/load point instead of going around the whole game once again.

SAVEPOINT savepointName;

DATA QUERY LANGUAGE (DQL) Command:

Here it gets super exciting as we get to work through the records in the database via querying.
There is only SELECT Command but there are loads of variations with different clauses which can be used with it.

If we want to get all the records in a specific table:
SELECT * FROM TableName;

But sometimes we need to get records specific to what the use cases are. Here comes the use of variations or clauses. Some of them are:

LIKE Clause:

LIKE is used when we want to perform pattern matching.
SELECT * FROM tableName WHERE columnName LIKE 'Condition';

Here the condition is having the utmost importance as according to that, the pattern matching takes place. It uses wildcards like % and _.

Here are some examples of how LIKE Clause can be used:

SELECT * FROM details WHERE Name LIKE 'a%';
This shows all the records in the table details whose names start with 'a'.

SELECT * FROM details WHERE Name LIKE '%s';
This shows all the records in the table details whose names end with s.

SELECT * FROM details WHERE Name LIKE '_t%';
This shows all the records in the table details whose names have 't' in the second position.

Lastly,
SELECT * FROM details WHERE Name LIKE 't___%';
This shows all the records in the table details whose names have 't' in the start and the length of the name is at least 3 characters.

With the use of wildcards, we can get the records from the table easily.

ORDERBY Clause:

The ORDERBY Clause helps a lot when we want the records to be in an ordered format.
Ascending or descending order gives a lot of information. For example, if we want the maximum of the price column where all the prices are stored. We can either use the maximum function in SQL i.e. MAX(), but we can also display the whole set of records with the highest in the first spot.

SELECT * FROM TableName ORDERBY col1,col2,...,coln ASC;
This is for the ascending order format.

SELECT * FROM TableName ORDERBY col1,col2,...,coln DESC;

For example:
SELECT name, prices, id FROM details ORDERBY prices DESC;

LIMIT Clause:

Used to specify the number of records to return.

SELECT ColName FROM TableName WHERE condition LIMIT number;
for example:

SELECT * FROM details WHERE Name LIKE 'a%' LIMIT 5;
This code will show only 5 records that match the condition.

So, these were some of the basic commands I had learned first when I started learning SQL in MySQL Command Line Client.

Here is my LinkedIn Profile Link

Happy Querying ! Happy Learning !