These days, there are tons of options available for exploring datasets. One of my favorite environments is R Studio. It's free, open-source, and powerful. R Studio is an IDE for R, a popular veteran programming language, tailored toward data analysis and display. R has thousands of packages which are particularly useful for data analysis, with specialized functions for a variety of scientific disciplines.
In this article, we will learn how to convert a .csv file into a data frame in R Studio, and explore the data via the data frame's built-in capabilities.
Go ahead and download and install the latest version of R for your operating system, and then install R Studio.
Now that you have R Studio installed, the first thing we're going to do is install a handy general purpose library for manipulating data in R.
install.packages("dplyr")
We will need some sample data in .csv form. You can use whatever .csv file happens to be hand (there are plenty available for free on the internet). I grabbed a .csv file with 10k records, see the link here:
http://eforexcel.com/wp/wp-content/uploads/2017/07/10000-Sales-Records.zip
The basic ingredients are in play, now we need to load our .csv frame into R.
First, check our R Studio working directory. In the R Studio console, type
getwd()
My directory is documents, so I'm going to place my .csv file my documents folder. Then, I will go ahead and load that .csv file, and convert it into a native R data frame.
my_records = csv.read("my_records")
Now, my data is living in a data frame. This means that I get access to a lot of handy methods.
To view the number of rows
nrow(my_records)
To view the number of columns
ncol(my_records)
To view the column names
colnames(my_records)
To view the first 10 rows of data
head(my_records, 10)
To view the last 10 rows of data
tail(my_records, 10)
To get the sum of the values in a column, by column name
sum(my_records['Total.Profit']
Sample 5 random rows
dplyr::sample_n(my_records, 5)
Filter data frame by column values, where Total Cost is greater than 1000
dplyr::filter(my_records, Total.Cost > 1000)
Remove Duplicate Rows
dplyr::distinct(my_records)
Select top 5 records, sorted by Total.Cost
dplyr::top_n(my_records, 5, Total.Cost)
Average of Total.Cost
dplyr::summarise(my_records, mean(Total.Profit))
Min value in column Total.Cost
dplyr::summarise(my_records, min(Total.Profit))
Max value in column Total.Cost
dplyr::summarise(my_records, max(Total.Profit))
Top comments (0)