DEV Community

Jason M
Jason M

Posted on

A Beginner's guide to examining .CSV data in R Studio

These days, there are tons of options available for exploring datasets. One of my favorite environments is R Studio. It's free, open-source, and powerful. R Studio is an IDE for R, a popular veteran programming language, tailored toward data analysis and display. R has thousands of packages which are particularly useful for data analysis, with specialized functions for a variety of scientific disciplines.

In this article, we will learn how to convert a .csv file into a data frame in R Studio, and explore the data via the data frame's built-in capabilities.

Go ahead and download and install the latest version of R for your operating system, and then install R Studio.

Now that you have R Studio installed, the first thing we're going to do is install a handy general purpose library for manipulating data in R.


install.packages("dplyr")

We will need some sample data in .csv form. You can use whatever .csv file happens to be hand (there are plenty available for free on the internet). I grabbed a .csv file with 10k records, see the link here:

http://eforexcel.com/wp/wp-content/uploads/2017/07/10000-Sales-Records.zip

The basic ingredients are in play, now we need to load our .csv frame into R.

First, check our R Studio working directory. In the R Studio console, type

getwd()

My directory is documents, so I'm going to place my .csv file my documents folder. Then, I will go ahead and load that .csv file, and convert it into a native R data frame.

my_records = csv.read("my_records")  

Now, my data is living in a data frame. This means that I get access to a lot of handy methods.

To view the number of rows

 nrow(my_records)

To view the number of columns

 ncol(my_records)

To view the column names

 colnames(my_records)

To view the first 10 rows of data

  head(my_records, 10)

To view the last 10 rows of data

  tail(my_records, 10)

To get the sum of the values in a column, by column name

  sum(my_records['Total.Profit']

Sample 5 random rows

  dplyr::sample_n(my_records, 5)

Filter data frame by column values, where Total Cost is greater than 1000

  dplyr::filter(my_records, Total.Cost > 1000)

Remove Duplicate Rows

  dplyr::distinct(my_records)

Select top 5 records, sorted by Total.Cost

  dplyr::top_n(my_records, 5, Total.Cost)

Average of Total.Cost

  dplyr::summarise(my_records, mean(Total.Profit))

Min value in column Total.Cost

  dplyr::summarise(my_records, min(Total.Profit))

Max value in column Total.Cost

  dplyr::summarise(my_records, max(Total.Profit))

Top comments (0)