Dipti M

Posted on Jun 17

Comprehensive Tutorial on Data Importing in R

#ai #datascience #analytics #programming

Data importing into R is a task in itself. There seems to be a different importing function for almost every type of file format. One can agree it becomes quite confusing and frustrating to mix up between those functions and yet again their arguments. But once you get to know what packages and commands to use for which file types, it is fairly simple. To help you with this, here is a comprehensive tutorial for your reference, covering these commands to import files into your R session.

So, next time you find yourself searching ‘how to load xyz type of file into R’, you’d know where to look.

We will discuss importing the most common file types such as .TXT, .CSV, .JSON, Excel worksheets, SAS/SPSS/STATA datasets, Matlab files, etc. with examples and code. We will also see how to use ODBC connections to import tables from relational databases directly into R. You are also in for an amazing importing hack which I find very useful for some quick and dirty adhoc analysis.

Let’s dive in!!

How to prepare workspace environment before data importing

Generally, files to be loaded for a particular task are stored in a single directory. Further, this directory can be assigned as the “working directory” inside your R environment for ease of importing. To know the location of your current working directory use -

getwd()

You could bring your files to the location returned by getwd. It could also be that you want to change the working directory to a location of your choice. Then you must use -setwd(“”)

setwd(“”)

R would now know which folder you’re working in. This makes it easier to import and export files using relative path rather than the long absolute path to file location.

Your workspace environment would often be filled with data and values from the last R session. It is often required to clean the environment before proceeding afresh. You can do it using

rm(list=ls())

It’s better to start with a clean environment. An alternative can be - not saving the workspace on ending the R session. Objects from the previous R sessions can lead to errors which are often very hard to debug. The above command gets rid of all the variables, functions and objects in your current environment. My advice – use it wisely.

Yayy, you’re finally done with the setup phase!

However, do keep in mind that the set up phase is important for eliminating majority of kinks you would come across in subsequent phases of your project. Now, it’s time to get to business.

Loading TXT/CSV/JSON/other separator files into R
Reading text files

Data in text files is of a form where values are separated by a delimiter such as tab, or comma or semicolon. Your file will look similar to the below format –

Category V1 V2
A 3 2
B 5 6
B 2 3
A 4 8
A 7 3

Above is an example of a tab delimited file. Similarly, a comma or semicolon could replace the tab separator as shown in the next section. Flat files like these can be read using the function read.table() –

df <- read.table(“”, header = FALSE)
(Note that you could skip writing the path in the above example if your file is in the working directory)

Read.table is the most common way of bringing simple files into your workspace. It is also very flexible. For instance, to import files delimited by characters other than tab, you can pass the sep argument indicating which character is the separator.

Reading CSV files

A CSV file is where entries are separated by a comma (,) or a semi-colon (;). Your file will look like this –

Category V1 V2
A 3 2
B 5 6
B 2 3
A 4 8
A 7 3

To load files like this, you can use read.csv() or read.csv2() function. For (,) separated files use read.csv, and for (;) separated files use read.csv2. Remember, files of both kind can also be read using read.table() function. read.csv() and read.csv2() functions are essentially just wrapper functions around read.table(). Look at their source codes:

read.csv
function (file, header = TRUE, sep = ",", quote = "\"", dec = ".",
fill = TRUE, comment.char = "", ...)
read.table(file = file, header = header, sep = sep, quote = quote,
dec = dec, fill = fill, comment.char = comment.char, ...)

read.csv2
function (file, header = TRUE, sep = ";", quote = "\"", dec = ",",
fill = TRUE, comment.char = "", ...)
read.table(file = file, header = header, sep = sep, quote = quote,
dec = dec, fill = fill, comment.char = comment.char, ...)

You’ll find the read.csv and read.csv2 are just wrapper functions around read.table with few changes in their default arguments.

Essentially,

df <- read.table(“”, sep = “,”)

is same as:

df <- read.csv(“”)

and

df <- read.table(“”, sep = “;”)

is same as:

df <- read.csv2(“”)

Similarly, for other separators, you can use either read.table or read.delim() with appropriate ‘sep’ argument.

Copy and paste from files directly into R

This is a quick and dirty hack that can save you a lot of time and effort. Select the data and copy it and run this command to bring it into R –

df <- read.table(“clipboard”)

Sometimes when your data won’t be in the appropriate format, it wouldn’t be read correctly. But, this is the best compromise when you want to do some ad-hoc analysis on the fly.

Packages

We’ve seen how to import files through basic R commands. You can also do this by loading in packages and using the packages’ functions. Furthermore, almost all complicated file types need respective packages for getting imported to R.

Some Workspace Prepping (yet again)

To use the import functions in a package, you first need to install your package and load it in your environment. Simply run the following two commands:
install.packages(“”)
library(“”)

Reading JSON files
For reading JSON files into R, you’ll need a package called ‘rjson’. Run the commands to download and install the package into R –
install.packages(“rjson”)
And load it into the environment by
library(rjson)

Now, you can use the fromJSON() function to read JSON data.

1 Import data from json file in your working directory

JsonData <- fromJSON(file= “”)

2 Import data from json file on a specific URL

JsonData <- fromJSON(file= “”)

Tip: The input JSON file is stored in JsonData object in R environment as a list. You can view the contents by printing the object JsonData. Also you can convert it into a data frame by running the below commandJsonDF <- as.data.frame(

JsonDF <- as.data.frame(JsonData)
print(JsonData) #

prints final data.frame

Importing XML data and data from HTML tables

Mostly while working with data on the World Wide Web, you’ll have to deal with HTML and XML formats.

Both XML and HTML data can be imported easily by the XML package in R. It offers many approaches for both reading and creating XML (and HTML) documents (including DTDs), both local and accessible via HTTP or FTP.

First, install and load the package, just like demonstrated above. Use the xmlParse() function to parse the XML file as shown in the code snippet –

Activate the libraries

library(XML)
library(RCurl)

Read XML file by URL

xmlData <- xmlTreeParse(“”)

Read XML file in working directory

XmlData <- xmlTreeParse(“”)

Tip: Again the xmlTreeParse() function returns a list into XmlData object. The same data can be read as a more usable data frame format by running below code –

xmldataframe <- xmlToDataFrame(“”)
print(xmldataframe)

Another useful function from the XML package allows reading data from one or more HTML tables. The readHTMLTable function and its methods provide robust methods for extracting data from HTML tables from an HTML document from a file, or (http: or ftp:) URL, or an already parsed document of htmlParse() function (XML package). One can also specify a specific

node in the HTML document to be read.

Read the HTML table

HtmlData <- readHTMLTable(getURL(””))

Refer this link for more on XML package and its other functions: https://www.rdocumentation.org/packages/XML/versions/3.98-1.9

Reading entire Excel workbooks

There are many packages to load excels into R – XLConnect, XLSX, gdata. A recent addition to these is the Hadley Wickham’s 2016 ReadXL package.

ReadXL surpasses it’s counterparts in both speed – leveraging R’s established RapidXML C++ library and ease of loading excel documents – since external code libraries e.g.: java jdk ; ActiveState PERL are not required.

Begin by installing and loading the package. Use read_excel function with default arguments to read in the first tab of your excel sheet. Remember your file must be in the current working directory, otherwise you need to write the path here instead of filename.

df <- read_excel(“”)

You can read different sheets of your excel workbook into R by using the ‘sheet’ argument.

read_excel(“”,sheet=”Sheet 3”)

With the sheet indexing starting at 1, you can also use indices to load in the corresponding tab with the following code:

read_excel(“”,sheet=3)

Take a look at all other arguments and functionalities of the ReadXL package here : http://readxl.tidyverse.org/index.html

You can also read data from excel spreadsheets using RODBC package which will be covered in the last section of the article.

Reading data from foreign statistical software

Sometimes, one would want to be able to shift between tools. R is very flexible in this aspect. It allows users to import (as well as export) data from foreign statistical software packages such as SAS, SPSS, Stata, Matlab, Octave. In this section, we will see how to do that.

Importing SAS, SPSS and Stata Files

There are a few contender packages available for importing SAS datasets – foreign, sas7bdat, haven. Among these, Haven package from Hadley Wickham seems the best option in terms of speed. It imports foreign statistical formats into R via the embedded 'ReadStat' C library.Install and load the package into your environment and run the following code to import from external software packages –

Install and load the package into your environment and run the following code to import from external software packages –
## Read a SAS dataset in your working directory
read_sas(“InputSAS.sas7bdat”)

You can use similar functions of Haven package to read SPSS and Stata files as well.

Read a SPSS dataset in your working directory

read_sav(“InputSPSS.sav”)

Read a Stata dataset in your working directory

read_dta(“InputStata.dta”)
Importing Matlab and Octave Files

Use the R.matlab package to import MATLAB files into R. It consists of methods readMat() and writeMat() for reading and writing MAT files. The package also provides methods for controlling MATLAB (v6 or newer) via R and sending and retrieving data between R and MATLAB.

Install and load R.matlab package

install.packages(“R.matlab”)
library(R.matlab)

Read mat file

data <- readMat("")

For Octave files, you will need the foreign package. Use the read.octave() function to import Octave text data into R:

library(foreign)
data <- read.octave("")
Other Relational Databases

RODBC package provides an ODBC database interface. It allows access to relational databases (including Microsoft Access and Microsoft SQL Server) through the ODBC interface.
The primary functions are given below.

odbcConnect(dsn, uid="", pwd="")
odbcConnect establishes a connection to the specified DSN
sqlFetch(channel, sqtable)
sqlFetch by default retrieves the entire contents of the table
sqlQuery(channel, query)
sqlQuery sends the SQL statement query to the server, using connection channel returned by odbcConnect, and retrieves (some or all of) the results
sqlSave(channel, mydf, tablename = sqtable, append = FALSE)
sqlSave saves the data frame dat in the table tablename. If the table exists and has the appropriate structure it is used, or else it is created anew.
sqlClear(channel, sqtable) & sqlDrop(channel, sqtable)
sqlClear deletes all the rows of the table sqtable, similar to TRUNCATE TABLE sql command; sqlDrop removes the table sqtable (if permitted), similar to DROP TABLE sql command.
odbcClose(channel)
odbcClose closes connections to ODBC databases, helps clean up and free resources. Another function odbcCloseAll closes all open channels. Channels are closed at the end of an R session, and may also be closed by garbage collection if no object refers to them anymore.

Let’s take up an example.

The following code imports 2 tables Table1 and Table2 from a DBMS into R Dataframe objects SqlData1 and SqlData2 using above functions.

install.packages(“RODBC”)
library(RODBC)
con <-odbcConnect(“dsn”, uid=”userID”, pwd=”123”)
SqlData1 <- sqlFetch(con,”Table1”)
SqlData2 <- sqlQuery(con, “select * from Table2”)
odbcClose(con)
Tips for making data import easier in R

Some points to keep in mind while importing files in R–

First row is usually reserved for column headers or names of the variables. In case you don’t have any column headers or names, R will automatically assign them for you once you import the file
Note that the column names should be unique, so remove duplicates
Also note that R is case sensitive
Short names are preferred over longer names
Follow the R naming conventions:
Avoid blank spaces
Avoid names with special symbols: ?, $, *, +, #, (, ), -, /, }, {, |, >, < etc. Instead, use underscore (var_name) or dot (var.name) or camel case (varName)
Variable names should not start with a number
Look at some common naming conventions here -- https://journal.r-project.org/archive/2012-2/RJournal_2012-2_Baaaath.pdf
Blank entries or missing values in the data should be replaced by NA
Remove any comments in the file to avoid unnecessary NA’s and extraneous columns
Follow one style of writing code. It helps on code consistency and maintains readability. You can find some style guides here
http://style.tidyverse.org/files.html#names
https://google.github.io/styleguide/Rguide.xml#filenames
https://4dpiecharts.com/r-code-style-guide/
End Notes

Loading data into R is just a small step in your exciting data exploration, manipulation and visualization journey. To take the next step—like building predictive models or deploying AI solutions—check out our AI Consulting Services. We learnt how to import CSV, TXT, HTML, JSON files, as well as data sets from external statistical packages like SAS, SPSS, STATA and Matlab. Lastly, we discussed connecting to relational databases with ODBC in R. Note that there are always more than one ways to do a task in R (..almost always). I have tried to list the common commands with efficient counterparts if any, to the best of my knowledge. Please do share in comments if you feel there is a better way to do the discussed tasks.

Hope this article makes your importing tasks easy’R!!

Questions?

Feel free to reach out to us using our contact us page