So, lately I have had my hands on some raw unclean data for an assignment for school. Originally I thought that messy data was about cleaning up blank values, formatting text, numbers, and strings in the right form, etc. But as I proceed to analyze my data in R I found out that it could not be handled. There was a key concept that I was missing when it comes to setting up data the right way: Wide and Long Data
What is Wide Data?
In the wide data (also known as unstacked) is when each variable attribute for a subject is in a separate column.
Person | Age | Weight |
---|---|---|
Buttercup | 24 | 110 |
Bubbles | 24 | 105 |
Blossom | 24 | 107 |
What is Long Data?
Narrow (stacked) data is presented with one column containing all the values and another column listing the context of the value
Person | Variable | Value |
---|---|---|
Buttercup | Age | 24 |
Buttercup | Weight | 110 |
Bubbles | Age | 24 |
Bubbles | Weight | 105 |
Blossom | Age | 24 |
Blossom | Weight | 107 |
It is easier for r to do analysis in the Long data form. This concept might seem weird at first. We are use to seeing and analyzing data in Wide data form but with practice it gets easier over time. R has an awesome package called reshape2 to convert your data from wide to long.
First install the r package and load the library.
install.packages("reshape2")
library(reshape2)
Using the wide table above we will split our variables into two groups identifiers and measured variables.
Identifier variable:Person
Measured variable: Age, weight
In order to transform this wide data into long data we will have to use the melt method. You “melt” data so that each row is a unique id-variable combination.
df
Person Age Weight
1 Buttercup 24 110
2 Bubbles 24 105
3 Blossom 24 107
ppg <-melt(df,id=c("Person"),measured=c("Age","Weight"))
ppg
Person variable value
1 Buttercup Age 24
2 Bubbles Age 24
3 Blossom Age 24
4 Buttercup Weight 110
5 Bubbles Weight 105
6 Blossom Weight 107
Resources
For official documentation about the reshape library from its creator Hadley Wickham.
More about Wide vs. Long data check out The Analysis Factor
More information about cleaning and shaping data from messy data to tidy data check out Hadley Wickham’s paper Tidy Data
Top comments (0)