Chapter 3 Tidy
3.1 Introduction
Reshaping the data is an important (if necessary) step to exploratory data analysis and preparatory data cleaning for modeling or creating specialized visualization. Further concepts about tidy data can be found in this paper Tidy Data.
The principles of tidy data are:
- Each variable forms a column
- Each observation forms a row
- Each type of observational unit forms a table
3.1.1 tidyr
tidyr
: An evolution of ‘reshape2’. It’s designed specifically for data tidying (not general reshaping or aggregating) and works well with ‘dplyr’ data pipelines. Tidy data complements R’s vectorized operations. R will automatically preserve observations as you manipulate variables. No other format works as intuitively with R.
3.1.1.1 Examples
library(tidyr)
library(insuranceData)
library(magrittr)
data("AutoBi")
head(AutoBi)
## CASENUM ATTORNEY CLMSEX MARITAL CLMINSUR SEATBELT CLMAGE LOSS
## 1 5 1 1 NA 2 1 50 34.940
## 2 13 2 2 2 1 1 28 10.892
## 3 66 2 1 2 2 1 5 0.330
## 4 71 1 1 1 2 2 32 11.037
## 5 96 2 1 4 2 1 30 0.138
## 6 97 1 2 1 2 1 35 0.309
# create an interaction column
g <- AutoBi %>% unite(col = MARITAL_CLMAGE, MARITAL, CLMAGE, sep = "*")
head(g)
## CASENUM ATTORNEY CLMSEX MARITAL_CLMAGE CLMINSUR SEATBELT LOSS
## 1 5 1 1 NA*50 2 1 34.940
## 2 13 2 2 2*28 1 1 10.892
## 3 66 2 1 2*5 2 1 0.330
## 4 71 1 1 1*32 2 2 11.037
## 5 96 2 1 4*30 2 1 0.138
## 6 97 1 2 1*35 2 1 0.309
Not all data that needs to be tidied comes in “long” format (i.e. the spread()
function), so this data set below is put into tidy or cleaned format using base R functions and general manipulation. This data set has its observations stored in the row names field, delimited with a semicolon “;”. The original column contains a mis-transformed value for the population density metric. The only variables’ names are also a concatenation of the entire data set’s column names delimited with a period “.”.
library(insuranceData)
library(magrittr)
data("Thirdparty")
head(Thirdparty) # rownames have column values
## lga.sd.claims.accidents.ki.population.pop_density
## ASHFIELD;1;1103;2304;920;124850;0 499001
## AUBURN;1;1939;2660;1465;143500;0 148379
## BANKSTOWN;1;4339;7381;3864;470700;0 205407
## BAULKHAMHILLS;1;1491;3217;1554;311300;0 25879
## BLACKTOWN;1;3801;6655;4175;584900;0 81222
## BOTANY;1;387;2013;854;106350;0 178143
cols = strsplit(colnames(Thirdparty) , "." , fixed=T) # a new vector to use later
dput(unlist(cols))
## c("lga", "sd", "claims", "accidents", "ki", "population", "pop_density"
## )
rows2df <- sapply(rownames(Thirdparty), strsplit, ";", USE.NAMES = FALSE)
tidy_3PD <- data.frame(matrix(unlist(rows2df), nrow = nrow(Thirdparty), byrow=T))
colnames(tidy_3PD) <- c("lga", "sd", "claims", "accidents", "ki", "population",
"pop_density")
tidy_3PD$pop_density <- Thirdparty$lga.sd.claims.accidents.ki.population.pop_density / 1000000
head(tidy_3PD)
## lga sd claims accidents ki population pop_density
## 1 ASHFIELD 1 1103 2304 920 124850 0.499001
## 2 AUBURN 1 1939 2660 1465 143500 0.148379
## 3 BANKSTOWN 1 4339 7381 3864 470700 0.205407
## 4 BAULKHAMHILLS 1 1491 3217 1554 311300 0.025879
## 5 BLACKTOWN 1 3801 6655 4175 584900 0.081222
## 6 BOTANY 1 387 2013 854 106350 0.178143
Every variable forms a column and every observation forms a row which makes for a table!