Chapter 3 Tidy

3.1 Introduction

Reshaping the data is an important (if necessary) step to exploratory data analysis and preparatory data cleaning for modeling or creating specialized visualization. Further concepts about tidy data can be found in this paper Tidy Data.

The principles of tidy data are:

  1. Each variable forms a column
  2. Each observation forms a row
  3. Each type of observational unit forms a table

3.1.1 tidyr

  • tidyr: An evolution of ‘reshape2’. It’s designed specifically for data tidying (not general reshaping or aggregating) and works well with ‘dplyr’ data pipelines. Tidy data complements R’s vectorized operations. R will automatically preserve observations as you manipulate variables. No other format works as intuitively with R.

3.1.1.1 Examples

library(tidyr)
library(insuranceData)
library(magrittr)
data("AutoBi")
head(AutoBi)
##   CASENUM ATTORNEY CLMSEX MARITAL CLMINSUR SEATBELT CLMAGE   LOSS
## 1       5        1      1      NA        2        1     50 34.940
## 2      13        2      2       2        1        1     28 10.892
## 3      66        2      1       2        2        1      5  0.330
## 4      71        1      1       1        2        2     32 11.037
## 5      96        2      1       4        2        1     30  0.138
## 6      97        1      2       1        2        1     35  0.309
# create an interaction column
g <- AutoBi %>%  unite(col = MARITAL_CLMAGE, MARITAL, CLMAGE, sep = "*")

head(g)
##   CASENUM ATTORNEY CLMSEX MARITAL_CLMAGE CLMINSUR SEATBELT   LOSS
## 1       5        1      1          NA*50        2        1 34.940
## 2      13        2      2           2*28        1        1 10.892
## 3      66        2      1            2*5        2        1  0.330
## 4      71        1      1           1*32        2        2 11.037
## 5      96        2      1           4*30        2        1  0.138
## 6      97        1      2           1*35        2        1  0.309

Not all data that needs to be tidied comes in “long” format (i.e. the spread() function), so this data set below is put into tidy or cleaned format using base R functions and general manipulation. This data set has its observations stored in the row names field, delimited with a semicolon “;”. The original column contains a mis-transformed value for the population density metric. The only variables’ names are also a concatenation of the entire data set’s column names delimited with a period “.”.

library(insuranceData)
library(magrittr)
data("Thirdparty")

head(Thirdparty) # rownames have column values
##                                         lga.sd.claims.accidents.ki.population.pop_density
## ASHFIELD;1;1103;2304;920;124850;0                                                  499001
## AUBURN;1;1939;2660;1465;143500;0                                                   148379
## BANKSTOWN;1;4339;7381;3864;470700;0                                                205407
## BAULKHAMHILLS;1;1491;3217;1554;311300;0                                             25879
## BLACKTOWN;1;3801;6655;4175;584900;0                                                 81222
## BOTANY;1;387;2013;854;106350;0                                                     178143
cols = strsplit(colnames(Thirdparty) , "." , fixed=T)   # a new vector to use later 
dput(unlist(cols))
## c("lga", "sd", "claims", "accidents", "ki", "population", "pop_density"
## )
rows2df <- sapply(rownames(Thirdparty), strsplit, ";", USE.NAMES = FALSE)

tidy_3PD <- data.frame(matrix(unlist(rows2df), nrow = nrow(Thirdparty), byrow=T))
colnames(tidy_3PD) <- c("lga", "sd", "claims", "accidents", "ki", "population",
"pop_density")
tidy_3PD$pop_density <- Thirdparty$lga.sd.claims.accidents.ki.population.pop_density / 1000000

head(tidy_3PD)
##             lga sd claims accidents   ki population pop_density
## 1      ASHFIELD  1   1103      2304  920     124850    0.499001
## 2        AUBURN  1   1939      2660 1465     143500    0.148379
## 3     BANKSTOWN  1   4339      7381 3864     470700    0.205407
## 4 BAULKHAMHILLS  1   1491      3217 1554     311300    0.025879
## 5     BLACKTOWN  1   3801      6655 4175     584900    0.081222
## 6        BOTANY  1    387      2013  854     106350    0.178143

Every variable forms a column and every observation forms a row which makes for a table!