R · Data cleaning · Statistics

Data preprocessing in R: cleaning a smoking and lung-capacity dataset

Preprocessing is the step that turns a raw CSV into a reliable dataset: read delimiters correctly, assign variable types, review missing values, detect outliers and leave the data ready for descriptive and inferential analysis.

Import

Check separators, decimal symbols, column names and encoding before trusting the file.

Types

Convert categorical variables to factors and keep numeric variables usable for statistics.

Quality

Review missing values, impossible ranges, duplicated records and inconsistent categories.

Output

Save a clean dataset that can be reused by the later statistical-analysis post.

Data file upload and brief description

When you open the file with a simple text editor such as Notepad or WordPad for Windows, or gedit or VIM for Linux, you can see that “,” is the separator symbol and the dot corresponds to the decimal value. Therefore, you have to read the file with the function read.csv().

The data to be treated correspond to a medical investigation aimed at studying the lung capacity of people depending on whether they were smokers or not. Each person was asked, through a questionnaire, their gender, age, sports habits, city of residence, if they were a smoker, and if so, how many cigarettes per day on average they smoked and how many years they had been smoking. In addition, weight, height and lung capacity were measured from an expelled air test, from which the FEF (forced expiratory flow) measurement was taken as lung capacity, which is the speed of the air leaving the lung during the central portion of a forced expiration. It is measured in liters / second.

The file is named Smokers_raw.csv, contains 300 records and 9 variables. These variables are: Sex, Sport, Years, Cig, PC, City, Weight, Age, Height

If you need an equivalent public set, you can start from the dataset smoking_youth_tbl_df

2 Indicate the type of statistical variable of each of the variables

# factor
var.factor <- c(1,2,6)
var.integer <- 3:4
var.numeric <- c(5,7,8,9)
var.tipus <- vector(mode="character",length=ncol(mydata))
var.tipus[var.factor] <- "factor"
var.tipus[var.integer] <- "integer"
var.tipus[var.numeric] <- "numeric"

They are nominal qualitative variables: Sex, City

They are ordinal qualitative variables: Sport

They are discrete quantitative variables: Years, Cig

They are continuous quantitative variables: PC, Weight, Age, Height

3 Assign each variable the appropriate R variable type

Reading the file with the function read.csv() has made the following assignment to each variable

res <- sapply(mydata,class)
kable(data.frame(variables=names(res),clase=as.vector(res)))

variables	class
sex	factor
Sports	integrate
Years	integrate
Cig	integrate
PC	factor
City	factor
Weight	integrate
Age	integrate
Height	integrate

var_wrong <- n.var[res != var.tipus]

Therefore, the variables with wrong assignment that need to be corrected are: Sport, PC, Weight, Age, Height

The assignment to be made is:

kable(data.frame(variables= var_wrong, clase= c("factor","numeric", "numeric", "numeric", "numeric")))

variables	class
Sports	factor
PC	numeric
Weight	numeric
Age	numeric
Height	numeric

3.1 Transform the Sport variable to factor type

You only need to use the function factor() or ordered() with the appropriate coding according to the criterion: 1 equal to N, 2 equal to S, 3 equal to R and 4 equal to E.

mydata[,var_wrong[1]] <- factor(mydata[,var_wrong[1]], levels= c(1,2,3,4), labels= c("N","S","R","E"))

table(mydata[,var_wrong[1]])

##
##   N   S   R   E
## 127  83  48  42

Now we can verify that the Sport variable is of type factor.

3.2 Correct errors in quantitative variables with decimal separator confusion

The variables that have confusion with the decimal separator are PC.

Now the “,” will be changed to the value “.”, the summary of the variable values is:

#Change "," to ".""
i <- 2
mydata[,var_wrong[i]] <-  as.numeric( sub(",","\\.",mydata[,var_wrong[i]]) )

summary(mydata[,var_wrong[i]])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
##   1.557   2.913   3.554   3.332   3.794   4.466       2

Now we can verify that the variable PC is of type numeric.

3.3 Convert Age, Weight and Height variables to numeric type

You just need to use the function as.numeric()

for (i in 3:5) {
          mydata[,var_wrong[i]] <- as.numeric(mydata[,var_wrong[i]])
}

Now we can verify that the variables Weight, Age, Height are of type numeric, numeric, numeric, respectively.

4 Normalize/Standardize quantitative variables

The precision to analyze the variable PC It does not require such a large number of decimals. Therefore, we round the values to 3 decimal places.

mydata[,5] <- round(mydata[,5],3)

To check if the changes have been made correctly, 6 values are listed:

mydata[1:6,5]

## [1] 2.579 1.557    NA 2.762 3.487 4.075

5 Normalize/Standardize qualitative variables

5.1 Gender

The first step is to show the table with the values of the different registered genres and their frequencies. This way you can see if there are any problems.

table(mydata$Sex)

##
##      f      F   f      F         m      M   m      M         f      F
##      6     18      2      9      5     23      4      8     24     54
##   f      F         m      M   m      M
##      5     19     19     76     10     18

It is observed that it is necessary to standardize the variable. The result of each phase is shown:

Remove spaces before and after each observed value. The result is:

# Treure espais abans i després
kk <- trimws( mydata$Sex )
table(kk )

## kk
##   f   F   m   M
##  37 100  38 125

Convert the observed values to uppercase. The result is:

# change  upper
kk <- toupper(kk)
table(kk)

## kk
##   F   M
## 137 163

# Update
mydata$Sex <-as.factor(kk)

5.2 Sports

It has already been normalized in step 3.1

5.3 City

The first step is to show the table with the values of the different registered cities and their frequencies. This way you can see if there are any problems.

table(mydata$City)

##
##                 alcanar                     barcelona
##                          1                          3
##                  Barcelona               barcelona
##                         12                          2
##               Barcelona                        blanes
##                          6                          2
##                  Blanes                       Cardona
##                          1                          1
##                 Cardona                        Girona
##                          1                          3
##                  girona                     Girona
##                          1                          1
##                  La Bisbal                     lleida
##                          2                          1
##                  lleida                     Lleida
##                          1                          2
##                    Montgat              Pineda de Mar
##                          1                          1
##           Pineda de Mar                     Puigcerda
##                          1                          2
##                     Ripoll      Sant Boi de Llobregat
##                          1                          2
##   Sant Boi de Llobregat                     Sitges
##                          1                          2
##                    solsona                  Tarragona
##                          1                          3
##                   terrassa                   Terrassa
##                          3                          6
##                terrassa                   Terrassa
##                          1                          4
##                      Tossa                      valls
##                          2                          2
##                      Valls                    alcanar
##                          2                          2
##                    Alcanar                 Alcanar
##                          8                          1
##                  barcelona                  Barcelona
##                         10                         57
##               barcelona                  Barcelona
##                          4                          8
##                     blanes                     Blanes
##                          2                          4
##                  Blanes                      Cadaques
##                          2                          2
##                   cardedeu                   Cardedeu
##                          1                          2
##                    cardona                    Cardona
##                          2                          2
##                     girona                     Girona
##                          2                          1
##                  Girona                     la bisbal
##                          2                          2
##                  La Bisbal                     lleida
##                          3                          1
##                     Lleida                  Lleida
##                          7                          1
##                    montgat                    Montgat
##                          1                          2
##                 montgat                    Montgat
##                          1                          1
##              Pineda de Mar           Pineda de Mar
##                          5                          4
##               puigcerda                        ripoll
##                          1                          1
##                     Ripoll                  Ripoll
##                          2                          2
##      sant boi de llobregat      Sant Boi de Llobregat
##                          2                          5
##   Sant Boi de Llobregat                        sitges
##                          2                          3
##                     Sitges                  sitges
##                          5                          2
##                  Sitges                       Solsona
##                          1                          2
##                 solsona                    Solsona
##                          1                          1
##                  tarragona                  Tarragona
##                          5                          2
##               tarragona                  Tarragona
##                          2                          2
##                   terrassa                   Terrassa
##                          5                         18
##                terrassa                   Terrassa
##                          1                          4
##                   tossa                      Tossa
##                          2                          3
##                      valls                      Valls
##                          3                          6
##                   valls                      Valls
##                          1                          1

It is observed that the variable must be standardized. The result of each phase is shown:

Remove spaces before and after each observed value. The result is:

# Treure espais abans i després
kk <- trimws( mydata$City )
table(kk )

## kk
##               alcanar               Alcanar             barcelona
##                     3                     9                    19
##             Barcelona                blanes                Blanes
##                    83                     4                     7
##              Cadaques              cardedeu              Cardedeu
##                     2                     1                     2
##               cardona               Cardona                girona
##                     2                     4                     3
##                Girona             la bisbal             La Bisbal
##                     7                     2                     5
##                lleida                Lleida               montgat
##                     3                    10                     2
##               Montgat         Pineda de Mar             puigcerda
##                     4                    11                     1
##             Puigcerda                ripoll                Ripoll
##                     2                     1                     5
## sant boi de llobregat Sant Boi de Llobregat                sitges
##                     2                    10                     5
##                Sitges               solsona               Solsona
##                     8                     2                     3
##             tarragona             Tarragona              terrassa
##                     7                     7                    10
##              Terrassa                 tossa                 Tossa
##                    32                     2                     5
##                 valls                 Valls
##                     6                     9

Convert the observed values to title format. The result is:

# change  str_to_title
kk <- str_to_title(kk)
table(kk)

## kk
##               Alcanar             Barcelona                Blanes
##                    12                   102                    11
##              Cadaques              Cardedeu               Cardona
##                     2                     3                     6
##                Girona             La Bisbal                Lleida
##                    10                     7                    13
##               Montgat         Pineda De Mar             Puigcerda
##                     6                    11                     3
##                Ripoll Sant Boi De Llobregat                Sitges
##                     6                    12                    13
##               Solsona             Tarragona              Terrassa
##                     5                    14                    42
##                 Tossa                 Valls
##                     7                    15

We change the preposition “De” to lowercase “de”

kk <- sub("De","de",kk)

# Update
mydata$City <-as.factor(kk)

To check the changes, the frequency table is presented.

table(mydata$City)

##
##               Alcanar             Barcelona                Blanes
##                    12                   102                    11
##              Cadaques              Cardedeu               Cardona
##                     2                     3                     6
##                Girona             La Bisbal                Lleida
##                    10                     7                    13
##               Montgat         Pineda de Mar             Puigcerda
##                     6                    11                     3
##                Ripoll Sant Boi de Llobregat                Sitges
##                     6                    12                    13
##               Solsona             Tarragona              Terrassa
##                     5                    14                    42
##                 Tossa                 Valls
##                     7                    15

6 Review possible inconsistencies between variables:

6.1 Years vs Cig

The inconsistency between the Years vs Cig variables occurs when a non-smoking individual (Years = 0) responds that he or she smokes a cigarette (Cig > 0) or also when a smoker individual (Years > 0) responds that he or she does not smoke any cigarettes (Cig = 0). The correction criterion is to assign the value of zero to the two variables, Years and Cig.

6.1.1 Case (Years = 0) and (Cig > 0)

# find wrong values
kk <- which(mydata$Years== 0 & mydata$Cig > 0)

21 individuals have appeared with this inconsistency. The records are: 9, 11, 17, 41, 60, 82, 90, 91, 103, 113, 129, 131, 132, 176, 181, 183, 257, 264, 266, 276, 298

And the inconsistent values are:

kable(t(mydata[kk,3:4]))

	9	11	17	41	60	82	90	91	103	113	129	131	132	176	181	183	257	264	266	276	298
Years	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
Cig	1	5	5	1	3	1	2	1	1	2	3	3	2	5	3	3	5	5	3	1	2

The next step is to correct it:

# posar 0 al valors de Cig

mydata[kk,4] <- 0

6.1.2 Case (Years > 0) and (Cig = 0)

# find wrong values
kk <- which(mydata$Years > 0 & mydata$Cig == 0)

21 individuals have appeared with this inconsistency. The records are: 35, 69, 83, 85, 106, 107, 124, 138, 139, 143, 157, 189, 193, 205, 213, 220, 233, 234, 262, 263, 295

And the inconsistent values are:

kable(t(mydata[kk,3:4]))

	35	69	83	85	106	107	124	138	139	143	157	189	193	205	213	220	233	234	262	263	295
Years	5	5	5	3	5	3	5	1	5	2	2	5	3	1	1	2	4	2	4	3	4
Cig	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

The next step is to correct it:

# posar 0 al valors de Cig

mydata[kk,3] <- 0

As a final check, the following table is presented:

table( mydata[,3]==0, mydata[,4]==0)

##
##         FALSE TRUE
##   FALSE   131    0
##   TRUE      0  169

7 Find outliers in quantitative variables

A boxplot of each quantitative variable is presented. In addition, a table is made with the robust and non-robust estimates of central tendency and dispersion for each quantitative variable.

res <- sapply(mydata,class)
res <- which(res=="numeric")

7.1 Present a boxplot for each quantitative variable

The boxplots of the variables Years, Cig, PC, Weight, Age, Height are shown

par(mfrow=c(2,2))
for (i in 1:4){
            boxplot(mydata[,res[i]], main=names(mydata)[res[i]],col="gray")
}

Boxplot

par(mfrow=c(1,1))

par(mfrow=c(1,2))
for (i in 5:6){
            boxplot(mydata[,res[i]], main=names(mydata)[res[i]],col="gray")
}

Boxplot

par(mfrow=c(1,1))

Even though some atypical values are observed in the boxplots of the variables Years and Cig We can consider them to be an artifact because there are two groups of data: zeros and non-zeroes, with a similar proportion.

Now the outlier values of the rest of the quantitative variables are presented:

# Outlier values
vars.cuantitativas <- res[-c(1,2)]
for(i in 1:length(vars.cuantitativas) ) {
            # i <- 1
  #print(boxplot.stats(mydata[,i])$out)
  indices <- which(mydata[,vars.cuantitativas[i]]  %in%  boxplot.stats(mydata[,vars.cuantitativas[i]])$out)
  cat(names(mydata)[vars.cuantitativas[i]],":", toString(indices), "\n" )
  }

PC : 2
Weight : 9, 11, 13, 14, 21, 34, 48, 53, 61, 94, 104, 110, 125, 133, 147, 180, 207, 213, 222, 267
Age : 247, 254
Height :

The variable that has the greatest number of outliers is Weight and clearly outside the logical magnitudes for weights of people expressed in Kg. This suggests that it is an error, as indicated in the statement. Possibly this is a change in units of measurement, instead of kg they are measured in grams. The rest of the variables present atypical values that cannot be considered erroneous.

Let's move on to correct the wrong values of the variable Weight:

i <- 2

indices <- which(mydata[,vars.cuantitativas[i]]  %in%  boxplot.stats(mydata[,vars.cuantitativas[i]])$out)

mydata[indices, vars.cuantitativas[i]] <- mydata[indices, vars.cuantitativas[i]]/1000

Now the changes made are checked with a boxplot:

boxplot(mydata[,vars.cuantitativas[i]], main=names(mydata)[vars.cuantitativas[i]],col="gray")

7.2 Table of central tendency and dispersion estimates (robust and non-robust) for each quantitative variable

mean.n <- as.vector(sapply( mydata[,res ],mean,na.rm=TRUE ) )
std.n <- as.vector(sapply(mydata[,res ],sd, na.rm=TRUE))
median.n <- as.vector(sapply(mydata[,res],median, na.rm=TRUE))
mean.trim.0.05 <- as.vector(sapply(mydata[,res],mean, na.rm=TRUE, trim=0.05))
mean.winsor.0.05 <- as.vector(sapply(mydata[,res],winsor.mean, na.rm=TRUE,trim=0.05))
IQR.n <- as.vector(sapply(mydata[,res],IQR, na.rm=TRUE))
mad.n <- as.vector(sapply(mydata[,res],mad, na.rm=TRUE))


kable(data.frame(variables= names(mydata)[res],
                 Media = mean.n,
                 Mediana = median.n,
                 Media.recort.0.05= mean.trim.0.05,
                 Media.winsor.0.05= mean.winsor.0.05
                 ),
      digits=2, caption="Estimaciones de Tendencia Central")

Central Tendency Estimates
variables	Average	Median	Average.crop.0.05	Media.winsor.0.05
Years	8.46	0.00	7.01	8.11
Cig	7.27	0.00	6.11	7.05
PC	3.33	3.55	3.36	3.33
Weight	67.72	68.00	67.73	67.71
Age	45.59	46.00	45.49	45.54
Height	171.44	172.00	171.47	171.43

kable(data.frame(variables= names(mydata)[res],
                 Desv.Standard = std.n,
                 IQR = IQR.n,
                 MAD = mad.n
                 ),
      digits=2, caption="Estimaciones de Dispersión")

Dispersion Estimates
variables	Standard Dev.	IQR	MAD
Years	12.54	15.25	0.00
Cig	10.41	13.00	0.00
PC	0.63	0.88	0.54
Weight	3.83	6.00	4.45
Age	10.63	14.00	10.38
Height	5.74	10.00	7.41

It can be seen that for the variables Years and Cig Robust measures differ greatly from ordinary measures because the data contain a large number of zeros.

8 Missing values

The variables that have missing values are:

v.p.p <- which(names(mydata)==v.p)

The variable with missing values is 5 from the list of variables.

This variable has missing values in positions 3, 7

8.1 Impute the values from the k-nearest neighbors using the Gower distance with the information of all the variables.

# kNN Imputation i Gower distance
# 6 last variables

mydata.completo <- kNN(mydata, variable=colnames(mydata)[v.p.p])

# Asignation
mydata[v.p.p.v,v.p.p] <- mydata.completo[v.p.p.v,v.p.p]

The imputed values are: 3.747, 2.773

9 Finally, make a brief descriptive study of the data once purified.

A descriptive study grouped by the type of statistical variable is presented.

options(knitr.kable.NA = '')

kable(summary(mydata)[,var.factor],
      digits=2, align='l', caption="Estadística descriptiva de variables cualitativas")

Descriptive statistics of qualitative variables
sex	Sports	City
F:137	N:127	Barcelona:102
M:163	S: 83	Terrassa: 42
	A: 48	Valleys: 15
	E: 42	Tarragona: 14
		Lleida: 13
		Sitges: 13
		(Other) :101

kable(summary(mydata)[,var.integer],
      digits=2, align='l', caption="Estadística descriptiva de variables cuantitativas discretas")

Descriptive statistics of discrete quantitative variables
	Years	Cig
	Min.: 0.000	Min.: 0.000
	1st Qu.: 0.000	1st Qu.: 0.000
	Median: 0.000	Median: 0.000
	Mean: 8,463	Mean: 7,273
	3rd Qu.:15,250	3rd Qu.:13,000
	Max. :51,000	Max. :47,000

kable(summary(mydata)[,var.numeric],
      digits=2, align='l', caption="Estadística descriptiva de variables cuantitativas continuas")

Descriptive statistics of continuous quantitative variables
PC	Weight	Age	Height
Min.:1,557	Min.:57.00	Min.:19.00	Min.:158.0
1st Qu.:2,909	1st Qu.:65.00	1st Qu.:38.00	1st Qu.:166.0
Median:3,554	Average:68.00	Average:46.00	Median:172.0
Mean:3,331	Mean:67.72	Mean:45.59	Mean:171.4
3rd Qu.:3,793	3rd Qu.:71.00	3rd Qu.:52.00	3rd Qu.:176.0
Max. :4,466	Max. :79.00	Max. :77.00	Max. :186.0

10 Create the clean file

my.newfile <- "Fumadors_clean.csv"
write.csv(mydata, file=my.newfile, row.names = FALSE)