A boxplot of each quantitative variable is presented. In addition, a table is made with the robust and non-robust estimates of central tendency and dispersion for each quantitative variable.
Present a boxplot for each quantitative variable
The boxplots of the variables Years, Cig, PC, Weight, Age, Height are shown
par(mfrow=c(2,2))
for (i in 1:4){
boxplot(mydata[,res[i]], main=names(mydata)[res[i]],col="gray")
}
par(mfrow=c(1,1))
par(mfrow=c(1,2))
for (i in 5:6){
boxplot(mydata[,res[i]], main=names(mydata)[res[i]],col="gray")
}
par(mfrow=c(1,1))
Even though some atypical values are observed in the boxplots of the variables Years and Cig We can consider them to be an artifact because there are two groups of data: zeros and non-zeroes, with a similar proportion.
Now the outlier values of the rest of the quantitative variables are presented:
# Outlier values
vars.cuantitativas <- res[-c(1,2)]
for(i in 1:length(vars.cuantitativas) ) {
# i <- 1
#print(boxplot.stats(mydata[,i])$out)
indices <- which(mydata[,vars.cuantitativas[i]] %in% boxplot.stats(mydata[,vars.cuantitativas[i]])$out)
cat(names(mydata)[vars.cuantitativas[i]],":", toString(indices), "\n" )
}
PC : 2
Weight : 9, 11, 13, 14, 21, 34, 48, 53, 61, 94, 104, 110, 125, 133, 147, 180, 207, 213, 222, 267
Age : 247, 254
Height :
The variable that has the greatest number of outliers is Weight and clearly outside the logical magnitudes for weights of people expressed in Kg. This suggests that it is an error, as indicated in the statement. Possibly this is a change in units of measurement, instead of kg they are measured in grams. The rest of the variables present atypical values that cannot be considered erroneous.
Let's move on to correct the wrong values of the variable Weight:
i <- 2
indices <- which(mydata[,vars.cuantitativas[i]] %in% boxplot.stats(mydata[,vars.cuantitativas[i]])$out)
mydata[indices, vars.cuantitativas[i]] <- mydata[indices, vars.cuantitativas[i]]/1000
Now the changes made are checked with a boxplot:
boxplot(mydata[,vars.cuantitativas[i]], main=names(mydata)[vars.cuantitativas[i]],col="gray")

Table of central tendency and dispersion estimates (robust and non-robust) for each quantitative variable
mean.n <- as.vector(sapply( mydata[,res ],mean,na.rm=TRUE ) )
std.n <- as.vector(sapply(mydata[,res ],sd, na.rm=TRUE))
median.n <- as.vector(sapply(mydata[,res],median, na.rm=TRUE))
mean.trim.0.05 <- as.vector(sapply(mydata[,res],mean, na.rm=TRUE, trim=0.05))
mean.winsor.0.05 <- as.vector(sapply(mydata[,res],winsor.mean, na.rm=TRUE,trim=0.05))
IQR.n <- as.vector(sapply(mydata[,res],IQR, na.rm=TRUE))
mad.n <- as.vector(sapply(mydata[,res],mad, na.rm=TRUE))
kable(data.frame(variables= names(mydata)[res],
Media = mean.n,
Mediana = median.n,
Media.recort.0.05= mean.trim.0.05,
Media.winsor.0.05= mean.winsor.0.05
),
digits=2, caption="Estimaciones de Tendencia Central")
Central Tendency Estimates| Years | 8.46 | 0.00 | 7.01 | 8.11 |
| Cig | 7.27 | 0.00 | 6.11 | 7.05 |
| PC | 3.33 | 3.55 | 3.36 | 3.33 |
| Weight | 67.72 | 68.00 | 67.73 | 67.71 |
| Age | 45.59 | 46.00 | 45.49 | 45.54 |
| Height | 171.44 | 172.00 | 171.47 | 171.43 |
kable(data.frame(variables= names(mydata)[res],
Desv.Standard = std.n,
IQR = IQR.n,
MAD = mad.n
),
digits=2, caption="Estimaciones de Dispersión")
Dispersion Estimates| Years | 12.54 | 15.25 | 0.00 |
| Cig | 10.41 | 13.00 | 0.00 |
| PC | 0.63 | 0.88 | 0.54 |
| Weight | 3.83 | 6.00 | 4.45 |
| Age | 10.63 | 14.00 | 10.38 |
| Height | 5.74 | 10.00 | 7.41 |
It can be seen that for the variables Years and Cig Robust measures differ greatly from ordinary measures because the data contain a large number of zeros.