Introduction
In this activity, we will use the resulting project file https://1938.com.es/preprocesamiento-datos-r . Remember that this file stores the data of a medical investigation on the lung capacity of several people, with the aim of studying whether health habits and smoking habits influence lung capacity. To carry out the study, a sample of 300 people was collected. Each person was asked through a questionnaire about their gender, sports habits, if they were a smoker, and if so, how many cigarettes a day on average they smoked and how many years they had been smoking. In addition, the lung capacity of each person was measured from an expelled air test, from which the FEF (forced expiratory flow) measurement was taken as lung capacity, which is the speed of the air leaving the lung during the central portion of a forced expiration. It is measured in liters / second. Other personal data collected are: height, weight and city where you live. An additional column “PC5Y” is included in the file, which is the lung capacity of each person measured 5 years after performing the first test. It is assumed that the person has not changed their personal conditions significantly during this time.
In this activity we will use the “clean” smokers file, that is, after the preprocessing has been carried out. Once the file is prepared for analysis, we will apply descriptive and inferential statistics analysis.
Data loading
Load the data file “Fumadores_clean_5Y.csv” and validate that the data types are interpreted correctly.
data <- read.csv( "Fumadores_clean_5Y.csv")
head(data)
## Sex Sport Years Cig PC City Weight Age Height PC5Y
## 1 M E 25 10 2.579 Barcelona 65 49 171 2.529
## 2 F E 18 32 1.557 Terrassa 65 35 166 1.444
## 3 M S 0 0 3.747 La Bisbal 69 38 175 3.730
## 4 M N 25 14 2.762 Blanes 70 55 176 2.670
## 5 M E 0 0 3.487 Sant Boi de Llobregat 72 55 178 3.487
## 6 F S 0 0 4.075 Barcelona 64 42 165 4.052
sapply( data, class)
## Sex Sport Years Cig PC City Weight
## "factor" "factor" "integer" "integer" "numeric" "factor" "integer"
## Age Height PC5Y
## "integer" "integer" "numeric"
Descriptive Statistics
First of all, we will study the central and dispersion values of some variables in the data set. Follow the steps specified below.
Core Values
Calculate the mean, median and the five numbers (Tukey) of lung capacity. View the sample values in a boxplot. Are extreme values (outliers) detected in the diagram? Next, show lung capacity using box plots, separating the female and male genders. Finally, show, using box plots, the comparison between the original CP value and after 5 years. Interpret the results.
x.five <- fivenum( data$PC )
x.median <- median( data$PC )
x.mean <-mean( data$PC )
boxplot( data$PC, main="PC" )

boxplot( PC~Sex, data, main="PC")

boxplot( data$PC, data$PC5Y, names=c("PC", "PC after 5Y"), main="PC")

Interpretation: The lung capacity of the sample moves from 1,557 as a minimum value to 4,466 as a maximum. The box plot is a bit asymmetrical on the left, that is, the bottom 50% of the values are more spread out than the top half. The median value is 3.554 and the mean is 3.33099. An extreme value is observed below.
Regarding the CP boxplot based on sex, it is observed that the lung capacity of women is lower. The dispersion is also greater than in the case of men.
In the boxplot comparing lung capacity in the first test and after five years, there are no notable differences in the upper half of the data. The lower half of PC value presents more dispersion in the test at five years. A slight decrease in the PC value is also observed in the lower half. It remains to be seen if this difference is significant.
Dispersion
Calculate the dispersion of lung capacity using the measurements: variance, standard deviation and interquartile range.
#Medidas de dispersión
x.iqr <- IQR( data$PC )
x.var <- var( data$PC )
x.sd <- sd( data$PC )
Interquartile range: 0.88475
Variance: 0.3937751
Standard deviation: 0.627515
Manual dispersion calculation
Calculate the standard deviation of lung capacity manually and compare the result with the corresponding R function.
N<-length( data$PC )
N
## [1] 300
sample.mean <- sum( data$PC )/N
sample.mean
## [1] 3.33099
sample.sd <- sqrt( sum((data$PC - sample.mean)^2) / (N-1) )
The result of the manually calculated sample standard deviation is: 0.6275. According to the function sd, the result is 0.6275.
Histogram
Represent a histogram of the PC variable. If necessary, configure the histogram parameters so that it appears with good precision.
N<-length( data$PC )
hist( data$PC, breaks=40, prob=TRUE, main="Distribució de valors de PC" )

N
## [1] 300
Categorical data
In the variables Sex, Sport and City, make a numerical summary and draw a circular diagram that shows the proportion of cases in each category.
summary(data)
## Sex Sport Years Cig PC
## F:137 E:127 Min. : 0.000 Min. : 0.000 Min. :1.557
## M:163 N: 83 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.:2.909
## R: 48 Median : 0.000 Median : 0.000 Median :3.554
## S: 42 Mean : 8.463 Mean : 7.273 Mean :3.331
## 3rd Qu.:15.250 3rd Qu.:13.000 3rd Qu.:3.793
## Max. :51.000 Max. :47.000 Max. :4.466
##
## City Weight Age Height
## Barcelona:102 Min. :57.00 Min. :19.00 Min. :158.0
## Terrassa : 42 1st Qu.:65.00 1st Qu.:38.00 1st Qu.:166.0
## Valls : 15 Median :68.00 Median :46.00 Median :172.0
## Tarragona: 14 Mean :67.72 Mean :45.59 Mean :171.4
## Lleida : 13 3rd Qu.:71.00 3rd Qu.:52.00 3rd Qu.:176.0
## Sitges : 13 Max. :79.00 Max. :77.00 Max. :186.0
## (Other) :101
## PC5Y
## Min. :1.444
## 1st Qu.:2.744
## Median :3.543
## Mean :3.289
## 3rd Qu.:3.796
## Max. :4.472
##
summary(data$Sport)
## E N R S
## 127 83 48 42
#Sex
par(mfrow=c(1,2))
table( data$Sex )
##
## F M
## 137 163
pie( table(data$Sex),main="Sex")
#Sport
pie( table(data$Sport),main="Sport")

pie( table(data$City),main="City")
par(mfrow=c(1,1))

Inferential statistics
Confidence interval
Calculate the 97% confidence interval of the lung capacity of the population.
By the central limit theorem, to calculate the confidence interval of the mean, and samples of size greater than 30, we can use a normal distribution.
n<-nrow( data )
alpha<-1-0.97
#Error típic
errorTipic <- sd(data$PC) / sqrt( n )
errorTipic
## [1] 0.0362296
#Valor z
z<-qnorm( 1-alpha/2 )
z
## [1] 2.17009
#Marge d'error
error<- z * errorTipic
error
## [1] 0.0786215
#Interval
c( mean(data$PC) - error, mean(data$PC) + error )
## [1] 3.252369 3.409611
#Comprovació amb t Student.
#No dona exactament igual perquè s'ha assumit distribució normal (anteriorment).
t.test( data$PC, conf.level=0.97 )
##
## One Sample t-test
##
## data: data$PC
## t = 91.941, df = 299, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 97 percent confidence interval:
## 3.251991 3.409989
## sample estimates:
## mean of x
## 3.33099
Analyze the lung capacity of women
We assume that we know the average lung capacity of the population, which is equal to 3.30. Can we say that women's lung capacity is lower than the population average, with a 95% confidence level? To answer this question, follow the steps provided.
Write the null and alternative hypothesis
H0:mu=mu0H1:μ<μ0 where
μ0= 3.30.
Method
Indicate which is the most appropriate method to perform this analysis, depending on the characteristics of the sample and the objective of the analysis.
Since the sample size is greater than 30, we use the normal distribution by applying the central limit theorem. A hypothesis test on the mean is applied, with a unilateral test.
Calculate the contrast statistic, the critical value and the p value.
#
Fem <- data[data$Sex=='F',]
mu0<-3.30
n<-nrow( Fem )
mean(Fem$PC)
## [1] 3.215241
#Estadístic de contrast
t<- (mean(Fem$PC)-mu0) / (sd(Fem$PC)/sqrt(n))
#Valor crític al 95%
#Contrast unilateral
tcritical <- qnorm( 1-0.95, lower.tail=T)
#Valor p
pvalue <- pnorm( t, lower.tail=T)
t
## [1] -1.592563
tcritical
## [1] -1.644854
pvalue
## [1] 0.05562909
#Comprovació
t.test( Fem$PC, mu=3.30, alternative="less" )
##
## One Sample t-test
##
## data: Fem$PC
## t = -1.5926, df = 136, p-value = 0.05679
## alternative hypothesis: true mean is less than 3.3
## 95 percent confidence interval:
## -Inf 3.303383
## sample estimates:
## mean of x
## 3.215241
Interpret the result
The critical value is -1.64, but the contrast statistic does not exceed, in absolute value, this value. Therefore, we are in the region of acceptance of the null hypothesis. We cannot affirm that women's lung capacity is lower than the population average of 3.3.
Through the p value we obtain the same conclusion, given that the p value obtained is p = 0.0556. In order to reject the null hypothesis at the 95% confidence level, p should be p < 0.05.
Comparison between smokers and non-smokers
We wonder if the lung capacity of smokers is lower than the lung capacity of non-smokers. Apply a hypothesis test to contrast the previous hypothesis with 95% confidence and interpret the result.
Follow the steps below.
Null and alternative hypotheses
Write the null and alternative hypothesis.
(H0:μFum=μ′′NoFum)
(H1:μFum<μNoFum) Method
Explain the method you will apply to make this contrast and justify it.
Two-sample test on the difference of means. We apply the case of large non-normal samples. It is a one-sided test.
Calculation
Perform the calculation. As before, you cannot use R functions or libraries that directly calculate contrast. The calculation must be done manually. You can use functions like qnorm, pnorm, qt, pt.
str( data$Cig )
## int [1:300] 10 32 0 14 0 0 15 0 0 12 ...
Fum <- data[data$Cig > 0 ,]
NoFum <- data[data$Cig==0, ]
n.fum <- nrow( Fum )
n.nofum <- nrow( NoFum )
#comprovació
nrow( data )
## [1] 300
n.fum
## [1] 131
n.nofum
## [1] 169
n.fum + n.nofum
## [1] 300
#Càlculs
mean.fum <- mean( Fum$PC )
sd.fum <- sd( Fum$PC )
c( mean.fum, sd.fum )
## [1] 2.7571985 0.4935864
mean.no.fum <- mean( NoFum$PC )
sd.no.fum <- sd( NoFum$PC )
c( mean.no.fum, sd.no.fum )
## [1] 3.7757633 0.2378603
#mostrem boxplot
boxplot( Fum$PC, NoFum$PC, main="PC", names=c("Fumador", "No Fumador"))

S <- sqrt( sd.fum^2/n.fum + sd.no.fum^2/n.nofum)
zobs <- (mean.no.fum-mean.fum)/ S
zobs
## [1] 21.74292
alfa <- 1-0.95
zcritical <- qnorm( alfa, lower.tail=FALSE )
zcritical
## [1] 1.644854
pvalue<-pnorm( abs(zobs), lower.tail=FALSE )
pvalue
## [1] 4.030158e-105
And at 99% confidence? Redo the calculations
alfa <- 1-0.99
zcritical <- qnorm( alfa, lower.tail=FALSE )
zcritical
## [1] 2.326348
pvalue<-pnorm( abs(zobs), lower.tail=FALSE )
pvalue
## [1] 4.030158e-105
Interpretation
We can reject the null hypothesis that smokers and non-smokers have the same lung capacity, in favor of the alternative hypothesis, with the 95% (p value < 0.05) and 99% (p value < 0.01) confidence level, respectively.
after 5 years
After 5 years, the lung capacity of the same people in the study is measured again. The PC5Y column incorporates the lung capacity of the same subjects at 5 years. We asked whether lung capacity has changed significantly, with a 95% confidence level in the case of smokers and in the case of non-smokers. Answer the following questions.
Calculate whether there are significant differences in non-smokers between initial lung capacity and lung capacity after 5 years.
Perform the necessary steps: Write the null and alternative hypothesis, the chosen method and the calculations.
Null and alternative hypotheses
Write the null and alternative hypothesis.
(H0:μFum=μFum5Y)
(H1:μFum=μFum5Y) which results in:
(H0:μdif=0)
(H1:μdif=0)
where
diff= Fum - Fum5Y
Method
Since the samples are paired, the difference is calculated element by element and a one-sample hypothesis test is applied. The test is bilateral.
Calculation
#Test apareado de dos colas:
#Input: Dos muestras apareadas y el nivel de confianza
test.paired <- function( d1, d2, cl ){
dif <- d1 - d2
n <- length( dif )
mean <- mean( dif )
sd <- sd( dif )
mu<-0
alfa <- 1-cl
z.obs<- (mean - mu) / (sd/sqrt(n))
z.critical <- qnorm( alfa/2, lower.tail=FALSE )
pvalue <- pnorm( abs(z.obs), lower.tail=FALSE )*2 #dos colas
cat ("sample mean=", mean, " sd=", sd, " sample length=", n, "\n",
"z obs= ", z.obs, "\n",
"z critical: ", z.critical, "\n",
"p value", pvalue, "\n")
return (pvalue)
}
test.paired( Fum$PC, Fum$PC5Y, 0.95)
## sample mean= 0.09820611 sd= 0.0970989 sample length= 131
## z obs= 11.57604
## z critical: 1.959964
## p value 5.45091e-31
## [1] 5.45091e-31
Perform the same calculation but now for non-smokers
test.paired( NoFum$PC, NoFum$PC5Y, 0.95 )
## sample mean= -0.002100592 sd= 0.02664195 sample length= 169
## z obs= -1.024989
## z critical: 1.959964
## p value 0.3053686
## [1] 0.3053686
Interpret the results obtained in the two contrasts
In the case of smokers, significant differences in lung capacity are observed between the original test and the test after 5 years, with a confidence level of 95%. The direction of changes is that CP decreases after 5 years. A unilateral test should be applied to validate that CP decreases after 5 years.
In the case of non-smokers, no significant differences are observed between initial CP and after five years, with a confidence level of 95%.