Statistics · Confidence intervals · Hypothesis testing

Descriptive and inferential statistics: lung capacity case study

This guide reframes the original R notebook as a complete statistics workflow: inspect the data, summarize lung capacity, compare groups and use inferential tests to decide whether the observed differences are statistically meaningful.

What is being analyzed?

The dataset contains lung-capacity measurements together with sex, sport habits, smoking status, cigarettes per day, years smoking, height, weight, age and a second lung-capacity measurement five years later.

Descriptive stage

Summaries, boxplots and group comparisons show the center, spread and possible outliers of lung capacity.

Confidence interval

The interval estimates the plausible population mean instead of relying on the sample mean alone.

Hypothesis tests

t-tests evaluate whether differences by sex, smoking status or time are compatible with random variation.

Interpretation

The key is to connect each p-value and interval with the health question that motivated the analysis.

Why this page is useful beyond the notebook

The original exercise contains many generated R outputs. This introduction clarifies the statistical storyline: first describe the sample, then estimate uncertainty, and finally test hypotheses about smoking, sex and changes over time. That structure helps readers understand the analysis before reading the raw tables.

Related reading: R data preprocessing and dataset selection and preparation.

Introduction

In this activity, we will use the resulting project file https://1938.com.es/preprocesamiento-datos-r . Remember that this file stores the data of a medical investigation on the lung capacity of several people, with the aim of studying whether health habits and smoking habits influence lung capacity. To carry out the study, a sample of 300 people was collected. Each person was asked through a questionnaire about their gender, sports habits, if they were a smoker, and if so, how many cigarettes a day on average they smoked and how many years they had been smoking. In addition, the lung capacity of each person was measured from an expelled air test, from which the FEF (forced expiratory flow) measurement was taken as lung capacity, which is the speed of the air leaving the lung during the central portion of a forced expiration. It is measured in liters / second. Other personal data collected are: height, weight and city where you live. An additional column “PC5Y” is included in the file, which is the lung capacity of each person measured 5 years after performing the first test. It is assumed that the person has not changed their personal conditions significantly during this time.

In this activity we will use the “clean” smokers file, that is, after the preprocessing has been carried out. Once the file is prepared for analysis, we will apply descriptive and inferential statistics analysis.

1 Data loading

Load the data file “Fumadores_clean_5Y.csv” and validate that the data types are interpreted correctly.

 data <- read.csv( "Fumadores_clean_5Y.csv")
 head(data)
##   Sex Sport Years Cig    PC                  City Weight Age Height  PC5Y
## 1   M     E    25  10 2.579             Barcelona     65  49    171 2.529
## 2   F     E    18  32 1.557              Terrassa     65  35    166 1.444
## 3   M     S     0   0 3.747             La Bisbal     69  38    175 3.730
## 4   M     N    25  14 2.762                Blanes     70  55    176 2.670
## 5   M     E     0   0 3.487 Sant Boi de Llobregat     72  55    178 3.487
## 6   F     S     0   0 4.075             Barcelona     64  42    165 4.052
 sapply( data, class)
##       Sex     Sport     Years       Cig        PC      City    Weight
##  "factor"  "factor" "integer" "integer" "numeric"  "factor" "integer"
##       Age    Height      PC5Y
## "integer" "integer" "numeric"

2 Descriptive Statistics


First of all, we will study the central and dispersion values of some variables in the data set. Follow the steps specified below.

2.1 Core Values

Calculate the mean, median and the five numbers (Tukey) of lung capacity. View the sample values ​​in a boxplot. Are extreme values ​​(outliers) detected in the diagram? Next, show lung capacity using box plots, separating the female and male genders. Finally, show, using box plots, the comparison between the original CP value and after 5 years. Interpret the results.

x.five <- fivenum( data$PC )
x.median <- median( data$PC )
x.mean <-mean( data$PC )
boxplot( data$PC, main="PC" )

boxplot( PC~Sex, data, main="PC")

boxplot( data$PC, data$PC5Y, names=c("PC", "PC after 5Y"), main="PC")

Interpretation: The lung capacity of the sample moves from 1,557 as a minimum value to 4,466 as a maximum. The box plot is a bit asymmetrical on the left, that is, the bottom 50% of the values ​​are more spread out than the top half. The median value is 3.554 and the mean is 3.33099. An extreme value is observed below.

Regarding the CP boxplot based on sex, it is observed that the lung capacity of women is lower. The dispersion is also greater than in the case of men.

In the boxplot comparing lung capacity in the first test and after five years, there are no notable differences in the upper half of the data. The lower half of PC value presents more dispersion in the test at five years. A slight decrease in the PC value is also observed in the lower half. It remains to be seen if this difference is significant.

2.2 Dispersion

Calculate the dispersion of lung capacity using the measurements: variance, standard deviation and interquartile range.

#Medidas de dispersión
x.iqr <- IQR( data$PC )
x.var <- var( data$PC )
x.sd  <- sd( data$PC )

Interquartile range: 0.88475
Variance: 0.3937751
Standard deviation: 0.627515

2.3 Manual dispersion calculation

Calculate the standard deviation of lung capacity manually and compare the result with the corresponding R function.

N<-length( data$PC )
N
## [1] 300
sample.mean <- sum( data$PC )/N
sample.mean
## [1] 3.33099
sample.sd <- sqrt( sum((data$PC - sample.mean)^2) / (N-1) )

The result of the manually calculated sample standard deviation is: 0.6275. According to the function sd, the result is 0.6275.

2.4 Histogram

Represent a histogram of the PC variable. If necessary, configure the histogram parameters so that it appears with good precision.

N<-length( data$PC )
hist( data$PC, breaks=40, prob=TRUE, main="Distribució de valors de PC" )

N
## [1] 300

2.5 Categorical data

In the variables Sex, Sport and City, make a numerical summary and draw a circular diagram that shows the proportion of cases in each category.

summary(data)
##  Sex     Sport       Years             Cig               PC
##  F:137   E:127   Min.   : 0.000   Min.   : 0.000   Min.   :1.557
##  M:163   N: 83   1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.:2.909
##          R: 48   Median : 0.000   Median : 0.000   Median :3.554
##          S: 42   Mean   : 8.463   Mean   : 7.273   Mean   :3.331
##                  3rd Qu.:15.250   3rd Qu.:13.000   3rd Qu.:3.793
##                  Max.   :51.000   Max.   :47.000   Max.   :4.466
##
##         City         Weight           Age            Height
##  Barcelona:102   Min.   :57.00   Min.   :19.00   Min.   :158.0
##  Terrassa : 42   1st Qu.:65.00   1st Qu.:38.00   1st Qu.:166.0
##  Valls    : 15   Median :68.00   Median :46.00   Median :172.0
##  Tarragona: 14   Mean   :67.72   Mean   :45.59   Mean   :171.4
##  Lleida   : 13   3rd Qu.:71.00   3rd Qu.:52.00   3rd Qu.:176.0
##  Sitges   : 13   Max.   :79.00   Max.   :77.00   Max.   :186.0
##  (Other)  :101
##       PC5Y
##  Min.   :1.444
##  1st Qu.:2.744
##  Median :3.543
##  Mean   :3.289
##  3rd Qu.:3.796
##  Max.   :4.472
## 
summary(data$Sport)
##   E   N   R   S
## 127  83  48  42
#Sex
par(mfrow=c(1,2))
table( data$Sex )
##
##   F   M
## 137 163
pie( table(data$Sex),main="Sex")

#Sport
pie( table(data$Sport),main="Sport")

pie( table(data$City),main="City")

par(mfrow=c(1,1))

3 Inferential statistics

3.1 Confidence interval

Calculate the 97% confidence interval of the lung capacity of the population.

By the central limit theorem, to calculate the confidence interval of the mean, and samples of size greater than 30, we can use a normal distribution.

n<-nrow( data )
alpha<-1-0.97

#Error típic
errorTipic <- sd(data$PC) / sqrt( n )
errorTipic
## [1] 0.0362296
#Valor z
z<-qnorm( 1-alpha/2 )
z
## [1] 2.17009
#Marge d'error
error<- z * errorTipic
error
## [1] 0.0786215
#Interval
c( mean(data$PC) - error, mean(data$PC) + error )
## [1] 3.252369 3.409611
#Comprovació amb t Student.
#No dona exactament igual perquè s'ha assumit distribució normal (anteriorment).
t.test( data$PC, conf.level=0.97 )
##
##  One Sample t-test
##
## data:  data$PC
## t = 91.941, df = 299, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 97 percent confidence interval:
##  3.251991 3.409989
## sample estimates:
## mean of x
##   3.33099

3.2 Analyze the lung capacity of women

We assume that we know the average lung capacity of the population, which is equal to 3.30. Can we say that women's lung capacity is lower than the population average, with a 95% confidence level? To answer this question, follow the steps provided.

3.2.1 Write the null and alternative hypothesis

H0:mu=mu0H_{0}: mu=mu_{0}

H1:μ<μ0H_{1}: \mu < \mu_{0}

whereμ0\mu_{0}= 3.30.

3.2.2 Method

Indicate which is the most appropriate method to perform this analysis, depending on the characteristics of the sample and the objective of the analysis.

Since the sample size is greater than 30, we use the normal distribution by applying the central limit theorem. A hypothesis test on the mean is applied, with a unilateral test.

3.2.3 Calculate the contrast statistic, the critical value and the p value.

#
Fem <- data[data$Sex=='F',]
mu0<-3.30
n<-nrow( Fem )

mean(Fem$PC)
## [1] 3.215241
#Estadístic de contrast
t<- (mean(Fem$PC)-mu0) / (sd(Fem$PC)/sqrt(n))

#Valor crític al 95%
#Contrast unilateral
tcritical <- qnorm( 1-0.95, lower.tail=T)

#Valor p
pvalue <- pnorm( t, lower.tail=T)

t
## [1] -1.592563
tcritical
## [1] -1.644854
pvalue
## [1] 0.05562909
#Comprovació
t.test( Fem$PC, mu=3.30, alternative="less" )
##
##  One Sample t-test
##
## data:  Fem$PC
## t = -1.5926, df = 136, p-value = 0.05679
## alternative hypothesis: true mean is less than 3.3
## 95 percent confidence interval:
##      -Inf 3.303383
## sample estimates:
## mean of x
##  3.215241

3.2.4 Interpret the result

The critical value is -1.64, but the contrast statistic does not exceed, in absolute value, this value. Therefore, we are in the region of acceptance of the null hypothesis. We cannot affirm that women's lung capacity is lower than the population average of 3.3.

Through the p value we obtain the same conclusion, given that the p value obtained is p = 0.0556. In order to reject the null hypothesis at the 95% confidence level, p should be p < 0.05.

3.3 Comparison between smokers and non-smokers

We wonder if the lung capacity of smokers is lower than the lung capacity of non-smokers. Apply a hypothesis test to contrast the previous hypothesis with 95% confidence and interpret the result.

Follow the steps below.

3.3.1 Null and alternative hypotheses

Write the null and alternative hypothesis.

(H0:μFum=μNoFum)(H_0: \mu_{Fum}=\mu_{{'{'}}NoFum})

(H1:μFum<μNoFum)(H_1: \mu_{Fum} < \mu_{NoFum})

3.3.2 Method

Explain the method you will apply to make this contrast and justify it.

Two-sample test on the difference of means. We apply the case of large non-normal samples. It is a one-sided test.

3.3.3 Calculation

Perform the calculation. As before, you cannot use R functions or libraries that directly calculate contrast. The calculation must be done manually. You can use functions like qnorm, pnorm, qt, pt.

str( data$Cig )
##  int [1:300] 10 32 0 14 0 0 15 0 0 12 ...
Fum <- data[data$Cig > 0 ,]
NoFum <- data[data$Cig==0, ]
n.fum <- nrow( Fum )
n.nofum <- nrow( NoFum )

#comprovació
nrow( data )
## [1] 300
n.fum
## [1] 131
n.nofum
## [1] 169
n.fum + n.nofum
## [1] 300
#Càlculs
mean.fum <- mean( Fum$PC )
sd.fum <- sd( Fum$PC )
c( mean.fum, sd.fum )
## [1] 2.7571985 0.4935864
mean.no.fum <- mean( NoFum$PC )
sd.no.fum <- sd( NoFum$PC )
c( mean.no.fum, sd.no.fum )
## [1] 3.7757633 0.2378603
#mostrem boxplot
boxplot( Fum$PC, NoFum$PC, main="PC", names=c("Fumador", "No Fumador"))

S <- sqrt( sd.fum^2/n.fum + sd.no.fum^2/n.nofum)
zobs <- (mean.no.fum-mean.fum)/ S
zobs
## [1] 21.74292
alfa <- 1-0.95
zcritical <- qnorm( alfa, lower.tail=FALSE )
zcritical
## [1] 1.644854
pvalue<-pnorm( abs(zobs), lower.tail=FALSE )
pvalue
## [1] 4.030158e-105

3.3.4 And at 99% confidence? Redo the calculations

alfa <- 1-0.99
zcritical <- qnorm( alfa, lower.tail=FALSE )
zcritical
## [1] 2.326348
pvalue<-pnorm( abs(zobs), lower.tail=FALSE )
pvalue
## [1] 4.030158e-105

3.3.5 Interpretation

We can reject the null hypothesis that smokers and non-smokers have the same lung capacity, in favor of the alternative hypothesis, with the 95% (p value < 0.05) and 99% (p value < 0.01) confidence level, respectively.

3.4 after 5 years

After 5 years, the lung capacity of the same people in the study is measured again. The PC5Y column incorporates the lung capacity of the same subjects at 5 years. We asked whether lung capacity has changed significantly, with a 95% confidence level in the case of smokers and in the case of non-smokers. Answer the following questions.

3.4.1 Calculate whether there are significant differences in non-smokers between initial lung capacity and lung capacity after 5 years.

Perform the necessary steps: Write the null and alternative hypothesis, the chosen method and the calculations.

3.4.1.1 Null and alternative hypotheses

Write the null and alternative hypothesis.

(H0:μFum=μFum5Y)(H_0: \mu{Fum}=\mu_{Fum5Y})

(H1:μFumμFum5Y)(H_1: \mu_{Fum} \neq \mu_{Fum5Y})
which results in:

(H0:μdif=0)(H_0: \mu_{dif}=0)

(H1:μdif0)(H_1: \mu_{dif} \neq 0)

where diff= Fum - Fum5Y

3.4.1.2 Method

Since the samples are paired, the difference is calculated element by element and a one-sample hypothesis test is applied. The test is bilateral.

3.4.1.3 Calculation

#Test apareado de dos colas:
#Input: Dos  muestras apareadas y el nivel de confianza
test.paired <- function( d1, d2, cl ){
              dif <- d1 - d2
  n <- length( dif )
  mean <- mean( dif )
  sd <- sd( dif )
  mu<-0
  alfa <- 1-cl
  z.obs<- (mean - mu) / (sd/sqrt(n))
  z.critical <- qnorm( alfa/2, lower.tail=FALSE  )
  pvalue <- pnorm( abs(z.obs), lower.tail=FALSE )*2  #dos colas
  cat ("sample mean=", mean, "   sd=", sd, "  sample length=", n, "\n",
        "z obs= ", z.obs, "\n",
       "z critical: ", z.critical, "\n",
       "p value", pvalue, "\n")
  return (pvalue)
}
test.paired( Fum$PC, Fum$PC5Y, 0.95)
## sample mean= 0.09820611    sd= 0.0970989   sample length= 131
##  z obs=  11.57604
##  z critical:  1.959964
##  p value 5.45091e-31
## [1] 5.45091e-31

3.4.2 Perform the same calculation but now for non-smokers

test.paired( NoFum$PC, NoFum$PC5Y, 0.95 )
## sample mean= -0.002100592    sd= 0.02664195   sample length= 169
##  z obs=  -1.024989
##  z critical:  1.959964
##  p value 0.3053686
## [1] 0.3053686

3.4.3 Interpret the results obtained in the two contrasts

In the case of smokers, significant differences in lung capacity are observed between the original test and the test after 5 years, with a confidence level of 95%. The direction of changes is that CP decreases after 5 years. A unilateral test should be applied to validate that CP decreases after 5 years.

In the case of non-smokers, no significant differences are observed between initial CP and after five years, with a confidence level of 95%.