Variable selection: tria i remena

Cristian Tebé Cordomí, MPH, PhD

Biostatistics Support and Research Unit - IGTP

2024-12-07

Tria i remena?



No reposis ni un moment
si remenes força estona
la barreja surt més bona
i el client deixes content.
Remena, remena, nena.

Enderrock: https://www.enderrock.cat/noticia/24742/veritable-historia-remena-nena

Summary

  • Objective

  • Variables and data

  • The classics

  • Regularization

  • Machine-learning

  • Beyond selection process

  • Final messages

What is the objective of our model?

  • To predict an outcome.
  • To estimate the causality of A on B.
  • To estimate the impact of an intervention on an outcome variable.
  • To classify / discriminate according to an outcome variable.

Which set of variables?

  • Literature review

  • Available data

  • Sit and talk to a clinician (an expert)

Is your data ready?

Know your data:

  • Summarize your data.
  • Look for outliers and quantify the number of missing values.
  • Estimate your outcome incidence/ prevalence/ expected value.

Addressing missing values is worthwhile:

  • Mean, median, mode.
  • Last observation carried forward
  • Modelling
  • Multiple imputation

Are your continuous variable on a comparable scale?

  • Regression and clustering are methods sensitive to variable scales.

  • To homogenise the scales, standardisation is a good option:

\[Z_{x_i}=\frac{(x_i-\bar x)}{S_i}\]

Check for multicollinearity among predictors as it can affect the selection process and the stability of the model.

\[\text{Variance Inflation Factor}=\frac{1}{(1-R^2)}\]

For each candidate \(X_i\): \(X_i=\beta_0+\sum_{i\ne j}\beta_jX_j+\epsilon\) the adjusted \(R^2\) is estimated.

Rule of the thumb: \(VIF>10\), implies an adjusted \(R^2>0.9\).

For description:

  • gtsummary: https://cran.r-project.org/web/packages/gtsummary/
  • compareGroups: https://cran.r-project.org/web/packages/compareGroups/
# Library
library(gtsummary)

## Description of included subjects
data |> select(-id,-exitus_ms)  |>
tbl_summary( missing="ifany") |> 
  bold_labels() 

## Event incidence
data |> select(exitus_ms)  |>
  tbl_summary(missing="ifany") |>
  bold_labels() 

## Description by event
data |> select(-id)  |>
  tbl_summary(by=exitus_ms, missing="ifany",percent="row") |>
  bold_labels() 

To deal with missing data:

  • mice: Multivariate Imputation by Chained Equations https://cran.r-project.org/web/packages/mice/index.html

Collinearity:

  • vif function from car: Companion to Applied Regression https://cran.r-project.org/web/packages/car/index.html
  • cor function from base R.
  • Use Principal Component Analysis (PCA).

Standarditzation:

  • scale function from base R.

Big message

The classics

  1. Empty model: \(y=\beta_0+\epsilon\)
  2. For each \(X_i\) not in the model: : \(y=\beta_0+\beta_i X_i+\epsilon\)
  3. Election criteria: p-value, AIC, BIC, adjusted \(R^2\).
  4. Add predictor \(X_i\): \(y=\beta_0+\beta_i X_i+\epsilon\)
  5. Repeat points 2 to 4, until no significant improvement is seen.
  1. Full model: \(y=\beta_0+\sum_{j=1}^m\beta_jX_j+\epsilon\)
  2. For each \(X_i\) not in the model: : \(y=\beta_0+\sum_{i\ne j}\beta_jX_j+\epsilon\)
  3. Define election criteria: p-value, AIC, BIC, adjusted \(R^2\).
  4. Remove predictor \(X_i\): \(y=\beta_0+\beta_i X_i+\epsilon\)
  5. Repeat points 2 to 4, until no significant improvement is seen.
  1. Alternates Forward and Backward
  2. Define election criteria: p-value, AIC, BIC, adjusted \(R^2\).
  3. Remove predictor \(X_i\): \(y=\beta_0+\beta_i X_i+\epsilon\)
  4. Repeat points 1 to 3, until no significant improvement is seen.
library(MASS) #forward, backward and stepwise functions using AIC

dt<-data[,!names(data)%in%c("id")]

fullmodel<-glm(exitus_ms~.,data=dt,family =binomial(link="logit")) 

step.model <- MASS::stepAIC(fullmodel, direction = "both",  trace = FALSE)

summary(step.model) #Get the final model

step.model$anova$Step # Which variable is discarded at each step

step.model$anova$AIC # AIC improvement at each step

The classics (reloaded)

  1. Generate a new data-set taking a sample with replacement from your original data.
  2. Fit a model using data from step 1: \(y=\beta_0+\sum_{j=1}^m\beta_jX_j+\epsilon\)
  3. Define election criteria: p-value, AIC, BIC, adjusted \(R^2\).
  4. Use stepwise to add or remove predictors until no significant improvement is seen.
  5. Repeat steps 1 to 4 B times.
library(bootStepAIC) #bootstrap stepwise function using AIC

voltes<-750 # WARNING: be patient if voltes or/and your data have big numbers.

boot.model <- bootStepAIC::boot.stepAIC(fullmodel, B=voltes ,data=dt,verbose=T,seed=1072024)

saveRDS(boot.model, file = r"(boot.model2.rds)")

boot.model$Covariates #show the % of appearance of each factor in a final model

Regularization methods

Adds a penalty to the ordinary least squares (OLS) objective function.

\[min\sum_{i=1}^n(y_i-(\beta_0+x_i\beta))^2 \] The main objective of the OLS Method is to find the best possible fit for a set of data by minimising the sum of squares of the residuals.

The penalty is the sum of the absolute values of the coefficients.

\[min(\frac{1}{2n}\sum_{i=1}^n(y_i-x_i\beta)^2+\alpha\sum_{j=1}^p|\beta_j|)\]

The penalty is the sum of the squared values of the coefficients.

\[min(\frac{1}{2n}\sum_{i=1}^n(y_i-x_i\beta)^2+\alpha\sum_{j=1}^p\beta_j^2)\] WARNING: does not perform variable selection.

The penalty combines the properties of Lasso and Ridge regression.

\[min(\frac{1}{2n}\sum_{i=1}^n(y_i-x_i\beta)^2+\alpha_1\sum_{j=1}^p|\beta_j| +\alpha_2\sum_{j=1}^p\beta_j^2)\]

Where \(\alpha_1\) is the weight of the Lasso penalty, and \(\alpha_2\) of the Ridge penalty.

An alternative is to parameterise the penalty

\[min(\frac{1}{2n}\sum_{i=1}^n(y_i-x_i\beta)^2+\alpha(\lambda\sum_{j=1}^p|\beta_j| +(1-\lambda)\sum_{j=1}^p\beta_j^2))\] Where \(\alpha\) is the weight of penalty, and \(\lambda\) controls the Lasso and Ridge weight.

library(glmnet) ## Regularization methods

set.seed(1072024)

dt_n<-dt
dt_n[] <- lapply(dt_n, function(x) if(is.factor(x)) as.numeric(x) else x) #factors to numbers

x <- as.matrix(dt_n[,!names(dt_n)%in%c("exitus_ms")])  # Predictor variables
y <- dt$exitus_ms         # Response variable

lasso_model0 <- glmnet(x, y, family = "binomial",alpha=0.5,lambda = NULL)

print(lasso_model0) #show procedure by steps
plot(lasso_model0,label = TRUE) #show selected variables, coefficient and log(lambda)
coef(lasso_model0 ) #coefficients

#cross validation to obtain best lambda
cv_lasso <- cv.glmnet(x, y, family = "binomial", alpha = 1)

coef_df <- as.data.frame(as.matrix(coef(lasso_model0 )))
row.names(coef_df[coef_df$s35!=0,])

lasso_model1 <- glmnet(x, y, family = "binomial",alpha=1,lambda=cv_lasso$lambda.min)

Machine-learning

Random forest is a machine learning algorithm that combines the output of multiple decision trees to reach a single result.

This method allows to rank variables based on their relevance:

  • Gini
  • Permutation
  • Boruta algorithm

Gini Importance: measures the importance of a variable based on the total decrease in node impurity when it is used to split a node in the tree.

library(randomForest)

set.seed(1072024)

x <- as.matrix(dt_n[,!names(dt_n)%in%c("exitus_ms")])  # Predictor variables
y <- dt$exitus_ms         # Response variable

#Just for fun more parameters need to be specified
model_rdf <- randomForest(x, y, importance=TRUE)

importance(model_rdf) # importancw table
varImpPlot(model_rdf,main="Importance") # importance plot

Permutation importance: assesses the importance of a variable by measuring the dencrease in the model’s accuracy when the feature’s values are randomly shuffled.

library(randomForest)
library(caret)

set.seed(1072024)

x <- as.matrix(dt_n[,!names(dt_n)%in%c("exitus_ms")])  # Predictor variables
y <- dt$exitus_ms         # Response variable

#Just for fun more parameters need to be specified
model_rdf <- randomForest(x, y)

imp <- varImp(model_rdf, scale=FALSE) #importance

Boruta Algorithm: it creates shadow variables by shuffling the original variables and then assesses the importance of each original variable by comparing it to the highest importance score among the shadow variables.

library(Boruta)

set.seed(1072024)

boruta <- Boruta(exitus_ms ~ ., data = dt_n, doTrace = 2, maxRuns = 500)

boruta <- readRDS("boruta.model.rds")

print(boruta)

plot(boruta, las = 2, cex.axis = 0.7)

plotImpHistory(boruta)

boruta2 <- TentativeRoughFix(boruta)
print(boruta2)

Other methodologies

  • Principal Component analysis

  • Partial Least Squares

  • Bayesian Variable Selection

Beyond selection process

Likelihood ratio test

Model 1: exitus_ms ~ scale(total_leuc) + depresion
Model 2: exitus_ms ~ scale(total_leuc) * depresion
  #Df  LogLik Df  Chisq Pr(>Chisq)   
1   3 -1030.4                        
2   4 -1026.8  1 7.1987   0.007296 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Final messages

  • Keep in mind the goals of the analysis.

  • Sit and talk (with an expert).

  • Know your data.

  • Avoid the classics and the rules of the thumb.

  • Try to not fall in love with one method.

  • Do bootstraping and/or crossvalidation

  • You are not alone… and ChatGTP is not a friend… better call a colleague.

Questions

Cristian Tebé Cordomí: ctebe@igtp.cat

Variable selection: tria i remena




Thank you very much!