Repeat points 1 to 3, until no significant improvement is seen.
library(MASS) #forward, backward and stepwise functions using AICdt<-data[,!names(data)%in%c("id")]fullmodel<-glm(exitus_ms~.,data=dt,family =binomial(link="logit")) step.model <- MASS::stepAIC(fullmodel, direction ="both", trace =FALSE)summary(step.model) #Get the final modelstep.model$anova$Step # Which variable is discarded at each stepstep.model$anova$AIC # AIC improvement at each step
Use stepwise to add or remove predictors until no significant improvement is seen.
Repeat steps 1 to 4 B times.
library(bootStepAIC) #bootstrap stepwise function using AICvoltes<-750# WARNING: be patient if voltes or/and your data have big numbers.boot.model <- bootStepAIC::boot.stepAIC(fullmodel, B=voltes ,data=dt,verbose=T,seed=1072024)saveRDS(boot.model, file = r"(boot.model2.rds)")boot.model$Covariates #show the % of appearance of each factor in a final model
Adds a penalty to the ordinary least squares (OLS) objective function.
\[min\sum_{i=1}^n(y_i-(\beta_0+x_i\beta))^2 \] The main objective of the OLS Method is to find the best possible fit for a set of data by minimising the sum of squares of the residuals.
The penalty is the sum of the absolute values of the coefficients.
Where \(\alpha_1\) is the weight of the Lasso penalty, and \(\alpha_2\) of the Ridge penalty.
An alternative is to parameterise the penalty
\[min(\frac{1}{2n}\sum_{i=1}^n(y_i-x_i\beta)^2+\alpha(\lambda\sum_{j=1}^p|\beta_j| +(1-\lambda)\sum_{j=1}^p\beta_j^2))\] Where \(\alpha\) is the weight of penalty, and \(\lambda\) controls the Lasso and Ridge weight.
library(glmnet) ## Regularization methodsset.seed(1072024)dt_n<-dtdt_n[] <-lapply(dt_n, function(x) if(is.factor(x)) as.numeric(x) else x) #factors to numbersx <-as.matrix(dt_n[,!names(dt_n)%in%c("exitus_ms")]) # Predictor variablesy <- dt$exitus_ms # Response variablelasso_model0 <-glmnet(x, y, family ="binomial",alpha=0.5,lambda =NULL)print(lasso_model0) #show procedure by stepsplot(lasso_model0,label =TRUE) #show selected variables, coefficient and log(lambda)coef(lasso_model0 ) #coefficients#cross validation to obtain best lambdacv_lasso <-cv.glmnet(x, y, family ="binomial", alpha =1)coef_df <-as.data.frame(as.matrix(coef(lasso_model0 )))row.names(coef_df[coef_df$s35!=0,])lasso_model1 <-glmnet(x, y, family ="binomial",alpha=1,lambda=cv_lasso$lambda.min)
Random forest is a machine learning algorithm that combines the output of multiple decision trees to reach a single result.
This method allows to rank variables based on their relevance:
Gini
Permutation
Boruta algorithm
Gini Importance: measures the importance of a variable based on the total decrease in node impurity when it is used to split a node in the tree.
library(randomForest)set.seed(1072024)x <-as.matrix(dt_n[,!names(dt_n)%in%c("exitus_ms")]) # Predictor variablesy <- dt$exitus_ms # Response variable#Just for fun more parameters need to be specifiedmodel_rdf <-randomForest(x, y, importance=TRUE)importance(model_rdf) # importancw tablevarImpPlot(model_rdf,main="Importance") # importance plot
Permutation importance: assesses the importance of a variable by measuring the dencrease in the model’s accuracy when the feature’s values are randomly shuffled.
library(randomForest)library(caret)set.seed(1072024)x <-as.matrix(dt_n[,!names(dt_n)%in%c("exitus_ms")]) # Predictor variablesy <- dt$exitus_ms # Response variable#Just for fun more parameters need to be specifiedmodel_rdf <-randomForest(x, y)imp <-varImp(model_rdf, scale=FALSE) #importance
Boruta Algorithm: it creates shadow variables by shuffling the original variables and then assesses the importance of each original variable by comparing it to the highest importance score among the shadow variables.
library(Boruta)set.seed(1072024)boruta <-Boruta(exitus_ms ~ ., data = dt_n, doTrace =2, maxRuns =500)boruta <-readRDS("boruta.model.rds")print(boruta)plot(boruta, las =2, cex.axis =0.7)plotImpHistory(boruta)boruta2 <-TentativeRoughFix(boruta)print(boruta2)