Hepatocellular carcinoma (HCC) is the most common type of primary liver cancer. Hepatocellular carcinoma occurs most often in people with chronic liver diseases, such as cirrhosis caused by hepatitis B or hepatitis C infection.
In this R project we will study the survival of patients with hepatocellular carcinoma with 3 different algorithms :
You will need to install mice library
install.packages("mice")
MICE (Multivariate Imputation via Chained Equations) Creating multiple imputations as compared to a single imputation (such as mean) takes care of uncertainty in missing values. MICE assume that the missing data are Missing at Random (MAR), which means that the probability that a value is missing depends only on observed value and can be predicted using them. It imputes data on a variable by variable basis by specifying an imputation model per variable.
Make necessary imports, Get the features and lables from dataset .
Use Mice for imputation the missing values.
Training the data
we loaded the class feature as a factor for prediction
xxxxxxxxxx
#splitting the data
library(caret)
set.seed(123)
data_imputed[c(27:29)] <- NULL
data_imputed$Class <- make.names(data_imputed$Class)
data_imputed$Class <- as.factor(data_imputed$Class)
Regression: relation (statistical ) between 2 or more variables where a change in independent variable is associated with change in dependent variable.
Method for fitting a regression curve, y=f(x) when y is categorical variable (0 or 1) [binomial logistic regression], since the variable to predict is binary [Binary classification].
linear regression:
y is always continuous variable, lm() function.
(A): predicted (y) lies with 0 and 1 range
(B):predicted (y) can exceed 0 and 1 range
logistic regression --> (dependent variable), glm () function:
Get probability score that reflects the probability of the occurrence of the event.
probability (P) always lies between 0 and 1.
xxxxxxxxxx
crossValidation <- function(data, method, folds,grid = NULL, indx, tune = NULL){
accuracy <- NULL
senstivity <- NULL
for(i in folds){
training_set <- data[-i, ]
testing_set <- data[i, ]
fit <- train(Class ~ ., data = training_set, method = method, tuneGrid = grid, preProc = c("center", "scale")
, metric = "Sens", tuneLength = tune)
y_pred <- predict(fit, newdata = testing_set[-indx])
cm <- table(testing_set[, indx], y_pred)
ac <- (cm[1,1] + cm[2,2]) / (cm[1,1] + cm[2,2] + cm[1,2] + cm[2,1])
accuracy <- c(accuracy, ac)
sens = (cm[1,1]) / (cm[1,1] + cm[1,2])
senstivity <- c(senstivity, sens)}
if(is.numeric(tune)){
print("best tune is : ")
print(fit$bestTune)
}
cat("For ", method, "Model", fill= 50 )
cat("Accuracy :", mean(accuracy), fill= 30)
cat("Senstivity :", mean(senstivity), fill= 30)
}
set.seed(123)
folds = createFolds(data$Class, k = 10)
crossValidation(data_imputed, "glm", indx = 50, folds = folds)
output:
xxxxxxxxxx
For glm Model
Accuracy : 0.6463235
Senstivity : 0.5131746
K-nearest neighbors (KNN) algorithm uses ‘feature similarity’ to predict the values of new datapoints which further means that the new data point will be assigned a value based on how closely it matches the points in the training set.
xxxxxxxxxx
crossValidation(data_imputed, "knn", indx = 50, folds = folds, grid = expand.grid(k=9))
xxxxxxxxxx
For knn Model
Accuracy : 0.6830882
Senstivity : 0.2606349
Naive Bayes is a probabilistic machine learning algorithm that can be used in a wide variety of classification tasks.
xxxxxxxxxx
crossValidation(data_imputed, "naive_bayes",
folds = folds, indx = 50, grid = expand.grid(laplace = 0, usekernel = FALSE, adjust = 1))
xxxxxxxxxx
For naive_bayes Model
Accuracy : 0.6466912
Senstivity : 0.1626984
Set of methods for measuring the performance of a given predictive model on new test data sets.
Dividing the data into 2 sets:
Also known as resampling method, because it involves fitting the same statistical method multiple times using different subsets of data.
Different methods of cross validation:
Model performance metrics (we want to estimate the prediction error ):
are used to measure the regression model performance during CV .
Cross validation methods / Algorithms:
K-fold CV[robust method for estimating the accuracy of model]
Algorithm :
TP | FN |
---|---|
FP | TN |
For numerical values,We want to select the best Subset of the features to increase our accuracy
xxxxxxxxxx
indx_num <- NULL
for (i in seq(1:9)){
indx_num <- c(indx_num, which(colnames(data_imputed) == names_num[i]))
}
data_imputed[indx_num]
Output:
Class | ALP | Hemoglobin | Albumin | Dir_Bil | Total_Bil | Major_Dim | INR | AST |
---|---|---|---|---|---|---|---|---|
2 | 178 | 11.7 | 3.4 | 0.3 | 3.9 | 1.8 | 1.01 | 112 |
It's a statistical method to see if there a significant correlation between two categorical variables from same population
It's a calculation of random variable frequency
Null hypothesis (H0):
Hypothesis test --> Allows mathematical model to validate or reject a null hypothesis within certain confidence level(alpha).
P-value :
less than alpha (significance level)
greater than alpha (significance level)
output columns:
Class | PS | Symptoms | Ascites | Metastasis | PVT | Encephalopathy |
---|---|---|---|---|---|---|
X2 | 0 | 0 | 1 | 0 | 0 | 1 |
xxxxxxxxxx
set.seed(123)
folds = createFolds(data_selected$Class, k = 10)
crossValidation(data_selected, "glm", indx = 1, folds = folds)
xxxxxxxxxx
For glm Model
Accuracy : 0.7014706
Senstivity : 0.5404762
output for KNN model with best k = 23
xxxxxxxxxx
crossValidation(data_selected, "knn", indx = 1, folds = folds,
grid = expand.grid(k=23)
xxxxxxxxxx
For knn Model
Accuracy : 0.6529412
Senstivity : 0.1619048
output for Naive Bayes
xxxxxxxxxx
crossValidation(data_selected, "naive_bayes",
folds = folds, indx = 1, grid = expand.grid(laplace = 0, usekernel = FALSE, adjust = 1))
xxxxxxxxxx
For naive_bayes Model
Accuracy : 0.6584559
Senstivity : 0.1261905