What is HCC?

Hepatocellular carcinoma (HCC) is the most common type of primary liver cancer. Hepatocellular carcinoma occurs most often in people with chronic liver diseases, such as cirrhosis caused by hepatitis B or hepatitis C infection.

In this R project we will study the survival of patients with hepatocellular carcinoma with 3 different algorithms :

Logistic regression
KNN
Naiive bayes

Prerequisites:

You will need to install mice library


install.packages("mice")

What is Mice?

MICE (Multivariate Imputation via Chained Equations) Creating multiple imputations as compared to a single imputation (such as mean) takes care of uncertainty in missing values. MICE assume that the missing data are Missing at Random (MAR), which means that the probability that a value is missing depends only on observed value and can be predicted using them. It imputes data on a variable by variable basis by specifying an imputation model per variable.

Steps for our prediction project:

Make necessary imports, Get the features and lables from dataset .
Use Mice for imputation the missing values.

Training the data

we loaded the class feature as a factor for prediction


xxxxxxxxxx
#splitting the data
library(caret)
set.seed(123)
data_imputed[c(27:29)] <- NULL
data_imputed$Class <- make.names(data_imputed$Class)
data_imputed$Class <- as.factor(data_imputed$Class)

Logistic regression method

Regression: relation (statistical ) between 2 or more variables where a change in independent variable is associated with change in dependent variable.
Method for fitting a regression curve, y=f(x) when y is categorical variable (0 or 1) [binomial logistic regression], since the variable to predict is binary [Binary classification].
linear regression:
y is always continuous variable, lm() function.
(A): predicted (y) lies with 0 and 1 range
(B):predicted (y) can exceed 0 and 1 range

logistic regression --> (dependent variable), glm () function:

Get probability score that reflects the probability of the occurrence of the event.

probability (P) always lies between 0 and 1.

photo4.png images


xxxxxxxxxx
crossValidation <- function(data, method, folds,grid = NULL, indx, tune = NULL){
    accuracy <- NULL
    senstivity  <- NULL
    
    for(i in folds){
    training_set <- data[-i, ]
    testing_set <- data[i, ]
    
    fit <- train(Class ~ ., data = training_set, method = method, tuneGrid = grid, preProc = c("center", "scale")
                    , metric = "Sens", tuneLength = tune)
    y_pred <- predict(fit, newdata = testing_set[-indx])
    
    cm <- table(testing_set[, indx], y_pred)
    ac <- (cm[1,1] + cm[2,2]) / (cm[1,1] + cm[2,2] + cm[1,2] + cm[2,1])
    accuracy <- c(accuracy, ac)
    sens = (cm[1,1]) / (cm[1,1] + cm[1,2])
    senstivity <- c(senstivity, sens)}
    
    if(is.numeric(tune)){
        print("best tune is : ")
        print(fit$bestTune)
    }
    cat("For ", method, "Model", fill= 50 )
    
    cat("Accuracy :", mean(accuracy), fill= 30)
    cat("Senstivity :", mean(senstivity), fill= 30)
    
}
set.seed(123)
folds = createFolds(data$Class, k = 10)
crossValidation(data_imputed, "glm", indx = 50, folds = folds)

output:


xxxxxxxxxx
For  glm Model
Accuracy : 0.6463235
Senstivity : 0.5131746

k-Nearest Neighbors

K-nearest neighbors (KNN) algorithm uses ‘feature similarity’ to predict the values of new datapoints which further means that the new data point will be assigned a value based on how closely it matches the points in the training set.

photo1.png images


xxxxxxxxxx
crossValidation(data_imputed, "knn", indx = 50, folds = folds, grid = expand.grid(k=9))

Output:


xxxxxxxxxx
For  knn Model
Accuracy : 0.6830882
Senstivity : 0.2606349

Naiive Bayes method

Naive Bayes is a probabilistic machine learning algorithm that can be used in a wide variety of classification tasks.

photo2.png images


xxxxxxxxxx
crossValidation(data_imputed, "naive_bayes", 
                folds = folds, indx = 50, grid = expand.grid(laplace = 0, usekernel = FALSE, adjust = 1))

output


xxxxxxxxxx
For  naive_bayes Model
Accuracy : 0.6466912
Senstivity : 0.1626984

Cross Validation:

Set of methods for measuring the performance of a given predictive model on new test data sets.
Dividing the data into 2 sets:
- Training set (Train / build the model).
- Testing set(Validation set) used to test (validation) the model by estimating the prediction error.
Also known as resampling method, because it involves fitting the same statistical method multiple times using different subsets of data.
Different methods of cross validation:
1. Validation set approach (data split).
2. Leave one out CV.
3. K-fold CV.
4. Repeated K-fold CV.
Model performance metrics (we want to estimate the prediction error ):
1. Build the model on a training data set.
2. Apply model on a new test data set to make prediction.
3. Compute the prediction errors.

[R^{2} , RMSE , MSE ]

are used to measure the regression model performance during CV .

Cross validation methods / Algorithms:
1. Reverse small samples of data set.
2. Train (build) model using the remaining part of data set.
3. Test the effectiveness ( Accuracy) of model on reserved sample of dataset , if model works well on test dataset, then it's good.
K-fold CV[robust method for estimating the accuracy of model]
- Evaluate model performance on different subset of training data and then calculate the average prediction error rate.
Algorithm :
1. Randomly split dataset into K-subsets [10-preferred].
2. Reverse one subset a train model on all other subsets.
3. Test model on reversed subset and record the prediction error.
4. Repeat process until each of K-subsets has served as the test set.
5. Compute the average of k recorded errors, this is called CV error serving as the performance metric of model.

Confusion Matrix:

It's a way of calculation applied after any algorithm applied on data to determine the sensitivity, accuracy, specifity, recall, precision ... etc.
it's familiar to machine learners by a table have these values [True Positive, False Positive, False Negative, True Negative] as showing next:

TP	FN
FP	TN

Feature selection:

correlation :
For numerical values,We want to select the best Subset of the features to increase our accuracy

photo3.png images


xxxxxxxxxx
indx_num <- NULL
for (i in seq(1:9)){
    indx_num <- c(indx_num, which(colnames(data_imputed) == names_num[i]))
}
data_imputed[indx_num]

Output:

Class	ALP	Hemoglobin	Albumin	Dir_Bil	Total_Bil	Major_Dim	INR	AST
2	178	11.7	3.4	0.3	3.9	1.8	1.01	112

- Chi-Squared Test:

It's a statistical method to see if there a significant correlation between two categorical variables from same population
It's a calculation of random variable frequency
Null hypothesis (H0):
- used in quantitative analysis to test theories.
- Assumed to be true statistical evidence nullifies it for alternative hypothesis.
Hypothesis test --> Allows mathematical model to validate or reject a null hypothesis within certain confidence level(alpha).
P-value :
- less than alpha (significance level)
  - Two random variable are dependent and alternative hypothesis exist
- greater than alpha (significance level)
  - Two random variables are independent and null hypothesis exist

output columns:

Class	PS	Symptoms	Ascites	Metastasis	PVT	Encephalopathy
X2	0	0	1	0	0	1

Training after selection

output for GLM model


xxxxxxxxxx
set.seed(123)
folds = createFolds(data_selected$Class, k = 10)
crossValidation(data_selected, "glm", indx = 1, folds = folds)


xxxxxxxxxx
For  glm Model
Accuracy : 0.7014706
Senstivity : 0.5404762

output for KNN model with best k = 23


xxxxxxxxxx
crossValidation(data_selected, "knn", indx = 1, folds = folds, 
grid = expand.grid(k=23)


xxxxxxxxxx
For  knn Model
Accuracy : 0.6529412
Senstivity : 0.1619048

output for Naive Bayes


xxxxxxxxxx
crossValidation(data_selected, "naive_bayes", 
                folds = folds, indx = 1, grid = expand.grid(laplace = 0, usekernel = FALSE, adjust = 1))


xxxxxxxxxx
For  naive_bayes Model
Accuracy : 0.6584559
Senstivity : 0.1261905

What is HCC?

Logistic regression

KNN

Naiive bayes

Prerequisites:

What is Mice?

Steps for our prediction project:

Logistic regression method

k-Nearest Neighbors

Output:

Naiive Bayes method

output

Cross Validation:

Confusion Matrix:

Feature selection:

correlation :

- Chi-Squared Test:

Training after selection