




Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
ISYE 6501 Course homework assignment one solution
Typology: Assignments
1 / 8
This page cannot be seen from the preview
Don't miss anything!
Describe a situation or problem from your job, everyday life, current events, etc., for which a classification model would be appropriate. List some (up to 5) predictors that you might use.
Answer
Since my friends are fairly into reading sometimes I have to pick a book as gift for their birthdays. To narrow down whether a friend of mine will like the book or not I will check Goodreads for some of the following information that will make good predictors
The files credit_card_data.txt (without headers) and credit_card_data-headers.txt (with headers) contain a dataset with 654 data points, 6 continuous and 4 binary predictor variables. It has anonymized credit card applications with a binary response variable (last column) indicating if the application was positive or negative. The dataset is the “Credit Approval Data Set” from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Credit+Approval) without the categorical variables and without data points that have missing values.
df <- read.table ("C:/Users/tungh/OneDrive/Georgia Tech/ISYE6501/Module 2/hw1/data 2.2/credit_card_data-h header = TRUE) head (df)
dim (df)
We first run the starter code in the homework, and with the defaut paremeter C = 100 we are getting close to 86.4% accuracy.
We will also print the coefficients of A1 through Am and Ao
# install.packages('kernlab') data <- as.matrix (df) library ("kernlab")
# call ksvm. Vanilladot is a simple linear kernel. model <- ksvm ( as.matrix (data[, 1 : 10]), data[, 11], type = "C-svc", kernel = "vanilladot", C = 100, scaled = TRUE)
# calculate a1...am a <- colSums (model @ xmatrix[[1]] ***** model @ coef[[1]]) a
# calculate a a0 <- - model @ b a
# see what the model predicts pred <- predict (model, data[, 1 : 10]) pred
Above you can see that with poly = 2 we are getting higher accuracy. We can write a function to tune C as well.
library (kernlab)
# Define the function to evaluate models with varying C values evaluate_svm <- function (data, C_values = 10 ˆseq ( - 3, 3, by = 1)) { best_accuracy <- 0 best_C <- NA
for (C_value in C_values) { # Train the SVM model with the current C value model <- ksvm ( as.matrix (data[, 1 : 10]), data[, 11], type = "C-svc", kernel = "polydot", kpar = list (degree = 2), C = C_value, scaled = TRUE)
# Make predictions pred <- predict (model, data[, 1 : 10])
# Calculate accuracy accuracy <- sum (pred == data[, 11]) /nrow (data)
# Print the accuracy for the current C value cat ("C:", C_value, "Accuracy:", sprintf ("%.2f%%", accuracy ***** 100), " \n ")
# Check if this is the best accuracy so far if (accuracy > best_accuracy) { best_accuracy <- accuracy best_C <- C_value } }
# Return the best C value and corresponding accuracy cat (" \n The best C value:", best_C, " \n Best Accuracy:", sprintf ("%.2f%%", best_accuracy ***** 100), " \n ") return ( list (best_C = best_C, best_accuracy = best_accuracy)) }
# Example usage result <- evaluate_svm (data)
Next we try the radical basis function which is what the class reading talks about (https://pyml.sourceforge. net/doc/howto.pdf)
model.4 <- ksvm ( as.matrix (data[, 1 : 10]), data[, 11], type = "C-svc", kernel = "rbfdot", C = 100, scaled = TRUE) pred <- predict (model.4, data[, 1 : 10]) sum (pred == data[, 11]) /nrow (data)
It performs exceptionally well, I wonder if it over fits when we have unbalanced data (lots of examples of 1 class but not others)
pred
Looking at the results, it doesnt seeem to predict all 1’s or 0’s. Thus we can conclude that perhaps theres not evident of overfitting on unbalanced data. Without test and validation sets, we can’t determine this completely.
Base on this, I conclude the Gaussian kernel performs the best, followed by 2-polynomial with C = 100 as second best performing model.
# install.packages('kknn') library ("kknn")
Using the Eucledian distance, the optimal value for k seems to be around 13 with accuracy for validation set of 89%. Note that we splitting 80% and 20% for train and validation set. Below we print out values for such data, it looks fairly close to the predictions made by the SVM model.
evaluate_knn (df, k = 13, print_pred = TRUE)
# table(df$R1)