This report documents the process of developing a predictive model to classify the manner in which subjects perform weight lifting exercises. Using accelerometer and gyroscope data from wearable devices, we trained a Random Forest classifier on a labeled dataset and generated predictions for a 20-case test set. The main objectives were to ensure data quality, select relevant features, validate model performance, and produce accurate predictions for submission.
setwd("C:/Users/Compumax/Desktop/ProyectoML")
training <- read.csv("C:/Users/Compumax/Desktop/pml-training.csv", stringsAsFactors = FALSE)
testing <- read.csv("C:/Users/Compumax/Desktop/pml-testing.csv", stringsAsFactors = FALSE)
training <- training %>%
select_if(~ mean(is.na(.)) < 0.95) %>%
select(-c(X, user_name, raw_timestamp_part_1, raw_timestamp_part_2,
cvtd_timestamp, new_window, num_window))
set.seed(123)
inTrain <- createDataPartition(training$classe, p = 0.7, list = FALSE)
trainSet <- training[inTrain, ]
testSet <- training[-inTrain, ]
trainSet$classe <- as.factor(trainSet$classe)
testSet$classe <- factor(testSet$classe, levels = levels(trainSet$classe))
# Train the Random Forest Model with Shared Clean Variables
common_vars <- intersect(names(trainSet), names(testing))
common_vars <- setdiff(common_vars, "classe")
train_reduced <- trainSet[, c(common_vars, "classe")]
# Convert to numeric without printing warnings about coercion
train_reduced[, common_vars] <- suppressWarnings(
lapply(train_reduced[, common_vars], function(x) as.numeric(as.character(x)))
)
# Remove columns with missing values
non_na_cols <- colSums(is.na(train_reduced)) == 0
train_reduced <- train_reduced[, non_na_cols]
# Train final model
final_model <- randomForest(classe ~ ., data = train_reduced, ntree = 100)
# Plot variable importance (optional but insightful)
varImpPlot(final_model)
# Use only the variables included in the final model
final_vars <- setdiff(names(train_reduced), "classe")
test_reduced <- testSet[, final_vars]
# Convert to numeric format without displaying warnings
test_reduced <- data.frame(
suppressWarnings(
lapply(test_reduced, function(x) as.numeric(as.character(x)))
)
)
# Make predictions and evaluate the model
pred_eval <- predict(final_model, newdata = test_reduced)
confusionMatrix(pred_eval, testSet$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1674 4 0 0 0
## B 0 1131 3 0 0
## C 0 4 1023 9 6
## D 0 0 0 955 4
## E 0 0 0 0 1072
##
## Overall Statistics
##
## Accuracy : 0.9949
## 95% CI : (0.9927, 0.9966)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9936
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9930 0.9971 0.9907 0.9908
## Specificity 0.9991 0.9994 0.9961 0.9992 1.0000
## Pos Pred Value 0.9976 0.9974 0.9818 0.9958 1.0000
## Neg Pred Value 1.0000 0.9983 0.9994 0.9982 0.9979
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2845 0.1922 0.1738 0.1623 0.1822
## Detection Prevalence 0.2851 0.1927 0.1771 0.1630 0.1822
## Balanced Accuracy 0.9995 0.9962 0.9966 0.9949 0.9954
testing_final <- testing[, final_vars]
testing_final <- data.frame(lapply(testing_final, function(x) as.numeric(as.character(x))))
final_predictions <- predict(final_model, newdata = testing_final)
final_predictions
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
#Explanation: These are the 20 predictions required for submission. Ensure you copy these in the format your course’s prediction quiz expects.
The final Random Forest model achieved an internal validation accuracy of 99.49%, with balanced accuracy across all classes near or above 99.5%. The modeling process included careful data cleaning, validation, and compatibility alignment with the final test set. The model was successfully used to predict the outcomes for the 20 cases provided, completing the practical machine learning assignment.