1. Introduction

This report documents the process of developing a predictive model to classify the manner in which subjects perform weight lifting exercises. Using accelerometer and gyroscope data from wearable devices, we trained a Random Forest classifier on a labeled dataset and generated predictions for a 20-case test set. The main objectives were to ensure data quality, select relevant features, validate model performance, and produce accurate predictions for submission.

1. Load Required Packages

Set working directory

setwd("C:/Users/Compumax/Desktop/ProyectoML")

2. Exploratory Data Analysis

training <- read.csv("C:/Users/Compumax/Desktop/pml-training.csv", stringsAsFactors = FALSE)
testing  <- read.csv("C:/Users/Compumax/Desktop/pml-testing.csv", stringsAsFactors = FALSE)

3. Clean the training set

training <- training %>%
  select_if(~ mean(is.na(.)) < 0.95) %>%
  select(-c(X, user_name, raw_timestamp_part_1, raw_timestamp_part_2,
            cvtd_timestamp, new_window, num_window))

4. Split into training and internal validation sets

set.seed(123)
inTrain <- createDataPartition(training$classe, p = 0.7, list = FALSE)
trainSet <- training[inTrain, ]
testSet  <- training[-inTrain, ]
trainSet$classe <- as.factor(trainSet$classe)
testSet$classe  <- factor(testSet$classe, levels = levels(trainSet$classe))

5. Train the model using only common variables without NAs

# Train the Random Forest Model with Shared Clean Variables
common_vars <- intersect(names(trainSet), names(testing))
common_vars <- setdiff(common_vars, "classe")
train_reduced <- trainSet[, c(common_vars, "classe")]

# Convert to numeric without printing warnings about coercion
train_reduced[, common_vars] <- suppressWarnings(
  lapply(train_reduced[, common_vars], function(x) as.numeric(as.character(x)))
)

# Remove columns with missing values
non_na_cols <- colSums(is.na(train_reduced)) == 0
train_reduced <- train_reduced[, non_na_cols]

# Train final model
final_model <- randomForest(classe ~ ., data = train_reduced, ntree = 100)

# Plot variable importance (optional but insightful)
varImpPlot(final_model)

6 Evaluate the model with validation subset

# Use only the variables included in the final model
final_vars <- setdiff(names(train_reduced), "classe")
test_reduced <- testSet[, final_vars]

# Convert to numeric format without displaying warnings
test_reduced <- data.frame(
  suppressWarnings(
    lapply(test_reduced, function(x) as.numeric(as.character(x)))
  )
)

# Make predictions and evaluate the model
pred_eval <- predict(final_model, newdata = test_reduced)
confusionMatrix(pred_eval, testSet$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1674    4    0    0    0
##          B    0 1131    3    0    0
##          C    0    4 1023    9    6
##          D    0    0    0  955    4
##          E    0    0    0    0 1072
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9949          
##                  95% CI : (0.9927, 0.9966)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9936          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9930   0.9971   0.9907   0.9908
## Specificity            0.9991   0.9994   0.9961   0.9992   1.0000
## Pos Pred Value         0.9976   0.9974   0.9818   0.9958   1.0000
## Neg Pred Value         1.0000   0.9983   0.9994   0.9982   0.9979
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2845   0.1922   0.1738   0.1623   0.1822
## Detection Prevalence   0.2851   0.1927   0.1771   0.1630   0.1822
## Balanced Accuracy      0.9995   0.9962   0.9966   0.9949   0.9954

7. Prepare the testing set with variables used by the model

testing_final <- testing[, final_vars]
testing_final <- data.frame(lapply(testing_final, function(x) as.numeric(as.character(x))))

8. Predict the 20 cases from the testing set

final_predictions <- predict(final_model, newdata = testing_final)
final_predictions

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E

#Explanation: These are the 20 predictions required for submission. Ensure you copy these in the format your course’s prediction quiz expects.

Conclusion

The final Random Forest model achieved an internal validation accuracy of 99.49%, with balanced accuracy across all classes near or above 99.5%. The modeling process included careful data cleaning, validation, and compatibility alignment with the final test set. The model was successfully used to predict the outcomes for the 20 cases provided, completing the practical machine learning assignment.

Machine Learning Project: Prediction

Natali Pérez

2025-06-20