In this vignette, we will implement logistic regression from scratch.
Problem
Logistic regression models the probability that an observation belongs to a particular class. Given a feature matrix \(X\) and a weight vector \(\beta\), the predicted probability is:
\[P(y = 1 | X) = \sigma(X \beta + \alpha)\]
where \(\sigma\) is the logistic function \(\sigma(z) = \frac{1}{1 + e^{-z}}\) and \(\alpha\) is the bias term.
In practice, logistic regression is typically fitted using Iteratively Reweighted Least Squares (IRLS), which is a form of Newton’s method. Here, we will use a simple gradient descept algorithm for didactic purposes.
Data
We’ll use the classic Titanic dataset from base R, which contains
survival data for the most famous shipwreck in history. The
Titanic object is a contingency table, so we first expand
it into individual observations.
library(anvil)
set.seed(42)
# Expand contingency table to individual observations
titanic_df <- as.data.frame(Titanic)
titanic <- titanic_df[rep(seq_len(nrow(titanic_df)), titanic_df$Freq), 1:4]
rownames(titanic) <- NULL
# Create design matrix with dummy variables
X <- model.matrix(~ Class + Sex + Age, data = titanic)[, -1]
y <- as.integer(titanic$Survived == "Yes")
# Standardize features for better gradient descent convergence
X <- scale(X)
n <- nrow(X)
p <- ncol(X)The dataset contains 2201 observations and 5 predictor variables. Our
goal is to predict survival (Survived) based on three
categorical features:
- Class: Passenger class (1st, 2nd, 3rd, or Crew) - encoded as dummy variables with 1st class as the reference category
- Sex: Male or Female - encoded with Male as the reference category
- Age: Child or Adult - encoded with Child as the reference category
summary(titanic)## Class Sex Age Survived
## 1st :325 Male :1731 Child: 109 No :1490
## 2nd :285 Female: 470 Adult:2092 Yes: 711
## 3rd :706
## Crew:885
The general survival rate was 32.30%.
Now we convert the data to AnvilTensors.
Model
The logistic regression model consists of computing the linear combination of features and then applying the logistic function.
predict_proba <- function(X, beta, alpha) {
logits <- X %*% beta + alpha
nv_logistic(logits)
}For binary classification, we use the binary cross-entropy loss:
\[\mathcal{L} = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(p_i) + (1 - y_i) \log(1 - p_i) \right]\]
We need to be careful with numerical stability when computing the logarithm of probabilities close to 0 or 1. We’ll add a small epsilon to avoid taking the log of exactly 0.
binary_cross_entropy <- function(y_true, y_pred) {
eps <- 1e-7
y_pred_clipped <- nv_clamp(eps, y_pred, 1 - eps)
loss <- -(y_true * log(y_pred_clipped) + (1 - y_true) * log(1 - y_pred_clipped))
mean(loss)
}We combine the prediction and loss computation into a single function.
model_loss <- function(X, y, beta, alpha) {
y_pred <- predict_proba(X, beta, alpha)
binary_cross_entropy(y, y_pred)
}Using {anvil}’s automatic differentiation, we can obtain the gradients of the loss with respect to the model parameters.
Training
We will implement the training loop using nv_while().
This keeps the entire training loop within a single compiled function,
which is more efficient than repeatedly calling a JIT-compiled function
from R, especially for small models.
fit_logreg <- jit(function(X, y, beta, alpha, n_epochs, lr) {
output <- nv_while(
list(beta = beta, alpha = alpha, epoch = nv_scalar(0L)),
\(beta, alpha, epoch) epoch < n_epochs,
\(beta, alpha, epoch) {
grads <- model_loss_grad(X, y, beta, alpha)
list(
beta = beta - lr * grads$beta,
alpha = alpha - lr * grads$alpha,
epoch = epoch + 1L
)
}
)
list(beta = output$beta, alpha = output$alpha)
})We initialize the parameters and train the model with a single function call.
beta_init <- nv_tensor(rnorm(p), dtype = "f32", shape = c(p, 1L))
alpha_init <- nv_scalar(0, dtype = "f32")
result <- fit_logreg(
X_tensor, y_tensor,
beta_init, alpha_init,
nv_scalar(50000L),
nv_scalar(0.1)
)
result## $beta
## AnvilTensor
## -0.3419
## -0.8300
## -0.4206
## 0.9920
## -0.2304
## [ CPUf32{5,1} ]
##
## $alpha
## AnvilTensor
## -0.8538
## [ CPUf32{} ]
Let’s verify our implementation by comparing with R’s built-in
glm().
glm_fit <- glm(y ~ X, family = binomial)Now let’s compare the coefficients:
## Parameter anvil glm
## 1 (Intercept) -0.8538059 -0.8538077
## 2 Class2nd -0.3418888 -0.3418906
## 3 Class3rd -0.8299901 -0.8299946
## 4 ClassCrew -0.4206346 -0.4206313
## 5 SexFemale 0.9919737 0.9919786
## 6 AgeAdult -0.2303528 -0.2303616
The estimates are very close, confirming that our gradient descent
implementation converges to the same solution as glm().