Stochastic Nonsense

Put something smart here.

Calculating the Hessian of the Logistic Log Likelihood

I may be the only person who feels this way, but it’s awfully easy to read a paper or a book, see some equations, think about them a bit, then sort of nod your head and think you understand them. However, when you go to actually implement them you look back and the jump from the symbols on the page to code that runs on a computer is little bigger than you thought. So this is mostly me thinking aloud, but I was reading about optimization functions that rely on the hessian and I wrote this out to make sure I understood this well enough to calculate it if I want.

I picked some random training data. First, we set up the design matrix x, the dependent variable y, and the theta (or beta) at which we will evaluate the hessian:

$$( x = \left[ \begin{array}{ccc} 2 & 3 \ 4 & 7 \ 5 & 6 \end{array} \right]; \ y = \left[ \begin{array}{c} 0 \ 1 \ 1 \end{array} \right]; \theta = \left[ \begin{array}{c} 1 \ 2 \end{array} \right] $$)
1
2
3
y <- matrix(nrow=3, c(0, 1, 1))
x <- matrix(nrow=3, ncol=2, byrow=T, c(2, 3, 4, 7, 5, 6))
theta <- matrix(nrow=2, c(1,2))

Typically you’d put a column of ones on the left of the matrix as an intercept term, but I didn’t set my problem up that way. The Hessian is the n by n matrix of 2nd derivatives of a scalar valued function. In our case, there are two parameters, ie two explanatory variables, so

$$( H[l] = \left[ \begin{array}{cc} \frac{\partial^2 l}{\partial \theta_1^2} & \frac{\partial^2 l}{\partial \theta_1\,\partial \theta_2} \ \frac{\partial^2 l}{\partial \theta_2\,\partial \theta_1} & \frac{\partial^2 l}{\partial \theta_2^2} \end{array} \right] $$)

Note that we denote the ith of m training example as

$$( (x^{(i)}, y^{(i)}), i = 1\ldots m $$)

the superscript in parentheses is not exponentiation. Here x is a column vector and y is either 0 or 1.

We also need some functions. R supports closures so you don’t have to pass x and y around.

$$( g(\theta; x) = \frac{ 1 }{ 1 + \exp^{ - x^{ \mathrm{T} }\theta } } $$) $$( l(\theta; x, y) = \sum_{i=1}^{m} y^{(i)} \log( g(x^{(i)})) + (1 - y^{(i)})\log(1 - g(x^{(i)})) $$)
1
2
3
4
5
g <- function(x, theta) 1 / (1 + exp(-1 * x %*% theta))

logistic_loglik <- function(theta){
  sum(log(g(x, theta)) * y) + sum((1 - y) * log(1 - g(x, theta)))
}

Finally, we can use the numDeriv package to calculate the Hessian and compare with a hand calculation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
require (numDeriv)

H <- hessian(logistic_loglik, theta)

#
# hand calculate the hessian at theta just to make sure I understand
#
m <- nrow(x)       # ie number of training examples
H_hand <- matrix(nrow=nrow(theta), ncol=nrow(theta))
for (row in 1:nrow(H_hand)){
  for (col in 1:ncol(H_hand)){
      H_hand[ row, col ] <- 0
      for (j in 1:m){
          h <- g(x[j, ], theta)
          H_hand[row, col ] <- H_hand[ row, col ] + x[j, row] * x[j, col] * h * (1 - h)
      }
  }
}
H_hand <- H_hand * -1

# error
err <- norm(H - H_hand)
print(sprintf('error: norm = %f', err))

You can clearly see why any optimization algorithm requiring the hessian will be slow; you iterate over every training example once for each explanatory variable.

Also, MathJax is an awesome and painless latex for your blog.