Regression Questions: Logistic Regression Probabilities

Assume we have a logistic regression of the form $\beta_0 + \beta_1 x$, and for value $x_0$ we predict success probability $p(x_0)$. Which of the following is correct?

a) $p(x_0 + 1) = B_1 + p(x_0)$

b) $\frac{ p(x_0 + 1) }{ 1 – p(x_0 + 1) } = \exp(\beta_1) \frac{p(x_0) }{1 – p(x_0)}$

c) $p(x_0 + 1) = \Phi(B_0 + B_1( x_0 + 1))$

Assume we run a logistic regression on the 1-dimensional data below. What happens?

a) $– \infty < B_0 < \infty; B_1 \rightarrow \infty$

b) $\beta_0 = 0$, $\beta_1 = 0$

c) $\beta_0 = 0$; $\beta_1 \rightarrow –\infty$

d) none of the above

Regression Questions: A Coin Teaser

This is a straightforward question that elucidates whether you understand regression, particularly the ceteris paribus interpretation of multiple regression.

• let $Y$ be the total value of change in your pocket;
• let $X_1$ be the total number of coins;
• let $X_2$ be the total number of quarters.

Now, regress $Y$ on $X_1$ or $Y$ on $X_2$ alone. Both $\beta_1$ and $\beta_2$ would be positive.

If you regress $Y$ on $X_1 + X_2$, what are the signs of $\beta_1$ and $\beta_2$?

Consider holding $X_1$ constant: for a fixed number of coins, if $X_2$ increases then $Y$ surely increases, so $\beta_2$ is positive.

Consider holding $X_2$ constant: for a fixed number of quarters, increasing the total number of coins will often decrease $Y$, so it is entirely possible that $\beta_1$ is negative.

Interview Questions in R

Previously, I wrote about a common interview question: given an array of words, output them in decreasing frequency order, and I provided solutions in java, java8, and python.

Here’s the reason I love R: this can be accomplished in 3 lines of code.

produces

Java8 Improvements

java8 has a bunch of nice improvements, and over the holidays I’ve had time to play with them a bit.

First, say goodbye to requiring Apache Commons for really simple functionality, like joining a string!

java8 also massively cleans up some common operations. A common interview question is given an array or list of words, print them in descending order by count, or return the top n sorted by count descending. A standard program to do this may go like this: create a map from string to count; reverse the map to go from count to array of words with that count, then descend to the correct depth.

The dummy data provided has these counts:

this will produce output like:

Using java8 streams, we can clean up much of this. For starters, creating the map from word –> word count is essentially build in.

Java8 also directly supports inverting or reversing a map, replacing the need to either do it by hand or use guava’s bi-directional map. In the common case, where values are unique, this will suffice:

Unfortunately, in my case that throws an exception because there is more than one word with the same count. So it’s slightly more complicated:

But I really want a treemap, so I can iterate over they keys in order. Fortunately, I can specify which type of map I want

it’s worth noting the python is simpler still…

Probability Problems Coin Flips 01

You have an urn with 10 coins in it: 9 fair, and one that is heads only. You draw a coin at random from the urn, then flip it 5 times. What is the probability that you get a head on the 6th flip given you observed head on each of the first 5 flips?

Let $H_i$ be the event we observe head on the $i$th flip, and let $C_i$ be the event we draw the $i$th coin, $i = 1,…,10$.

Then we wish to calculate (using range syntax for brevity) $$P(H_6 | H_1 H_2 H_3 H_4 H_5) = P(H_6 | H_{1:5})$$

Conditioning on which coin we drew, and exploiting the symmetry between coins 1 to 9:

\begin{align} P(H_6 | H_{1:5}) & = \sum_{i=1}^{10} P(H_6 | H_{1:5}, C_{i}) P(C_i | H_{1:5} ) \\ & = 9 \cdot P(H_6 | H_{1:5}, C_1) P(C_1 | H_{1:5}) + P(H_6 | H_{1:5}, C_{10}) P(C_{10} | H_{1:6} ) \end{align}

So it just remains to calculate $P(C_i | H_{1:5})$. This can be done via bayes rule:

$$P(C_i | H_{1:5}) = \frac{ P(H_{1:5} | C_i ) P(C_i) }{ P(H_{1:5}) }$$

where, playing the same conditioning trick:

\begin{align} P(H_{1:5}) &= \sum_{i=1}^{10} P(H_{1:5} | C_i ) P(C_i) \\ & = \sum_{i=1}^{9}P(H_{1:5} | C_i) P(C_i) + P(H_{1:5} | C_{10}) P(C_{10}) \\ & = 9 \cdot \left( \frac{1}{2} \right)^5 \frac{1}{10} + 1^5 \frac{1}{10} \end{align}

Thus:

\begin{align} P(C_1 | H_{1:5}) & = \frac{ P(H_{1:5} | C_1 ) P(C_1) }{ 9 \cdot \left( \frac{1}{2} \right)^5 \frac{1}{10} + 1^5 \frac{1}{10} } \\ & = \frac{ \left( \frac{1}{2} \right)^5 \frac{1}{10} }{ 9 \cdot \left( \frac{1}{2} \right)^5 \frac{1}{10} + 1^5 \frac{1}{10} } \\ & = \frac{1}{9 + 2^5} \\ & = \frac{1}{41} \\ & \\ P(C_{10} | H_{1:5}) & = \frac{ P(H_{1:5} | C_{10} ) P(C_{10}) }{ 9 \cdot \left( \frac{1}{2} \right)^5 \frac{1}{10} + 1^5 \frac{1}{10} } \\ & = \frac{ 1^5 \frac{1}{10} }{ 9 \cdot \left( \frac{1}{2} \right)^5 \frac{1}{10} + 1^5 \frac{1}{10} } \\ & = \frac{32}{9 + 32} \\ & = \frac{32}{41} \\ \end{align}

Note that we can quickly self-test and verify $\sum_{i=1}^{10} P(C_i) = 1$.

Returning to eqn (2)

\begin{align} P(H_6 | H_{1:5}) & = 9 \cdot P(H_6 | H_{1:5}, C_1) P(C_1 | H_{1:5}) + P(H_6 | H_{1:5}, C_{10}) P(C_{10} | H_{1:6} ) \\ & = 9 \cdot \frac{1}{2} \frac{1}{41} + 1 \cdot \frac{32}{41} \\ & = \frac{73}{82} \end{align}

Alternatively, you can use R to calculate the probability via brute force by repeatedly sampling according to our problem and counting the number of heads observed.

my sample run produced

C++ and Const. Sigh.

well known benefits of constness.

what on earth? It turns out the winner is this beautiful bit of syntax:

beautiful.

So for all future googlers, this is how you declare const double arrays or const multidimensional arrays in c++.

Splitting Files With Awk

To split files (eg for test / train splits or k-folds) without having to load into R or python, awk will do a fine job.

For example, to crack into 16 equal parts using modulus to assign rows to files:

Or to crack a file into a 80/20 test/train split:

And finally, if your data file has a header that you don’t want to end up in a random file, you can dump the header row into both files, then tell your awk script to append (and use tail to skip the header row)

Wordpress to Octopress Migration

As mentioned, I’m moving my blog to octopress. I got tired of php, php-cgi, wordpress, security holes, constant updates that broke random plugins, 5 second page loads, fragile caching plugins, and all the various nonsense that wordpress brings to the table.

An aside: php-cgi is so fragile and crashes so often I ran a screen session as root that just attempted to restart it every 5 seconds (attached below for any poor souls stuck using this tech.)

For googlers who want to move from wordpress to octopress, here’s how I moved 70-odd posts with minimal pain.

1 – Get thomasf’s excellent python script (accurately named exitwp) that converts wordpress posts to octopress posts. This will create one octopress post per wordpress post in the source directory.

2 – I simultaneously moved urls from blog.earlh.com to earlh.com/blog so I needed to 301 all the old posts. I did that by getting this awesome wordpress post exporter script contributed by Mike Schinkel. I curled that to create a list of urls to forward, then built a tsv of pairs of old url\tnewurl. Then the below awk script will print nginx forward rules:

The rules look like:

Add them to your site nginx.conf file inside the server configuration block.

I’ll update with solutions for better image embedding.

C++ Is Horrific

I’m poking at some c++ after not touching it for a decade. c++11 has apparently gotten roughly as capable as java pre 2000; it now can create threads! But the error messages. Oh, the error messages

what was the cause of this monstrosity?

so yeah, you can’t copy thread objects, enforced by having a private constructor. Still, the amount of knowledge it takes to translate from the error message to the error is pretty amazing.

Replacing Sort | Uniq

A code snippet: when poking at columnar data in the shell, you’ll often find yourself asking questions like what are the unique values of a particular column, or the unique values and their counts. R would accomplish this via unique or table, but if your data is largish it may be quite annoying to load into R. I often use bash to quickly pick out a column, ala

In order: bash cats my data, tells awk to print just column 8 using , as the separator field, sorts all the data so that I can use uniq, asks uniq to print the counts and then the unique strings, then sorts by the counts descending (-n interprets as a number and -r sorts descending). The obvious inefficiency here is if your data is a couple of gb, you have to sort in order for uniq to work. Instead, you can add the script below to your path and replace the above with:

not only is this a lot less typing, but it will be significantly faster since you don’t have to hold all the data in ram and sort it.