Stochastic Nonsense

Put something smart here.

Regression Questions: Logistic Regression Probabilities

Assume we have a logistic regression of the form $\beta_0 + \beta_1 x$, and for value $x_0$ we predict success probability $p(x_0)$. Which of the following is correct?

a) $p(x_0 + 1) = B_1 + p(x_0)$

b) $\frac{ p(x_0 + 1) }{ 1 – p(x_0 + 1) } = \exp(\beta_1) \frac{p(x_0) }{1 – p(x_0)}$

c) $p(x_0 + 1) = \Phi(B_0 + B_1( x_0 + 1))$

Assume we run a logistic regression on the 1-dimensional data below. What happens?

a) $– \infty < B_0 < \infty; B_1 \rightarrow \infty$

b) $\beta_0 = 0$, $\beta_1 = 0$

c) $\beta_0 = 0$; $\beta_1 \rightarrow –\infty$

d) none of the above

Regression Questions: A Coin Teaser

This is a straightforward question that elucidates whether you understand regression, particularly the ceteris paribus interpretation of multiple regression.

  • let $Y$ be the total value of change in your pocket;
  • let $X_1$ be the total number of coins;
  • let $X_2$ be the total number of quarters.

Now, regress $Y$ on $X_1$ or $Y$ on $X_2$ alone. Both $\beta_1$ and $\beta_2$ would be positive.

If you regress $Y$ on $X_1 + X_2$, what are the signs of $\beta_1$ and $\beta_2$?

Consider holding $X_1$ constant: for a fixed number of coins, if $X_2$ increases then $Y$ surely increases, so $\beta_2$ is positive.

Consider holding $X_2$ constant: for a fixed number of quarters, increasing the total number of coins will often decrease $Y$, so it is entirely possible that $\beta_1$ is negative.

Interview Questions in R

Previously, I wrote about a common interview question: given an array of words, output them in decreasing frequency order, and I provided solutions in java, java8, and python.

Here’s the reason I love R: this can be accomplished in 3 lines of code.

1
2
3
tt <- sort(table(c("a", "b", "a", "a", "b", "c", "a1", "a1", "a1")), dec=T)
depth <- 3
tt[1:depth]

produces

1
2
 a a1  b
 3  3  2

Java8 Improvements

java8 has a bunch of nice improvements, and over the holidays I’ve had time to play with them a bit.

First, say goodbye to requiring Apache Commons for really simple functionality, like joining a string!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import static java.util.stream.Collectors.joining;

import java.util.Arrays;
import java.util.List;

/**
 * 
 */
public class StringUtils {

  public static void main(String[] args){
    List<String> words = Arrays.asList("a", "b", "a", "a", "b", "c", "a1", "a1", "a1");

    // old style print each element of a list: Arrays.toString(result.toArray())
    puts("java6 style %s", Arrays.toString(words.toArray()));
    puts("java8 style [%s]", words.stream().collect(joining(", ")));

  }

  public static void puts(String s){ System.out.println(s); }
  public static void puts(String format, Object... args){ puts(String.format(format, args)); }
}

java8 also massively cleans up some common operations. A common interview question is given an array or list of words, print them in descending order by count, or return the top n sorted by count descending. A standard program to do this may go like this: create a map from string to count; reverse the map to go from count to array of words with that count, then descend to the correct depth.

The dummy data provided has these counts:

1
2
 a a1  b  c
 3  3  2  1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.TreeMap;

/**
 * get the n highest frequency words
 */
public class WordCounts {

public static void main(String[] args){
    String[] words = new String[]{"a", "b", "a", "a", "b", "c", "a1", "a1", "a1"};

    for(int depth = 0; depth < 4; depth++){
      List<String> result = getMostFrequentWords(words, depth);
      puts("depth %d -> %s", depth, Arrays.toString(result.toArray()));
      puts("");
    }
  }

  public static List<String> getMostFrequentWords(String[] words, int depth){
    if(words == null || words.length == 0 || depth <= 0)
      return Collections.emptyList();

    // word -> counts
    HashMap<String, Integer> counts = new HashMap<>();
    for(String word : words){
      if(counts.containsKey(word))
        counts.put(word, counts.get(word) + 1);
      else
        counts.put(word, 1);
    }

    // count -> list of words with that count
    TreeMap<Integer, ArrayList<String>> countmap = new TreeMap<>();
    for(Map.Entry<String, Integer> entry : counts.entrySet()){
      if(countmap.containsKey(entry.getValue()))
        countmap.get(entry.getValue()).add(entry.getKey());
      else {
        ArrayList<String> l = new ArrayList<>();
        l.add(entry.getKey());
        countmap.put(entry.getValue(), l);
      }
    }

    // iterate through treemap to desired depth
    ArrayList<String> result = new ArrayList<>();
    while(result.size() <= depth){
      for(Integer i : countmap.descendingKeySet()){
        ArrayList<String> list = countmap.get(i);
        if (list.size() + result.size() < depth){
          result.addAll(list);
        } else {
          for(String s : list){
            result.add(s);
            if(result.size() == depth)
              return result;
          }
        }
      }
    }
    return result;
  }

  public static void puts(String s){ System.out.println(s); }
  public static void puts(String format, Object... args){ puts(String.format(format, args)); }
}

this will produce output like:

1
2
3
4
5
6
7
depth 0 -> []

depth 1 -> [a1]

depth 2 -> [a1, a]

depth 3 -> [a1, a, b]

Using java8 streams, we can clean up much of this. For starters, creating the map from word –> word count is essentially build in.

1
2
3
  // word -> counts
  Map<String, Long> counts = Arrays.stream(words)
    .collect(Collectors.groupingBy(s -> s, Collectors.counting()));

Java8 also directly supports inverting or reversing a map, replacing the need to either do it by hand or use guava’s bi-directional map. In the common case, where values are unique, this will suffice:

1
2
3
4
  // count -> list of words: reverse the counts map
  Map<Long, String> countmap = counts.entrySet().stream()
    .collect(Collectors.toMap(Map.Entry::getValue, Map.Entry::getKey));
  puts("countmap: %s", countmap);

Unfortunately, in my case that throws an exception because there is more than one word with the same count. So it’s slightly more complicated:

1
2
3
  // count -> list of words: reverse a map with duplicate values, collecting duplicates in an ArrayList
  Map<Long, ArrayList<String>> countmap = counts.entrySet().stream()
  .collect(Collectors.groupingBy(Map.Entry<String, Long>::getValue, Collectors.mapping(Map.Entry<String, Long>::getKey, Collectors.toCollection(ArrayList::new))));

But I really want a treemap, so I can iterate over they keys in order. Fortunately, I can specify which type of map I want

1
2
  TreeMap<Long, ArrayList<String>> countmap = counts.entrySet().stream()
              .collect(Collectors.groupingBy(Map.Entry<String, Long>::getValue, TreeMap::new, Collectors.mapping(Map.Entry<String, Long>::getKey, Collectors.toCollection(ArrayList::new))));

it’s worth noting the python is simpler still…

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
from collections import defaultdict

def get_most_frequent_words(words, depth):
  if words is None or len(words) == 0 or depth <= 0:
    return []

  counts = defaultdict(lambda: 0)
  for word in words:
    counts[word] += 1

  countmap = defaultdict(lambda: [])
  for word, count in counts.iteritems():
    countmap[count].append(word)

  result = []
  for key in sorted(countmap.keys(), reverse=True):
    if len(result) + len(countmap[key]) < depth:
      result.extend(countmap[key])
    else:
      for w in countmap[key]:
        result.append(w)
        if len(result) == depth:
          return result

  return result


words = ["a", "b", "a", "a", "b", "c", "a1", "a1", "a1"]
for depth in range(0, 4):
  print('depth %d -> [%s]' % (depth, (', '.join(get_most_frequent_words(words, depth)))))
  print('\n')

Probability Problems Coin Flips 01

You have an urn with 10 coins in it: 9 fair, and one that is heads only. You draw a coin at random from the urn, then flip it 5 times. What is the probability that you get a head on the 6th flip given you observed head on each of the first 5 flips?

Let $H_i$ be the event we observe head on the $i$th flip, and let $C_i$ be the event we draw the $i$th coin, $i = 1,…,10$.

Then we wish to calculate (using range syntax for brevity) $$( P(H_6 | H_1 H_2 H_3 H_4 H_5) = P(H_6 | H_{1:5}) $$)

Conditioning on which coin we drew, and exploiting the symmetry between coins 1 to 9:

$$( \begin{align} P(H_6 | H_{1:5}) & = \sum_{i=1}^{10} P(H_6 | H_{1:5}, C_{i}) P(C_i | H_{1:5} ) \\ & = 9 \cdot P(H_6 | H_{1:5}, C_1) P(C_1 | H_{1:5}) + P(H_6 | H_{1:5}, C_{10}) P(C_{10} | H_{1:6} ) \end{align} $$)

So it just remains to calculate $P(C_i | H_{1:5})$. This can be done via bayes rule:

$$( P(C_i | H_{1:5}) = \frac{ P(H_{1:5} | C_i ) P(C_i) }{ P(H_{1:5}) } $$)

where, playing the same conditioning trick:

$$( \begin{align} P(H_{1:5}) &= \sum_{i=1}^{10} P(H_{1:5} | C_i ) P(C_i) \\ & = \sum_{i=1}^{9}P(H_{1:5} | C_i) P(C_i) + P(H_{1:5} | C_{10}) P(C_{10}) \\ & = 9 \cdot \left( \frac{1}{2} \right)^5 \frac{1}{10} + 1^5 \frac{1}{10} \end{align} $$)

Thus:

$$( \begin{align} P(C_1 | H_{1:5}) & = \frac{ P(H_{1:5} | C_1 ) P(C_1) }{ 9 \cdot \left( \frac{1}{2} \right)^5 \frac{1}{10} + 1^5 \frac{1}{10} } \\ & = \frac{ \left( \frac{1}{2} \right)^5 \frac{1}{10} }{ 9 \cdot \left( \frac{1}{2} \right)^5 \frac{1}{10} + 1^5 \frac{1}{10} } \\ & = \frac{1}{9 + 2^5} \\ & = \frac{1}{41} \\ & \\ P(C_{10} | H_{1:5}) & = \frac{ P(H_{1:5} | C_{10} ) P(C_{10}) }{ 9 \cdot \left( \frac{1}{2} \right)^5 \frac{1}{10} + 1^5 \frac{1}{10} } \\ & = \frac{ 1^5 \frac{1}{10} }{ 9 \cdot \left( \frac{1}{2} \right)^5 \frac{1}{10} + 1^5 \frac{1}{10} } \\ & = \frac{32}{9 + 32} \\ & = \frac{32}{41} \\ \end{align} $$)

Note that we can quickly self-test and verify $ \sum_{i=1}^{10} P(C_i) = 1 $.

Returning to eqn (2)

$$( \begin{align} P(H_6 | H_{1:5}) & = 9 \cdot P(H_6 | H_{1:5}, C_1) P(C_1 | H_{1:5}) + P(H_6 | H_{1:5}, C_{10}) P(C_{10} | H_{1:6} ) \\ & = 9 \cdot \frac{1}{2} \frac{1}{41} + 1 \cdot \frac{32}{41} \\ & = \frac{73}{82} \end{align} $$)

Alternatively, you can use R to calculate the probability via brute force by repeatedly sampling according to our problem and counting the number of heads observed.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
max_itrs <- 5*1e6
used_itrs <- 0
skipped_itrs <- 0
num_H <- 0

for(itr in 1:max_itrs){
  coin <- sample(1:10, 1)

  if( coin <= 9 ){
    if( !all( runif(5) <= 0.5)){
      skipped_itrs <- skipped_itrs + 1
      next
    }
    num_H <- num_H + ifelse( runif(1) <= 0.5, 1, 0)
  } else {
    num_H <- num_H + 1
  }
  used_itrs <- used_itrs + 1
}

num_H / used_itrs

# vs 73/82
num_H / used_itrs - 73/82

my sample run produced

1
2
3
4
5
6
> num_H / used_itrs
[1] 0.8901657
> 
> # vs 73/82
> num_H / used_itrs - 73/82
[1] -7.821522e-05

C++ and Const. Sigh.

well known benefits of constness.

1
2
3
4
5
6
7
8
// function:
bool learn_tree(const float** target, unsigned int num_classes);

// and I'm attempting to call with:
float** residuals;

// compiling produces:
error: cannot initialize a parameter of type 'const float **' with an lvalue of type 'float **'

what on earth? It turns out the winner is this beautiful bit of syntax:

1
bool learn_tree(const float* const* target, unsigned int num_classes);

beautiful.

So for all future googlers, this is how you declare const double arrays or const multidimensional arrays in c++.

Splitting Files With Awk

To split files (eg for test / train splits or k-folds) without having to load into R or python, awk will do a fine job.

For example, to crack into 16 equal parts using modulus to assign rows to files:

split files into equal parts with awk
1
$ cat LARGE_FILE | awk '{print $0 >("part_" int(NR%16))}'

Or to crack a file into a 80/20 test/train split:

create test/train split using awk
1
awk '{if( NR % 10 <= 1){ print $0 > "data.test.20"} else {print $0 > "data.train.80"}}'

And finally, if your data file has a header that you don’t want to end up in a random file, you can dump the header row into both files, then tell your awk script to append (and use tail to skip the header row)

create test/train split with proper header handling
1
2
3
head -1 LARGE_FILE > data.test.20
head -1 LARGE_FILE > data.train.80
tail -n+2 LARGE_FILE | awk '{if( NR % 10 <= 1){ print $0 >> "data.test.20"} else {print $0 >> "data.train.80"}}'

Wordpress to Octopress Migration

As mentioned, I’m moving my blog to octopress. I got tired of php, php-cgi, wordpress, security holes, constant updates that broke random plugins, 5 second page loads, fragile caching plugins, and all the various nonsense that wordpress brings to the table.

An aside: php-cgi is so fragile and crashes so often I ran a screen session as root that just attempted to restart it every 5 seconds (attached below for any poor souls stuck using this tech.)

1
2
# run as root inside screen
for (( ; ; )); do /usr/bin/spawn-fcgi -u nginx -g nginx -f /usr/bin/php-cgi -a 127.0.0.1 -p 53217 -P /var/run/fastcgi-php.pid; sleep 5; done

For googlers who want to move from wordpress to octopress, here’s how I moved 70-odd posts with minimal pain.

1 – Get thomasf’s excellent python script (accurately named exitwp) that converts wordpress posts to octopress posts. This will create one octopress post per wordpress post in the source directory.

2 – I simultaneously moved urls from blog.earlh.com to earlh.com/blog so I needed to 301 all the old posts. I did that by getting this awesome wordpress post exporter script contributed by Mike Schinkel. I curled that to create a list of urls to forward, then built a tsv of pairs of old url\tnewurl. Then the below awk script will print nginx forward rules:

1
cat posts.tsv | awk -F"\t" '{print "\tlocation " $2 "{\n\t\treturn 301 " $1 ";\n\t\tbreak;\t}"'} | sed "s/http:\/\/blog.earlh.com//"

The rules look like:

1
2
3
4
location /index.php/2009/06/cleaning-data-in-r-csv-files {
        return 301 http://earlh.com/blog/2009/06/29/cleaning-data-in-r-csv-files/;
        break;
}

Add them to your site nginx.conf file inside the server configuration block.

I’ll update with solutions for better image embedding.

C++ Is Horrific

I’m poking at some c++ after not touching it for a decade. c++11 has apparently gotten roughly as capable as java pre 2000; it now can create threads! But the error messages. Oh, the error messages

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
$ clang++ -std=c++11 -stdlib=libc++ src/test.cpp
In file included from src/test.cpp:4:
In file included from /usr/bin/../lib/c++/v1/thread:93:
In file included from /usr/bin/../lib/c++/v1/functional:465:
/usr/bin/../lib/c++/v1/memory:1677:31: error: calling a private constructor of class 'std::__1::thread'
            ::new((void*)__p) _Up(_VSTD::forward<_Args>(__args)...);
                              ^
/usr/bin/../lib/c++/v1/memory:1604:18: note: in instantiation of function template specialization
      'std::__1::allocator<std::__1::thread>::construct<std::__1::thread, const std::__1::thread &>' requested here
            {__a.construct(__p, _VSTD::forward<_Args>(__args)...);}
                 ^
/usr/bin/../lib/c++/v1/memory:1488:14: note: in instantiation of function template specialization
      'std::__1::allocator_traits<std::__1::allocator<std::__1::thread> >::__construct<std::__1::thread, const std::__1::thread &>' requested here
            {__construct(__has_construct<allocator_type, pointer, _Args...>(),
             ^
/usr/bin/../lib/c++/v1/vector:1471:25: note: in instantiation of function template specialization
      'std::__1::allocator_traits<std::__1::allocator<std::__1::thread> >::construct<std::__1::thread, const std::__1::thread &>' requested here
        __alloc_traits::construct(this->__alloc(),
                        ^
src/test.cpp:36:13: note: in instantiation of member function 'std::__1::vector<std::__1::thread, std::__1::allocator<std::__1::thread> >::push_back'
      requested here
    threads.push_back(t);
            ^
/usr/bin/../lib/c++/v1/thread:261:5: note: implicitly declared private here
    thread(const thread&);
    ^
1 error generated.

what was the cause of this monstrosity?

1
2
3
4
std::vector< std::thread > threads;
// [...]
std::thread t(thread_func, i);
threads.push_back(t);

so yeah, you can’t copy thread objects, enforced by having a private constructor. Still, the amount of knowledge it takes to translate from the error message to the error is pretty amazing.

Replacing Sort | Uniq

A code snippet: when poking at columnar data in the shell, you’ll often find yourself asking questions like what are the unique values of a particular column, or the unique values and their counts. R would accomplish this via unique or table, but if your data is largish it may be quite annoying to load into R. I often use bash to quickly pick out a column, ala

pick out a column
1
$ cat data.csv | awk -F, '{print $8}' | sort | uniq -c | sort -r -n

In order: bash cats my data, tells awk to print just column 8 using , as the separator field, sorts all the data so that I can use uniq, asks uniq to print the counts and then the unique strings, then sorts by the counts descending (-n interprets as a number and -r sorts descending). The obvious inefficiency here is if your data is a couple of gb, you have to sort in order for uniq to work. Instead, you can add the script below to your path and replace the above with:

pick out a column
1
$ cat data.csv | awk -F, '{print $8'} | count

not only is this a lot less typing, but it will be significantly faster since you don’t have to hold all the data in ram and sort it.

replace sort | uniq
1
2
3
4
5
6
7
8
9
10
#!/usr/bin/ruby

# replaces sort | uniq -c
cnts={}
$stdin.each{ |i|
  cnts[i] = 1 + (cnts[i] || 0)
}

cnts.sort_by{ |word, cnt| -1*cnt }.
    each{ |word, cnt| puts "#{cnt}\t#{word}" }