Stochastic Nonsense

Put something smart here.

Vi Swapping Order of Tables of Counts and Labels

A note to myself: if you have a table of counts and labels, perhaps created by something | something | uniq -c, you can swap the order of the labels and the counts / change the order of columns in vi via the following regex.

1
2
3
4
1 labelone
2 labeltwo
3 labelthree
99 label99

highlight in visual mode with V then run the following regex: s/\(\d\+\)\s\+\([a-zA-Z0-9_]*\)/\2 \1/

1
2
3
4
labelone 1
labeltwo 2
labelthree 3
label99 99

or turn them into the correct format for a python dict via s/\(\d\+\)\s\+\([a-zA-Z0-9_]*\)/'\2': \1,/

1
2
3
4
'labelone': 1,
'labeltwo': 2,
'labelthree': 3,
'label99': 99,

Shell Utilities for Data Analysis

Quick utilities to help with data analysis from the shell:

Print numbered column names of a csv or tsv. You can specify a file or it will read from stdin. It will also guess the separator, whichever of tab or comma is more common; or you may specify with --separator. This is particularly useful if you want to use awk to select columns.

(colnum) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
#!/usr/bin/python
# you use macports, you probably want the first line to be exactly #!/opt/local/bin/python

# Copyright 2015 Earl Hathaway rblog Ray at Bans earlh dot com (take my sunglasses off to email me)
# License: The author or authors of this code dedicate any and all copyright interest in this code to the public domain.

#
# print numbered column names or headers from a file or stdin if present with an optional field separator
# tested to work with python from 2.7 to 3.4

from __future__ import print_function
import argparse
import math
import os.path
import sys

stdin = not sys.stdin.isatty()
parser = argparse.ArgumentParser(description='print numbered column headers')
parser.add_argument('file', nargs='?', help='filename(default: stdin if not a tty)')
parser.add_argument('--separator', dest='separator', nargs=1, help='specify the field separator (default: whichever of comma or tab is more common)')
parser.add_argument('--python_dict', dest='pydict', action="store_true", help='emit a python dict?')

args = parser.parse_args(sys.argv[1:])
if args.file is not None and not os.path.isfile(args.file):
  print('File "%s" does not exist' % args.file)
  sys.exit(0)

first = None
if stdin:
  first = sys.stdin.readline()
elif args.file is not None:
  with open(args.file, 'r') as f:
    first = f.readline()
else:
  print('no file specified and nothing on stdin')
  parser.print_help()
  sys.exit(0)

sep = None
if args.separator is None:
  n_comma = first.count(',')
  n_tabs = first.count('\t')
  sep = "\t" if n_tabs >= n_comma else ","
else:
  sep = args.separator[0]

fields = first.split(sep)

# emit a python dict to copy into code; should be zero based
if args.pydict:
  pydict = '{' + (', '.join(['\'%s\': %d' % (val.strip(), idx) for idx, val in enumerate(fields)])) + '}'
  print(pydict)
  sys.exit(0)


# calculate indentation for fields so they don't stagger
width = 0 if len(fields) < 10 else int(math.ceil(math.log10(len(fields))))
format = ' %%%dd %%s' % width
for idx, val in enumerate(fields):
  print(format % (idx + 1, val.strip()))

Example one:

1
2
3
4
5
6
7
8
$ colnum data.csv
 1 request_id
 2 quotes
 3 location_id
 4 category_id
 5 creation_time
 6 week
 7 month

or head -1 data.csv | colnum or colnum --separator , data.csv etc.

There are two options: --separator forces a separator, and --python_dict prints a zero-index based lookup dict like so:

1
{'request_id': 0, 'quotes': 1, 'location_id': 2, 'category_id': 3, 'creation_time': 4, 'week': 5, 'month ': 6}

Fastmail Thoughts

last updated Saturday 20 June 2015

I recently switched to fastmail in lieu of gmail, mostly because I increasingly dislike google’s stance on privacy, their integration between products, and their ongoing updates to gmail. I unfortunately updated gmail on my phone, and their new material design ethos was designed by an idiot who thinks that they should have whitespace everywhere, wasting tons of space already in short supply. I now can only see 5.5 messages in the inbox view, whereas I used to be able to see 8, an incredibly annoying change in the most important screen. So I switched.

A review of fastmail a several months in:

tl;dr: gmail is a better web application, and a better android application. Choose fastmail if you value privacy; choose gmail otherwise.

Positives

  • it’s not gmail
  • privacy
  • gmail shrunk the view window on android for some stupid flat design rationale; they appear to assume everyone reads email on a 6 inch phone

Negatives

  • fastmail pretends to be a gmail style email client where the unit of manipulation is a conversation, not a message. But the underlying message orientation peeks through in many cases.
    • When deleting a conversation, it has repeatedly asked if I want to delete the entire conversation (what else would I want?) and had a Yes/No for don’t ask me again. I’ve clicked “don’t ask again” at least 3 times. It doesn’t take.
    • if you archive a conversation, the sent emails also move to archive out of sent. This is wrong.
  • Settings feels like my first javascript project.
    • routing rules have to be very simple and sometimes don’t work.
    • The UI for setting up routing rules is shit; you have to add them, click, add, then scroll to the top of a very long page and click “apply all changes” for the rules to take (yes, I missed that while porting rules from my old webmail and had to redo 40+ rules). It’s essentially two-phase commit ala git; not at all what I expected for a webmail ui.
    • The rules don’t work as you would expect: eg messages from “a@b.com” do not match “sender ends with” “b.com”.
    • Rules can only filter on one thing at a time — no compound rules on eg sender and subject. When you create a rule, it doesn’t offer to apply to existing messages in the inbox.
    • Rules can’t use “or”. So if you filter on receiver, you can’t say a@b.com or b@b.com or c@b.com. Instead, you have to have one rule per each. By the time you have 100+ of these, it’s damn annoying.
  • spam filtering is crappy:
    • When you mark something as not spam, it is delivered to your inbox and skips rules.
    • there’s no ability to sort by spam score. Hopefully the most likely nonspam would have the lowest score, so it would be convenient to sort by that to find nonspam.
    • the spam filter doesn’t learn: I’ve had to mark a loan payment confirmation email as not-spam every single month I’ve used fastmail
  • It sometimes loses the send button while composing messages.
  • No option to “filter emails like this”; instead, you have to copy and paste eg the address you want to filter into a screen 3 clicks away.
  • By default, it doesn’t load images in html email. There is a link that tries to load the images in the email you’re viewing; it works perhaps 2/3 of the time.
  • The rich editor is crap.
    • For just one of a long list: paste tsv data in there; it strips all the tabs. Awesome. So a b c pastes as abc. Wat?
    • the mobile site on firefox lags typing like 10+ seconds if you have a quoted reply in the message box. It’s strictly amateur hour.
  • The application disables access to files on mobile phones even after using requesting desktop in firefox. surprise!

Potential Dealbreakers:

  • They attempt to monetize security in an incredibly stupid way. If you setup two factor authentication to text a code to your phone, they charge for the sms messages — 0.12 each! — even on $40/year accounts. That’s just chintzy. Better yet, because they’re run by cheap dicks, purchased sms credits expire after a year!!! When I saw that it felt like purchasing a prepaid cellphone at a gas station level cheap. They’re seriously pricing at 1600 hundred times the pricing twilio has on their web page for joe-random-user, not even considering volume discounts. Monetizing security makes you an asshole.

In summary, there’s just a lot of annoyances that make me assume the devs don’t use their own product or they’d fix it out of sheer annoyance. But they don’t sell your information, or decide to shrink the number of messages viewable in your inbox in order to conform to some stupid corporate design ethos.

Regression Questions: Logistic Regression Probabilities

Assume we have a logistic regression of the form $\beta_0 + \beta_1 x$, and for value $x_0$ we predict success probability $p(x_0)$. Which of the following is correct?

a) $p(x_0 + 1) = B_1 + p(x_0)$

b) $\frac{ p(x_0 + 1) }{ 1 – p(x_0 + 1) } = \exp(\beta_1) \frac{p(x_0) }{1 – p(x_0)}$

c) $p(x_0 + 1) = \Phi(B_0 + B_1( x_0 + 1))$

Assume we run a logistic regression on the 1-dimensional data below. What happens?

a) $– \infty < B_0 < \infty; B_1 \rightarrow \infty$

b) $\beta_0 = 0$, $\beta_1 = 0$

c) $\beta_0 = 0$; $\beta_1 \rightarrow –\infty$

d) none of the above

Regression Questions: A Coin Teaser

This is a straightforward question that elucidates whether you understand regression, particularly the ceteris paribus interpretation of multiple regression.

  • let $Y$ be the total value of change in your pocket;
  • let $X_1$ be the total number of coins;
  • let $X_2$ be the total number of quarters.

Now, regress $Y$ on $X_1$ or $Y$ on $X_2$ alone. Both $\beta_1$ and $\beta_2$ would be positive.

If you regress $Y$ on $X_1 + X_2$, what are the signs of $\beta_1$ and $\beta_2$?

Consider holding $X_1$ constant: for a fixed number of coins, if $X_2$ increases then $Y$ surely increases, so $\beta_2$ is positive.

Consider holding $X_2$ constant: for a fixed number of quarters, increasing the total number of coins will often decrease $Y$, so it is entirely possible that $\beta_1$ is negative.

Interview Questions in R

Previously, I wrote about a common interview question: given an array of words, output them in decreasing frequency order, and I provided solutions in java, java8, and python.

Here’s the reason I love R: this can be accomplished in 3 lines of code.

1
2
3
tt <- sort(table(c("a", "b", "a", "a", "b", "c", "a1", "a1", "a1")), dec=T)
depth <- 3
tt[1:depth]

produces

1
2
 a a1  b
 3  3  2

Java8 Improvements

java8 has a bunch of nice improvements, and over the holidays I’ve had time to play with them a bit.

First, say goodbye to requiring Apache Commons for really simple functionality, like joining a string!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import static java.util.stream.Collectors.joining;

import java.util.Arrays;
import java.util.List;

/**
 * 
 */
public class StringUtils {

  public static void main(String[] args){
    List<String> words = Arrays.asList("a", "b", "a", "a", "b", "c", "a1", "a1", "a1");

    // old style print each element of a list: Arrays.toString(result.toArray())
    puts("java6 style %s", Arrays.toString(words.toArray()));
    puts("java8 style [%s]", words.stream().collect(joining(", ")));

  }

  public static void puts(String s){ System.out.println(s); }
  public static void puts(String format, Object... args){ puts(String.format(format, args)); }
}

java8 also massively cleans up some common operations. A common interview question is given an array or list of words, print them in descending order by count, or return the top n sorted by count descending. A standard program to do this may go like this: create a map from string to count; reverse the map to go from count to array of words with that count, then descend to the correct depth.

The dummy data provided has these counts:

1
2
 a a1  b  c
 3  3  2  1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.TreeMap;

/**
 * get the n highest frequency words
 */
public class WordCounts {

public static void main(String[] args){
    String[] words = new String[]{"a", "b", "a", "a", "b", "c", "a1", "a1", "a1"};

    for(int depth = 0; depth < 4; depth++){
      List<String> result = getMostFrequentWords(words, depth);
      puts("depth %d -> %s", depth, Arrays.toString(result.toArray()));
      puts("");
    }
  }

  public static List<String> getMostFrequentWords(String[] words, int depth){
    if(words == null || words.length == 0 || depth <= 0)
      return Collections.emptyList();

    // word -> counts
    HashMap<String, Integer> counts = new HashMap<>();
    for(String word : words){
      if(counts.containsKey(word))
        counts.put(word, counts.get(word) + 1);
      else
        counts.put(word, 1);
    }

    // count -> list of words with that count
    TreeMap<Integer, ArrayList<String>> countmap = new TreeMap<>();
    for(Map.Entry<String, Integer> entry : counts.entrySet()){
      if(countmap.containsKey(entry.getValue()))
        countmap.get(entry.getValue()).add(entry.getKey());
      else {
        ArrayList<String> l = new ArrayList<>();
        l.add(entry.getKey());
        countmap.put(entry.getValue(), l);
      }
    }

    // iterate through treemap to desired depth
    ArrayList<String> result = new ArrayList<>();
    while(result.size() <= depth){
      for(Integer i : countmap.descendingKeySet()){
        ArrayList<String> list = countmap.get(i);
        if (list.size() + result.size() < depth){
          result.addAll(list);
        } else {
          for(String s : list){
            result.add(s);
            if(result.size() == depth)
              return result;
          }
        }
      }
    }
    return result;
  }

  public static void puts(String s){ System.out.println(s); }
  public static void puts(String format, Object... args){ puts(String.format(format, args)); }
}

this will produce output like:

1
2
3
4
5
6
7
depth 0 -> []

depth 1 -> [a1]

depth 2 -> [a1, a]

depth 3 -> [a1, a, b]

Using java8 streams, we can clean up much of this. For starters, creating the map from word –> word count is essentially build in.

1
2
3
  // word -> counts
  Map<String, Long> counts = Arrays.stream(words)
    .collect(Collectors.groupingBy(s -> s, Collectors.counting()));

Java8 also directly supports inverting or reversing a map, replacing the need to either do it by hand or use guava’s bi-directional map. In the common case, where values are unique, this will suffice:

1
2
3
4
  // count -> list of words: reverse the counts map
  Map<Long, String> countmap = counts.entrySet().stream()
    .collect(Collectors.toMap(Map.Entry::getValue, Map.Entry::getKey));
  puts("countmap: %s", countmap);

Unfortunately, in my case that throws an exception because there is more than one word with the same count. So it’s slightly more complicated:

1
2
3
  // count -> list of words: reverse a map with duplicate values, collecting duplicates in an ArrayList
  Map<Long, ArrayList<String>> countmap = counts.entrySet().stream()
  .collect(Collectors.groupingBy(Map.Entry<String, Long>::getValue, Collectors.mapping(Map.Entry<String, Long>::getKey, Collectors.toCollection(ArrayList::new))));

But I really want a treemap, so I can iterate over they keys in order. Fortunately, I can specify which type of map I want

1
2
  TreeMap<Long, ArrayList<String>> countmap = counts.entrySet().stream()
              .collect(Collectors.groupingBy(Map.Entry<String, Long>::getValue, TreeMap::new, Collectors.mapping(Map.Entry<String, Long>::getKey, Collectors.toCollection(ArrayList::new))));

it’s worth noting the python is simpler still…

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
from collections import defaultdict

def get_most_frequent_words(words, depth):
  if words is None or len(words) == 0 or depth <= 0:
    return []

  counts = defaultdict(lambda: 0)
  for word in words:
    counts[word] += 1

  countmap = defaultdict(lambda: [])
  for word, count in counts.iteritems():
    countmap[count].append(word)

  result = []
  for key in sorted(countmap.keys(), reverse=True):
    if len(result) + len(countmap[key]) < depth:
      result.extend(countmap[key])
    else:
      for w in countmap[key]:
        result.append(w)
        if len(result) == depth:
          return result

  return result


words = ["a", "b", "a", "a", "b", "c", "a1", "a1", "a1"]
for depth in range(0, 4):
  print('depth %d -> [%s]' % (depth, (', '.join(get_most_frequent_words(words, depth)))))
  print('\n')

Probability Problems Coin Flips 01

You have an urn with 10 coins in it: 9 fair, and one that is heads only. You draw a coin at random from the urn, then flip it 5 times. What is the probability that you get a head on the 6th flip given you observed head on each of the first 5 flips?

Let $H_i$ be the event we observe head on the $i$th flip, and let $C_i$ be the event we draw the $i$th coin, $i = 1,…,10$.

Then we wish to calculate (using range syntax for brevity) $$( P(H_6 | H_1 H_2 H_3 H_4 H_5) = P(H_6 | H_{1:5}) $$)

Conditioning on which coin we drew, and exploiting the symmetry between coins 1 to 9:

$$( \begin{align} P(H_6 | H_{1:5}) & = \sum_{i=1}^{10} P(H_6 | H_{1:5}, C_{i}) P(C_i | H_{1:5} ) \\ & = 9 \cdot P(H_6 | H_{1:5}, C_1) P(C_1 | H_{1:5}) + P(H_6 | H_{1:5}, C_{10}) P(C_{10} | H_{1:6} ) \end{align} $$)

So it just remains to calculate $P(C_i | H_{1:5})$. This can be done via bayes rule:

$$( P(C_i | H_{1:5}) = \frac{ P(H_{1:5} | C_i ) P(C_i) }{ P(H_{1:5}) } $$)

where, playing the same conditioning trick:

$$( \begin{align} P(H_{1:5}) &= \sum_{i=1}^{10} P(H_{1:5} | C_i ) P(C_i) \\ & = \sum_{i=1}^{9}P(H_{1:5} | C_i) P(C_i) + P(H_{1:5} | C_{10}) P(C_{10}) \\ & = 9 \cdot \left( \frac{1}{2} \right)^5 \frac{1}{10} + 1^5 \frac{1}{10} \end{align} $$)

Thus:

$$( \begin{align} P(C_1 | H_{1:5}) & = \frac{ P(H_{1:5} | C_1 ) P(C_1) }{ 9 \cdot \left( \frac{1}{2} \right)^5 \frac{1}{10} + 1^5 \frac{1}{10} } \\ & = \frac{ \left( \frac{1}{2} \right)^5 \frac{1}{10} }{ 9 \cdot \left( \frac{1}{2} \right)^5 \frac{1}{10} + 1^5 \frac{1}{10} } \\ & = \frac{1}{9 + 2^5} \\ & = \frac{1}{41} \\ & \\ P(C_{10} | H_{1:5}) & = \frac{ P(H_{1:5} | C_{10} ) P(C_{10}) }{ 9 \cdot \left( \frac{1}{2} \right)^5 \frac{1}{10} + 1^5 \frac{1}{10} } \\ & = \frac{ 1^5 \frac{1}{10} }{ 9 \cdot \left( \frac{1}{2} \right)^5 \frac{1}{10} + 1^5 \frac{1}{10} } \\ & = \frac{32}{9 + 32} \\ & = \frac{32}{41} \\ \end{align} $$)

Note that we can quickly self-test and verify $ \sum_{i=1}^{10} P(C_i) = 1 $.

Returning to eqn (2)

$$( \begin{align} P(H_6 | H_{1:5}) & = 9 \cdot P(H_6 | H_{1:5}, C_1) P(C_1 | H_{1:5}) + P(H_6 | H_{1:5}, C_{10}) P(C_{10} | H_{1:6} ) \\ & = 9 \cdot \frac{1}{2} \frac{1}{41} + 1 \cdot \frac{32}{41} \\ & = \frac{73}{82} \end{align} $$)

Alternatively, you can use R to calculate the probability via brute force by repeatedly sampling according to our problem and counting the number of heads observed.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
max_itrs <- 5*1e6
used_itrs <- 0
skipped_itrs <- 0
num_H <- 0

for(itr in 1:max_itrs){
  coin <- sample(1:10, 1)

  if( coin <= 9 ){
    if( !all( runif(5) <= 0.5)){
      skipped_itrs <- skipped_itrs + 1
      next
    }
    num_H <- num_H + ifelse( runif(1) <= 0.5, 1, 0)
  } else {
    num_H <- num_H + 1
  }
  used_itrs <- used_itrs + 1
}

num_H / used_itrs

# vs 73/82
num_H / used_itrs - 73/82

my sample run produced

1
2
3
4
5
6
> num_H / used_itrs
[1] 0.8901657
> 
> # vs 73/82
> num_H / used_itrs - 73/82
[1] -7.821522e-05

C++ and Const. Sigh.

well known benefits of constness.

1
2
3
4
5
6
7
8
// function:
bool learn_tree(const float** target, unsigned int num_classes);

// and I'm attempting to call with:
float** residuals;

// compiling produces:
error: cannot initialize a parameter of type 'const float **' with an lvalue of type 'float **'

what on earth? It turns out the winner is this beautiful bit of syntax:

1
bool learn_tree(const float* const* target, unsigned int num_classes);

beautiful.

So for all future googlers, this is how you declare const double arrays or const multidimensional arrays in c++.

Splitting Files With Awk

To split files (eg for test / train splits or k-folds) without having to load into R or python, awk will do a fine job.

For example, to crack into 16 equal parts using modulus to assign rows to files:

split files into equal parts with awk
1
$ cat LARGE_FILE | awk '{print $0 >("part_" int(NR%16))}'

Or to crack a file into a 80/20 test/train split:

create test/train split using awk
1
awk '{if( NR % 10 <= 1){ print $0 > "data.test.20"} else {print $0 > "data.train.80"}}'

And finally, if your data file has a header that you don’t want to end up in a random file, you can dump the header row into both files, then tell your awk script to append (and use tail to skip the header row)

create test/train split with proper header handling
1
2
3
head -1 LARGE_FILE > data.test.20
head -1 LARGE_FILE > data.train.80
tail -n+2 LARGE_FILE | awk '{if( NR % 10 <= 1){ print $0 >> "data.test.20"} else {print $0 >> "data.train.80"}}'