Stochastic Nonsense

Put something smart here.

Splitting Audio With Ffmpeg

Here’s a quick utility to use a set list and ffmpeg to split single audio files into multiple tracks. It splits audio files via a setlist then sets the song name, artist, album id3 tags. The script is crude, but it’s a quick start.

I used this to split a couple concerts by two of my favorite artists: James McMurtry in Concert July 14 2013 and Ray Wylie Hubbard in Nashville, TN performing tracks from A. Enlightenment B. Endarkenment (Hint: There is no C). You can find the set lists below.

(splitter.py) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# use ffmpeg to split an mp3 according to a set list
from __future__ import print_function
import subprocess as sp

# NB: add -copy if from mp3 -> mp3; else no -copy
setfile = './Ray Wylie Hubbard in Concert -  December 17, 2010 at Tennessee State Museum.set'
mp3file = './Ray Wylie Hubbard full concert - DASH.m4a'
outdir = './Ray Wylie Hubbard in Concert -  December 17, 2010 at Tennessee State Museum'
meta_src = 'https://www.youtube.com/watch?v=y9_xBIuV9nE'
do_copy = False
print('splitting \'%s\' according to sets in \'%s\' into directory \'%s\'' % (mp3file, setfile, outdir))

metas = {'artist':'Ray Wylie Hubbard', 'album':'Acoustic Live in Concert December 17 2010 at Tennessee State Museum / Hippie Jacks'}
metastring = ' '.join([ '-metadata %s="%s"' % (k,v) for k,v in metas.iteritems()])

songs = []
with open(setfile) as f:
  for line in f:
    start, title = line.strip().split(' ', 1)
    songs.append((start, title))

print('read %d lines' % len(songs))

# set id3 tags ala http://jonhall.info/how_to/create_id3_tags_using_ffmpeg
cmds = []
for i in range(len(songs) - 1):
  cmd = (songs[i][0], songs[i+1][0], songs[i][1])
  cmds.append(cmd)

  run = 'ffmpeg -i \"%s\" -acodec %s -ss %s -to %s ' % (mp3file, 'copy' if do_copy else 'libmp3lame -qscale:a 2', cmd[0], cmd[1])
  run += '%s -metadata title="%s" -metadata track="%d/%d" -metadata publisher="%s" ' % (metastring, cmd[2], i+1, len(songs) - 1, meta_src)
  run += '"%s/%02d.mp3"' % (outdir, i + 1)
  # print("\tcmd: %s" % run)
  sp.call(run, shell=True)

You’ll have to adjust params at the top of splitter.py:

  • setfile is the set file
  • mp3file is the audio file
  • outdir is the output directory (you should probably mkdir this beforehand)
  • meta_src is the source
  • do_copy should be True if your source is an mp3 file and False if you want to transcode to mp3
  • replace artist and album in metas

Then just run it: python splitter.py.

(James_McMurtry_in_Concert_-__July_14,_2013.set) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
00:00:00 chirping
00:00:48 Bayou Tortous
00:05:04 Red Dress
00:11:00 talking / tuning
00:11:52 What's the Matter Now
00:16:09 talking / real sensitive little love ballads and baptists
00:17:18 How'm I Gonna Find You Now
00:21:28 talking / buy cds
00:22:09 Just Us Kids
00:27:00 applause
00:27:27 You'd A' Thought (Leonard Cohen Must Die)
00:33:34 clapping and get ready to dance
00:33:47 Choctaw Bingo (with new verse) (introduction)
00:34:30 Choctaw Bingo (with new verse)
00:46:22 applause / belly rubbing dances
00:46:49 These things i've come to know
00:50:26 talking / introduction to the band / unfortunately song still relevant
00:51:44 We Can't Make it Here Anymore
00:58:37 talking about ice fishing and Tim Holt
00:59:37 Copper Canteen
01:04:22 talking / back to stompin' songs
01:04:41 Freeway View
01:08:03 talking / feral hogs in michigan
01:08:33 Lobo Town
01:14:00 applause
01:14:07 Restless
01:19:17 talking / goes out to memory of max crawford
01:20:02 Levelland
01:26:40 talking / thanks crowd / reminds crowd to bring whiskey / be nice to the nice officer
01:28:21 Too Long in the Wasteland
01:35:36 applause
01:35:50 Lights of Cheyenne (acoustic)
01:41:51 thanks and watch for deer (god bless wisconsin)
01:42:42 finit
(Ray_Wylie_Hubbard_in_Concert_-__December_17,_2010_at_Tennessee_State_Museum.set) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
00:00:00 hippie jacks intro
00:03:03 talking / rwh enters
00:03:18 Count My Blessings
00:06:43 talking / glad to be here / band on weekend work release
00:07:09 Rabbit
00:10:02 talking / get what you pay for
00:10:23 Snake Farm
00:14:14 talking / Hayes Carll / smell
00:16:18 Drunken Poet's Dream
00:20:00 talking / get more attention burning down a barn
00:21:29 Down Home Country Blues
00:24:18 talking / hardly strictly bluegrass / commitment
00:26:25 The Ballad of the Crimson Kings
00:30:57 talking / Without Love banter
00:32:40 Without Love
00:36:28 tuning
00:36:43 Dust of the Chase
00:40:50 applause / talking / tuning / remembering Mambo John Traynor
00:43:30 Name Droppin'
00:46:44 talking / ain't no rock and roll roosters
00:47:30 Rooster
00:50:12 talking / gambler and a drinker's gold bullet
00:52:35 Mississippi Flush
00:56:39 talking / no name dropping / slaid cleaves / turned it over to the da
00:59:34 Conversation With the Devil
01:03:30 talking / hard core serious hillbilly redneck country bar
01:05:37 Redneck Mother
01:09:26 talking / ralph stanley and bill monroe
01:11:00 Wanna Rock and Roll
01:14:03 talking / band should get instruments and learn to play 'em
01:15:03 The Messenger
01:19:38 talking / thank you
01:20:43 Mother Hubbard's Blues
01:27:13 thanks / post show
01:29:05 finit

Vi Swapping Order of Tables of Counts and Labels

A note to myself: if you have a table of counts and labels, perhaps created by something | something | uniq -c, you can swap the order of the labels and the counts / change the order of columns in vi via the following regex.

1
2
3
4
1 labelone
2 labeltwo
3 labelthree
99 label99

highlight in visual mode with V then run the following regex: s/\(\d\+\)\s\+\([a-zA-Z0-9_]*\)/\2 \1/

1
2
3
4
labelone 1
labeltwo 2
labelthree 3
label99 99

or turn them into the correct format for a python dict via s/\(\d\+\)\s\+\([a-zA-Z0-9_]*\)/'\2': \1,/

1
2
3
4
'labelone': 1,
'labeltwo': 2,
'labelthree': 3,
'label99': 99,

Shell Utilities for Data Analysis

Quick utilities to help with data analysis from the shell:

Print numbered column names of a csv or tsv. You can specify a file or it will read from stdin. It will also guess the separator, whichever of tab or comma is more common; or you may specify with --separator. This is particularly useful if you want to use awk to select columns.

(colnum) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
#!/usr/bin/python
# you use macports, you probably want the first line to be exactly #!/opt/local/bin/python

# Copyright 2015 Earl Hathaway rblog Ray at Bans earlh dot com (take my sunglasses off to email me)
# License: The author or authors of this code dedicate any and all copyright interest in this code to the public domain.

#
# print numbered column names or headers from a file or stdin if present with an optional field separator
# tested to work with python from 2.7 to 3.4

from __future__ import print_function
import argparse
import math
import os.path
import sys

stdin = not sys.stdin.isatty()
parser = argparse.ArgumentParser(description='print numbered column headers')
parser.add_argument('file', nargs='?', help='filename(default: stdin if not a tty)')
parser.add_argument('--separator', dest='separator', nargs=1, help='specify the field separator (default: whichever of comma or tab is more common)')
parser.add_argument('--python_dict', dest='pydict', action="store_true", help='emit a python dict?')

args = parser.parse_args(sys.argv[1:])
if args.file is not None and not os.path.isfile(args.file):
  print('File "%s" does not exist' % args.file)
  sys.exit(0)

first = None
if stdin:
  first = sys.stdin.readline()
elif args.file is not None:
  with open(args.file, 'r') as f:
    first = f.readline()
else:
  print('no file specified and nothing on stdin')
  parser.print_help()
  sys.exit(0)

sep = None
if args.separator is None:
  n_comma = first.count(',')
  n_tabs = first.count('\t')
  sep = "\t" if n_tabs >= n_comma else ","
else:
  sep = args.separator[0]

fields = first.split(sep)

# emit a python dict to copy into code; should be zero based
if args.pydict:
  pydict = '{' + (', '.join(['\'%s\': %d' % (val.strip(), idx) for idx, val in enumerate(fields)])) + '}'
  print(pydict)
  sys.exit(0)


# calculate indentation for fields so they don't stagger
width = 0 if len(fields) < 10 else int(math.ceil(math.log10(len(fields))))
format = ' %%%dd %%s' % width
for idx, val in enumerate(fields):
  print(format % (idx + 1, val.strip()))

Example one:

1
2
3
4
5
6
7
8
$ colnum data.csv
 1 request_id
 2 quotes
 3 location_id
 4 category_id
 5 creation_time
 6 week
 7 month

or head -1 data.csv | colnum or colnum --separator , data.csv etc.

There are two options: --separator forces a separator, and --python_dict prints a zero-index based lookup dict like so:

1
{'request_id': 0, 'quotes': 1, 'location_id': 2, 'category_id': 3, 'creation_time': 4, 'week': 5, 'month ': 6}

Fastmail Thoughts

last updated Saturday 20 June 2015

I recently switched to fastmail in lieu of gmail, mostly because I increasingly dislike google’s stance on privacy, their integration between products, and their ongoing updates to gmail. I unfortunately updated gmail on my phone, and their new material design ethos was designed by an idiot who thinks that they should have whitespace everywhere, wasting tons of space already in short supply. I now can only see 5.5 messages in the inbox view, whereas I used to be able to see 8, an incredibly annoying change in the most important screen. So I switched.

A review of fastmail a several months in:

tl;dr: gmail is a better web application, and a better android application. Choose fastmail if you value privacy; choose gmail otherwise.

Positives

  • it’s not gmail
  • privacy
  • gmail shrunk the view window on android for some stupid flat design rationale; they appear to assume everyone reads email on a 6 inch phone

Negatives

  • fastmail pretends to be a gmail style email client where the unit of manipulation is a conversation, not a message. But the underlying message orientation peeks through in many cases.
    • When deleting a conversation, it has repeatedly asked if I want to delete the entire conversation (what else would I want?) and had a Yes/No for don’t ask me again. I’ve clicked “don’t ask again” at least 3 times. It doesn’t take.
    • if you archive a conversation, the sent emails also move to archive out of sent. This is wrong.
  • Settings feels like my first javascript project.
    • routing rules have to be very simple and sometimes don’t work.
    • The UI for setting up routing rules is shit; you have to add them, click, add, then scroll to the top of a very long page and click “apply all changes” for the rules to take (yes, I missed that while porting rules from my old webmail and had to redo 40+ rules). It’s essentially two-phase commit ala git; not at all what I expected for a webmail ui.
    • The rules don’t work as you would expect: eg messages from “a@b.com” do not match “sender ends with” “b.com”.
    • Rules can only filter on one thing at a time — no compound rules on eg sender and subject. When you create a rule, it doesn’t offer to apply to existing messages in the inbox.
    • Rules can’t use “or”. So if you filter on receiver, you can’t say a@b.com or b@b.com or c@b.com. Instead, you have to have one rule per each. By the time you have 100+ of these, it’s damn annoying.
  • spam filtering is crappy:
    • When you mark something as not spam, it is delivered to your inbox and skips rules.
    • there’s no ability to sort by spam score. Hopefully the most likely nonspam would have the lowest score, so it would be convenient to sort by that to find nonspam.
    • the spam filter doesn’t learn: I’ve had to mark a loan payment confirmation email as not-spam every single month I’ve used fastmail
  • It sometimes loses the send button while composing messages.
  • No option to “filter emails like this”; instead, you have to copy and paste eg the address you want to filter into a screen 3 clicks away.
  • By default, it doesn’t load images in html email. There is a link that tries to load the images in the email you’re viewing; it works perhaps 2/3 of the time.
  • The rich editor is crap.
    • For just one of a long list: paste tsv data in there; it strips all the tabs. Awesome. So a b c pastes as abc. Wat?
    • the mobile site on firefox lags typing like 10+ seconds if you have a quoted reply in the message box. It’s strictly amateur hour.
  • The application disables access to files on mobile phones even after using requesting desktop in firefox. surprise!

Potential Dealbreakers:

  • They attempt to monetize security in an incredibly stupid way. If you setup two factor authentication to text a code to your phone, they charge for the sms messages — 0.12 each! — even on $40/year accounts. That’s just chintzy. Better yet, because they’re run by cheap dicks, purchased sms credits expire after a year!!! When I saw that it felt like purchasing a prepaid cellphone at a gas station level cheap. They’re seriously pricing at 1600 hundred times the pricing twilio has on their web page for joe-random-user, not even considering volume discounts. Monetizing security makes you an asshole.
  • Their calendar implementation doesn’t understand meeting requests from Outlook. For example, I got a meeting request for 3pm PDT (sent as 2200 Greenwich; see excerpt from the calendar invite below) that Fastmail interpreted as 2pm PDT / 10pm BST. What on earth? Exchange is only the most common professional calendar server; why would you assume fastmail interoperates with Outlook?
1
DTSTART;TZID=Greenwich Standard Time:20150729T220000

In summary, there’s just a lot of annoyances that make me assume the devs don’t use their own product or they’d fix it out of sheer annoyance. But they don’t sell your information, or decide to shrink the number of messages viewable in your inbox in order to conform to some stupid corporate design ethos.

Regression Questions: Logistic Regression Probabilities

Assume we have a logistic regression of the form $\beta_0 + \beta_1 x$, and for value $x_0$ we predict success probability $p(x_0)$. Which of the following is correct?

a) $p(x_0 + 1) = B_1 + p(x_0)$

b) $\frac{ p(x_0 + 1) }{ 1 – p(x_0 + 1) } = \exp(\beta_1) \frac{p(x_0) }{1 – p(x_0)}$

c) $p(x_0 + 1) = \Phi(B_0 + B_1( x_0 + 1))$

Assume we run a logistic regression on the 1-dimensional data below. What happens?

a) $– \infty < B_0 < \infty; B_1 \rightarrow \infty$

b) $\beta_0 = 0$, $\beta_1 = 0$

c) $\beta_0 = 0$; $\beta_1 \rightarrow –\infty$

d) none of the above

Regression Questions: A Coin Teaser

This is a straightforward question that elucidates whether you understand regression, particularly the ceteris paribus interpretation of multiple regression.

  • let $Y$ be the total value of change in your pocket;
  • let $X_1$ be the total number of coins;
  • let $X_2$ be the total number of pennies, nickels, and dimes.

Now, regress $Y$ on $X_1$ or $Y$ on $X_2$ alone. Both $\beta_1$ and $\beta_2$ would be positive.

If you regress $Y$ on $X_1 + X_2$, what are the signs of $\beta_1$ and $\beta_2$?

Consider holding $X_2$ constant: if $X_1$ increases by 1, ie you turn a penny, nickle, or dime into a quarter, then $Y$ surely increases. Therefore $\beta_1$ is positive.

Now consider holding $X_1$ constant and increasing $X_2$. If the number of pennies, nickles, and dimes increases while the total number of coins stays constant, you’re replacing quarters with a lower valued coin. Thus increasing $X_2$ can decrease $Y$, so it is entirely possible that $\beta_2$ is negative.

Updated 26 August 2015.

Interview Questions in R

Previously, I wrote about a common interview question: given an array of words, output them in decreasing frequency order, and I provided solutions in java, java8, and python.

Here’s the reason I love R: this can be accomplished in 3 lines of code.

1
2
3
tt <- sort(table(c("a", "b", "a", "a", "b", "c", "a1", "a1", "a1")), dec=T)
depth <- 3
tt[1:depth]

produces

1
2
 a a1  b
 3  3  2

Java8 Improvements

java8 has a bunch of nice improvements, and over the holidays I’ve had time to play with them a bit.

First, say goodbye to requiring Apache Commons for really simple functionality, like joining a string!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import static java.util.stream.Collectors.joining;

import java.util.Arrays;
import java.util.List;

/**
 * 
 */
public class StringUtils {

  public static void main(String[] args){
    List<String> words = Arrays.asList("a", "b", "a", "a", "b", "c", "a1", "a1", "a1");

    // old style print each element of a list: Arrays.toString(result.toArray())
    puts("java6 style %s", Arrays.toString(words.toArray()));
    puts("java8 style [%s]", words.stream().collect(joining(", ")));

  }

  public static void puts(String s){ System.out.println(s); }
  public static void puts(String format, Object... args){ puts(String.format(format, args)); }
}

java8 also massively cleans up some common operations. A common interview question is given an array or list of words, print them in descending order by count, or return the top n sorted by count descending. A standard program to do this may go like this: create a map from string to count; reverse the map to go from count to array of words with that count, then descend to the correct depth.

The dummy data provided has these counts:

1
2
 a a1  b  c
 3  3  2  1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.TreeMap;

/**
 * get the n highest frequency words
 */
public class WordCounts {

public static void main(String[] args){
    String[] words = new String[]{"a", "b", "a", "a", "b", "c", "a1", "a1", "a1"};

    for(int depth = 0; depth < 4; depth++){
      List<String> result = getMostFrequentWords(words, depth);
      puts("depth %d -> %s", depth, Arrays.toString(result.toArray()));
      puts("");
    }
  }

  public static List<String> getMostFrequentWords(String[] words, int depth){
    if(words == null || words.length == 0 || depth <= 0)
      return Collections.emptyList();

    // word -> counts
    HashMap<String, Integer> counts = new HashMap<>();
    for(String word : words){
      if(counts.containsKey(word))
        counts.put(word, counts.get(word) + 1);
      else
        counts.put(word, 1);
    }

    // count -> list of words with that count
    TreeMap<Integer, ArrayList<String>> countmap = new TreeMap<>();
    for(Map.Entry<String, Integer> entry : counts.entrySet()){
      if(countmap.containsKey(entry.getValue()))
        countmap.get(entry.getValue()).add(entry.getKey());
      else {
        ArrayList<String> l = new ArrayList<>();
        l.add(entry.getKey());
        countmap.put(entry.getValue(), l);
      }
    }

    // iterate through treemap to desired depth
    ArrayList<String> result = new ArrayList<>();
    while(result.size() <= depth){
      for(Integer i : countmap.descendingKeySet()){
        ArrayList<String> list = countmap.get(i);
        if (list.size() + result.size() < depth){
          result.addAll(list);
        } else {
          for(String s : list){
            result.add(s);
            if(result.size() == depth)
              return result;
          }
        }
      }
    }
    return result;
  }

  public static void puts(String s){ System.out.println(s); }
  public static void puts(String format, Object... args){ puts(String.format(format, args)); }
}

this will produce output like:

1
2
3
4
5
6
7
depth 0 -> []

depth 1 -> [a1]

depth 2 -> [a1, a]

depth 3 -> [a1, a, b]

Using java8 streams, we can clean up much of this. For starters, creating the map from word –> word count is essentially build in.

1
2
3
  // word -> counts
  Map<String, Long> counts = Arrays.stream(words)
    .collect(Collectors.groupingBy(s -> s, Collectors.counting()));

Java8 also directly supports inverting or reversing a map, replacing the need to either do it by hand or use guava’s bi-directional map. In the common case, where values are unique, this will suffice:

1
2
3
4
  // count -> list of words: reverse the counts map
  Map<Long, String> countmap = counts.entrySet().stream()
    .collect(Collectors.toMap(Map.Entry::getValue, Map.Entry::getKey));
  puts("countmap: %s", countmap);

Unfortunately, in my case that throws an exception because there is more than one word with the same count. So it’s slightly more complicated:

1
2
3
  // count -> list of words: reverse a map with duplicate values, collecting duplicates in an ArrayList
  Map<Long, ArrayList<String>> countmap = counts.entrySet().stream()
  .collect(Collectors.groupingBy(Map.Entry<String, Long>::getValue, Collectors.mapping(Map.Entry<String, Long>::getKey, Collectors.toCollection(ArrayList::new))));

But I really want a treemap, so I can iterate over they keys in order. Fortunately, I can specify which type of map I want

1
2
  TreeMap<Long, ArrayList<String>> countmap = counts.entrySet().stream()
              .collect(Collectors.groupingBy(Map.Entry<String, Long>::getValue, TreeMap::new, Collectors.mapping(Map.Entry<String, Long>::getKey, Collectors.toCollection(ArrayList::new))));

it’s worth noting the python is simpler still…

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
from collections import defaultdict

def get_most_frequent_words(words, depth):
  if words is None or len(words) == 0 or depth <= 0:
    return []

  counts = defaultdict(lambda: 0)
  for word in words:
    counts[word] += 1

  countmap = defaultdict(lambda: [])
  for word, count in counts.iteritems():
    countmap[count].append(word)

  result = []
  for key in sorted(countmap.keys(), reverse=True):
    if len(result) + len(countmap[key]) < depth:
      result.extend(countmap[key])
    else:
      for w in countmap[key]:
        result.append(w)
        if len(result) == depth:
          return result

  return result


words = ["a", "b", "a", "a", "b", "c", "a1", "a1", "a1"]
for depth in range(0, 4):
  print('depth %d -> [%s]' % (depth, (', '.join(get_most_frequent_words(words, depth)))))
  print('\n')

Probability Problems Coin Flips 01

You have an urn with 10 coins in it: 9 fair, and one that is heads only. You draw a coin at random from the urn, then flip it 5 times. What is the probability that you get a head on the 6th flip given you observed head on each of the first 5 flips?

Let $H_i$ be the event we observe head on the $i$th flip, and let $C_i$ be the event we draw the $i$th coin, $i = 1,…,10$.

Then we wish to calculate (using range syntax for brevity) $$( P(H_6 | H_1 H_2 H_3 H_4 H_5) = P(H_6 | H_{1:5}) $$)

Conditioning on which coin we drew, and exploiting the symmetry between coins 1 to 9:

$$( \begin{align} P(H_6 | H_{1:5}) & = \sum_{i=1}^{10} P(H_6 | H_{1:5}, C_{i}) P(C_i | H_{1:5} ) \\ & = 9 \cdot P(H_6 | H_{1:5}, C_1) P(C_1 | H_{1:5}) + P(H_6 | H_{1:5}, C_{10}) P(C_{10} | H_{1:6} ) \end{align} $$)

So it just remains to calculate $P(C_i | H_{1:5})$. This can be done via bayes rule:

$$( P(C_i | H_{1:5}) = \frac{ P(H_{1:5} | C_i ) P(C_i) }{ P(H_{1:5}) } $$)

where, playing the same conditioning trick:

$$( \begin{align} P(H_{1:5}) &= \sum_{i=1}^{10} P(H_{1:5} | C_i ) P(C_i) \\ & = \sum_{i=1}^{9}P(H_{1:5} | C_i) P(C_i) + P(H_{1:5} | C_{10}) P(C_{10}) \\ & = 9 \cdot \left( \frac{1}{2} \right)^5 \frac{1}{10} + 1^5 \frac{1}{10} \end{align} $$)

Thus:

$$( \begin{align} P(C_1 | H_{1:5}) & = \frac{ P(H_{1:5} | C_1 ) P(C_1) }{ 9 \cdot \left( \frac{1}{2} \right)^5 \frac{1}{10} + 1^5 \frac{1}{10} } \\ & = \frac{ \left( \frac{1}{2} \right)^5 \frac{1}{10} }{ 9 \cdot \left( \frac{1}{2} \right)^5 \frac{1}{10} + 1^5 \frac{1}{10} } \\ & = \frac{1}{9 + 2^5} \\ & = \frac{1}{41} \\ & \\ P(C_{10} | H_{1:5}) & = \frac{ P(H_{1:5} | C_{10} ) P(C_{10}) }{ 9 \cdot \left( \frac{1}{2} \right)^5 \frac{1}{10} + 1^5 \frac{1}{10} } \\ & = \frac{ 1^5 \frac{1}{10} }{ 9 \cdot \left( \frac{1}{2} \right)^5 \frac{1}{10} + 1^5 \frac{1}{10} } \\ & = \frac{32}{9 + 32} \\ & = \frac{32}{41} \\ \end{align} $$)

Note that we can quickly self-test and verify $ \sum_{i=1}^{10} P(C_i) = 1 $.

Returning to eqn (2)

$$( \begin{align} P(H_6 | H_{1:5}) & = 9 \cdot P(H_6 | H_{1:5}, C_1) P(C_1 | H_{1:5}) + P(H_6 | H_{1:5}, C_{10}) P(C_{10} | H_{1:6} ) \\ & = 9 \cdot \frac{1}{2} \frac{1}{41} + 1 \cdot \frac{32}{41} \\ & = \frac{73}{82} \end{align} $$)

Alternatively, you can use R to calculate the probability via brute force by repeatedly sampling according to our problem and counting the number of heads observed.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
max_itrs <- 5*1e6
used_itrs <- 0
skipped_itrs <- 0
num_H <- 0

for(itr in 1:max_itrs){
  coin <- sample(1:10, 1)

  if( coin <= 9 ){
    if( !all( runif(5) <= 0.5)){
      skipped_itrs <- skipped_itrs + 1
      next
    }
    num_H <- num_H + ifelse( runif(1) <= 0.5, 1, 0)
  } else {
    num_H <- num_H + 1
  }
  used_itrs <- used_itrs + 1
}

num_H / used_itrs

# vs 73/82
num_H / used_itrs - 73/82

my sample run produced

1
2
3
4
5
6
> num_H / used_itrs
[1] 0.8901657
> 
> # vs 73/82
> num_H / used_itrs - 73/82
[1] -7.821522e-05

C++ and Const. Sigh.

well known benefits of constness.

1
2
3
4
5
6
7
8
// function:
bool learn_tree(const float** target, unsigned int num_classes);

// and I'm attempting to call with:
float** residuals;

// compiling produces:
error: cannot initialize a parameter of type 'const float **' with an lvalue of type 'float **'

what on earth? It turns out the winner is this beautiful bit of syntax:

1
bool learn_tree(const float* const* target, unsigned int num_classes);

beautiful.

So for all future googlers, this is how you declare const double arrays or const multidimensional arrays in c++.