Stochastic Nonsense

Splitting Audio with ffmpeg

2015-08-08T13:55:38-07:00

Here’s a quick utility to use a set list and ffmpeg to split single audio files into multiple tracks. It splits audio files via a setlist then sets the song name, artist, album id3 tags. The script is crude, but it’s a quick start.

I used this to split a couple concerts by two of my favorite artists: James McMurtry in Concert July 14 2013 and Ray Wylie Hubbard in Nashville, TN performing tracks from A. Enlightenment B. Endarkenment (Hint: There is no C). You can find the set lists below.

(splitter.py) download

# use ffmpeg to split an mp3 according to a set list
from __future__ import print_function
import subprocess as sp

# NB: add -copy if from mp3 -> mp3; else no -copy
setfile = './Ray Wylie Hubbard in Concert -  December 17, 2010 at Tennessee State Museum.set'
mp3file = './Ray Wylie Hubbard full concert - DASH.m4a'
outdir = './Ray Wylie Hubbard in Concert -  December 17, 2010 at Tennessee State Museum'
meta_src = 'https://www.youtube.com/watch?v=y9_xBIuV9nE'
do_copy = False
print('splitting \'%s\' according to sets in \'%s\' into directory \'%s\'' % (mp3file, setfile, outdir))

metas = {'artist':'Ray Wylie Hubbard', 'album':'Acoustic Live in Concert December 17 2010 at Tennessee State Museum / Hippie Jacks'}
metastring = ' '.join([ '-metadata %s="%s"' % (k,v) for k,v in metas.iteritems()])

songs = []
with open(setfile) as f:
  for line in f:
    start, title = line.strip().split(' ', 1)
    songs.append((start, title))

print('read %d lines' % len(songs))

# set id3 tags ala http://jonhall.info/how_to/create_id3_tags_using_ffmpeg
cmds = []
for i in range(len(songs) - 1):
  cmd = (songs[i][0], songs[i+1][0], songs[i][1])
  cmds.append(cmd)

  run = 'ffmpeg -i \"%s\" -acodec %s -ss %s -to %s ' % (mp3file, 'copy' if do_copy else 'libmp3lame -qscale:a 2', cmd[0], cmd[1])
  run += '%s -metadata title="%s" -metadata track="%d/%d" -metadata publisher="%s" ' % (metastring, cmd[2], i+1, len(songs) - 1, meta_src)
  run += '"%s/%02d.mp3"' % (outdir, i + 1)
  # print("\tcmd: %s" % run)
  sp.call(run, shell=True)

You’ll have to adjust params at the top of splitter.py:

setfile is the set file
mp3file is the audio file
outdir is the output directory (you should probably mkdir this beforehand)
meta_src is the source
do_copy should be True if your source is an mp3 file and False if you want to transcode to mp3
replace artist and album in metas

Then just run it: python splitter.py.

(James_McMurtry_in_Concert_-__July_14,_2013.set) download

00:00 chirping
00:48 Bayou Tortous
05:04 Red Dress
11:00 talking / tuning
11:52 What's the Matter Now
16:09 talking / real sensitive little love ballads and baptists
17:18 How'm I Gonna Find You Now
21:28 talking / buy cds
22:09 Just Us Kids
27:00 applause
27:27 You'd A' Thought (Leonard Cohen Must Die)
33:34 clapping and get ready to dance
33:47 Choctaw Bingo (with new verse) (introduction)
34:30 Choctaw Bingo (with new verse)
46:22 applause / belly rubbing dances
46:49 These things i've come to know
50:26 talking / introduction to the band / unfortunately song still relevant
51:44 We Can't Make it Here Anymore
58:37 talking about ice fishing and Tim Holt
59:37 Copper Canteen
04:22 talking / back to stompin' songs
04:41 Freeway View
08:03 talking / feral hogs in michigan
08:33 Lobo Town
14:00 applause
14:07 Restless
19:17 talking / goes out to memory of max crawford
20:02 Levelland
26:40 talking / thanks crowd / reminds crowd to bring whiskey / be nice to the nice officer
28:21 Too Long in the Wasteland
35:36 applause
35:50 Lights of Cheyenne (acoustic)
41:51 thanks and watch for deer (god bless wisconsin)
42:42 finit

(Ray_Wylie_Hubbard_in_Concert_-__December_17,_2010_at_Tennessee_State_Museum.set) download

00:00 hippie jacks intro
03:03 talking / rwh enters
03:18 Count My Blessings
06:43 talking / glad to be here / band on weekend work release
07:09 Rabbit
10:02 talking / get what you pay for
10:23 Snake Farm
14:14 talking / Hayes Carll / smell
16:18 Drunken Poet's Dream
20:00 talking / get more attention burning down a barn
21:29 Down Home Country Blues
24:18 talking / hardly strictly bluegrass / commitment
26:25 The Ballad of the Crimson Kings
30:57 talking / Without Love banter
32:40 Without Love
36:28 tuning
36:43 Dust of the Chase
40:50 applause / talking / tuning / remembering Mambo John Traynor
43:30 Name Droppin'
46:44 talking / ain't no rock and roll roosters
47:30 Rooster
50:12 talking / gambler and a drinker's gold bullet
52:35 Mississippi Flush
56:39 talking / no name dropping / slaid cleaves / turned it over to the da
59:34 Conversation With the Devil
03:30 talking / hard core serious hillbilly redneck country bar
05:37 Redneck Mother
09:26 talking / ralph stanley and bill monroe
11:00 Wanna Rock and Roll
14:03 talking / band should get instruments and learn to play 'em
15:03 The Messenger
19:38 talking / thank you
20:43 Mother Hubbard's Blues
27:13 thanks / post show
29:05 finit

vi swapping order of tables of counts and labels

2015-06-13T12:29:49-07:00

A note to myself: if you have a table of counts and labels, perhaps created by something | something | uniq -c, you can swap the order of the labels and the counts / change the order of columns in vi via the following regex.

labelone
labeltwo
labelthree
label99

highlight in visual mode with V then run the following regex: s/$\d\+$\s\+$[a-zA-Z0-9_]*$/\2 \1/

labelone 1
labeltwo 2
labelthree 3
label99 99

or turn them into the correct format for a python dict via s/$\d\+$\s\+$[a-zA-Z0-9_]*$/'\2': \1,/

'labelone': 1,
'labeltwo': 2,
'labelthree': 3,
'label99': 99,

shell utilities for data analysis

2015-06-11T18:55:00-07:00

Quick utilities to help with data analysis from the shell:

Print numbered column names of a csv or tsv. You can specify a file or it will read from stdin. It will also guess the separator, whichever of tab or comma is more common; or you may specify with --separator. This is particularly useful if you want to use awk to select columns.

(colnum) download

#!/usr/bin/python
# you use macports, you probably want the first line to be exactly #!/opt/local/bin/python

# Copyright 2015 Earl Hathaway rblog Ray at Bans earlh dot com (take my sunglasses off to email me)
# License: The author or authors of this code dedicate any and all copyright interest in this code to the public domain.

#
# print numbered column names or headers from a file or stdin if present with an optional field separator
# tested to work with python from 2.7 to 3.4

from __future__ import print_function
import argparse
import math
import os.path
import sys

stdin = not sys.stdin.isatty()
parser = argparse.ArgumentParser(description='print numbered column headers')
parser.add_argument('file', nargs='?', help='filename(default: stdin if not a tty)')
parser.add_argument('--separator', dest='separator', nargs=1, help='specify the field separator (default: whichever of comma or tab is more common)')
parser.add_argument('--python_dict', dest='pydict', action="store_true", help='emit a python dict?')

args = parser.parse_args(sys.argv[1:])
if args.file is not None and not os.path.isfile(args.file):
  print('File "%s" does not exist' % args.file)
  sys.exit(0)

first = None
if stdin:
  first = sys.stdin.readline()
elif args.file is not None:
  with open(args.file, 'r') as f:
    first = f.readline()
else:
  print('no file specified and nothing on stdin')
  parser.print_help()
  sys.exit(0)

sep = None
if args.separator is None:
  n_comma = first.count(',')
  n_tabs = first.count('\t')
  sep = "\t" if n_tabs >= n_comma else ","
else:
  sep = args.separator[0]

fields = first.split(sep)

# emit a python dict to copy into code; should be zero based
if args.pydict:
  pydict = '{' + (', '.join(['\'%s\': %d' % (val.strip(), idx) for idx, val in enumerate(fields)])) + '}'
  print(pydict)
  sys.exit(0)


# calculate indentation for fields so they don't stagger
width = 0 if len(fields) < 10 else int(math.ceil(math.log10(len(fields))))
format = ' %%%dd %%s' % width
for idx, val in enumerate(fields):
  print(format % (idx + 1, val.strip()))

Example one:

$ colnum data.csv
request_id
quotes
location_id
category_id
creation_time
week
month

or head -1 data.csv | colnum or colnum --separator , data.csv etc.

There are two options: --separator forces a separator, and --python_dict prints a zero-index based lookup dict like so:

{'request_id': 0, 'quotes': 1, 'location_id': 2, 'category_id': 3, 'creation_time': 4, 'week': 5, 'month ': 6}

fastmail thoughts

2015-06-06T23:36:06-07:00

last updated Saturday 20 June 2015

I recently switched to fastmail in lieu of gmail, mostly because I increasingly dislike google’s stance on privacy, their integration between products, and their ongoing updates to gmail. I unfortunately updated gmail on my phone, and their new material design ethos was designed by an idiot who thinks that they should have whitespace everywhere, wasting tons of space already in short supply. I now can only see 5.5 messages in the inbox view, whereas I used to be able to see 8, an incredibly annoying change in the most important screen. So I switched.

A review of fastmail a several months in:

tl;dr: gmail is a better web application, and a better android application. Choose fastmail if you value privacy; choose gmail otherwise.

Positives

it’s not gmail
privacy
gmail shrunk the view window on android for some stupid flat design rationale; they appear to assume everyone reads email on a 6 inch phone

Negatives

fastmail pretends to be a gmail style email client where the unit of manipulation is a conversation, not a message. But the underlying message orientation peeks through in many cases.
- When deleting a conversation, it has repeatedly asked if I want to delete the entire conversation (what else would I want?) and had a Yes/No for don’t ask me again. I’ve clicked “don’t ask again” at least 3 times. It doesn’t take.
- if you archive a conversation, the sent emails also move to archive out of sent. This is wrong.
Settings feels like my first javascript project.
- routing rules have to be very simple and sometimes don’t work.
- The UI for setting up routing rules is shit; you have to add them, click, add, then scroll to the top of a very long page and click “apply all changes” for the rules to take (yes, I missed that while porting rules from my old webmail and had to redo 40+ rules). It’s essentially two-phase commit ala git; not at all what I expected for a webmail ui.
- The rules don’t work as you would expect: eg messages from “a@b.com” do not match “sender ends with” “b.com”.
- Rules can only filter on one thing at a time — no compound rules on eg sender and subject. When you create a rule, it doesn’t offer to apply to existing messages in the inbox.
- Rules can’t use “or”. So if you filter on receiver, you can’t say a@b.com or b@b.com or c@b.com. Instead, you have to have one rule per each. By the time you have 100+ of these, it’s damn annoying.
spam filtering is crappy:
- When you mark something as not spam, it is delivered to your inbox and skips rules.
- there’s no ability to sort by spam score. Hopefully the most likely nonspam would have the lowest score, so it would be convenient to sort by that to find nonspam.
- the spam filter doesn’t learn: I’ve had to mark a loan payment confirmation email as not-spam every single month I’ve used fastmail
It sometimes loses the send button while composing messages.
No option to “filter emails like this”; instead, you have to copy and paste eg the address you want to filter into a screen 3 clicks away.
By default, it doesn’t load images in html email. There is a link that tries to load the images in the email you’re viewing; it works perhaps 2/3 of the time.
The rich editor is crap.
- For just one of a long list: paste tsv data in there; it strips all the tabs. Awesome. So a b c pastes as abc. Wat?
- the mobile site on firefox lags typing like 10+ seconds if you have a quoted reply in the message box. It’s strictly amateur hour.
The application disables access to files on mobile phones even after using requesting desktop in firefox. surprise!

Potential Dealbreakers:

They attempt to monetize security in an incredibly stupid way. If you setup two factor authentication to text a code to your phone, they charge for the sms messages — 0.12 each! — even on $40/year accounts. That’s just chintzy. Better yet, because they’re run by cheap dicks, purchased sms credits expire after a year!!! When I saw that it felt like purchasing a prepaid cellphone at a gas station level cheap. They’re seriously pricing at 1600 hundred times the pricing twilio has on their web page for joe-random-user, not even considering volume discounts. Monetizing security makes you an asshole.
Their calendar implementation doesn’t understand meeting requests from Outlook. For example, I got a meeting request for 3pm PDT (sent as 2200 Greenwich; see excerpt from the calendar invite below) that Fastmail interpreted as 2pm PDT / 10pm BST. What on earth? Exchange is only the most common professional calendar server; why would you assume fastmail interoperates with Outlook?

DTSTART;TZID=Greenwich Standard Time:20150729T220000

In summary, there’s just a lot of annoyances that make me assume the devs don’t use their own product or they’d fix it out of sheer annoyance. But they don’t sell your information, or decide to shrink the number of messages viewable in your inbox in order to conform to some stupid corporate design ethos.

Regression Questions: Logistic Regression Probabilities

2015-01-18T22:51:52-08:00

Assume we have a logistic regression of the form $\beta_0 + \beta_1 x$, and for value $x_0$ we predict success probability $p(x_0)$. Which of the following is correct?

a) $p(x_0 + 1) = B_1 + p(x_0)$

b) $\frac{ p(x_0 + 1) }{ 1 – p(x_0 + 1) } = \exp(\beta_1) \frac{p(x_0) }{1 – p(x_0)}$

c) $p(x_0 + 1) = \Phi(B_0 + B_1( x_0 + 1))$

Assume we run a logistic regression on the 1-dimensional data below. What happens?

a) $– \infty < B_0 < \infty; B_1 \rightarrow \infty$

b) $\beta_0 = 0$, $\beta_1 = 0$

c) $\beta_0 = 0$; $\beta_1 \rightarrow –\infty$

d) none of the above

Regression Questions: A Coin Teaser

2015-01-17T22:51:52-08:00

This is a straightforward question that elucidates whether you understand regression, particularly the ceteris paribus interpretation of multiple regression.

let $Y$ be the total value of change in your pocket;
let $X_1$ be the total number of coins;
let $X_2$ be the total number of pennies, nickels, and dimes.

Now, regress $Y$ on $X_1$ or $Y$ on $X_2$ alone. Both $\beta_1$ and $\beta_2$ would be positive.

If you regress $Y$ on $X_1 + X_2$, what are the signs of $\beta_1$ and $\beta_2$?

Consider holding $X_2$ constant: if $X_1$ increases by 1, ie you turn a penny, nickle, or dime into a quarter, then $Y$ surely increases. Therefore $\beta_1$ is positive.

Now consider holding $X_1$ constant and increasing $X_2$. If the number of pennies, nickles, and dimes increases while the total number of coins stays constant, you’re replacing quarters with a lower valued coin. Thus increasing $X_2$ can decrease $Y$, so it is entirely possible that $\beta_2$ is negative.

Updated 26 August 2015.

interview questions in R

2014-12-01T17:37:31-08:00

Previously, I wrote about a common interview question: given an array of words, output them in decreasing frequency order, and I provided solutions in java, java8, and python.

Here’s the reason I love R: this can be accomplished in 3 lines of code.

tt <- sort(table(c("a", "b", "a", "a", "b", "c", "a1", "a1", "a1")), dec=T)
depth <- 3
tt[1:depth]

produces

 a a1  b
 3  3  2

java8 improvements

2014-12-01T15:52:39-08:00

java8 has a bunch of nice improvements, and over the holidays I’ve had time to play with them a bit.

First, say goodbye to requiring Apache Commons for really simple functionality, like joining a string!

import static java.util.stream.Collectors.joining;

import java.util.Arrays;
import java.util.List;

/**
 * 
 */
public class StringUtils {

  public static void main(String[] args){
    List<String> words = Arrays.asList("a", "b", "a", "a", "b", "c", "a1", "a1", "a1");

    // old style print each element of a list: Arrays.toString(result.toArray())
    puts("java6 style %s", Arrays.toString(words.toArray()));
    puts("java8 style [%s]", words.stream().collect(joining(", ")));

  }

  public static void puts(String s){ System.out.println(s); }
  public static void puts(String format, Object... args){ puts(String.format(format, args)); }
}

java8 also massively cleans up some common operations. A common interview question is given an array or list of words, print them in descending order by count, or return the top n sorted by count descending. A standard program to do this may go like this: create a map from string to count; reverse the map to go from count to array of words with that count, then descend to the correct depth.

The dummy data provided has these counts:

 a a1  b  c
 3  3  2  1

import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.TreeMap;

/**
 * get the n highest frequency words
 */
public class WordCounts {

public static void main(String[] args){
    String[] words = new String[]{"a", "b", "a", "a", "b", "c", "a1", "a1", "a1"};

    for(int depth = 0; depth < 4; depth++){
      List<String> result = getMostFrequentWords(words, depth);
      puts("depth %d -> %s", depth, Arrays.toString(result.toArray()));
      puts("");
    }
  }

  public static List<String> getMostFrequentWords(String[] words, int depth){
    if(words == null || words.length == 0 || depth <= 0)
      return Collections.emptyList();

    // word -> counts
    HashMap<String, Integer> counts = new HashMap<>();
    for(String word : words){
      if(counts.containsKey(word))
        counts.put(word, counts.get(word) + 1);
      else
        counts.put(word, 1);
    }

    // count -> list of words with that count
    TreeMap<Integer, ArrayList<String>> countmap = new TreeMap<>();
    for(Map.Entry<String, Integer> entry : counts.entrySet()){
      if(countmap.containsKey(entry.getValue()))
        countmap.get(entry.getValue()).add(entry.getKey());
      else {
        ArrayList<String> l = new ArrayList<>();
        l.add(entry.getKey());
        countmap.put(entry.getValue(), l);
      }
    }

    // iterate through treemap to desired depth
    ArrayList<String> result = new ArrayList<>();
    while(result.size() <= depth){
      for(Integer i : countmap.descendingKeySet()){
        ArrayList<String> list = countmap.get(i);
        if (list.size() + result.size() < depth){
          result.addAll(list);
        } else {
          for(String s : list){
            result.add(s);
            if(result.size() == depth)
              return result;
          }
        }
      }
    }
    return result;
  }

  public static void puts(String s){ System.out.println(s); }
  public static void puts(String format, Object... args){ puts(String.format(format, args)); }
}

this will produce output like:

depth 0 -> []

depth 1 -> [a1]

depth 2 -> [a1, a]

depth 3 -> [a1, a, b]

Using java8 streams, we can clean up much of this. For starters, creating the map from word –> word count is essentially build in.

  // word -> counts
  Map<String, Long> counts = Arrays.stream(words)
    .collect(Collectors.groupingBy(s -> s, Collectors.counting()));

Java8 also directly supports inverting or reversing a map, replacing the need to either do it by hand or use guava’s bi-directional map. In the common case, where values are unique, this will suffice:

  // count -> list of words: reverse the counts map
  Map<Long, String> countmap = counts.entrySet().stream()
    .collect(Collectors.toMap(Map.Entry::getValue, Map.Entry::getKey));
  puts("countmap: %s", countmap);

Unfortunately, in my case that throws an exception because there is more than one word with the same count. So it’s slightly more complicated:

  // count -> list of words: reverse a map with duplicate values, collecting duplicates in an ArrayList
  Map<Long, ArrayList<String>> countmap = counts.entrySet().stream()
  .collect(Collectors.groupingBy(Map.Entry<String, Long>::getValue, Collectors.mapping(Map.Entry<String, Long>::getKey, Collectors.toCollection(ArrayList::new))));

But I really want a treemap, so I can iterate over they keys in order. Fortunately, I can specify which type of map I want

  TreeMap<Long, ArrayList<String>> countmap = counts.entrySet().stream()
              .collect(Collectors.groupingBy(Map.Entry<String, Long>::getValue, TreeMap::new, Collectors.mapping(Map.Entry<String, Long>::getKey, Collectors.toCollection(ArrayList::new))));

it’s worth noting the python is simpler still…

from collections import defaultdict

def get_most_frequent_words(words, depth):
  if words is None or len(words) == 0 or depth <= 0:
    return []

  counts = defaultdict(lambda: 0)
  for word in words:
    counts[word] += 1

  countmap = defaultdict(lambda: [])
  for word, count in counts.iteritems():
    countmap[count].append(word)

  result = []
  for key in sorted(countmap.keys(), reverse=True):
    if len(result) + len(countmap[key]) < depth:
      result.extend(countmap[key])
    else:
      for w in countmap[key]:
        result.append(w)
        if len(result) == depth:
          return result

  return result


words = ["a", "b", "a", "a", "b", "c", "a1", "a1", "a1"]
for depth in range(0, 4):
  print('depth %d -> [%s]' % (depth, (', '.join(get_most_frequent_words(words, depth)))))
  print('\n')

probability problems coin flips 01

2014-08-21T14:42:53-07:00

You have an urn with 10 coins in it: 9 fair, and one that is heads only. You draw a coin at random from the urn, then flip it 5 times. What is the probability that you get a head on the 6th flip given you observed head on each of the first 5 flips?

Let $H_i$ be the event we observe head on the $i$th flip, and let $C_i$ be the event we draw the $i$th coin, $i = 1,…,10$.

Then we wish to calculate (using range syntax for brevity) $$( P(H_6 | H_1 H_2 H_3 H_4 H_5) = P(H_6 | H_{1:5}) $$)

Conditioning on which coin we drew, and exploiting the symmetry between coins 1 to 9:

$$( \begin{align} P(H_6 | H_{1:5}) & = \sum_{i=1}^{10} P(H_6 | H_{1:5}, C_{i}) P(C_i | H_{1:5} ) \\ & = 9 \cdot P(H_6 | H_{1:5}, C_1) P(C_1 | H_{1:5}) + P(H_6 | H_{1:5}, C_{10}) P(C_{10} | H_{1:6} ) \end{align} $$)

So it just remains to calculate $P(C_i | H_{1:5})$. This can be done via bayes rule:

$$( P(C_i | H_{1:5}) = \frac{ P(H_{1:5} | C_i ) P(C_i) }{ P(H_{1:5}) } $$)

where, playing the same conditioning trick:

$$( \begin{align} P(H_{1:5}) &= \sum_{i=1}^{10} P(H_{1:5} | C_i ) P(C_i) \\ & = \sum_{i=1}^{9}P(H_{1:5} | C_i) P(C_i) + P(H_{1:5} | C_{10}) P(C_{10}) \\ & = 9 \cdot \left( \frac{1}{2} \right)^5 \frac{1}{10} + 1^5 \frac{1}{10} \end{align} $$)

Thus:

$$( \begin{align} P(C_1 | H_{1:5}) & = \frac{ P(H_{1:5} | C_1 ) P(C_1) }{ 9 \cdot \left( \frac{1}{2} \right)^5 \frac{1}{10} + 1^5 \frac{1}{10} } \\ & = \frac{ \left( \frac{1}{2} \right)^5 \frac{1}{10} }{ 9 \cdot \left( \frac{1}{2} \right)^5 \frac{1}{10} + 1^5 \frac{1}{10} } \\ & = \frac{1}{9 + 2^5} \\ & = \frac{1}{41} \\ & \\ P(C_{10} | H_{1:5}) & = \frac{ P(H_{1:5} | C_{10} ) P(C_{10}) }{ 9 \cdot \left( \frac{1}{2} \right)^5 \frac{1}{10} + 1^5 \frac{1}{10} } \\ & = \frac{ 1^5 \frac{1}{10} }{ 9 \cdot \left( \frac{1}{2} \right)^5 \frac{1}{10} + 1^5 \frac{1}{10} } \\ & = \frac{32}{9 + 32} \\ & = \frac{32}{41} \\ \end{align} $$)

Note that we can quickly self-test and verify $ \sum_{i=1}^{10} P(C_i) = 1 $.

Returning to eqn (2)

$$( \begin{align} P(H_6 | H_{1:5}) & = 9 \cdot P(H_6 | H_{1:5}, C_1) P(C_1 | H_{1:5}) + P(H_6 | H_{1:5}, C_{10}) P(C_{10} | H_{1:6} ) \\ & = 9 \cdot \frac{1}{2} \frac{1}{41} + 1 \cdot \frac{32}{41} \\ & = \frac{73}{82} \end{align} $$)

Alternatively, you can use R to calculate the probability via brute force by repeatedly sampling according to our problem and counting the number of heads observed.

max_itrs <- 5*1e6
used_itrs <- 0
skipped_itrs <- 0
num_H <- 0

for(itr in 1:max_itrs){
  coin <- sample(1:10, 1)

  if( coin <= 9 ){
    if( !all( runif(5) <= 0.5)){
      skipped_itrs <- skipped_itrs + 1
      next
    }
    num_H <- num_H + ifelse( runif(1) <= 0.5, 1, 0)
  } else {
    num_H <- num_H + 1
  }
  used_itrs <- used_itrs + 1
}

num_H / used_itrs

# vs 73/82
num_H / used_itrs - 73/82

my sample run produced

> num_H / used_itrs
[1] 0.8901657
> 
> # vs 73/82
> num_H / used_itrs - 73/82
[1] -7.821522e-05

c++ and const. sigh.

2014-04-10T13:46:55-07:00

well known benefits of constness.

// function:
bool learn_tree(const float** target, unsigned int num_classes);

// and I'm attempting to call with:
float** residuals;

// compiling produces:
error: cannot initialize a parameter of type 'const float **' with an lvalue of type 'float **'

what on earth? It turns out the winner is this beautiful bit of syntax:

bool learn_tree(const float* const* target, unsigned int num_classes);

beautiful.

So for all future googlers, this is how you declare const double arrays or const multidimensional arrays in c++.

splitting files with awk

2014-03-26T17:52:09-07:00

To split files (eg for test / train splits or k-folds) without having to load into R or python, awk will do a fine job.

For example, to crack into 16 equal parts using modulus to assign rows to files:

split files into equal parts with awk

$ cat LARGE_FILE | awk '{print $0 >("part_" int(NR%16))}'

Or to crack a file into a 80/20 test/train split:

create test/train split using awk

awk '{if( NR % 10 <= 1){ print $0 > "data.test.20"} else {print $0 > "data.train.80"}}'

And finally, if your data file has a header that you don’t want to end up in a random file, you can dump the header row into both files, then tell your awk script to append (and use tail to skip the header row)

create test/train split with proper header handling

head -1 LARGE_FILE > data.test.20
head -1 LARGE_FILE > data.train.80
tail -n+2 LARGE_FILE | awk '{if( NR % 10 <= 1){ print $0 >> "data.test.20"} else {print $0 >> "data.train.80"}}'

Wordpress to Octopress Migration

2014-03-26T17:23:53-07:00

As mentioned, I’m moving my blog to octopress. I got tired of php, php-cgi, wordpress, security holes, constant updates that broke random plugins, 5 second page loads, fragile caching plugins, and all the various nonsense that wordpress brings to the table.

An aside: php-cgi is so fragile and crashes so often I ran a screen session as root that just attempted to restart it every 5 seconds (attached below for any poor souls stuck using this tech.)

# run as root inside screen
for (( ; ; )); do /usr/bin/spawn-fcgi -u nginx -g nginx -f /usr/bin/php-cgi -a 127.0.0.1 -p 53217 -P /var/run/fastcgi-php.pid; sleep 5; done

For googlers who want to move from wordpress to octopress, here’s how I moved 70-odd posts with minimal pain.

1 – Get thomasf’s excellent python script (accurately named exitwp) that converts wordpress posts to octopress posts. This will create one octopress post per wordpress post in the source directory.

2 – I simultaneously moved urls from blog.earlh.com to earlh.com/blog so I needed to 301 all the old posts. I did that by getting this awesome wordpress post exporter script contributed by Mike Schinkel. I curled that to create a list of urls to forward, then built a tsv of pairs of old url\tnewurl. Then the below awk script will print nginx forward rules:

cat posts.tsv | awk -F"\t" '{print "\tlocation " $2 "{\n\t\treturn 301 " $1 ";\n\t\tbreak;\t}"'} | sed "s/http:\/\/blog.earlh.com//"

The rules look like:

location /index.php/2009/06/cleaning-data-in-r-csv-files {
        return 301 http://earlh.com/blog/2009/06/29/cleaning-data-in-r-csv-files/;
        break;
}

Add them to your site nginx.conf file inside the server configuration block.

I’ll update with solutions for better image embedding.

c++ is horrific

2014-03-24T12:21:00-07:00

I’m poking at some c++ after not touching it for a decade. c++11 has apparently gotten roughly as capable as java pre 2000; it now can create threads! But the error messages. Oh, the error messages

$ clang++ -std=c++11 -stdlib=libc++ src/test.cpp
In file included from src/test.cpp:4:
In file included from /usr/bin/../lib/c++/v1/thread:93:
In file included from /usr/bin/../lib/c++/v1/functional:465:
/usr/bin/../lib/c++/v1/memory:1677:31: error: calling a private constructor of class 'std::__1::thread'
            ::new((void*)__p) _Up(_VSTD::forward<_Args>(__args)...);
                              ^
/usr/bin/../lib/c++/v1/memory:1604:18: note: in instantiation of function template specialization
      'std::__1::allocator::construct' requested here
            {__a.construct(__p, _VSTD::forward<_Args>(__args)...);}
                 ^
/usr/bin/../lib/c++/v1/memory:1488:14: note: in instantiation of function template specialization
      'std::__1::allocator_traits >::__construct' requested here
            {__construct(__has_construct(),
             ^
/usr/bin/../lib/c++/v1/vector:1471:25: note: in instantiation of function template specialization
      'std::__1::allocator_traits >::construct' requested here
        __alloc_traits::construct(this->__alloc(),
                        ^
src/test.cpp:36:13: note: in instantiation of member function 'std::__1::vector >::push_back'
      requested here
    threads.push_back(t);
            ^
/usr/bin/../lib/c++/v1/thread:261:5: note: implicitly declared private here
    thread(const thread&);
    ^
1 error generated.

what was the cause of this monstrosity?

std::vector< std::thread > threads;
// [...]
std::thread t(thread_func, i);
threads.push_back(t);

so yeah, you can’t copy thread objects, enforced by having a private constructor. Still, the amount of knowledge it takes to translate from the error message to the error is pretty amazing.

replacing sort | uniq

2014-03-21T12:59:00-07:00

A code snippet: when poking at columnar data in the shell, you’ll often find yourself asking questions like what are the unique values of a particular column, or the unique values and their counts. R would accomplish this via unique or table, but if your data is largish it may be quite annoying to load into R. I often use bash to quickly pick out a column, ala

pick out a column

$ cat data.csv | awk -F, '{print $8}' | sort | uniq -c | sort -r -n

In order: bash cats my data, tells awk to print just column 8 using , as the separator field, sorts all the data so that I can use uniq, asks uniq to print the counts and then the unique strings, then sorts by the counts descending (-n interprets as a number and -r sorts descending). The obvious inefficiency here is if your data is a couple of gb, you have to sort in order for uniq to work. Instead, you can add the script below to your path and replace the above with:

pick out a column

$ cat data.csv | awk -F, '{print $8'} | count

not only is this a lot less typing, but it will be significantly faster since you don’t have to hold all the data in ram and sort it.

replace sort | uniq

#!/usr/bin/ruby

# replaces sort | uniq -c
cnts={}
$stdin.each{ |i|
  cnts[i] = 1 + (cnts[i] || 0)
}

cnts.sort_by{ |word, cnt| -1*cnt }.
    each{ |word, cnt| puts "#{cnt}\t#{word}" }

Hiring Software Engineers

2013-06-30T20:58:00-07:00

I perpetually see employers, on hacker news and elsewhere, complaining about difficulty hiring. I haven’t had such issues, so a (perhaps bold) guide to hiring software engineers:

are you paying market salaries?
1. are you really paying market salaries, or are employees supposed to join your company because you’re a special snowflake?
2. even if you are paying market, why should an employee go to your firm? What is the upside to them for leaving a boss and company that they know? Because it would be really convenient for you, the hirer, is not a good answer.
do you make the interviewing process decent, or do you scatter caltrops in front of potential employees?
1. good employees do not need to crash study then regurgitate graph algorithms that your company never uses on the whiteboard. They also have jobs and value their vacation time and don’t care to spend a week consulting for you.
2. how long does it take you to respond to resumes that come in? You should be able to say yes/no/maybe within 2 business days. Do your recruiters / interviewers actually read the cover letters / resumes? Last time I changed job a big sf / yc startup let my first interviewer roll into the interview room just shy of 20 minutes late without having read my resume. That’s a complete dick move, and it’s part of why I turned them down.
3. when potential employees send you github links, do you have an engineer actually bloody look at them (almost never in my experience)?
4. do you actually expend effort to meet potential employees and grow a bunch of warm leads, or do you wait until 3 weeks before you want someone to start then gripe because you can’t convert cold leads in 1 week plus a 2 week resignation period for their current employer?
5. do you use shit software like that jobvite bullshit that badly ocrs then expects me to hand proof their shitty ocr job, or do you directly accept pdf resumes?
6. for the love of god, I do not have a copy of ms word and wouldn’t take one if it were free. I will not put my resume into word format.
do you take some pains to grow employees?
1. do you hire people out of university? Take a chance on people?
2. Like one of my former employers, do you do a good job hiring new grads from schools besides stanford / berkeley / mit / cmu, but then 18 months in after employees have demonstrated their value refuse to bring them up to market rates and lose them?
when you send out offers, do you actually put out a good offer or do you throw numbers out that are 10% or more under your ceiling then expect employees to negotiate hard with you? I just turned down an sf startup because they did this; the ceo who successfully hired me said, “Earl: when I was an employee I hated negotiating, so I’m going to make you a great offer. This also means I’m not going to negotiate.” And you know what? It was a great offer, and I said yes the next morning. It also avoids starting your new job after a confrontational exercise.
if you have recruiters contacting people, do you have them make clear they’re internal not external?
1. do your recruiters actually read peoples’ linkedin profiles before contacting them? I used rails a bit 3 employers ago and had to remove that word from my profile because I got spammed with rails stuff.
do your job postings on linkedin, craigslist, message boards, and your website tell potential employees why he or she should work for you, or like the vast majority, is it simply a long list of desiderata?
just like the easiest sale is an upsell to a customer you have, the easiest recruit is the good employee you already have that you keep happy and prevent from leaving
1. like Rand says, do you know off the top of your head the career goals of your employees? What are you doing to help them get there?
2. do you give your employees raises to keep them at or above market, or can they get a $20k raise by swapping companies? If that raise is on offer, exactly why should they continue to work for you?
3. on that note… 0.2% of an A round company isn’t golden handcuffs. It’s more like paper handcuffs.

moving to octopress

2013-06-17T22:25:00-07:00

I finally got tired of wordpress, and I’m trying out octopress. If you have a blog, I think you would probably also be happier using octopress.

Reasons to switch:

wordpress doesn’t just work; it takes endless supervision. This is complicated by the seeming inability to get only security related updates, so I always feel forced to stay on the version treadmill for security reasons. This frequently breaks plugins.
wordpress treats security as an afterthought. For instance, the first thing you may think of is locking down the wp-admin directory to your home ip, but last time I tried this breaks the site.
php is security-hole ridden junk. This seems to have improved over the years but I’m still a little uneasy about using it, whereas only serving static html should be very secure and only require a single serving program making it much easier to keep up with security.
serving php with nginx is fragile; there’s multiple ways to set it up and none of them seem to work particularly reliably
there’s always 10 plugins to do any given task, none of which fully work or fully integrate with wordpress. I tried to get markdown working with wordpress last weekend and it was a nightmare
wordpress is slow, and caching plugins are brittle; serving static html from octopress should be lightning fast.
octopress comes with a bunch of nice features like syntax highlighting that doesn’t require loading 15 different javascript files ala the syntax highlighter I’m using

Reasons I want to switch:

I really like the idea of serving a static site and deploying with rsync
I like using git to version my site and vim to write posts

I will miss comments, but I hope people will email instead. That said, of the nearly 20,000 comments my site has received I believe fewer than thirty weren’t spam. In fact, wordpress has a whole cottage industry selling a comment spam control tool called Akismet created to fix how easy wordpress makes comment spam.

Equifax are Scum Who Sell Your Email Address to Scammers

2013-05-20T10:53:11-07:00

There’s only one company that (should) have ever seen the highlighted email address. It’s also not a common word that you would find in a dictionary attack.

Useful tweaks for Hadoop on EMR

2013-05-07T03:16:56-07:00

more ram for the workers: modify mapred-site.xml and add

mapred.child.java.opts-Xmx3192m

To push the changes to all the machines, use the script to modify mapper or reducer count on a running emr cluster.

Modifying the Number of Mappers or Reducers on a Running EMR Cluster

2013-05-02T23:58:59-07:00

Amazon emr unfortunately doesn’t give you an easy way to change the number of mappers and reducers on a running cluster. To do so before booting the cluster, add

--bootstrap-action="s3://elasticmapreduce/bootstrap-actions/configure-hadoop"  \
   --args "-m,mapred.tasktracker.map.tasks.maximum=4,-m,mapred.tasktracker.reduce.tasks.maximum=2"

as appropriate to the elastic-mapreduce.rb command.

For a running emr cluster, you can use the following scripts. Navigate to the conf directory; it will be in a path similar to /home/hadoop/.versions/1.0.3/conf

Edit mapred-site.xml and replace either or both of

mapred.tasktracker.map.tasks.maximum

mapred.tasktracker.reduce.tasks.maximum

Then copy and paste these commands:

$ # distribute the file to all nodes
hadoop job -list-active-trackers | sed "s/^.*_//" | sed "s/:.*//" | xargs -t -I{} -P10 scp -o StrictHostKeyChecking=no  mapred-site.xml hadoop@{}:.versions/1.0.3/conf/
$
$ # bounce the tasktrackers on each node
hadoop job -list-active-trackers | sed "s/^.*_//" | sed "s/:.*//" | xargs -t -I{} -P10 ssh -o StrictHostKeyChecking=no hadoop@{}   sudo /etc/init.d/hadoop-tasktracker stop
$
$ # restart the jobtracker on the headnode
sudo /etc/init.d/hadoop-jobtracker stop

One way to verify this worked is on the jobtracker web page.

Building lush on OSX Lion

2012-12-05T18:03:58-08:00

If building lush2 errors out with this compilation error

g++ -L/opt/local/lib -DHAVE_CONFIG_H -DNO_DEBUG -Wall -O3 -mmmx -msse -I../include  -I/opt/local/include -I/opt/local/include/freetype2  -o lush2 at.o binary.o cref.o calls.o arith.o check_func.o date.o dh.o dump.o eval.o fileio.o fltlib.o fpu.o function.o event.o graphics.o htable.o idx1.o idx2.o idx3.o idx4.o index.o io.o list.o main.o math.o misc.o cmm.o module.o number.o oostruct.o regex.o storage.o string.o symbol.o toplevel.o user.o weakref.o ps_driver.o rng.o lisp_driver.o x11_driver.o unix.o   cpp.o -L/opt/local/lib -lXft -lSM -lICE -lX11 -liconv -lreadline -lcurses -lutil -ldl -lm
Undefined symbols for architecture x86_64:
  "_FcNameParse", referenced from:
      _getfont in x11_driver.o
  "_FcPatternDestroy", referenced from:
      _getfont in x11_driver.o
  "_FcPatternGet", referenced from:
      _getfont in x11_driver.o
  "_FcPatternDel", referenced from:
      _getfont in x11_driver.o
  "_FcPatternAdd", referenced from:
      _getfont in x11_driver.o
  "_FcNameUnparse", referenced from:
      _getfont in x11_driver.o
ld: symbol(s) not found for architecture x86_64
collect2: ld returned 1 exit status
make[1]: *** [lush2] Error 1
make: *** [all] Error 2

Jason Aten was kind enough to fix this for Snow Leopard and later, as detailed in the lush mailing list archive. Grab Jason’s lush2 git repo from github.