Stochastic Nonsense

Put something smart here.

Unique Is Broken in R

Are you kidding me?

1
2
3
$ R
> unique(1,1,2,3,4)
[1] 1

This was the source of yesterday’s nasty to track down bug. What you really want is unique on a vector, as in:

1
2
> unique(c(1,1,2,3,4))
[1] 1 2 3 4

I can’t believe someone decided to let this silently fail in the manner most likely to screw the user. Note that other functions such as sum and max behave as expected:

1
2
3
4
> max(1,2,3,4) == max(c(1,2,3,4))
[1] TRUE
> sum(1,2,3,4) == sum(c(1,2,3,4))
[1] TRUE

Formatted Numbers in Ruby

In C or C++, it’s can be a pain to get thousands separators in printf. In ruby, it can be trivial, as long as you use the right libraries. If you have ActiveSupport installed (which I believe comes with Rails), you’re all set. Note that you don’t have to be using Rails; this will work in a plain ruby script.

1
2
3
4
5
6
7
8
9
$ irb
irb(main):001:0> require 'action_view'
=> true
irb(main):002:0> include ActionView::Helpers::NumberHelper
=> Object
irb(main):003:0> number_with_delimiter(123456)
=> "123,456"
irb(main):004:0> number_to_human(123456)
=> "123 Thousand"

number_with_delimiter is great, and number_to_human is a nice bonus.

For the record, software versions and the install command:

1
2
3
4
5
$ ruby --version
ruby 1.9.2p180 (2011-02-18 revision 30909) [x86_64-darwin10.7.4]
$ gem --version
1.3.7
$ gem install actionpack

Finding the Sort Order of an Array in R or Ruby

Suppose you have an array that you’d like to sort by another array. A common use case might be a set of arrays of somethings and for each something you generate a score in say [0,1]. Now you’d like to sort your somethings by their scores.

Concretely, say you have an array of scores:

1
2
scores: [0.3347867, 0.9069004, 0.4391635, 0.8376249, 0.7133011]
indices: [0, 1, 2, 3, 4]

and you want the indices of the sorted scores, ie

1
2
scores_sorted: [0.3347867, 0.4391635, 0.7133011, 0.8376249, 0.9069004]
indices_sorted: [0, 2, 4, 3, 1]

in R, you can always use order, as in

1
2
3
4
5
6
7
8
9
10
11
12
13
14
$ R
> scores <- runif(5)
> perm <- order(scores)
> data.frame(score=scores, order=perm)
     score order
1 0.3347867     1
2 0.9069004     3
3 0.4391635     5
4 0.8376249     4
5 0.7133011     2
>
> # and just to check
> scores[ perm ]
[1] 0.3347867 0.4391635 0.7133011 0.8376249 0.9069004

You can do something similar in ruby:

1
2
3
4
5
6
7
8
9
10
11
12
13
irb > scores = [0.3347867, 0.9069004, 0.4391635, 0.8376249, 0.7133011]
=> [0.3347867, 0.9069004, 0.4391635, 0.8376249, 0.7133011]
irb > scores.zip( (1..scores.length).to_a )
=> [[0.3347867, 1], [0.9069004, 2], [0.4391635, 3], [0.8376249, 4], [0.7133011, 5]]
irb >
irb > scores.zip( (1..scores.length).to_a ).sort_by{ |e| e.first }
=> [[0.3347867, 1], [0.4391635, 3], [0.7133011, 5], [0.8376249, 4], [0.9069004, 2]]
irb >
irb > perm = scores.zip( (1..scores.length).to_a ).sort_by{ |e| e.first }.map{ |e| e[1] - 1 }
=> [0, 2, 4, 3, 1]
irb > 
irb > scores.values_at(*perm)
=> [0.3347867, 0.4391635, 0.7133011, 0.8376249, 0.9069004]

And finally in C++, you can leverage qsort_r; this function was designed to be a reentrant / threadsafe qsort so you’re given a void* to pass a block of memory into your comparison function. You can use this to sort the indices array by the scores:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
#include<stdlib.h>

// [...]
// utility fns that join an array of u (unsigned int) or f (double) into a string
char* vsprintf_u(char* buff, unsigned int* array, unsigned int len){
  char* orig = buff; buff += sprintf(buff, "["); for (int i=0; i < len; i++) buff += sprintf(buff, "%3u, ", array[i]); sprintf(buff, "]"); 
  return orig;
}
char* vsprintf_f(char* buff, double* array, unsigned int len){
  char* orig = buff;
  buff += sprintf(buff, "["); for (int i=0; i < len; i++) buff += sprintf(buff, "%1.4f, ", array[i]); sprintf(buff, "]");
  return orig;
}

/**
 * qsort_r comparison fn: sort array indices by scores
 */
int score_comparator(void* scoresv, const void* leftv, const void* rightv){
  unsigned int* left = (unsigned int*)leftv;
  unsigned int* right = (unsigned int*)rightv;
  double* scores = (double*)scoresv;
  
  if (scores[ *left ] < scores[ *right ])
      return -1;
  else if (scores[ *left ] == scores[ *right ])
      return 0;
  return 1;
}

// [...]

printf("test code:\n");
double scores[5] = {0.3347867, 0.9069004, 0.4391635, 0.8376249, 0.7133011};
unsigned int perm[] = {0, 1, 2, 3, 4};
char buff[4192];
  
printf("presort:  %s\n", vsprintf_f(buff, scores, 5));
printf("presort:  %s\n", vsprintf_u(buff, perm, 5));
  
qsort_r(perm, 5, sizeof(unsigned int), scores, &score;_comparator);
  
printf("postsort: %s\n", vsprintf_u(buff, perm, 5));
printf("postsort: [");
for (unsigned int i=0; i < 5u; i++)
  printf("%1.4f, ", scores[ perm[ i ] ]);
printf("]\n");

which produces when run

1
2
3
4
5
$ ./a.out
presort:  [0.3348, 0.9069, 0.4392, 0.8376, 0.7133, ]
presort:  [  0,   1,   2,   3,   4, ]
postsort: [  0,   2,   4,   3,   1, ]
postsort: [0.3348, 0.4392, 0.7133, 0.8376, 0.9069, ]

Note the canonical way to sort somethings by a float in c++ is to bang everything into a struct or class and leverage qsort on the structs/classes directly. However, this is often pretty inconvenient, and if you have a lot of whatever you want to sort, it’s too memory intensive to put everything into structs/classes with the sole addition of your score field.

I think it’s obvious why I prefer to program in R.

NB: I am developing for OS X; if you are targeting linux you’ll have to figure out how to link qsort_r yourself. I think someone also decided to permute the argument order. Sigh.

Getting the Value of a Variable From a String in R

It’s often convenient to use reflection to get the value of a variable from the name as a string. In R, you can use the get function to do this.

In R :

1
2
3
4
5
blog $ R
> x = 3
> get('x')
[1] 3
>

In Ruby:

1
2
3
4
5
$ irb
irb(main):001:0> x = 3
=> 3
irb(main):002:0> eval 'x'
=> 3

though Ruby’s eval is more general, and is equivalent to eval in R allowing you to evaluate arbitrary code in a string.

Windows Still Sucks, Can’t Read Exfat Formatted on a Mac

Now that OS X Snow Leopard supports exfat / fat64 and Microsoft Windows theoretically does, you might naively assume that you can share external drives between Vista and Snow Leopard. This is true with one giant caveat: you can’t format the drive on your Mac. You have to format it first on your windows box due to some unknown issue from the morons from Redmond. Posted to hopefully save you the two hours I wasted debugging this. Of course, if you say, already have 300+GB of data on your external drive that you originally formatted on a mac, this knowledge isn’t the most helpful thing ever. Sigh.

Prepping the Reuters 21578 Classification Sample Dataset

I’ve been playing around with some topic models and decided to look at the Reuters 21578 dataset. For your convenience, this dataset is stored as xml split between 20 files or so. And invalid xml at that. I prefer to work with flat text files, so this bit of ruby turns the xml into a single file, slightly cleaned of trailing periods and commas, with one line per input document. Hopefully this will save someone some time, whereas I got to spend an hour or more trying to figure out how to strip invalid UTF-8. This works in ruby 1.9 with the obvious gems installed and the reuters dataset as of today.

Updated May 2015: For those of you using ruby 1.9+ or having trouble installing iconv, you can use this instead:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# dump the reuters dataset from http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html
# into a single file, one line per article

require 'hpricot'

# config 
directory = 'reuters21578'
do_downcase = true
strip_reuter = true               # remove the reuter if it's the last elem of the array
strip_all_reuter = false      # remove all words that match reuter
strip_trailing_comma = true
strip_trailing_period = true

output = File.new('all.txt', 'w')

Dir.entries(directory).select{ |i| i=~ /sgm$/}.each do |filename|
  file = File.new("#{ directory }/#{ filename }", 'r').read
  xml = Hpricot(file)
  articles = xml.search('/REUTERS/*/BODY')

  puts "reading #{filename} : #{ articles.length}"

  articles.each{ |article|
    a = article.innerHTML
    # strip some bad unicode in reut2-017.sgm
    a.encode!("UTF-8", "binary", :invalid => :replace, :undef => :replace, :replace => "?")
    a = a.split(/\s/).map{ |i| i.chomp }.select{ |i| nil == (i =~ /^&/) }

    a.map!{ |i| i.downcase } if do_downcase
    a = a[0..-2] if strip_reuter and a.last =~ /reuter/i
    a.select!{ |i| nil == (i =~ /reuter/ ) } if strip_all_reuter


    a.map!{ |i| i.sub( /,$/, '') } if strip_trailing_comma
    a.map!{ |i| i.sub( /\.$/, '') } if strip_trailing_period
    a = a.select{ |i| i != '' }.join(' ')
    output.puts(a)
  }
end

output.close

Original:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# dump the reuters dataset from http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html
# into a single file, one line per article

require 'hpricot'
require 'iconv'

# config 
directory = 'reuters21578'
do_downcase = true
strip_reuter = true               # remove the reuter if it's the last elem of the array
strip_all_reuter = false      # remove all words that match reuter
strip_trailing_comma = true
strip_trailing_period = true

output = File.new('all.txt', 'w')
iconv = Iconv.new('UTF-8//IGNORE', 'UTF-8') # used to turn invalid utf-8 into valid

Dir.entries(directory).select{ |i| i=~ /sgm$/}.each do |filename|
    file = File.new("#{ directory }/#{ filename }", 'r').read
    xml = Hpricot(file)
    articles = xml.search('/REUTERS/*/BODY')
    
    puts "reading #{filename} : #{ articles.length}"

    articles.each{ |article|
            a = iconv.iconv(article.innerHTML)
            a = a.split(/\s/).map{ |i| i.chomp }.select{ |i| nil == (i =~ /^&/) }
            
            a.map!{ |i| i.downcase } if do_downcase
            a = a[0..-2] if strip_reuter and a.last =~ /reuter/i
            a.select!{ |i| nil == (i =~ /reuter/ ) } if strip_all_reuter

            
            a.map!{ |i| i.sub( /,$/, '') } if strip_trailing_comma
            a.map!{ |i| i.sub( /\.$/, '') } if strip_trailing_period
            a = a.select{ |i| i != '' }.join(' ')
            output.puts(a)
    }
end

output.close

Thousands Separator in Printf in C++

I’ve unfortunately been writing some C++. It’s the crappiest language in the world. I just wasted 90 perfectly good minutes attempting to put thousands separators in numbers that I’m printf ing. If you naively read the man pages, it looks easy — just include ' in your format specifier. Unfortunately, the hidden requirement is that you call this earlier in your code:

1
2
3
4
5
6
7
#include<locale.h>
[...]
setlocale(LC_ALL, "");

// then
unsigned int x = 12345u;
printf("%'8u\n", x);  // -> 12,345 as desired

Edit: I was developing under OS X with XCode 3.2.6.

1
2
3
4
$ sw_vers
ProductName:  Mac OS X
ProductVersion:   10.6.7
BuildVersion: 10J4138

I also verified this works under linux, or at least centos. The man page for printf doesn’t even mention apostrophe as an option, but it still works as long as you call setlocale.

1
2
3
4
$ uname -r
2.6.18-164.6.1.el5.centos.plus
$ cat /proc/version 
Linux version 2.6.18-164.6.1.el5.centos.plus (mockbuild@builder10.centos.org) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-46)) #1 SMP Wed Nov 4 09:31:39 EST 2009

Watching Lecture Videos on Your Computer

I’ve recently been watching some of the lecture videos available on videolectures.net. The site is a great resource, but often the lecturers speak too slowly. I really prefer to watch lecture videos at a higher speed, otherwise I lose focus, try to multitask and browse the internet in the background, and end up retaining neither the lecture material nor the browsing.

Instead, you can use VLC or MPlayer, both available for OS X. In VLC, you can use

 apple key plus +/-

to speed or slow down video playback. You may have to go to preferences –> audio –> all –> enable time stretching audio so that the audio is resampled to preserve pitch so the speakers don’t sound like chipmunks.

In MPlayer,

 apple key + [/]

speeds or slows video playback.

Unfortunately, both of these techniques require downloading the video, and some sites, like video lectures, don’t directly allow you to do so. Fortunately, both VLC and MPlayer allow you to do so, though with some caveats, and both are a bit finicky. For example, both will die if your internet connection goes out and force you to restart download.

Nonetheless, in VLC, you can use the following shell script (a tip I found somewhere I can’t remember on the internet):

$ cat getV.sh
/Applications/VLC.app/Contents/MacOS/VLC -I rc $1 --sout $2 vlc://quit;
$
$ /get.sh "mms://[snip] " gradient01.wmv

whence you will see a bunch of spew ala

[0x10472e368] [rc] lua interface: Listening on host "*console".
VLC media player 1.1.9 The Luggage
Remote control interface initialized. Type `help' for help.
> [0x10473ca88] access_mms access error: no data received
[0x10473ca88] access_mms access error: no data received
[0x10473ca88] access_mms access error: no data received
[0x10473ca88] access_mms access error: no data received
[0x10473ca88] access_mms access error: no data received
[0x10473ca88] access_mms access error: no data received
[0x10473ca88] access_mms access error: no data received
[0x10473ca88] access_mms access error: no data received
[0x10473ca88] access_mms access error: no data received
[0x10473ca88] access_mms access error: no data received
[0x10473ca88] access_mms access error: no data received
[0x10473ca88] access_mms access error: no data received
[0x10473ca88] access_mms access error: no data received
[0x10473ca88] access_mms access error: no data received
[0x10473ca88] access_mms access error: no data received
[0x10473ca88] access_mms access error: no data received
[0x10473ca88] access_mms access error: no data received
[0x10473ca88] access_mms access error: no data received
[0x10473ca88] access_mms access error: no data received
[0x10473ca88] access_mms access error: no data received
[0x10473ca88] access_mms access error: cannot connect to server
[0x10473ca88] access_mms access error: cannot read data 2
[0x10473c7c8] mux_avi mux: stream[0] duration:3520 totalsize:316578836 frames:88018 fps:25.000000 KiB/s:702
[0x10473c7c8] mux_avi mux: stream[1] duration:3520 totalsize:28180137 frames:18951 fps:5.383371 KiB/s:62
[0x105f1ce18] dummy demux: command `quit'

and in mplayer

$ cat getM.sh
  mplayer -dumpstream -dumpfile $2 "$1"

eg

$ ./getM.sh "[snip mms url here]" test.wmv

Finally, in order to use these to download lectures from videolectures, go to the page of a lecture you wish to watch, view source, and find the link that starts with “mms://”. I’ve had better luck using VLC than mplayer.

Horizontal Paging of Greenplum or Postgres Queries

When using gpsql or pgsql to query greenplum or postgres respectively, query results which exceed the width of your term will wrap in a very annoying fashion. To get horizontal paging, set the environmental variable PAGER:

export PAGER='less -RSFX'

then either in your psql or gpsql session, or in your .psqlrc file,

\pset pager always

Note that if you just want the pager for long output, not wide, you can omit the always from the pset command.

Interactive Plotting in R

There are many ways to compare univariate distributions; one of my favorites is violin plots. However, if you are only comparing two distributions, then the best solution is often a scatter plot. To that end, I’ve build some code that creates an interactive scatter plot of two distributions and allows you to interactively print arbitrary strings on the graph when you select / deselect points. This creates a slightly kludgy but very handy tool for hand comparing distributions.

Unfortunately, truly interactive plotting isn’t really a part of R and you are thus forced to lean on external tools. I picked JGR, the java gui for R. This is best used by getting the JGR launch tool.

Basically, I have data with multiple tests; a single line shows the results for one item across several tests. I wish to compare the distributions.

1
2
3
4
5
6
7
8
> head(age)
name    default      test1      test2
1 item 1 0.02110710 0.01900870 0.02030870
2 item 2 0.03160770 0.02926650 0.03345660
3 item 3 0.03909570 0.03702500 0.04016650
4 item 4 0.00262195 0.00225917 0.00302822
5 item 5 0.01668860 0.01555010 0.01783400
6 item 6 0.04223370 0.03904630 0.04123270

test data

You can use this function to throw up a window, and allow you to draw a box around items to see their information displayed in the upper left.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
library('iplots')

visCompare <- function(dat, xname, yname){

  # override this to display your preferred text
  makeDispString <- function(row){
    sprintf('%s : %s = %0.3f; %s = %0.3f; diff = %0.3f', row$name,
      xname, row[[xname]], yname, row[[yname]], row[[xname]] - row[[yname]])
  }

  ypoint <- 0.05 + max(dat[[yname]])

  iplot(x=dat[[xname]], y=dat[[yname]], xlab=xname, ylab=yname,
    ylim=c(0, ypoint + 0.05), xlim=c(0, max(dat[[xname]])), lwd=2)

  iabline(coef=c(0,1))
  d <- iplot.data()
  cat('Select break from the menu to exit loop')

  txtObj <- NULL

  while (!is.null(ievent.wait())){
    if (iset.sel.changed()){
      cat("sel changed\n")
      s <- iset.selected()

      if (length(s) >= 1){
        if (!is.null(txtObj) ){
          iobj.rm( txtObj )
        }

        aa <- paste( makeDispString(dat[s[1:min(3, length(s))],]), collapse="\n")
        cat(paste(aa, "\n"))
        txtObj <- itext(x=0, y=ypoint, labels=aa)
      }

    } else {
      if ( !is.null(txtObj)){

        cat(paste('removing ', txtObj, "\n"))
        iobj.rm( txtObj )
        txtObj <- NULL
      }
    }

  }
}

To test, you can use these two bits of code:

1
2
3
4
5
6
7
8
9
if (F){
    read.csv(file='iplot.test.csv.txt', header=T, sep=',')
    visCompare(age, 'default', 'test2')
}
if (F){
    read.csv(file='iplot.test.csv.txt', header=T, sep=',')
    myDispFn <- function(a){ return(paste(a$name, 'blah blah', sep=' : ') }
    visCompare(age, 'default', 'test2', myDispFn)
}

And here are the results: first, a visual check via scatterplot of the differences of the two distributions:

and with the ability to highlight points and see what you’re looking at: