Stochastic Nonsense

Put something smart here.

Replacing Sort | Uniq

A code snippet: when poking at columnar data in the shell, you’ll often find yourself asking questions like what are the unique values of a particular column, or the unique values and their counts. R would accomplish this via unique or table, but if your data is largish it may be quite annoying to load into R. I often use bash to quickly pick out a column, ala

pick out a column
1
$ cat data.csv | awk -F, '{print $8}' | sort | uniq -c | sort -r -n

In order: bash cats my data, tells awk to print just column 8 using , as the separator field, sorts all the data so that I can use uniq, asks uniq to print the counts and then the unique strings, then sorts by the counts descending (-n interprets as a number and -r sorts descending). The obvious inefficiency here is if your data is a couple of gb, you have to sort in order for uniq to work. Instead, you can add the script below to your path and replace the above with:

pick out a column
1
$ cat data.csv | awk -F, '{print $8'} | count

not only is this a lot less typing, but it will be significantly faster since you don’t have to hold all the data in ram and sort it.

replace sort | uniq
1
2
3
4
5
6
7
8
9
10
#!/usr/bin/ruby

# replaces sort | uniq -c
cnts={}
$stdin.each{ |i|
  cnts[i] = 1 + (cnts[i] || 0)
}

cnts.sort_by{ |word, cnt| -1*cnt }.
    each{ |word, cnt| puts "#{cnt}\t#{word}" }