A code snippet: when poking at columnar data in the shell, you’ll often find yourself asking questions like what are the unique values of a particular column, or the unique values and their counts. R would accomplish this via unique
or table
, but if your data is largish it may be quite annoying to load into R. I often use bash to quickly pick out a column, ala
1
|
|
In order: bash cat
s my data, tells awk
to print just column 8 using ,
as the separator field, sort
s all the data so that I can use uniq
, asks uniq
to print the counts and then the unique strings, then sort
s by the counts descending (-n
interprets as a number and -r
sorts descending). The obvious inefficiency here is if your data is a couple of gb, you have to sort in order for uniq
to work. Instead, you can add the script below to your path and replace the above with:
1
|
|
not only is this a lot less typing, but it will be significantly faster since you don’t have to hold all the data in ram and sort it.
1 2 3 4 5 6 7 8 9 10 |
|