Stochastic Nonsense

Put something smart here.

Picking Subsets of CSV/TSV Files With Awk

Say you have a csv or tsv file, and you want to only select the bits where a particular column is not zero. Start with a csv like this: `

earl$ head ttt
104834, 0, 206, 104578, false
104837, 4, 206, 103566, false
104854, 0, 193, 101063, false
104856, 0, 195, 101851, false
8469683, 0, 149, 50191, false
121867, 4, 207, 107816, false
2252883, 0, 149, 49456, false
13234551, 4, 252, 119802, false
104841, 0, 209, 104289, false
8469506, 0, 149, 50183, false

`

I’m interested in the values of column 2, which I can examine as such. The -F command tells awk to use a comma followed by a space as the field separator instead of a tab. I then sort the data and ask uniq to output the uniq values and their counts. `

naughtywords earl$ awk -F", " '{print $2}' ttt | sort | uniq -c 
20264 0
 274 22
   2 26
2378 4
  21 8

`

Now, say I want only the lines in this csv where column 2 is greater than zero: I can use an if statement in the awk command. In this case, $0 will print the entire line. I pipe to head so I don’t see the entire set of lines spew down my terminal. `

naughtywords earl$ awk -F", " '{if($2 != 0){print $0}}' ttt | head
104837, 4, 206, 103566, false
121867, 4, 207, 107816, false
13234551, 4, 252, 119802, false
11251559, 4, 186, 107663, false
112913, 4, 190, 98530, false
222067, 4, 166, 94450, false
222083, 4, 191, 100498, false
8385060, 4, 175, 95544, false
16684109, 4, 236, 115062, false
102869, 4, 164, 92748, false
naughtywords earl$ 

`

Finally, let’s test and make sure we see the same counts as above: `

naughtywords earl$ awk -F", " '{if($2 != 0){print $0}}' ttt |  awk -F", " '{print $2}' | sort | uniq -c
 274 22
   2 26
2378 4
  21 8

`