Examining CSV Data Columns From a Shell

It’s very handy to be able to pop open a shell and peek in your csv files. awk is a command that will do just that — it divides each line into fields based either on a whitespace separator or by a separator specified by -F. Here’s a script that prints the 8th data column, sorts it, and prints out the unique values. It runs in a couple seconds on a 10MM row csv file.

awk -F, '{print $8}' aad.uploadnewb.csv | sort | uniq

2009-04-22
2009-04-26
2009-05-01
2009-05-02
2009-05-03
2009-05-04
2009-05-05

If you prefer uniq to spit out the counts, it will do so with the -c argument:

#  awk -F, '{print $8}' aad.uploadnewb.csv | sort | uniq -c
341770 2009-04-22
445955 2009-04-26
242915 2009-05-01
135366 2009-05-02
943205 2009-05-03
545772 2009-05-04
745088 2009-05-05

For the record, the csv file in question looks like this:

# head aad.uploadnewb.csv
0,78,1,1.79487,0.0128205,0.14,0.14,2009-04-22,2009-04-22,2009-07-08 02:55:52
1,168,3,1.25,0.0178571,0.07,0.21,2009-04-22,2009-04-22,2009-07-08 02:55:52
2,168,31,10.4167,0.184524,0.0564516,1.75,2009-04-22,2009-04-22,2009-07-08 02:55:52
4,297,38,20.0673,0.127946,0.156842,5.96,2009-04-22,2009-04-22,2009-07-08 02:55:52
6,235,23,4.51064,0.0978723,0.046087,1.06,2009-04-22,2009-04-22,2009-07-08 02:55:52
12,52,4,1.34615,0.0769231,0.0175,0.07,2009-04-22,2009-04-22,2009-07-08 02:55:52
19,75,4,2.13333,0.0533333,0.04,0.16,2009-04-22,2009-04-22,2009-07-08 02:55:53

Equivalent code in R would be something like

> d <- read.csv(file='./aad.uploadnewb.csv', header=F, sep=',')
> 
> # just the unique values
> unique(d$X8)
>
> # include counts
> table(d$X8)

Stochastic Nonsense

Put something smart here.

Examining CSV Data Columns From a Shell