shell utilities for data analysis

Quick utilities to help with data analysis from the shell:

Print numbered column names of a csv or tsv. You can specify a file or it will read from stdin. It will also guess the separator, whichever of tab or comma is more common; or you may specify with --separator. This is particularly useful if you want to use awk to select columns.

(colnum) download

#!/usr/bin/python
# you use macports, you probably want the first line to be exactly #!/opt/local/bin/python

# Copyright 2015 Earl Hathaway rblog Ray at Bans earlh dot com (take my sunglasses off to email me)
# License: The author or authors of this code dedicate any and all copyright interest in this code to the public domain.

#
# print numbered column names or headers from a file or stdin if present with an optional field separator
# tested to work with python from 2.7 to 3.4

from __future__ import print_function
import argparse
import math
import os.path
import sys

stdin = not sys.stdin.isatty()
parser = argparse.ArgumentParser(description='print numbered column headers')
parser.add_argument('file', nargs='?', help='filename(default: stdin if not a tty)')
parser.add_argument('--separator', dest='separator', nargs=1, help='specify the field separator (default: whichever of comma or tab is more common)')
parser.add_argument('--python_dict', dest='pydict', action="store_true", help='emit a python dict?')

args = parser.parse_args(sys.argv[1:])
if args.file is not None and not os.path.isfile(args.file):
  print('File "%s" does not exist' % args.file)
  sys.exit(0)

first = None
if stdin:
  first = sys.stdin.readline()
elif args.file is not None:
  with open(args.file, 'r') as f:
    first = f.readline()
else:
  print('no file specified and nothing on stdin')
  parser.print_help()
  sys.exit(0)

sep = None
if args.separator is None:
  n_comma = first.count(',')
  n_tabs = first.count('\t')
  sep = "\t" if n_tabs >= n_comma else ","
else:
  sep = args.separator[0]

fields = first.split(sep)

# emit a python dict to copy into code; should be zero based
if args.pydict:
  pydict = '{' + (', '.join(['\'%s\': %d' % (val.strip(), idx) for idx, val in enumerate(fields)])) + '}'
  print(pydict)
  sys.exit(0)


# calculate indentation for fields so they don't stagger
width = 0 if len(fields) < 10 else int(math.ceil(math.log10(len(fields))))
format = ' %%%dd %%s' % width
for idx, val in enumerate(fields):
  print(format % (idx + 1, val.strip()))

Example one:

$ colnum data.csv
request_id
quotes
location_id
category_id
creation_time
week
month

or head -1 data.csv | colnum or colnum --separator , data.csv etc.

There are two options: --separator forces a separator, and --python_dict prints a zero-index based lookup dict like so:

{'request_id': 0, 'quotes': 1, 'location_id': 2, 'category_id': 3, 'creation_time': 4, 'week': 5, 'month ': 6}

Stochastic Nonsense

Put something smart here.

Shell Utilities for Data Analysis