Stochastic Nonsense

Put something smart here.

Howto Transform TSV to CSV, or Just Remove Tabs

Unfortunately, statistics and machine learning seem to degenerate into a giant mess of getting data from multiple sources, munging it together, transforming it, and formatting the output, even before you can get to the work proper. A common problem is taking tab separate value (tsv) files, perhaps produced as the output of a mysql or postgres query, and turning them into comma separated value (csv) files.

Here’s one method, using sed and pretty standard regexp syntax:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
work earl$ head for_tyler.csv
day   referer ip
2009-07-06    http://powertwitter.me/ 111113333
2009-07-06    http://powertwitter.me/ 111113333
2009-07-06    http://powertwitter.me/ 111113333
2009-07-06    http://powertwitter.me/ 111113333
2009-07-06    http://twitter.com/ 111113333
2009-07-06    http://twitter.com/home 111113333
2009-07-06    http://twitter.com/home 111113333
2009-07-06    http://twitter.com/home 111113333
2009-07-06    http://twitter.com/home 111113333
work earl$
work earl$ head for_tyler.csv | sed "s/\t/,/g"
day,referer,ip
2009-07-06,http://powertwitter.me/,111113333
2009-07-06,http://powertwitter.me/,111113333
2009-07-06,http://powertwitter.me/,111113333
2009-07-06,http://powertwitter.me/,111113333
2009-07-06,http://twitter.com/,111113333
2009-07-06,http://twitter.com/home,111113333
2009-07-06,http://twitter.com/home,111113333
2009-07-06,http://twitter.com/home,111113333
2009-07-06,http://twitter.com/home,111113333
work earl$ 
work earl$

The key bit above is this: "s/\t/,/g". That says turn every tab (\t) into a comma (,). If you instead preferred to just remove tabs from the file period, you could use sed on "s/\t//g".

So, yet another thing I learned today: the version of sed that ships with MacOS, even through 10.5.7, doesn’t support special character sequences. If the above isn’t working for you, and instead is just replacing every t character in the file with a comma, then try this:

1
$ echo "   a       " | sed "s/\t/,/g"

Note that to type those tabs, you’ll have to hit ctrl-v (^V). If the output isn’t ",a,", then you have to type literal tabs in your sed command. The \t works under reasonable versions of linux; you’ll have to use literal tabs under OS X. Bleh.