splitting files with awk

To split files (eg for test / train splits or k-folds) without having to load into R or python, awk will do a fine job.

For example, to crack into 16 equal parts using modulus to assign rows to files:

split files into equal parts with awk

$ cat LARGE_FILE | awk '{print $0 >("part_" int(NR%16))}'

Or to crack a file into a 80/20 test/train split:

create test/train split using awk

awk '{if( NR % 10 <= 1){ print $0 > "data.test.20"} else {print $0 > "data.train.80"}}'

And finally, if your data file has a header that you don’t want to end up in a random file, you can dump the header row into both files, then tell your awk script to append (and use tail to skip the header row)

create test/train split with proper header handling

head -1 LARGE_FILE > data.test.20
head -1 LARGE_FILE > data.train.80
tail -n+2 LARGE_FILE | awk '{if( NR % 10 <= 1){ print $0 >> "data.test.20"} else {print $0 >> "data.train.80"}}'

Stochastic Nonsense

Put something smart here.