Splitting Files With Awk

Mar 26th, 2014

To split files (eg for test / train splits or k-folds) without having to load into R or python, awk will do a fine job.

For example, to crack into 16 equal parts using modulus to assign rows to files:

split files into equal parts with awk

$ cat LARGE_FILE | awk '{print $0 >("part_" int(NR%16))}'

Or to crack a file into a 80/20 test/train split:

create test/train split using awk

awk '{if( NR % 10 <= 1){ print $0 > "data.test.20"} else {print $0 > "data.train.80"}}'

And finally, if your data file has a header that you don’t want to end up in a random file, you can dump the header row into both files, then tell your awk script to append (and use tail to skip the header row)

create test/train split with proper header handling

head -1 LARGE_FILE > data.test.20
head -1 LARGE_FILE > data.train.80
tail -n+2 LARGE_FILE | awk '{if( NR % 10 <= 1){ print $0 >> "data.test.20"} else {print $0 >> "data.train.80"}}'

Wordpress to Octopress Migration

Mar 26th, 2014

As mentioned, I’m moving my blog to octopress. I got tired of php, php-cgi, wordpress, security holes, constant updates that broke random plugins, 5 second page loads, fragile caching plugins, and all the various nonsense that wordpress brings to the table.

An aside: php-cgi is so fragile and crashes so often I ran a screen session as root that just attempted to restart it every 5 seconds (attached below for any poor souls stuck using this tech.)

# run as root inside screen
for (( ; ; )); do /usr/bin/spawn-fcgi -u nginx -g nginx -f /usr/bin/php-cgi -a 127.0.0.1 -p 53217 -P /var/run/fastcgi-php.pid; sleep 5; done

For googlers who want to move from wordpress to octopress, here’s how I moved 70-odd posts with minimal pain.

1 – Get thomasf’s excellent python script (accurately named exitwp) that converts wordpress posts to octopress posts. This will create one octopress post per wordpress post in the source directory.

2 – I simultaneously moved urls from blog.earlh.com to earlh.com/blog so I needed to 301 all the old posts. I did that by getting this awesome wordpress post exporter script contributed by Mike Schinkel. I curled that to create a list of urls to forward, then built a tsv of pairs of old url\tnewurl. Then the below awk script will print nginx forward rules:

cat posts.tsv | awk -F"\t" '{print "\tlocation " $2 "{\n\t\treturn 301 " $1 ";\n\t\tbreak;\t}"'} | sed "s/http:\/\/blog.earlh.com//"

The rules look like:

location /index.php/2009/06/cleaning-data-in-r-csv-files {
        return 301 http://earlh.com/blog/2009/06/29/cleaning-data-in-r-csv-files/;
        break;
}

Add them to your site nginx.conf file inside the server configuration block.

I’ll update with solutions for better image embedding.

C++ Is Horrific

Mar 24th, 2014

I’m poking at some c++ after not touching it for a decade. c++11 has apparently gotten roughly as capable as java pre 2000; it now can create threads! But the error messages. Oh, the error messages

$ clang++ -std=c++11 -stdlib=libc++ src/test.cpp
In file included from src/test.cpp:4:
In file included from /usr/bin/../lib/c++/v1/thread:93:
In file included from /usr/bin/../lib/c++/v1/functional:465:
/usr/bin/../lib/c++/v1/memory:1677:31: error: calling a private constructor of class 'std::__1::thread'
            ::new((void*)__p) _Up(_VSTD::forward<_Args>(__args)...);
                              ^
/usr/bin/../lib/c++/v1/memory:1604:18: note: in instantiation of function template specialization
      'std::__1::allocator<std::__1::thread>::construct<std::__1::thread, const std::__1::thread &>' requested here
            {__a.construct(__p, _VSTD::forward<_Args>(__args)...);}
                 ^
/usr/bin/../lib/c++/v1/memory:1488:14: note: in instantiation of function template specialization
      'std::__1::allocator_traits<std::__1::allocator<std::__1::thread> >::__construct<std::__1::thread, const std::__1::thread &>' requested here
            {__construct(__has_construct<allocator_type, pointer, _Args...>(),
             ^
/usr/bin/../lib/c++/v1/vector:1471:25: note: in instantiation of function template specialization
      'std::__1::allocator_traits<std::__1::allocator<std::__1::thread> >::construct<std::__1::thread, const std::__1::thread &>' requested here
        __alloc_traits::construct(this->__alloc(),
                        ^
src/test.cpp:36:13: note: in instantiation of member function 'std::__1::vector<std::__1::thread, std::__1::allocator<std::__1::thread> >::push_back'
      requested here
    threads.push_back(t);
            ^
/usr/bin/../lib/c++/v1/thread:261:5: note: implicitly declared private here
    thread(const thread&);
    ^
1 error generated.

what was the cause of this monstrosity?

std::vector< std::thread > threads;
// [...]
std::thread t(thread_func, i);
threads.push_back(t);

so yeah, you can’t copy thread objects, enforced by having a private constructor. Still, the amount of knowledge it takes to translate from the error message to the error is pretty amazing.

Replacing Sort | Uniq

Mar 21st, 2014

A code snippet: when poking at columnar data in the shell, you’ll often find yourself asking questions like what are the unique values of a particular column, or the unique values and their counts. R would accomplish this via unique or table, but if your data is largish it may be quite annoying to load into R. I often use bash to quickly pick out a column, ala

pick out a column

$ cat data.csv | awk -F, '{print $8}' | sort | uniq -c | sort -r -n

In order: bash cats my data, tells awk to print just column 8 using , as the separator field, sorts all the data so that I can use uniq, asks uniq to print the counts and then the unique strings, then sorts by the counts descending (-n interprets as a number and -r sorts descending). The obvious inefficiency here is if your data is a couple of gb, you have to sort in order for uniq to work. Instead, you can add the script below to your path and replace the above with:

pick out a column

$ cat data.csv | awk -F, '{print $8'} | count

not only is this a lot less typing, but it will be significantly faster since you don’t have to hold all the data in ram and sort it.

replace sort | uniq

#!/usr/bin/ruby

# replaces sort | uniq -c
cnts={}
$stdin.each{ |i|
  cnts[i] = 1 + (cnts[i] || 0)
}

cnts.sort_by{ |word, cnt| -1*cnt }.
    each{ |word, cnt| puts "#{cnt}\t#{word}" }

Hiring Software Engineers

Jun 30th, 2013

I perpetually see employers, on hacker news and elsewhere, complaining about difficulty hiring. I haven’t had such issues, so a (perhaps bold) guide to hiring software engineers:

are you paying market salaries?
1. are you really paying market salaries, or are employees supposed to join your company because you’re a special snowflake?
2. even if you are paying market, why should an employee go to your firm? What is the upside to them for leaving a boss and company that they know? Because it would be really convenient for you, the hirer, is not a good answer.
do you make the interviewing process decent, or do you scatter caltrops in front of potential employees?
1. good employees do not need to crash study then regurgitate graph algorithms that your company never uses on the whiteboard. They also have jobs and value their vacation time and don’t care to spend a week consulting for you.
2. how long does it take you to respond to resumes that come in? You should be able to say yes/no/maybe within 2 business days. Do your recruiters / interviewers actually read the cover letters / resumes? Last time I changed job a big sf / yc startup let my first interviewer roll into the interview room just shy of 20 minutes late without having read my resume. That’s a complete dick move, and it’s part of why I turned them down.
3. when potential employees send you github links, do you have an engineer actually bloody look at them (almost never in my experience)?
4. do you actually expend effort to meet potential employees and grow a bunch of warm leads, or do you wait until 3 weeks before you want someone to start then gripe because you can’t convert cold leads in 1 week plus a 2 week resignation period for their current employer?
5. do you use shit software like that jobvite bullshit that badly ocrs then expects me to hand proof their shitty ocr job, or do you directly accept pdf resumes?
6. for the love of god, I do not have a copy of ms word and wouldn’t take one if it were free. I will not put my resume into word format.
do you take some pains to grow employees?
1. do you hire people out of university? Take a chance on people?
2. Like one of my former employers, do you do a good job hiring new grads from schools besides stanford / berkeley / mit / cmu, but then 18 months in after employees have demonstrated their value refuse to bring them up to market rates and lose them?
when you send out offers, do you actually put out a good offer or do you throw numbers out that are 10% or more under your ceiling then expect employees to negotiate hard with you? I just turned down an sf startup because they did this; the ceo who successfully hired me said, “Earl: when I was an employee I hated negotiating, so I’m going to make you a great offer. This also means I’m not going to negotiate.” And you know what? It was a great offer, and I said yes the next morning. It also avoids starting your new job after a confrontational exercise.
if you have recruiters contacting people, do you have them make clear they’re internal not external?
1. do your recruiters actually read peoples’ linkedin profiles before contacting them? I used rails a bit 3 employers ago and had to remove that word from my profile because I got spammed with rails stuff.
do your job postings on linkedin, craigslist, message boards, and your website tell potential employees why he or she should work for you, or like the vast majority, is it simply a long list of desiderata?
just like the easiest sale is an upsell to a customer you have, the easiest recruit is the good employee you already have that you keep happy and prevent from leaving
1. like Rand says, do you know off the top of your head the career goals of your employees? What are you doing to help them get there?
2. do you give your employees raises to keep them at or above market, or can they get a $20k raise by swapping companies? If that raise is on offer, exactly why should they continue to work for you?
3. on that note… 0.2% of an A round company isn’t golden handcuffs. It’s more like paper handcuffs.

Moving to Octopress

Jun 17th, 2013

I finally got tired of wordpress, and I’m trying out octopress. If you have a blog, I think you would probably also be happier using octopress.

Reasons to switch:

wordpress doesn’t just work; it takes endless supervision. This is complicated by the seeming inability to get only security related updates, so I always feel forced to stay on the version treadmill for security reasons. This frequently breaks plugins.
wordpress treats security as an afterthought. For instance, the first thing you may think of is locking down the wp-admin directory to your home ip, but last time I tried this breaks the site.
php is security-hole ridden junk. This seems to have improved over the years but I’m still a little uneasy about using it, whereas only serving static html should be very secure and only require a single serving program making it much easier to keep up with security.
serving php with nginx is fragile; there’s multiple ways to set it up and none of them seem to work particularly reliably
there’s always 10 plugins to do any given task, none of which fully work or fully integrate with wordpress. I tried to get markdown working with wordpress last weekend and it was a nightmare
wordpress is slow, and caching plugins are brittle; serving static html from octopress should be lightning fast.
octopress comes with a bunch of nice features like syntax highlighting that doesn’t require loading 15 different javascript files ala the syntax highlighter I’m using

Reasons I want to switch:

I really like the idea of serving a static site and deploying with rsync
I like using git to version my site and vim to write posts

I will miss comments, but I hope people will email instead. That said, of the nearly 20,000 comments my site has received I believe fewer than thirty weren’t spam. In fact, wordpress has a whole cottage industry selling a comment spam control tool called Akismet created to fix how easy wordpress makes comment spam.

Equifax Are Scum Who Sell Your Email Address to Scammers

May 20th, 2013

equifax are scum

There’s only one company that (should) have ever seen the highlighted email address. It’s also not a common word that you would find in a dictionary attack.

Useful Tweaks for Hadoop on EMR

May 7th, 2013

more ram for the workers: modify mapred-site.xml and add

<property><name>mapred.child.java.opts</name><value>-Xmx3192m</value></property>

To push the changes to all the machines, use the script to modify mapper or reducer count on a running emr cluster.

Modifying the Number of Mappers or Reducers on a Running EMR Cluster

May 2nd, 2013

Amazon emr unfortunately doesn’t give you an easy way to change the number of mappers and reducers on a running cluster. To do so before booting the cluster, add

--bootstrap-action="s3://elasticmapreduce/bootstrap-actions/configure-hadoop"  \
   --args "-m,mapred.tasktracker.map.tasks.maximum=4,-m,mapred.tasktracker.reduce.tasks.maximum=2"

as appropriate to the elastic-mapreduce.rb command.

For a running emr cluster, you can use the following scripts. Navigate to the conf directory; it will be in a path similar to /home/hadoop/.versions/1.0.3/conf

Edit mapred-site.xml and replace either or both of

mapred.tasktracker.map.tasks.maximum

mapred.tasktracker.reduce.tasks.maximum

Then copy and paste these commands:

$ # distribute the file to all nodes
hadoop job -list-active-trackers | sed "s/^.*_//" | sed "s/:.*//" | xargs -t -I{} -P10 scp -o StrictHostKeyChecking=no  mapred-site.xml hadoop@{}:.versions/1.0.3/conf/
$
$ # bounce the tasktrackers on each node
hadoop job -list-active-trackers | sed "s/^.*_//" | sed "s/:.*//" | xargs -t -I{} -P10 ssh -o StrictHostKeyChecking=no hadoop@{}   sudo /etc/init.d/hadoop-tasktracker stop
$
$ # restart the jobtracker on the headnode
sudo /etc/init.d/hadoop-jobtracker stop

One way to verify this worked is on the jobtracker web page.

Building Lush on OSX Lion

Dec 5th, 2012

If building lush2 errors out with this compilation error

g++ -L/opt/local/lib -DHAVE_CONFIG_H -DNO_DEBUG -Wall -O3 -mmmx -msse -I../include  -I/opt/local/include -I/opt/local/include/freetype2  -o lush2 at.o binary.o cref.o calls.o arith.o check_func.o date.o dh.o dump.o eval.o fileio.o fltlib.o fpu.o function.o event.o graphics.o htable.o idx1.o idx2.o idx3.o idx4.o index.o io.o list.o main.o math.o misc.o cmm.o module.o number.o oostruct.o regex.o storage.o string.o symbol.o toplevel.o user.o weakref.o ps_driver.o rng.o lisp_driver.o x11_driver.o unix.o   cpp.o -L/opt/local/lib -lXft -lSM -lICE -lX11 -liconv -lreadline -lcurses -lutil -ldl -lm
Undefined symbols for architecture x86_64:
  "_FcNameParse", referenced from:
      _getfont in x11_driver.o
  "_FcPatternDestroy", referenced from:
      _getfont in x11_driver.o
  "_FcPatternGet", referenced from:
      _getfont in x11_driver.o
  "_FcPatternDel", referenced from:
      _getfont in x11_driver.o
  "_FcPatternAdd", referenced from:
      _getfont in x11_driver.o
  "_FcNameUnparse", referenced from:
      _getfont in x11_driver.o
ld: symbol(s) not found for architecture x86_64
collect2: ld returned 1 exit status
make[1]: *** [lush2] Error 1
make: *** [all] Error 2

Jason Aten was kind enough to fix this for Snow Leopard and later, as detailed in the lush mailing list archive. Grab Jason’s lush2 git repo from github.

← Older Blog Archives Newer →

Stochastic Nonsense

Put something smart here.