Stochastic Nonsense

Put something smart here.

Finding the Sort Order of an Array in R or Ruby

Suppose you have an array that you’d like to sort by another array. A common use case might be a set of arrays of somethings and for each something you generate a score in say [0,1]. Now you’d like to sort your somethings by their scores.

Concretely, say you have an array of scores:

1
2
scores: [0.3347867, 0.9069004, 0.4391635, 0.8376249, 0.7133011]
indices: [0, 1, 2, 3, 4]

and you want the indices of the sorted scores, ie

1
2
scores_sorted: [0.3347867, 0.4391635, 0.7133011, 0.8376249, 0.9069004]
indices_sorted: [0, 2, 4, 3, 1]

in R, you can always use order, as in

1
2
3
4
5
6
7
8
9
10
11
12
13
14
$ R
> scores <- runif(5)
> perm <- order(scores)
> data.frame(score=scores, order=perm)
     score order
1 0.3347867     1
2 0.9069004     3
3 0.4391635     5
4 0.8376249     4
5 0.7133011     2
>
> # and just to check
> scores[ perm ]
[1] 0.3347867 0.4391635 0.7133011 0.8376249 0.9069004

You can do something similar in ruby:

1
2
3
4
5
6
7
8
9
10
11
12
13
irb > scores = [0.3347867, 0.9069004, 0.4391635, 0.8376249, 0.7133011]
=> [0.3347867, 0.9069004, 0.4391635, 0.8376249, 0.7133011]
irb > scores.zip( (1..scores.length).to_a )
=> [[0.3347867, 1], [0.9069004, 2], [0.4391635, 3], [0.8376249, 4], [0.7133011, 5]]
irb >
irb > scores.zip( (1..scores.length).to_a ).sort_by{ |e| e.first }
=> [[0.3347867, 1], [0.4391635, 3], [0.7133011, 5], [0.8376249, 4], [0.9069004, 2]]
irb >
irb > perm = scores.zip( (1..scores.length).to_a ).sort_by{ |e| e.first }.map{ |e| e[1] - 1 }
=> [0, 2, 4, 3, 1]
irb > 
irb > scores.values_at(*perm)
=> [0.3347867, 0.4391635, 0.7133011, 0.8376249, 0.9069004]

And finally in C++, you can leverage qsort_r; this function was designed to be a reentrant / threadsafe qsort so you’re given a void* to pass a block of memory into your comparison function. You can use this to sort the indices array by the scores:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
#include<stdlib.h>

// [...]
// utility fns that join an array of u (unsigned int) or f (double) into a string
char* vsprintf_u(char* buff, unsigned int* array, unsigned int len){
  char* orig = buff; buff += sprintf(buff, "["); for (int i=0; i < len; i++) buff += sprintf(buff, "%3u, ", array[i]); sprintf(buff, "]"); 
  return orig;
}
char* vsprintf_f(char* buff, double* array, unsigned int len){
  char* orig = buff;
  buff += sprintf(buff, "["); for (int i=0; i < len; i++) buff += sprintf(buff, "%1.4f, ", array[i]); sprintf(buff, "]");
  return orig;
}

/**
 * qsort_r comparison fn: sort array indices by scores
 */
int score_comparator(void* scoresv, const void* leftv, const void* rightv){
  unsigned int* left = (unsigned int*)leftv;
  unsigned int* right = (unsigned int*)rightv;
  double* scores = (double*)scoresv;
  
  if (scores[ *left ] < scores[ *right ])
      return -1;
  else if (scores[ *left ] == scores[ *right ])
      return 0;
  return 1;
}

// [...]

printf("test code:\n");
double scores[5] = {0.3347867, 0.9069004, 0.4391635, 0.8376249, 0.7133011};
unsigned int perm[] = {0, 1, 2, 3, 4};
char buff[4192];
  
printf("presort:  %s\n", vsprintf_f(buff, scores, 5));
printf("presort:  %s\n", vsprintf_u(buff, perm, 5));
  
qsort_r(perm, 5, sizeof(unsigned int), scores, &score;_comparator);
  
printf("postsort: %s\n", vsprintf_u(buff, perm, 5));
printf("postsort: [");
for (unsigned int i=0; i < 5u; i++)
  printf("%1.4f, ", scores[ perm[ i ] ]);
printf("]\n");

which produces when run

1
2
3
4
5
$ ./a.out
presort:  [0.3348, 0.9069, 0.4392, 0.8376, 0.7133, ]
presort:  [  0,   1,   2,   3,   4, ]
postsort: [  0,   2,   4,   3,   1, ]
postsort: [0.3348, 0.4392, 0.7133, 0.8376, 0.9069, ]

Note the canonical way to sort somethings by a float in c++ is to bang everything into a struct or class and leverage qsort on the structs/classes directly. However, this is often pretty inconvenient, and if you have a lot of whatever you want to sort, it’s too memory intensive to put everything into structs/classes with the sole addition of your score field.

I think it’s obvious why I prefer to program in R.

NB: I am developing for OS X; if you are targeting linux you’ll have to figure out how to link qsort_r yourself. I think someone also decided to permute the argument order. Sigh.