Blog

How to find k-subsets, grey codes and polynomials

Jun 5, 2014

I have recently been online hunting for some interesting programming/math related questions and found a question which essentially boiled down to finding the 2-subsets of a set. There could be many applications for an algorithm which can do this efficiently, such as finding all possible edges in a graph, or in the general case of k-subsets (subsets of size k), we may want to find all combinations of groups of k people in a population, etc,etc.

Anyway, the question is an interesting one. The naive way to go about this is perhaps to start with the first k elements, then move the last guy to the right, that's a new set, then keep moving him and so on... Then when he's exhausted the list you move the k-1'th guy to position k and the last guy to position k+1 and keep going... This is basically just getting all the n choose k subsets in lexicographic order.

Recall:

n choose k = n! / [ k! (n-k)! ]

Sorry I don't have LaTeX on the blog, but basically the idea is that if k = n/2 we get n! / (n/2)! = n(n-1)(n-2)...(n/2 + 1)(n/2) which still basically on the order n!

This gives a least upper bound on the algorithm.

Anyway, so I looked into it, and the above algorithm reminded me of counting in some arbitrary base. Then I thought: we are finding a special kind of linear combination, a binary one. So basically another way to see this is that we want to find all binary strings with k 1's... So how to find this?

I looked into how to generate all binary strings, you can use grey codes. Grey codes are basically a way of enumerating all binary strings. Look at what I do here:

0000
0001
0011
0010
0110
0111
0101
0100

So, I'm flipping 1 bit each time, and there are no duplicates... and each time I flip the right-most possible bit...

Ok, easy to get these codes. Also, checking Wikipedia there is a simple explicit formula for the n'th gray code... Ok... So, how do I know which ones have exactly k 1's in them?

Lets start with k=2. Some simple Python code shows that the indices of the grey codes which have exactly 2 1's are as follows: [2, 4, 6, 8, 12, 14, 16, 24, 28, 30, 32]. Maybe you say a pattern. I didn't. I did notice that they are multiples of 2 and that their differences were quite low. So, I then formed the differences. I get: [2,2,2,4,2,2,8,4,2,2,16,8,4,2,2,32,16,8,4,2,2,64,...]. Ok, lets partition it so we can see what's really going on: [[2],[2,2],[4,2,2],[8,4,2,2],[16,8,4,2,2],[32,16,8,4,2,2],...] See what's going on?

So in fact we can find all 2-subsets in linear time (provided we have random access). I then looked into 3-subsets. I found this by first looking at the sequence [5, 9, 11, 13, 17, 19, 23, 25, 27, 29, 33, 35, 39,...] I formed the differences and found sort of a sequence... I don't have much more time today to look at this further but I thought I'd include the screenshot of the partition I made... there seems to be some sort of pattern there right?

Anyway, if it turns out theres a pattern here (and it feels like there is), hopefully we can just plug an arbitrary k in and generate all k-subsets that way.

<< Go back to the previous page