I’ve lost my NP-completeness virginity.

I watch a fair amount of fansubbed anime on my computer. When I'm done watching a series, I like to burn it to CDs. Sometimes, when you try to burn a series, in order, to a set of CDs, the last CD only has one or two small files on it. With some rearranging, you can usually fit those files in the wasted space on previous CDs. This takes a lot of work, and, as a programmer, I hate manual labor. :) So my friend Python and I started a little side project to make my life easier...

Thus, partition.py is born! (See the new scripts section on my web site.) partition.py splits a set of files up into fixed-sized buckets, using one of five algorithms.

But! It turns out that the general problem of fitting a set of files into the minimum number of buckets possible is NP-complete. To those of you not versed in computational theory, this means that there is no way a computer could solve the problem in a reasonable amount of time. NP-complete problems usually have approximation algorithms that come close to ideal solutions in a reasonable amount of time, though...

partition.py provides four approximations, and the real algorithm, just for kicks. In the running time and storage analyses, n is the number of files, B is the number of buckets, and W is the maximum bucket size.

Ordered - This is how most people burn CDs. Just put files into buckets in (alphabetical) order, trying to fit as many files as possible into each bucket. It's not optimal, but it maintains the most spatial locality. I use this algorithm if the ones below don't produce better results. Running time: O(n) Storage: O(n + B)
Decreasing Fit First - Found the DFF on everything2.com, under the bin packing node. Easy to implement, reasonably fast, gives good results... This one is probably the best. Running time: O(nB) Storage: O(n + B)
Huffman - I wrote this algorithm thinking that the greedy algorithm used to build Huffman coding trees might provide decent results... Boy was I ever wrong. Basically, you make a bucket for each file. Then you combine the smallest two buckets until you can't combine any more without exceeding the maximum bucket size. This algorithm gives horrible results, and the naive ordered one often beats it by a large margin. Running time: O(n^2) Storage: O(n + B)
Multiple Knapsack - This takes the 0-1 knapsack algorithm and runs it repeatedly until all of the files have been placed in buckets. The first bucket will have almost no wasted space, the second a little bit more, the last buckets wasting the most. This one seems like it would be ideal, since it uses the ideal solution to the knapsack problem, but I've seen it lose to the DFF. Multiple Knapsack works fine if you have fewer than 40 files, but I tried running it overnight on a set of 100 and my machine ran out of memory. Running time: O(B*2^n) Storage: O(nW+B)
Slow - The most worthless algorithm, and the second I implemented. It's so slow that I've never seen it finish with 8 or more files. Basically, the algorithm does a depth-first search of the entire space of all possible combinations of files into buckets, and remembers the best it has seen so far. Running time: Absurdly high. Storage: Not too bad, since it's a DFS instead of a breadth-first search.

Feel free to correct my analyses. They aren't rigorous.

Python is great. It really lets you focus on the fun stuff -- the algorithms. But sometimes the lack of static typing makes the code hard to read. Or maybe that's just me. And the standard library could really use more data types and algorithms. Along the lines of C++'s STL, maybe. Python 2.3 provides a hash-based Set class, and a heap-based priority queue. But there weren't any easy ways that I saw to remove duplicates from lists, or build tree-based sets or maps.

Oh well. It was a fun excursion, and the result does make my life easier. :)

Imported Comments [?]

anonymous on Jul 26, 2007

Though your approach to solving the problem with the multiple knapsack concept is corrent, you are not using it properly. You are running the dynamic programming algorithm and that one is optimal if it uses a single knapsack (it runs in psuedo-polynomial time (O(nB) where B is knapsack's capacity). The correct solution uses a branch-and-bound approach and some internal algorithms that will make the upper and lower boundaries as tight as possible. The latter is accomplished by using relxation algorithms (langrarian, surrogate or linear programming) and subset sum algorithms. Once you accomplish that you'll have lots of fun.
Gabriel