Python Algorithms - Knapsack Problem

Contents

前面两篇都写得很长，然后就导致一个多月都没写新的，有种唱歌调起高了的感觉。以后还是想到什么就记录一下吧，不要搞得太严肃各种有压力啊。

上周Stanford算法设计与分析课的作业是经典的背包问题，用来练习刚刚习得的Dynamic Programming技能。按照课上老师讲的方法，用Python写一下具体实现，先求解问题，再还原最优解中所有的Item：

def knapsack(maxweight, items):
    bestvalues = [[0] * (maxweight + 1)
                  for _ in xrange(len(items) + 1)]
 
    for i, (value, weight) in enumerate(items):
        for capacity in xrange(maxweight + 1):
            if weight > capacity:
                bestvalues[i + 1][capacity] = bestvalues[i][capacity]
            else:
                case1 = bestvalues[i][capacity]
                case2 = bestvalues[i][capacity - weight] + value
 
                bestvalues[i + 1][capacity] = max(case1, case2)
 
    trackback = []
    n = len(items)
    w = maxweight
    for i in xrange(n, 0, -1):
        if bestvalues[i][w] != bestvalues[i - 1][w]:
            trackback.append(items[i - 1])
            w -= items[i - 1][1]
 
    trackback.reverse()
 
    return bestvalues[len(items)][maxweight], trackback

恩，这个应该大家都很熟悉了，先分割子问题，然后用个二维数组来记录之前求得的结果避免重复计算，这样复杂度是O(nw)。不过这个作业有两问，第二问的数据量一下子大了很多，有2000个item，背包容量达2000000……恩，估计是给奥特曼背的……再用这个程序一跑，很快就出了Memory Error，数组太大啦！

怎么优化呢？比较自然的想法就是只计算那些有必要计算的子问题，从顶向下来跑，用递归和cache，也是比较好实现的：

def knapsack(items, maxweight):
    
    @memoize
    def bestvalue(i, j):
        if i == 0: return 0
        value, weight = items[i - 1]
        if weight > j:
            return bestvalue(i - 1, j)
        else:
            return max(bestvalue(i - 1, j),
                       bestvalue(i - 1, j - weight) + value)
 
    j = maxweight
    result = []
    for i in xrange(len(items), 0, -1):
        if bestvalue(i, j) != bestvalue(i - 1, j):
            result.append(items[i - 1])
            j -= items[i - 1][1]
    result.reverse()
    return bestvalue(len(items), maxweight)

还尝试了下对我来说很新潮的decorator哈哈！这个东西看起来真是各种专业，感兴趣的同学可以看耗子哥的介绍。下面的代码来自官方wiki。

def memoize(obj):
    cache = obj.cache = {}
 
    @functools.wraps(obj)
    def memoizer(*args, **kwargs):
        if args not in cache:
            cache[args] = obj(*args, **kwargs)
        return cache[args]
    return memoizer

信心满满开始跑大case！然后很快就又报错说递归层数超限制了……好吧，改下限制：

1 2	import sys sys.setrecursionlimit(3000)

本以为这样搞一下应该瞬间出结果了吧，结果还是跑了近2分钟（110秒）……不过看下总共解的子问题数量已经降到了6227855 / (2000000 * 2000)，也就是原来的0.156%！

算法课的老师说，我们搞算法，最重要的一个问题就是：“Can we do better?” 像我这样的有志青年，怎么能满足于2分钟才能跑出来的结果呢？！！于是我点开课程讨论区，看看大家有没有什么更好的解法……果然！有一位来自大洋彼岸的名叫大卫的同学提出了稀疏Column的解法，其思想就是在原始的子问题矩阵中，其实有很多是重复的值（因为重量往往比较大比较分散，所以很多计算其实都无法加上新item而只是沿用之前的值（而且还得取两个值算一次max呢，感觉很亏！）。所以事实上我们只需要记录在一列中发生数值变化的那几个点就可以了！这样将大大减少计算量以及存储矩阵所需要的空间！

有了思路就可以来实现了！大致就是轮一遍所有的Item，搞个Hash来存总Value，Key就是当前的总Weight，每到新一个Item的时候取出Hash中所有的Key，然后加上当前轮Item的Weight即为新的发生变化的点，判断下不要超出Max Weight，如果与之前的Key有重合就取比较大的结果，总的来说思路跟原始的bottom-up做法是一样的。代码如下，命名比较乱，而且没有去实现回溯……

def knapsack(items, maxweight):
    # history = []
    result = {}
    result[0] = 0
    for i in range(len(items)):
        v, w = items[i]
        new_res = dict(result)
        for c in sorted(result.keys()):
            if c + w > maxweight:
                break
            if result.get(c+w):
                new_res[c+w] = max(result[c+w], result[c] + v)
            else:
                new_res[c+w] = result[c] + v
        result = dict(new_res)
        # history.append(result)
 
    return result[max(result.keys())]

不过理想是美好的，现实总是残酷的。实现完后一跑就出问题了，跟正确答案总是差那么一点点……是哪里不对了呢？然后自己搞了个小数据量的Case看了下，发现少考虑了一种情况！那就是如果有一个Item的Value很大，Weight也很大（比如正好比一半多一点），当背包里只放它一件物品的时候总Value就超过其它所有物品的Value和了，那么当计算到它时必须把后面比它小的那些Value都给删掉！这样才能确保获得正确答案！这样一来感觉就有点麻烦了啊，还要删掉Value……试了下用数组用哈希都可以，数组的话就是用两组index的数组像merge sort那样分别取Weight出来保证有序，如果比之前的Value小就不写入新result里了。还有就是下面还是用Hash做的方法：

def knapsack(items, maxweight):
    # history = []
    result = {}
    result[0] = 0
    for i in range(len(items)):
        v, w = items[i]
        new_res = {}
        changed_weight = dict.fromkeys(result.keys(), 0)
        new_weights = dict(changed_weight)
        for k in changed_weight:
            new_weight = k + w
            if new_weight <= maxweight:
                if new_weight in changed_weight:
                    new_weights[new_weight] = 1
                else:
                    new_weights[new_weight] = -1
        last_val = -1
        for (k, c) in sorted(new_weights.items(), key=lambda x: x[0]):
            if c == 0:
                new_val = result[k]
            elif c == 1:
                new_val = max(result[k], result[k-w] + v)
            else:
                new_val = result[k-w] + v
            if new_val > last_val:
                new_res[k] = new_val
                last_val = new_val
        result = dict(new_res)
        # history.append(result)
 
    return result[max(result.keys())]

哈哈，还是挺有趣的吧！跑了一下大约4.3s完成！速度提升25x！用cProfile, line_profiler, memory_profiler等Python的性能工具对比看了下，这个Sparse Column的实现占用内存大约11.4MB，而Top-down Recursive的实现要占用1159MB！

Line #    Mem usage    Increment   Line Contents
================================================
    12    8.641 MiB    0.000 MiB   @profile
    13                             def main():
    14    8.645 MiB    0.004 MiB       cache = {}
    15    9.031 MiB    0.387 MiB       total_weight, total_item, items = generate_items('knapsack_big.txt')
    16    9.031 MiB    0.000 MiB       def memo(i, j):
    17                                     if (i, j) in cache:
    18                                         return cache[(i, j)]
    19                                     if i == 0:
    20                                         return 0
    21                                     v, w = items[i - 1]
    22                                     if j - w < 0:
    23                                         result = memo(i-1, j)
    24                                         cache[(i, j)] = result
    25                                         return result
    26                                     else:
    27                                         result = max(memo(i-1, j), memo(i-1, j-w) + v)
    28                                         cache[(i, j)] = result
    29                                         return result
    30
    31 1159.055 MiB 1150.023 MiB       print memo(total_item, total_weight)

Sparse Column的实现只消耗了Top-down实现1%的内存！

Line #    Mem usage    Increment   Line Contents
================================================
     9    9.008 MiB    0.000 MiB   @profile
    10                             def knapsack(items, maxweight):
    11                                 #history = []
    12    9.012 MiB    0.004 MiB       result = {}
    13    9.012 MiB    0.000 MiB       result[0] = 0
    14   11.398 MiB    2.387 MiB       for i in range(len(items)):
    15   11.398 MiB    0.000 MiB           v, w = items[i]
    16   11.398 MiB    0.000 MiB           new_res = {}
    17   11.398 MiB    0.000 MiB           changed_weight = dict((k, 0) for k in result.keys())
    18   11.398 MiB    0.000 MiB           new_weights = dict(changed_weight)
    19   11.398 MiB    0.000 MiB           for k in changed_weight:
    20   11.398 MiB    0.000 MiB               new_weight = k + w
    21   11.398 MiB    0.000 MiB               if new_weight <= maxweight:
    22   11.398 MiB    0.000 MiB                   if new_weight in changed_weight:
    23   11.398 MiB    0.000 MiB                       new_weights[new_weight] += 1
    24                                             else:
    25   11.398 MiB    0.000 MiB                       new_weights[new_weight] = -1
    26   11.398 MiB    0.000 MiB           last_val = -1
    27   11.398 MiB    0.000 MiB           for (k, c) in sorted(new_weights.items(), key=lambda x: x[0]):
    28   11.398 MiB    0.000 MiB               if c == 0:
    29   11.398 MiB    0.000 MiB                   new_val = result[k]
    30   11.398 MiB    0.000 MiB               elif c == 1:
    31   11.398 MiB    0.000 MiB                   new_val = max(result[k], result[k-w] + v)
    32                                         else:
    33   11.398 MiB    0.000 MiB                   new_val = result[k-w] + v
    34   11.398 MiB    0.000 MiB               if new_val > last_val:
    35   11.398 MiB    0.000 MiB                   new_res[k] = new_val
    36   11.398 MiB    0.000 MiB                   last_val = new_val
    37   11.398 MiB    0.000 MiB           result = dict(new_res)
    38                                     #history.append(result)
    39
    40   11.398 MiB    0.000 MiB       return result[max(result.keys())]

无论是运行速度还是内存占用都很有优势啊！（update：发现是我乌龙了……这个Sparse Column实现并没有储存历史记录，无法回溯，所以才占用这么小……要减少子问题的数量还是得用后面的剪枝啊！）另外我还发现cProfile之类的工具分析递归好像很困难……在Sparse Column的实现中每次对key的sort挺消耗时间，改成数组的实现还可以快0.5秒以上。一开始在初始化Hash的时候用了个内部循环changed_weight = dict((k, 0) for k in result.keys())，看了下profiling数据：

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     9                                           @profile
    10                                           def knapsack(items, maxweight):
    11                                               #history = []
    12         1            3      3.0      0.0      result = {}
    13         1            2      2.0      0.0      result[0] = 0
    14      2001         3032      1.5      0.0      for i in range(len(items)):
    15      2000         3375      1.7      0.0          v, w = items[i]
    16      2000        23585     11.8      0.1          new_res = {}
    17      2000       926182    463.1      2.9          changed_weight = dict((k, 0) for k in result.keys())
    18      2000       111575     55.8      0.3          new_weights = dict(changed_weight)
    19   1663355      1963616      1.2      6.1          for k in changed_weight:
    20   1661355      2033315      1.2      6.3              new_weight = k + w
    21   1661355      1958858      1.2      6.1              if new_weight <= maxweight:
    22   1279701      1591834      1.2      5.0                  if new_weight in changed_weight:
    23    254768       368270      1.4      1.1                      new_weights[new_weight] += 1
    24                                                           else:
    25   1024933      1316114      1.3      4.1                      new_weights[new_weight] = -1
    26      2000         2371      1.2      0.0          last_val = -1
    27   2688288      5822873      2.2     18.2          for (k, c) in sorted(new_weights.items(), key=lambda x: x[0]):
    28   2686288      3260883      1.2     10.2              if c == 0:
    29   1406587      1813914      1.3      5.7                  new_val = result[k]
    30   1279701      1525904      1.2      4.8              elif c == 1:
    31    254768       455276      1.8      1.4                  new_val = max(result[k], result[k-w] + v)
    32                                                       else:
    33   1024933      1419824      1.4      4.4                  new_val = result[k-w] + v
    34   2686288      3245556      1.2     10.1              if new_val > last_val:
    35   1662811      2142042      1.3      6.7                  new_res[k] = new_val
    36   1662811      1991281      1.2      6.2                  last_val = new_val
    37      2000        98447     49.2      0.3          result = dict(new_res)
    38                                                   #history.append(result)
    39
    40         1           63     63.0      0.0      return result[max(result.keys())]

可以看到17行这个Hash初始化的per hit time高居榜首！我们改一下实现变成changed_weight = dict.fromkeys(result.keys(), 0)：

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     9                                           @profile
    10                                           def knapsack(items, maxweight):
    11                                               #history = []
    12         1            5      5.0      0.0      result = {}
    13         1            2      2.0      0.0      result[0] = 0
    14      2001         3095      1.5      0.0      for i in range(len(items)):
    15      2000         3313      1.7      0.0          v, w = items[i]
    16      2000        25807     12.9      0.1          new_res = {}
    17      2000       154480     77.2      0.5          changed_weight = dict.fromkeys(result.keys(), 0)
    18      2000       125462     62.7      0.4          new_weights = dict(changed_weight)
    19   1663355      1756998      1.1      6.1          for k in changed_weight:
    20   1661355      1807104      1.1      6.2              new_weight = k + w
    21   1661355      1739799      1.0      6.0              if new_weight <= maxweight:
    22   1279701      1444483      1.1      5.0                  if new_weight in changed_weight:
    23    254768       355288      1.4      1.2                      new_weights[new_weight] += 1
    24                                                           else:
    25   1024933      1188602      1.2      4.1                      new_weights[new_weight] = -1
    26      2000         2112      1.1      0.0          last_val = -1
    27   2688288      5982674      2.2     20.6          for (k, c) in sorted(new_weights.items(), key=lambda x: x[0]):
    28   2686288      2859058      1.1      9.9              if c == 0:
    29   1406587      1687684      1.2      5.8                  new_val = result[k]
    30   1279701      1332986      1.0      4.6              elif c == 1:
    31    254768       504591      2.0      1.7                  new_val = max(result[k], result[k-w] + v)
    32                                                       else:
    33   1024933      1354702      1.3      4.7                  new_val = result[k-w] + v
    34   2686288      2870422      1.1      9.9              if new_val > last_val:
    35   1662811      1957315      1.2      6.7                  new_res[k] = new_val
    36   1662811      1739647      1.0      6.0                  last_val = new_val
    37      2000       124322     62.2      0.4          result = dict(new_res)
    38                                                   #history.append(result)
    39         1           63     63.0      0.0      return result[max(result.keys())]

效果相当明显啊！从400多降到了77微秒！

另外Python里也可以追踪object的数量，做引用分析之类，鉴于这只是个小脚本……就随便看下效果吧：

(Pdb) import objgraph
(Pdb) objgraph.show_growth()
list                           2195     +2195
function                       1242     +1242
wrapper_descriptor             1046     +1046
builtin_function_or_method      685      +685
method_descriptor               537      +537
dict                            470      +470
weakref                         428      +428
tuple                           396      +396
member_descriptor               193      +193
getset_descriptor               183      +183
(Pdb) objgraph.show_backrefs([result], filename="/Users/Bytes/backrefs.png")

Profiling工具还是相当丰富的嘛！不知道虚拟机，GC什么的能不能像JVM那样细致观察调优……

性能优化工作到这里这还没结束。继续尝试了下剪枝，发现这个数据sample还是效果挺明显的！首先用 Value / Weight 对所有Item进行排序，然后在运算过程中看一下剩下的背包空间再去乘这个比例还会不会有1以上的总Value提升。如果没有的话那之后的所有Item都是小于这个单位重量的价值的，就没有必要继续了。只需要加一个sort和两行判断：

def generate_items(filename):
    with open(filename) as f:
        total_weight, total_item = [int(x) for x in f.readline().strip().split()]
        items = []
        for line in f:
            items.append([int(x) for x in line.strip().split()])
        # Sort the items by value / weight
        sorted_items = sorted(items, key=lambda x: x[0] / x[1], reverse=True)
    return total_weight, sorted_items

剪枝：

...
curr_max_weight = max(result.keys())
if (maxweight - curr_max_weight) * v / w < 1: break
...

这样搞完之后，只需要0.4秒就跑完了！性能提升275x！当然有这么大的效果也是因为数据比较特别，如果所有物品的这个比率都比较接近，加这些判断就反而会拖累运行速度了……所以要具体案例具体分析。

恩，总结一下，性能优化很有趣，上课做作业有人讨论问题的感觉真好哈哈！