optimal merge pattern

🏠

On this page we'll take a look at the optimal merge pattern discussed in this great lecture. Please note we use the optimal merge pattern to build our huffman encoding tree, although in this article we use a more efficient method (at the expense of being less concise).

This time, rather than explicitely making a call to min as we did in our huffman encoding example, we use a heapq .

using a heapq for efficiency

The heapq, a binary tree, remains sorted, and has O(logn) push and pop. Using a heapq to generate our optimal merge pattern is therefore done in O(nlogn), while using min, which takes O(n) would have been in the order of O(n^2) time.

graph TD 5 --> C[C:2] 5 --> D[D:3] 10 --> B[B:5] 10 --> 5 16 --> A[A:6] 16 --> 10
 1import heapq
 2
 3class MergeNode:
 4
 5    def __init__(self, name=0, size=0, l=0, r=0):
 6        self.name = name
 7        self.size = size if size else l.size + r.size
 8        self.l = l
 9        self.r = r
10
11    def __lt__(self, other):
12        return self.size < other.size
13
14    def __repr__(self):
15        return f"[{self.name}:{self.size}{self.l}:{self.r}]"
16
17    def count(self):
18        if self.l or self.r:
19            lc = self.l.count() if self.l else 0
20            rc = self.r.count() if self.r else 0
21            return lc + rc + self.size
22        return 0
23
24def gen_optimal_merge(items, print_tree=False):
25    heapq.heapify(items)
26
27    while len(items) > 1:
28        l = heapq.heappop(items)
29        r = heapq.heappop(items)
30        if print_tree:
31            if l.name:
32                print(f"{l.size+r.size} --> {l.name}[{l.name}:{l.size}]")
33            else:
34                print(f"{l.size+r.size} --> {l.size}")
35            if r.name:
36                print(f"{l.size+r.size} --> {r.name}[{r.name}:{r.size}]")
37            else:
38                print(f"{l.size+r.size} --> {r.size}")
39        heapq.heappush(items, MergeNode( name='', 
40                                    size=0,
41                                    l=l,
42                                    r=r))
43    return next(iter(items))
44
45def main():
46    from itertools import combinations as comb
47    from string import ascii_lowercase as asc
48    from random import randint
49    r = iter(range(1, 1000000))
50    items = [MergeNode(''.join(x), next(r)) for x in comb(asc, 6)]
51    print(f'lists to merge: {len(items)}')
52    print_tree = False
53    if print_tree:
54        print('graph TD')
55
56    optimal_merge = gen_optimal_merge(items, print_tree)
57
58    if print_tree:
59        print(optimal_merge)
60    print(optimal_merge.count())
61
62if __name__ == '__main__':
63    main()

larger merge patterns

Let's try with a slightly larger tree. Please note this is graphed using the same technique as generating callgraphs in python with mermaid.

graph TD 3 --> a[a:1] 3 --> b[b:2] 6 --> c[c:3] 6 --> 3 9 --> d[d:4] 9 --> e[e:5] 12 --> f[f:6] 12 --> 6 15 --> g[g:7] 15 --> h[h:8] 18 --> 9 18 --> i[i:9] 21 --> j[j:10] 21 --> k[k:11] 24 --> l[l:12] 24 --> 12 27 --> m[m:13] 27 --> n[n:14] 30 --> o[o:15] 30 --> 15 33 --> p[p:16] 33 --> q[q:17] 36 --> 18 36 --> r[r:18] 39 --> s[s:19] 39 --> t[t:20] 42 --> u[u:21] 42 --> 21 45 --> v[v:22] 45 --> w[w:23] 48 --> 24 48 --> x[x:24] 51 --> y[y:25] 51 --> z[z:26] 57 --> 27 57 --> 30 69 --> 33 69 --> 36 81 --> 39 81 --> 42 93 --> 45 93 --> 48 108 --> 51 108 --> 57 150 --> 69 150 --> 81 201 --> 93 201 --> 108 351 --> 150 351 --> 201

merging huge sequences

Here we build a optimal merge tree for a quarter of a million merge sequences, on one thread, in roughly 4 seconds.

1shyal:[~/dev/monolith] time python3 tmp.py
2lists to merge: 230230
3465300318788
4
5real   0m4.144s
6user   0m4.004s
7sys    0m0.119s