Heaps are simple data structures for efficiently supporting the priority queue operations. Min or Max item can be found in O(1) constant time. The operations supported by heaps are
- Find Min/Max
- Remove Min/Max
Min and Max are symmetrical. Let us start with the a problem and solve it in different ways. In the process we will see how heaps solves the problem elegantly. The example code is written in Python language.
Top K Problem
You have n unique items and you need to identity the Top k of them.
The easiest way to solve this is to sort n items in the descending order and pick up the first k items. Lots of problems in Computer Science can be solved by Sorting. This takes O(n* log n) time. Can we improve this?
def topKBySorting(l,k): s = sorted(l,reverse=True) return s[:min(k,len(l))] l = [5,1,20,2,25,3,44,22,12] topK = 5 topKResults = topKBySorting(l,topK) assert [44,25,22,20,12] == topKResults print 'Using Sorting Top', topK , 'from', l , 'is', topKResults #Output: Using Sorting Top 5 from [5, 1, 20, 2, 25, 3, 44, 22, 12] is [44, 25, 22, 20, 12]
In the selection method
- Find the maximum item from n
- Remove the maximum and append it to the top k list
- Repeat step 1 until k items are found
Finding max takes O(n) and since we do this k times the total time taken is O(k * n). Is this better than O(n * log n)?. For small values of k, O(k * n) would be better. But as k grows bigger O(n * log n) is going to very fast. This means if k = log n the time taken for Selection and Sorting will be the same. If (k > log n) then Sorting it be better. Remember Log base 2 for 1 billion is approximately 30. This means we need a better algorithm. Can we improve this?
def topKBySelection(l,k): topk =  while len(topk) < k and len(l) != 0: max = l maxIndex = 0 for i in range(1,len(l)): if l[i] > max: max = l[i] maxIndex = i topk.append(max) del l[maxIndex] return topk l = [5,1,20,2,25,3,44,22,12] topK = 5 topKResults = topKBySelection(list(l),topK) assert [44,25,22,20,12] == topKResults print 'Using Selection Top', topK , 'from', l , 'is', topKResults #Output: Using Selection Top 5 from [5, 1, 20, 2, 25, 3, 44, 22, 12] is [44, 25, 22, 20, 12]
If we do not care of about the ordering in Top k, turns out that we can improve the performance tremendously. This runs in O(n) time. I had hard time in understanding the recurrence relation for this. It runs fast because of the condition that Top k is in a random order. It works the following way
- Randomly pick up an element r from n
- From n, place smaller elements to the right of r
- From n, place bigger elements to the left of r
- The bigger elements to the left of r will fall into 4 cases as explained in the code
import random def partition(l,v): smaller =  bigger =  for i in l: if i < v: smaller.append(i) elif i > v: bigger.append(i) return (bigger,[v],smaller) def topkByPartition(l,k): v = l[random.randrange(len(l))] (left,middle,right) = partition(l,v) if len(left) == k: #Case 1 return left elif len(left) + 1 == k: #Case 2 return left + [v] elif len(left) > k: #Case 3 return topkByPartition(left,k) return left + [v] + topkByPartition(right,k - len(left) - 1) #Case 4 l = [5,1,20,2,25,3,44,22,12] topK = 5 topKResults = topkByPartition(l,topK) assert 5 == len(topKResults) for i in [44,25,22,20,12]: #Since the order is random I am using the in operator assert i in topKResults print 'Using Partition Top', topK , 'from', l , 'is', topKResults #Output: Using Partition Top 5 from [5, 1, 20, 2, 25, 3, 44, 22, 12] is [25, 44, 22, 20, 12]
Alright we used 3 different ways to solve the Top k problem and the best we can do is O(k * n)[when k < (log n)]. Can we do better? Heaps can solve it in O(n * log k). Before solving this problem let us understand how does heaps work. Given below is an example of Max Heap. The root will have the maximum value.
Following are the properties of Heap.
- Balanced Binary tree. Except for the nodes at the leaf level all other nodes should be complete
- The root will contain the maximum value in Max Heaps. For Min heaps it will contain the minimum value.
- It is partially ordered. Value of the parent will be greater than all its children(Max Heaps).
Heaps can also be represented by Arrays. By not using explicit pointers, some memory is saved. Parent/Child relationship can be determined by using the array indices. To simplify the calculations I am using the index 0 as a sentinel in the code.
#pos is the index(location) in the array def parent(pos): if pos <= 1: return -1 return pos/2 # l is the list(array) def leftChild(l,pos): p = 2 * pos if p < len(l): return p return -1 # l is the list(array) def rightChild(l,pos): p = leftChild(l,pos) if p != -1: p = p + 1 if p < len(l): return p return -1
For inserting an element into the heap
- Add the element to the end of the list
- If the value of the new element is greater than its parent, then swap
- Go to the new parent location and repeat
Insert operation takes O(log n) time. This technique is called as bubbleUp. The code given below is for Max Heaps. Changing it to be a Min Heap should be trivial.
# Add the value to the end of the list and then bubble up def insert(l,value): newPos = len(l) l.append(value) bubbleUp(l,newPos) def bubbleUp(l,pos): p = parent(pos) if p == -1: return if l[pos] > l[p]: l[pos] , l[p] = l[p] , l[pos] bubbleUp(l,p)
Finding the max is a constant operation O(1).
# The top most element. Remember the first element is sentinel def findMax(l): if len(l) >= 2: return l return None
Removing the element takes O(log n) time. The technique used is bubbleDown.
- Take the first element in the list.
- Replace it with the last element
- Compare the value with its two children and if it is lesser then swap
def remove(l): max = findMax(l) if max == None: return None if len(l) == 2: del l else: n = len(l) - 1 l = l[n] del l[n] bubbleDown(l,1) return max def bubbleDown(l,pos): v = l[pos] maxPos = pos leftChildPos = leftChild(l,pos) rightChildPos = rightChild(l,pos) if leftChildPos != - 1 and l[leftChildPos] > v: if rightChildPos == -1 or l[leftChildPos] > l[rightChildPos]: maxPos = leftChildPos else: maxPos = rightChildPos elif rightChildPos != -1 and l[rightChildPos] > v: maxPos = rightChildPos if maxPos != pos: #Change in position. swap and then continue l[pos], l[maxPos] = l[maxPos],l[pos] bubbleDown(l,maxPos)
Sample code using the insert,find and remove methods
# sentinel value to make parent/child pointer calculations easier l =  insert(l,12) insert(l,10) insert(l,100) insert(l,2) print 'List is' ,l print 'Should be 100 and got',findMax(l) print 'Should be 100 and got',remove(l) print 'Should be 12 and got', remove(l) print 'Should be 10 and got',remove(l) print 'Should be 2 and got',remove(l) print 'Should be None and got', remove(l) #Output #List is [0, 100, 10, 12, 2] 0 is the sentinel value #Should be 100 and got 100 #Should be 100 and got 100 #Should be 12 and got 12 #Should be 10 and got 10 #Should be 2 and got 2 #Should be none and got None
Now back to the original problem of finding the Top k from n items.
- Create a Min Heap. The code above uses a Max Heap. The change should be trivial
- Insert the first k items from n to the heap
- The top most position is the minimum value
- For each of the remaining n – k items if it is greater than min;swap and then bubbleDown
def topKUsingMinHeaps(p,topK): l =  # sentinel for i in range(0,topK): insert(l,p[i]) #Store the first k items to the heap for j in range(topK,len(p)): #For the rest of the n - k items if p[j] > findMin(l): #If it is bigger than min l = p[j] #replace bubbleDown(l,1) #bubbleDown so that min is at the top top =  for i in range(0,topK): top.append(remove(l)) return top x = [5,1,20,2,25,3,44,22,12] topK = 5 l = topKUsingMinHeaps(x,topK) for i in range(len(l) - 1,-1,-1): print l[i], #Output: 44 25 22 20 12
Each remove operation takes (log k) time and hence for n items it should take n * (log k). This is much better than all the other approaches.
(n * log k) < (n * k) < (n * log n)
Remember Heaps when you encounter problems for finding Top k.