Given below is the information about employees in a software company. The company has 100 employees.

Id |
Name |
Date of Birth |

1 | Tom | 10-Nov-1980 |

2 | Peter | 12-Oct-1981 |

3 | Alex | 08-Jun-1979 |

… | ||

100 | Jian | 10-Jan-1977 |

If you know the employeeId how fast can you retrieve their information? If we store all the employee information in an array of size 100 and use the employeeId as an **index** into the **array** then we can access their information in constant time **O(1)**. You cannot do better than this.

## Hashing

One fine day the company decides to use the **social security number (SSN)** as the employee identifier. SSN has 9 digits and it is of the format **000-00-0000**. For simplicity let us assume that all the digits from 0 to 9 are possible in all the 9 positions. This gives us **10 ^ 9** permutations which results in 1 billion possible numbers. Using an array of size 1 billion to store 100 employee information is not the correct solution. We need a better way. What if we keep the array size as 100 and somehow map the SSN into the array index in a deterministic way. We can use the modulo operator to solve this problem.

arrayIndex = SSN mod arraySize 89 = 123456789 mod 100

The mod operator guarantees that the reminder will be between **0 and arraySize – 1**. The function that does this mapping is called as the **hash function**. In this example SSN happened to be a number and hence we can use the modulo operator. In real life we have identifiers that are strings. How do we use modulo operator on strings? You need to convert strings to number and then use the modulo operator. Let us try to convert the string “JAVA” to a number.

Character |
Number |

A | 1 |

B | 2 |

C | 3 |

… | |

Z | 26 |

I created a simple mapping table for each character. What if we add their corresponding numbers?

J = 10 A = 1 V = 22 A = 1 10 + 1 + 22 + 1 =34

For **JAVA** we get **34** and we can apply the modulo operator on this number. But there is a problem. Let us take the identifier **ZZZZZZZZZZ**. By applying the addition rule we get **26 * 10 = 260**. This means that all the keys up to 10 characters of length should fit into an array of size 260. If all the characters are possible then the total identifiers will come to **26 ^ 10**. Fitting at least a fraction of them in an array of size 260 will not result in constant time operation. Is there a better way? What if we borrow the idea of place value notation from numbers. If you move from left to right each digit position value is 10 times as big as the position to its right.

723 = 7 * 10 ^ 2 + 2 * 10 ^ 1 + 3 * 10 ^ 0 723 = 700 + 20 + 3

Applying the same logic for JAVA we get 177009 and hence we can the modulo operator on keys that are strings.

JAVA = 10 * 26 ^ 3 + 1 * 26 ^ 2 + 22 * 26 ^ 1 + 1 * 26 ^ 0 177009 = 175760 + 676 + 572 + 1

## Collisions

Let us take two employee identifiers 22 and 32 and the array size is 10. The modulo operator for both these identifiers will be 2.

22 mod 10 = 2 32 mod 10 = 2

Two different keys getting hashed to the same index is called as **hash collision**. This situation is unavoidable. We need a strategy to handle this situation. One strategy is called as **separate chaining**. The array index will not store the employee information. Instead it will point to a list which contains the employee information. Take a look at the following data.

Id |
Name |

16 | Tim |

32 | Alice |

19 | Peter |

31 | Jian |

Let us assume that the array size is 8

- Tim identifier is 16. When we 16 mod 8 we get the index as 0. We add the information to the list at position 0.
- Alice identifier is 32. When we 32 mod 8 we get the index as 0. We add Alice information after Tim [
**key collision**] - Peter identifier is 19. When we 19 mod 8 we get the index as 3. We add the information to the list at position 3.
- Jian identifier is 31. When we 31 mod 8 we get the index as 7. We add the information to the list at position 7.

**Key operations on hash table**

Following are the key operations

Operation |
Description |
Time |

put(key, value) | Add the key and the value. Overwrite the value if the key already exists | O(1) |

get(key) | Get the value corresponding to the key | O(1) |

remove(key) | Removes the key and its corresponding value | O(1) |

The next question is how to represent the hash table. From the above image it is very clear that we can represent it using a list of lists. Given below is the code written in Python to create the hash table. This code is not thread safe. In reality you should use **dictionaries** in python. In real projects if you write your own hash table most likely you are doing something wrong. I did this for better understanding.

**Node and Hash table data structures**

class Node: def __init__(self, key, value): self.key = key self.value = value class Hashtable: def __init__(self, size = 10): self.buckets = [] for i in range(size): self.buckets.append([]) self.size = size ...

**getBucket()**

def getBucket(self, key): return hash(key) % self.size

**Put() and Get()**

def put(self, key, value): bucketNo = self.getBucket(key); l = self.buckets[bucketNo] if len(l) == 0: n = Node(key, value) l.append(n) else: keyFound = False for n in l: if n.key == key: n.value = value keyFound = True if not keyFound: n = Node(key, value) l.append(n) def get(self, key): bucketNo = self.getBucket(key) l = self.buckets[bucketNo] if len(l) != 0: for n in l: if n.key == key: return n.value

**Remove()**

def remove(self, key): bucketNo = self.getBucket(key) l = self.buckets[bucketNo] if len(l) != 0: for n in l: if n.key == key: l.remove(n) break

## Rehashing

The goal of the hash table is to give constant access time for its key operations. In order to do that we need to have a good hash function. Writing a good hash function is very hard. The code that I have written is for understanding purposes only. In real projects use the inbuilt function. The characteristic of a good hash function are.

- It should minimize collisions
- Distribute the keys evenly in the hash table
- Keep the load factor under 1

What is a load factor? Hash table contains N elements distributed over an array of M slots. Load factor is defined as

Load Factor = N / M

If N is greater M then it is very hard to get constant time operations. Hence if the load factor is greater than some ratio then the array size gets doubled. This process of increasing the array size is called as **rehashing**. When this happens you need to move the contents to different buckets in the hash table. Why? With a different array size the hash code for the keys will change.

Ideal load factor is between 0.6 and 0.8. Let us take a hash table of array size 64. If it contains 32 elements then the load factor is 32 / 64 = 0.5. Given below is the code which does rehashing. The code is not thread safe.

def rehash(self): oldBuckets = self.buckets newSize = self.size * 2 newBuckets = [] for i in range(newSize): newBuckets.append([]) self.size = newSize self.buckets = newBuckets for l in oldBuckets: for n in l: self.put(n.key, n.value)

## Problems

**1. You have an unordered array of unique numbers. Write a program to check if there exists a pair of number whose sum add up to the input number. **

Consider the array [1, 2, 3, 4, 5] and the input number 3. The method should return true as 1 + 2 will result in 3. For the input number 22 it should return false as there is no pair that adds up to 22. Using hash tables we can solve this problem in linear O(n) time. In the code I am using python dictionaries.

def chekIfPairWithSum(l, n): d = {} for i in l: d[i] = i for i in l: o = n - i if o != i and o in d: return True return False

**2. You have two unordered arrays. Find out if the contents of the arrays are the same.**

Consider two arrays [1, 2, 2, 3] and [2, 2, 3, 1]. For this input the method should return true. For this case [1,2,3,3] and [2, 1, 1] the method should return false. With hash tables you can solve the problem in linear time.

def checkIfArrayContentsAreSame(l1, l2): d = {} for i in l1: if i in d: d[i] = d[i] + 1 else: d[i] = 1 for i in l2: if i in d: d[i] = d[i] - 1 else: return False for v in d.values(): if v != 0: return False return True

If you have a key and want constant time operations you need to think of hash tables.

Nice. Looking forward to more technical articles like this… Raja