Given below are 5 friends and how they are connected. If I ask you to find the shortest path for Peter to reach Tom then you will immediately answer that he should go through Sachin [Peter -> Sachin -> Tom]. It requires some effort if you want to make the computer answer the same question. In this post let us see how a computer answers this question.
We need a way to store the names of the friends and their connections. Graph is a data structure which is used to store this information. Using the friends image let us understand the vocabulary used in graphs.
Vertex – Each circle in the image is called as a vertex. Since there are 5 circles there are 5 vertices.
Edge – Peter and Sachin are friends since they are directly connected. This connection is called as an edge. There are 6 edges in total. In the image Peter -> Sachin also implies that Sachin -> Peter. The edge does not have a direction and hence called as an undirected edge. You can find undirected edges in Facebook and LinkedIn. I have a twitter account and I follow Warren Buffett. But he does not follow me. In this case the edge has a direction [Me -> Buffett] and hence it is called as directed edge.
Edge Weights – Peter is connected to his best friend Sachin and he can chose to give more importance to their friendship by giving it a weight of 10 (best). Rahul is a normal friend of Peter and he can chose to give a weight of 1 (normal). If the edge is weighted then it is called as weighted edge. If not it is an unweighted edge. In this example we are dealing with unweighted edge as all the edges are treated alike.
Path – A sequence of vertices that are connected by edges is called as a path. [Peter -> Sachin -> Tom] is one path.
A graph is defined as a collection of vertex and edges. An undirected graph with n vertices can have up to n * (n – 1) / 2 edges. Given below is a proof for that.
Consider a graph with n vertices and a vertex does not connect to itself. Each vertex can connect to (n - 1) other vertices. Total no of edges = (n - 1) + (n - 1) + ... + (n - 1) [n times] Total no of edges = n * (n - 1) Since A -> B and B -> A is counted twice we have to divide it by 2 Total no of edges = [n * (n - 1)] / 2
If a graph has few edges then it is called as sparse graph. If it has a lot of edges close to maximum n * (n – 1) / 2 then it is called as dense graph.
There are two ways to represent a graph in memory – adjacency matrix and adjacency list.
Adjacency Matrix
You can represent the graph as a 2 dimensional matrix as shown below. Peter and Rahul are connected and hence you can find X mark in Peter -> Rahul and Rahul -> Peter. Since Peter is not connected to Tom, I have left their intersection cell empty. Adjacency matrix is very easy to understand and implement. But there is one problem. For 5 people we created a matrix with 25 entries (5 *5). For 1000 people we need to create a matrix of 1,000,000 entries (1000 * 1000). Most of these cells are going to be empty and hence we will be wasting a lot of memory. Hence this approach is rarely used. If you are dealing with a dense graph with edges close to maximum n * (n – 1) / 2 then you can use this approach.
Peter | Sachin | Rahul | Tom | Tim | |
Peter | X | X | |||
Sachin | X | X | X | ||
Rahul | X | X | X | ||
Tom | X | X | |||
Tim | X | X |
Adjacency List
Each vertex will only store information about other vertices it is directly connected to. The edges are captured as a list and hence the name adjacency list. Any time you are dealing with a sparse graph you should use adjacency list. Most of the real life use cases deal with sparse graph. Hence I will be using adjacency list.
Given below is the code in Python to represent a vertex. The name of the person is stored in the variable key. The friends that each person directly connects to is stored in the variable neighbors.
class Vertex: def __init__(self, key): self.key = key self.neighbors = [] #Adjacency list def getKey(self): return self.key def addNeighbor(self, n): self.neighbors.append(n) def getNeighbors(self): return self.neighbors def __str__(self): return str(self.key) + ' -> ' + str([x.key for x in self.getNeighbors()])
Given below is the code to represent a graph. It contains a dictionary with the name as the key and vertex as its value. If the graph is undirected then the variable directed is set to false otherwise it is set to true.
class Graph: def __init__(self, directed=False): self.vertices = {} self.directed = directed def addVertex(self, key): vertex = Vertex(key) self.vertices[key] = vertex return vertex def getVertex(self, key): if key in self.vertices: return self.vertices[key] return None def isDirected(self): return self.directed def getVertices(self): return self.vertices.keys() def addEdge(self, v1, v2): if v1 not in self.vertices: self.addVertex(v1) if v2 not in self.vertices: self.addVertex(v2) self.getVertex(v1).addNeighbor(self.getVertex(v2)) if(not self.directed): self.getVertex(v2).addNeighbor(self.getVertex(v1)) def __str__(self): s = [] for x in self.vertices: s.append(str(self.vertices[x])) return str(s) #Code to create the graph as given in the image g = Graph() g.addEdge('Peter', 'Sachin') g.addEdge('Peter', 'Rahul') g.addEdge('Sachin', 'Rahul') g.addEdge('Sachin', 'Tom') g.addEdge('Rahul', 'Tim') g.addEdge('Tim', 'Tom')
We have everything to find the shortest path for Peter to reach Tom. In order to do this we need to explore (traverse) the graph. There are two ways to traverse the graph (1) Breadth First Search (2) Depth First Search.
Breadth First Search
Let us assume that Peter is getting married. He will personally invite his direct friends Sachin and Rahul. He also requests Sachin and Rahul to invite their friends. Sachin has 3 friends [Peter, Rahul, and, Tom]. Peter and Rahul already has the invitation. Hence Sachin will invite Tom. Rahul also has 3 friends [Peter, Sachin, and, Tim]. Peter and Sachin already has the invitation. Hence Rahul will invite Tim. Tom has 2 friends [Sachin and Tim] but both of them already has an invitation. Hence he need not invite anyone. Tim also has 2 friends [Rahul and Tom] but both of them already has an invitation. Hence he need not invite anyone.
We are traversing the graph by radiating out slowly from the starting vertex Peter. Hence this kind of traversal is called as Breadth First Search (BFS). Each vertex is processed in the First-In-First-Out (FIFO) order and hence we will use queue data structure for processing. For each vertex we need to know if it is (1) discovered (2) processed and we will maintain two dictionaries for this. We start the search with Peter and add him to the discovered vertex.
def breadthFirstSearch(graph, start, handler): discovered = {} processed = {} parents = {} for k in graph.getVertices(): discovered[k] = False for k in graph.getVertices(): processed[k] = False processQueue = [graph.getVertex(start)] discovered[graph.getVertex(start)] = True while len(processQueue) > 0: v = processQueue.pop(0) handler.processVertexEarly(v) processed[v] = True for nbr in v.getNeighbors(): if((nbr not in processed) or (graph.isDirected())): handler.processEdge(v, nbr) if(nbr not in discovered): processQueue.append(nbr) discovered[nbr] = True parents[nbr.getKey()] = v.getKey() handler.processVertexLate(v) return parents
Given below is the code to call breadthFirstSearch with Peter as the starting vertex. Methods on the Handler class gets invoked when a vertex/edge is processed. The Handler class separates the logic of traversal and processing.
class Handler: def processVertexEarly(self, v): print 'Processing Vertex Early: ' + v.getKey() def processEdge(self, v1, v2): print 'Processing Edge: ' + str(v1.getKey()) + ':' + str(v2.getKey()) def processVertexLate(self, v): print 'Processing Vertex Late: ' + v.getKey() print '***Breadth first search***' parents = breadthFirstSearch(g, 'Peter', Handler())
Output from BFS starting at vertex Peter.
Processing Edge: Peter:Sachin Processing Edge: Peter:Rahul Processing Edge: Sachin:Rahul Processing Edge: Sachin:Tom Processing Edge: Rahul:Tim Processing Edge: Tom:Tim
While doing BFS we store the parent vertex for each vertex. The parent for each vertex is
Tim parent is Rahul Tom parent is Sachin Sachin parent is Peter Rahul parent is Peter
The graph is redrawn with only parent edges.
Following the parent edges you are guaranteed to find the shortest path from Peter to any vertices. The parent for Tom is Sachin and the parent for Sachin is Peter. The shortest path from Peter to Tom is [Peter -> Tom -> Sachin]. For this logic to work the edges should unweighted and the search should start from Peter.
#BFS finds the shortest path. No weights should be present for this to work. def findPath(start, end, parents, path): if(start == end or end == None): path.append(start) else: findPath(start, parents[end], parents, path) path.append(end) path = [] findPath('Peter', 'Tom', parents, path) print '***Finding path***' print path #Output of path: ['Peter', 'Sachin', 'Tom']
Depth First Search
Depth First search (DFS) is another way for traversing a graph. You start at a vertex and explore as far as possible along each branch before backtracking. The image given below shows the difference between DFS and BFS traversal. In DFS we we will use recursion to take care of backtracking.
Let us understand DFS by solving a problem. I wanted to complete a project which has 4 tasks. Task 2 and 3 can only be started after Task 1 is completed. Task 4 can only be started after Task 2 and Task 3 are completed. How many ways can I complete the project?
The project can be completed in 2 ways [Task 1 -> Task 2 -> Task 3 -> Task 4] or [Task 1 -> Task 3 -> Task 2 -> Task 4]. DFS can be used to solve this problem to tell one of the ways to complete the project. This problem is known as topological sort. For doing a topological sort the graph should be (1) Directed (2) Acyclic (no cycles).
A topological sort or topological ordering of a directed graph is a linear ordering of its vertices such that for every directed edge uv from vertex u to vertex v, u comes before v in the ordering.
The code is similar to BFS except (1) Stack (Recursion) is used to keep track of vertex processing (2) Processing of the child vertex gets completed before the parent vertex.
discovered = {} processed = {} parents = {} def depthFirstSearch(graph, start, handler): for k in graph.getVertices(): discovered[k] = False for k in graph.getVertices(): processed[k] = False return depthFirstSearchInternal(graph, graph.getVertex(start), handler) def depthFirstSearchInternal(graph, v, handler): handler.processVertexEarly(v) discovered[v] = True for nbr in v.getNeighbors(): if nbr not in discovered: parents[nbr] = v handler.processEdge(v, nbr) depthFirstSearchInternal(graph, nbr, handler) # recursion elif ((nbr not in processed) or (graph.isDirected())): handler.processEdge(v, nbr) # Parent vertices are finished after the child vertex handler.processVertexLate(v) processed[v] = True
In the class TopologySortHandler we have a stack. In DFS processing of child vertex gets completed before the parent vertex. In the processVertexLate() we push to vertex into the stack.
class TopologySortHandler: def __init__(self, stack): self.stack = stack def processVertexEarly(self, v): print 'Processing Vertex Early: ' + v.getKey() def processEdge(self, v1, v2): print 'Processing Edge: ' + str(v1.getKey()) + ':' + str(v2.getKey()) def processVertexLate(self, v): print 'Processing Vertex Late: ' + v.getKey() self.stack.insert(0, v) # Insert the vertex in the stack g = Graph(True) g.addEdge('Task 1', 'Task 2') g.addEdge('Task 1', 'Task 3') g.addEdge('Task 2', 'Task 4') g.addEdge('Task 3', 'Task 4') print '***Depth first search***' stack = [] depthFirstSearch(g, 'Task 1', TopologySortHandler(stack)) for l in stack: print l.key
Given below is the order in which vertices are pushed into the stack. Stack uses last-in-first-out ordering. Hence popping items from the stack will produce the output [Task 1 -> Task 3 -> Task 2 -> Task 4].
Processing Vertex Late: Task 4 Processing Vertex Late: Task 2 Processing Vertex Late: Task 3 Processing Vertex Late: Task 1
Graphs are powerful data structures which are used to solve complicated problems. In this post I scratched the surface of it. I did not cover weighted graphs and shortest path algorithms. If you are really interested in learning about them you can view the online lectures from MIT open courseware.