Error! submit again

# GitHub Contributions Graph: Analyzing PageRank & Proving The 6 Handshakes Theory

source{d} has recently published a dataset with metadata on 462,000,000 commits: data.world. It allows you to build the contributions graph. For example, these are the neighbors around Armin Ronacher:

Neighbors around Armin Ronacher, 2 generations - the repositories he contributed to and their contributors. Armin is in the center. 8k nodes, 11k edges. The graph was produced with Gephi.

The fans on the above image are communities around some popular open source projects, e.g. rust-lang/rust on the bottom.

Neighbors around Armin Ronacher, 3 generations. Armin is again somewhere in the center. 200k nodes, 300k edges.

These are the neighbors around Rob Pike:

Neighbors around Rob Pike, 3 generations, zoomed center region. Nodes are highlighted with the heatmap tool to reflect the distance from Rob. Hubs: upper-left: golang/go*, upper-right: onef9day/gowiki, lower-left: cmars/oo, lower-right: cmars/tools.

Schematic abstraction of the previous graph. Edge weights are the number of commits.

Actually, Rob Pike has never contributed to cmars/oo and cmars/tools. The owner of those repos must have used git filter-branch or something similar to forge the history. This is why it is so hard to distinguish the real contributions in Git world!

The contributions graph is bipartite and is represented by an extremely sparse adjacency matrix:

The square adjacency matrix of the contributions graph. Cij is the number of commits done by developer i to repository j; Cij = Cji.

I continue mining the October 2016 snapshot, at that time the matrix was about 23 million by 23 million and contained 48 million non-zero elements. Of course, these numbers are approximate - we depend on our identity matching here. The identity matching is the way to merge several email addresses into a single personality and it is not an easy task because we have to make assumptions. The public dataset has hashes instead of email addresses so it is impossible to perform the identity matching on that data. You can download the graph here. It is a pickled scipy.sparse.csr_matrix, the following Python code loads it:

import pickle
with open("graph_blog_post.pickle", "rb") as fin:
ndevs = graph.shape[0] - len(repos)
print(repr(graph))
print("Number of developers:", ndevs)
print("Number of repositories:", len(repos))
inv_repos = {r: i + ndevs for i, r in enumerate(repos)}
name = "src-d/go-git"
print("Repository %s has %d contributors, %d commits" % (
name, len(graph[inv_repos[name]].indices), graph[inv_repos[name]].sum()
))

<23223056x23223056 sparse matrix of type '<class 'numpy.float32'>'
with 47866516 stored elements in Compressed Sparse Row format>
Number of developers: 6621684
Number of repositories: 16601372
Repository src-d/go-git has 14 contributors, 169 commits


We ship only the second part of the matrix index, the repository name → index mapping. Please note the following:

1. We were unable to process some large repositories back in October. Particularly, about 20% of the most highly rated ones. That is, unfortunately, there is no data for git/git, torvalds/linux, rails/rails, etc.
2. Some secondary repositories were confused with the main ones. E.g. there is no golang/go but rather some random 4ad/go. That was a bug we’re now fixing.
3. Initially we had 18M repos, but filtered 1.5M duplicates not marked as forks; read this blog post how.

Let’s conduct two funny experiments with our graph:

## The Handshake Theory

The six handshakes theory, or Six degrees of separation, states that everybody in the world can reach out everybody else using the chain of sequentially familiar people of the size smaller than or equal to 6. Supposing that you know all the people who contributed to the same projects you contributed yourself, we can accept or reject this assumption on GitHub contributions graph.

The plan will be as follows.

1. Find all the connected components of the graph.

2. Pick the core component, or simply “the core”. Calculate the size of the representative sample of the pairs of nodes.

3. Calculate the distances between sampled node pairs.

4. Plot the histogram, draw the conclusion.

We need to determine the connected components because it is impossible to find a path between nodes lying in different components and every shortest path algorithm has to scan all the nodes before returning the negative result. We’ve got 23M dots, remember.

#### Connected components

While our graph is directed, every edge has the corresponding backward edge of the same weight, so the weak connectivity automatically means strong connectivity. In other words, we do not need to apply complex algorithms designed to find the strongly connected components, but rather conduct the series of graph traversals. For example, the following code works:

def RFS(v_start):
visited = set()
pending = {v_start}
while pending:
v = pending.pop()
if len(visited) % 500000 == 0:
print(len(visited))
for i in range(graph.indptr[v], graph.indptr[v + 1]):
nv = graph.indices[i]
if nv not in visited:
return visited

unvisited = set(range(graph.shape[0]))
components = []
while unvisited:
v = unvisited.pop()
c = RFS(v)
components.append(c)
if len(components) % 100000 == 0:
print(len(components))
unvisited -= c
print("Number of connected components:", len(components))
clens = list(sorted((len(c) for c in components), reverse=True))
print(clens[:10])


The traversal algorithm is neither depth-first nor breadth-first. It is random-first as Python’s set is an unordered container. Random-first traversal is sufficient for our task of visiting every connected vertex once and is faster. It takes about 5 minutes to execute. The result is:

Number of connected components: 3430141
[10200912, 10229, 8456, 2917, 2910, 2614, 2612, 2604, 2576, 2139]


We’ve the clear core with 10M nodes! How many developers are in it?

c0 = components[0]
core_devs = [m for m in c0 if m < ndevs]
print(len(core_devs))

2153559


That is, the “contributional active” GitHub users are 2.2M or 33% or ⅓ of the whole users we analysed, here we analysed about 73% of all the users who made at least 1 commit, and the official number of users is reported to be 20M (including users without any commits). Thus the final ratio is 11%.

#### Representative sample

We need to determine how many distances between random nodes in the core must be evaluated to achieve a statistically significant distribution. Let’s find a rough estimate. The number of possible distance pairs between the core developers is $$\frac{N_ {core} (N_ {core} - 1)}{2}$$.

print("Number of possible pairs:", len(core_devs) * (len(core_devs) - 1) / 2)

Number of possible pairs: 2318907106461


Now we should use the special calculator to find out the size of the representative sample from 2.3 trillion. For example, checkmarket.com. We set the population size to 2318907106461, the margin of error to 1% and the confidence level to 95% and see roughly 10,000. Not bad! This is how to sample 10k random vertices:

samples = numpy.random.choice(core_devs, 10000 * 2, replace=False).reshape((10000, 2))


#### Distances calculation

There are many shortest path algorithms out there. I can recall the following three from scratch:

1. Floyd–Warshall allows to calculate the distances from everybody to everybody but has a tiny drawback - it works in O(V3) time and eats O(V2) memory. We cannot afford ourselves to store 23M × 23M = 5e14 elements (2 PB) and wait several years.

2. Good ol’ Dijkstra allows to find all the shortest distances from one fixed node to all the rest in O(|E| + |V|log|V|) time and O(|V|) space. It works but it is not efficient on our graph. The problem is with the “generational explosion”, each next node generation is exponentially larger than the previous one, so Dijkstra’s algorithm ends up with inspecting almost all the nodes every time.

3. Bidirectional search spreads simultaneous waves from the both nodes and suits our task best since this algorithm is less sensitive to the “generational explosion”.

Here is the code I used. It leverages the multiprocessing package to calculate many shortest paths in parallel.

def path_length(first, second):
visited_first_set = set()
visited_second_set = set()
visited_first_paths = {}
visited_second_paths = {}
pending_first = [(first, 0)]
pending_second = [(second, 0)]
pending_first_set = {first}
pending_second_set = {second}
pmax = graph.shape[0]
path = pmax

def step(visited_set, visited_paths, pending, pending_set):
v, p = pending.pop(0)
pending_set.remove(v)
visited_paths[v] = p
for i in range(graph.indptr[v], graph.indptr[v + 1]):
nv = graph.indices[i]
if nv not in visited_set and nv not in pending_set:
pending.append((nv, p + 1))

while (path == pmax) and (pending_first or pending_second):
if pending_first:
step(visited_first_set, visited_first_paths, pending_first, pending_first_set)
if pending_second:
step(visited_second_set, visited_second_paths, pending_second, pending_second_set)
isect = visited_first_set.intersection(visited_second_set)
for v in isect:
p = visited_first_paths[v] + visited_second_paths[v]
path = min(path, p)
return path / 2

from multiprocessing import Pool
with Pool() as p:
paths = p.starmap(path_length, samples)


The script took less than 10 minutes to finish on our 32-core machine with 256 gigs of RAM. Each process takes less than 4GB of RAM which means 32 processes need less than 128GB.

We divide each path by 2 because the nodes corresponding to repositories should not be taken into account in the shortest path’s size calculation.

#### The histogram

85% of the core is less than or equal to 6 handshakes from each other. The average is 4.9, the maximum is 18. The average is close to what Facebook had in 2011. ∎

## GitHub PageRank

As soon as we have the contributions graph, we want to find out whose GitHub is bigger. The natural way of doing this is to calculate some centrality measure. We decided to study eigenvector centralities which make nodes important if they are referenced by other important nodes.

Armin Ronacher’s contributions, simplified PageRank version.

$$x_ {AR} = \frac{w_ 1 x_ 1 + w_ 2 x_ 2 + w_ 3 x_ 3 + …}{\lambda} = \frac{1}{\lambda} \sum w_ i x_ i$$

$$w_ i$$ depend on the impact to the project or to the ratio of the developer’s efforts. If our adjacency matrix contained strictly positive elements (everybody is connected to everybody) then by Perron–Frobenius theorem $$\vec{x}$$ is the eigenvector of this matrix which corresponds to the greatest eigenvalue. Moreover, if the matrix is stochastic, that is, the sum of values in every column equals to 1, $$\lambda=1$$. The greatest eigenvalue and the corresponding eigenvector can be found by various efficient methods such as power iteration or Arnoldi iteration.

Unfortunately, this is not applicable to our case: we’ve got a sparse matrix with a lot of zeros. Practically speaking, the greatest eigenvalue corresponds to an eigenvector with negative elements. We need to do something. Luckily, we are not the only ones who hit this problem. So did Larry and Sergey back in 1998. They studied the similar adjacency matrix of WWW pages. Each element $$C_ {ij}$$ is 1 if page i links to page j and 0 otherwise. Larry and Sergey found the nice trick to make the adjacency matrix well-conditioned and suitable for * iteration methods which they called PageRank.

There is a good explanation of PageRank in Stanford’s CS246. I am not going to repeat it but rather briefly describe how the trick applies to our centrality problem using the terms from CS246.

1. We normalize each column to 1: $$\displaystyle \sum_ i{C_ {ij}} = 1$$. Now the matrix is not symmetric and $$C_ {ij} \neq C_ {ji}$$, however, it becomes stochastic.
2. We multiply all the elements by $$\beta$$ and add $$\frac{1-\beta}{N} E$$ where $$E$$ is N by N matrix filled with 1 (not the identity matrix!). $$\beta$$ is the so-called dampening or random walk factor and can be set to 0.85. Our matrix becomes acyclic and irreducible, aka the Google matrix.
3. We run power iteration and find PageRank values for every node.

#### Normalization

The following code performs the L1 normalization.

norms = 1 / numpy.asarray(graph.sum(axis=1)).ravel()
from scipy.sparse import diags
graph_normed = H = graph.dot(diags(norms, format="csc"))


Here is what happens: $$norms_ i = \frac{1}{\sum\limits_ j C_ {ij}}$$
$$diags = \left( \begin{array}{cccc} norms_ 0 & 0 & … & 0 \\ 0 & norms_ 1 & … & 0 \\ … & … & \ddots & … \\ 0 & 0 & … & norms_ {N-1} \end{array} \right) \\$$
$$H=\left( \begin{array}{cccc} C_ {0,0} & C_ {0,1} & … & C_ {0,N-1} \\ C_ {1,0} & C_ {1,1} & … & C_ {1,N-1} \\ … & … & \ddots & … \\ C_ {N-1,0} & C_ {N-1,1} & … & C_ {N-1,N-1} \end{array} \right)\times diags=$$ $$=\left( \begin{array}{cccc} \frac{C_ {0,0}}{norms_ 0} & \frac{C_ {0,1}}{norms_ 1} & … & \frac{C_ {0,N-1}}{norms_ {N-1}} \\ \frac{C_ {1,0}}{norms_ 0} & \frac{C_ {1,1}}{norms_ 1} & … & \frac{C_ {1,N-1}}{norms_ {N-1}} \\ … & … & \ddots & … \\ \frac{C_ {N-1,0}}{norms_ 0} & \frac{C_ {N-1,1}}{norms_ 1} & … & \frac{C_ {N-1,N-1}}{norms_ {N-1}} \end{array} \right)$$ scipy.sparse API makes the diagonal matrix multiplication the only way to normalize the columns.

$$G = \beta H + \frac{1-\beta}{N}\left( \begin{array}{cccc} 1 & 1 & … & 1 \\ 1 & 1 & … & 1 \\ … & … & \ddots & … \\ 1 & 1 & … & 1 \end{array} \right)$$ We are not going to ever calculate it explicitly because that would involve storing N × N elements! Instead, we will create a custom power iteration algorithm.

#### Power iteration

The idea of the power iteration method is dead simple: if we want to find the vector $$\vec{x}$$ such that $$G\cdot \vec{x} = \vec{x}$$ then we repeat the same matrix multiplication until the convergence: $$x_ {i+1} = G\cdot x_ i$$ Provided that the matrix is well-conditioned, the convergence is guaranteed. Here is the code which calculates $$\vec{x}$$.

def page_rank(m, beta=0.85, niter=80):
N = m.shape[0]
x = numpy.ones(N, dtype=numpy.float32) / N
for i in range(niter):
x_next = m.dot(x) * beta
x_next += (1 - beta) / N  # ***
xdiff = numpy.linalg.norm(x - x_next, ord=1)
x = x_next
print("iter #%d: %f" % (i + 1, xdiff))
if i % 10 == 0:
print(x)
return x

page_rank(graph_normed)


# *** every edge has the opposite one, so there are no “dead ends”. Otherwise, we would have to replace beta with x_next.sum().

page_rank() executes fast and the result is returned within a minute.

#### Who is the most important on GitHub?

Ryan Baumann aka ryanfb! He is eventually a good example how to study how GitHub’s cache works (spoiler: it invalidates after some time).

github.com/ryanfb takes several seconds to be generated.

Ryan knows what it means to be productive.

I am joking, of course. Although he is on the top, he is actually the living illustration of the weaknesses PageRank algorithm has. There used to be days when Ryan created 1,000 repositories and of course he is not a super human - they were automatically generated. Yet he got many incoming links and PageRank rated him the highest.

The essential move would be to ignore repositories with a single contributor, and that definitely helps, though other fun effects still reflect the light. After all, only the ratio of PageRank-s makes sense. For example, after the mono repository filtering, Armin Ronacher has 8.0e-6, our CTO Maximo Cuadros has 2.6e-6 and I have 1.6e-6.

#### What is the most important on GitHub?

Here is the descending top 200:

It’s a pity we were unable to process some important repos like torvalds/linux back in October, as I noted in the beginning, so they are absent. bmorganatlas2/firstrepo got to the top due to the failure of the identity matching: we’ve got the clique which is PageRank’s Achilles’ heel. This is not much of our fault: the guy created 30,000 commits, each under a different, obviously fake, author. How were we supposed to realize it without using GitHub API? The same failure happened with bmorganatlas/fusiontest5 (besides, it has 10,000+ branches!). Other repositories seem to be legit and typically have thousands of contributors, hence high PageRank. We are currently building out a large data pipeline that will increase the quality of our data and expand it beyond GitHub to almost all git repositories on the web.

## Summary

This post studies the contributions graph which source{d} collected from GitHub repositories. The October 2016 dataset is open on data.world and can be used to build it, however, it can be downloaded fully baked from our Google Drive. We’ve had fun with proving the theory of 6 handshakes between GitHub core developers and measuring the community members’ importance by applying PageRank algorithm.

## Update

This is what you get if you visualize the Google matrix with LargeVis:

I used hexbin from matplotlib:

hexbin(graph[:, 0], graph[:, 1],
gridsize=2048, bins="log", cmap="inferno")