Software Heritage: Analyzing the Global Graph of Public Software Development

le 7 décembre 2022


Campus de Beaulieu Salle Jersey - bât. 12D

Intervention de Stefano Zacchiroli, enseignant-chercheur à Télécom
Paris, Polytechnic Institute of Paris, dans le cadre des séminaires du département Informatique.


The Software Heritage project has assembled the largest existing archive of publicly available software source code and associated development history, for more than 10 billion unique source code files and 2 billion unique commits, coming from more than 190 million software development projects.

 In this talk we will review the project background and current status with a focus on its graph-based data model and its research applications. The archive is a Merkle DAG whose nodes stand for source code development artifacts such as source files, code trees, commits, releases, and version control system (VCS) snapshots. The graph is typed, fully-deduplicated, and global, allowing to keep track of all the different places (e.g., different VCS repositories) from which a given artifacts have been distributed from. The graph is huge, with about 200 billion edges and 20 billion nodes and exponentially growing, doubling every 2 years.

 We will discuss the state-of-the-art of operating, analyzing, and querying the Software Heritage graph, highlighting recent research results obtained using it as a large-scale dataset in the field of  empirical software engineering.
