HFCommunity: An extraction process and relational database to analyze Hugging Face Hub data
Ait A., Cánovas Izquierdo J.L., Cabot J.
Science of Computer Programming, vol. 234, art. no. 103079, 2024
Social coding platforms such as GITHUB or GITLAB have become the de facto standard for developing Open-Source Software (OSS) projects. With the emergence of Machine Learning (ML), platforms specifically designed for hosting and developing ML-based projects have appeared, being HUGGING FACE HUB (HFH) one of the most popular ones. HFH aims at sharing datasets, pre-trained ML models and the applications built with them. With over 400 K repositories, and growing fast, HFH is becoming a promising source of empirical data on all aspects of ML project development. However, apart from the API provided by the platform, there are no easy-to-use solutions to collect the data, nor prepackaged datasets to explore the different facets of HFH. We present HFCOMMUNITY, an extraction process for HFH data and a relational database to facilitate an empirical analysis on the growing number of ML projects.