Discussions of fairness in machine learning datasets typically focus on the content of those datasets, in how they (mis)represent certain groups of data subjects. Comparatively less attention has been paid to to the histories, values, and norms that are embedded in the creation of these datasets. This paper remedies this gap in the literature, laying out a framework for understanding the labor embedded in dataset construction, and thereby articulating new avenues for contesting these data.
• What do the problems and limitations of benchmark machine learning datasets mean for machine learning models? • Why is interdisciplinarity important to machine learning research? • What can the machine learning community to ensure appropriate responsibility for the “afterlife” of machine learning datasets?
• We must move beyond (only) focusing on the statistical properties of datasets, and move towards examining the contextual and contingent conditions of datasets’ construction • Datasets act as infrastructure for machine learning research and development, which limits the avenues of contesting these datasets • Genealogy provides a method of de-naturalizing data infrastructure, opening up new avenues of contestation and intervention