Understanding the Representation and Representativeness of Age in AI Data Sets

Joon Sung Park, Michael S. Bernstein, Robin N. Brewer, Ece Kamar, Meredith Ringel Morris
AAAI/ACM Conference on Artificial Intelligence, Ethics, and Society, 2021
A diverse representation of different demographic groups in AI training data sets is important in ensuring that the models will work for a large range of users. To this end, recent efforts in AI fairness and inclusion have advocated for creating AI data sets that are well-balanced across race, gender, socioeconomic status, and disability status. In this paper, we contribute to this line of work by focusing on the representation of age by asking whether older adults are represented proportionally to the population at large in AI data sets. We examine publicly-available information about 92 face data sets to understand how they codify age as a case study to investigate how the subjects' ages are recorded and whether older generations are represented. We find that older adults are very under-represented; five data sets in the study that explicitly documented the closed age intervals of their subjects included older adults (defined as older than 65 years), while only one included oldest-old adults (defined as older than 85 years). Additionally, we find that only 24 of the data sets include any age-related information in their documentation or metadata, and that there is no consistent method followed across these data sets to collect and record the subjects' ages. We recognize the unique difficulties in creating representative data sets in terms of age, but raise it as an important dimension that researchers and engineers interested in inclusive AI should consider.