Thanks to the recent technological advancements, we can now use AI to both detect faces and identify their genders. We believe that this type of software can help us identify newsroom gender biases by uncovering who we actually depict.
By focusing on images we could aggregate news media content from all over the world to get a glimpse into how we are doing on a global level. With the help of AI - and two committed team AIJO members coming from Nikkei - we can do this analysis at scale.
We collectively had a hunch that men would be depicted more than women. But how can we know without actually counting them? As noted by the Centre for Data Ethics and Innovation, “data gives us a powerful weapon to see where bias is occurring [...].”
We needed data, and lots of it.
Coming from eight different news organizations, all of us have different content management systems and varying ability to go back to gather historical data about what we have published. As we were determined to do this experiment collectively, we had to plan for a future experiment and set a timeframe where we all gathered data - images - in a streamlined way (there is little value in comparing apples to pears!).
After team discussions, the data that we finally decided to include in our self-assessment were images published between November 9th to 15th 2020 (7 days).
Setting the time frame
Choosing the length and timing of the experiment was not an easy task. We decided to limit it to one-week of coverage due to the short time we had to do this work. One week was the smallest sample possible, as depending on the day of the week content can be less or more gender balanced. For example, the typically vast amount of sports news on Sundays could make it a very “masculine” day in terms of representation.
A key obstacle was the timing of the LSE Collab, occurring right in the middle of a historical U.S. presidential election with two male candidates in the running and likely dominating news media imagery across the world. Still, we chose to run the analysis and make it clear that we expected some influence from this - assuming that U.S. elections coverage might contribute to furthering “the gender gap”.
But, keep in mind that our goal was to explore how a tool like computer vision could be used rather than come out with results that would be statistically perfect (remember; we are exploring how we can use technology to spur cultural change). We therefore decided that the experiment still made sense to run.
Deciding what type of images to assess
To streamline the data collection, we decided to only include images displayed on the front pages of our respective news sites. This means that we use the images displayed on the equivalent “first impression” of our sites, rather than looking at specific niches (which varies across our brands).
Getting the data to Nikkei
The analysis would be conducted under the administration of two of our team members, Yosuke Suzuki and Issei Mori at Nikkei. In order for them to access data from all of the collaborating media outlets, we all set up individual solutions for data gathering. Depending on the sophistication of our different content management systems, this at times included scraping our own sites. For Reuters and AFP, due to the massive amount of images they produce every week, a random sampling was made.
We all created datasets including all of the images from the decided time frame - November 9th to 15th - and made it available to Nikkei via Google Drive.
In total, 28,051 images were analysed by the AI system through Nikkei’s supervision. Split into the participating brands, the distribution looked like this:
Schibsted: 7536 (spread over 9 brands)
Reach PLC: 4881 (spread over 10 brands)
La Nacion: 1238
Nice Matin: 521
About the AI Models
We decided to work on the "facial detection" and "gender classification" tasks separately.
For the facial detection model, we used "Retina Face," a state-of-the-art model introduced in 2020. The model is trained on "WIDER FACE hard difficulty" and detects almost every face visible. This model also detects masked faces, which may repeatedly appear during our experiment.
For the gender classification model, we used the "Insightface Gender Age Model." This model achieved 96% accuracy on its validation data and has over 90% accuracy on large and clear faces in news images.
Following the results and discussion made in the previous work by e.g. Pew Research, we decided to evaluate each image by counting its female ratio (so, the percentage share of females represented in the image), and then aggregate all the data to get the average the ratio of all images across brands.
If a picture did not include human faces, we did not take it into the count. We calculated the ratio of each image, then calculated the average of the ratio by publisher, and finally aggregated all the publishers’ ratios.
This is not necessarily "the right" way to do it. We considered many alternative ways of calculating gender representation, discussing e.g. dominance (if the share of females were higher, the whole image would be marked as “female dominant”), pixel representation (how much of the images surface represented male/female faces), or a simple count (total amount of females/males represented in images). When taking into account the various limitations of these approaches, we decided to go for the average ratio described above.
If you are curious about the details of our solution, please visit the GitHub repository for our project.
Of the total 28,051 images, our AI determined that 16,666 included faces (31,660 faces, to be exact). But were they male of female?
Aggregating the results of the 8 different publishers, the computer vision model determined the average ratio of females represented in images as 22.9%. Among the participating publishers, this ratio ranged from 16.6% to 35.7%.
An important note to this result is that the results change significantly after human review. As we only reviewed 10% of the total images with a human eye, we cannot tell for certain whether the mistakes we identified in our review are representative for the full dataset. But what we can say is that the average female ratio in the sampled human-reviewed ten percent went up to 27.2%. The female ratio among publishers after human review ranges from 19.4% to 36.2%.
We believe that the “true” number is somewhere between 22.9% and 27.2%.
As noted in the section about the limitations of the model, it tends to mistake females as male more often than the reversed - meaning that the share of females tend to go up after human review. For one case, though, the female ratio went down after human review.
Data is not The Truth
Data is always a product of human decisions. There are many choices, big and small, that we have made during this project that influenced the final results of our analysis. Two of them include setting threshold values for what faces to include and deciding whether to count groups (Schibsted and Reach) as one or multiple entities.
In some instances, when thinking about the goal of this experiment (to learn about gender representation) it might not be appropriate to count a human face when the image - for a human eye - is clearly intended to show something else. With this in mind we set some threshold values. When calculating the share of females in images, we ignored human faces which were blurry or very small. We decided that the clearness level threshold is 35 (so, faces used for analysis should be above 35) and that the face area should be larger than 0.13% of the picture which includes the face). Clearness level was calculated based on the variance of pixel colors in the face region (0 to +inf), and thus, a lower clearness value would indicate that the face was out of focus. We found these appropriate threshold values based on a few iterations of processing images and manually evaluating the outcomes.
Two publisher groups - Schibsted (SE/NO) and Reach PLC (UK) - participated in the analysis with data from multiple individual brands. If we for example would have counted each of Schibsted’s brands as individual publishers, the total aggregated share of females would have slightly increased. Discussing this in the team, we still decided to count the group as one publisher, since we considered how not doing so would be misleading given the global nature of our assessment (we would have ended up with 9 Scandinavian brands, in contrast to e.g. 1 Asian).
Understanding Limitations Through Human Review
The gender classification model used for the AIJO project showed poor performance on a number of different types of cases. To investigate what types of cases this included - and to achieve greater accuracy in our experiment - we decided to manually assess the analysis done by the model and correct any potential errors manually. Here, we randomly sampled 10% images from each publisher for human review.
We also discussed only manually assessing images where the AI had a low confidence score (you'll learn what that means below!), but concluded that we wanted to leverage our journalistic skills to investigate the model more qualitatively.
A typical error (53% of the cases randomly sampled images reviewed by a human) was that the model detected a female face as male. In 32%, the mistake was the opposite (male face detected as female).
We also saw general low performance on technically difficult cases such as side-angles, blurry faces, and small (but still within the threshold) faces. Other example limitations we found in our model include cases of male children being identified as women and people wearing face masks being unidentified as human faces.
And finally, as in the illustration above, there were some odd cases - like our AI categorising a ferret was a man.
As others (e.g. The Algorithmic Justice League) before us have shown, our human biases are often encoded into AI systems and solutions creating a harmful cycle of cementing e.g. racial or gender biases. While our study did not assess this with statistical relevance, we consider this to be an incredibly important field of research and hope to find ways of further contributing to it in the future.
Bias or other shortcomings can also be related to AI’s inability to uncover nuances (see or The Alan Turing Institute for an interesting technical description). If it is hard for a human eye to categorise something, it will be close to impossible for AI.
Assessing the Confidence of the Model
The computer vision model we used predicts male and female confidence value, and it is from this value that we can determine whether the face in the image is male or female. “Undetermined” does not exist as a label for the model, but the confidence score is a way for the model to express how sure it is about its analysis. The confidence value ranges from -2.0 (the model is confident that it is not female/male) to 2.0 (the model is confident that it is female/male).
For example, for one image, the female confidence was 1.574161 and male confidence was -1.612680. The computer vision predicted the face in the image as female and the prediction was correct when manually assessed. In another case, female confidence was -0.100542 and male confidence was 0.085507. The computer vision predicted this was male but the prediction was incorrect.
When manually assessing 10% of the AI-analysed images, we found quite a few errors. We learned that if the male or female confidence is high enough – compared to the other gender confidence – the model's prediction is usually good. But if the male or female confidence is around 0.0, the computer vision failed more often. This means that the model performs the worst in cases of uncertainty, or, when it has a low confidence score.
AI are by no means perfect technologies. For the AIJO Project, a global learning exercise, we chose to focus on exploring the approach rather than advancing the technological state-of-the art. For future projects in neighboring domains, we suggest paying more attention and effort to how models can be improved to contribute to more diverse representation.
We would have likely gotten more accurate results should we have utilized commercial services. Still, this was a learning experience and we wanted to try what the process of doing the analysis ourselves would be like.
As AI is not good with nuances, we chose to only assess binary genders and hence went with a method for calculating gender representation that in hindsight may be hurtful to some reading this study. We want to take the opportunity so state that we understand that while the lack of a “undetermined” label of our AI model may be a technological limitations, the realities of individuals behind such gender identities should be recognised - not least by publishers like ourselves. As we continue to explore AI in our newsrooms, we are eager to find better ways of holding ourselves accountable to that.
There are many other, more detailed and/or nuanced aspects of image representation that we would like to explore in future projects with less complexity in regards to time and organisational barriers. Image-related questions that we would be interested in looking at with the help of AI include but are not limited to:
In what respect does the gender of those depicted correlate to topics? Who is depicted in sports/business/entertainment content?
Where are people placed in photos? Who are depicted in the front vs back of pictures? Who is the center of attention?
How many people are in the photo? Are there differences in male/female or old/young representations as part of groups?
How long are the photos displayed on the front? Are men/women given the same “airtime”?
How are people posing? Are there differences in angles of photos? What gender do we find represented in close-ups and which more in half-close or total shots?
What pictures do we choose to represent top politicians ? Is there a difference when representing men and women?
How many individuals are responsible for the total number of faces? How much do we depict e.g. Donald Trump or Kamala Harris?
There are endless issues to explore, and through this first little experiment, we feel confident that AI can help us look into them.