top of page

Experimenting with Natural Language Processing

The way we speak and write reflects who we are. Sadly the real diversity of our society is not reflected in how we write many articles. Partly this is because diversity is rare in top positions across society. It is also because our own news industry is not diverse enough. Finally, the imbalance can be the result of our own unconscious biases as writers.

So, what can be done to uncover these disparities in our content with facts rather than intuition? In the early 1950s, French anthropologist Claude Levi Strauss mounted a passionate defence of mathematics (in fact, cybernetics!) as a tool to research the field of linguistics. This is finally happening in journalism as well.

Our Collab selected one tool to try and measure bias in our text content. The Gender Gap Tracker is an automated software system created by the Discourse Processing Lab at Simon Fraser University that measures men and women’s voices.

GGT has now been used for almost two years to show gender ratios of seven English-language Canadian news sites in real time.

Together with Maite Taboada, director of the Discourse Processing Lab, four members of our Collab -- UK’s Reach Plc, Japanese newspaper Nikkei, German TV group Deutsche Welle and international news agency Agence France-Presse decided to run a pilot experiment on their content, for five days, from November 16-20.

The goal of the study was to test the tool and check how it could be adapted for our respective newsrooms.

What we found was no surprise. Females represented 21% of the people mentioned in the articles, whereas about 73,3% were men. 5,6% of the names could not be identified. As for sources we quoted, 21,9% were women, versus 73,4% of men (unidentified sources represent 5% of the total).

Interestingly, the percentage of women quoted as sources is slightly higher than the percentage of women featured more generally. Asked about this finding, Maite Taboada told us that “it is quite possible that reporters are aware of the diversity of sources, and try harder to quote more women (...) We have the same results for the main GGT data”.

We also listed our "Top sources" -- the most frequently quoted personalities. We found that the two top sources among men, not surprisingly Donald Trump and Joe Biden, were quoted 24 and 16 times, whereas the two most quoted women, Ursula von der Leyen and Angela Merkel appeared as sources four times respectively. The third most quoted man was the Prime Minister of Ethiopia and Nobel peace prize winner Abiy Ahmed (13 times) and the third place on the podium on the female side went to Pamela Rendi-Wagner, chairwoman of the social democratic party in Austria (3 times).

What happened in the world that week and our own editorial priorities -- from financial and business news for some, to international affairs and national affairs for others -- obviously had an impact on the results.

That week for instance, the protagonists of many international breaking news were men: Donald Trump refusing to concede; Abiy Ahmed leading a military offensive against leaders of the dissident northern region of Tigray; Moderna and Pfizer’s CEOs announcing final results of trials on Covid vaccines; the SpaceX Dragon crew (three men, one woman) docking at the International Space Station; and G20 leaders meeting at a virtual summit. The picture of the summit shows 22 people: two women and 20 men....

News reflects society.

But bias can also be unconscious and that is why we wanted to look into length of quotes: were they shorter for women?

The answer is yes. The average quote length was 103 characters for men and 97 for women.


We analysed 1,430 articles in English. The results were presented as an aggregate of percentages of individual publishers’ results. This was a pilot study and it has limited statistical relevance, given its short time frame. It was also limited to a binary analysis (male/female) that does not reflect society as a whole.

But it was for us a great learning experiment that will help us conduct further investigations back in our newsrooms.

What did we learn?

First and foremost: We can do it!

With an open source tool available here, you can use Natural Language Processing (NLP) to identify who is mentioned (names) and who is quoted (sources) in the texts. To predict the gender of names the system uses, a simple, and free API, up to a thousand names a day. It can be complemented by another paid service: Gender-API. The Discourse Processing Lab was able to deliver the analysis in a week. With simple training, the tool can be used in any newsroom with a developer.

Second lesson: there is much more to be done.

The Gender Gap Tracker also offers topic analysis. Given the small amount of articles submitted we could not run that study. And what about languages beyond English? Last but not least, there is ongoing research on aspects as subtle as the use of verbs, pronouns and adjectives and how they can uncover prejudice.

Third: this AI tool can be used in many other ways.

NLP can help uncover strong imbalance in the length of quotes for candidates running for an election for example. One partner in this collaboration now wants to quantify the representation of particular minorities in its journalism. Another hopes to extend the experiment to a group of international broadcasters. The results show there is much work to be done for gender equality, but we all hope this tool will help us improve and monitor our progress.

For further technical documentation and background to the Gender Gap Tracker Tool, please have a look at the attached Pilot Study Report.

Download PDF • 791KB


bottom of page