Evaluating Gender Bias in Speech Translation

Evaluating Gender Bias in Speech Translation


The scientific community is more and more aware of the necessity to embrace pluralism and consistently represent major and minor social groups. In this direction, there is an urgent need to provide evaluation sets and protocols to measure existing biases in our automatic systems. This paper introduces WinoST, a new freely available challenge set for evaluating gender bias in speech translation. WinoST is the speech version of WinoMT which is a MT challenge set and both follow an evaluation protocol to measure gender accuracy. Using a state-of-the-art end-to-end speech translation system, we report the gender bias evaluation on 4 language pairs and we show that gender accuracy in speech translation is more than 23% lower than in MT.


Marta R. Costa-jussà, Christine Basta, Gerard I. Gállego1 \addressTALP Research Center, Universitat Politècnica de Catalunya, Barcelona {keywords} gender bias, challenge set, multilinguality, direct speech-to-text translation

1 Introduction

There is a massive lack of representation of diverse gender, race, cultural groups in the power structures. These problems permeate data science causing an unbalanced progress in this area, which misrepresents social groups by gender, race, and nationality. Consequently, machine translation systems offer high-quality translations for high-resource languages compared to low-resourced [1]; image recognition systems perform much better for European and American faces [8]; and automatic speech recognition has a low error rate for female’s voices than for male’s ones [27]. While this is still an uncovered challenge in most artificial intelligence tasks, the scientific community is making huge efforts towards bringing to light this challenge. In the long-term, there is multidisciplinary work to do in education, politics, and communications. While in the short-term, researchers can detect this bias, devoting resources to evaluate it, and proposing methods to mitigate it, among others.

Speech Translation (ST) is at the intersection of Automatic Speech Recognition (ASR) and Machine Translation (MT) tasks. Within these tasks, biases have been detected and studied from different perspectives. ASR is biased towards gender and dialects [27]. Similarly, MT perpetuates and amplifies stereotypes [22]. To solve this problem in MT, some papers propose to evaluate gender bias in MT [26] or to create multilingual balanced data to train more fair systems [11] or to use equalizing techniques to mitigate the effect of unbalanced training data [16, 23].

Finally, in ST, the Must-She corpus [3] is a natural benchmark for automatically evaluating gender bias. The benchmark evaluates two categories of gender bias: one related to the speaker gender and another to the utterance content. The benchmark is limited to English-to-French and English-to-Italian language pairs. Differently, the main contribution of our work is providing the first large-scale multilingual ST challenge set2, which follows an evaluation protocol for the analysis of gender bias, previously proposed for MT [26].

2 Bias Statement

As proposed in previous work [6] and suggested in related venues3, we formulate the bias statement of our work. Our work consists of proposing a challenge set and benefit from an evaluation protocol for the application of ST. Our challenge set serves to measure how accurately gendered source spoken words are translated. It includes stereotypical and anti-stereotypical examples, adding value to the evaluation analysis. The main contribution of our proposal is to have an objective evaluation protocol that can measure how biased are our ST systems. Note that our work is based on a synthetic data set simplified to binary gender which does not aim at being a mirror of reality. Still, we are contributing to exhibiting that current systems contain biases, while we sustain that long-term policies are required to solve this problem [13].

3 Gender Bias within MT systems and Related Work

Gendered languages have richer grammar for expressing gender. In these languages, gender has to be assigned to all nouns, and consequently, all the articles, verbs, and adjectives have to correspond with the gender of this noun. This leads to incorrect translation from a less-gendered language (English) to a highly-gendered language (Spanish) due to the lack of explicit evidence of the gender in the source [18]. Thus, gender bias occurs due to gender information loss and the over-prevalence of gendered forms in the training data. An illustrative example would be the translation from ’The doctor spoke to the responsible nurse in the hospital.’ to its Spanish version. The Spanish translation would be ’El médico habló con la enfermera responsable en el hospital.’, assuming the nurse to be female and the doctor to be male. Although nothing in the text mentions their genders, the systems would translate in a biased manner. Generally speaking, MT systems have proven to have biased outputs. Even in more context, translations seem to ignore the context in sentences with the masculine and stereotyped gendered roles [26, 23]. The main reason for such bias is training models on human-biased data [12].

Researchers have recently dedicated efforts to attempt resolving such bias. We can thereby describe three research lines of gender bias in MT: mitigating gender bias in MT, detecting issues in translation, and creating the challenge test sets to evaluate gender bias in the systems.

Regarding the first research line, approaches have been dedicated to solving the translation bias from a non-gendered language to a gendered one. Adding a gender tag of the speaker during training enhances the translation quality, as demonstrated in [28]. It facilitates the gender prediction correctly when translating from English to other gendered languages, giving control over the translation hypothesis gender. This was confirmed in recent work by [2] , who proved that adding gender helps to increase the accuracy of gendered translations. Moreover, the authors showed that increasing context has a better effect on gendered translations leading to higher performance. [20] incorporate gender information by prepending a short phrase for each sentence in inference time, which acts as a gender label. Recent work [23] has treated gender biasing as a domain-adaptation problem, in which the system is fine-tuned instead of retrained for mitigating gender bias. They have adapted a set of synthetic sentences with equal numbers of entities using masculine and feminine inflections to fine-tune the MT system. [16] introduced the idea of adjusting the word embeddings, which improved performance on an English-Spanish occupations task.

Concerning research for detecting gender translation issues, [22] have examined Google translations and proved that mentions of stereotyped professions are more reliably translated than those anti-stereotyped. In their study, they have used sentence templates filled with word lists of professions and adjectives. The authors in [10] also have studied the pronouns’ translations for the Korean language using sentence templates. The recent work [18] has proposed a BERT-based perpetuation method to identify gender issues in MT automatically. The technique discovers new sources of bias beyond the word lists used previously.

The most related line of research to our work is creating challenge sets. The first MT challenge set was introduced by [26], called WinoMT. WinoMT is a test set of 3888 sentences, where each sentence contains two human entities, where one of them is co-referent to a pronoun. The evaluation depends on comparing the translated entity with the golden gender, with the objective of a correctly gendered translation. The authors identified three metrics for the evaluation: accuracy, and . The accuracy is the percentage of correctly gendered translated entities compared to the gender of golden entities. is the difference in F1 score between the set of sentences with male entities and female entities set. is the difference in accuracy between the set of sentences with pro-stereotypical entities and the set with anti-stereotypical entities. A pro-stereotypical set identifies ’developer’ as male and ’hairdresser’ as female. An anti-stereotypical identifies the former as female and the latter as male. As far as we know, the only existing related work for studying gender bias in speech is presented in [3]. They created a benchmark dataset of two language pairs (English-Italian/English-French) and accordingly evaluated their systems.

4 Speech Translation System

We trained a ST system to evaluate its gender bias with the methodology we are presenting. We used an end-to-end ST approach that directly translates the utterance without obtaining the intermediate transcriptions. This task was introduced by [5], and recently it had a growing interest in the research community [29, 9, 19]. The data we used to train it is the MuST-C corpus, that consists of speech fragments from TED Talks, its transcriptions and translations into 8 European languages [14].

The architecture we used is the S-Transformer, a popular adaptation of the Transformer for ST [17]. It applies a stack of convolutions and self-attention layers to manage the log-Mel spectrograms extracted from the speech utterances.The two two-dimensional (2D) convolutional layers are in charge of capturing local patterns of the spectrogram, in both time and frequency dimensions. Moreover, they reduce the features maps by four, which is crucial to avoid memory issues that happen when feeding the Transformer with too long sequences. Then, the two 2D self-attention layers, which were introduced by [15] model long-range dependencies that convolutional layers cannot capture. Finally, the self-attention layers of the Transformer encoder also include a logarithmic distance penalty that biases them towards the local context [25]. Following the common approach, we pre-trained the S-Transformer encoder for ASR to improve the performance of the final ST system, introduced by [4] and recommended by the authors of the S-Transformer.

5 Proposed Gender Evaluation: WinoST challenge set

WinoST is the speech version of WinoMT, recorded in off-voice by an American female speaker, and consists of speech audios in English. By nature, sentences from WinoST contain information in the utterance content, not in gender information in the speaker’s voice. An example of these sentences is The developer argued with the designer because she did not like the design., where she refers to developer, meaning that the developer is actually a female.

WinoST serves as an input of the ST system to be evaluated, and the output text of the systems follows the same evaluation protocol as WinoMT [26]. Figure 1 shows the block diagram of this procedure. As a side-product, and not shown in the figure, WinoST can also be used as a challenge set for evaluating ASR gender bias.

Figure 1: WinoST Evaluation Block Diagram for Speech Translation

Further technical details on WinoST are reported in Table 1, including number of files, total hours/words, audio recording and format. The voice mastering process we applied to the recordings includes dynamic voice processing, broadcasting, equalization and filtering. WinoST is available under the MIT License4 with the limitation that recordings cannot be used for speech synthesis, text to speech, voice conversion or other applications where the speaker voice is imitated or reproduced.

# Files
# Hours
# Words
Audio format WAV ( KHz, -bit)
Table 1: WinoST details.

6 Experiments

In this section, we are describing the first experiments with WinoST. We describe the baseline ST systems that we are using and the results that we obtain in gender accuracy.

6.1 Data preprocessing

Before training the S-Transformer model, we preprocessed both speech and text data. We extracted 40-dimensional log-Mel spectrograms from the audio files, using a window size of 25 ms and hop length of 10 ms, with XNMT [21]5. We normalized the punctuation from text data, we tokenized it, and we de-escaped special characters, using the Moses scripts6. Furthermore, in the case of transcriptions, we lowercased them, and we removed the punctuation. We used the BPE algorithm [24] for encoding translation texts, using a vocabulary size of 8000 for each language, but a character-level encoding in the case of transcriptions.

6.2 System Details

The model we used has two convolutional layers with a kernel size of 3, 64 channels and a stride of 2. The Transformer has an embedding size of 512, 6 layers at the encoder and decoder, 8 self-attention heads, and a feed-forward network hidden size of 1024. We trained the S-Transformer with an Adam optimizer, with a learning rate of , and an inverse square root scheduler. The training has a warm-up stage of 4000 updates, in which the learning rate grows from . We used a cross-entropy loss with label smoothing by a factor of 0.1. Moreover, a dropout of 0.1 and a gradient clipping to 20 were applied. Furthermore, we generated the outputs with a beam search of size 5. We loaded 8 sentences per update, with a frequency of 64, which supposes an effective batch size of 512. Those audios longer than 14 seconds and sentences with more than 300 tokens weren’t used during training.

6.3 Results

This section describes the results of evaluating the ST system on WinoST and its performance in terms of gender. We are also interested in evaluating ASR English transcriptions and perceive if they contain any gender bias.

Language ASR (WER ) ST (BLEU )
Table 2: WER and BLEU (%) scores for the MuST-C corpus.

General ASR and ST Evaluation: We use the standard WER and BLEU measures to report the ASR and ST performance, respectively in Table 2. Our results concur with the results in [14].

Gender Bias Evaluation in ST: Our main objective is evaluating the accuracy of the systems for each of the language pairs. The high accuracy demonstrates that the system is able to translate the gender of the entities correctly. We also report and in Table 3. Ideally, these values should be close to 0. High indicates that the system translates males better, and high denotes that the system tends to translate pro-stereotypical entities better than anti-stereotypical entities.

The English-to-German (en-de) system has the highest accuracy 51%. This system also shows the minor difference in treating males and females translations (lowest , 1.7) and the minor difference in the pro-stereotypical and the anti-stereotypical entities (lowest , 1.5). The surprising behaviour comes with the English-to-Italian (en-it) system, which has the lowest accuracy of 37.3%, but still performs reasonably towards the anti-stereotypical entities translations, with the second lowest difference (5.6). However, the system still favors the male translations with a high difference (23.6). Both English-to-Spanish (en-es) and English-to-French (en-fr) have similar accuracies (45.2 and 43.2, respectively). However, there is a big difference in the , which is much more higher in the case of en-es (25.7), showing higher bias towards male translations. With these accuracy results, we are showing that the four translation directions present a significant amount of bias and they are far from approaching gender parity in performance.

Moreover, after manually investigating the translation outputs, we observe that some professions are not correctly translated, and ’physician’ is always translated to the male version in en-es and en-it, and similarly, ’developer’ is always translated to the male version in en-it and en-fr, showing that stereotypes are perpetuated in ST.

Gender Bias in ST vs MT: Note that even when using a state-of-the-art ST system, results are much more biased than in MT commercial systems reported in the original WinoMT paper [26], where best accuracies from commercial systems reached 74.1% in en-de, 59.4 % in en-es, 63.6% in en-fr and 42.4% in en-it. This may be due to the fact that ST is much more challenging than MT, and lower system performance implies higher biases. This big gap is reduced when comparing in terms of and . In this case, ST becomes closer to MT (when comparing in absolute terms), showing even better results in: for en-it (in MT, 27.8); for en-de (in MT, 12.5) and en-it (in MT, 9.4).

ST Acc. () () ()
Table 3: WinoMT Gender Evaluation for four language pairs. Acc.(% of instances the translation had the correct gender)(the higher the better) notes difference in F1 score between masculine and feminine sentences (the higher the worse) and notes difference in accuracy between pro/anti stereotypical sentences (the higher the worse) .

Gender Bias Evaluation in ASR: ASR systems contain gender biases, e.g., they perform better for male than female speakers [27]. However, gender bias associated with the context has not been studied in ASR yet, and WinoST allows this analysis. We may expect that ASR is less prone to show gender bias in contextual patterns because of the nature of the task, which inherently combines the purpose of acoustic and language modeling. The acoustic part does not consider long context information, but it tends to benefit from local context information [25]. However, the language modeling part takes into consideration the long-range context, and thus it may induce bias [7].

When using WinoST for ASR Gender Bias Evaluation, we need to distinguish between the transcription errors related to gender, from the ones that are not. In this sense, we computed the global accuracy in WinoST for the ASR best system in table 2, en-fr, and got a accuracy. However, this global accuracy includes misspelled professions. Discarding these misspelling errors, we obtained a accuracy predicting pronouns, showing that the amount of gender bias at the context level is quite low in ASR.

7 Conclusions

This paper presents a new freely available challenge set for evaluating gender bias in ST. This challenge set, WinoST, can benefit from the evaluation protocol which is widely used for MT. Our set is only based on evaluating systems in the utterance content, where information of gender is extracted from the context and not from the audio signal.

We used a state-of-the-art end-to-end ST system and evaluated their accuracy in terms of gender bias with this new challenge set. Results show that gender accuracy is much lower for ST than for MT, but we have to take into account that ST has also a lower quality than MT. Finally, we show that ASR can exhibit gender bias at the contextual-level.

WinoST shares similar limitations as WinoMT, which is the fact of using a synthetic challenge set. Having a synthetic set is positive because of providing a controlled evaluation, and also it is negative because we might be introducing some artificial biases. Therefore, further work could find in the wild transcriptions (with parallel speech utterances) that hold the valuable patterns designed in WinoMT.


  1. thanks: This project is supported in part by the Catalan Agency for Management of University and Research Grants (AGAUR) through the FI PhD Scholarship. This work also is supported in part by the Spanish Ministerio de Ciencia e Innovación, the European Regional Development Fund, the Agencia Estatal de Investigación through the postdoctoral senior grant Ramón y Cajal and the projects EUR2019-103819, PCIN-2017-079 and PID2019-107579RB-I00 / AEI / 10.13039/501100011033.
    ©2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
  2. Freely available in Zenodo (10.5281/zenodo.4139080)
  3. https://genderbiasnlp.talp.cat/gebnlp2020/how-to-write-a-bias-statement/
  4. https://github.com/gabrielStanovsky/mt_gender/blob/master/LICENSE
  5. https://github.com/neulab/xnmt
  6. https://github.com/moses-smt/mosesdecoder


  1. R. Aharoni, M. Johnson and O. Firat (2019) Massively multilingual nmt. In NAACL, External Links: Link, Document Cited by: §1.
  2. C. Basta, M. R. Costa-jussà and J. A. R. Fonollosa (2020) Towards mitigating gender bias in a decoder-based neural machine translation model by adding contextual information. In 4th WiNLP, External Links: Document Cited by: §3.
  3. L. Bentivogli, B. Savoldi, M. Negri, M. A. Di Gangi, R. Cattoni and M. Turchi (2020) Gender in danger? evaluating speech translation technology on the MuST-SHE corpus. In ACL, External Links: Link, Document Cited by: §1, §3.
  4. A. Bérard, L. Besacier, A. C. Kocabiyikoglu and O. Pietquin (2018) End-to-End ASR of Audiobooks. In ICASSP, External Links: Link Cited by: §4.
  5. A. Bérard, O. Pietquin, L. Besacier and C. Servan (2016) Listen and Translate: A Proof of Concept for End-to-End Speech-to-Text Translation. In NIPS Workshop on end-to-end learning for speech and audio processing, External Links: Link Cited by: §4.
  6. S. L. Blodgett, S. Barocas, H. Daumé III and H. Wallach (2020) Language (technology) is power: a critical survey of “bias” in NLP. In ACL, External Links: Link, Document Cited by: §2.
  7. S. Bordia and S. R. Bowman (2019) Identifying and reducing gender bias in word-level language models. In NAACL: SRW, External Links: Document Cited by: §6.3.
  8. J. A. Buolamwini (2017) Gender shades: intersectional phenotypic and demographic evaluation of face datasets and gender classifiers. Massachusetts Institute of Technology. Cited by: §1.
  9. L. C. Vila, C. Escolano, J. A. R. Fonollosa and M. R. Costa-Jussà (2018) End-to-End Speech Translation with the Transformer. In IberSPEECH, External Links: Document, Link Cited by: §4.
  10. W. I. Cho, J. W. Kim, S. M. Kim and N. S. Kim (2019) On measuring gender bias in translation of gender-neutral pronouns. In GeBNLP Workshop, External Links: Link, Document Cited by: §3.
  11. M. R. Costa-jussà, P. L. Lin and C. España-Bonet (2020) GeBioToolkit: automatic extraction of gender-balanced multilingual corpus of wikipedia biographies. In LREC, Cited by: §1.
  12. M. R. Costa-jussà (2019) An analysis of gender bias studies in natural language processing. Nature Machine Intelligence 1. Cited by: §3.
  13. C. D’Ignazio and L. Klein (2018) Data feminism. In , Cited by: §2.
  14. M. A. Di Gangi, R. Cattoni, L. Bentivogli, M. Negri and M. Turchi (2019) MuST-c: a multilingual speech translation corpus. In NAACL, Cited by: §4, §6.3.
  15. L. Dong, S. Xu and B. Xu (2018) Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In ICASSP, Vol. . Cited by: §4.
  16. J. E. Font and M. R. Costa-jussà (2019) Equalizing gender bias in nmt with word embeddings techniques. In GeBNLP Workshop, pp. 147–154. External Links: Link, Document Cited by: §1, §3.
  17. M. A. D. Gangi, M. Negri and M. Turchi (2019) Adapting Transformer to End-to-End Spoken Language Translation. In Interspeech, External Links: Document, Link Cited by: §4.
  18. H. Gonen and K. Webster (2020) Automatically identifying gender issues in machine translation using perturbations. In arXiv:2004.14065, External Links: 2004.14065 Cited by: §3, §3.
  19. Y. Liu, H. Xiong, J. Zhang, Z. He, H. Wu, H. Wang and C. Zong (2019) End-to-End Speech Translation with Knowledge Distillation. In Interspeech, External Links: Document, Link Cited by: §4.
  20. A. Moryossef, R. Aharoni and Y. Goldberg (2019) Filling gender & number gaps in neural machine translation with black-box context injection. arXiv preprint arXiv:1903.03467. Cited by: §3.
  21. G. Neubig, M. Sperber, X. Wang, M. Felix, A. Matthews, S. Padmanabhan, Y. Qi, D. Sachan, P. Arthur, P. Godard, J. Hewitt, R. Riad and L. Wang (2018) XNMT: the extensible nmt toolkit. In AMTA, Cited by: §6.1.
  22. M. O.R. Prates, P. H. Avelar and L. C. Lamb (2020) Assessing gender bias in mt: a case study with google translate. Neural Comput and Applic. Cited by: §1, §3.
  23. D. Saunders and B. Byrne (2020) Reducing gender bias in nmt as a domain adaptation problem. In ACL, External Links: Link, Document Cited by: §1, §3, §3.
  24. R. Sennrich, B. Haddow and A. Birch (2016) NMT of rare words with subword units. In ACL, External Links: Link, Document Cited by: §6.1.
  25. M. Sperber, J. Niehues, G. Neubig, S. Stüker and A. Waibel (2018) Self-Attentional Acoustic Models. In InterSpeech, External Links: 1803.09519, Link Cited by: §4, §6.3.
  26. G. Stanovsky, N. A. Smith and L. Zettlemoyer (2019) Evaluating gender bias in mt. In ACL, Cited by: §1, §1, §3, §3, §5, §6.3.
  27. R. Tatman (2017) Gender and dialect bias in YouTube’s automatic captions. In Proc. 1st ACL Workshop on Ethics in NLP, External Links: Link, Document Cited by: §1, §1, §6.3.
  28. E. Vanmassenhove, C. Hardmeier and A. Way (2018) Getting gender right in nmt. In EMNLP, Cited by: §3.
  29. R. J. Weiss, J. Chorowski, N. Jaitly, Y. Wu and Z. Chen (2017) Seq2seq models can directly translate foreign speech. In Interspeech, External Links: Document, Link Cited by: §4.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description