Apple has just published a paper, in collaboration with USC, that explores the machine learning methods employed to provide users of its iOS18 operating system more selection about gender in the case of translation.
Source: https://support.apple.com/guide/iphone/translate-text-voice-and-conversations-iphd74cb450f/ios
Though the problems tackled within the work (which Apple has announced here) engages, to a certain extent, in current topical debates around definitions of gender, it centers on a far older problem: the undeniable fact that 84 out of the 229 known languages on the earth use a sex-based gender system.

Source: https://wals.info/feature/31A#map
Surprisingly, the English language falls into the sex-based category, since it assigns masculine or feminine singular pronouns.
In contrast, all Romance languages (including over half a billion Spanish speakers) – and multiple other popular languages, equivalent to Russian – require gender agreement in ways in which force translation systems to deal with sex-assignment in language.
The brand new paper illustrates this by observing all possible Spanish translations of the sentence :

Source: https://arxiv.org/pdf/2407.20438
Naïve translation is much from sufficient for longer texts, which can establish gender initially (, , etc.) and thereafter not confer with gender again. Nonetheless, the interpretation must remember the assigned gender of the participant .
This might be difficult for token-based approaches that address translations in discrete chunks, and risk to lose the assigned gender-context throughout the duration of the content.
Worse, systems that provide alternative translations for biased gender assignments cannot do that indiscriminately, i.e., by merely substituting the gender noun, but must be sure that all other parts of language agree with the modified gender noun.
In this instance from the Apple/USC paper, we see that though has been assigned a male gender, the singular past has been left as feminine ():

A translation system must also address the eccentricities of particular languages in regard to gender. Because the paper points out, the pronoun is gendered in Hindi, which provides an unusual clue to gender.
Gender Issues
Within the latest paper, titled , the Apple and USC researchers propose a semi-supervised method to convert gender-ambiguous entities into an array of entity-level alternatives.
The system, which was used to tell translation from the Apple Translate app in iOS18, constructs a language schema by each using large language models (LLMs), and by fine-tuning pre-trained open source machine translation models.
The outcomes from translations from these systems were than trained into an architecture containing – groups of phrases that contain diverse types of various gendered nouns representing the identical entity.
The paper states*:
The approach that the researchers arrive at effectively turns a translation from a single token to a user-controlled array.
The model Apple and USC developed was evaluated on the GATE and MT-GenEval test sets. GATE comprises source sentences with as much as 3 gender-ambiguous entities, while MT-GenEval comprises material where gender can’t be inferred, which, the authors state, aids in understanding when alternative gender options shouldn’t be offered to the user.
In each cases, the test sets needed to be re-annotated, to align with the goals of the project.
To coach the system, the researchers relied on a novel automatic data augmentation algorithm, in contrast to the aforementioned test sets, which were annotated by humans.
Contributing datasets for the Apple curation were Europarl; WikiTitles; and WikiMatrix. The corpora was divided into (with 12,000 sentences), encompassing sentences with head words for all entities, along with a gender-ambiguous annotation; and (with 50,000 sentences), containing gender-ambiguous entities and gender alignments.
The authors assert:
Datasets and diverse data for the project have been made available on GitHub. The info features five language pairs, pitting English against Russian, German, French, Portuguese and Spanish.
The authors leveraged a previous approach from 2019 to endow the model with the potential to output gender alignments, training with cross entropy loss and an extra alignment loss.
For the info augmentation routine, the authors eschewed traditional rule-based methods in favor of a data-centric approach, fine-tuning a BERT pre-trained language model on the G-Tag dataset.
Double-Take
For cases where ambiguous gender entities are detected, Apple and USC explored two methods – the fine-tuning of pre-trained language models, and using LLMs.
In regard to the primary method, the paper states:

Within the image above, we see the fine-tuned text within the lower middle column, and the specified output in the correct column, with the underlying rationale illustrated above.
For this approach, the authors made use of a lattice rescoring method from an earlier 2020 work. To be sure that only the goal domain (gender) was addressed, a constrained beam search was used as a filter.
For the LLM approach, the authors devised a technique that uses an LLM as an editor, by re-writing the supplied translations to offer gender assignments.

With results from each approaches concatenated, the model was subsequently fine-tuned to categorise source tokens as (indicated by ‘1′ within the schema below) or (indicated by ‘2′ below).

Data and Tests
The detector used for the project was developed by fine-tuning Facebook AI’s xlm-roberta-large model, using transformers. For this, the combined G-Tag was used across all five language pairs.
In the primary of the aforementioned two approaches, the M2M 1.2B model was trained on Fairseq, jointly with bi-text data from the G-Trans dataset, with gender inflections provided by Wiktionary.
For the LLM method, the authors used GPT-3.5-turbo. For the alignment of gender structures, xlm-roberta-large was again used, this time with gender alignments extracted from G-Trans.
Metrics for the evaluation of alternatives, structure (with and ), and .
Though the primary two of those are self-explanatory, alignment accuracy measures the proportion of output gender structures that conform to the known correct source identity, and uses the δ-BLEU method, in accordance with the methodology for MT-GenEval.
Below are the outcomes for the info augmentation pipeline:

Here the authors comment*:
The researchers also compared the info augmentation system’s performance, via M2M, against GATE’s sentence-level gender re-writer, on GATE’s own stated terms.

Here the paper states:
Finally, the authors trained diverse ‘vanilla’ multilingual models into . The contributing datasets were WikiMatrix, WikiTitles, Multi-UN, NewsCommentary, and Tilde.
Two additional vanilla models were trained, one incorporating the G-Trans dataset with the prefixed tag , which was employed because the supervised baseline; and a 3rd, incorporating gender structure and alignments (on the smaller local model, since using GPT’s API-based services would have been very expensive for this purpose).
The models were tested against the 2022 FloRes dataset.

The paper summarizes these results:
The authors conclude by noting that the success of the model needs to be considered within the broader context of NLP’s struggle to rationalize gender task in a translation method; and so they note that this stays an open problem.
Though the researchers consider that the outcomes obtained don’t fully achieve the aim of the generation of entity-level gender-neutral translations and/or disambiguations regarding gender, they consider the work to be a ‘powerful instrument’ for future explorations into probably the most difficult areas of machine translation.
*