[Upcoming Webinar] The BIS 50% Rule is Here. Are You Prepared?

Blog Posts

Avoiding Pitfalls When Machine Translating Proper Names

4 minute read

Multilingual investigations are common for both organizations and enforcement agencies due to the global nature of supply chains and illicit networks. While it would be ideal for investigators to possess proficiency in the language of the target jurisdiction, this is not always feasible.

Machine translation solutions like Google Translate and AI tools such as ChatGPT and Gemini are often used to tackle these multi-jurisdiction investigations, but they have limitations, and an over-reliance on machine translation can lead to costly mistakes. 

Translating proper names is especially challenging for machine translation tools. Let’s look at three scenarios involving machine translation of proper names and tips for ensuring these translations are accurate.

Addressing translation challenges of a proper name

In some cases, translating text from one language to another can be performed reliably with machine translation tools. 

Figure 1: Two examples of machine translation

However, there’s not always a 1:1 correlation in these translation scenarios. A company may not translate its name, translate only part of its name, or use a combination of translation and transliteration, as in the Chinese name in Figure 1. Latin American companies will often forego translation entirely and operate globally using their native Spanish names (e.g., Banco Santander). U.S.-based investigators and compliance teams would not translate these examples into English to see where they might be operating in the U.S.

In these instances, investigators should be aware of regional tendencies in language use and the fact that word-by-word translations are not always going to reflect how a company name will appear abroad.

>> Discover practical techniques for overcoming language barriers in investigations <<

Addressing transliteration challenges of a proper name

Transliteration converts characters or letters between writing systems while preserving pronunciation. For example, the translation of “Γεια σου,” an informal expression in Greek, is “hello.” A transliteration of this word would be “geia sou.”

Machine translation solutions choose to transliterate rather than translate when no direct, meaningful translation exists, such as with proper nouns and branding. Transliteration arguably poses a bigger challenge to investigators than translation because it can create a multitude of potential correct answers or different potential aliases of an individual or company. Not taking this into account can mean missing key information in an investigation.

For example, مُحَمَّد can take on a variety of forms when transliterated into Latin characters. A human might recognize these different iterations as being the same, but a computer parsing this information or looking for exact character-to-character matches against another database may have a harder time making this determination.

Figure 2: Potential results from transliteration

Investigators should be aware of the fact that a target may write his name differently as a result of transliteration in order to cast a wider net during an investigation. Querying multiple possible variations or using Boolean operators or fuzzy logic to broaden results are good approaches to use in this situation.

In addition, becoming aware of the regional characteristics within jurisdictions can also help ensure investigations start on the right track. For example, there are some trends on how Mohammed is typically spelled in former French colonies like Algeria (where the name tends to start with “mo”) compared to some of the Gulf states (where the name generally starts with “mu”).

>> Learn how Sayari Graph enables effective investigations into transnational criminal organizations <<

Addressing phonetic transcription challenges of a proper name

Logographic writing systems like Chinese Mandarin use characters to represent words or concepts, as opposed to individual letters used in phonetic writing systems like English. Mandarin pronunciation is often represented using a phonetic transcription system called pinyin, which uses the Latin alphabet.

Figure 3: Translated and pinyin versions of original Mandarin

Pinyin transcriptions are not bidirectional. When investigators run a pinyin name through a machine translation tool in an attempt to return Chinese characters, they are often frustrated when it returns no result or a result that does not correspond to an actual company. In the worst-case scenario, investigators may not realize the result is incorrect, leaving them vulnerable to costly and time-consuming errors. 

Phonetic transcriptions like pinyin are complex and unlikely to have success when machine translating from pinyin back to the original characters. It is recommended that investigators find alternate methods of translation, such as searching for the company’s website to find its registered English name in the site footer. 

>> See how to leverage open-source intelligence to disrupt transnational iIllicit trade <<  

How machine translation is used in Sayari products

Machine translation can provide significant assistance when conducting cross-border investigations, which is why Sayari’s products offer in-application access to Google Translate to facilitate on-demand translation of record text. Given that machine translation is not always correct or helpful, the original language text is preserved within Sayari to support further investigation. In addition, where possible, Sayari products include aliases (i.e., a Chinese company name and its official, registered English trade name).

For additional guidance on techniques to get the most out of machine translations, watch our webinar Navigating Language Barriers in Investigations and Due Diligence.