Here at Datactics, we’ve recently done a number of Transliteration matching tasks helping people with Japanese, Russian Cyrillic and Arabic data sets.
Transliteration matching can seem challenging, especially when presented with text that you don’t understand, but with the right techniques a lot can be achieved – the key is to really understand the problem and have some proven techniques for dealing with them:
- Transliteration – Matching data within a single character set
We have a long-standing Chinese customer who routinely matches data sets of 100’s of millions of customer records all in Chinese. Even with messy data, this is a relatively straight forward task, as long as your matching algorithms can handle Unicode properly, fuzzy matching within a single character set e.g. Chinese customer database to Chinese marketing database is very similar to the same task in a roman character set albeit with some tweaks to fuzzy match tolerances.
- Frequency Analysis
Another very useful technique is to perform frequency analysis on the input text to help identify ‘noise text’ such as company legal forms within company names that can be either eliminated from the match or that should be matched with lower importance than the rest of a company name. For example frequency analysis on a Japanese entity master database may reveal a large number of company names containing the Kanji “株式会社” or “株” – the Japanese equivalent of ‘Limited’ (or ‘Ltd.’ in abbreviated form). The beauty of this technique is that it can be applied to any language or character set.
- Matching between character sets using Transliteration, fuzzy and phonetic matching
A common requirement in the AML/KYC space is matching account names in Chinese, Japanese, or Cyrillic etc to sanctions and PEP lists which are usually published in Latin script. In order to do this a process called ‘transliteration’ is required. Transliteration converts text in one character set to another, but the results from raw transliteration are not always usable since the resulting transliterated text is often more of a ‘pronunciation guide’ rather than how a native speaker would write the text in Latin script. However, by using a combination of fuzzy and phonetic matching on the transliterated string, it is possible to obtain very accurate matching.