top of page

The Transliteration Problem in OSINT

Updated: May 2


Open source investigations routinely require the use of translation tools to assist with language barriers as targets may post or engage online using multiple languages. When using English as a base language, any collection in a foreign language that does not use Western-style Roman lettering will result in the requirement to transliterate the original language so it can be used for processing and analysis.

This is inherently vulnerable to unintentional error and inaccuracy when the information is translated back out to its original form for reporting and evidentiary purposes. This inaccuracy in reporting could lead to failed prosecutions due to collection inaccuracies when pivoting to additional information about an individual. Linguistic details are paramount. Literal accuracy matters.

The Arabic language provides a prime vignette for addressing the procedural issues that corrupt the accuracy of the intelligence cycle when collecting open source information. This paper explores this problem and provides a simple procedural solution to ensure accuracy in producing intelligence from open source information. It is important to have a basic understanding of the intelligence cycle and where collection, processing and analysis lead to intelligence production.

The Transliteration Problem

The requirement to translate non-native languages, such as Arabic to English, creates inherent probability of deviation in the accuracy of the translated information[1]. Details pertaining to the human and physical terrain are vital to ensure accurate and legal investigations based on the direction given. The name of a person must be valid if it is to serve as the basis for investigation and prosecutions just as the accuracy of locations should be when supporting ground-based operations such as surveillance and the inherent reporting that is required post-operation.

Historically the requirement to transliterate from non-English native languages to English and back has been addressed through the introduction of technological tools that use common transliteration to standardises naming references in English. However, these tools only addressed the consistency of the information and do not provide a reference point to validate the information during analysis and when intelligence product is being disseminated out.

For evidentiary purposes and any warrant based operation where the target originates from a non-Western country there will ultimately be a requirement for the name of the individual to be available in the native language, however it is possible that not all individuals conducting the investigation will be able to read and write in the language required.

Failing to properly validate the individuals name in the native language could result in the inadvertent targeting of two completely different individuals. In giving linguistic context to the problem, the English language has roughly 600,000 non-repeated words when using the largest known English dictionary. The Arabic language has over 12,300,000 non-repeated words[2] . Mathematically this creates significant potential error rates when translating between the two languages.

The problem is not a failure of any individual in the intelligence cycle but rather the absence of a validation processes surrounding the collection of information through to its dissemination as intelligence.

Inputs & Outputs

The following diagram depicts a common interaction with a partner force element and its association with the intelligence cycle:

Figure 1 - Processing open source information through the intelligence cycle

Without validating the input against the output, there is significant potential for cascading errors to occur throughout the intelligence cycle. This starts with the discarding of the native language version of the information, in this case, Arabic. The result is an inability to conduct validation against the native language during the analysis and dissemination phases.


Using the flowchart provided in Figure 1, the following example serves to highlight the potential deviation in information accuracy as part of the intelligence cycle when no native language validation is conducted. The name provided is a less-common Arabic name to demonstrate the potential deviations that can occur. Note: the transliteration is not inclusive of all variants. The error rate would grow with more variants included.

  1. Input variables:

  • Information collected online: A person’s first name

  • Arabic name collected in raw form:عباس

  1. Processing variables:

  • English transliteration variants (samples): Abbas, Abas, Abaas, Aubaas

  1. Output variables:

  • Arabic versions (samples) that could be translated back out from transliterated ,ابعاس ,اباص ,اباس ,اباس ,ابباس ,ابباص ,عباص ,عباس:

Figure 2 - Output possibility matrix

Using the information flow in Figure 2, if the name was fused with other forms of intelligence and a new intelligence product was produced for dissemination (e.g. supporting a warrant), it is highly plausible that when the transliterated name went through re-translation it could become incorrect. This is problematic when prosecuting a target as there is a possibility of mistaken identity or the action itself can become invalid during judicial processing as the name provided with associated evidence may not align correctly to the individual apprehended.

The Validation Loop