Expression, conversation, inflection; are all complex facets of communication that are used unconsciously within our existence. Language comprises more than just the written and spoken elements within English, but can be found in gestures, in numbers, in word of mouth or simply at the end of a pen. More than just a form of socialization, all these different types of language, at their core, are a means of transferring information from one party to another. If not in the words that are spoken, then the communication is found in the grammatical structure, the intonation of certain sounds, the symbols it uses, or how the language is written. 

Languages across the globe are forms of cultural understanding and cultural constructs. Many aspects of the knowledge being expressed between humans can be understood automatically through assumption or instinctive understanding. Constructs like gender roles, social hierarchies or cultural values, are all communicated through cadences of the “speaker’s” intention.

With that being said, it is fair to state that different languages contain different data. This would be especially relevant when dealing with language from a data science perspective. It is apparent that in data, many languages may communicate different information with the exact same sentence. Elements such as sarcasm, metaphor, and so-called ‘outdated terminology’ may be misinterpreted or entirely lost.

For example, take the English sentence, “I have a cat.”. An AI trained in English would not be able to derive much information from this outside of the speaker being in possession of a cat.  However, in a gendered language like Arabic, the sentence “I have a cat.” suddenly gains an additional layer of data due to the use of pronoun reflection in the language – the gender of the cat is implied in the wording. In Arabic, the word for “cat” changes depending on its gender. Making a simple observation of possessing a cat, suddenly conveys more information than is merely stated. An AI trained with this in mind would be able to derive more information due to different grammatical rules. Thus adding layers to conversation that would have otherwise been overlooked. This could create some issues with processing information in some languages, especially if your database consists of examples from multiple languages. This is where it becomes particularly important to incorporate specific rules relating to language when dealing with it in a data format.

With the expansion of languages within language processing programs, pushing past the limitations that would be found in a solely English understanding, one can avoid the confusion that overlap may cause. However, even then, some cultural knowledge may be lost. It is important to have someone fluent in the language (and in it’s cultural intricacies) to go through the results of the processing and catch any information that the AI may have missed, so that the program may be adjusted to account for those bits of cultural information that were lacking. This not only recovers any lost information that may have been lost in translation, but can also bring a humanization of the data given, allowing for a friendlier understanding. Once a proper data analysis infrastructure is in place, along with any cultural resources, the ecosystem.Ai workbench can help organize the results of your dataset. In a variety of ways, linguistic data can be dense and difficult to understand. Using your specifically configured environment in the workbench will allow you to view the general mood, and the personality types of the speakers in your dataset.


Daniel Nordfors

Jessica Nicole