A team of scientists is looking for writings in the Cypriot dialect to create a bot for the internet and smart phones
A team of four data scientists and a linguist have been combing the island for anyone who writes in Cypriot Greek, also known as the Cypriot dialect, in an attempt to build a bot that speaks Cypriot, a pioneering initiative that aims to put Cypriot on the artificial intelligence map.
Comprised of Nick Sorros, founder of MantisNLP, Christos Christodoulou, industry programme manager, CaSToRC centre, at The Cyprus Institute, Spyros Armostis, linguist from the University of Cyprus and Loukia Taxitari, lecturer of experimental psychology, at Neapolis University Paphos, the team is quick to say the intelligence behind the bot, which already exists for most languages, is envisioned to have many uses.
“We’re mainly interested in the challenge of seeing whether we can build a bot for Cyprus,” explains Sorros. “We know that such applications can be quite useful. With this bot, for example, our smartphones could suggest for the correct Cypriot Greek autocomplete instead of having to write Greek or English but there are other applications that we can mention as well.”
From the start of the project, the team was aware that a Cypriot Greek bot didn’t exist. “We’re also very aware of the technology used for other languages, so we thought we could explore and see whether it would be a good idea to see if we can do something similar for Cypriot,” says Sorros.
From a ‘technical’ point of view and coding, the challenge was therefore more or less straightforward. Where the difficulty seems to lie is with regards to the actual language and its current standing on the internet.
“Being considered as a dialect, Cypriot Greek lacks standardisation, it doesn’t have a standardised grammar in the sense that somebody has written a grammar book for Cypriot Greek, there is no such thing,” says Armostis. “There are dictionaries and grammar books on Cypriot Greek but they’re not of a prescriptive nature; there are no norms like you would have in standardised languages…This means that we have a lack of, or anything that supports writing in Cypriot Greek.”
Facing this lack of language resource, the team is currently calling out to anyone who has text written in Greek Cypriot – authors, poets, script writers and directors, to name a few. This data will then be used to enrich the bot’s data input. “We do have printed dictionaries and some electronic online dictionaries but not fully developed ones; we don’t have anything more practical, anything that can make our life, especially in the digital world, easier in using Cypriot Greek in electronic communication,” says Armostis.
Considering that an equivalent bot for the Greek language uses approximately three billion words to work properly, the need for as much data as possible is crucial.
“For the Greek language bot, they were able to gather data because they used information from Greece’s parliament proceedings, the European parliament and the internet written in Greek, all of which you can easily find online,” explains Sorros. Finding an equivalent source in Cypriot was unsuccessful and therefore urged the team to initially identify the problem at hand and find the data themselves.
“We’re at the stage where we know that the hard part is finding the data but then I think it’s highly likely that we will be able to do something similar to what has been done with other languages, and acquire a resource of Cypriot Greek text that is of equivalent scale and which we could use to train our robot or model,” says Sorros.
“What’s becoming interesting with Cypriot Greek is that we’re moving towards a world were these models (bots) can do many things, but given the fact that there isn’t enough data in Cypriot Greek on the internet and in the data set that is being used for these models, it’s (Cypriot Greek) being left out of this so called revolution. So we’re trying to fill a gap.”
How the bot will work
In a nutshell, the bot calculates some statistics of the language. “For example it will learn how often a word appears after another, so when you start typing something in Cypriot Greek it can autocomplete that… these applications are usually modelled with the probability of the next word or after a phrase and this happens recursively,” explains Sorros. This is precisely why the corpus of the language has to be extremely vast and have enough specific words to create statistics and train the bot to understand Cypriot Greek and learn the logic behind it.
“What the model is doing is similar to how children learn their mother language, because you never teach them the rules of the language, the grammar or the syntax or the formal rules of the language. So these models can help us understand how children learn language and the way we learn language,” says Taxitari.
And as a further vision, the team also envisions to add speech as an extension to the bot. “We’re starting with text but we want to extend it to speech also, so for a TV series, for example, it’s difficult to have auto transcription into Cypriot Greek. The speech aspect can help with this,” adds Sorros.
For Armostis, the stigma around Cypriot Greek is slowly being removed, although not complete. “There is a lot of change, especially in the last decade and that’s why we may observe more production in Cypriot Greek and therefore more need to support that production and that’s where we come in,” he says.
As for Sorros, his motivation in the project admittedly lies in the challenge and curiosity, but most importantly perhaps, this bot is starting point for Cypriot Greek to be part of the AI space.