iTranslate Converse goes OFFLINE using Deep Neural Networks

iTranslate Converse, the winner of Apple’s Design Award in 2018, still is one of the leading translation apps on the App Store. 

But the revolutionary speech translation app is not only turning phones in translation devices, but it’s also turning heads again with its brand new Offline Mode! What makes this update so particularly unique is that we integrated our own Deep Neural Network models for speech recognition and translation to make the app work completely offline, meaning there’s no need to be connected to the internet anymore. Use the app whenever and wherever you need it – on the plane, deep in the jungle of Brazil or any other place where you’re unable to get a signal or wifi. 

Continue reading to get all the details on the new Offline Mode and what part Neural Machine Translation played in creating it!

How to use the new Offline Mode in the App

We wanted to keep iTranslate Converse an easy to use translation app, especially when adding a significant new feature. To use iTranslate Converse offline, simply switch to “Offline Mode” on the main screen, and you’re all set. Use the app as usual: Swipe up the screen and choose between 5 languages and dialects: English, Spanish, French, German, and Chinese (Mandarin). Now select the desired language pair, and with the whole screen still being the translation button, press and hold to speak. It automatically detects the correct language between two selected languages for a fast and accurate translation. 

But let’s dive in a little deeper into the technology of Deep Neural Networks, because the high consumption of coffee and chocolate waffle cuts for our dev and machine learning team shouldn’t have been in vain! Applying Deep Neural Networks to human language is perhaps one of the most challenging tasks since the creation of artificial intelligence, given its natural ambiguity and flexibility. 

A gentle introduction to Deep Neural Networks and their application to Machine Translation and Speech Recognition

Initially inspired by neurobiology, Deep Neural Networks (DNNs) are a set of algorithms designed to recognize patterns among example input, like images, sound (as in speech), time series, or text. The Neural Network typically consists of an input layer, a various amount of hidden layers, and an output layer. When the number of hidden layers increases (i.e., more than two), then that is known as Deep Neural Network.

Credit: Xenonstack | Simple Neural Network and Deep Neural Network

Due to the powerful ability of feature learning and representation, DNNs have made significant breakthroughs in speech recognition and image processing. Various kinds of DNNs are used for different types of functions of input and output data. But all of these share the same goal: learning the syntactic and semantic representations of the data being processed.

Neural Machine Translation (NMT) uses those Deep Neural Networks to convert a sequence of words from a source language, like English, to a sequence of words to a target language like French or Chinese.  The strength of NMT lies in its ability to learn directly, in an end-to-end fashion. The mapping from source to target text no longer requires the pipeline of specialized systems used in Statistical Machine Learning (SMT). Although SMT has been the dominant translation standard for decades, it suffers from a narrow focus on the phrases being translated, due to ignoring important syntax characteristics and hence losing the broader nature of the target text. 

Neural Machine Translation, however, attempts to build and train a single, extensive Neural Network that reads a sentence and outputs a correct translation. 

Source: Sutskever, Vinyals, V. Le, 2014 (https://bit.ly/2HyBW3y)

Similarly, automatic speech recognition (ASR) (i.e., the mapping from audio to spoken text) can be solved by a Deep Neural Network that maps the sequential audio input to a sequence of words. As in Neural Machine Translation, DNNs can learn this mapping directly in an end-to-end manner, thereby replacing more complex systems that have been used for speech recognition in the past.

How we used Deep Neural Networks for the Offline Mode

We’ve been using our in-house Deep Neural Networks for all iTranslate apps for quite a while already – but only while being online. Integrating those networks for offline usage is a whole different level. 

Our team at iTranslate has been working hard on its very own on-device speech recognition and translation models to increase their accuracy and to make them more robust to variations of speaking rate, pause duration, and voice input quality. The use of Offline Modes has the advantage of eliminating the limitations of server-based processing and therefore increasing our user’s privacy. 

The functionality of the app is composed of three main subtasks: converting audio input into text, translating this text into another language, and converting the translated text back to audio output. We have implemented the first two subtasks via our Deep Neural Network models; the third task is handled by Apple’s on-device text-to-speech algorithm. In addition to converting the spoken audio to text, we detect the language the speaker is using by a probabilistic acoustic model. This is necessary, as it determines which models have to be used in the further process. So far, we have implemented the models for recognizing and translating between 5 languages and dialects (English, Spanish, French, German, and Chinese (Mandarin)) – more languages already in the pipeline.