Graphical User Interfaces exist to enable the communication between humans and computers. The first graphical user interface was the command line (or the terminal), where users have to type explicit commands that the computer can understand. It’s not that suitable for the not-that-tech-savy people, but a lot of computer programmers are still using it today for performing some tasks. The introduction of the desktop user interface, brought the computers to the masses. It still required a learning curve, but a lot easier than the terminal. Next, the mobile phones revolution brought the multi-touch interface, where the finger was the primary point of interaction with the devices. A more intuitive solution, but still users have to be taught first. Also, there are different operating systems (OS) on the market, which have their own specific features – there isn’t a unified user interface that works the same on all devices.
But what if we don’t even need a graphical user interface? We have one natural way to express our thoughts, our wishes and our feelings – our voice. Why don’t we give commands to the machines, the same way we give commands to other people? This idea is not new, however it needs some time before it really takes off. The reason for this is that understanding the language is one of the most challenging tasks in computer science.
Why is language complex?
Language itself has broad range of usage – we use it both to explain complex computer science problems and feelings. There are a lot of challenges that appear even in the communication between people.
One is ambiguity, people will say something and we might interpret this in a totally different way. Then we have punctuation – even something as small as comma can completely change the meaning of a sentence. We also have figurative speech – metaphors, oxymorons, similes etc, that might point to a completely different thing than what we expect. In order to properly understand what a person says, we need to have some context – what are that person’s beliefs, what was previously said in the conversations. Then, we have uncertainty, we might encounter words which are not known to us and we guess about their meaning. We also have implications, meanings that are inferred or suggested, but not directly said in a sentence.
Basic NLP concepts
These challenges are sometimes difficult for humans as well and you can imagine how difficult are for computers. The sub-field of Artificial Intelligence that deals with understanding language is called Natural Language Processing (NLP). Let’s familiarise ourselves with few NLP concepts.
Now, I’m writing a blog post about understanding language on iOS.
When we analyse this sentence, in order to figure out what is it about, we need several things. First, we need to know what is the goal of the sentence, what the user is trying to do. This is called an intent. Next, we need to know the parameters of that action, which are called entities. In our case, we have few different types of entities. The first one is “understanding language”, which is an entity that gives information about the subject of the blog post. We also have “now”, which is time based entity, that gives information when this action is happening. We also have “iOS”, which might be interpreted as a “technology” entity – it tells us for which technology we are writing the post. Then if we say something like:
I will share some insights about it.
It’s clear that I’m reffering to the “understanding language” part. We as humans can easily deduce that, but for computers it’s not that obvious. Remembering previous state about the conversation is called context.
Now that we know the basics of NLP, let’s see how we can integrate voice interface to our application. In order to do that, first we need to have an agent, trained with domain knowledge specific to the business rules of our product or service. Then, we need to expose that agent to the mobile application, either as a REST service or as a mobile framework integrated in our project. Afterwards, we need to have a mechanism that will convert the spoken phrase of the user to a machine-readable text. The NLP step comes next – we need to extract the intent of the phrase, along with its parameters. With this, we have everything we need in order to execute the user’s request.
There are several technologies on the market that give you the possibility to integrate conversational interface to your app. The next image shows these products split in 4 different sections, based on ease of integration and flexibility. As you might imagine, Apple’s technologies are faring pretty low on the flexibility and customization part. SiriKit is the easiest to integrate, while Core ML is a bit harder, since it requires machine learning experties. The perfect place is the easy and flexible section, where we have few NLU platforms, like Dialogflow, Wit.ai, LUIS, Watson, Amazon Lex. In the last section, we can always do everything by ourselves, which is both hard and expensive (we will need a lot of infrastructure, storage, sophisticated algorithms), but we have the flexibility to do it our way.
SiriKit was announced on last years’s WWDC. It enables third-party apps to provide functionality to Siri, without the app being in the foreground. Everything happens in Siri’s context. Here, Apple does all the heavy lifting – the speech recognition and the natural language understanding. On the developer side, you just need to implement delegate methods, which give you the intent and the values for the paremeters. However, this comes with a trade-off, you don’t have much flexibility if you want to customise the flow or the type of intent. You can only use SiriKit in certain pre-defined domains, like booking a ride, payments, creating lists and notes, accessing photos and few more. More details on SiriKit here.
To implement SiriKit, you will need three steps. The first step is Resolve – where you have to determine whether you can handle the parameters provided by the intent (Siri). If something is missing, you can tell Siri to ask the user to provide that additional information. For example, if the user is booking a ride, but haven’t specified their starting location, the resolve step is where the information is taken from the user. AfterThe second step is Confirm, where you are confirming to Siri that you can handle the intent, by providing Response which contains the details displayed in the Siri popup. The user also has to confirm that they agree with the provided option (for example a ride) by your app. The last step Handle is the handling of the action, where you actually perform the action you’ve confirmed in the previous step. Here’s an example with the ride-booking functionality in Siri.
Dialogflow (formerly known as api.ai) is natural language understanding platform owned by Google. It provides web application, where you can create agents and train them with domain specific knowledge. You can use already trained agents by the service as well. For every agent, you define the intents it supports, as well as which are the entities of that intent.
In the example, we are creating an AddProduct intent, with several different grocery list products as entities. We train the agent by providing as many sentences as possible and marking the entities. The yellow ones are when a product should be added to a grocery list, while the orange ones mean that it should be removed. Over time, the agent learns how to handle even previously unknown sentences. You can access the service via REST interface.
Here’s an overview of the process. After the spoken phrase is provided by the user, it is translated to plain text using Apple’s Speech framework. Then, the text is sent to Dialogflow, which processes the text and extracts the intent and the entities associated with it, returning a JSON response. We save this information and present it to the user in our table view. We have used this flow to implement our ShoppieTalkie app (iOS / Android). You can see a demo integration with Masterpass here.
Similar service to Dialogflow is wit.ai. It also has a web application, through which you can train your agents by marking the entities. It has iOS SDK, which has the Speech framework integrated, so you don’t have to do the speech recognition part by yourself.
You can also use Core ML to do speech analysis, by doing sentiment analysis. Sentiment Analysis is the process of computationally identifying and categorizing opinions.
To do this, we need to have two steps. The first one is all about machine learning. Choosing the right dataset, implementing the right machine learning algorithm, fine tuning and changing the parameters requires a lot of expertise in this area. The second step, the iOS integration is where the developers step in. They can easily integrate the Core ML model and focus on what they do best, which is creating apps, utilizing the established mobile technologies and concepts.
The main role of Core ML currently is to bridge the gap between the academia (that does the process of researching, designing algorithms and training datasets) and the developers (that don’t have much machine learning expertise, but know how to bring production-ready apps to the real world). More details on Core ML here.
Do it yourself
If none of these technologies look interesting to you, you can always go ahead and implement everything by yourself. The NSLinguisticTagger class from Apple’s AVFoundation is a good starting point. It provides methods for extracting words, phrases, sentences, person names and everything else you need for basic natural language processing. You can, for example, implement the TF-IDF algorithm for extracting keywords from documents. TF-IDF is popular algorithm for finding the most popular words in a text. It consists of two parts, TF and IDF. The first part, term frequency (TF), is about how many times a term occurs in a document. Since there are words like ‘the’ which appear very often, but are not important to the meaning of a text, the inverse document frequency is introduced. With inverse document frequency (IDF), we count how many times a word appears in the other documents. If it appears a lot (like ‘the’), the weight of a word is diminished. You can find more details about this project here.
The space of natural language understanding and speech analysis is really exciting. Things are changing pretty fast and we can expect a lot of improvements and innovation in this area.