Playing with Speech Framework


At the latest WWDC (2016), Apple announced SiriKit, which enables developers to provide extensions to Siri with their apps’ functionality. We will talk about SiriKit in other posts. Now we will focus on another brand new framework, which was probably in the shadow of SiriKit – Speech framework. Although it only had one short (11 minutes) prerecorded video on WWDC, the functionalities it offers might be very interesting to developers. The Speech framework is actually the same voice recognition system that SiriKit uses.

What does the Speech framework offer? It recognizes both live and prerecorded speech, creates transcriptions and alternative interpretations of the recongnized text, as well as confidence levels on how accurate is the transcription. Sounds similar to what Siri does, so what’s the difference between SiriKit and the Speech Framework?

The Speech framework does only the speech recognizing and the transcriptions parts. It’s meant to be used inside your apps, to get the users’ input in a more efficient way than the standard one – which is typing on a keyboard. You have to be inside your app in order to start the recording and recognition of the speech, and you have to do this in a way which is transparent for the user. You will need to ask for permissions to access the speech recognition and the microphone, but also make it clear to they user when you are recording. When your phone is locked and/or the app is in background, you can’t start the recording by giving a voice command.

With SiriKit on the other hand, the extensions that your app provides to Siri, are available even from a locked phone. When the user says ‘Hey Siri, book me a ride using YourApp’, Siri will recognize that your app needs to be asked to execute the request. So your app will get called (but not shown on the foreground), to handle the request and Siri will provide the results to the user. This comes with some limitations though. First, your app is never started, it just serves as a helper to Siri. Second, you can only help Siri in certain predefined domains, like the mentioned booking a ride domain, payments, messaging etc. But if the business case of your app doesn’t fit in any of those domains, you can’t use SiriKit.

This is where Speech Framework might come in handy – you can simplify the user experience by providing voice input to your apps. For example, let’s say you have a todo list or grocery list. You want to be able to say something like ‘Add Milk’, which will add the product to your grocery list. Afterwards, when you bought it and don’t need it anymore, you can say ‘Remove Milk’, which will remove it from the list. Let’s see how we can implement this using the Speech framework.


The scope of this tutorial will be adding/removing pre-defined items to a grocery list. There won’t be any fancy entity extraction by machine learning, for which you can use a REST service like from Google. We will just have a list of products that we expect to be added to the list. We will also define some removing words, like ‘delete’ and ‘remove’, which if appear before a word, the word will be deleted from the list. In any other case, we will just add items to the list.

As mentioned above, you can’t start recording in a non-transparent way to the user. For this, we will have a button where the user can start the recording. The speech framework might also access web service in order to perform the speech recognition. This means that, in order to remain free for every app, some limits are imposed by Apple for the service. They don’t disclose how many requests you can do per day, but warn you to be prepared to handle failures when this limit is hit. If you hit the limit too often, you should contact Apple in order to discuss this. The limit restriction is also enforced on the duration of the recording – Apple recommends no more than a minute. This means you can’t just record all the time and wait for the user to start using it at some point – the service has to be turned on – on demand. Apart from clicking the button to stop the recording, we will also allow the user to say something like ‘I’m done’, which will stop the recording. This is pretty straightforward for implementation, since the recording is already in progress, we will need to see whether the transcrpited text contains our defined stopping word or phrase.

After creating new Xcode project, let’s call it SpeechPlayground, we will first add the needed permissions for accessing the feature in the Info.Plist file. The permissions we need are ‘Privacy – Speech Recognition Usage Description’ and ‘Privacy – Microphone Usage Description’:

Screen Shot 2017-03-05 at 11.40.04

Next, let’s define few products that we will support in the grocery list. Create new products.json file and put some products in there:

{“products”: [“milk”, “vegetables”, “tomato”, “fruit”, “cucumber”, “potato”, “cheese”, “orange”]}

These will be the products that we will look for when we get a transcription from the Speech framework. Now let’s dive into some coding. The first thing we need to do is check whether the user has granted us the permissions we need to access the speech recognizing feature. If that’s not the case, we will show an alert dialog:

Screen Shot 2017-03-06 at 20.06.07

Next, let’s define a class called SpeechHelper, which will be in charge of returning the key words that we need to recognize – the products, the stopping words and the removal words. We’ve extracted this in a separate class in order to isolate the part of loading the words – currently they are hardcoded words, but they can easily be returned from a (maybe language dependent) web service, without changing our main code:

Screen Shot 2017-03-06 at 20.10.12

Nothing special in the chunk of code above, just loading the products from the products.json file we’ve created above. Next, we will add some helper methods for the error messages that might appear because of permissions errors or device limitations:

Screen Shot 2017-03-06 at 22.44.56

Now, let’s start with the interesting part. The user interface for this app would be pretty simple – there will be a button through which we will trigger the start/stop of the recording, a text view which will show what the Speech framework transcripted for us, and a table view which will list the products we need to buy from our grocery list.

Screen Shot 2017-03-06 at 22.50.15

We will add an IBAction to the recording button, which will call the handleRecordingStateChange method. This method will check the state of the audio session, and based on that it will either start or stop the recording session.

Screen Shot 2017-03-06 at 23.21.40

In order to understand this method and it’s two states, we will introduce few new types of objects. What’s this object audioEngine used for? It’s an object from the class AVAudioEngine, which contains a group of nodes (AVAudioNodes). These nodes perform the job of audio singal creation, processing and I/O tasks. Without those nodes, the engine wouldn’t be able to do its job, but also AVAudioNodes do not currently provide useful functionality until attached to an engine. You can create your own AVAudioNodes, and attach them to the engine, using the attach(_ node: AVAudioNode) method.

Let’s introduce another new object, recognitionRequest, of the class SFSpeechAudioBufferRecognitionRequest. This class is used for requests that recognize live audio or in-memory content, which is what we need in this case. We want to show and update the transcript when the user says something. There’s also another class, which is called SFSpeechURLRecognitionRequest, and this one is used to perform recognition on a prerecorded, on-disk audio file.

The third object that we will introduce is recognitionTask (SFSpeechRecognitionTask). With this task, you can monitor the recognition process. The task can be either starting, running, finishing, canceling or completed. This kind of object is what we get when we ask the speech recognizer to start listening to the user input and get back to us what it heard. The speech recognizer is represented by the SFSpeechRecognizer class and this is the class that does the actual speech recognizing. It only supports one language and using the default initializer init?(), returns a speech recognizer for the device’s current locale (if a recognizer is supported for that locale). If we want to be sure that english transcription will be used, we can create the speech recognizer by explicitly stating its locale:

private let speechRecognizer: SFSpeechRecognizer! =
SFSpeechRecognizer(locale: Locale.init(identifier: “en-US”))

A lot of new classes, let’s see how we can put them all together. With the isRunning method, we are checking whether there’s an audio session in progress at the moment. If there’s no such session, we are doing several things (we will ignore the cancelCalled flag for now, we will get back to it later). First, we are checking whether we already have recognition task which is in progress. If there is, we will just cancel it and nil it out:

Screen Shot 2017-03-06 at 23.43.49

Next, we will start the audio session. We will set the category of the session to be AVAudioSessionCategoryRecord, which only records the session. If you want to also play it later, you should use AVAudioSessionCategoryPlayAndRecord. Next, we will set the mode of the session, which will be AVAudioSessionModeMeasurement. Apple recommends to use this mode if your app is performing measurement of audio input or output, because it does minimal signal processing on the audio. Also, we will create the recognition request. Note that we set the shouldReportPartialResults to true, which means that the task will report progress all the time, not only when the recording finishes. This enables us to show (and update) the text view which will hold the text on each new spoken word.

Screen Shot 2017-03-08 at 19.54.14

Finally, we can now start recording. The startRecording method does just that:

Screen Shot 2017-03-09 at 13.06.00

There are a lot of things going on here, but as you will see, it’s not that complicated. First, we are checking whether there’s an inputNode available for the engine and show an error if there’s not. Then, we start the recognition task for the speech recognizer, with the recognition request we’ve created above. We will get back to the callback later, first we will see the how to start the audio engine. We are doing this by first installing an audio tap on the bus of the input node, with a buffer size of 1024 bytes. Then we are trying to start the audio engine, by first preallocating many of the resources the engine requires with the prepare method and then starting the engine with the start method.

In the callback, we are initializing two arrays (createProductsArraysForSession method) that we will need for keeping track of which products we want to add and remove from the grocery list. When there’s a result in the resultHandler of the task, we are getting the best transcription by calling the result?.bestTranscription.formattedString. If you want to show a popup with other transcriptions and let the user choose the one that fits best, you can call the result?.transcriptions which will give you an array of the possible transcriptions. Every SFTranscription contains two properties – formattedString and segments (SFTranscriptionSegment). Segments contain other information that you might find helpful, like confidence (on a scale of 0 to 1), which gives you an indication how accurate is the Speech framework that this string is the one the user has spoken. This property is used when figuring out which transcription is bestTranscription. You can also get the timestamp and the duration of the spoken segment, as well as some alternative interpretations of the segment.

We will use the segments array of the best transcription, to iterate through all the spoken words. Our implementation for extracting the key words will be simple in this tutorial, but you can use some more sophisticated solutions with machine learning and natural language processing, that can extract entities from the given sentence and determine which action should be done (like or LUIS). So first, we will check whether there’s a removal word (remove or delete) before a given word. We track this with the shouldDelete flag. If the flag is set, than we will add the word to the deletedProducts array, otherwise it will go to the sessionProducts array (it’s a new product that the user has spoken). We are also checking if the word is a stopping word (like ‘done’) and if it is, we are just returning from the method.

As we’ve mentioned previously, we would like to write and update the transcription as the user gives voice commands. That’s why we are setting the recognized text to our text view (self.recognizedText.text = recognized). Then we are checking whether recording finished. If it did, then we are removing the audio tap from the input node’s bus that we added at the beginning, nil out the request and the task and stop the audio engine.

Now let’s go back to the cancelCalled method. This flag is used to stop subsequent calls to the method that updates everything when the recording state changes (handleRecordingStateChange). The subsequent calls can happen because the result handler is called on every sound that’s recognized – which means it can be called even after the triggering stop word is found.

Screen Shot 2017-03-09 at 20.21.32

We’ve covered everything we do when the session is not currently running and we need to start it. Now, let’s see the other state – when we have a recording that we need to stop and update the list based on the transcription.

Screen Shot 2017-03-06 at 23.21.40

One interesting method here is update products, which first stores the currently displayed products in a temporary variable. Then it adds the ones that should be added in the current session. It then goes through all products and checks whether they are in the deleted products list. This means that our current logic will delete all occurences of an item of the list if the remove word is found before it. That’s the simplest case, you can implement it in a more sophisticated manner if you prefer.

Screen Shot 2017-03-09 at 20.28.39

Apart from updating the products, we also need to stop the audio engine, end the recognition request and update the state of the recording button.

The last pieces of code that we haven’t covered are the UITableViewDataSource methods. They are pretty simple and trivial, so we will not go into much details explaining them:

Screen Shot 2017-03-09 at 20.31.29

We’ve gone a long way to have everything up and running and here’s the end result:

The Speech Framework is very interesting framework, since it offers solid speech recognition functionality to your apps. It’s still not a perfect product and it also comes with few limitations, as discussed at the begining. The source code for this simple project can be found on GitHub.
There are also some other tutorials you might find helpful, like this one from appcoda.


2 thoughts on “Playing with Speech Framework

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s