We’ve seen in the previous post how an iOS device can understand and transcript the voice commands we give to it (speech to text). In this post, we will see the opposite – how the device can communicate an information we have as a string in our app, with speech. We will extend the grocery list app from the previous post (make sure to check that one out first), by adding a functionality to tell the user what remaining products they need to buy from the list. We will also provide a way to customize the voice that will do the speaking, through a settings page.
In order to accomplish this, we will need a different class (AVSpeechSynthesizer) from a different framework (AVFoundation). As the Apple docs tell us, this class produces synthesized speech from text on an iOS device, and provides methods for controlling or monitoring the progress of ongoing speech – which is exactly what we need, so let’s get started!
Let’s first create an object from the class we need. There’s also a new variable for the audioSession, since we will use that one in more methods and we don’t want to always call the sharedInstance class method:
private var speechSynthesizer = AVSpeechSynthesizer()
private var audioSession = AVAudioSession.sharedInstance()
In order for the class to speak your text, you need to provide it an object of type AVSpeechUtterance. An AVSpeechUtterance is the basic part of speech synthesis. It keeps information about the text to be spoken and parameters that can customize the voice, pitch, rate, and delay of the spoken text. The speech synthesizer keeps a queue (FIFO data structure) of utterances to be spoken. There’s a method that checks whether the synthesizer is currently speaking and if it does, it just adds the next utterances to the queue. It also provides methods to pause, play and stop the speech, which might be useful if you want to develop an audio books app.
We will persist the user preferences about the speech parameters between app launches and for this we will create a new class – SettingsManager. Here’s a part of the class, it just provides methods for saving and getting the values for the parameters:
Now, let’s extend our storyboard with new screens and some design updates:
We’ve changed the root view controller to be UINavigationViewController, in order to be able to push view controllers from the initial grocery list screen. On the grocery list screen, there are two new buttons – “Settings” (which will open the Settings screen) and “Tell me the remaining products”, which will invoke the text to speech feature. There’s also another screen which is opened from the Settings screen, LanguageViewController, which will show a list of the available languages that can be used in the speech utterance.
The SettingsViewController is pretty simple, it has 4 sliders for the parameters needed for customizing the voice and a button that will show the language selection. The sliders’ maximum and minimum values are set in the interface builder (less code -> easier maintaince ), based on Apple’s documentation for the possible values. The Volume and the Rate have possible float values from 0 to 1, the pitching from 0.5 to 2 and for the delay we’ve set a limit of 5 seconds. The values in the sliders are read from the SettingsManager we saw above:
When the save button is clicked, we will just store the changed values of the sliders in the SettingsManager and pop the Settings screen:
The LanguageViewController also doesn’t do much, it gets the available voice languages from the AVSpeechSynthesisVoice class and displays them in a table view. When a row is selected, the checkmark is set to the selected row and the table view is reloaded. There’s also a better way, to keep the old index path and update only the changed cells, but we will stick with the simple implementation now:
Let’s go back to the grocery list screen. When the user clicks the button for the remaining products, the playRemainingText is called:
In this method, we are first checking if the speech syntesizer is speaking currently. If it does, we will just let it continue doing its thing. Otherwise, we call the speak method, with a newly created utterance. The createUtterance method does this, by reading the parameters the user set in the settings screen. The text that will be spoken is created in the createRemainingText, which goes through the products list and adds the items to the text. We use comma to separate the words, since the syntesizer takes this in consideration and pronounces them in more natural way. Without a comma, it will just rush through the items.
If you run it now and press the remaining button first, you will hear a voice saying “You don’t have any remaining products on your grocery list.”. That’s great, it’s what we expect. Now let’s add some products with the recording and speech recognizing we’ve implemented in the previous tutorial. When you run the remaining button now again, nothing happens. What’s the problem and why did it stop working? Somehow after the recording, the device doesn’t have a possibility to play a sound.
And that’s exactly what happens. When we were discussing the startAudioSession method in the previous post, we’ve said that the category of the audio session will be AVAudioSessionCategoryRecord. That was good enough for us then, but now we have a new feature, which requires also to play sound. That’s why we will change the category to be AVAudioSessionCategoryPlayAndRecord:
Another improvement we can do to our grocery list is to get rid of the stopping words. We don’t want to always say “I’m done”, the recorder should be smart enough to stop the recording when there’s no action for some time. Since the Speech framework currently doesn’t provide this functionality, we can implement this by ourselves. We can use a timer, which will be re-created on every call to the result handler of the speech recognition task. If a timer manages to live for two seconds (long enough for the method that’s scheduled to be invoked), the recording will be stopped:
That’s the last detail we needed to do. You can now go ahead, add some products and play with the voice parameters (hint: the pitch parameter might be fun). We now have two directional speech communication, from speech to text and vice versa. Having in mind that conversational interfaces are starting to have huge impact on the way we design the user experience, knowledge of the Speech framework and AVFoundation (specifically the AVSpeechSynthesizer class), can be of huge benefit. If you want to check the complete code of this app, please check this GitHub repo.