Text recognition on iOS 13 with Vision, SwiftUI and Combine

Introduction

This year, WWDC was full of new and exciting features. The biggest one was probably SwiftUI, Apple’s new UI framework. However, there were a lot of other cool announcements, especially in the machine learning and augmented reality areas. Text recognition is now supported directly in the Vision framework. Previously, you would have to create your own machine learning model and combine it with the text detection feature from Vision. To see how things worked before iOS 13, please check my post Text recognition using Vision and Core ML.

In this post, we build a brand new text recognizer, using the new feature from Vision. We will do this in SwiftUI, which is (as you will see) very fun to use.

Implementation

To follow along, you will need Xcode 11 and iOS 13 (beta at the time of writing). To get started, create a new project called TextRecognizerSwiftUI. Make sure to have “Use SwiftUI” selected.

Screenshot 2019-06-16 at 16.28.48.png

This will create a project setup to use SwiftUI. You can have a look at the created files. Apart from the AppDelegate, now we also have a SceneDelegate, which actually creates the window now. To do this, it creates a UIHostingController with a root view, called ContentView. The hosting controller is a ViewController that is capable of hosting the new SwiftUI views.

func scene(_ scene: UIScene, willConnectTo session: UISceneSession, options connectionOptions: UIScene.ConnectionOptions) {
     // Use a UIHostingController as window root view controller
     let window = UIWindow(frame: UIScreen.main.bounds)
     window.rootViewController = UIHostingController(rootView: ContentView())
     self.window = window
     window.makeKeyAndVisible()
}

Next, let’s build the content view. Our app will have one text view, which will display the detected text. It will also have a button, which will allow us to pick a document from which we will extract the text. Translated in code, this will look like this:

struct ContentView : View {

    @ObjectBinding var recognizedText: RecognizedText = RecognizedText()

    var body: some View {
        NavigationView {
            VStack {
                Text(recognizedText.value)
                .lineLimit(nil)
                Spacer()
                NavigationButton(destination: ScanningView(recognizedText: $recognizedText.value)) {
                    Text("Scan document")
                }
            }
        }
    }
}

There are already many great resources for SwiftUI, so I will not go into much detail about this structure. In a nutshell, in SwiftUI, we specify one body view, in which we declaratively specify the components which are part of that view.  At the root is our navigation view (since we will show another screen). Then, we have a vertical stack, which consists of the text view for the recognized text, a spacer and a navigation button. When the button is tapped, the scanning view is presented.

The most interesting part here is the @ObjectBinding property wrapper for the recognized text. @ObjectBinding is the source of truth for this view. It contains the state for the view. When this value changes, all the immutable views are re-drawn, in a smart and efficient way. @ObjectBinding is used when we want to have an external state (value which can be modified by other views as well). For simpler cases, when the scope of change is only that view, you can use the @State property wrapper.

In order to have @ObjectBinding property wrapper, we need to have a class that implements the BindableObject protocol. This protocol has only one requirement, which is implementing the didChange method. Here’s the implementation for our RecognizedText class.

import Combine
import SwiftUI

final class RecognizedText: BindableObject {

    let didChange = PassthroughSubject()

    var value: String = "Scan document to see its contents" {
        didSet {
            didChange.send(self)
        }
    }

}

To implement this protocol, we use the PassthroughSubject type from Apple’s new Combine framework. This type is simply a subject that passes values. In our case, when the value of the recognized text changes, we notify this change.

By using @ObjectBinding, we are in charge of notifying when the data is changed and this is the mechanism to do it.

Next, let’s see the ScanningView. One specific thing here is that we are using a VNDocumentCameraViewController, which is a VisionKit standard view controller, and not a SwiftUI view. This means, we need to build a wrapper in the ScanningView, to make this controller usable in SwiftUI.

The way to do this is to implement the UIViewControllerRepresentable protocol. For this, we need to implement several methods. But first, we need to specify the type of the view controller that we are wrapping, in this case VNDocumentCameraViewController.

Next, we can optionally specify a coordinator. We will need a custom coordinator in order to specify the delegate for the VNDocumentCameraViewController (more on that later). Then, in the makeUIViewController method, we create the view controller that needs to be presented by SwiftUI.

struct ScanningView: UIViewControllerRepresentable {

    @Binding var recognizedText: String

    typealias UIViewControllerType = VNDocumentCameraViewController

    func makeCoordinator() -> Coordinator {
        return Coordinator(recognizedText: $recognizedText)
    }

    func makeUIViewController(context: UIViewControllerRepresentableContext) -> VNDocumentCameraViewController {
        let documentCameraViewController = VNDocumentCameraViewController()
        documentCameraViewController.delegate = context.coordinator
        return documentCameraViewController
    }

    func updateUIViewController(_ uiViewController: VNDocumentCameraViewController, context: UIViewControllerRepresentableContext) {

    }

Note how we specified a @Binding property wrapper for the recognized text. This is an unowned binding (remember that we defined the @ObjectBinding in the ContentView). With this binding in place, whenever the text changes here, it will magically change also in the ContentView.

Next, let’s see the Coordinator implementation. Our coordinator will basically implement the VNDocumentCameraViewControllerDelegate and do the text recognition.

class Coordinator: NSObject, VNDocumentCameraViewControllerDelegate {

        @Binding var recognizedText: String
        private let textRecognizer: TextRecognizer

        init(recognizedText: Binding) {
            self.$recognizedText = recognizedText
            textRecognizer = TextRecognizer(recognizedText: recognizedText)
        }

        public func documentCameraViewController(_ controller: VNDocumentCameraViewController, didFinishWith scan: VNDocumentCameraScan) {
            var images = [CGImage]()
            for pageIndex in 0 ..< scan.pageCount {
                let image = scan.imageOfPage(at: pageIndex)
                if let cgImage = image.cgImage {
                    images.append(cgImage)
                }
            }
            textRecognizer.recognizeText(from: images)
            controller.navigationController?.popViewController(animated: true)
        }

    }

When the delegate method of the document camera view controller is called, we take the images and call our text recognizer with those images. Note how we again here specify the @Binding for the recognized text.

Finally, let's see the TextRecognizer, as well as the new text recognition API. The new type of request is called VNRecognizeTextRequest. In this request, we can specify the recognition level. There are two types of levels: fast and accurate. Fast is recommended for realtime scanning, for example in AR apps. Accurate is recommended for documents and already taken images. It takes a bit longer than the fast recognition level, since it uses neural networks to determine the characters. We will use the accurate one for our use-case.

public struct TextRecognizer {

    @Binding var recognizedText: String

    private let textRecognitionWorkQueue = DispatchQueue(label: "TextRecognitionQueue",
                                                         qos: .userInitiated, attributes: [], autoreleaseFrequency: .workItem)

    func recognizeText(from images: [CGImage]) {
        self.recognizedText = ""
        var tmp = ""
        let textRecognitionRequest = VNRecognizeTextRequest { (request, error) in
            guard let observations = request.results as? [VNRecognizedTextObservation] else {
                print("The observations are of an unexpected type.")
                return
            }
            // Concatenate the recognised text from all the observations.
            let maximumCandidates = 1
            for observation in observations {
                guard let candidate = observation.topCandidates(maximumCandidates).first else { continue }
                tmp += candidate.string + "\n"
            }
        }
        textRecognitionRequest.recognitionLevel = .accurate
        for image in images {
            let requestHandler = VNImageRequestHandler(cgImage: image, options: [:])

            do {
                try requestHandler.perform([textRecognitionRequest])
            } catch {
                print(error)
            }
            tmp += "\n\n"
        }
        self.recognizedText = tmp
    }

}

Finally, we go through the images and we create a VNImageRequestHandler for each image. Then, we perform the text recognition request with the created handler. At the end, we update the recognized text binding, which automatically updates the recognized text everywhere in the app.

Conclusion

In this post, we’ve built an app that recognizes text from an image, using the new text recognition API from iOS 13. We’ve implemented this feature using the SwiftUI framework. SwiftUI looks very cool and powerful and it will probably be the future of building iOS apps.

You can find the source code for this app here.

What are your thoughts on SwiftUI and text recognition? Do you have any suggestions how to improve the code? Please share you thoughts in the comments section.

19 Comments

  1. This looks great but it’s not working. Appears that certain things have been deprecated. Also, wonder if you could do an update for image recognition rather than text recognition or explain how it would be done? Thanks!

    Like

      1. Hi Martin,

        Thank you very much for the code! The program does compile under XCode Beta 5 but it crashes under iOs 13 Beta 6 on my iPhone X apparently due to the @ObjectBinding which is now in the @ObservableObject. It crashes in the NavigationView.

        Best
        Frank

        Like

      2. Hi Frank,

        It seems it’s the Xcode Beta 5 itself in this moment, we can’t deploy it on real devices:

        You can run it from the simulator, but since you have no camera there, you can extend the image picker to take images from the gallery.

        In the meantime, we should wait for a new beta.

        Best,
        Martin

        Like

  2. Hi Martin,

    Yes, hope the next beta is coming soon.
    I tried to circumvent the crash with the @EnvironmentObject, but I got stuck in the TextRecognizer so
    far.

    Best
    Frank

    Like

      1. Do you have any code examples to extend this to use the photo library Martin? The Apple docs are incredibly cryptic and poorly written, with no example code, no hints at implementation of what they are talking about..unless one has been coding for 72 years does he have a slight chance at deciphering those hieroglyphics. I’m sure 90 percent of developers get their code from Stack Overflow and other sites…I’m glad the new SwiftUI team wrote better example docs then was written previously.

        Like

  3. Hi, I was taking a look to your previous text recoginition post and I found this one, looks impressive. Would it be extensible to dynamic writing? (for detecting letters while writing from Apple Pencil or a finger)

    Like

      1. Hi again Martin, I was going to test your app but seems to have a black screen at the beginning, so it’s impossible to interact.

        Like

      2. Hey guys, I’ve fixed the issue with the black screen. However, the bindings are not working correctly with this new Xcode version. Next week I will have more time to look into this. If you have some ideas in the meantime, feel free to contribute.

        Like

  4. I want to detect non english languages. I tried setting IOC standards to request.recognitionLanguages . But it is not working as expected. also
    let revision = VNRecognizeTextRequest.currentRevision
    var possibleLanguages: Array = []
    do {
    possibleLanguages = try VNRecognizeTextRequest.supportedRecognitionLanguages(for: .accurate, revision: revision)
    } catch {
    print(“Error getting the supported languages.”)
    }

    print(“Possible languages for revision \(revision):\n\(possibleLanguages.joined(separator: “\n”))”)

    this always return en-US

    Do you know how to add support for other languages as well?

    Like

  5. Hi Martin, this post has been wonderfully helpful, but I’m finding the observations often miss or mistake capital-I and the number 1, especially in sans-serif typefaces. Do you have any recommendations for how to improve accuracy for these sorts of cases, if I’m already using accurate recognition?

    Thanks!

    Like

    1. Hi Steve,
      Thanks, I’m glad it was helpful for you. About your question, here are some tips I can think of:
      – do some post processing – plain character replace (I with 1) or maybe keep a list of valid words and find Levenshtein distance (https://en.wikipedia.org/wiki/Levenshtein_distance) between the found word and a valid word.
      – other option is to use the customWords property on the text recognition request, to add to it some custom words which would be more in favor over others.
      – If the results are still not ok for some fonts, you might want to try MLKit from Google, they also have some text recognition, that works pretty good.
      Hope that helps.
      Best,
      Martin

      Like

Leave a comment