Building smart Q&A app with CoreML, SwiftUI and Combine

 

Introduction

Imagine an app that lets you scan a document and then gives you the possibility to ask questions on its content. Something like this:

This app has several challenges:

  • detecting edges on a document
  • scanning and recognizing the text in the document
  • answering questions based on the content of the scanned document.

Such an app can be quite useful to everyone who doesn’t have the time, or is too lazy to read the whole text. For example, you can use the app to scan this post and learn the most important things about how you can implement this by yourself, without reading it at all.

question-answering-website-825x510.png

Implementation

We will be using a lot of cool technologies to implement this app, such as SwiftUI, Combine, CoreML and Vision. To get started, first create a new SwiftUI project (I’ve called mine SmartAssistant).

Text recognition

The first thing we need to do is to scan the document and extract its text. When we have it as a string, it would be much easier to analyze it. You can find a lot more details about text recognition in my other post Text recognition on iOS 13 with Vision, SwiftUI and Combine.

In a nutshell, we need to implement a SwiftUI wrapper for the VNDocumentCameraViewController, which is a UIKit view controller from the Vision framework. This controller shows the camera scanner that does edge detection.

struct ScanningView: UIViewControllerRepresentable {
    
    let documentsRepository: DocumentsRepository
    var modalShown: Binding
    var loadingViewShown: Binding
    
    typealias UIViewControllerType = VNDocumentCameraViewController
    
    func makeCoordinator() -> Coordinator {
        return Coordinator(withDocumentsRepository: documentsRepository,
                           modalShown: modalShown,
                           loadingViewShown: loadingViewShown)
    }
    
    func makeUIViewController(context: UIViewControllerRepresentableContext) -> VNDocumentCameraViewController {
        let documentCameraViewController = VNDocumentCameraViewController()
        documentCameraViewController.delegate = context.coordinator
        return documentCameraViewController
    }
    
    func updateUIViewController(_ uiViewController: VNDocumentCameraViewController, context: UIViewControllerRepresentableContext) {
        
    }
    
    class Coordinator: NSObject, VNDocumentCameraViewControllerDelegate {
        
        let documentsRepository: DocumentsRepository
        var modalShown: Binding
        var loadingViewShown: Binding
        
        init(withDocumentsRepository documentsRepository: DocumentsRepository,
             modalShown: Binding,
             loadingViewShown: Binding) {
            self.documentsRepository = documentsRepository
            self.modalShown = modalShown
            self.loadingViewShown = loadingViewShown
        }
        
        public func documentCameraViewController(_ controller: VNDocumentCameraViewController, didFinishWith scan: VNDocumentCameraScan) {
            var images = [CGImage]()
            for pageIndex in 0..scan.count-1 {
                let image = scan.imageOfPage(at: pageIndex)
                if let cgImage = image.cgImage {
                    images.append(cgImage)
                }
            }
            self.loadingViewShown.wrappedValue = true
            DispatchQueue.global(qos: .userInitiated).async { [unowned self] in
                let text = TextRecognizer.recognizeText(from: images)
                let number = self.documentsRepository.documents.count + 1
                let title = "Document \(number)"
                let document = Document(id: UUID().uuidString, title: title, content: text)
                DispatchQueue.main.async {
                    self.documentsRepository.add(document: document)
                    self.loadingViewShown.wrappedValue = false
                    self.modalShown.wrappedValue = false
                }
            }
        }
    }

}

When a new image is scanned by the user, the text recognizer class tries to recognize the text from the image. Using the recognized text, we create a new document and save it in our document repository (which is an environment object).

Our Document type is a simple struct, which contains an identifier, title and the content of the document. The document repository is even simpler – it just stores an array of saved documents and publishes any changes to its subscribers.

struct Document: Identifiable {
    let id: String
    let title: String
    let content: String
}

class DocumentsRepository: ObservableObject {

    @Published var documents = [Document]()

    func add(document: Document) {
        documents.append(document)
    }

}

This demo app doesn’t have any persistence, which means if you close the app, the scanned documents would be lost.

Now that we have our scanning view, let’s integrate it in our main view. It will be shown as a modal view, when the user presses a plus icon.

struct ContentView: View {

    @EnvironmentObject var documentsRepository: DocumentsRepository
    @State var modalShown: Bool = false
    @State var loadingViewShown: Bool = false

    var body: some View {
        NavigationView {
            DocumentsList()
            .navigationBarTitle("Documents")
            .navigationBarItems(trailing:
                Button(action: {
                    self.modalShown = true
                }) {
                    Image(systemName: "plus")
                    .renderingMode(.original)
                }
            )
        }.sheet(isPresented: $modalShown) {
            LoadingView(isShowing: self.$loadingViewShown) {
                ScanningView(documentsRepository: self.documentsRepository,
                             modalShown: self.$modalShown,
                             loadingViewShown: self.$loadingViewShown)
            }
        }
    }
}

The DocumentsList view subscribes to the changes of the documentsRepository and updates its content when new documents are added.

struct DocumentsList: View {

    @EnvironmentObject var documentsRepository: DocumentsRepository

    var body: some View {
        List(documentsRepository.documents) { document in
            NavigationLink(destination: DocumentDetail(document: document)) {
                DocumentRow(document: document)
            }
        }
    }

}

When we click on an entry of the DocumentsList, a DocumentDetail view with the selected document is shown. At the top, this view has an input field, which can be used by the users to ask questions on the detected content shown below it.

struct DocumentDetail: View {
    let document: Document
    let bert = BERT()

    @State private var question: String = ""
    @State private var answer: String = ""
    @State private var loadingViewShown = false

    var body: some View {
        LoadingView(isShowing: $loadingViewShown) {
            VStack {
                HStack {
                    TextField("Insert your question here...", text: self.$question)
                    Button(action: {
                        self.findAnswer()
                    }) {
                        Text("Answer")
                    }
                }.padding()
                Text(self.answer).padding()
                ScrollView(.vertical) {
                    Text(self.document.content).padding()
                }
            }
            .navigationBarTitle(self.document.title)
        }
    }

    private func findAnswer() {
        self.loadingViewShown = true
        DispatchQueue.global(qos: .userInitiated).async {
            self.answer = String(self.bert.findAnswer(for: self.question, in: self.document.content))
            self.loadingViewShown = false
        }
    }

}

There are three @State variables in this view. First, we need to keep track of the question that the user enters. The question string is bound to the text field. When the question changes and the user presses the “answer” button, the answer changes (more on that below). When the answer is changed, the label that displays that answer should also be updated.

Questions and answers model

Now, to the most interesting part – how do we answer a question. Luckily, there’s a CoreML port of the BERT model. BERT stands for Bidirectional Encoder Representations from Transformers (BERT) and is a result of the research work and paper by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.

The model accepts text from a document and a question, in natural English, about the document. The model responds with the location of a passage within the document text that answers the question. You can find more details about how the CoreML model works in the Apple sample code.

Basically, it takes the question and the content where the answer should be found. It splits the words from both string entries into tokens, using the Natural Language framework. Then, each word is converted into an ID, based on a pre-defined BERTVocabulary (you can find it in the text file bert-base-uncased-vocab.txt in the repo). If a word token doesn’t exist in the vocabulary, the method looks for subtokens, or wordpieces. Next, the word IDs are transformed into MLMultiArray and provided as an input to the CoreML model, which does the prediction. For more details about this process, please refer to the Apple sample code.

For us, as users of this model, we just need to call:

self.answer = String(self.bert.findAnswer(for: self.question, in: self.document.content))

That’s everything that we need to do. When the BERT prediction is done, the answer is updated. Since this is bound to the label, it’s automatically updated on the screen.

Testing

You can try testing with several different types of documents. I’ve tried by scanning the AppStore page of my Drawland app (check the video above). I was asking several questions, like:

  • How Drawland helps?
  • How many sketches it has?
  • Who is it good for?

The model gave pretty good results. For example, for the second and third question, I didn’t even mention Drawland, it was able to figure this out by itself. The third question didn’t use the word perfect (which is used in Drawland’s description page), but a different word (good). The model correctly returned the target group for the app – kids and anyone who wants to learn to draw.

Conclusion

Machine learning is making some impressive advancements. Without being an expert in natural language processing, you can make your apps smarter by using the knowledge and expertise of the researchers and data scientists.

One limitation – at the moment, the model is quite big (over 200MB) and probably you will not use it in a productive app because of that. Also, there’s a limitation about the number of tokens – if you have over 389 tokens, your question will not be answered.

Leave a comment