Hand tracking with Turi Create and Core ML

Introduction

The task of realtime hand tracking on a mobile device is both very interesting and challenging. The hand is one of the more difficult parts of the body to detect and track. The reason for this is that the hands can look very different, whether that’s with the form (although this holds for other body parts as well) or with the position of the fingers. A hand can go from a punch to hi-five in just a few moments. This means it’s quite difficult to collect a dataset with appropriate annotations from different hand states, taken from different angles. In this post, we will see an approach how to do this with Turi Create, Apple’s framework for creating Core ML models.

Object tracking

The machine learning task we need here is object detection. Object detection enables you to not only detect whether an object is present in the current camera frame, but also what’s the position of that object. This enables you to either draw some visual indication (like rectangle) or present a virtual model on the detected place. Once you find that object, you can use for example Vision’s tracking feature to update the visual indication whenever the object of interest is moved.

If you want to just find out if the object is on the picture, you need image classification. For this, you can use Create ML, Apple’s other framework for creating machine learning models.

Turi Create

Turi Create is a tool that simplifies the creation of custom machine learning models, that can easily be exported to Apple’s Core ML format.  This means that you don’t have to be a machine learning expert to add some intelligence in your apps. The tasks that are supported by Turi Create are recommendations, image classification, object detection, style transfer, activity classification and much more. Although it requires a little bit of Python coding, it’s still very easy to get started, as we will see during this post.

You can find details on how to install Turi Create on their GitHub repo. Basically you would need to do the following from your terminal:

pip install -U turicreate

Once Turi Create is installed, our next task is to find and prepare the data.

Preparing the data

One of the biggest challenges of machine learning is finding enough data to train the machine learning models. As we have discussed at the beginning, detecting the correct bounds of a hand can be a bit tricky. That’s why we need very good dataset. Let’s see what I’ve found.

The dataset that I was using can be found here. I’ve tried several options and this one proved to be the best. It’s quite big, around 6 GB and it can be quite challenging for a Mac machine without a GPU to train the model. Other options that I’ve found are EgoHands and Oxford Hand dataset. I was thinking of combining all three together to have a better machine learning model, but my Mac couldn’t handle that.

Now, let’s see the data. The data of the VIVA hand detection challenge that I was using, was split in two folders, pos and posGt, both under the train folder. The pos folder contained all the images, while the posGt contained all the annotations in csv format, along with information about the hand (left or right). Each csv entry contains information about the bounding boxes of the hand, described using the top left point, a width, and a height [x y w h] in the 2D image plane.

What Turi Create expects?

Turi Create on the other hand, expects SFrame, which is a tabular data structure, in which you can put both the images and the corresponding annotations. The annotations are in JSON format. Every image has an array of objects, which have the keys coordinates and label. The coordinates value contains information for the bounding box, while the label about what is the bounding box. In our case, it’s either left or right hand.

[ {'coordinates': {'height': 104, 'width': 110, 'x': 115, 'y': 216},
'label': 'left'}, ...]

The coordinates represent the centre of the rectangle, and the width and height, which is different from the way the data is organised in the hands dataset (there we have the top left point, instead of the centre of the rectangle).

Screenshot 2019-01-19 at 16.19.40.png

In order to create that data structure, we will do some Python coding. The following script will transform the images and csv files from the dataset, into SFrame which can then be used to create the Turi Create model.

import turicreate as tc
import os
from os import listdir
from os.path import isfile, join

path = 'train/posGt'
imagesDir = "train/pos"
files = [f for f in listdir(path) if isfile(join(path, f))]
annotations = []
labels = []
for fname in files:
	if fname != ".DS_Store":
		lines = tuple(open(path + "/" + fname, 'r'))
		count = 0
		entries = []
		for line in lines:
			if count > 0:
				words = line.split()
				passengerLabel = words[0]
				label = "left"
				if passengerLabel.find("left") == -1:
					label = "right"
				x = int(words[1])
				y = int(words[2])
				width = int(words[3])
				height = int(words[4])
				xCenter = x + width / 2
				yCenter = y + height / 2
				coordinates = {'height': height, 'width': width, 'x': xCenter, 'y': yCenter}
				entry = { 'coordinates' : coordinates, 'label' : label }
				entries.append(entry)
			count = count + 1
		annotations.append(entries)
sf_images = tc.image_analysis.load_images(imagesDir, random_order=False, with_path=False)
sf_images["annotations"] = annotations
sf_images['image_with_ground_truth'] = \
    tc.object_detector.util.draw_bounding_boxes(sf_images['image'], sf_images['annotations'])
sf_images.save('handsFrame.sframe')

In order to accomplish this, we first go through the annotations, parsing the CSV and creating the coordinates JSON, while also transforming the top left coordinates into centre coordinates. Next, we load the images using helper functions from the turicreate package. Then we simply put the annotations, while maintaining the order. The SFrame is then saved to a handsFrame data structure.

You can also call sf_images.explore() to have visualization of the bounding boxes and the images. However, you should test this with few images, otherwise it will just load forever.

explore_ground_truth.png

The next step is to use the SFrame to create the Core ML model. This means we should do another round of Python coding.

import turicreate as tc

# Load the data
data =  tc.SFrame('handsFrame.sframe')

# Make a train-test split
train_data, test_data = data.random_split(0.8)

# Create a model
model = tc.object_detector.create(train_data, feature='image', max_iterations=120)

# Save predictions to an SArray
predictions = model.predict(test_data)

# Evaluate the model and save the results into a dictionary
metrics = model.evaluate(test_data)

# Export for use in Core ML
model.export_coreml('Hands.mlmodel')

What we do here is first we load the SFrame that we have created with the first script. Then we create train and test data, randomly split with 80-20% ratio. Then, using the object_detector.create method from Turi Create, we create the model with the train data. You can play with the max_iterations property (my machine crashed at 150, so 120 is the best I could do). Afterwards, we do predictions and evaluate the model. In the last step, we export the model in a Core ML format.

iOS implementation

Now that we have Core ML model, it’s easy to integrate it into an iOS app, just by dragging and dropping it into an Xcode project.

Let’s examine the created model. The type of the model is called Pipeline, which works really nicely with Apple’s Vision from iOS 12. It accepts as an input an image, with 416×416 size. As an output, it provides two MLMultiArrays that contain confidence and coordinates about the detected object.

Screenshot 2019-01-19 at 17.09.28.png

The good news is that you don’t have to handle these complicated multi-dimensional arrays – the Vision framework does this automatically for you (for pipeline models created with Turi Create) and it provides you with a VNRecognizedObjectObservation. This type contains information about the bounding box (as a CGRect), along with the confidence. Now, when you run the Vision session, you just need to check if the result is of that type and draw the appropriate bounding box.

func handleNewHands(request: VNRequest, error: Error?) {
        DispatchQueue.main.async {
            //perform all the UI updates on the main queue
            guard let results = request.results as? [VNRecognizedObjectObservation] else { return }
            for result in results {
                print("confidence=\(result.confidence)")
                if result.confidence >= self.confidence {
                    self.shouldScanNewHands = false
                    let trackingRequest = VNTrackObjectRequest(detectedObjectObservation: result, completionHandler: self.handleHand)
                    trackingRequest.trackingLevel = .accurate
                    self.trackingRequests.append(trackingRequest)
                }

            }
        }
    }

Once the object is detected, we can tell Vision to track it. For this, we are creating object of type VNTrackObjectRequest, where we pass the recognised object observation and we start the tracking. We update the tracking rectangle every time the completion handler handleHand is called.

Source code

This was the most important part of the iOS implementation. You can find the full source code here, with all the Vision detection and tracking details.

Conclusion

This was a very interesting machine learning exercise for me. Turi Create is a very powerful tool for creating machine learning models. It creates models that work seamlessly with iOS apps.

There’s a lot of room for improvement of this project. First, the model should be trained with a lot more data, so it behaves correctly on all light conditions and hand positions. Also, the iOS code can be improved to better handle the tracking requests. Now it can happen that multiple rectangles are recognised of the same hand.

Screenshot 2019-01-19 at 16.19.40.png

Another cool thing would be to track not only the whole hand, but the fingers as well.

That’s everything for this post. Do you think that detecting body parts would be a useful task for our future apps? What about machine learning in general? Leave out any comments in the section below.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s