Explore the Action & Vision app

Back to WWDC 2020

Explore the Action & Vision app

It's now easy to create an app for fitness or sports coaching that takes advantage of machine learning — and to prove it, we built our own. Learn how we designed the Action & Vision app using Object Detection and Action Classification in Create ML along with the new Body Pose Estimation, Trajectory Detection, and Contour Detection features in the Vision framework. Explore how you can create an immersive application for gameplay or training from setup to analysis and feedback. And follow along in Xcode with a full sample project. To get the most out of this session, you should have familiarity with the Vision framework and Create ML's Action Classifier tools. To learn more, we recommend watching “Build an Action Classifier with Create ML,” “Explore Computer Vision APIs,” and “Detect Body and Hand Pose with Vision.” We also recommend exploring the Action & Vision sample project to learn more about adopting these technologies. Whether you are building a fitness coaching app, or exploring new ways of interacting, consider the incredible features that you can build by combining machine learning with the rich set of computer vision features. By bringing Create ML, Core ML, and Vision API together, there's almost no end to the magic you can bring to your app.

Resources
Related Videos

WWDC 2020
Download

Hello and welcome to WWDC.
Hi. My name is Frank Doepke, and together with my colleague, Brent Dimick, we're going to explore the Action and Vision Application.
The theme that we would like to set for today is that we use the phone as an observer and give feedback to our users. What do I mean by that? We already see that at plenty of sporting events, we have people using their phones to record and film it. But our thought is, can we use our iPhone or iPad and step in as a coach? We already have it with us, even when we go to the gym, but we want to use the camera and the sensors to actually, instead of us looking and observing what's on the device, have it observe what we are doing and give us some real-time feedback. You already have the perfect tool in your pocket. We have a high-quality camera, a fast CPU, GPU and neural engine. Now, we have a set of comprehensive and cooperate APIs that make it easy for you to take advantage of all the hardware.
Last, but not least, all of this can happen on device. Now, that is important for two reasons.
First, we actually want to make sure that we preserve the privacy of our users by keeping all the data on the device. And second, you actually don't have to wait for any kind of latency by analyzing something in the cloud.
So today, in sports and fitness, we can actually see that analysis of the sport can really help everyone to improve. So we have billions of sports enthusiasts up to the professional athlete who can benefit from sports analysis. But instead of just looking at the device as we do today for many things and looking at videos of how to do it, we want to take it to the next level. We want to improve in the sports and fitness. So we do some common analysis, and we think this can be done. So, first, what did our body do? How did we move? Then we might want to look at which objects are in motion. Think of the ball in the soccer game or in a tennis game. We need to understand the field of play. The court in the tennis court, or the soccer goal. And then we need to give feedback to the user of what we actually saw and what happened.
So, for our session, we picked an example of a sport that is simple and easy to understand. And then we wrote a fun little application around it that you can actually download the source code for to actually follow along. So let me introduce the Action and Vision Application.
We picked the game of bean bag toss. It is a fun game for everyone to play. It's very simple. We have two boards set up 25 feet apart. Each of the boards has a regulation size of two feet by four feet. and it has a six-inch-diameter hole right in the center. Players now take the bean bags and throw them on the boards. And they take turns and score different scores for if you land on the board, and an extra score when you actually land in the hole. Now, you might think of bean bag as just a pastime, but still, everybody's competitive and wants to win.
So you might ask yourself the question of, like, "Why did I miss my shot?" To answer that, we want to see, how was the bag flying? What was my body pose when I released the bag? And how fast was my throw? Of course, I need to keep score, and perhaps I want to show off in front of my friends by doing some different shots. So, let's head outside and actually play a game.
All right. Here we have our phone already set up. We now go into the live-action mode to actually record the session.
The first thing we need to do is find our board. So we panned our camera over. Now it's stable. And we are waiting for the player. And there we go. We have our player. Let's see how I did.
You can see in the orange line it saw how you are actually throwing, and I could see my trajectory of my bean bag flying.
Wow, I got lucky on that shot.
The other part I would like you to pay attention to is that we have a skeleton on top of the player, so we see all the key points of our movements.
With that, we can understand the release angle when we're throwing, and we also can see what kind of throw we did-- overhand, underhand.
And even there, you see, a trick shot, where we're throwing under the leg.
You also see the speed of which the bean bag was traveling, and you have the score on the bottom left and the kind of throws that we did on the bottom right. Now, once we are done with all of the eight throws, we actually get a summary view. I can see all my different trajectories and can see what worked best for me. I see my average speed at which I was throwing, as well as the release angle. And, of course, the final score. So that is the Action and Vision App. We hope that you will enjoy it. Now let's look how we actually created this application. We have some key algorithms that have to play together to make all of this happen. We start with the prerequisite phase. As you saw, the camera was on a tripod, or somehow we need to have it stabilized. Then we have a game setup. This is kind of where we understand the playing field. And then, of course, comes the game play. Now, the prerequisite is first, is you find the board. That was the very first panning move that you saw. Once we have the board, we now need to ensure that we have scene stability. That means we know that the camera is, for instance, on a tripod or somehow otherwise stabilized. Now we're getting ready to play in the game setup part. We measure out the boards, then we find our player, and we are ready to roll. Now, when the game play starts, we can actually find all the throws. Then we analyze the kind of throw type. Is it an overhand, underhand or under leg? And last but not least, we measure the speed. Now, of course we're interested-- Which algorithms do we use? So, to find the boards, we trained a custom model, and we're using the VNCoreMLRequest to actually run the inference on that model. And it tells us where the board is. Once we have the board, we can now use VNTranslationalImageRequest to analyze for scene stability. We measure the boards by actually running the VNDetectContourRequest, which gives us the outline of the board itself. And then we use the VNDetectHumanBodyPoseRequest, which is new this year, to find the human. Now, when we are ready to play, we use the VNDetectTrajectoriesRequest, which finds the throw of the bag. And then we have a new model that we trained in Create ML and run through CoreML to actually classify what kind of throw we have done. Last but not least, we can use the measurements of the board, together with the analysis of the trajectory, to measure the speed of our throw. Now, to guide you through all of this, we actually have an icon that will help you. So you see that we have our prerequisite stage, the game setup stage and the game play. Let's dive into the details. The first part is that we need to detect the boards and recognize them. So we created a Custom Object Detection Model. We used Create ML with its object detection template. We have our own training data that we brought along, where we found images with the boards in them and negatives where there's actually no board in it. And we trained the model. You will hear later on in the session a little bit more about how we did the training. Now, once we have the model, we can run the inference through Vision. We saw that we need to fixate the camera. Why do we have to do this? Some of our algorithms actually require a stable scene. But it also gives us some other advantages because we only need to analyze the playing field once. We know after that, it doesn't change. Another neat part about this is that it shows that the user has clear intent of actually capturing it. So we don't need a start button. You don't need to touch the screen to do anything. We're doing the scene stability through registration. For that, we're using the VNTranslationalImageRegistrationRequest. That is a mouthful. What it does, it analyzes the movement from one frame to the next. And what you saw when the camera was panning is that we had a movement of ten pixels between each of the frames. Once the camera came to rest, that movement went down to zero. So we are now below a certain threshold, and we know that our scene is stable. The camera is not moving anymore. Next, we do the contour detection. For that, we use the VNDetectContourRequest. To do that, we use the bounding box that we got from our object detection as the region of interest. Then we simplify the contours for the analysis. By using these two techniques together, we only look at the contours that we have from the board, and not of the whole scene. If you want to learn more about the contour detection, you can look at our "Explore Computer Vision APIs" session. What we need next is our player. For that, we use the VNDetectHumanBodyPoseRequest. It gives us the points of the body joints, like your elbows, the shoulders and the wrist and the legs. We can use these points to analyze the angle between the joints, so we can know, for instance, our arm was bent. More details on how to use the BodyPoseRequest can be found in the "Understand Body and Hand Pose Using Vision" session. Also, keep this in mind because we're gonna use this for the Action Classification as well. Now, after so many slides, you want to see some code. So let me hand it over to my colleague Brent, who'll walk you through that. Brent? Thanks, Frank. Hi. I'm Brent from the CoreML team. I'm going to be walking through some of the code of the app that Frank was talking about. And as Frank mentioned, not only did we build this app for our session today, but we're also making it available for download. So if you'd like, you can pause here, download the app and follow along with me. You can find it linked to this session in the resources section.
All right, let's dive in.
The first thing I'd like to show you is how the app progresses through various states of the game. The app uses a GameManager to manage its state and communicate that state with the view controllers.
We'll see these states as we progress through the app.
And the GameManager will notify its listening view controllers of state changes. Also note that the GameManager is a singleton, which we'll find used throughout the app.
Next, let's take a look at the main storyboard.
When a user launches the app, it begins with the start and setup instructions screens.
Next, the source picker is brought up so that either the live camera or an uploaded video is used as input.
The SourcePickerViewController handles the selection of these input options.
After that, the app segues to the RootViewController.
The RootViewController is responsible for a couple of things in our app, so let's take a closer look at it.
The first thing the RootViewController is responsible for is hosting the CameraViewController.
When the RootViewController loads, it creates an instance of the CameraViewController to manage the buffers of frames coming from either the camera or the video.
The CameraViewController has an OutputDelegate which is used to pass those buffers to the appropriate delegate ViewControllers.
Once the RootViewController sets up the CameraViewController, then it calls startObservingStateChanges, which will register it to be notified by the GameManager of game-state changes.
This corresponds to the second responsibility of the RootViewController, which is to present and dismiss overlaying view controllers based on the game state.
The RootViewController has an extension where it conforms to the GameStateChange protocol. As the GameManager notifies its observers of game-state changes, the RootViewController will listen to these state changes to determine which other ViewController to present. This could be the SetupViewController, the GameViewController or the SummaryViewController. The SetupViewController and GameViewController classes have extensions to conform to both the GameStateChangeObserver and CameraViewControllerOutputDelegate protocols. Which means when one of those ViewControllers is presented by the RootViewController, it is also added as a GameStateChangeObserver, and it becomes the CameraViewControllerOutputDelegate. We just talked about how the app progresses through game states and how it passes buffers to the ViewControllers. Next, let's take a closer look at some of the key functionality that Frank was talking about. We're going to jump into the SetupViewController because this is the first ViewController that the RootViewController presents. When the SetupViewController appears, it creates a VNCoreMLRequest using the Object Detection model that we created to detect our boards using Create ML. We used the new Object Detection transfer learning algorithm in the Object Detection template. Then as the SetupViewController starts receiving buffers from the CameraViewController in its CameraViewControllerOutputDelegate extension, it starts performing these Vision requests on each buffer in the detectBoard function.
Here, the app is taking the results from the requests, which are the detected objects-- in our case, our game boards-- and filtering out low-confidence results. If it finds a result with a high enough confidence, it draws a bounding box on the screen around the detected object, and then progresses from detecting the board to detecting the board placement. The app then instructs the user to align the bounding box with the boardLocationGuide, which is already present on the screen. Once the object bounding box is placed within the boardLocationGuide, the app progresses to determining scene stability. One thing to note is that the app will only guide the user to move the board into the boardLocationGuide when the user is using live camera mode. It will not guide the user to move the board during video playback. During video playback, the app assumes that the board is placed on the right side of the video.
We just saw how the app detects our game boards and guides our users to place the board in the expected location in the camera frames.
Next, let's take a look at how the app determines that the scene is stable.
Let's go back and look at the CameraViewControllerOutputDelegate extension of the SetupViewController. It uses a VNSequenceRequestHandler because there will be a Vision request performed across a series of frames to determine the scene stability. This app iterates over 15 frames in order to make sure the scene is actually stable. As the SetupViewController receives buffers from the CameraViewController, it performs VNTranslationalImageRegistration requests for each buffer on the previous buffer to determine how aligned those two buffers are. When the ViewController receives the results from the request, it appends the points from the transform to the sceneStabilityHistoryPoints array, and then updates the Setup State again. When the Setup State is detecting player state, as it is now, the ViewController uses a read-only computed property called sceneStability to calculate whether the scene is stable or not. This property calculates the moving average of the points stored in the sceneStabilityHistoryPoints array. If the distance of the moving average is less than ten pixels, then this app considers the scene to be stable. Once scene stability is found, the app can progress to detecting the contours of our game board. So now, let's take a look at how the app does that.
We'll take a look back at the CameraViewControllerOutputDelegate extension of the SetupViewController. This time, when the ViewController receives a buffer, the setupStage is detectingBoardContours, so the ViewController calls detectBoardContours. This function uses the new VNDetectContoursRequest. Notice that the boardBoundingBox that was found earlier when we were detecting our boards is used to set a regionOfInterest for this request. This will cause the request to only be performed in that region. The app then performs an analysis of those contours to find the edge of the board and the hole of the board. Once the app finishes detecting contours, the game state moves to DetectedBoardState. Since the SetupViewController is also a GameStateChangeObserver, the following code will be run on GameStateChanges. In the case, the game state is DetectedBoardState, so the app lets the user know that a board has been detected, and the game state is changed to DetectingPlayerState. At this point, the app has found our game board, made sure it's placed correctly, determined that our scene is stable and found the contours on the game board. That completes the responsibilities of the SetupViewController, so we can move on to the next ViewController. We'll take a quick look back at the RootViewController, where we can see that since the game state is now DetectingPlayerState, the next ViewController that will be presented is the GameViewController. This means that the GameViewController will be added as a GameStateChangeObserver, and it will become the CameraViewControllerOutputDelegate. Since the GameViewController is now the CameraViewControllerOutputDelegate, it will be receiving buffers from the CameraViewController and executing the following code on each buffer. The GameViewController will perform its detectPlayerRequest, which is an instance of the VNDetectHumanBodyPoseRequest. When the ViewController receives the results from this request, it passes them to the humanBoundingBox function. This function filters out low-confidence observations and returns the bounding box of the person who enters the frame. Once this happens, the app moves its game state to the next phase. Let's remember this humanBoundingBox function because we'll hear about it again a little later.
Next, Frank is going to tell you about detecting trajectories of the bean bags while the game is being played. Frank? All right. Now we have our game play. So let's look at the trajectory detection. The VNDetectTrajectoriesRequest finds objects that moves through a scene, and it also filters out the noise from movements that we might not be interested in. But to actually use it, we need to have a better understanding as to how it works. So let's look at this. Here, we actually will see now, a throw, but I want to peel back a little bit the covers so that you can see under the hood what we are actually using for the analysis. So, this was our throw. But this is not what the algorithm looks at. It looks at what we call a frame differential. And we can now just highlight much easier the objects that are moving because they change from frame to frame. There's a bit of noise that we filter out. So, what we're going to do is we actually have a whole sequence, and we see in this action the bean bag flying. So we use our new VNDetectTrajectoryRequest, but it's a bit of a special request that's new in Vision this year. It's what we call a "stateful request." That means we need to keep this request around, which is not a mandatory part for other requests in Vision. But here, we need to do this because it builds state over time. So we feed it the first frame, and nothing happens. Continue feeding frames to it. Four frames in. Now we get to the fifth frame, and we actually get a trajectory detected, because we now have enough evidence. We cannot see from a single frame if something is moving. We need a bit of evidence over time,

May	JUN	Jul
	12
2020	2021	2022

Resources

Related Videos

WWDC 2020