Optimize Metal apps and games with GPU counters

Back to WWDC 2020

Optimize Metal apps and games with GPU counters

GPU counters can help you precisely measure GPU utilization to pinpoint bottlenecks and optimize workloads for your Metal apps and games. We'll walk you through the tools available in the Metal System Trace instrument and Metal Debugger in Xcode 12 to profile your graphics workload, and show you how to use collected data to discover underused and overworked stages of your GPU pipeline. Discover how you can act on that data to improve your app's capabilities. To get the most out of the session, you should understand the tile-based deferred rendering architecture of Apple GPUs and familiarize yourself with our recommended best practices for performance optimization. For a primer, check out “Delivering optimized Metal apps and games” and “Harness Apple GPUs with Metal.” Once you've learned how to act on GPU counter data to optimize your Metal apps, see how you can use those skills to "Bring your Metal app to Apple silicon Macs" and "Optimize Metal Performance for Apple silicon Macs".

Resources
Related Videos

WWDC 2021
- Discover Metal debugging, profiling, and asset creation tools
- Optimize high-end games for Apple GPUs
WWDC 2020
WWDC 2019
- Delivering Optimized Metal Apps and Games
WWDC 2018
- Metal Shader Debugging and Profiling
Download

Hello and welcome to WWDC.
Guillem Vinals Gangolells: Hello and welcome to this session. I am Guillem Vinals from the Metal Ecosystem team. Today I will talk about how to optimize your game or app using GPU performance counters. This talk will walk you through the architecture of modern Apple GPUs and explain its performance metrics. We will start with an introduction to both our GPUs and the performance counters. Then we will cover several groups of GPU performance counters.
We'll talk about performance limiters, memory bandwidth, occupancy, and hidden surface removal. All of these GPU performance counters will help us understand the Apple GPUs much better. We will start with an introduction to the GPU and its performance counters. The GPU is a central part of Apple processors such as A13. So let's do a quick recap of Apple GPUs first.
Apple GPUs are part of the Apple processors, which are very power efficient. Apple processors have unified memory architecture where the CPU and the GPU share System Memory. The GPU has on-chip Tile Memory. Notice that the GPU does not have dedicated Video Memory, so bandwidth could be a problem if the content has not been tuned. To be fast and efficient without Video Memory, our GPUs are TBDRs, or Tile Based Deferred Renderers.
This diagram shows the Apple GPU rendering pipeline. We have covered the pipeline in more detail in other talks, so I will just provide a quick overview. The rendering pipeline has two distinct phases: First, Tiling, where all of the geometry will be processed. Second, Rendering, where all of the pixels will be processed. So let's recap both phases, starting with the Tiling Phase.
During the Tiling Phase, the GPU will, for the entire render pass, split the viewport into a list of tiles, shade all of the vertices, and bin the transformed primitives into tiles. Now, the GPU is going to shade all of these tiles separately. Each GPU core will shade at least one tile at a time.
For each tile in the render pass, the GPU will execute the load action, rasterize and compute the visibility for all of the primitives, shade all of the visible pixels, and then execute the store action.
This is how our design can scale so well. The more GPU cores we have, the more tiles we can shade at the same time. Before concluding this overview, let's have a closer look at the GPU configuration.
Apple GPUs have multiple cores. A GPU core contains a Shader Core, a Texture Unit, and a Pixel Backend, as well as a dedicated pool of Tile Memory. Notice that Tile Memory is just part of the hierarchy. Both the ALU and the TPU have dedicated L1s. All of the GPU cores share a last level cache. And then, of course, there's System Memory which is basically DRAM. This talk will assume some familiarity with the Apple TBDR architecture as well as the Metal Best Practices. Check out these two talks to brush up on both topics. I would actually recommend you to start with "Harness Apple GPUs with Metal" and then look at the Best Practices. So, let's build up some context around GPU profiling first. In order to render a frame, the GPU needs to process multiple render passes. Each render pass will be executed across multiple GPU cores. And each GPU core will, in turn, process different tasks, such as shading or texturing.
All of those tasks will be executed on different hardware units, such as the ALU or the TPU. And of course, every single one of these units has a different throughput which uses different metrics. For example, we will use FLOPS to measure the ALU throughput or megabytes per second to measure the TPU throughput. So, there's multiple metrics to look at. What metrics should we look at then? Well, enter GPU performance counters. GPU performance counters will measure how the GPU is being utilized. Will help us find if the GPU doesn't have enough work, or if the GPU has too much work. Will help us identify performance bottlenecks, and also help us optimize the commands that take the longest. Cool, so let's review the GPU performance counters for our Apple GPUs. Well, that's actually quite a list. There's over 150 GPU counters to look at. Maybe at this point, there's just far too much data to parse.
So how can we make sense of all those numbers? The answer is tooling. Our GPU tools will help you navigate all that data, starting with Metal System Trace, which is part of Instruments. You will want to use Metal System Trace for performance overview. You will see both the CPU and the GPU timelines. Your workload will be affected by thermals and dynamic system changes.
Metal System Trace is already part of the Game Performance template in Instruments. You can also enable GPU performance counters which can be used to identify potential GPU or memory bottlenecks at different points during the frame. Of course, there's also the Metal Debugger which is part of Xcode.
You will want to use this tool for a deep performance investigation. You will see both a detailed GPU timeline as well as the Metal API usage of your game. And your workload will be unaffected by thermals or dynamic system changes. Xcode also supports GPU performance counters and exposes every single one of them at encoder granularity.
There's also a large subset of counters available per draw call.
Xcode is where all of the counters are listed, so it's definitely the right tool to correlate metrics. So what exactly do those values mean? By now you know that there are a ton of counters, and that the tools will help you focus on the important ones.
The rest of the talk will walk you through different groups of counters and explain them in more detail.
We will start with performance limiters, arguably the GPU counters you should always look at first. Limiters are very important due to the parallel nature of GPUs.
The GPU can execute a ton of work in parallel: arithmetic, memory accesses, as well as rasterization tasks. The limiter counters will measure the activity of multiple GPU subsystems. They will help you find work being executed, as well as find stalls that prevent work from being executed. Remember, the GPU is only as fast as the slowest part. Limiters will point you to that part for you to investigate. Time for a demo. Please welcome Sam for a cool demo of Metal System Trace. Thanks, Guillem. I've got my iPad Pro, and I'm playing Respawnables Heroes, a game by our friends over at Digital Legends. It looks great. It's got reflections, beautiful dynamic lighting with shadows, and many more post-processing effects.
But to get a sense of how well it's running, I'm going to show you how to record the performance limiters in Instruments. Let's switch back to my computer where I've already got Instruments open.
First, I'll select the Game Performance template. Then, I'll make sure that my device is selected and the game. I'm gonna long-press on the Record button and click on Recording Options.
Then, I'll switch to the Metal Application recording options and make sure that Performance Limiters is selected under the GPU Counter Set.
I'm also going to enable the new Shader Timeline, and you'll see why in a sec. But for now, let's click on the Record button.
Instruments is now recording the game, and when we're done, we can click Stop.
The Game Performance template gathers a lot of information about the state of the system, but for now, we're interested in the GPU.
So I'm going to disclose the A12Z track to see what was running.
I'm going to hold Option and left-click and drag to zoom into a frame.
We can now see a timeline of all of the command buffers and encoders that were running, color-coded by frame.
We can see that Respawnables Heroes first renders a shadow map. This is then followed by a Deferred Phase Encoder where it looks roughly 50-50 split between the vertex and fragment shader, but the fragment shader is a little bit longer. In this case, 1.29 milliseconds. After this is a bunch of post-processing effects.
Now, I'm going to take a close look at the Deferred Phase Encoder because it's taking the longest time. So I can disclose the fragment track to see the new Shader Timeline...
which shows me which shaders are running at certain sample times during the execution of my command encoder.
This fine-grained detail makes it really easy to see and identify longer-running shaders, and helps to explain why a given encoder is taking a certain amount of time. If I select the track and a region, I can actually see which shaders were running in the table below, along with how many samples they were running for and an approximate GPU time.
We can also see the performance limiters in Instruments.
So, the first track is the top performance limiter track. Now, if I scrub my mouse over this track, we can see that during the deferred phase, the ALU Limiter is the highest. And during the post-processing, it's the Texture Sampler.
Now, this makes a lot of sense. But don't worry if you don't know what they mean. Guillem will later explain each limiter and what to do if you see a high value.
Below the Top Performance Limiter tracks are the individual limiters themselves, such as ALU, Texture Sampler, and many, many more.
Now, back to Guillem.
Excellent. Thank you, Sam. Now we know where to find the GPU performance limiters. So, let's focus on some of them.
We will talk about Arithmetic, Texture Read and Write, Tile Memory Load and Store, Buffer Read and Write, GPU Last Level Cache, and Fragment Input Interpolation limiters. As we go through the list, we will also be putting them in the context of the Apple GPU. Also, I will show you how to find them in Xcode, starting with the ALU limiter. Before looking at the limiter, we will build some context first. The ALU is part of the shader core. It processes arithmetic operations, both bit-wise and relational operations. It is optimized for both floating-point arithmetic and coherent execution. So, let's review that. Let's review the relative throughput of the different operations first. At the top, we can see 16-bit floating point operations, which are run at double rate. Then, we also have 32-bit floating point operations, which are run at full rate.
Finally, we also have 32-bit integer and complex operations, which are run at half rate or less.
For example, we should prefer F16 over F32 when possible.
Also, watch out for complex operations. The best case is shown here. Some complex operations such as a square root will have an actually lower rate. Great. So let's talk about the execution model of our shader core.
Each shader core has multiple SIMD units, as well as dedicated Tile Memory and a pool of Register Memory. Each SIMD unit has 32 threads, and each thread in the SIMD executes the same instruction. This is very important when it comes to authoring shaders.
Each SIMD lane has 32 threads but a single program counter. This is ideal when all of the threads execute the same instruction.
In this case, the condition "a" is equal for all of the threads. That's what we call coherent execution. All of the threads will execute the same instruction. And the total time to execute this program will be 40 cycles. There is no penalty for the "if" branch other than the extra temporary registers required and not utilized.
In this case, we have divergent execution. Some of the threads will evaluate "a" to "true." All of the SIMD lane has to execute all of the instructions. The threads that don't take the branch will mask out the execution, but still spend the cycles. In this case, the total cost will be 70 cycles. Notice that we have the extra 30 cycles from the "if" condition. One last note on execution model. There are some cases where only a few threads of a SIMD will actually need to run. This will, of course, have an impact on performance, since most of the threads are wasting cycles. Okay, so with that in mind, we can now look at the limiter. So what can we do if we are actually limited by the ALU? Well, in most cases, we may want to celebrate. That's exactly what we want: the GPU is crunching numbers, and that's exactly what the GPU is for. But what if we actually want to reduce the ALU load? In which case, we will want to replace complex calculations with either approximations or lookup tables. Also, try to replace floats, full-precision by half-precision. Try to avoid implicit conversions. Avoid FP32 inputs such as textures or buffers. And also make sure that all of the shaders are compiled using the Metal "-ffast-math" flag.

May	JUN	Jul
	12
2020	2021	2022

Resources

Related Videos

WWDC 2021

WWDC 2020

WWDC 2019

WWDC 2018