Matt Rajca

blog projects github twitter email

Speeding Up TensorFlow with Metal Performance Shaders

November 26, 2016

In Getting Started with Deep MNIST and TensorFlow on iOS, I walked through the process of getting TensorFlow set up such that we can perform inference on a pre-trained model directly on an iOS device. Even though we were able to get an iPad Pro to classify 5,000 images of dimensions 28x28 in a little over 5 seconds, we can do even better by leveraging the compute power modern GPUs provide on data-parallel tasks. While TensorFlow offers GPU support for CUDA- and OpenCL-enabled devices, iOS supports neither, so in this article, we'll implement the inference pipeline ourselves with Metal.


Our CPUs are good at performing long-running tasks such as compiling code or rendering audio on a handful of cores (2-4 on today's iOS devices). This is known as task-parallelism. Meanwhile, our GPUs are optimized for running hundreds or thousands of short operations simultaneously – think applying a transformation matrix to thousands of vertices in a 3D game. This is known as data parallelism. While the exact architectural details of Apple's AX chips are not available, we know the A9X chip found in the iPad Pro has 12 GPU cores, and each of those likely contains several processing elements.

Since performing inference with deep networks involves repeating lots of short calculations over millions of elements in a data-parallel fashion, we can get a noticeable speed-up by moving this work to the GPU. More concretely, the convolutions and matrix multiplies that are typically performed as data is propagated through a deep network can be parallelized at the data level. For example, a matrix multiply can be thought of as N * N dot products of length N, each of which can be computed independently. Similarly, in a convolution, a small filter (of size 5x5 in the Deep MNIST example) is multiplied by a region of the same size surrounding each pixel in an image. Each of those operations can be performed independently as well.


Since iOS does not support CUDA or OpenCL, we'll have to use Metal to perform work on the GPUs found in iOS devices. Prior to iOS 10, we'd have to implement the programs that run on the GPU (known as kernels) ourselves. While the Metal Shading Language is quite similar to OpenCL's variant of C, writing high-performance kernels is as much of an art as a science. For example, understanding GPU memory access patterns, taking advantage of local memory across workgroups, and avoiding operations that are expensive on GPUs (such as modulus) can all involve weeks of work and implementing mathematical tricks. Moreover, ALUs on Apple's AX chips are only 16-bits wide, so if we implement a naïve 1:1 port of an OpenCL kernel that uses 32-bit floats (instead of 16-bit floats known as halfs), we'll see subpar performance.

To make this easier, Apple introduced support for deep network operations in the Metal Performance Shaders framework in iOS 10. This API is optimized to squeeze every drop of power out of the GPUs found in AX chips and saves from having to write Metal kernels ourselves that perform convolutions, matrix multiplications, and more.

Getting Started

We'll build on top of the Getting Started with Deep MNIST and TensorFlow on iOS article and move inference to the GPU with Metal. To get started, we'll link the target with the MetalPerformanceShaders framework and import its umbrella header file:

    #import <MetalPerformanceShaders/MetalPerformanceShaders.h>

One of the things we did in our TensorFlow implementation was load a graph of our deep network. This is no longer necessary as we'll hardcode the structure of our network with Metal APIs (plus, Metal doesn't know anything about TensorFlow graphs anyway).

Metal also doesn't know anything about our exported "checkpoint" file containing our learned parameters. Instead of writing a parser for it, we can simply modify our training script to export each of our 8 variables (4 weight tensors + 4 bias vectors) in a binary format. The resulting file will simply be a list of floating-point (IEEE 754) values stored in C (row-major) order. In other words, if we have a 2D matrix of dimensions 4x4, the resulting file will be 64 bytes in size since we have 16 floats that are each 4 bytes in size; further, the data will be laid out row-wise.

One thing we have to watch out for is the order in which Metal expects learned parameters: [outputChannels][{source/kernel}Height][{source/kernel}Width][inputChannels].

This differs from TensorFlow, which stores things in [{source/kernel}Height][{source/kernel}Width][inputChannels][outputChannels] order.

To re-order the dimensions of a matrix before exporting it, we can use the tf.transpose function. Here is how we export W_conv1 at the end of our training script:

    with open('W_conv1', 'w') as f:
      W_conv1_p = tf.transpose(W_conv1, perm=[3, 0, 1, 2])

We permute the tensor such that the output channels come first. Then, we export it as a binary tensor of floats.

The rest of the data is exported in a similar fashion:

    with open('b_conv1', 'w') as f:
    with open('W_conv2', 'w') as f:
      W_conv2_p = tf.transpose(W_conv2, perm=[3, 0, 1, 2])
    with open('b_conv2', 'w') as f:
    with open('W_fc1', 'w') as f:
      W_fc1_shp = tf.reshape(W_fc1, [7, 7, 64, 1024])
      W_fc1_p = tf.transpose(W_fc1_shp, perm=[3, 0, 1, 2])
    with open('b_fc1', 'w') as f:
    with open('W_fc2', 'w') as f:
      W_fc2_shp = tf.reshape(W_fc2, [1, 1, 1024, 10])
      W_fc2_p = tf.transpose(W_fc2_shp, perm=[3, 0, 1, 2])
    with open('b_fc2', 'w') as f:

Note that we don't have to re-order the bias variables as they're 1D vectors. Moreover, the original code flattens W_fc1 and W_fc2 into a 2D matrix to perform a matrix multiply. To re-order the columns in 4D space, we have to re-shape it back into a 4D tensor prior to permuting the dimensions.

Now we can re-run the training script to export our 8 variables as binary tensors.

Loading the Model

Next, we'll drag the 8 binary files into our Xcode project (be sure to include them as bundle resources as well).

We'll also define a helper function for loading them. Since the data is already in the correct format, this is pretty straightforward:

    static float *loadTensor(NSString *baseName, NSUInteger length) {
      NSString *path = [[NSBundle mainBundle] pathForResource:baseName ofType:nil];
      NSData *data = [NSData dataWithContentsOfFile:path];
      float *tensor = new float[length];
      for (NSUInteger i = 0; i < length; i++) {
        [data getBytes:&tensor[i] range:NSMakeRange(i * sizeof(float), sizeof(float))];

      return tensor;

Now, we'll define a new -testGPU: method for running inference on the GPU. We'll start by loading our weights and biases using the loadTensor function we just defined:

    float *conv1weights = loadTensor(@"W_conv1", 5 * 5 * 1 * 32);
    float *conv1biases = loadTensor(@"b_conv1", 32);
    float *conv2weights = loadTensor(@"W_conv2", 5 * 5 * 32 * 64);
    float *conv2biases = loadTensor(@"b_conv2", 64);
    float *fc1weights = loadTensor(@"W_fc1", 7 * 7 * 64 * 1024);
    float *fc1biases = loadTensor(@"b_fc1", 1024);
    float *fc2weights = loadTensor(@"W_fc2", 1024 * 10);
    float *fc2biases = loadTensor(@"b_fc2", 10);

We'll also load the test images we'll be working with, as well as their labels:

    NSString *imagesPath = [[NSBundle mainBundle] pathForResource:@"images" ofType:nil];
    NSString *labelsPath = [[NSBundle mainBundle] pathForResource:@"labels" ofType:nil];
    NSData *imageData = [NSData dataWithContentsOfFile:imagesPath];
    NSData *labelsData = [NSData dataWithContentsOfFile:labelsPath];

    uint8_t *expectedLabels = new uint8_t[kUsedExamples];

    float *x = new float[kUsedExamples * kInputLength];
    size_t xIndex = 0;

    for (auto exampleIndex = 0; exampleIndex < kUsedExamples; exampleIndex++) {
      // Actual labels start at offset 8.
      [labelsData getBytes:&expectedLabels[exampleIndex] range:NSMakeRange(8 + exampleIndex, 1)];

      for (auto i = 0; i < kInputLength; i++) {
        uint8_t pixel;
        // Actual image data starts at offset 16.
        [imageData getBytes:&pixel range:NSMakeRange(16 + exampleIndex * kInputLength + i, 1)];
        x[xIndex++] = pixel / 255.0f;

This is nearly identical to the code in the original article, so I won't explain it again.

Metal Pipeline

Now, we're ready to create the Metal pipeline, starting with the Metal device and the command queue we'll use to submit work to it:

    id<MTLDevice> device = MTLCreateSystemDefaultDevice();
    if (device == nil) {
      NSLog(@"No Metal device");

    id<MTLCommandQueue> queue = [device newCommandQueue];

If you've worked with OpenCL before, you'll find a lot of the terminology familiar.

Convolutional Layers

Next, we'll have to set up the structure of our deep network in code. If we look back at our training script, you'll note we start with a convolutional layer that uses a 5x5 filter, unit stride, zero padding, 1 input channel, 32 output channels, and a ReLU activation function:

    def conv2d(x, W):
      return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')
    W_conv1 = weight_variable([5, 5, 1, 32])
    h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)

To translate this to Metal, we set-up a MPSCNNConvolution:

    const MPSCNNNeuronReLU *reluUnit = [[MPSCNNNeuronReLU alloc] initWithDevice:device a:0];
    MPSCNNConvolutionDescriptor *conv1descriptor = [MPSCNNConvolutionDescriptor cnnConvolutionDescriptorWithKernelWidth:5
    MPSCNNConvolution *conv1layer = [[MPSCNNConvolution alloc] initWithDevice:device
    MPSImageDescriptor *conv1outdescriptor = [MPSImageDescriptor imageDescriptorWithChannelFormat:MPSImageFeatureChannelFormatFloat16

We directly pass in the kernel width/height and the number of input/output channels; we also specify the ReLU activation function. The default stride is already {1,1} and the default padding is already zero, so we don't have to explicitly set those. If you'd like to change them, see the edgeMode and strideInPixels{X/Y} properties on MPSCNNConvolution. conv1outdescriptor simply describes the format and dimensions of the output matrix: [28][28][32]. This is also the size of h_conv1 in the training script. Note that we use float16s for storage and computation as the AX GPUs have 16-bit ALUs. Metal will convert our weights and biases from 32-bit floats automatically.

Next, we set up our max pooling layer. In the training script, this looks as such:

    def max_pool_2x2(x):
      return tf.nn.max_pool(x, ksize=[1, 2, 2, 1],
                            strides=[1, 2, 2, 1], padding='SAME')
    h_pool1 = max_pool_2x2(h_conv1)

Now let's translate it to Metal:

    MPSCNNPoolingMax *pool1layer = [[MPSCNNPoolingMax alloc] initWithDevice:device
    pool1layer.offset = (MPSOffset) { 1, 1, 0 };
    pool1layer.edgeMode = MPSImageEdgeModeClamp;
    MPSImageDescriptor *pool1outdescriptor = [MPSImageDescriptor imageDescriptorWithChannelFormat:MPSImageFeatureChannelFormatFloat16

The kernel size of {2,2} and stride of {2,2} are directly specified in MPSCNNPoolingMax's initializer. Since the kernel is positioned around its center, we set its starting offset to {1, 1}, this way it doesn't run off the top and left corners of the image. More visually, we want to start pooling at {0,0}, not {-1,-1}:

Since we're dealing with even image sizes, the edge mode doesn't matter since our 2x2 kernel will never run off the edges of the image. However, suppose we do run off the right edge of the image and two of our values are negative. With zero padding, the other two values are 0:

    -0.412641 | 0
    -2.104933 | 0

If we take the max of this 2x2 region, we get 0, but that's not really correct since the zeros lie outside of the image. For this reason, it's better to use clamped padding, which will repeat the values closest to the missing ones:

    -0.412641 | -0.412641
    -2.104933 | -2.104933

Now if we take the max of this 2x2 region, it's -0.412641, which is more correct.

That it's for our first convolutional layer with max pooling. The second one is fairly similar; the only thing that changes are the dimensions (the 2x2 max pool effectively reduces the size of the image by 2 in both dimensions). To make this easier to follow, I defined two new constants:

  static constexpr int kImageSide2 = kImageSide / 2;
  static constexpr int kImageSide4 = kImageSide / 4;

The 32 output channels from the first convolution are now passed in as input channels to the second convolution.

    MPSCNNConvolutionDescriptor *conv2descriptor = [MPSCNNConvolutionDescriptor cnnConvolutionDescriptorWithKernelWidth:5
    MPSCNNConvolution *conv2layer = [[MPSCNNConvolution alloc] initWithDevice:device
    MPSImageDescriptor *conv2outdescriptor = [MPSImageDescriptor imageDescriptorWithChannelFormat:MPSImageFeatureChannelFormatFloat16
    MPSCNNPoolingMax *pool2layer = [[MPSCNNPoolingMax alloc] initWithDevice:device
    pool2layer.offset = (MPSOffset) { 1, 1, 0 };
    pool2layer.edgeMode = MPSImageEdgeModeClamp;
    MPSImageDescriptor *pool2outdescriptor = [MPSImageDescriptor imageDescriptorWithChannelFormat:MPSImageFeatureChannelFormatFloat16

For reference, the training script sets this up as:

    h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2)
    h_pool2 = max_pool_2x2(h_conv2)

Now, we're ready to move on to our final two fully-connected layers.

Fully-Connected Layers

Our first fully-connected layer maps the output of our second max pooling operation to 1024 hidden units:

    h_pool2_flat = tf.reshape(h_pool2, [-1, 7 * 7 * 64])
    h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

In Metal, the implementation is similar to that of our convolutional layers, but we now use MPSCNNFullyConnected instead:

    MPSCNNConvolutionDescriptor *fc1descriptor = [MPSCNNConvolutionDescriptor cnnConvolutionDescriptorWithKernelWidth:kImageSide4
    MPSCNNFullyConnected *fc1layer = [[MPSCNNFullyConnected alloc] initWithDevice:device
    MPSImageDescriptor *fc1outdescriptor = [MPSImageDescriptor imageDescriptorWithChannelFormat:MPSImageFeatureChannelFormatFloat16

The output is more simply a 1024-unit vector.

Finally, we'll implement our second fully-connected layer that will map our 1024-unit vector to a 10-unit vector and take the softmax instead of using a ReLU activation function.

In the training script, this looked as follows:

    y_conv = tf.nn.softmax(tf.matmul(h_fc1, W_fc2) + b_fc2, name="softmax")

First, we'll port the second fully-connected layer:

    MPSCNNConvolutionDescriptor *fc2descriptor = [MPSCNNConvolutionDescriptor cnnConvolutionDescriptorWithKernelWidth:1
    MPSCNNFullyConnected *fc2layer = [[MPSCNNFullyConnected alloc] initWithDevice:device
    MPSImageDescriptor *fc2outdescriptor = [MPSImageDescriptor imageDescriptorWithChannelFormat:MPSImageFeatureChannelFormatFloat16

This is fairly similar to the first fully-connected layer, but note that we pass in nil for neuronFilter. Instead, we'll set up a separate softmax layer:

    MPSImageDescriptor *softmaxOutput = [MPSImageDescriptor imageDescriptorWithChannelFormat:MPSImageFeatureChannelFormatFloat16
    MPSCNNSoftMax *softmaxLayer = [[MPSCNNSoftMax alloc] initWithDevice:device];

Before we finish up, let's also define an image descriptor for our input test images:

    MPSImageDescriptor *inputDescriptor = [MPSImageDescriptor imageDescriptorWithChannelFormat:MPSImageFeatureChannelFormatFloat32

Even though they're 32-bit floats, Metal will convert them to 16-bit halfs for us with no loss in accuracy (they're just 0s and 1s).

Now we're ready to iterate through 5,000 test examples and run each one through our pipeline.

Running the Pipeline

Before we do so, we'll create two arrays to store the command buffer and softmax buffer for each test image. Keeping references to the command buffers will let us track and wait on any work sent to the GPU which runs asynchronously. The softmax buffers will simply store the class probability distribution for each test image.

    NSMutableArray<id<MTLCommandBuffer>> *pendingBuffers = [[NSMutableArray alloc] init];
    NSMutableArray<MPSImage *> *results = [[NSMutableArray alloc] init];

Now, we'll begin timing the code and run through each test image.

    const auto start = CACurrentMediaTime();
    for (size_t inputIndex = 0; inputIndex < kUsedExamples; inputIndex++) {
      id<MTLCommandBuffer> buffer = [queue commandBuffer];
      MPSImage *inputImage = [[MPSImage alloc] initWithDevice:device imageDescriptor:inputDescriptor];
      [inputImage.texture replaceRegion:MTLRegionMake2D(0, 0, kImageSide, kImageSide)
                              withBytes:x + inputIndex * kInputLength
                            bytesPerRow:sizeof(float) * kImageSide];

We create a new command buffer that will be used to encode GPU compute commands and create a new MPSImage object that will represent our current test image. We load in the image data from x with a call to -replaceRegion:....

Whereas MPSImages can be accessed from both the host and the GPU. MPSTemporaryImages can only be accessed from the GPU but are faster to work with. We'll use MPSImages for our input images and softmax outputs; we'll use MPSTemporaryImages for all of the tensors we allocate in-between.

The +[MPSTemporaryImage prefetchStorageWithCommandBuffer:...] method can pre-allocate temporary images for us and optimize them for re-use, so we'll call it now with all of our output image descriptors.

      [MPSTemporaryImage prefetchStorageWithCommandBuffer:buffer imageDescriptorList:@[conv1outdescriptor, pool1outdescriptor, conv2outdescriptor, pool2outdescriptor, fc1outdescriptor, fc2outdescriptor]];

Now, we'll simply allocate our temporary images and enqueue each layer onto our command buffer. This is fairly mechanical:

      MPSTemporaryImage *c1o = [MPSTemporaryImage temporaryImageWithCommandBuffer:buffer imageDescriptor:conv1outdescriptor];
      [conv1layer encodeToCommandBuffer:buffer sourceImage:inputImage destinationImage:c1o];
      MPSTemporaryImage *p1o = [MPSTemporaryImage temporaryImageWithCommandBuffer:buffer imageDescriptor:pool1outdescriptor];
      [pool1layer encodeToCommandBuffer:buffer sourceImage:c1o destinationImage:p1o];
      MPSTemporaryImage *c2o = [MPSTemporaryImage temporaryImageWithCommandBuffer:buffer imageDescriptor:conv2outdescriptor];
      [conv2layer encodeToCommandBuffer:buffer sourceImage:p1o destinationImage:c2o];
      MPSTemporaryImage *p2o = [MPSTemporaryImage temporaryImageWithCommandBuffer:buffer imageDescriptor:pool2outdescriptor];
      [pool2layer encodeToCommandBuffer:buffer sourceImage:c2o destinationImage:p2o];
      MPSTemporaryImage *fc1tdi = [MPSTemporaryImage temporaryImageWithCommandBuffer:buffer imageDescriptor:fc1outdescriptor];
      [fc1layer encodeToCommandBuffer:buffer sourceImage:p2o destinationImage:fc1tdi];
      MPSTemporaryImage *fc2tdi = [MPSTemporaryImage temporaryImageWithCommandBuffer:buffer imageDescriptor:fc2outdescriptor];
      [fc2layer encodeToCommandBuffer:buffer sourceImage:fc1tdi destinationImage:fc2tdi];

The final softmax buffer will be created as an MPSImage rather than a MPSTemporaryImage so we can access it from the host. We'll also add it to our array of results, and we'll add the command buffer to our array of buffers. Finally, we commit the buffer for execution. Note that results will not be available immediately as work happens asynchronously.

      MPSImage *resultsImage = [[MPSImage alloc] initWithDevice:device imageDescriptor:softmaxOutput];
      [softmaxLayer encodeToCommandBuffer:buffer sourceImage:fc2tdi destinationImage:resultsImage];
      [results addObject:resultsImage];
      [buffer commit];
      [pendingBuffers addObject:buffer];

Once all the work is enqueued, we wait for it to finish and log the time it took:

    [pendingBuffers enumerateObjectsUsingBlock:^(id<MTLCommandBuffer> buffer, NSUInteger idx, BOOL *stop) {
        [buffer waitUntilCompleted];
    NSLog(@"Time: %g seconds", CACurrentMediaTime() - start);

On my iPad Pro, this took 3.29s, down from 5.4s (a 40% improvement). If you try running the project now, please note that Metal Performance Shaders are not available in the iOS Simulator.

Finally, we can compute the accuracy on our test set:

    __block int correctExamples = 0;
    [pendingBuffers enumerateObjectsUsingBlock:^(id<MTLCommandBuffer> buffer, NSUInteger idx, BOOL *stop) {
      const size_t numSlices = (results[idx].featureChannels + 3)/4;
      float16_t halfs[numSlices * 4];
      for (size_t i = 0; i < numSlices; i += 1) {
        [results[idx].texture getBytes:&halfs[i * 4] bytesPerRow:8 bytesPerImage:8 fromRegion:MTLRegionMake3D(0, 0, 0, 1, 1, 1) mipmapLevel:0 slice:i];

      float results[kOutputs];

      vImage_Buffer fullResultVImagebuf; = results;
      fullResultVImagebuf.height = 1;
      fullResultVImagebuf.width = kOutputs;
      fullResultVImagebuf.rowBytes = kOutputs * 4;

      vImage_Buffer halfResultVImagebuf; = halfs;
      halfResultVImagebuf.height = 1;
      halfResultVImagebuf.width = kOutputs;
      halfResultVImagebuf.rowBytes = kOutputs * 2;

      vImageConvert_Planar16FtoPlanarF(&halfResultVImagebuf, &fullResultVImagebuf, 0);

      int bestIndex = -1;
      float bestProbability = 0;
      for (auto i = 0; i < kOutputs; i++) {
        const auto probability = results[i];
        if (probability > bestProbability) {
          bestProbability = probability;
          bestIndex = i;

      if (bestIndex == expectedLabels[idx]) {

    NSLog(@"Accuracy: %f", static_cast<float>(correctExamples) / kUsedExamples);

Unfortunately, getting the softmax values is a little complicated since Metal stores them in an "image" with a planar RGBA layout. Fortunately, we have the vImageConvert_Planar16FtoPlanarF to help us. Once we have the softmax values, we can simply take their argmax and compare it to the expected class label, as we did in the original article. When I ran this, I obtained 98.6% accuracy.


That's what it takes to implement inference ourselves for a deep network pre-trained with TensorFlow. By using Metal Performance Shaders, this yields performance that is around 40% better and reduces our dependency on TensorFlow. By no longer having to link in the TensorFlow and Protocol Buffers static libraries, the size of our binary is reduced from 40 MB to 160 KB.

The original project on GitHub has been updated with Metal support.