It’s been a long time coming, but Microsoft has finally delivered on its promise of a complete set of developer tools for Windows on Arm, from hardware to software, with support for Arm-specific platform features. It’s just in time, too, as the latest generation of Surface devices puts Microsoft’s Qualcomm-based SQ3 processor in the same hardware as Intel’s i7.
What’s perhaps most interesting about Microsoft’s Arm focus is its link to building machine learning–driven applications into Windows. The Surface launch showcased using the SQ3’s neural processing unit (NPU) to deliver real-time audio and visual processing, using pretrained models to handle background blurring and remove distracting noises. Microsoft states that the SQ3 NPU can deliver 15 trillion operations a second, so there’s plenty of scope for using it in your own code. That translates to being 80 to 90 times faster than running the model on a CPU and 20 times faster than on a GPU.
Accelerate AI with neural processing
NPUs are new to Windows, but these specialized accelerators are a key feature of the hyperscale cloud. An Arm system on a chip with its own NPUs is an important part of Microsoft CEO Satya Nadella’s “intelligent edge.” Microsoft has already delivered an example of this in its Azure Percept IoT appliances, which use Intel NPU hardware and NXP Arm processors. In this approach, machine learning models are built and trained in the cloud before being exported into standard formats and executed on local runtimes that leverage the NPU hardware.
That’s all very well for Microsoft’s own models, but how do we build our own? There are still questions to be answered, but much of the picture has finally come into focus.
First, we need to start with affordable developer hardware. The Arm-powered Surface Pro 9 and its Surface Pro X predecessors are lovely pieces of hardware, but they’re high-end personal devices and unlikely to be used as developer hardware. Microsoft is well aware of this, announcing its “Project Volterra” Arm-based developer hardware at Build earlier this year.
After a series of supply chain–related delays, the hardware is now shipping as Windows Dev Kit 2023, an affordable $599 desktop Arm PC based on the Snapdragon 8cx Gen 3 processor, with 32GB of RAM and 512GB of storage. You can think of it as an Arm-powered NUC, using a variant of the mobile hardware used in Arm-based Windows devices, though without the 5G connectivity.
It’s designed to be stackable, with the intent that developers can have more than one device on their desks so they can code, build, and test easily. That’s enough power and storage to run a full Visual Studio instance as well as any test harness. More complex tasks can be offloaded to the Ampere Altera servers running in Azure.
NPUs in Windows
Microsoft is presenting the NPU as the future of Windows, so it needs an NPU-centric development platform like these Arm-powered boxes. Qualcomm claims 29 TOPS for its AI Engine NPU, which it states supports “cloud-class models running on thin and light laptops.” Offloading these from the CPU to the NPU should allow applications to remain responsive, using the 8 Arm cores and the GPU to use and render NPU outputs. The sample applications Microsoft demonstrated at its recent Surface launch show this approach in action: The SQ3’s NPU manages complex audio and video tasks, with the results composited and displayed via the existing camera application into tools such as Teams.
At the heart of Microsoft’s NPU support is the ONNX (Open Neural Network Exchange) portable neural network format. This allows you to take advantage of the compute capabilities of Azure’s Machine Learning platform to build and train models before exporting them to run locally on a Windows device through either the Windows ML or ML.NET APIs or directly using the Qualcomm Neural Processing SDK.
For now, Windows Arm devices will need to use the Qualcomm tool to access their NPUs. Although Windows ML and ML.NET support is likely if Microsoft and Qualcomm collaborate on a DirectCompute wrapper for its AI Engine APIs, for now, it looks as though you have to build separate versions of your applications if you want to run ONNX on Qualcomm-based Arm devices or on Intel hardware. As there’s already a Qualcomm-optimized version of the .NET ONNX Runtime library, this should be relatively easy for higher-level tools to implement.
Microsoft and Qualcomm provide a complete toolchain for building NPU applications on the Windows Dev Kit 2023 hardware, with an Arm build of Visual Studio 2022 available as a preview, along with an Arm-optimized release of the upcoming .NET 7. Alongside these, sign up for the closed Windows release of Qualcomm’s Neural Processing SDK for AI, often referred to by its old name: the Snapdragon Neural Processing Engine (SNPE). This comes with an SNPE runtime for ONNX models.
Build an NPU application on Windows Dev Kit hardware
Getting a model running on the 8cx AI Engine NPU isn’t entirely straightforward. It requires both the Linux and Windows versions of the Qualcomm NPU tool to create an optimized ONNX file from a pretrained model. As Windows 11 and WSL2 support Arm binaries, you can do this all on the Dev Kit system, first setting up an Ubuntu WSL environment and then installing Qualcomm’s tools and configuring your Linux environment to use them. You can, of course, use any Ubuntu system; for complex models, you may prefer Azure Arm instances to process models.
Start by installing Python 3 in your Linux system, then use pip to install the ONNX tools, following the instructions on GitHub. You can now install the SNPE tools, first unzipping the distribution file into a directory and then running SNPE’s dependency checker to ensure you have everything needed to run the SDK. Once all the prerequisites are in place, use its configuration script to set environment variables for use with ONNX.
You’re now ready to process an existing ONNX model for use with a Qualcomm NPU. Download your model and a sample file along with its label data. The example Microsoft uses is an image recognizer, so you’ll need the SNPE tools to preprocess the sample image before converting the ONNX model into SNPE’s internal DLC format. Once that process is complete, use SNPE to quantize the model before exporting a ONNX-wrapped DLC file ready for use in your code.
Copy the ONNX file from WSL into Windows. You can now install the Microsoft.ML.OnnxRuntime.Snpe package from Nuget, ready to use in your applications. This is an optimized version of Microsoft’s existing ONNX Runtime tooling, so it should be relatively simple to add to existing code or build into a new app. If you need hunts, the sample C# code in the example Windows SNPE repository will help you use the sample ONNX model in a basic console application.
There’s enough in the combination of Qualcomm’s machine learning tool and Microsoft’s Arm platform to get you building code for this first generation of Windows NPU hardware. Microsoft’s own NPU-powered video and audio features in Windows 11 should help inspire your own code, but there’s a lot that can be done with hardware neural network accelerators, for example, using them to speed up image processing tools like those used in Adobe’s creative tools or using NPU-accelerated Arm hardware running Windows 11 IoT Enterprise on the edge of your network to preprocess data before delivering it to Azure IoT Hubs.
This is the early stage of a new direction for Windows, and while Microsoft has been using these tools internally for some time, they’re now available to us all—ready to take advantage of a new generation of Arm-based Windows systems on our desks, on the edge, and in the cloud.
Copyright © 2022 IDG Communications, Inc.
Originally posted on October 26, 2022 @ 10:30 am