Neural networks are widely used in all kinds of applications today. To assist data scientists to focus on their work in improving neural network computations, several neural network frameworks (PyTorch, TensorFlow, Caffe, …) have emerged, all with their strengths and weaknesses. What they have in common is that they model the network as a graph of layers, where each layer does a specific computational operation on the data. This is implemented by mapping these computations on existing BLAS or specialized neural network libraries, provided by hardware manufacturers that promise to achieve peak performance on the underlying hardware.
While this approach is convenient to use and makes use of optimized implementations for compute intensive operations, it does not take the actual structure of the neural network and the data path into consideration. Especially in layers that are memory bound, this causes constant cache thrashing resulting in significant performance drops. For this, the developers of these optimized libraries started to create specialized implementations of sequences of layers, trying solve these problems. However, this is just trying to circumvent the actual problem. The structure of a neural networks can be seen as computer program. While for C/C++ or other languages we have compilers that tune the code to run optimally, no such technology exists for neural networks, that takes the structure of the network into account and tries to optimally map it onto the computing hardware.
The Sol Project
The mission of the Sol project is to transparently accelerate neural network workloads, with as few computing overhead as possible. We integrate Sol into neural network frameworks and where it attaches to the neural network.
Sol takes over the control of neural networks and reshapes their execution process. It analyzes the underlying structure and applies a series of optimizations (including operation reordering, loop merging, kernel fusion, etc.) to maximize the data reuse in neural network layers and to utilize caches and other on-chip memories more efficiently. While we alter the computations we ensure that we do not alter the results. Therefore we rely solely on optimizations to instructions, caching, and workflows and do not employ alternative algorithms, approximations, data types with lower accuracy or other methods that could influence the results. Our initial prototype already achieves up to 41.1% and 35.7% on CPUs and GPUs respectively for prediction on state-of-the-art neural networks, compared to PyTorch. Please refer to our technical report for more details and results from Sol.
Our approach aims at assisting data scientists with their work and not push them to become high performance computing experts. For this we designed Sol to have a very simple API interface. The following example shows how to initialize a Densenet 201 using PyTorch, optimize it using Sol and execute the forward pass. To enable Sol, only the two comment lines need to be added to the source code (example shows syntax of newest Sol development version).
from torch.autograd import Variable import torch from torchvision import models #import sol.pytorch as sol model = models.__dict__['densenet201']() #model = sol.optimize(model, [0, 3, 224, 224]) model(Variable(torch.rand(32, 3, 224, 224)))
To support not only a single neural network framework, we designed Sol to operate as middleware, that can plug-in several different frameworks as frontends and utilize also different kind of compute devices as backends.
The frontends ensure the compatibility between the Sol core and the neural network framework. They directly interface with the framework, read the structure of the neural network, pass it to the core and further take care of the execution control from inside the framework, as well supply the data in the framework’s own tensor format.
The optimizer performs the main work in Sol. It analyzes the neural network structure, identifies optimizable layers and applies various optimizations. Then it uses a generic SIMD processor model to fine tune the code towards the underlying hardware and then generates device specific code using the corresponding backends.
The backends are very slim and only provide cooking receipts to generate device specific code based on the generic SIMD processor model and to handle device specific API calls during runtime.
The scheduler is a runtime controller that manages the execution of the optimized layers.
BrainSlug: Transparent Acceleration of Deep Learning Through Depth-First Parallelism
Nicolas Weber (firstname.lastname@example.org), Florian Schmidt (email@example.com), Mathias Niepert (firstname.lastname@example.org) and Felipe Huici (email@example.com)
Technical Report, ArXiV, 2018
Extended Abstract, DeepMobile, 2018
Sol is at an early stage. For now we support PyTorch v.3.* as frontend and Intel CPUs and NVIDIA GPUs as compute devices. It can only optimize the inference/prediction pass for CNN based layers. In the next months we want to extend the number of optimizable functions, implement a frontend for TensorFlow, add support for the NEC Aurora vector processor and enable also to optimize training of neural networks.