Nowadays, Deep learning (DL) is at the core of many embedded applications, due to its unprecedented predictive performance. Direct on-device deployment of DL models can provide advantages in terms of latency predictability, energy efficiency and data privacy. However, the tight power, latency, and memory constraints of embedded devices only allow the deployment of highly optimized models. Finding a sufficiently compact, yet accurate DNN manually is a long trial-and-error process, and usually leads to suboptimal solutions.
To solve this issue, Neural Architecture Search (NAS) tools, that improve and automate the exploration of DNN design spaces have emerged. Specifically, NAS tools for embedded systems usually look for solutions capable of co-optimizing the predictive performance along with a computational cost metric (e.g., the number of parameters, the number of operations per inference, etc). First-generation NAS tools were based on extremely time-consuming reinforcement learning or evolutionary algorithms (up to 1000s of GPU hours for a single search), while a more efficient recent alternative is represented by Differentiable NAS (DNAS), which simultaneously trains the DNN weights and optimizes its architecture, using gradient descent.
In this contribution, we describe a novel set of DNAS tools for edge devices, which optimize a DNN model for deployment with minimal overhead compared to a normal training.
The first optimization performed by our tools is the tuning of some of the key geometrical hyper-parameters of the network (the number of features, the kernel size, etc). In this regard, we propose a novel mechanism to approach the exploration of these hyper-parameters with a so-called “mask-based” DNAS, in which optimized architectures are obtained “by subtraction”, pruning unimportant portions of a large initial “seed” model. Furthermore, we also introduce a new formulation of the DNAS optimization problem, which allows to co-optimize energy (or latency) and predictive accuracy under a fixed memory constraint.
A second optimization step focuses on quantization, i.e., the replacement of floating point with low-precision integers for data storage and computations in DNNs, which is fundamental to reduce the memory occupation and improve the energy efficiency of the inference process, especially on embedded nodes. In detail, we leverage a DNAS to tune the quantization bit-width used in different fine-grain portions of the network, enabling the deployment of so-called channel-wise mixed-precision DNNs.
Lastly, we describe how these tools can be used to explore and optimize DNNs architectures for a great variety of real-world edge-relevant use-cases (e.g., bio-signal analysis, keyword spotting, presence detection, etc), reaching up to 150x memory compression and 5.5x energy and latency reduction at iso-accuracy with respect to a manually tuned model.
For more details on this presentation please click the button below: