





# The importance of memory in the next generation of real-time systems

Paolo Burgio paolo.burgio@unimore.it





Industry 4.0

Cities

Machine to

Machine

Smarter Planet

Cyber-Physical

Internet of Things

Industrial Internet

Systems

#### The four horsemen

- **1.** Heavy workloads
  - Sensor-fusion and image-processing

#### 2. Reduced power consumption

- Smaller batteries and renewable power sources
- **3. Quickly interact** with the environment
  - Prompt elaboration of sensor data
  - Run highest criticality workloads
    - Replacing safety-critical human activities

#### Artificial intelligence



IWES @Rome, September 8, 2017

#### Internet-of-Things



Cyber-physical systems



2

Health and medicine

Autonomous driving

4.

©2017 Universit



Multi- and many-core platforms are the solution for 1-2(-3)

- Climbing "the power wall"
- High Performance @ poor Watts



Real-Time system: produce result in a guaranteed/bounded amount of time

- ✓ By construction
- ✓ Application fields: automotive, avionics, industry, medical...

The keyword: predictability

- Provide the correct result....when expected
- ✓ The system must be simple to analyze



## Real-Time systems – traditional approach



Scheduling (also, mapping)



Architectural bottlenecks

- ✓ Shared memory banks
- ✓ Caches (\$)

✓ I/Os





Beyond traditional tecnhiques

- 1. More parameters
  - Shared resources (e.g., memory, SSDs, IOs, caches..)
  - The complexity of analysis grows exponentially w/number of cores
- 2. Mem accesses: instead of thin lines, big bars
  - The mostly accessed resource in the system
  - Traditional techniques are too conservative (bounds too high)







- Thousands cores arranged in CLUSTERS
- Host-accelerator architecture (e.g., GP-GPUs)  $\checkmark$
- ..even worse!  $\checkmark$





## Knowledge of the platform is power

#### Two motivating examples

- Both from real systems
- 1. Many-core accelerator-based platforms
  - Quad-/Octa-core as host
  - Integrated GPU iGPU of FPGA
  - Powerful enough to run neural networks
- 2. Reference industrial system
  - Multi-core ARM
  - Multi-OS (embedded Linux + Win for UI)
  - Hypervisor-based







### Testbed #1: "automotive" platforms

Qualitatively analyze and characterize the conflicts due to parallel accesses to main memory by both CPU cores and iGPU

- 1. NVIDIA Tegra K1 w/Kepler GPU
- 2. NVIDIA Tegra X1 w/Maxwell GPU
- 3. NVIDIA Tegra X2 w/Parker GPU automotive-grade
- 4. Intel i7-6700 w/intel GPU
- 5. Xilinx Zynq Ultrascale multi-core + FPGA (+GPU)

# HERCULES

Roberto Cavicchioli, Nicola Capodieci and Marko Bertogna, "Memory Interference Characterization between CPU cores and integrated GPUs in Mixed-Criticality Platforms", 22nd IEEE International Conference on Emerging Technologies And Factory Automation



- ✓ Shared memory between CPU/GPU complex
  - "Unified Virtual Memory"
  - Unlike traditional "discrete" GPU systems

#### Notable contention points









- Last-generation FPGA-based heterogeneous SoC
  - FPGA = (re-)programmability
- ✓ ARM A53 Quad-core as host "PS"
- ✓ FPGA as accelerator "PL"

Notable contention points (1)





## Test 'A' - Xilinx Zynq

A1 - Sequential read, sequential interference Latency [ns] ....... Cache limit Alone Interf 1 Interf 2 Interf 3 WSS [B]



























- ✓ Interfere with prefetching mechanism
- ✓ Interfering cores read at increasing strided addresses





### Testbed #2: industrial platform



19

- ✓ NXP iMX6 from Egicon
  - Components for F1 teams, industrial telescopic arms
  - Credits to Francesco Bellei

| System Control                                                                                                                                                                                          |                                                                                                                                                                                                          |                                                                                                                                                                                                                                                                                                      | Connectivity            |                                                                                                                                                |  |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------|------------------------------------------------------------------------------------------------------------------------------------------------|--|
| Secure JTAG                                                                                                                                                                                             | CPU Platform<br>ARM <sup>®</sup> Cortex™-A9 Core                                                                                                                                                         |                                                                                                                                                                                                                                                                                                      | MMC 4.4/                | USB2 HSIC                                                                                                                                      |  |
| PLL, Osc.                                                                                                                                                                                               | 32 KB I-Cache                                                                                                                                                                                            | 32 KB I-Cache 32 KB D-Cache                                                                                                                                                                                                                                                                          |                         | Host x2                                                                                                                                        |  |
| Clock and Reset                                                                                                                                                                                         | per Core                                                                                                                                                                                                 | per Core                                                                                                                                                                                                                                                                                             | MMC 4.4/                | MIPI HSI                                                                                                                                       |  |
| Smart DMA                                                                                                                                                                                               | NEON per Core PTM per Core                                                                                                                                                                               |                                                                                                                                                                                                                                                                                                      | UART x5,                | S/PDIF<br>Tx/Rx                                                                                                                                |  |
| IOMUX                                                                                                                                                                                                   | 256 KB-1 MB L2-Cache                                                                                                                                                                                     |                                                                                                                                                                                                                                                                                                      | 5 Mbps                  |                                                                                                                                                |  |
| Timer x3                                                                                                                                                                                                | Multimedia<br>Hardware Graphics Accelerators                                                                                                                                                             |                                                                                                                                                                                                                                                                                                      | I <sup>2</sup> C x3,    | (1-Lane)                                                                                                                                       |  |
| PWM x4                                                                                                                                                                                                  | 3D                                                                                                                                                                                                       | Vector Graphics                                                                                                                                                                                                                                                                                      | SPIXS                   | FlexCAN x2                                                                                                                                     |  |
| Watch Dog x2                                                                                                                                                                                            | 2D                                                                                                                                                                                                       |                                                                                                                                                                                                                                                                                                      | ESAI, I2S/SSI           | MLB150 +<br>DTCP                                                                                                                               |  |
| Power Management                                                                                                                                                                                        | Video Codecs                                                                                                                                                                                             | Audio                                                                                                                                                                                                                                                                                                |                         |                                                                                                                                                |  |
| Power Temperature                                                                                                                                                                                       | 1080p30 Enc/Dec ;                                                                                                                                                                                        | ASRC                                                                                                                                                                                                                                                                                                 | 3.3V GPIO               | 1 Gb Ethernet<br>+ IEEE® 1588                                                                                                                  |  |
| Supplies                                                                                                                                                                                                | Imaging Processing Unit                                                                                                                                                                                  |                                                                                                                                                                                                                                                                                                      | Keypad                  | /pad                                                                                                                                           |  |
| ROM RAM                                                                                                                                                                                                 | Resizing and Blending Image Enhancement<br>Inversion/Rotation                                                                                                                                            |                                                                                                                                                                                                                                                                                                      | S-ATA and<br>PHY 3 Gbps | NAND Cntrl.<br>(BCH40)                                                                                                                         |  |
| Coounty                                                                                                                                                                                                 | Display and Camera Interface                                                                                                                                                                             |                                                                                                                                                                                                                                                                                                      | ·····                   | LP-DDB2.                                                                                                                                       |  |
| RNG Security Cntrl.                                                                                                                                                                                     | HDMI and PHY 24-bit RGB, LVDS (x2)                                                                                                                                                                       |                                                                                                                                                                                                                                                                                                      | USB2 OTG                | DDR3/                                                                                                                                          |  |
| TrustZone Secure RTC                                                                                                                                                                                    | MIPI DSI                                                                                                                                                                                                 | USB2 Host                                                                                                                                                                                                                                                                                            | x32/64,                 |                                                                                                                                                |  |
| Ciphers eFuses                                                                                                                                                                                          | MIPI CSI2                                                                                                                                                                                                | EPDC                                                                                                                                                                                                                                                                                                 | and PHY                 | 533 MHz                                                                                                                                        |  |
| PWM x4   Watch Dog x2   Power Management   Power Management   Power Management   Rower Management   Rom   ROM   RNG   Security Cntrl   TrustZone   Secure RTC   Ciphers   eFuses   WES @Rome, September | Hardware Graphics<br>3D<br>2D<br>Video Codecs<br>1080p30 Enc/Dec<br>Imaging Proce<br>Resizing and Blending In<br>Inversion/Rotation<br>Display and Came<br>HDMI and PHY<br>24-I<br>MIPI DSI<br>MIPI CSI2 | Hardware Graphics Accelerators   3D Vector Graphics   2D 2D   Video Codecs Audio   1080p30 Enc/Dec AsRC   Imaging Processing Unit AsRC   Resizing and Blending Image Enhancement Inversion/Rotation   Display and Camera Interface HDMI and PHY   Video Codecs 24-bit RGB, LVDS (x2)   MIPLOSI2 EPDC |                         | FlexCAN x2<br>MLB150 +<br>DTCP<br>1 Gb Etherne<br>+ IEEE® 1588<br>NAND Cntrl.<br>(BCH40)<br>LP-DDR2,<br>DDR3/<br>LV-DDR3<br>x32/64,<br>533 MHz |  |



✓ More "traditional"



# Memory latency - sequential (ns)









# What do we do with this knowledge?





✓ A set of techniques to turn the view of the system that software has..



# PREM - PRedictable Execution Models



- Group memory access at the beginning of every software task
- Co-schedule memory accesses and tasksto-cores
- Greatly reduces the complexity of the scheduling problem

#### ...and increases performance

Up to 4x predictable performance on a many-core platform

| # Cores/threads              | 1     | 2     | 4     | 8     |
|------------------------------|-------|-------|-------|-------|
| No-PREM – Worst (Analytical) | 0.026 | 0.047 | 0.088 | 0.170 |
| PREM – Worst (Analytical)    | 0.010 | 0.014 | 0.022 | 0.038 |
| Speedup                      | 2.6×  | 3.4×  | 4.0×  | 4.5×  |

2015 paper

@ RTEST



# Thank you!

Paolo Burgio paolo.burgio@unimore.it





http://hipert.unimore.it



## Backup



- 1. One observed core reads sequentially within a variable sized working set, while other cores are interfering sequentially
- 2. One observed core reads randomly within a variable sized working set, while other cores are interfering sequentially
- 3. One observed core reads sequentially within a variable sized working set, while other cores are interfering randomly
- 4. One observed core reads randomly within a variable sized working set, while other cores are interfering randomly



- ✓ Shared memory between CPU/GPU complex
  - "Unified Virtual Memory"





- ✓ x86\_64 powerful host + iGPU
  - Sharing L3\$, External DRAM...

Notable contention points (1)

















## Test case B – iGPU interference on CPU

- 1. One CPU core reads sequentially within a variable working set, while the GPU accesses memory according to different paradigms:
  - CUDA memcpy
  - CUDA memcpy on UVM
  - CUDA memcpy on pinned mem
  - CUDA memset (0)
- 2. Same, but CPU core reads randomly









- 1. CPU generates sequential interfering mem accesses, while GPU accesses memory according to different paradigms:
  - CUDA memcpy
  - CUDA memcpy on UVM
  - CUDA memcpy on pinned mem
  - CUDA memset (0)
- 2. Same, but CPU core interference is random













