# A High Precision Power Model for the Tegra K1 CPU, GPU and RAM



[ simula . research laboratory ] Power modelling is an important topic in many areas of computing, for example to save energy in texture streaming for gaming[1] or to select efficient H.264 video encoding parameters[2]. However, researchers' view of how hardware consume power is limited. They typically resort to rate-based models to describe the energy consumption of hardware, where power usage is correlated directly with hardware access rates (for example instructions or cache misses per second)[3,4,5,6]. This approach ignores many mechanisms that impact the power usage of a system, such as rail voltages, core- and clockgating, frequency scaling and variable cost of instruction execution. Because of this, they can mispredict up to 70 % on the Tegra K1. We show that by taking all these factors into account with sufficient hardware knowledge, it is possible to bridge the gap between power usage and software execution to build power models which are over 98 % accurate over all CPU, GPU and memory frequency combinations.



| Rail      | Predictor          | Description                                      | Coefficient          |   |
|-----------|--------------------|--------------------------------------------------|----------------------|---|
| GPU       | $V_{gpu}$          | GPU voltage                                      | $I_{gau, beak}$      |   |
|           | $\rho_{gpu,clock}$ | Total clock cycles per second                    | $C_{gpu,clock}$      |   |
|           | $\rho_{gpn,L2R}$   | L2 cache 32B reads per second                    | $C_{g_{3}m,L2R}$     | 1 |
|           | $\rho_{gpw,L1R}$   | L1 cache 4B reads per second                     | $C_{gpu,L1R}$        |   |
|           | $\rho_{gpn,L1W}$   | L1 cache 4B writes per second                    | $C_{gpn,L1W}$        |   |
|           | $\rho_{gpa,INT}$   | Integer instructions per second                  | $C_{gpa,INT}$        | 4 |
|           | $\rho_{gpn,F32}$   | Float (32-bit) instructions per second           | $C_{gpu,F32}$        | 4 |
|           | $\rho_{gpu,F61}$   | Float (64-bit) instructions per second           | $C_{gpu,F64}$        | 1 |
|           | $\rho_{gpu,CNV}$   | Conversion instructions per second               | $C_{ggm,CNV}$        | 1 |
|           | $\rho_{gpu,MSC}$   | Miscellaneous instructions per second            | $C_{gpu,MSC}$        | 1 |
|           | $\rho_{mem,clock}$ | Total clock cycles per second                    | Cmem.clock           | 2 |
|           | $\beta_{mem,204}$  | Power offset at 204 MHz                          | $P_{mcm,204}$        | - |
| Memory    | $\beta_{mem,300}$  | Power offset at 300 MHz                          | $P_{mem,300}$        |   |
|           | $\rho_{mem,CPU}$   | CPU busy memory cycles per second                | $C_{mem,epu}$        |   |
|           | $\rho_{mem,OTH}$   | Other (GPU) busy memory cycles per second        | $C_{mem,oth}$        |   |
| Core      | $V_{core}$         | Core rail voltage (always powered)               | Leoredcak            | 6 |
|           | $\rho_{core,clk}$  | Active clock cycles per second (LP core)         | $C_{core,dk}$        | 3 |
|           | $V_{hp}$           | HP rail voltage (when powered)                   | I <sub>hp,leak</sub> | 5 |
|           | $\rho_{hp,dk1}$    | Active clock cycles per second (first core)      | $C_{lqp,dk1}$        | 3 |
| HP .      | $\rho_{hp,clk2}$   | Active clock cycles per second (second core)     | $C_{hp,clk2}$        | 2 |
|           | $\rho_{hp,dk3}$    | Active clock cycles per second (third core)      | $C_{hp,dk3}$         | 2 |
|           | $\rho_{hp,clk4}$   | Active clock cycles per second (fourth core)     | $C_{hp,dk4}$         | 2 |
| HP+Core - | $V_{com,ouline}$   | Rail voltage when any core is online (not gated) | Lepsleak             | 2 |
|           | $\rho_{com, J1l2}$ | Cache maintenance, L1 and L2                     | $C_{com,34d2}$       |   |
|           | $\rho_{com,f2ram}$ | Cache maintenance, L2 and RAM                    | $C_{com,f2rmn}$      |   |
|           | $\rho_{com,ips}$   | Instructions per second (workload-specific)      | $C_{com,ips}$        |   |
| Other     | $P_{base}$         | Base power                                       | -                    |   |

| Benchmark    | Description                          | Components |              |                  |   |
|--------------|--------------------------------------|------------|--------------|------------------|---|
| 15cucumark   |                                      | CPU        | RAM<br>(CPU) | CPU              |   |
| htte CPU     | GPU off. CPU in idle state.          | ~          |              |                  |   |
| CPU-workload | GPU off, CPU processing.             | 1          | ø            |                  |   |
| Idle GPU     | GPU on and idle, CPU in idle state.  | ~          |              | - 10             |   |
| L2 Read      | Stresses L2 cocite reads only.       |            |              | 1                |   |
| L1 Read      | Stresses L1 cache reads.             | 1          | -            |                  |   |
| L1 Write     | Stresses L1 cache writes.            | ÷          | 1            |                  | Ī |
| RAM          | Stresses RAM activity (GPU EMC).     |            |              | 1                |   |
| Integer      | Stresses integer arithmetic unit.    | ~          |              | ~                |   |
| Float32      | Stresses flooting point unit.        | 1          |              |                  |   |
| Float64      | Stresses floating point unit.        | ÷          |              | - 1 <sup>4</sup> | Ī |
| Control      | Stresses conversion instructions.    |            |              |                  |   |
| Misc         | Stresses miscellaneous instructions. | ~          |              |                  |   |

Kristoffer Robin Stokke, Håkon Kvale Stensland, Pål Halvorsen Department of Informatics, University of Oslo, Norway

