

## Improved hard real-time scheduling and transformations for embedded Streaming Applications

Spasic, J.

## Citation

Spasic, J. (2017, November 14). *Improved hard real-time scheduling and transformations for embedded Streaming Applications*. Retrieved from https://hdl.handle.net/1887/59459

| Version:         | Not Applicable (or Unknown)                                                                                                            |
|------------------|----------------------------------------------------------------------------------------------------------------------------------------|
| License:         | <u>Licence agreement concerning inclusion of doctoral thesis in the</u><br><u>Institutional Repository of the University of Leiden</u> |
| Downloaded from: | https://hdl.handle.net/1887/59459                                                                                                      |

Note: To cite this publication please use the final published version (if applicable).

Cover Page



# Universiteit Leiden



The following handle holds various files of this Leiden University dissertation: <u>http://hdl.handle.net/1887/59459</u>

Author: Spasic, J. Title: Improved hard real-time scheduling and transformations for embedded Streaming Applications Issue Date: 2017-11-14

## **Chapter 5**

# Exploiting Parallelism in Hard Real-Time Systems to Minimize Energy

Jelena Spasic, Di Liu, Todor Stefanov, "Energy-Efficient Mapping of Real-Time Applications on Heterogeneous MPSoCs using Task Replication", *In Proceedings of the IEEE/ACM/IFIP International Conference on HW/SW Codesign and System Synthesis* (CODES+ISSS'16), pp. 28:1–28:10, Pittsburgh, Pennsylvania, USA, October 2-7, 2016.

**I** N this chapter, we devise an approach to exploit the right amount of parallelism in streaming applications in order to minimize the energy consumption of streaming applications with throughput constraints when they are mapped on cluster heterogeneous MPSoCs. That is, this chapter describes our solution to **Problem 3** given in Section 1.3.

The problem of energy minimization by exploiting parallelism in streaming applications is further described in Section 5.1. Then, our contributions are summarized in Section 5.2. The related work is addressed in Section 5.3. Section 5.4 gives a motivational example to demonstrate the need for a new energy minimization approach. The considered system model and energy model are described in Sections 5.5 and 5.6. Then, the new energy minimization approach, that is, our proposed solution approach, is given in Section 5.7. Our proposed approach is experimentally evaluated in Section 5.8. The concluding discussion is given in Section 5.9.

### 5.1 Problem Statement

As introduced in Sections 1.2.2 and 1.3, cluster heterogeneous MPSoCs with per-cluster VFS capability are recognized as energy-efficient platforms for embedded systems. In order to efficiently utilize these cluster heterogeneous platforms to achieve all the desired requirements, the underlying hardware platform and the running streaming application have to be closely related. This requires the embedded designer to expose the right amount of parallelism available in the application and to decide how to allocate and schedule the tasks of the application on the available processing elements such that the platform is utilized efficiently and the timing and energy consumption constraints are met. As explained in Chapter 4 already, the given initial parallel application specification is often constructed without fully considering the computational capacity and power consumption profile of an MPSoC platform. This may lead to an application specification which consists of highly imbalanced tasks in terms of the task workload, that is, task utilization. This may further lead to unnecessary increase in the energy consumption of such a system because when several tasks are mapped onto the same cluster in cluster heterogeneous MPSoCs, the one with the heaviest utilization will determine the required voltage and frequency of the whole cluster and will significantly increase the energy consumption of the other tasks mapped on the same cluster. As shown in Chapter 4, by applying task replication to application tasks with heavy utilization, their utilization can be decreased while still providing the same application performance. Thus, to better utilize the underlying MPSoC platform while minimizing the energy consumption, the initial specification of an application, that is, the initial task graph, should be transformed to an alternative one that exposes more parallelism while preserving the same application behavior and timing performance. However, as mentioned in Chapter 4, having more tasks' replicas than necessary introduces more overheads in code and data memory, scheduling and inter-tasks communication, which in turn will result in higher energy consumption. Therefore, in this chapter, we investigate how to **exploit the right amount of parallelism**, that is, find the proper values of replication (unfolding) factors, depending on the underlying MPSoC platform, to achieve the required performance and timing guarantees while minimizing the energy consumption.

## 5.2 Contributions

As a solution to the research problem described in Section 5.1, we propose a novel algorithm to efficiently map real-time streaming applications onto cluster heterogeneous MPSoCs, which are subject to throughput constraints, such that the energy consumption of the cluster heterogeneous MPSoC is reduced by using task replication and per-cluster VFS. The specific novel contributions of this chapter are the following:

- We propose a novel polynomial-time algorithm, called Data Parallel Energy Minimization (DPEM), to map and schedule hard real-time streaming applications modeled as acyclic SDF graphs onto a cluster heterogeneous MPSoC such that the energy consumption is minimized while the throughput constraints are guaranteed. By using the hard realtime scheduling framework for CSDF graphs, presented in Chapter 3, we propose within our DPEM algorithm an efficient way to determine a suitable processor type for each task in an (C)SDF graph such that the energy consumption is minimized and the throughput constraint is met. Then, by using the unfolding graph transformation in Chapter 4, we propose a method in DPEM to determine a replication factor for each task in an SDF graph such that the distribution of the workload on the same type of processors is balanced, which enables processors to run at a lower frequency, hence reducing the energy consumption.
- We show, on a set of real-life streaming applications, that our proposed energy minimization approach outperforms related approaches in terms of energy consumption while meeting the same throughput constraints.

## 5.3 Related Work

Energy-efficient mapping and scheduling of streaming applications represented as dataflow graphs which guarantees certain throughput has been extensively studied. The related works can be divided into several categories depending on the MPSoC platform they consider: *homogeneous* [SDK13, DSB<sup>+</sup>13, ZSJ08, LW13, NMM<sup>+</sup>11, BL13, HNP<sup>+</sup>15], or *heterogeneous* [HMGM13, SJE11]. Depending on the VFS technique they apply to minimize the energy consumption, the related works can be divided into those considering *per-core VFS* [SDK13, DSB<sup>+</sup>13, ZSJ08, LW13, NMM<sup>+</sup>11, HMGM13, BL13], those considering *global VFS* [HMGM13, HNP<sup>+</sup>15] and the works which do not consider VFS but they utilize platform heterogeneity to achieve energy-efficiency [SJE11]. The approaches in [HMGM13, LW13, NMM<sup>+</sup>11, SJE11, HNP<sup>+</sup>15] convert an initial SDF graph into equivalent Homogeneous SDF (HSDF) graph to exploit the parallelism of an application and achieve energy-efficiency. However, the HSDF graph obtained from the initial SDF graph may grow in size exponentially, making the analysis performed on the HSDF graph time-consuming. Instead, the approaches in [SDK13, ZSJ08, DSB+13, BL13] perform energy minimization directly on an SDF graph. Works [SDK13] and [ZSJ08] perform design space exploration at design time to find the energy-efficient mapping solution of an SDF scheduled in self-timed manner on a homogeneous MPSoC platform with per-core VFS capability such that certain throughput is guaranteed. In addition, the approach in [SDK13] has a run-time phase where slack created at run time is exploited to further minimize the energy consumption. In  $[DSB^+13]$  the authors propose a heuristic to find per-core voltage-frequency points for a given task mapping and the execution order such that the throughput constraint is met. The authors in [BL13] propose a technique to transform an SDF graph at run time into its equivalent SDF graph to adapt to environmental and demand changes. One possible scenario where the SDF graph should be transformed to adapt to the new circumstances is when some processors become available on a homogeneous platform with per-core VFS capability. In that case the tasks in the SDF graph are replicated such that all processors are occupied, which enables processors to run at a lower frequency hence consuming less energy. However, the authors in [BL13] focus more on the transformation itself and not on the energy minimization. In contrast to all related works, discussed above, our approach: 1) considers heterogeneous MPSoC platforms with per-cluster VFS capability, which is a good trade-off in terms of energy-efficiency and the implementation cost; 2) utilizes an unfolding graph transformation to balance the workload put on the MPSoC and to reduce energy consumption by finding how many times each task in a graph should be replicated; 3) uses preemptive hard real-time scheduling to schedule the tasks which gives more opportunities to meet the lowest frequency for schedulability supported by the platform.

Energy-efficient mapping and scheduling of periodic hard real-time tasks has been widely researched in the past. [BMAB16] gives a comprehensive review of works dealing with energy-aware scheduling for real-time systems. As stated in [BMAB16], most of the existing work considers homogeneous MPSoCs and in recent years people started considering heterogeneous platforms and platforms with voltage/frequency levels shared among multiple processors as energy-efficient design solutions. Regarding the considered heterogeneous MPSoC platforms, the closest to our work are the works in [CKR14] and [LSCS15]. The approach in [CKR14] proposes and evaluates several partitioned Earliest Deadline First (EDF) scheduling strategies for real-time tasks mapped on cluster heterogeneous platforms in terms of energy-efficiency. However, because of the bin-packing issue in partitioned scheduling, the approach in [CKR14] may not fully utilize the energy-efficient cores in a cluster heterogeneous MPSoC, hence the energy minimization is limited. In contrast, by replicating the tasks with heavy utilization, we can reduce their utilization and hence fully utilize the energy-efficient cores. The approach in [LSCS15] considers cluster scheduling for cluster heterogeneous MPSoCs where tasks are allowed to migrate at run-time among processors within the same cluster in order to achieve better resource utilization. However, cluster scheduling suffers from high scheduling overhead caused by task migration and increased context switching. Moreover, the frequency of some clusters in [LSCS15] is still determined by the tasks with the heaviest utilization. In contrast, in our approach, we use partitioned scheduling which has low scheduling overhead and we avoid the capacity loss and we lower the operating frequency by replicating the tasks with heavy utilizations.

The works in [XKD12], [WYK<sup>+</sup>10] and [Lee09] consider parallel execution of task replicas to achieve energy efficiency, as we do. The authors in [XKD12] consider frame-based tasks with an implicit deadline and a homogeneous platform with per-core VFS capability where the frequency of a core may be changed for each task. In contrast, in our work, we consider more general periodic task model and more realistic heterogeneous platform with per-cluster VFS capability, hence our approach is more applicable in practice than the approach in [XKD12]. The approach in [WYK<sup>+</sup>10] exploits the data parallelism in an application by replicating the tasks of the application over all processors available in an MPSoC. This means that, in distributed memory architectures, the code of the whole application has to be replicated on all the processors in an MPSoC. By contrast, in our approach, only certain tasks of the application have to be replicated, which reduces significantly the memory overhead of our approach compared to the one in [WYK<sup>+</sup>10]. Moreover, the work in [WYK<sup>+</sup>10] assumes homogeneous systems with per-core VFS and continuous frequencies, while we consider heterogeneous systems with per-cluster VFS capability, which is more practical in modern embedded systems. The approach presented in [Lee09] replicates computation-intensive tasks which yields to a more balanced load on processors, and in turn allows the system to run at a lower frequency. In addition, the authors in [Lee09] consider systems with discrete set of operating frequencies and homogeneous platforms with per-core VFS capability. As discussed earlier, per-core VFS is not practical in modern many-core systems. Hence, our work considers heterogeneous platforms with per-cluster VFS capability. The approach in [Lee09] is devised and, hence efficient only for platforms with performance-efficient processors. This means that the approach in [Lee09] would never replicate the tasks which are going to be mapped on energy-efficient processors. In addition, if the total number of tasks with heavy utilization is equal to the number of processors in a platform, tasks will not be replicated in [Lee09]. In contrast, our approach will replicate the tasks mapped on energy-efficient processors and it will replicate the tasks even if the number of heavy tasks is equal to the number of processors if this leads to more energy-efficient design.

### 5.4 Motivational Example

In the first part of this section, we motivate the need for using the unfolding graph transformation, presented in Chapter 4, to achieve energy-efficient MPSoC design under a throughput constraint. We first show the drawback of the energy minimization approaches for heterogeneous MPSoCs and hard real-time scheduling, that is, the approaches in [CKR14] and [LSCS15]. We analyze three different designs obtained by mapping the SDF graph *G* in Figure 5.1 to a heterogeneous platform consisting of one performance efficient (PE) cluster with 2 PE processors and one energy-efficient (EE) cluster with 2 EE processors, namely, the platform given in column 1, row 2 in Table 5.1, under a throughput constraint of 1 output token per 100  $\mu$ s. An example of PE and EE clusters and processors is given in Figure 1.1, where a cluster of 'big' cores corresponds to a PE cluster, and a cluster of 'LITTLE' corresponds to an EE cluster.

The first design out of the three analyzed is obtained by using the best mapping approach evaluated in [CKR14] and we refer to that approach as CKR. The CKR approach allocates actors  $v_2$  and  $v_3$  to PE processors in one-to-one manner, and it allocates actors  $v_1$  and  $v_4$  to one EE processor, while the other EE processor is switched-off. Once the actors are allocated, the minimum frequency which ensures the schedulability of the actors mapped to a processor in a cluster is selected from a discrete set of frequencies per cluster. The energy consumption of such design is given in Table 5.1, column 2, row 2. After applying the approach in [LSCS15], we obtain the second design where actors  $v_2$  and  $v_3$  are allocated to the PE cluster, while actors  $v_1$  and  $v_4$  are allocated to the EE cluster. The corresponding energy consumption after applying the approach in [LSCS15], denoted by FDM, is given in Table 5.1, column 3, row 2. If we apply our approach presented in Section 5.7 which uses the unfolding transformation in Chapter 4 on graph *G* in Figure 5.1,





**Figure 5.2:** A CSDF graph G' obtained by unfolding SDF graph G in Figure 5.1 with  $\vec{f} = [1, 2, 2, 1]$ .

**Table 5.1:** *Different MPSoC designs for G in Figure 5.1.* 

| MPSoC            | CKR $[\mu J]$ | FDM $[\mu J]$ | SDK $[\mu J]$ | our [µJ] | WYL $[\mu J]$ |
|------------------|---------------|---------------|---------------|----------|---------------|
| (2PE)(2EE)       | 343.55        | 343.55        | 346.30        | 97.76    | 343.55        |
| (PE)(PE)(PE)(PE) | 357.94        | 392.18        | 389.09        | 192.80   | 240.32        |

under the same throughput constraint as in the CKR and FDM approaches, we can lower the utilization of the actors with high utilization,  $v_2$  and  $v_3$ , and achieve better load balancing on the processors of the same type and hence, the frequency of the power-hungry processors can be lowered further than in [CKR14], [LSCS15]. For example, by unfolding actors  $v_2$  and  $v_3$  twice, as given in Figure 5.2, our approach in Section 5.7 allocates  $v_{2,0}$  and  $v_{3,0}$  one-to-one to PE processors, and it allocates  $v_{2,1}$  to an EE processor and  $v_{3,1}$ ,  $v_1$  and  $v_4$  to another EE processor. The energy consumption value for this third design is given in Table 5.1, column 5, row 2. We can see that our approach reduces the energy consumption by 71% when compared to the CKR and FDM approaches.

Now, we would like to analyze an approach which was devised for homogeneous platforms with per-core VFS capability, that is, the approach in [SDK13], denoted by SDK in Table 5.1. To this end, we compare the energy consumption of two designs in which the SDF graph *G* in Figure 5.1 is mapped to a homogeneous MPSoC consisting of four PE clusters with 1 processor per cluster, under a throughput constraint of 1 output token per  $100 \, \mu s$ . The SDK approach will allocate actors to processors in one-to-one manner, while our

approach, proposed in Section 5.7, will replicate actors  $v_2$  and  $v_3$  twice, as shown in Figure 5.2, to lower their utilization. We can see from columns 4 and 5, row 3 in Table 5.1 that our approach reduces the energy consumption by 50% when compared to the SDK approach. The main reason is that we are using the unfolding graph transformation to reduce the influence of heavy actors and hence minimize the energy of an MPSoC. Another reason is that the SDK approach uses self-timed scheduling which is non-preemptive, hence, less flexible for scheduling and that the SDK only minimizes dynamic energy consumption. The energy consumption values of the approaches CKR, FDM and SDK in Table 5.1 which were not discussed above are given only for completeness. However, we can see that these values are always higher than the corresponding energy consumption of our approach in Section 5.7. Thus, our approach outperforms these related approaches.

Above, we motivated the need to use the unfolding transformation, presented in Chapter 4, within our new approach described in Section 5.7 to achieve energy-efficiency for MPSoCs under a throughput constraint. Now, we would like to motivate the need for our whole approach, presented in Section 5.7, which efficiently finds task replication factors and task mappings to achieve further reductions in the energy consumption. Although the approach in [Lee09] exploits the energy-saving capability of data-parallel execution for a homogeneous MPSoC with per-core VFS capability, that approach is not efficient in terms of energy reduction, especially in the case of platforms with EE processors. Below we show its inefficiency for the homogeneous platform in column 1, row 3, and for the heterogeneous MPSoC platform in column 1, row 2, in Table 5.1. The approach in [Lee09], called the WYL approach, considers platforms which power consumption curve is increasing "fast" with the increase of processor utilization. Such power consumption curve corresponds to PE processors. Let us consider the mapping of the SDF graph in Figure 5.1 on the homogeneous platform under a throughput constraint of 1 output token per 100  $\mu$ s. The WYL approach will classify actors  $v_2$  and  $v_3$ as "heavy" tasks, that is, tasks eligible for replication. However, because the platform contains only 4 processors, the WYL will decide that the number of processors is not sufficient to replicate both actors and it will only replicate actor  $v_2$  twice. In contrast, our algorithm will replicate both actors  $v_2$  and  $v_3$ twice, which will lead to an energy reduction of 20%, see Table 5.1, row 3, columns 5 and 6.

Let us now analyze the designs obtained by applying the WYL and our new approach for mapping graph *G* in Figure 5.1 on the heterogeneous platform in column 1, row 2, in Table 5.1. Given that the power consumption curve

of EE processors is a "slowly" increasing curve with the increase of processor utilization, the WYL approach will never replicate actors assigned to EE processors. In contrast, our approach presented in Section 5.7 will replicate actors assigned to EE processors as well if their replication leads to more energy-efficient MPSoC. We can see in row 2, columns 5 and 6, in Table 5.1, that our approach leads to a design with 71% reduction in energy consumption when compared to the WYL approach, for the heterogeneous MPSoC with one PE and one EE cluster each containing 2 processors. This happens because after the classifications of actors into PE and EE in order to satisfy the throughput constraint of 1 output token per 100  $\mu$ s, EE actors will not be considered for replicated because of the algorithm in [Lee09] which does not replicate the actors once the number of "heavy" actors is equal to the number of (PE) cores, which happens for this platform.

From the above examples, we can see the necessity and usefulness of our approach, presented in Section 5.7, which uses the graph unfolding transformation, given in Chapter 4, to obtain energy-efficient cluster heterogeneous MPSoC designs.

### 5.5 System Model

In this section, we describe the system model we use in this chapter. We consider a cluster heterogeneous MPSoC containing two types of clusters – performance-efficient (PE) clusters and energy-efficient (EE) clusters. Each cluster has a number of identical PE processors, denoted as  $N_p^{PE}$ , or a number of EE processors, denoted as  $N_p^{EE}$ . Thus, in total, a cluster heterogeneous MPSoC contains  $N_c^{PE} \times N_p^{PE}$  PE processors and  $N_c^{EE} \times N_p^{EE}$  EE processors, where  $N_c^{PE}$  and  $N_c^{EE}$  represent the total number of PE clusters and the total number of EE clusters, respectively. All processors on the same cluster operate at the same voltage and frequency level. The voltage and frequency level of a cluster can be changed to control the power consumption. A cluster can be switched-off, thereby consuming no power.

Since the actors of a (C)SDF graph *G* modeling an application may run on two different types of processors (PE and EE), the worst-case execution time value  $C_i(\varphi)$  for each phase  $\varphi$  of an actor  $v_i$  has two values –  $C_i^{PE}(\varphi)$  and  $C_i^{EE}(\varphi)$ . The total utilizations of the actors/tasks assigned to PE cluster *j* and EE cluster *k* can be calculated by:

$$u_{j}^{PE} = \sum_{v_{i} \in \mathcal{V}_{j}^{PE}} \frac{\sum_{\varphi=1}^{\varphi=\phi_{i}} C_{i}^{PE}(\varphi)}{T_{i}}, \ u_{k}^{EE} = \sum_{v_{i} \in \mathcal{V}_{k}^{EE}} \frac{\sum_{\varphi=1}^{\varphi=\phi_{i}} C_{i}^{EE}(\varphi)}{T_{i}},$$
(5.1)

where  $\mathcal{V}_{j}^{PE}$  and  $\mathcal{V}_{k}^{EE}$  represent sets of CSDF actors/tasks assigned to PE cluster *j* and EE cluster *k*, respectively.

#### 5.6 Energy Model

This section defines the energy model used in this chapter. Given that all processors in the same cluster operate at the same voltage and frequency level, we can reduce the energy consumption of a cluster heterogeneous MPSoC by using per-cluster VFS and by switching-off some clusters. The authors in [LSCS15] give a power model for cluster heterogeneous MPSoC systems with discrete voltage and frequency levels based on real measurements performed on the ODROID XU-3 [ODR] board containing an MPSoC with two clusters – one quad core Cortex A15 big (PE) cluster and one quad core Cortex A7 LITTLE (EE) cluster. The power model of a cluster is given by:

$$P(f) = \alpha f^{\flat} + \beta N_{p,ac} + P_s(f), \qquad (5.2)$$

where the first term is the dynamic power consumption,  $\beta$  is the static power consumption of one processor and  $N_{p,ac}$  is the number of active processors on the cluster,  $P_s(f)$  is the "uncore" power consumption and f is the frequency level. The "uncore" power consumption is the power consumption from some components not pertaining to a processor, such as a shared cache, an integrated memory controller, and others. Parameters  $\alpha$ , b and  $\beta$ , and  $P_s(f)$  depend on the platform and cluster type, and they are determined in [LSCS15].

We calculate the total energy consumption for an application graph G mapped onto a cluster heterogeneous MPSoC over one hyperperiod  $T_G$  by:

$$E = E^{PE} + E^{EE}. (5.3)$$

 $E^{PE}$  in Equation (5.3) contains the total energy consumption of *PE* clusters and is given by:

$$E^{PE} = T_G \bigg( \sum_{j=1}^{N_{ac}^{PE}} \bigg( u_j^{PE} \alpha^{PE} (f_j)^{b^{PE}} + \beta^{PE} N_{p,ac_j}^{PE} + P_s^{PE} (f_j) \bigg) \bigg),$$
(5.4)

where  $N_{ac}^{PE}$  is the number of active PE clusters,  $N_{p,ac_j}^{PE}$  is the number of active processors on PE cluster *j*,  $u_j^{PE}$  is the total utilization for tasks successfully scheduled by a partitioned scheduling algorithm on the corresponding PE cluster *j*,  $f_j$  is the operating frequency for the corresponding PE cluster *j*, and  $\alpha^{PE}$ ,  $b^{PE}$  and  $\beta^{PE}$  are the power parameters for PE clusters taken from [LSCS15].

The total energy consumption of *EE* clusters,  $E^{EE}$  in Equation (5.3), is given by:

$$E^{EE} = T_G \bigg( \sum_{k=1}^{N_{ac}^{EE}} \bigg( u_k^{EE} \alpha^{EE} (f_k)^{b^{EE}} + \beta^{EE} N_{p,ac_k}^{EE} + P_s^{EE} (f_k) \bigg) \bigg),$$
(5.5)

where  $N_{ac}^{EE}$  is the number of active PE clusters,  $N_{p,ac_k}^{EE}$  is the number of active processors on EE cluster k,  $u_k^{EE}$  is the total utilization for tasks successfully scheduled by a partitioned scheduling algorithm on EE cluster k,  $f_k$  is the operating frequency for the corresponding EE cluster k, and  $\alpha^{EE}$ ,  $b^{EE}$  and  $\beta^{EE}$  are the power parameters for EE clusters taken from [LSCS15].

### 5.7 The Proposed Energy Minimization Approach

In this section, we present our novel energy minimization approach called Data-Parallel Energy Minimization (DPEM) which energy-efficiently exploits a given cluster heterogeneous MPSoC platform when mapping a hard real-time streaming application under a throughput constraint. The logic behind our energy minimization approach is the following: our approach replicates the tasks with heavy utilization to reduce their utilization and lower the operating frequency, thereby reducing the energy consumption; it tries to map as many tasks as possible to EE processors such that the energy consumption is further reduced, while the throughput constraint is met. The DPEM approach is given in Algorithm 6 and explained in Section 5.7.1 while its constituents are desribed in Section 5.7.2 and Section 5.7.3.

#### 5.7.1 The Data-Parallel Energy Minimization Algorithm

In this section, we present our integral algorithm for Data-Parallel Energy Minimization (DPEM). The inputs to DPEM are an SDF graph *G*, a cluster heterogeneous MPSoC, and a throughput constraint  $\mathcal{R}_{out}$ . The outputs are a vector of unfolding factors  $\vec{f}_{best}$  according to which each actor in the initial SDF graph should be replicated, the task mapping to processors in the clusters  $\mathcal{C}_{best}$ , a vector of operating frequencies for clusters  $\vec{F}_{best}$  and the minimum

energy consumption of the system  $E_{best}$ . The DPEM algorithm is shown in Algorithm 6. Line 1 in Algorithm 6 initializes each unfolding factor of an actor in graph G to 1. In Lines 2 and 3, the initial graph G is converted to periodic tasks by the ISPS approach in Chapter 3, where periods for each actor in G are set, by using scaling factor *s* in Line 3, to be as large as possible while meeting the throughput constraint  $\mathcal{R}_{out}$ . The corresponding hyperperiod  $T_G$ of graph G is calculated as well in Line 3. Line 4 finds the bottleneck actor in G. The bottleneck actor is the actor with the heaviest workload among the actor workloads for PE type of processors during one hyperperiod. If multiple actors have the same maximum workload, then the one with the smallest code size is selected to be the bottleneck. Note that stateful actors and input and output actors are not unfolded. In Line 5, Algorithm 7, explained in Section 5.7.2, is applied to classify actors into two groups – EE and PE. Here, by splitting actors into two groups, the required throughput of G under ISPS is guaranteed. Line 6 uses Algorithm 8, described in Section 5.7.3, to energy-efficiently map graph *G* on the input MPSoC platform. It may happen that the input platform is not big enough to map the input application, that is, graph G. In that case Algorithm 8 will return an empty mapping,  $C_{best} = \emptyset$ . If this happens, the algorithm terminates and signals failure in Line 8. Otherwise, after obtaining the initial energy-efficient solution in Line 6, we further search to reduce the energy consumption by exploiting task replication via the unfolding, Lines 9 to 20.

Line 9 checks if the upper bound on the unfolding factor for the bottleneck actor has been reached and if the bottleneck actor is one of the actors which cannot be unfolded (input, output actors and stateful actors). If one of these happens, Algorithm 6 terminates and returns in Line 21 the most energyefficient solution found so far. Otherwise, the initial SDF graph is transformed into an equivalent CSDF graph by replicating, in Line 10, the bottleneck actor previously found in Line 4. The graph transformation is performed in Line 11 by using the unfolding transformation method given by Algorithm 3, described in Chapter 4. Given that the transformed graph contains more actors than the original one, the WCETs of the actors have to be recomputed because the worst-case communication time may change. This is done in Line 12. Once the WCETs in the CSDF graph are recalculated, actors in the CSDF graph are transformed into periodic tasks by using the ISPS approach in Chapter 3. The unfolding graph transformation is usually used to increase the throughput of a graph by exposing more parallelism through task replication. However, here we want just to meet the same throughput constraint  $\mathcal{R}_{out}$  as the initial graph, and use the unfolding transformation to change the utilization of the periodic **Algorithm 6:** Data-Parallel Energy Minimization (DPEM).

**Input**: An SDF graph  $G = (\mathcal{V}, \mathcal{E})$ , a cluster heterogeneous MPSoC and a throughput constraint  $\mathcal{R}_{out}$ . **Output**: Vector of unfolding factors  $f_{hest}$ , task mapping to processors in the clusters  $\mathcal{C}_{best}$ , vector of operating frequencies for clusters  $\vec{F}_{best}$  and the minimum energy consumption  $E_{best}$ . 1  $\vec{f} = [1, 1, \cdots, 1];$ <sup>2</sup> Calculate WCETs for each actor  $v_i$  in G by using Equation (3.2); <sup>3</sup> Calculate period  $T_i$  for PE type of processors for each actor  $v_i$  in G by using Equation (4.3) and  $s = \left\lfloor \frac{\phi_{out} \cdot r_{out}}{\mathcal{R}_{out} \cdot \operatorname{lcm}(\vec{r})} \right\rfloor$ ;  $T_{best} = T_G = r_{out} \cdot T_{out}$ ; <sup>4</sup> Find the bottleneck actor  $v_{h,k}$  in *G*; 5  $\mathcal{V}^{EE}$ ,  $\mathcal{V}^{PE} \leftarrow \text{Classify actors in } G$  by Algorithm 7( $G, \frac{\phi_{out}}{T_{out}}$ ); 6 Find  $C_{best}$ ,  $\vec{F}_{best}$ ,  $E_{best}$  by Algorithm 8( $\mathcal{V}^{EE}$ ,  $\mathcal{V}^{PE}$ ); 7 if  $C_{best} = \emptyset$  then return Unschedulable; 9 while  $f_b < (N_c^{EE} \times N_p^{EE} + N_c^{PE} \times N_p^{PE}) \land v_{b,k}$  not stateful/in/out do  $f_b = f_b + 1;$ 10 Get G' by unfolding G using the method in Chapter 4 (Algorithm 3); 11 Calculate WCETs for each actor  $v'_i$  in G' by using Equation (3.2); 12 Calculate period  $T'_i$  for each actor  $v'_i$  by using Equation (4.3) and  $s = \left| \frac{\varphi'_{out} \cdot r'_{out}}{\mathcal{R}_{out} \cdot \operatorname{lcm}(\vec{r}')} \right|$ ; 13  $T_{G'} = r'_{out} \cdot T'_{out};$ Find the bottleneck actor  $v_{b,k}$  in G'; 14  $\mathcal{V}^{\prime EE}, \mathcal{V}^{\prime PE} \leftarrow \text{Classify actors in } G' \text{ by Algorithm 7}(G', \frac{\phi_{out}'}{T_{out}'});$ 15 Find  $C_{best,u}$ ,  $\vec{F}_{best,u}$ ,  $E_{best,u}$  by Algorithm 8( $\mathcal{V}^{\prime EE}$ ,  $\mathcal{V}^{\prime PE}$ ); 16 if  $C_{best.u} = \emptyset$  then 17 go to 9; 18 19 20 21 return  $\vec{f}_{best}, C_{best}, \vec{F}_{best}, E_{best};$ 

tasks. To meet throughput constraint  $\mathcal{R}_{out}$  and keep the throughput as close as possible to the initial throughput in Line 3, we scale the periods of the periodic tasks obtained after the conversion by scaling factor *s*, which is given in Line 13. Then, we find in Line 14 the bottleneck actor in the equivalent CSDF graph *G'*, which is replicated in the next pass of the algorithm. The actors in *G'* are classified into PE and EE actors and the minimum energy of mapping the tasks corresponding to actors in *G'* onto the MPSoC is calculated in Lines 15 and 16. If there is no feasible mapping we continue with the task replication, Lines 17

and 18. On the other hand, if we could map G' on the MPSoC, the obtained energy is compared against the best, that is, the minimum, energy obtained so far over the same time interval in Line 19. If we detect that the energy consumption of the current solution is smaller than the energy consumption of the best solution found so far, the current solution becomes the best one in Line 20. Line 9 checks whether the termination criteria for Algorithm 6 is met. If it is not, the algorithm will repeat Lines 10 to 20. Otherwise, the best solution is returned in Line 21.

Finally, we can analyze the time complexity of our DPEM algorithm in the worst case. The complexity of Algorithm 6 is determined by the **while loop** in Lines 9 to 20. In the worst case, the while loop will be executed until all the actors in the initial graph are replicated in the equivalent graph maximum number of times, which is equal to the number of processors *N* in the platform. So, the while loop will be executed  $|\mathcal{V}|N$  times in the worst case. The complexity of the graph unfolding algorithm in Chapter 4, Algorithm 3, which is called in Line 11, is  $O(|\mathcal{E}|N^2\phi)$ , where  $\phi$  is the maximum number of execution phases per actor in the equivalent CSDF graph obtained after unfolding,  $\phi = \max_{v_i \in \mathcal{V}'} {\phi_i}$ . The complexity of the other parts of the while loop is determined by Algorithm 8, see Section 5.7.3. Thus, the worst-case complexity of Algorithm 6 is  $O(N|\mathcal{V}| \cdot (N^2\phi|\mathcal{E}| + (N|\mathcal{V}|)^2 \log(N|\mathcal{V}|)))$ , which is polynomial.

#### 5.7.2 Task Classification for Energy Minimization

In Algorithm 6, we used Algorithm 7 in Lines 5 and 15 to classify tasks of a graph into two groups, depending on the processor type they should be executed. Selecting the processor type to execute a task in an application is very important because different type of processors in a heterogeneous MPSoC have significantly different power and timing profiles. Algorithm 7 gives our task classification method. It takes a CSDF graph *G* and a throughput requirement  $\frac{\phi_{out}}{T_{out}}$  as inputs and it produces PE and EE subsets of tasks in *G*.

First, we sort the tasks in order of increasing workload assuming all of them are assigned to EE processors – see Line 1 in Algorithm 7. Then, with the sorted tasks, we use the hyperperiod  $r_{out} \cdot T_{out}$  as the classification threshold such that throughput requirement  $\frac{\phi_{out}}{T_{out}}$  is met and the energy consumption is minimized, and deploy a binary search algorithm in Line 2 to find the pivotal point by which we can split the sorted tasks into two sets, one for the EE type of processor and another for the PE type of processor. The goal is to put as many tasks as possible to EE processors to reduce the energy consumption while satisfying the throughput requirement. All the tasks, which do not

**Algorithm 7:** Procedure to classify tasks according to processor type.

**Input**: A CSDF graph  $G = (\mathcal{V}, \mathcal{E})$  and a throughput constraint  $\frac{\varphi_{out}}{T_{out}}$ . **Output**: Subsets  $\mathcal{V}^{PE}$  and  $\mathcal{V}^{EE} \subset \mathcal{V}$ .

- 1  $V \leftarrow$  Sort actors  $v_i$  in  $\mathcal{V}$  in increasing order of  $W_i^{EE}$ ;
- 2 *b* ← Binary search to find the position in V with the biggest index where actor  $v_i$  can meet  $W_i^{EE} \le r_{out}T_{out}$ ;
- $\mathcal{V}^{EE} \leftarrow \mathcal{V}[0:b];$
- $4 \mathcal{V}^{PE} \leftarrow \mathcal{V} \mathcal{V}^{EE};$
- 5 return  $\mathcal{V}^{EE}$ ,  $\mathcal{V}^{PE}$ ;

violate the throughput constraint, that is, the hyperperiod  $r_{out} \cdot T_{out}$ , when assigned to EE processors are classified as EE tasks, Line 3, and all the rest as PE tasks, Line 4. In this way we guarantee that the throughput requirement will be met while minimizing the energy consumption.

Since the sorting algorithm in Line 1 has the worst-case complexity of  $O(|\mathcal{V}| \log |\mathcal{V}|)$  and the worst-case complexity of the binary search in Line 2 is  $O(\log |\mathcal{V}|)$ , the worst-case complexity of Algorithm 7 is  $O(|\mathcal{V}| \log |\mathcal{V}|)$ .

#### 5.7.3 Task Mapping for Energy Minimization

In Algorithm 6, once the actors in a graph are classified by Algorithm 7 in Lines 5 and 15 into two sets of EE and PE actors, each set is mapped by Algorithm 8 in Lines 6 and 16 onto the corresponding type of clusters, EE and PE clusters, such that the energy consumption of the whole cluster heterogeneous MPSoC is minimized. Our algorithm of energy-efficient tasks mapping is given in Algorithm 8.

Algorithm 8 takes sets  $\mathcal{V}^{EE}$  and  $\mathcal{V}^{PE}$  of actors and a cluster heterogeneous MPSoC, and it returns the task mapping on processors in the clusters C, a vector of operating frequencies for clusters  $\vec{F}$  and the minimum energy consumption E. The authors in [AY03] showed that the most balanced workload distribution leads to the least energy consumption, and that the the most balanced distribution is obtained when the Worst-Fit Decreasing (WFD) heuristic [CGJ96] is used to allocate tasks to processors. Thus, in this work, we use the WFD heuristic for task allocation. First, Algorithm 8 checks in Lines 1 to 4 whether the input MPSoC has enough resource to map (allocate) and schedule the tasks by using the WFD allocation heuristic [CGJ96], applied among the processors of the same type, and a given per-processor schedulability test [LL73] when processors are running at the maximum available frequency for each processor type. If there is no enough EE type of processors,

**Algorithm 8:** Procedure to find the minimum energy when the given tasks are mapped onto a cluster heterogeneous MPSoC.

**Input**: Sets of actors  $\mathcal{V}^{EE}$  and  $\mathcal{V}^{PE}$  and a cluster heterogeneous MPSoC. Output: Task mapping to processor in the clusters C, vector of operating frequencies for clusters  $\vec{F}$  and the minimum energy consumption *E*. 1 if  $\mathcal{V}^{EE}$  cannot be scheduled on  $N_c^{EE} \times N_p^{EE}$  processors by WFD algorithm and max frequency  $f_{max}^{EE}$  then <sup>2</sup> Move some actors  $v_i \in \mathcal{V}^{EE}$  to PE set  $\mathcal{V}^{PE}$  in order of non-increasing  $u_i$  such that  $\mathcal{V}^{EE}$  is schedulable on  $N_c^{EE} \times N_p^{EE}$  processors; <sup>3</sup> **if**  $\mathcal{V}^{PE}$  cannot be scheduled on  $N_c^{PE} \times N_p^{PE}$  processors by WFD algorithm and max frequency  $f_{max}^{PE}$  then 4 **return**  $C \leftarrow \emptyset, \vec{F} \leftarrow \emptyset, E = \infty;$ 5 if  $|\mathcal{V}^{EE}| = 0$  then 6  $\int \mathcal{C}^{EE} \leftarrow \emptyset, \vec{F}^{EE} \leftarrow \emptyset, E^{EE} = 0;$ 7 else  $\begin{bmatrix} n_{lb}^{EE} = \left\lceil \frac{\left\lceil u^{EE} \right\rceil}{N_p^{EE}} \right\rceil, n_{ub}^{EE} = \min\left\{ \left\lceil \frac{\left| \mathcal{V}^{EE} \right\rceil}{N_p^{EE}} \right\rceil, N_c^{EE} \right\}; \\ \text{Find } \mathcal{C}^{EE}, \vec{F}^{EE}, E^{EE} \text{ by Algorithm 9}(n_{lb}^{EE}, n_{ub}^{EE}, \mathcal{V}^{EE}, \text{ Equation (5.5)}); \end{bmatrix}$ 10 if  $|\mathcal{V}^{PE}| = 0$  then 11  $\int \mathcal{C}^{PE} \leftarrow \emptyset, \vec{F}^{PE} \leftarrow \emptyset, E^{PE} = 0;$ 12 else  $n_{lb}^{PE} = \left\lceil \frac{[u^{PE}]}{N_p^{PE}} \right\rceil, n_{ub}^{PE} = \min\{\left\lceil \frac{[\mathcal{V}^{PE}]}{N_p^{PE}} \right\rceil, N_c^{PE}\};$ 14 Find  $C^{PE}$ ,  $\vec{F}^{PE}$ ,  $E^{PE}$  by Algorithm 9( $n_{lb}^{PE}$ ,  $n_{ub}^{PE}$ ,  $\mathcal{V}^{PE}$ , Equation (5.4)); 15  $C = \{C^{EE}, C^{PE}\}, \vec{F} = \{\vec{F}^{EE}, \vec{F}^{PE}\}, E = E^{EE} + E^{PE};$ 16 return  $C, \vec{F}, E;$ 

we select some actors from set  $\mathcal{V}^{EE}$  and assign them to set  $\mathcal{V}^{PE}$ . The actors are selected in order of decreasing utilization and the selection is terminated as soon as the tasks corresponding to actors in set  $\mathcal{V}^{EE}$  are schedulable on the EE processors. However, if there is no enough PE type of processors, that means the application is not schedulable on the input MPSoC. The algorithm terminates and signals the failure by returning an empty set for tasks-to-processors mapping C in Line 4. Line 5 checks if there are tasks that should be mapped on processors in EE clusters. If no task should be mapped to EE clusters, then EE clusters will not be used within the input MPSoC, hence they will not contribute to the total energy consumption, Line 6. Otherwise, the bounds on the number of active EE clusters are calculated in Line 8 and the energy consumption of mapping task set  $\mathcal{V}^{EE}$  to EE clusters is calculated in Line 9.

The lower bound  $n_{lb}^{EE}$  corresponds to the minimum possible number of active clusters to schedule the tasks because it is determined according to the ceiling of the utilization  $u^{EE}$  of EE tasks. The upper bound  $n_{ub}^{EE}$  is selected to be the minimum value among the case when tasks are mapped onto processors in one-to-one manner, and the case when all clusters available on the platform are active. We find the minimum energy for mapping the tasks on EE clusters by using Algorithm 9 (described later) in Line 9. Similarly, Line 10 checks whether there are tasks that should be mapped onto processors in PE clusters. If there are such tasks, lower and upper bounds of active PE clusters are calculated in Line 13 and the minimum energy for mapping the tasks on PE clusters by using Algorithm 9 is obtained in Line 14. Finally, the EE solution and the PE solution mappings are grouped together in Line 15 and the integral solution mapping of the given tasks onto the given MPSoC which results in minimum energy consumption is returned in Line 16 of Algorithm 8.

Within Algorithm 8, described above, Algorithm 9 is used to map the tasks which are in the same group, EE or PE, such that the energy consumption is minimized. Algorithm 9 takes the bounds on the number of active clusters of certain type (PE or EE),  $n_{lb}$  and  $n_{ub}$ , tasks  $\mathcal{V}$  that are going to be mapped onto PE/EE clusters, the corresponding equation, Equation (5.4) or (5.5) – see Section 5.6, for the calculation of the energy consumption and returns the task partitions among the processors in the clusters  $C_{best}$  and a vector of operating frequencies for clusters  $\vec{F}_{hest}$  which lead to the minimal energy consumption  $E_{best}$ . In Lines 2 to 15 in Algorithm 9, the best task mapping and the frequency assignment is determined among different number of active clusters in the range from  $n_{lb}$  to  $n_{ub}$ . For each number of active clusters  $n, n \in [n_{lb}, n_{ub}]$ , the algorithm in Line 4 performs the WFD allocation heuristic [CGJ96] and uses a given per-processor schedulability test [LL73] to check the schedulability of the tasks. In this way, we want to achieve load balancing among the processors of the same type. If all tasks are allocated on processors, Line 5, we group processors into clusters according to their workload such that all processors in one cluster run at the frequency which matches their workload as much as possible. This is done in Lines 6 and 7, where processors  $\pi_i \in \Pi$  are first sorted in non-increasing order of their workload, that is, their utilization  $u_i$ , and then starting from the processor with the highest utilization, every  $N_v$ processors are grouped into a cluster. For each cluster, we select the smallest frequency which guarantees the schedulability and is supported by the cluster type, that is, it is in the set  $\mathcal{F}$  of available frequencies, Lines 9 to 11. The energy consumption of the mapping is calculated in Lines 12 and 13 of Algorithm 9. In Lines 14 and 15, we check whether the energy consumption obtained by

**Algorithm 9:** Procedure to find the minimum energy when the given tasks are mapped onto the same type of clusters.

**Input**: Lower  $n_{lb}$  and upper  $n_{ub}$  bound on the number of clusters, set  $\mathcal{V}$  of tasks that should be mapped onto clusters, equation *Eq*. for calculating the energy consumption. **Output**: Task mapping to processor in the clusters  $C_{best}$ , vector of operating frequencies for clusters  $\vec{F}_{best}$  and the minimum energy consumption  $E_{best}$ . 1  $E_{best} = \infty, F_{best} \leftarrow \emptyset, C_{best} \leftarrow \emptyset;$ <sup>2</sup> for  $n = n_{lb}$  to  $n_{ub}$  do Create a set  $\Pi$  of  $n \times N_p$  empty processors,  $\forall \pi_i \in \Pi : u_i = 0$ ; 3 Perform WFD allocation heuristic and a corresponding schedulability test for all 4 tasks in  $\mathcal{V}$ : **if** all  $v_i \in \mathcal{V}$  can be scheduled on  $\Pi$  **then** 5  $\Pi \leftarrow \text{Sort } \Pi \text{ in non-increasing order of } u_i;$ 6  $C \leftarrow$  group every  $N_p$  processors in  $\Pi$  to a cluster  $C_k, k \in [1, n]$ ; 7  $E = 0, F_k = 0, k \in [1, n];$ 8 for cluster  $C_k \in C$  do 9 Find processor  $\pi_i \in C_k$  with the highest utilization  $u_i$ ,  $u_{max} = u_i$ ; 10 Compute frequency of  $C_k$  as  $F_k \ge u_{max} \cdot f_{max} \wedge F_k \in \mathcal{F}$ ; 11 Calculate energy  $E_k$  for cluster  $C_k$  by using Eq.; 12  $E = E + E_k;$ 13 if  $E < E_{best}$  then 14  $E_{best} = E, \vec{F}_{best} \leftarrow \vec{F}, C_{best} \leftarrow C;$ 15 16 return  $E_{best}$ ,  $F_{best}$ ,  $C_{best}$ ;

mapping the tasks on the current number of active clusters n is the smallest one obtained so far. If that is the case, the mapping on the current number of active clusters becomes the best mapping solution. Finally, in Line 16, Algorithm 9 returns the minimum energy  $E_{best}$  obtained after mapping the tasks on clusters of the same type, the frequency assignment  $\vec{F}_{best}$  for clusters and the cluster partitions  $C_{best}$ .

Let us now analyze the time complexity of Algorithm 9 and Algorithm 8 in the worst case. The complexity of Algorithm 9 is determined by the **for loop** in Lines 2 to 15. Due to the sorting algorithms used within the WFD heuristic, in Lines 4, and in Line 6, the complexity of Algorithm 9 is  $O(N_c|\mathcal{V}|\log|\mathcal{V}|)$ , where  $N_c$  is the number of active clusters. The worst-case complexity of Algorithm 8 is then determined by Line 2, which is executed in the worst case  $|\mathcal{V}|$  times, and every time the WFD allocation heuristic is applied, thus the complexity of Algorithm 8 is  $O(|\mathcal{V}|^2 \log |\mathcal{V}|)$ .

| Benchmark                       |     | $ \mathcal{E} $ | $\mathcal{R}_{out}[1/\text{time unit}]$ |
|---------------------------------|-----|-----------------|-----------------------------------------|
| Discrete cosine transform (DCT) |     | 7               | 1/47616                                 |
| Fast Fourier transform (FFT)    |     | 16              | 1/12032                                 |
| Filterbank                      |     | 99              | 1/11312                                 |
| Time delay equalization (TDE)   |     | 28              | 1/36960                                 |
| Data encryption standard (DES)  | 53  | 60              | 1/1024                                  |
| Serpent                         | 120 | 128             | 1/3336                                  |
| Bitonic Sorting                 | 40  | 46              | 1/95                                    |
| MPEG2                           | 23  | 26              | 1/7680                                  |
| Vocoder                         | 114 | 147             | 1/9105                                  |
| FMRadio                         | 43  | 53              | 1/1434                                  |
| Channel Vocoder                 | 55  | 70              | 1/35500                                 |

Table 5.2: Benchmarks used for evaluation.

## 5.8 Evaluation

We have performed three experiments to evaluate the efficiency of our DPEM approach in comparison to the related energy minimization approaches in [CKR14], [LSCS15], [SDK13] and [Lee09]. We have selected the approaches in [CKR14] and [LSCS15] for comparison because they consider the same task and system models as we do. We selected to compare with the approach in [SDK13] because it is a very good representative among the approaches for energy-efficient mapping and scheduling of streaming applications modeled as SDF graphs. Finally, we compare our approach with the approach in [Lee09] which is the only approach among the related approaches which considers task replication for energy minimization for classical periodic real-time tasks. In the first two experiments, we compare the approaches when the streaming applications are executed on a cluster heterogeneous platform. We apply our task classification method, given in Algorithm 7, for the approaches in [SDK13] and [Lee09] which were originally devised for homogeneous platforms and then we apply these approaches on the two sets of tasks, PE and EE, obtained by the classification. Since two of the related approaches, [SDK13] and [Lee09], originally consider homogeneous platforms with per-core VFS capability, in the third experiment, we compare our approach with these related approaches on this type of platform.

The experiments have been performed on the real-life applications from the StreamIt benchmarks suit [TA10], given in Table 5.2.  $|\mathcal{V}|$  denotes the number of actors in an SDF graph, while  $|\mathcal{E}|$  denotes the number of communication channels.  $\mathcal{R}_{out}$  is the maximum achievable throughput, computed by using Equation (3.5) and (3.22), when the applications are scheduled by the ISPS approach described in Chapter 3. We consider these throughput values as the

throughput constraints in our experiments.

In the experiments on heterogeneous MPSoC platforms, we consider the same MPSoC platforms considered in [LSCS15]. These platforms have the same number of PE processors and EE processors but they have different cluster granularities, that is, different number of processors per cluster, and hence, different number of clusters. We use the same MPSoC notation MPSoC x\_pe\_ee as in [LSCS15]. For example, MPSoC\_2\_20\_28 corresponds to an MPSoC platform with 2 processors per cluster, 20 PE clusters and 28 EE clusters. The approaches in [CKR14], [LSCS15] and [Lee09] use hard real-time scheduling algorithms to schedule the tasks on an MPSoC while the approach in [SDK13] uses self-timed scheduling. The application tasks are permanently assigned to processors in [CKR14], [Lee09] and [SDK13], while in [LSCS15], the tasks are permanently assigned to clusters, but within a cluster tasks are scheduled by a global scheduling algorithm, hence, they can migrate. In the experiments, we use the EDF [LL73] scheduling algorithm within our DPEM approach which is also used in [CKR14] and [Lee09]. In all experiments, we use the power parameters in [LSCS15] obtained from real measurements performed on the ODROID XU-3 [ODR] board. The results of the evaluations are shown in Figure 5.3, Figure 5.4 and Figure 5.5. In all these figures, we show the energy reduction obtained by our DPEM approach in comparison with the related approaches. The energy reduction *r* is computed by:

$$r = \frac{E_{rel} - E_{DPEM}}{E_{rel}},\tag{5.6}$$

where  $E_{rel}$  is the energy consumption of an application to MPSoC mapping configuration obtained by a related approach and  $E_{DPEM}$  denotes the energy consumption achieved by our DPEM approach.

#### 5.8.1 Comparison with [CKR14], [LSCS15], [SDK13] on Heterogeneous MPSoCs

In this section, we compare the energy consumption on cluster heterogeneous MPSoCs obtained by our proposed DPEM approach with the energy consumption delivered by the related approaches which do not consider task replication [CKR14] – CKR, [LSCS15] – FDM, [SDK13] – SDK.

The comparison results with the CKR, FDM and SDK approaches on the three considered heterogeneous MPSoCs are given in Figure 5.3(a)-5.3(c). In each of these figures, the x-axis shows the application benchmarks and the y-axis shows the energy reduction. Both approaches CKR and FDM are devised for cluster heterogeneous MPSoCs and both of them use preemptive





**Figure 5.3:** Comparison of our proposed DPEM approach with related approaches on heterogeneous MPSoCs.

hard real-time scheduling algorithms, which is also the case in our DPEM algorithm. We can see in Figure 5.3 that our DPEM approach reduces the energy consumption when compared to CKR and FDM for all but two considered benchmarks. The two benchmarks for which our approach results in the same energy consumption as CKR and FDM are Filterbank and Channel **Vocoder**. The workload which these two benchmarks put on the considered MPSoCs is balanced among processors, hence our approach will not replicate tasks of these benchmarks which leads to the same energy consumption as obtained by the CKR and FDM approaches. The average energy reduction of our approach when compared to the CKR approach is 62%, 62% and 63.1% for the three MPSoCs with 2, 4 and 8 processors per cluster, respectively. When compared to the FDM approach the corresponding average energy reductions are 61.6%, 61.4% and 61.6%. When compared to the SDK approach, our approach achieves energy reduction for all benchmarks because we use both task replication and preemptive scheduling. Note that we only use the design-time phase in the SDK approach for the comparison because our approach is a design-time approach. Our approach obtains on average the energy reduction of 65%, 65.1% and 66% for the three MPSoCs with 2, 4 and 8 processors per cluster, respectively, when compared to the SDK approach. We can conclude from these results that our approach achieves large energy reduction by utilizing task replication.

#### 5.8.2 Comparison with [Lee09] on Heterogeneous MPSoCs

In this section, we compare the energy consumption on cluster heterogeneous MPSoCs of our DPEM approach with the related approach in [Lee09], denoted by WYL, which considers task replication as well. The results are given in Figure 5.4. Here again, both approaches will not replicate tasks in Filterbank and Channel Vocoder and hence both approaches will lead to the same energy consumption in these two cases. Given that the task classification in the WYL approach is based on the power consumption curve of a processor, the WYL approach will never replicate tasks assigned to EE processors. In addition, the WYL approach will never replicate the tasks of an application once the total number of heavy tasks is equal to the number of processors on an MPSoC platform. All these limitations of WYL explain the energy reduction achieved when our approach is used to map the benchmarks in Table 5.2 onto the three considered MPSoCs. The average energy reduction obtained by our DPEM approach is 51.3%, 57.2% and 60.7% for the MPSoCs with 2, 4 and 8 processors per cluster, respectively.



Figure 5.4: Comparison between DPEM and WYL on heterogeneous MPSoCs.

#### 5.8.3 Comparison on Homogeneous MPSoC

Given that both the SDK and WYL approaches were originally proposed for homogeneous platforms with per-core VFS capability, in this section, we compare the energy consumption on such systems when our DPEM approach is used with the energy consumption values when the SDK and WYL approaches are used. The results of the energy reduction on a homogeneous MPSoC platform consisting of 96 PE processors with per-core VFS capability are given in Figure 5.5. Here, we also give the results of energy reduction when our DPEM approach is compared with the CKR and FDM approaches for completeness.

The benchmarks Filterbank and Channel Vocoder were the only two benchmarks for which our approach could not obtain any reduction in energy consumption on heterogeneous MPSoCs when compared to the approaches which use hard real-time scheduling algorithms – CKR, FDM and WYL. In the case of a homogeneous platform, we can see in Figure 5.5 that there is still no difference in energy consumption between our DPEM and the CKR, FDM and WYL for the Filterbank benchmark. This happens because when mapped onto a homogeneous MPSoC, Filterbank has balanced workload among the processors, hence both our DPEM and the WYL approaches will not replicate tasks. However, in the case of the Channel Vocoder benchmark, we see in Figure 5.5 that the situation changes, that is, now there is a reduction in the energy consumption when our approach is compared to the CKR and FDM, because our approach will replicate tasks to balance the workload of Channel



Figure 5.5: Comparison on homogeneous MPSoC.

Vocoder on a homogeneous platform. The WYL approach will replicate tasks as well, leading to the same energy consumption as obtained by our DPEM approach. Although the WYL approach was devised for homogeneous platforms with types of processors which match the PE type and with per-core VFS capability, still our DPEM approach outperforms the WYL approach by reducing the energy consumption on average by 10.4%, and in the best case up to 22%. The reason is that our task replication procedure is more flexible than the procedure in the WYL approach.

When compared to the another approach devised for homogeneous MP-SoCs with per-core VFS capability, that is, the SDK approach, our DPEM approach leads to an energy reduction of 36% on average and up to 90% in the best case. The reason is that our approach replicates tasks to lower the utilization per-processor, and hence, lower operating frequencies can be achieved. In addition, the SDK approach minimizes only the dynamic energy consumption and uses non-preemptive scheduling which both lead to higher total energy consumption.

Finally, when compared to the CKR and FDM approaches on a homogeneous platform, our DPEM approach delivers systems with energy reduction of 21.2% and 25.6% on average, respectively. The main reason is the task replication which our approach uses to lower the utilization per processor while keeping the application throughput.

We performed an additional experiment to evaluate the influence of the

number of processors in an MPSoC on the energy reduction of our DPEM approach in comparison with the related approaches. In this experiment, beside the MPSoC platform with 96 PE processors, we considered two additional platforms with 48 and 192 PE processors with per-core VFS capability. On the 48-processor platform, our DPEM approach resulted in the energy reduction of 6%, 18.5%, 20.5% and 30.3% when compared with the WYL, CKR, FDM and SDK approaches, respectively. In comparison with the same related approaches, our DPEM approach obtains on the 192-processor platform the following energy reductions – 10.7%, 24.9%, 29.4% and 39.8%. We can conclude that the energy reduction of our approach with regard to the related approaches slowly increases with the increase of the number of processors in the platform.

#### 5.8.4 Overhead and Time Complexity Analysis

In this section, we briefly discuss the code and data memory overhead of our approach when compared to the related approaches and the time complexity of our and the related approaches. The code and data memory overhead of our approach on heterogeneous platforms when compared to the WYL approach is 2 times higher on average, and 2.3 times higher on average than the approaches which do not consider task replication, that is, approaches CKR, FDM and SDK. The memory overhead of our DPEM approach on the homogeneous platform is 16% higher on average when compared to the WYL approach, and 85% higher on average when compared to the CKR, FDM and SDK approaches. Given that the actual memory increase in the worst case is 213 KB and given the size of memory available in modern embedded systems, we can conclude that the memory overhead introduced by our approach is acceptable.

The time complexity in the worst-case of our DPEM approach and the approaches CKR, FDM and WYL is polynomial, while the worst-case time complexity of the SDK approach is exponential. In the worst-case, our approach needs 62 minutes, the WYL approach needs 5 minutes, the CKR approach takes 11 minutes, the FDM less than 1 second and the SDK approach needs 6 days to find an energy-efficient solution. Given that our DPEM approach is a design-time approach and that it delivers solutions of better quality, we can conclude that our approach outperforms the related approaches.

## 5.9 Discussion

In this chapter, we proposed a novel energy minimization mapping approach to reduce the energy consumption of embedded multiprocessor streaming systems with throughput constraints. To map energy-efficiently an SDF graph onto cluster-heterogeneous MPSoC, our polynomial-time solution approach: 1) determines a processor type for each task in an SDF graph such that the throughput constraint is met and the energy consumption is minimized; 2) determines a replication factor for each task in an SDF graph such that the distribution of the workload on the same type of processors is balanced, which enables processors to run at a lower frequency, hence reducing the energy consumption. The experiments on a set of real-life streaming applications showed that our approach reduces energy consumption by 66% on average among all the performed experiments while meeting the same throughput requirement when compared to related energy minimization mapping approaches.