Skip to content

Latest commit

 

History

History
215 lines (108 loc) · 84.3 KB

Exploring-Exploration.md

File metadata and controls

215 lines (108 loc) · 84.3 KB

Exploring Exploration: A Tutorial Introduction to Embedded Systems Design Space Exploration

Andy D. Pimentel @ University of Amsterdam

Designers of modern embedded systems face several daunting challenges since these systems typically have to meet a range of stringent, and often conflicting, design requirements. As many embedded systems target mass production and battery-based devices or devices that cannot use active cooling, they should be cheap and power efficient. At the same time, a great deal of these systems must, increasingly, support multiple applications and standards for which they need to provide real-time performance. For example, mobile devices must support different standards for communication and coding of digital contents. Furthermore, modern embedded systems also need to be reliable as well as flexible such that they can easily be updated and extended with future applications and standards. The latter calls for a high degree of programmability of these systems, whereas performance, power consumption, and cost constraints require implementing substantial parts of these systems in dedicated hardware blocks. As a result, modern embedded systems often have a heterogeneous multiprocessor system architecture. They consist of processors that range from fully programmable cores to fully dedicated hardware blocks for time-critical application tasks. Increasingly, the components in such systems are integrated onto a single chip, yielding heterogeneous multiprocessor system-on-chip (MPSoC) architectures [1].

现代嵌入式系统的设计者面临几个严峻的挑战,因为这些系统一般需要满足几个苛刻的,通常是冲突的,设计需求。因为很多嵌入式系统面向大规模制造,和基于电池的设备,或不能进行主动散热的设备,这样的系统应该是廉价的,功耗低的。同时,这些系统大部分需要支持多个应用和标准,需要提供实时的性能。比如,移动设备必须支持通信和数字内容编码的不同标准。而且,现代嵌入式系统也需要既可靠,又灵活,这样可以很容易的随着未来的应用和标准进行更新和拓展。后者要求这些系统必须有很高的可编程性,而性能,功耗和代价约束要求在专用的硬件模块中实现这些系统的核心部分。结果是,现代嵌入式系统通常有异质多处理器系统架构。他们包含的处理器,从完全可编程的核,到完全专用的硬件模块,用于对时间要求很高的应用任务。这种系统中的组成部分,越来越多的集成到单个芯片中,产生异质多处理片上系统架构。

To cope with the design complexity of such systems, we have witnessed the emergence of a new design methodology in the past 15–20 years, called electronic system-level (ESL) design. It aims at raising the level of abstraction of the design process to improve the design productivity. Key enablers to this end are the use of architectural MPSoC platforms to facilitate reuse of IP components and the concept of high-level system modeling and simulation [2], [3]. The latter allows for capturing the behavior of platform components and their interactions at a high level of abstraction. As such, these high-level models minimize the modeling effort and are optimized for execution speed, and can therefore be applied during the very early design stages to perform design space exploration (DSE) [4]. During such DSE, a large variety of different design alternatives can be explored, such as the number and type of processors deployed in the platform architecture, the type of interconnection network used to connect system components, or the spatial binding and temporal binding (i.e., scheduling) of application tasks to processor cores. This process of early DSE is of paramount importance as the considered design choices may heavily influence the success or failure of the final product. However, the process of DSE also is highly challenging because the design space that needs to be explored typically is vast, especially during the early stages of design. For instance, the design space for exploring different mappings of application tasks to processing resources and trying to optimize the mapping for, e.g., performance or power consumption exponentially grows with the number of application tasks and processors, and is generally considered to be an NP-hard problem [5]. Therefore, the development of efficient and effective DSE methods has received significant research attention in recent years. In this article, we will provide a tutorial introduction to the topic of embedded systems DSE.

为处理这样系统的设计复杂度,我们在过去15-20年中看到了一种新的设计方法的兴起,称为电子系统级(ESL)的设计。其目标是提升设计过程的抽象层次,以改进设计的生产力。使这种技术成为可能的关键,是使用了架构MPSoC平台,促进了IP组成部分的重用,和高层次系统建模和仿真的概念。后者可以捕获平台组成部分的行为,以及其互动,并在较高的抽象层次上。这样,这些高层次模型使建模的工作最小化,而且对执行速度进行了优化,可以在设计的非常早期就进行应用,以进行设计空间探索[4]。在这样的DSE中,可以探索大量不同的设计替代品,比如部署在平台架构中的处理器的数量和类型,用于连接系统组成部分的互联网络的类型,或应用任务在处理器核上的空间绑定和时间绑定(即,调度)。早期DSE的这个过程,是具有很大的重要性的,因为考虑的设计选项可能会严重影响最终产品的成功或失败。但是,DSE的过程也是非常有挑战的,因为要探索的设计空间,一般是很大的,尤其是在设计的早期阶段。比如,探索应用任务到处理资源的不同映射的设计空间,以及尝试优化这个映射对性能或功耗,这随着应用任务和处理器的数量的增长而呈指数级增长,一般认为是一个NP难题。因此,开发高效有效的DSE方法,最近几年收到了明显的研究关注。本文中,我们会给出嵌入式系统DSE的教程式介绍。

1. DSE: Basic concepts

During the DSE of embedded systems, multiple optimization objectives, such as performance, power/energy consumption, and cost, should be considered simultaneously. This is called multiobjective DSE. Since the objectives are often in conflict, there cannot be a single optimal solution that simultaneously optimizes all objectives. Therefore, optimal decisions need to be taken in the presence of tradeoffs between design criteria.

在嵌入式系统的DSE过程中,多个优化目标,如性能,功耗,价格,应当同时被考虑。这称为多目标DSE。由于目标通常是冲突的,不可能有单个最优解同时最优化所有目标。因此,最优决策应当是存在设计规则之间的折中的。

Given a set of m decision variables, which are the degrees of freedom (e.g., MPSoC system parameters like the number and type of processors, application mapping, etc.) that are explored during DSE, a so-called fitness function must optimize the n objective values. The fitness function is defined as

给定m个决策变量的集合,这是设计自由度(如,MPSoC系统参数,如处理器的数量和类型,应用映射,等),是DSE需要探索的,一个所谓的适应度函数需要优化n个目标值。适应度函数定义为

$$f_i: R^m → R^1$$(1)

A potential solution x ∈ R^m is an assignment of the m decision variables. The fitness function f_i translates a point in the solution space X into the i-th objective value (where 1 ≤ i ≤ n). For example, a particular fitness function f_i could assess the performance or energy efficiency of a certain solution x (representing a specific design instance). The combined fitness function f(x) subsequently translates a point in the solution space into the objective space Y. Formally, a multi-objective optimization problem (MOP) that tries to identify a solution x for the m decision variables that minimizes the n objective values using objective functions f_i with 1 ≤ i ≤ n

一个可能的解x ∈ R^m是m个决策变量的指定值。适应度函数f_i将解空间X的一个点映射到第i个目标值,1≤i≤n。比如,一个特定的适应度函数f_i可以评估一个特定解x(表示一个特定的设计实例)的性能或功耗。结合的适应度函数f(x)将解空间的一个点,映射到目标空间Y中。形式上,一个多目标优化问题,试图发现一个解x,对m个决策变量,最小化n个目标值,使用的目标函数为f_i,1≤i≤n

$$Minimize y = f(x) = (f_1(x), f_2(x),..., f_n(x)), where x = (x_1, x_2, ..., x_m)∈ X, y = (y_1, y_2, ..., y_n)∈ Y$$

In the remainder of this discussion, we assume a minimization procedure, but without loss of generality, this minimization procedure can be converted into a maximization problem by multiplying the fitness values y_i with −1.

在本文剩下部分,我们假设一个最小化过程,但不失一般性,这个最小化过程可以转换到最大化问题,只要将适应度值y_i乘以-1。

With an optimization of a single objective, the comparison of solutions is trivial. A better fitness (i.e., objective value) means a better solution. With multiple objectives, however, the comparison becomes nontrivial. Take, for example, two different MPSoC designs: a high-performance MPSoC and a slower but much cheaper MPSoC. In case there is no preference defined with respect to the objectives and there are also no restrictions for the objectives, one cannot say if the high-performance MPSoC or the low-cost MPSoC is better. A MOP can have even more different objectives, like the performance, energy consumption, cost, and reliability of an MPSoC-based embedded system. To compare different solutions in the case of multiple objectives, the Pareto dominance relation is typically used. Here, a solution x_1 ∈ X is said to dominate solution x_2 ∈ X if and only if x1 < x2

在单目标优化的情况下,解的对比是无意义的。更好的适应度(即,目标值)意味着更好的解。但是,在多目标的情况下,比较就变得有意义了。比如,有两个不同的MPSoC设计:一个高性能的MPSoC,和一个更慢但是更便宜的MPSoC。在对目标函数没有倾向性,而且对目标函数没有限制的情况下,不能说是高性能的MPSoC好,还是价值低的MPSoC好。一个多目标优化问题甚至可以有更多不同的目标,比如一个基于MPSoC的嵌入式系统的性能,功耗,价格和可靠性。为在多目标的情况下比较不同的解,通常是使用Pareto占优关系。这里,在下列条件成立的情况下,并且只在这个情况下,一个解x_1 ∈ X对另一个解x_2 ∈ X占优

$$x_1 < x_2 ⇔ ∀i ∈ {1,2,…,n}: f_i(x_1) ≤ f_i(x_2) ∧ ∃i ∈ {1,2,…,n}: f_i(x_1) < f_i(x_2)$$

Hence, a solution x_1 dominates x_2 if its objective values are superior to the objective values of x_2. For all of the objectives, x_1 must not have a worse objective value than solution x_2. Additionally, there must be at least one objective in which solution x_1 is better (otherwise they are equal).

因此,一个解x_1对x_2占优,如果其目标值优于x_2的目标值。对于所有目标值,x_1都不能劣于x_2的目标值。另外,至少有一个目标值上,解x_1是更好的(否则它们就是相等的)。

An example of the dominance relation is given in Figure 1, which illustrates a 2-D MOP. For solution H, the dominance relations are shown. Solution H is dominated by solutions B, C, and D as all of them have a lower value for both f1 and f2. On the other hand, solution H is superior to solutions M, N, and O. Finally, some of the solutions are not comparable to H. These solutions are better for one objective but worse for the other.

图1给出了占优关系的一个例子,描述的是2D多目标优化问题。给出了解H的占优关系。解H被解B,C,D占优,因为它们的f1和f2值都要更低。另一方面,解H对解M,N,O占优。最后,一些解与H不能比较。这些解在一个目标函数上要更好,但是另一个目标函数则更差。

The Pareto dominance relation only provides a partial ordering. For example, the solutions A to F of the example in Figure 1 cannot be ordered using the ordering relation. Since not all solutions x ∈ X can be ordered, the result of a MOP is not a single solution, but a front of nondominated solutions, called the Pareto front. A set X' is defined to be a Pareto front of the set of solutions X as follows:

Pareto占优关系只提供了一个偏序。比如,图1中的解A-F不能用排序关系进行排序。由于并不是所有的解x ∈ X都可以被排序,多目标优化问题的结果并不是一个解,而是不被占优解的front,称为Pareto front。集合X'定义为解X集合的Pareto front如下:

$${ x ∈ X' | \nexists x_i ∈ X: x_i < x}$$

The Pareto front of Figure 1 contains six solutions: A − F. Each of these solutions does not dominate the other. An improvement on objective f1 is matched by the worse value for f2. Generally, it is up to the designer to decide which of the solutions provides the best tradeoff.

图1中的Pareto front包含6个解:A - F。这些解中的每个都不对其他解占优。目标f1得到了改进,则f2就会恶化。一般来说,要由设计者来决定哪个解给出了最好的tradeoff。

The search for Pareto optimal design points with respect to multiple design criteria entails two distinct elements [4]: 1) the evaluation of a single design point using the fitness function(s) regarding all the objectives in question like system performance, power/energy consumption, and so on; and 2) the search strategy for covering the design space during the DSE process. Figure 2 shows a simple taxonomy for DSE approaches, recognizing the two DSE elements as well as different properties of these DSE elements. As will be discussed in more detail later on, there usually exists a tradeoff between the accuracy and speed with which the fitness of single design points can be evaluated. In addition to this, the various fitness evaluation techniques also differ with respect to the implementation effort and the capability of evaluating the fitness for a wide range of systems, involving issues such as modularity, reusability of models etc.

对多个设计准则来搜索Pareto最优设计点,涉及两个清晰的元素:1)使用适应度函数对所有目标的单个设计点的评估,包括系统性能,功耗,等等;2)在DSE过程中覆盖设计空间的搜索策略。图2给出了DSE方法的简单taxonomy,包括两个DSE元素,以及这些DSE元素的不同性质。后面会详细讨论,评估单个设计点的时候,适应度函数计算的准确率和速度通常会存在tradeoff。除了这个,不同的适应度评估技术的不同之处还在于,实现的工作量,评估很多系统的适应度的能力,包括模块性,模型的可重用性,等。

Regarding the search strategy element of DSE, the confidence characteristic denotes how certain we are that the design points returned by the DSE include the true optimum, or alternatively, how close they are to the true optimum. In many search algorithms, confidence is obtained by avoiding local optima and ensuring sufficient design space coverage. Clearly, an exhaustive search in which every single point in the design space is evaluated and compared would provide a 100% confidence. However, such exhaustive search is usually prohibitive due to the sheer size of the design space. In those cases, heuristic search techniques can be used to search the design space for optimal solutions using only a finite number of design point evaluations. The convergence property denotes the speed of evaluating a range of design points, and, more specifically, the rate at which the DSE search algorithm manages to converge to an optimum. Finally, analogous with the effort property in the case of evaluating a single design point, the effort for searching the design space refers to the implementation of the search method and setting its parameters, as well as setting up, running, and evaluating the results of the exploration experiment. In the two subsequent sections, we will provide a more detailed overview of the different techniques, and their properties, applied in each of the two elements of DSE.

至于DSE的搜索策略元素,置信度特征表示,DSE返回的设计点包括真正的最优设计点的确定程度,或者,它们距离真正的最优值有多近。在很多搜索算法中,得到置信度,是通过避免局部最优值,确保足够的设计空间覆盖。很明显,穷举式搜索会得到100%的置信度,设计空间中的每个点都会被评估和比较。但是,这种穷举式搜索通常是不可行的,因为设计空间的规模太大。在这些案例中,启发式搜索技术可以用于搜索设计空间的最优解,只使用有限数量的设计点评估。收敛性质表示的是,评估一定范围内的设计点的速度,更具体的,DSE搜索算法收敛到最优值的速度。最终,与评估单个设计点的情况下的effort性质类似的,搜索设计空间的effort,指的是搜索方法的实现,以及设置其参数,以及探索试验的设置,运行和结果评估。在后面两个部分中,我们会给出不同技术及其性质的更详细的概览,包括DSE的两个不同的元素。

2. Evaluation of a single design point

Methods for evaluating the fitness of a single design point in the design space roughly fall into one of three categories: 1) measurements on a (prototype) implementation; 2) simulation-based evaluations; and 3) estimations based on some kind of analytical model. Each of these methods has different properties with regard to evaluation time and accuracy. Evaluation of prototype implementations provides the highest accuracy, but long development times prohibit evaluation of many design options. Analytical estimations are considered the fastest, but accuracy is limited since they are typically unable to sufficiently capture particular intricate system behavior. Simulation-based evaluation fills up the range in between these two extremes: both highly accurate (but slower) and fast (but less accurate) simulation techniques are available. This tradeoff between accuracy and speed is very important, since successful DSE depends both on the ability to evaluate a single design point as well as being able to efficiently search the entire design space. As current DSE efforts in the domain of embedded systems design typically use simulation or analytical models to evaluate single design points, the remainder of this section will focus on these methods.

评估设计空间中的单个设计点的适应度的方法,大致可以分为三类:1)在一个(原型)实现上度量;2)基于仿真的评估;3)基于某种解析模型的估计。每种方法在评估时间和准确率上都有不同的性质。原型实现的评估给出了最高的准确率,但开发时间过长使很多设计选项的评估不可能。解析估计一般是最快的,但准确率有限,因为一般不能充分捕获特定的复杂系统行为。基于仿真的评估在这两种极端之间:有很准确但比较慢的仿真技术,也有快速但不那么准确的仿真技术。这种准确率和速度之间的tradeoff非常重要,因为成功的DSE依赖于两点,既有评估单个设计点的能力,也有高效的搜索整个设计空间。由于在嵌入式系统设计领域,目前的DSE工作一般都使用仿真或解析模型来评估单个设计点,本节目剩下的部分就关注这些方法。

2.1 Simulative fitness evaluation

Simulating system components can, as was already mentioned above, be performed at different levels of abstraction. The higher the abstraction level, the less intricately the system components are modeled and, therefore, the higher the simulation speed is. Evidently, such efficiency improvements come at the cost of a less accurate fitness estimation because of the fact that particular system details are not taken into account. This simulation speed-accuracy tradeoff is shown in Figure 3. This figure depicts several widely used simulation abstraction levels, and it does so for both the simulation of processor components as well as the simulation of communication between system components.

上面已经提到,仿真系统组成部分可以在不同的抽象层次上进行。抽象层次越高,系统组成部分建模的就没那么复杂,因此,因此仿真速度就越快。明显的,这种效率提升的代价是,适应度估计就没那么准确了,因为特定的系统细节没有进行考虑。这种仿真的速度-准确率tradeoff如图3所示。图中给出了几个广泛使用的仿真抽象层次,对于处理器组成部分进行仿真,也对系统组成部分的通信进行仿真。

For both the simulation of processor and communication components, the lowest level of abstraction for simulating a digital system is the register-transfer level (RTL). At this level of abstraction, the flow of digital signals between registers and combinational logic is explicitly simulated. This yields a highly accurate but also very slow simulation. As a result, the use of RTL simulation in the process of DSE is confined to only relatively small and narrow design spaces focusing on, for example, the design of one specific system component. Performing system-level DSE is infeasible using RTL simulation.

对于处理器和通信组成部分的仿真,仿真一个数字系统最低层的抽象是RTL。在这个抽象层次,数字信号在寄存器和组合逻辑之间的流动是显式仿真的。这得到的是非常准确,但是也很慢的仿真。结果是,在DSE的过程中使用RTL仿真,就只能限定在相对较小和窄的设计空间中,比如说,聚焦在一个特定系统组成部分的设计。使用RTL仿真进行系统级DSE是不可行的。

Raising the level of abstraction, one can simulate system components at the cycle accurate level. This means that the system components are simulated on a cycle-by-cycle basis and, as such, that the simulated system state conforms to the cycle-by-cycle behavior of the target design. This results in more efficient simulations as compared to RTL simulation at the cost of a somewhat reduced accuracy since the system state between cycles is not accounted for. Cycle-accurate simulation is a popular technique for simulating microprocessors: so-called cycle-accurate instruction set simulation (ISS). These ISS simulators try to capture the cycle-by-cycle behavior of the microarchitectural components of a microprocessor, such as the pipeline logic, out-of-order processing, branch predictors, caches, and so on. To account for power consumption behavior, ISS simulators often use activity-based power models that accumulate the power consumption of the relevant microarchitecture components based on their activity ratio. A good example is the widely used cycle-accurate Gem5 ISS [7], which can be extended to also support area and power predictions using activity-based modeling frameworks such CACTI [8] and McPAT [9]. Although these ISS simulators can be deployed to perform microarchitectural DSE for processor components, they are typically still too slow for performing full system-scale DSE.

提高抽象的级别,可以在周期准确的级别仿真系统组成部分。这意味着系统组成部分在每个周期的基础上进行仿真,这样,仿真的系统状态符合目标设计每周期的行为。与RTL仿真相比,这得到的仿真更加高效,代价是准确率有一定的下降,因为周期之间的系统状态并没有进行考虑。周期准确的仿真是仿真微处理器的流行技术:所谓的周期准确的指令集仿真(ISS)。这些ISS仿真器尝试捕获一个微处理器的微架构组成部分的周期到周期的行为,比如流水线逻辑,乱序处理,分支预测器,缓存,等等。为考虑功耗行为,ISS仿真器通常使用基于行为的能量模型,对相关的微架构组成部分,基于其行为比率累积其功耗。一个很好的例子是广泛使用的周期准确的Gem5 ISS,可以进行拓展以支持面积和功耗预测,使用基于行为的建模框架,比如CACTI和McPAT。虽然这些ISS仿真器可以部署对处理器组成部分进行微架构DSE,但是他们一般仍然太慢了,不能进行完整系统级的DSE。

In cycle-accurate ISS simulators, the fetching, decoding, and execution of instructions are explicitly simulated. To further optimize the speed of such simulators, one could translate the instructions from the target binary to be simulated to an equivalent sequence of instructions (using static or dynamic just-in-time translation) that can be executed on the simulation host computer. This so-called binary translation technique, which is, e.g., deployed in the widely used QEMU simulator [10], aims at reducing the overhead of explicitly simulating the instruction fetch and decode stages. The translated instruction sequences are often instrumented with additional code to keep track of the extra-functional behavior, such as timing and power consumption, of the original code as it would have been executed on the target processor.

在周期准确的ISS仿真器中,取指、译码和执行是显式仿真的。为进一步优化这种仿真器的速度,可以将指令从目标二进制翻译到等价的指令序列(使用静态或动态的即时翻译),可以在仿真宿主计算机上进行执行的仿真。这种所谓的二进制翻译技术,在广泛使用QEMU仿真器中进行了部署,其目标是降低显式的仿真取指和译码阶段的代价。翻译的指令序列通常会带有额外的编码,以追踪原始代码额外的功能行为,比如时序和功耗,因为要在目标处理器中进行执行。

For simulating communication between system components, one could use so-called bus-cycle accurate simulation [11] to speed up the simulation process. In this type of simulation, all signals of the communication bus are modeled explicitly in a cycle accurate fashion, but this accuracy is only maintained for the signals on the communication bus and not for the logic around it. The surrounding components can thus use more abstract timing models.

在系统组成部分之间通信的仿真,可以使用所谓的总线周期准确仿真[11],来加速仿真过程。在这种类型的仿真中,通信总线上的所有信号以周期准确的方式进行显式建模,但这种准确率只对在通信总线上的信号进行维护,并没有对周围的逻辑进行维护。周围的组成部分可以因此使用更抽象的时序模型。

Raising the abstraction level even further for processor simulation yields so-called host-compiled simulation [12]. In this technique, the source code of the target program is directly compiled into a binary program that can run on the host computer. In addition, and similar to the binary translation technique, the source code can be instrumented with a timing and power consumption model based on the target architecture. Since these simulations are efficient as they directly execute target programs on the host computer, they are very suitable for system-level DSE. However, at this level of abstraction, it is difficult to accurately capture intricate microarchitectural behavior, like pipeline and cacheing behavior. Another drawback of this simulation approach is that one needs to have access to the source code of a target program.

对处理器仿真进一步提升抽象层次,得到所谓的宿主编译的仿真。在这种技术中,目标程序的源码,直接编译成可以在宿主计算机运行的二进制程序。此外,与二进制翻译技术类似,源码可以加入基于目标架构的时序和功耗模型。因为这种仿真很高效,因为在宿主计算机上直接执行目标程序,它们非常适用于系统级DSE。但是,在这种抽象层次,很难准确的捕获复杂的微架构行为,比如流水线和缓存行为。这种仿真的另一种缺陷是,需要访问目标程序的源码。

For simulating communication, transaction-level modeling (TLM) [11] provides the highest level of abstraction. In TLM, communication details at the level of signals and protocols are abstracted away by means of encapsulation into entire transactions between system components. At this level, the emphasis is more on the functionality of the data transfers, i.e., what data are transferred to and from what locations, rather than on their actual implementation. Evidently, the extra-functional behavior in TLM simulation models is also captured at the level of entire transactions.

对于通信的仿真,事务级的建模(TLM)是最高级的抽象。在TLM中,信号和协议级的通信细节都抽象掉了,封装进了系统组成部分之间的整个事务中。在这个级别,强调的更多的是数据传输的功能,即,传输的是什么数据,来自什么位置,而不是其实际的实现。显然,在TLM仿真模型中的额外的功能性的行为,在整个事务级被捕获。

The above processor simulation techniques are all execution-driven simulation methods as they are directly driven by the execution of a program. Alternatively, there are also trace-driven simulation techniques in which the simulation is driven by event traces that have been collected through the execution of a program (e.g., [13] and [14]). These trace events can focus on the evaluation of specific system elements such as memory address traces for cache simulation. However, an event trace may also consist of the full sequence of executed instructions, thereby allowing full, trace-driven microprocessor simulation for the purpose of performance and/or power estimation. To optimize for simulation speed, the trace events may also represent computations (and, if needed, communication) at a higher level of abstraction than the level of machine instructions, like at the level of the execution of basic blocks or even entire functions. Another advantage of trace-driven simulation is the fact that the event traces often only need to be generated once (i.e., executing the program to collect the traces once), after which they can be reused in the DSE process. Drawbacks of trace-driven simulation evidently are the need for storing the event traces which can become extremely large in size, and the fact that trace-driven simulation does not allow for simulating all intricate system behavior, such as the effects of speculative instruction execution in microprocessors.

上述处理器仿真技术,都是执行驱动的仿真方法,因为都是由程序的执行来直接驱动的。此外,还有迹驱动的仿真技术,仿真是由事件迹驱动的,在程序的执行期间收集到这些事件迹,如[13,14]。这些迹事件可以聚焦在特定系统元素的评估,比如内存地址迹,进行缓存的仿真。但是,一个事件迹也可能是执行的指令的完整序列组成,因此可以进行完整的,迹驱动的微处理器仿真,估计性能和/或功耗。为优化仿真速度,迹事件也可以表示计算(如果需要的话,通信也可以),其表示比机器指令抽象层次要高,就像在基本模块执行的层次,或整个函数执行的层次。迹驱动仿真的另一个优势是,事件迹通常只需要生成一次(即,执行程序来收集迹一次),在这之后,可以在DSE过程中重复使用。迹驱动的仿真的缺陷很明显,需要存储事件迹,其大小通常很大,而且迹驱动的仿真不能仿真所有复杂的系统行为,比如微处理器中预测指令执行的效果。

An example of a high-level, trace-driven MPSoC simulation environment is the Sesame system-level modeling and simulation framework [15]. Sesame is based on the Y-chart methodology [16], and accordingly it recognizes separate application and architecture models. The application models are explicitly mapped onto the architecture models by means of trace-driven simulation. The workload of an application is captured by instrumenting the application model, which is a parallel specification of the application, with annotations that describe the application’s computational and communication actions at a coarse-grained level (typically at the level of the execution of entire functions). By executing this instrumented application model, these annotations cause the generation of traces of application events that subsequently drive the underlying architecture model. This architecture model, capturing the system resources and their constraints, then simulates the consequences of the consumed computation and communication events in terms of extra-functional system behavior (performance, power consumption, etc.). Figure 4 depicts Sesame’s layered organization, illustrating the mapping of two multimedia applications (an MP3 encoder and video decoder) onto a bus-based MPSoC platform. A special mapping layer in Sesame provides the scheduling of application events in the case multiple application processes are mapped onto a single processing resource.

高层次的,迹驱动的MPSoC仿真环境的一个例子是Sesame系统级建模和仿真框架[15]。Sesame是基于Y-chart方法论的,因此它可以认出不同的应用和架构模型。应用模型通过迹驱动的仿真,显式的映射到架构模型中。一个应用的workload,是通过加入应用模型来捕获到的,这是应用的并行规范说明,带有标注,在粗糙粒度的层次(通常是在整个函数执行的层次)描述了应用的计算和通信行为。通过执行这个增强的应用模型,这些标注会导致应用事件的迹的生成,随后会驱动潜在的架构模型。这个架构模型捕获的是系统的资源和其约束,然后仿真消耗的计算和通信事件的结果,以额外功能性的系统行为(性能,功耗,等)。图4展示了Sesame的分层的组织,描述了两个多媒体应用(一个MP3编码器和视频解码器)映射到基于总线的MPSoC的平台上。Sesame中的一个特殊映射层,在多个应用过程映射到一个处理资源时,给出了应用事件的调度。

Orthogonal to most of the (processor) simulation methods described above, there are additional techniques to further improve the simulation speed [17]. In sampled simulation, for example, the simulation does not cover the execution of an entire program but only simulates relatively small samples of the program’s execution. Here, the challenge is to select the samples in such a manner that they sufficiently represent the behavior as if the entire program was simulated. Another technique for speeding up simulation is statistical simulation. Rather than using real (benchmark) programs for simulation, it uses a statistical program profile. This profile captures the distributions of important program characteristics, and is used for generating a synthetic instruction trace that drives a simple trace-driven simulator. As the synthetic trace is randomly generated from a statistical profile, this type of simulations can converge to a set of performance predictions fairly quickly.

除了上面描述的多数处理器仿真方法,还有一些技术可以进一步改进仿真速度[17]。比如,在采样的仿真中,仿真并不会覆盖整个程序的执行,而只仿真程序执行的较小一部分样本。这里,挑战之处在于,选择样本的方式要足够有代表性,仿佛整个程序都被仿真了。另一种加速仿真的技术是统计仿真。这种技术不使用真实的(基准测试)程序进行仿真,而是使用统计程序的概要。这个概要捕获了重要程序特征的分布,用于生成合成的指令迹,驱动一个简单的迹驱动的仿真器。由于合成的迹是从统计的概要中随机生成的,这种类型的仿真可以很快的收敛到性能预测集。

2.2 Analytical fitness evaluation

In comparison to simulation, analytical models allow for much more efficient prediction of the extra-functional system behavior at the expense of a reduced accuracy. This makes analytical models very suitable for exploring large design spaces to rapidly identify regions of interest that can be later explored in more detail using simulation. Another advantage of analytical models is that they can provide direct insight into the relationship between model parameters (representing design choices) and the predicted extra-functional behavior. For simulative methods, such understanding would require a large number of simulations.

与仿真比起来,解析模型可以进行额外的功能性的系统行为的更高效的预测,其代价是准确率下降了。这使解析模型非常适用于探索大型设计空间,以快速识别感兴趣的区域,后续再使用仿真来进行更细节的探索。解析模型的另一个优势是,可以给出模型参数(表示设计选择)与预测的额外功能性行为的关系的洞见。对于仿真性的方法,这样的理解会需要大量仿真。

Analytical models can roughly be divided into three classes [17]: 1) mechanistic (or whitebox) models; 2) empirical (or blackbox) models; and 3) a hybrid combination of mechanistic and empirical modeling. Mechanistic models are based on first principles, which implies that they are built in a bottom-up fashion starting from a basic understanding of the mechanics of the modeled system. For example, in a mechanistic microprocessor performance model, penalties due to cache misses, branch mispredictions, the execution of instructions with different latencies, etc., are explicitly captured in the model.

解析模型可以大致分成三类:1)机理式的(白盒)模型;2)经验(黑盒)模型;3)机理式的建模和经验建模的混合。机理式模型是基于第一准则的,这意味着,这是在自下而上的方式构建的,从被建模系统的原理的基础理解开始。比如,在微处理器性能的机理模型中,由于缓存缺失,分支预测错误,不同延迟的指令执行等得到的惩罚,是在模型中显式捕获到的。

In empirical models, statistical inference and machine learning techniques, like regression models or neural networks, are used to automatically synthesize a model through the process of learning from training data. For example, using a set of microarchitectural parameters such as pipeline depth, issue width, caches sizes, etc., one could train a model that predicts the Instructions Per Cycle (IPC) or Cycles Per Instruction (CPI) of a microprocessor. Inferring a model by means of automatic training typically is easier than developing a mechanistic model because it does not require intimate understanding of the mechanics of the modeled system. Evidently, the latter is also an immediate drawback as empirical models also tend to provide less insight than mechanistic models.

在经验模型中,统计推断和机器学习技术,如回归模型或神经网络,用于通过从训练数据中学习的过程,自动的合成一个模型。比如,使用微架构参数的集合,比如流水线深度,发射宽度,缓存大小,等,可以训练一个模型,预测微处理器的IPC或CPI。通过自动训练推断一个模型,一般是比开发一个机理式的模型要容易,因为不需要对被建模系统的机制有详细的理解。明显的,后者也是一个明显的缺陷,因为经验式模型比机理式模型给出的洞见也更少。

In hybrid mechanistic-empirical modeling, which is sometimes referred to as greybox modeling, extra-functional system aspects are captured using a formula that has been derived from insights in the underlying system. However, this formula includes a number of unknown parameters, which are then inferred through fitting (e.g., using regression), similarly to empirical modeling. Such hybrid mechanistic-empirical modeling is motivated by the fact that it provides insight (like mechanistic modeling) while easing the construction of the model (like empirical modeling).

在混合的机理式经验式建模中,有时候也称为灰盒建模,使用一个公式来获得额外的功能性系统方面,这是从潜在的系统的洞见中推导得到的。但是,这个公式包含一些未知的参数,这是通过拟合来推断得到的(如,使用回归),与经验式建模类似。这种混合的机理式经验式建模,可以提供洞见(像机理式建模),还使模型的构建变得容易(像经验式建模)。

3. Searching the design space

As explained before, searching a design space is a multiobjective optimization process. This process will evidently benefit from a good tradeoff between speed, accuracy, and effort of the method for evaluating the fitness of a single design point, as discussed in the previous section. But, even if this tradeoff is ideal, we still have to make sure that each evaluation of a design point contributes as much as possible to an effective and efficient search of the design space. A crucial component toward this goal is the search algorithm that navigates the design space toward areas of interest by proposing which design points to evaluate next. Regardless of the specific type of search method that is used for such a design space traversal, its success depends on three major concerns, as was shown in Figure 2: confidence, convergence, and effort. These concerns typically cannot be considered in isolation, as they are highly interdependent, contradictory, and sometimes overlapping. The state of the art in DSE can be summarized as finding a good tradeoff between these concerns.

之前解释过,搜索一个设计空间是一个多目标优化的过程。这个过程会从速度、准确率和评估单个设计点的适应度的方法的工作的tradeoff中获益。但是,即使这种tradeoff是理想的,我们仍然要确定,一个设计点的每次评估,要尽可能多的为设计空间的高效有效搜索做贡献。朝这个目标的关键组成部分是搜索算法,在设计空间中导航,朝向感兴趣区域,提出下一步评估哪个点。不管用于设计空间遍历的搜索方法是哪种类型,其成功依赖于三个主要考虑,如图2所示:置信度,收敛性,和努力。这些考虑一般不能单独考虑,因为它们是高度互相关的,互相矛盾的,有时候还是重叠的。DSE中目前最好的算法可以总结为在这些考虑中找到一个很好的折中。

DSE search algorithms can be divided into exact and heuristic methods. In exact DSE methods, like those implemented using integer linear programming (ILP) solutions (e.g., [18] and [19]) or branch & bound algorithms (e.g., [20]), the optimum is guaranteed to be found. As such methods generally are compute intensive, they typically use design space pruning (i.e., discarding unsuitable design points) to optimize the efficiency of the search, thereby allowing to handle larger design spaces. However, for realistic design problems with design spaces that are vast, these methods may still be less suited. Alternatively, in heuristic methods, meta-heuristics are used to find a design point in the known design space that meets the design requirements as best as possible. To this end, these methods search the design space for optimal solutions using only a finite number of design point evaluations, and can thus handle larger design spaces. However, there is no guarantee that the global optimum will be found using meta-heuristics, and therefore the result can be a local optimum within the design space. Examples of meta-heuristics are hill climbing, tabu search, simulated annealing, ant colony optimization, particle swarm optimization, and genetic algorithms. In this tutorial, we will focus on methods to navigate the design space that are based on genetic algorithms (GA). GA-based DSE has been widely studied in the domain of system-level embedded design (e.g., [21] and [22]) and has been demonstrated to yield good results. Moreover, GAs can be used in their basic (domain-independent) form or, as will also be explained later on, with custom extensions that incorporate domain-dependent knowledge in order to improve search performance even further.

DSE搜索算法可以分成精确方法和启发式方法。在精确DSE方法中,如使用整数线性规划(integer linear programming, ILP)或分支界定算法实现的,一定会找到最优解。因为这种方法一般计算量很大,一般会使用设计空间修剪(即,抛弃不合适的设计点),以优化搜索的效率,因此可以处理更大的设计空间。但是,对于实际的设计问题,其设计空间是巨大的,这些方法仍然不是很适用。此外,在启发式方法中,元启发式用于在已知的设计空间找到一个设计点,尽可能的满足设计需求。为此,这些方法使用有限数量的设计点评估,来搜索设计空间得到最优解,因此可以处理更大的设计空间。但是,不能保证使用元启发式能够找到全局最优,因此结果可能是设计空间中的一个局部最优。元启发式算法的例子如,爬坡算法,tabu搜索,模拟退火,蚁群优化,粒子群优化,和遗传算法。在这个教程中,我们会关注基于遗传算法(GA)在设计空间中导航的方法。基于GA的DSE在系统级嵌入式设计中被广泛研究,已经证明可以得到很好的结果。而且,GA可以用于可以以其基本形式(领域不相关)使用,或带有定制的扩展使用,将领域相关的知识纳入到其中,以进一步改进搜索性能。

3.1 GA-based DSE

GAs operate by searching through the solution space (spanned by the design variables/decisions being explored) where each possible solution is encoded as a string-like representation, often referred to as the chromosome [23]. A (randomly initialized) population of these chromosomes is then iteratively modified by performing a fixed sequence of actions that are inspired by their counterparts from biology: fitness evaluation and selection, crossover, and mutation. A fundamental design choice of a GA is the genetic representation of the solution space, because each of the crossover and mutation steps depends on it. To illustrate how such a genetic representation could look like, let us use a widely studied DSE problem in the domain of system-level embedded system design as an example: optimizing the mapping of a (set of) concurrent application(s) onto an underlying (heterogeneous) MPSoC platform architecture [5]. As a convenient mapping description for an application with n tasks, we use a vector of size n with processor identifiers p_i, where p_i indicates the mapping target of task i

GA搜索设计空间(由要探索的设计变量/决策张成),其中每个可能的解编码为字符串样的表示,通常称为染色体。这些染色体的随机初始化的族群,通过进行固定顺序的行为进行迭代修正,这是由生物学中的对应行为启发得到的:适应度评估和选择,交叉,和变异。GA的一个基本设计选择是解空间的遗传表示,因为每个交叉和变异步骤都依赖于这个。为描述这样的遗传表示应当看起来是怎样,我们使用在系统级嵌入式系统设计领域中广泛研究的DSE问题作为例子:优化并发应用到潜在的异质MPSoC平台架构的映射[5]。作为一个有n个任务的应用的映射描述,我们使用大小为n的向量,处理器标志符p_i,其中p_i表示任务i的映射目标

$$[p_0, ..., p_i, ..., p_{n-1}]$$

This commonly used description is very suitable to serve as the chromosome representation for a GA. A valid mapping specification is a feasible partitioning of all n tasks. With feasible, we mean that tasks are mapped onto processing elements that can execute those tasks (i.e., there are no functional restrictions of the processing element in question, like an ASIC component that only allows the execution of one particular piece of functionality), and that communicating tasks are mapped onto processing elements that can actually communicate with each other (i.e., there are no topological communication restrictions). In case an infeasible mapping is created by the genetic operators of a GA (crossover and mutation), a mechanism is required that either discards or repairs such a chromosome. Repairing a chromosome implies that it is transformed into a valid chromosome (mapping) that is as close as possible to the original, invalid one. Moreover, note that task partitions specifying a mapping may also be empty [particular processor(s) not in use] or contain all n tasks (a single processor system). A processor that is not assigned any tasks (having an empty task partition) can be considered idle or nonexistent.

这种常用的表示非常适合于作为GA的染色体表示。一个有效的映射指标,是所有n个任务的一个可行分割。我们说可行的意思是,这些任务映射到了可以执行这些任务的处理单元上(即,我们讨论的处理单元没有功能上的限制,比如一个ASIC组成部分只能执行特定功能),通信任务要映射到可以实际互相通信的处理单元中(即,没有拓扑通信限制)。在GA的基因算子(交叉和变异)创建了一个不可行映射的情况下,需要一种机制来抛弃或修复这样的染色体。修复染色体意味着,要变换到一个有效的染色体(映射),还要与原始的无效染色体尽可能的接近。而且,注意指定了一个映射的任务分割,还可能是孔的(特定的处理器不在使用中),或包含所有n个任务(单处理器系统)。一个并没有指定任何任务的处理器(空任务分割)可以认为是空闲或不存在的。

In Figure 5a, the different steps of a GA are shown. This figure also illustrates the mapping representation of a chromosome for an application with six tasks and a 4-processor bus-based MPSoC platform. Starting from a (randomly initialized) population of chromosomes, representing the different mapping design instances, the fitness of the mapping solutions in the population is first evaluated. To this end, any of the previously discussed analytical or simulative techniques can be used. Subsequently, based on the fitness evaluation, a selection of chromosomes is made that will be used to create offspring. This offspring is created by combining genetic material from two parents using a crossover operation, as illustrated in the top part of in Figure 5b. There exist various forms of this crossover operator, of which the uniform, onepoint, and two-point crossovers are the most popular. Next, new genetic material is introduced in the offspring by means of a mutation operator as illustrated at the bottom of Figure 5b. Such a mutation randomly changes one or more genes within chromosomes. Finally, the newly created offspring is used to update the population by either replacing it or by deploying so-called elitism. Such elitism involves the combination of the new offspring with a small number of the best solutions from the original population to avoid loosing strong solutions.

在图5a中展示了GA的不同步骤。图中还展示了在一个应用中,有6个任务,MPSoC平台是基于总线的4处理器平台,一个染色体的映射表示。从随机初始化的染色体族群开始,表示不同的映射设计实例,首先评估族群中的映射解的适应度。为此,可以使用前面讨论的解析技术或仿真性技术的任何一种。然后,基于适应度评估,开始选择染色体,用于创建子代。子代的产生,是从两个父辈中使用交叉操作结合基因材料,如图5b的上图。交叉算子有各种不同的形式,最流行的是均匀交叉,一点交叉,和两点交叉。下一步,在子代中通过编译算子引入新的基因材料,如图5b的下部所示。这样一种变异随机改变染色体中的一个或多个基因。最后,新创建的子代用于更新族群,可以通过替代,或部署所谓的精英。这样的精英是新的子代与少量原始族群中最优解的组合,以避免损失强解。

To provide a small example of the results a GA-based DSE could obtain, we present some results of a small-scale case study where the design space consists of an application with 11 tasks that is to be mapped onto a 4-processor MPSoC architecture with a crossbar interconnect [6]. The mapping design space contains more than four million design points, of which 175 000 are unique ones (as the target platform is a homogeneous, symmetric MPSoC). Because of the relatively small design space, in this particular case, we were also able to perform an exhaustive search, allowing us to evaluate the quality of the GA-based search results. To account for the stochastic behavior of GAs, all results are averages over 300 GA runs. The fitness of mapping solutions has been evaluated using the Sesame MPSoC simulation framework [15] (see also the Simulative fitness evaluation section). Figure 6 shows the results of the GA-based DSE with different population sizes (10, 15, 40, or 80 chromosomes), a constant mutation rate (0.1) and crossover probability (0.9), and a uniform crossover in a so-called probability-quality (P-Q) plot. Regarding the top part of this plot, the horizontal axis (top x-axis) represents the quality of the result as a percentile toward the true optimum (a lower percentile indicates a result closer to the optimum) and the vertical axis represents the probability of achieving a result with that quality. The straight lines in the graph represent the theoretically derived probabilities of finding results using a simple, uniform random search. We have also computed the 80%–95% confidence intervals of the mean fitness value (execution time in cycles, in this case) of mapping solutions found by the GA, averaged over the 300 runs of each GA search. These confidence intervals, shown at the bottom of the graph in Figure 6, indicate how certain (as specified by the confidence level) we are that the real mean lies within the confidence interval. The more the confidence intervals for different experiments are nonoverlapping, the more significant the difference of the mean behavior (which is clearly the case in the example of Figure 6). The results from this particular case study show that the GA-based DSE with the largest population size can find mapping solutions that are always very close to the real optimum: within the 0.1 percentile, implying that they belong to the best 175 000/1000 = 175 solutions. A larger population size, however, comes with a higher number of fitness evaluations during the search and thus requires a longer search time (assuming the number of search iterations remains constant). According to Figure 6, a population size of 40 may therefore provide a good compromise.

为给出基于GA的DSE可以得到的结果的一些例子,我们给出小规模案例研究的一些结果,其中的设计空间由11个任务的应用组成,映射到了一个4处理器MPSoC架构中,其连接为crossbar。映射设计空间包含超过了4百万个设计点,其中17.5万个是唯一的(目标平台是一个同质的对称的MPSoC)。因为设计空间相对较小,在这个特定的案例中,我们也能够进行穷举式搜索,使我们能评估基于GA的搜索结果的质量。为计入GA的统计行为,所有的结果都是在300次GA runs上平均得到的。映射解的适应度使用Sesame MPSoC仿真框架进行评估的[15](见仿真性的适应度评估一节)。图6展示了基于GA的DSE在不同的族群规模时的结果(10,15,40或80个染色体),变异率为常数0.1,交叉概率为常数0.9,均匀交叉,图为所谓的概率-质量(probability-quality, P-Q)图。图中的上半部分,水平轴(上面的x轴)表示相对于真最优值结果质量的百分比(百分比更低,意味着到最优值更近),垂直轴表示得到这样质量的解的结果的概率。图中的直线表示理论推导出的找到结果的概率,使用的是一个简单的均匀随机搜索。我们还计算了GA找到的映射解的平均适应度值的80%-95%置信度间隔(在这种情况下,执行时间是周期),每个GA搜索运行了300次进行了平均。这些置信度区间如图6的下部所示,表示了我们有多确定,真实的均值就在置信度区间中(由置信度等级指定)。不同的实现的置信度区间不重叠的越多,平均行为的差异越明显(图6中例子很明显就是这种情况)。这个特定的案例研究的结果表明,基于GA的DSE在最大的族群规模下,可以找到的映射解,与真正的最优解是很接近的:在0.1的百分比内,意味着这属于最好的175 000/1000 = 175个解。族群规模越大,在搜索过程需要评估的适应度的数量越多,因此需要更长的搜索时间(假设搜索的迭代次数保持常数)。根据图6,族群规模40可以得到很好的折中效果。

3.2 Optimizing GA-based DSE

There are various methods for making the search process of a GA-based DSE more efficient. This allows the DSE process to either find the design candidates quicker (i.e., improve the convergence behavior of the DSE) or to spend the same amount of time to evaluate more design points. The latter can be used to enable the search of larger design spaces or to improve the chance of finding better design candidates (i.e., improve the confidence property of the DSE). One approach for optimizing the GA-based search is to enrich the genetic operators of the GA with domain knowledge such that they produce more diverse offspring or offspring with a higher probability of being closer to the optimum. For example, in [24], new GA operators have been proposed that optimize the search performance by 1) reducing the redundancy present in chromosome representations (e.g., mapping symmetries in the case of homogeneous, symmetrical MPSoC platforms); or 2) using a new crossover operator that is based on a mapping distance metric that provides a measure of similarity between mappings. Using this mapping distance information, the new crossover operator aims at retaining the strong chromosome parts of both of the parents. In [25], a new mutation operator has been proposed that considers the affinity of tasks with respect to processors, the communication cost between tasks, and the differences of processor workloads to steer the mutation in such a way that offspring is produced with a higher probability of being (near) optimal.

有几种方法,可以使得基于GA的DSE搜索过程更加高效。这使得DSE过程可以更快的找到设计候选(即,改进DSE的收敛行为),或耗费相同的时间评估更多的设计点。后者可以用于搜索更大的设计空间,或改进找到更好的设计候选的机会(即,改进DSE的置信度性质)。一种优化基于GA的搜索的方法是,用领域知识扩充GA的基因算子,这样可以产生更多样化的子代,或可能更靠近最优值的子代。比如,在[24]中,提出了新的GA算子通过下面的方法优化搜索性能:1)降低染色比表示中的冗余(如,在同质化,对称的MPSoC平台上的映射对称性),2)使用一种新的交叉算子,基于的映射距离度量可以给出映射之间的相似度。使用这种映射距离信息,新的交叉算子的目标是,保持两个父辈中很强的染色体部分。在[25]中,提出了一种新的变异算子,考虑了相对于处理来说,任务之间的亲和性,任务之间的通信代价,处理器workloads之间的差异,操纵变异的方式是,子代会有更高的概率成为/接近最优。

Another approach for optimizing GA-based DSE concerns the reduction of the time taken to evaluate the fitness of solutions during the GA’s execution. As mentioned before, DSE approaches typically use either simulation or an analytical model to evaluate the fitness of design points, where simulative approaches prohibit the evaluation of many design options due to the higher evaluation performance costs and analytical approaches suffer from accuracy issues. Therefore, in [26], a hybrid form of mapping DSE has been proposed that combines simulation with analytical estimations to prune the design space in terms of application mappings that need to be evaluated using simulation. To this end, the DSE technique uses an analytical model that estimates the expected throughput of an application given a certain architectural configuration and application-to-architecture mapping. In the majority of the search iterations of the DSE process, this analytical throughput estimation avoids the use of simulations to evaluate the design points. However, since the analytical estimations may in some cases be less accurate, the analytical estimations still need to be interleaved with simulative evaluations in order to ensure that the DSE process is steered into the right direction. A similar approach is taken in [27], where an iterative DSE methodology is proposed exploiting the statistical properties of the design space to infer, by means of an empirical analytic model, the design points to be analyzed with low-level simulations. The knowledge of a few design points is used to predict the expected improvement of unknown configurations. Alternatively, in hierarchical DSE (e.g., [28], [29], and [30]), DSE is first performed using analytical or symbolic models to quickly find the interesting parts in the design space, after which simulation-based DSE is performed to more accurately search for the optimal design points.

另一种优化基于GA的DSE方法,是关于减少在GA执行的过程中,评估适应度函数的时间。之前提到过,DSE方法一般使用仿真模型或解析模型,来评估设计点的适应度,仿真性的方法不能进行很多设计选项的评估,因为评估性能代价较高,而解析方法则有准确率的问题。因此,在[26]中,提出了一种混合形式的映射DSE,将仿真估计和解析估计结合到了一起,以修剪设计空间,在应用映射方面需要使用仿真进行评估。为此,DSE技术使用解析模型,在给定特定的架构配置和应用到架构的映射时,估计一个应用期望的吞吐量。在DSE过程的大部分搜索迭代中,这个解析的吞吐量估计避免使用仿真来估计设计点。但是,由于解析估计在一些案例中没那么准确,接续估计仍然需要与仿真评估交替进行,以确保DSE过程往正确的方向行进。[27]中采用了类似的方法,其中提出了一种迭代的DSE方法,利用设计空间的统计性质,用一个经验解析模型,来推理得到设计点,用低层次仿真进行分析。使用一些设计点的知识来预测未知配置的期望改进。此外,在层次化DSE中(如,[28,29,30]),DSE首先使用解析或符号模型进行,以快速找到设计空间中有趣的部分,然后进行基于仿真的DSE,以更准确的搜索最优设计点。

3.3 Workload models: Static versus dynamic

The DSE techniques discussed so far focus on the evaluation and exploration of MPSoC architectures under static, single-application workloads. Todays MPSoC systems, however, often require supporting an increasing number of applications and standards, where multiple applications can run simultaneously and concurrently contend for system resources. For each single application, there may also be different execution modes (or program phases) with different computational and communication requirements. For example, in software-defined radio appliances, a radio may change its behavior according to resource availability, such as the long-term evolution (LTE) standard which uses adaptive modulation and coding to dynamically adjust modulation schemes and transport block sizes based on channel conditions. Another example would be a video application that dynamically lowers its resolution to decrease its computational demands in order to save battery life. As a consequence, the behavior of application workloads executing on the embedded system can change dramatically over time.

目前讨论的DSE技术,聚焦在静态、单应用workloads下,MPSoC架构的探索和评估。但是,现在的MPSoC系统通常需要支持越来越多的应用和标准,多个应用会同事和并发的运行,对系统资源形成竞争关系。对每个单独的应用,也可能有不同的执行模式(或程序阶段),有着不同的计算和通信需求。比如,在软件定义无线电的应用中,一个无线电会根据资源可用性,来改变其行为,比如LTE标准使用的是自适应调制和编码,基于通道条件来动态的调整调制方案和发送块大小。另一个例子是视频应用,可以动态的降低分辨率,以降低其计算量需求,以节省电池寿命。结果是,应用workloads在嵌入式系统上执行的行为,可以随着时间有着极大的变化。

To capture the dynamism in application workload behavior during the design process, this section describes the concept of application scenarios [31] as well as scenario-based DSE [32], [33]. Like in the previous section, we will again use the example of application mapping exploration to illustrate the concepts. Application scenarios are able to describe the dynamism of embedded applications and the interaction between the different applications on the embedded system. An application scenario consists of two parts: an inter-application scenario and an intra-application scenario. An inter-application scenario describes the interaction between multiple applications, i.e., which applications are concurrently executing at a certain moment in time. Interapplication scenarios can be used to prevent the overdesign of a system. If some of the applications cannot run concurrently, then there is no need of reserving resources for the situation where these applications are running together. Intra-application scenarios, on the other hand, describe the different execution modes for each individual application.

为在设计过程中捕获应用workload的动态性,本节描述了应用场景的概念,和基于场景的DSE。与之前的节很像,我们会再次使用应用映射探索的例子,来描述这个概念。应用场景能够描述嵌入应用的动态性,和不同应用在嵌入式系统中的相互作用。一个应用场景包含两个部分:一个应用内的场景,和一个应用间的场景。应用间的场景描述的是多个应用之间的相互作用,即,哪些应用在特定时刻是同时执行的。应用间的场景可以用于防止一个系统的过度设计。如果一些应用不能同时运行,那么就不需要为这些应用共同运行时来保留资源。另一方面,应用内场景描述的是对每个单独的应用的不同的执行模式。

The number of different application scenarios grows exponentially with the number of applications involved. So, to perform DSE with these application scenarios, this so-called scenario-based DSE needs to solve the problem that the number of possible application scenarios is too large to exhaustively evaluate the fitness of design points with all of these scenarios. Therefore, a small but representative subset of scenarios must be selected for the evaluation of MPSoC design points. This representative subset must be used for comparing mappings and should lead to the same performance ordering as would have been produced when the complete set of the application scenarios would have been used. That is, if mapping m1 is better than mapping m2, the representative subset should be able to give a better predicted fitness to mapping m1 than it assigns to mapping m2. However, the selection of such a representative subset is not trivial [34]. This is because the representative subset is dependent on the current set of mappings that are being explored. Depending on the set of mappings, a different subset of application scenarios may reflect the relative mapping qualities of the majority of the application scenarios.

不同的应用场景的数量,会随着涉及到的应用的数量的增多,而呈指数级增加。所以,为用这些应用场景进行DSE,这些所谓的基于场景的DSE需要解决这个问题,即可能的应用场景的数量太大了,不能穷举式评估所有场景的设计点的适应度。因此,必须选择一个小型但有代表性的场景子集,评估MPSoC设计点。这个代表性的子集,必须用于比较映射,与完整集合的应用场景用于产生性能排序时,应当带来相同的性能排序。即,如果映射m1比映射m2号,代表性子集应当对映射m1给出一个更好的预测适应性,对映射m2给出的要差一些。但是,这样一个代表性子集的选择是不简单的。。这是因为代表性子集依赖于正在探索的当前映射集合。依赖于映射的集合,不同的应用场景子集可能会反应主要的应用场景的相对映射质量。

As a result, the representative subset cannot statically be selected. For a static selection, one would need to have a large fraction of the mappings that are going to be explored during the MPSoC DSE. However, since these mappings are only available during DSE, a dynamic selection method must be used. Thus, both the set of optimal mappings and the representative subset of scenarios need to be coexplored simultaneously such that the representative subset is able to adapt to the set of mappings that are currently being explored. Figure 7 shows the scenario-based DSE framework. The left part of the picture provides a general overview of the exploration flow, whereas the right part illustrates the scenario-based DSE in more detail. As an input, the scenario-based DSE requires a scenario database, application models, and an MPSoC platform architecture model. The description of the application workload is split into two parts: 1) the structure and 2) the behavior. The structure of applications is described using application models (as described before), whereas a scenario database [35] explicitly stores all the possible multiapplication workload behaviors in terms of application scenarios (i.e., intra-application and inter-application scenarios). In the scenario-based DSE framework, two separate components are recognized that simultaneously perform the coexploration tasks: the design explorer searches for the set of optimal mappings while the subset selector tries to select a representative subset of scenarios. To this end, they exchange data in an asynchronous fashion after every search iteration. Here, the design explorer sends a sample of the current mapping population to the subset selector, whereas the subset selector makes the most representative subset available for the fitness prediction in the design explorer.

结果是,代表性子集不能被静态的选择。对于一个静态的选择,就需要有很大一部分映射需要在MPSoC DSE时被探索。但是,由于这个映射只在DSE的过程中可用,必须要使用一个动态选择方法。因此,最优映射的集合和代表性子集场景两者都需要同时被探索,这样代表性子集需要适应当前探索的映射的集合。图7展示了基于场景的DSE框架。图片左边的部分给出了探索流的概览,右边详细描述基于场景的DSE。基于场景的DSE的输入,需要一个场景数据库,应用模型,和MPSoC平台架构模型。应用workload的描述分成了两部分:1)结构,2)行为。应用的结构是用应用模型描述的(之前描述过),而场景数据库以应用场景显式的存储了所有可能的多应用workload行为(即,应用内和应用间的场景)。在基于场景的DSE框架中,两个分离的组成部分会同时进行共同探索任务:设计探索器搜索最优映射的集合,而子集选择器尝试选择场景的代表性子集。为此,他们在每次搜索迭代后,都以异步的方式交换数据。这里,设计探索器将目前的映射族的样本送到子集选择器,而子集选择器使最有代表性的子集对设计探索器中的适应度预测是可用的。

The design explorer performs a traditional mapping DSE using a GA, as discussed in the previous section. As explained above, it uses a representative subset of scenarios to evaluate the fitness of mapping solutions. At every iteration of the GA, the design explorer reads in the most recent representative scenario subset from the subset selector and submits the current population of mapping solutions to the subset selector in order to allow the latter to select the appropriate representative subset. This subset selection is not trivial as there are many scenarios to pick from, leading to a huge number of possible scenario subsets. Therefore, the subset selector uses the set of mappings it regularly receives from the design explorer to train the scenario subset such that it is representative for the current population in the design explorer. As the population of the design explorer slowly changes over time, the representative subset will change accordingly. In [33], three different techniques for selecting a representative scenario subset are presented and evaluated: a GA-based scenario space search (which means that two GAs are running concurrently, one for the design explorer and one for the subset selector), a feature selection (FS)-based search algorithm, and a hybrid combination (HYB) between these two. The latter aims at combining the strengths of both the GA-based and FS-based searches. That is, a GA is capable of quickly exploring the space of potential scenario subsets, but due to its stochastic nature, it is susceptible to missing the optimal scenario subsets. This is not the case with the feature selection algorithm as it more systematically explores the local neighborhood of a scenario subset.

设计探索器使用GA进行传统的映射DSE,前一节已经讨论过了。就像之前解释的,使用场景的代表性子集来评估映射解的适应度。在GA的每次迭代中,设计探索器从最近的代表性场景子集中从子集选择器中读取,将目前的映射解的族提交到了子集选择器,使后者选择合适的代表性子集。这个子集选择并不是很简单的,因为有很多场景要选择,带来大量可能的场景子集。因此,子集选择器使用的映射的集合,是其从设计探索器经常收到的,用于训练场景子集,这样对于设计探索器中的当前族群是典型的。随着设计探索器的族群随着时间缓慢变化,代表性子集也会相应的变化。在[33]中给出了3种不同的选择代表性子集技术并进行了评估:一种基于GA的场景空间搜索(意思是,两个GA同时运行,一个用于设计探索器,一个用于子集选择器),一个基于特征选择的搜索算法,和一个两者混合的算法。后者的目标是将基于GA的和基于FS的搜索的优势结合起来。即,GA能够迅速的探索可能的场景子集的空间,但由于其随机的本质,可能会丢失最优场景子集。对于特征选择算法则不是这个情况,因为其会更加系统性的探索一个场景子集的局部邻域。

To give a feeling of the performance of the three different fitness prediction techniques, Figure 8 shows the results of a scenario-based DSE experiment in which the three techniques are compared for three different scenario subset sizes: 1%, 4%, and 8% of the total number of application scenarios. In this experiment, the mapping of ten applications with a total of 58 tasks and 75 communication channels is explored. The multiapplication workload consists of 4607 different application scenarios in total. The target platform is a heterogeneous MPSoC with four general-purpose processors, two ASIPs and two ASICs, all connected using a crossbar network. In this experiment, a DSE with a fixed duration of 100 min is performed for all three subset selector approaches. The results have been averaged over nine runs. To evaluate the fitness of mapping solutions, we have again deployed the Sesame MPSoC simulation framework (see the Simulative fitness evaluation section). To determine the efficiency of the multiobjective DSE, we obtain the distance of the estimated Pareto front (execution time versus energy consumption of mapping solutions) to the optimal Pareto front. For this purpose, we normalized execution time and energy consumption to a range from 0 to 1. As the optimal Pareto front is not exactly known since the design space is too large to exhaustively search it, we have used the combined Pareto front of all our experiments for this.

为了感觉一下三种不同的适应度预测技术的性能,图8展示了一个基于场景的DSE试验的结果,其中三种技术在三种不同的场景子集大小中进行比较:应用场景的总计数量的1%,4%和8%。在这个试验中,探索了10个应用的映射,总计58个任务,和75个通信通道。多应用workload总计包含4607个不同的应用场景。目标平台是一个异质MPSoC,有4个通用目标的处理器,2个ASIP,2个ASIC,都使用crossbar网络连接到一起。在这个试验中,对三个子集选择器方法,都进行了固定100分钟的DSE。进行了9次运行,结果进行了平均。为评估映射解的适应度,我们再次部署了Sesame MPSoC仿真框架(见仿真性适应度评估节)。为确定多目标DSE的效率,我们得到估计的Pareto front(执行时间与映射解的能耗)到最优Pareto front的距离。为此,我们将执行时间和功耗归一化到了0-1。由于最优Pareto front并不是精确已知的,因为设计空间太大了,不能穷举式搜索,我们在所有试验中对此使用了结合的Pareto front。

The size of the scenario subset provides a trade-off between accuracy and convergence of the search. That is, a larger scenario subset will lead to a more accurate fitness prediction of mappings in the design explorer at the cost of a larger computational overhead to obtain the fitness of a single mapping causing a slower convergence of the search. This can be seen in Figure 8. The GA and the FS subset selection methods have worse results when the subset becomes larger (remember that we use a fixed DSE duration of 100 min). For a subset size of 4%, the hybrid selector is, however, still able to benefit from a subset with a higher accuracy. The slower convergence only starts to effect the efficiency for the 8% subset. Comparing the different methods, the hybrid method shows the best results. The only exception is for the 1% subset. In this case, the GA is still able to search the smaller design space of possible subsets. Still, the result of the hybrid method at 4% is better than the result of the GA at 1%. With the larger subset sizes, the hybrid method can exploit both the benefits of the feature selection and the GA.

场景子集的规模,在搜索的准确率和收敛性之间给出了折中。即,更大的场景子集会带来设计探索器中,更精确的映射适应度预测,其代价是得到一个映射的适应度会有更大的计算量,导致搜索收敛更慢。具体见图8。GA和FS子集选择方法的效果,在子集变得更大的时候,效果会更差一些(我们使用了固定的DSE时间,100 min)。对于4%的子集大小,混合选择器仍然能够得到更高的准确率。慢收敛性在8%子集大小时,开始影响效率。比较不同的方法,混合方法给出了最好的结果。例外是在1%的子集上。在这种情况中,GA能够搜索更小的设计空间,得到可能的子集。在更大的子集规模中,混合方法可以利用FS和GA方法的优势。

4. Conclusion and Future Work

In this article, we have presented various aspects of the state of the art in embedded systems DSE. Here, we have organized our discussion along the lines of the two primary elements of DSE: the evaluation of single design points and the search strategy for covering the design space. For the coming years, there are still many open research challenges for this domain. Just to give a few examples, first, embedded systems more and more need to become adaptive systems due to increasingly dynamic application workload behavior (as was previously discussed), the need for QoS management to dynamically trade off different system qualities such as performance, precision, and power consumption, and the fact that we have reached a technology level where our circuits are no longer fully reliable, increasing the chances of transient and permanent faults. This calls for research to take system adaptivity, in which a system can continuously customize itself at runtime according to the application workload at hand and the state of the system (e.g., [5] and [36]), into account in the process of DSE.

本文中,我们给出了目前最好的嵌入式系统DSE的各个方面。这里,我们沿着DSE的两个基本元素来组织我们的讨论:单个设计点的评估,和覆盖设计空间的搜索策略。未来的几年中,这个领域仍然会有很多开放的研究挑战。几个例子是,第一,嵌入式系统越来越多的要成为自适应系统,因为应用workload的行为越来越动态变化,QoS管理的需求,来动态的在不同的系统质量之间tradeoff,比如性能,精度和功耗,我们已经达到了一个技术层次,我们的电路并不是完全可靠的,瞬态和永久错误的可能性增加了。这呼吁研究在DSE的过程中将系统自适应性纳入考虑,即系统在运行的时候可以持续的根据手头的应用workload和系统的状态进行定制。

Second, the trend toward cyberphysical systems and the IoT makes the process of DSE even more complicated since DSE in this context requires taking the behavior of the physical environment (including user behavior) into account. This calls for renewed research into the speed-accuracy tradeoff for the different models and their possible co-simulation applied in DSE for this domain.

第二,cyberphysical系统和IoT的趋势,使DSE的过程更复杂,因为这个上下文中的DSE要将物理环境的行为(包括用户行为)纳入考虑。这要求研究要更新,对这个领域中的DSE,对不同的模型和其可能的共同仿真的速度-准确率折中进行研究。

A final research direction involves the introduction of new design objectives in the process of (early) DSE, in addition to the traditional objectives such as system performance and power/energy consumption. Arguably, a good example is the need for taking system security into account as an optimization objective. As embedded systems are becoming increasingly ubiquitous and interconnected, they attract a worldwide attention of attackers, which makes the security aspect more important than ever during the design of those systems. Currently, system security is still mostly considered as an afterthought and typically is not taken into account during the very early design stages. However, any security measure that may eventually be taken later in the design process does affect the already established tradeoffs with respect to the other system objectives such as performance, power/energy consumption, cost, etc. Thus, covering the security aspect in the earliest phases of design is necessary to design systems that are, in the end, optimal with regard to all system objectives. However, this poses great difficulties because unlike the earlier mentioned conventional system objectives like performance and power consumption, security is hard to quantify. This necessitates research on techniques that make it possible to incorporate security as an objective in early DSE.

最后一个研究方向涉及的是,在早期DSE的过程中,在传统的目标,比如系统性能和功耗/能耗之外,引入新的设计目标。一个好的例子是,需要将系统安全性纳入考虑,作为一个优化目标。由于嵌入式系统正变得无处不在和相互连接,它们吸引了世界范围内攻击者的注意力,这使安全方面在设计这些系统的时候更加重要。目前,系统安全性仍然主要认为是事后的思考,在设计的早期,一般不进行考虑。但是,任何在后面的设计过程中采取安全措施,都会影响已经确定的其他系统目标的tradeoff,比如性能,功耗/能耗,价格,等等。因此,在设计的最早期,将安全方面纳入考虑,是非常必要的,设计出来的系统,最终要在所有系统目标方面都是最优的。但是,这提出了很大的困难,因为与之前提到的传统系统目标不同的是,如性能和功耗,安全很难进行量化。这使得研究将安全纳入到早期DSE的目标成为必要。