Skip to content

Latest commit

 

History

History
231 lines (116 loc) · 82.3 KB

GARNET.md

File metadata and controls

231 lines (116 loc) · 82.3 KB

GARNET: A Detailed On-Chip Network Model inside a Full-System Simulator

Niket Agarwal, et. al. Princeton University

0. Abstract

Until very recently, microprocessor designs were computation-centric. On-chip communication was frequently ignored. This was because of fast, single-cycle on-chip communication. The interconnect power was also insignificant compared to the transistor power. With uniprocessor designs providing diminishing returns and the advent of chip multiprocessors (CMPs) in mainstream systems, the on-chip network that connects different processing cores has become a critical part of the design. Transistor miniaturization has led to high global wire delay, and interconnect power comparable to transistor power. CMP design proposals can no longer ignore the interaction between the memory hierarchy and the interconnection network that connects various elements. This necessitates a detailed and accurate interconnection network model within a full-system evaluation framework. Ignoring the interconnect details might lead to inaccurate results when simulating a CMP architecture. It also becomes important to analyze the impact of interconnection network optimization techniques on full system behavior.

直到最近,微处理器设计都是以计算为中心的。片上通信一直被忽略掉。这是因为有高速的单周期片上通信。与晶体管能耗相比,互联的能力也是很小的。随着单核设计的收益提升越来越少,片上多核(CMP)成为主流的设计方式,连接不同处理器核的片上网络,就成为了设计的关键部分。晶体管缩小带来了高全局线延迟,与晶体管功耗相比,互联能量也变得很高。CMP设计建议再也不能忽略内存层次结构之间的相互作用,和连接不同元素的互联网络。这使得完整系统中的评估框架的详细准确的互联网络模型成为必须。在对CMP架构进行仿真时,忽略互联细节,会带来不准确的结果。分析互联网络优化技术,对完整系统行为的影响,也变得越来越重要。

In this light, we developed a detailed cycle-accurate interconnection network model (GARNET), inside the GEMS full-system simulation framework. GARNET models a classic five-stage pipelined router with virtual channel (VC) flow control. Microarchitectural details, such as flit-level input buffers, routing logic, allocators and the crossbar switch, are modeled. GARNET, along with GEMS, provides a detailed and accurate memory system timing model. To demonstrate the importance and potential impact of GARNET, we evaluate a shared and private L2 CMP with a realistic state-of-the-art interconnection network against the original GEMS simple network. The objective of the evaluation was to figure out which configuration is better for a particular workload. We show that not modeling the interconnect in detail might lead to an incorrect outcome. We also evaluate Express Virtual Channels (EVCs), an on-chip network flow control proposal, in a full-system fashion. We show that in improving on-chip network latency-throughput, EVCs do lead to better overall system runtime, however, the impact varies widely across applications.

考虑到这些因素,我们开发了一个详细的cycle-accurate互联网络模型(GARNET),并在GEMS full-system框架中进行了应用。GARNET对经典的五级流水的路由器进行了建模,有虚拟通道的流控制。微架构的细节,比如flit层次的输入buffers,路由逻辑,分配器和crossbar switch,都进行了建模。GARNET与GEMS一起,给出了一个详细准确的内存系统时序模型。为展示GARNET的重要性和可能的影响,我们评估了一个共享的私有的L2 CMP设计,有实际的目前最好的互联网络,与原始的GEMS简单网络进行了比较。评估的目的是,看看哪种配置对特定的workload更好。我们展示了,对互联网络不进行详细的建模,可能会得出不正确的输出。我们还评估了快速虚拟通道(EVCs),这是一个片上网络流控制建议。我们展示了,在改进片上网络的延迟-吞吐量方面,EVCs确实带来了更好的总体系统运行时,但是,这个影响在不同的应用中差异较大。

1. Introduction

With continued transistor scaling providing chip designers with billions of transistors, architects have embraced many-core architectures to deal with increasing design complexity and power consumption [13, 14, 29]. With increasing core counts, the on-chip network becomes an integral part of future chip multiprocessor (CMP) systems. Future CMPs, with dozens to hundreds of nodes, will require a scalable and efficient on-chip communication fabric. There are several ways in which on-chip communication can affect higher-level system design. Contention delay in the network, as a result of constrained bandwidth, impacts system message arrivals. In multi-threaded applications, spin locks and other synchronization mechanisms magnify small timing variations into very different execution paths [2]. Network protocols also impact the ordering of messages. A different order of message arrival can impact the memory system behavior substantially. Especially for cache coherence protocols, protocol-level deadlocks are carefully avoided by designing networks that obey specific ordering properties among various protocol messages [8]. The manner in which the ordering is implemented in the network leads to different messages seeing different latencies and again impacts message arrivals. Communication affects not only performance, but can also be a significant consumer of system power [18].

晶体管尺寸的持续缩小,为芯片设计者提供了数十亿个晶体管,架构师拥抱了多核架构,来应对增加的设计复杂度和功耗。随着核数量的增加,片上网络成为了未来CMP的一个有机组成部分。未来的CMPs有几十到上百个节点,需要可扩展和高效的片上通信结构。片上通信系统有几种影响更高级系统设计的方式。网络中的竞争延迟,是有限带宽的结果,影响着系统信息的到达。在多线程应用中,spin locks和其他同步机制放大了小的时序变化,成为了非常不同的执行路径。网络协议也影响着信息的顺序。信息到达的不同次序,会极大影响内存系统的行为。尤其是对于缓存一致性协议,设计的网络要在不同的协议信息中遵守特定的顺序性质,来小心的避免协议级的死锁。网络中排序实现的方式,带来的是不同的信息有不同的延迟,再次影响着信息的到达。通信不仅影响着性能,同时还是系统能力的显著消耗者。

Not only do network characteristics impact system-level behavior, the memory system also impacts network design to a huge extent. Co-designing the interconnect and the memory system provides the network with realistic traffic patterns and leads to better finetuning of network characteristics. System-level knowledge can highlight which metric (delay/throughput/power) is more important. The interconnect also needs to be aware of the specific ordering requirements of higher levels of design. Figure 1 shows how various components of a CMP system are coupled together. The inter-connection network is the communication backbone of the memory system. Thus, interconnection network details can no longer be ignored during memory system design.

网络性质不仅影响着系统级的行为,内存系统还极大的影响着网络设计。互联网络和内存系统的协同设计,为网络提供了实际的流量模式,带来网络性质的更好调节。系统级的知识会将哪个度量(延迟,吞吐量,功耗)更加更要高亮出来。互联系统还要注意到更高层设计的特定排序要求。图1展示了一个CMP系统的各个组成部分是怎样耦合到一起的。互联网络是存储系统的通信骨干。因此,互联网络的细节,在存储系统的设计时,不能被忽略了。

To study the combined effect of system and interconnect design, we require a simulation infrastructure that models these aspects to a sufficient degree of detail. In most cases, it is difficult to implement a detailed and accurate model that is fast enough to run realistic workloads. Adding detailed features increases the simulation overhead and slows it down. However, there are some platforms that carefully trade off accuracy and performance to sufficiently abstract important system characteristics while still having reasonable speed of simulation on realistic workloads. One such platform is the GEMS [20] full-system simulation platform. It does a good job in capturing the detailed aspects of the processing cores, cache hierarchy, cache coherence, and memory controllers. This has led to widespread use of GEMS in the computer architecture research community. There has been a huge body of work that has used GEMS for validating research ideas. One limitation of GEMS, however, is its approximate interconnect model. The interconnection substrate in GEMS serves as a communication fabric between various cache and memory controllers. The model is basically a set of links and nodes that can be configured for various topologies with each link having a particular latency and bandwidth. For a message to traverse the network, it goes hop by hop towards the destination, stalling when there is contention for link bandwidth. This is an approximate implementation and far removed from what a state-of-art interconnection network [10,14] lookslike. GEMS does not model a detailed router or a network interface. By not modeling a detailed router microarchitecture, GEMS ignores buffer contention, switch and virtual channel (VC) arbitration, realistic link contention and pipeline bubbles. The GEMS interconnect model also assumes perfect hardware multicast support in the routers. In on-chip network designs, supporting fast and low power hardware multicast is a challenge [15]. These and other limitations in the interconnect model can significantly affect the results reported by the current GEMS implementation. Also, for researchers focusing on low-level interconnection network issues, GEMS has not been adopted, with network researchers typically relying on traffic trace-driven simulation with synthetic or actual traces. In a trace-driven approach, the functional simulation is not impacted by the timing simulator and hence the timing-dependent effects are not captured. This is because the trace is generated a priori on a fixed system and the timing variation caused in the network does not affect the message trace. For example, if a network optimization speeds up a message, that can lead to faster computation which results in faster injection of the next message, thereby increasing injection rates. Trace-driven techniques also do not capture program variability [2] that a full-system evaluation can.

为研究系统和互联设计的联合效果,我们需要一个仿真基础设施,将这些方面建模到一个充分细节的程度。在多数情况下,很难实现一个详细的准确的模型,还足够快来运行实际的workloads。加入细节的特征会增加仿真的代价,降低其速度。但是,有一些平台,可以小心的在准确率和性能之间进行折中,提取出足够多的系统的重要特征的抽象,同时仍然在真实的workload上有合理的仿真速度。一个这种平台就是GEMS full-system仿真平台。这个平台可以很好的捕获处理器核、缓存层次结构、缓存一致性和内存控制器的细节。在计算机体系结构研究的团体中,GEMS得到了广泛的使用。有很多工作都使用GEMS来验证研究思想。但是,GEMS的一个局限,是其近似的互联模型。GEMS中的互联层是各种缓存和内存控制器的通信结构。这个模型基本上就是,连接和节点的集合,可以配置为各种拓扑结构,每个link都有特定的延迟和带宽。一条消息穿过网络时,会逐跳走向目的地,当link带宽有竞争时,就会出现停止。这是一个近似的实现,远远不是一个目前最好的互联网络的样子。GEMS并没有建模一个详细的路由器或网络接口。GEMS没有建模一个详细的路由微架构,因此忽略了buffer contention,switch and virtual channel (VC) arbitration, realistic link contention and pipeline bubbles. GEMS互联模型还假设路由硬件multicast的支持是完美的。在片上网络设计中,快速和低功耗的硬件multicast的支持是一个挑战。互联模型的这些局限和其他局限,会极大的影响目前GEMS实现给出的结果。而且,对于关注底层互联网络问题的研究者,甚至都没有采用GEMS,网络研究者依赖的是流量痕迹驱动的仿真,其痕迹是合成的,或真实的。在痕迹驱动的方法中,功能仿真没有受到时序仿真器的影响,因此没有捕获到依赖于时序的效果。这是因为,在固定的系统上,痕迹是先验生成的,网络中导致的时序变化,并不会影响信息痕迹。比如,如果一个网络优化加速了一条信息,会导致更快的计算,这又会带来下一条信息的更快的注入,因此增加了注入的速度。痕迹驱动的技术还不能捕获到程序的变化,而full-system评估是可以捕获到的。

In the light of the above issues, we have developed GARNET, which is a detailed timing model of a state-of-the-art interconnection network, modeled in detail up to the microarchitecture level. A classical five-stage pipelined router [10] with virtual channel flow control is implemented. Such a router is typically used for high-bandwidth on-chip networks [14]. We describe our model in detail in Section 2. It should be noted that the original interconnect model in GEMS is better than other simulators which only model no-load network latency. We will discuss various state-of-the-art simulators in Section 6.

鉴于上述问题,我们开发了GARNET,这是目前最好的互联网络的详细时序建模,是在微架构层次上的详细建模。实现了一个经典的五级流水的路由器,带有virtual flow control。这样一个路由器一般是用于高带宽片上网络的。我们在第2部分详细描述了模型。应当指出,GEMS中原始的互联模型比其他只建模了没有负载的网络延迟的仿真器是要好的。我们会在第6部分讨论各种目前最好的仿真器。

To demonstrate the strong correlation of the interconnection fabric and the memory system, we evaluate a shared and private L2 CMP with a realistic state-of-the-art interconnection network against the original GEMS simple network. We wish to evaluate which configuration (shared/private) performs better for a particular benchmark. To correctly study the effect of the network, we kept all other system parameters the same. We show that not modeling the interconnect in detail might lead to an incorrect outcome and a wrong system design choice.

为证明互联结构和存储系统有很强的关联,我们评估一个共享的和私有的L2 CMP,带有实际的目前最好的互联网络,与原始的GEMS简单网络进行比较。我们希望评估,哪种配置(共享/私有)对特定基准测试的性能更好。为正确的研究网络的效果,我们保持其他系统参数相同。我们证明,对互联网络的非详细建模,会带来不正确的输出,和系统设计的错误选择。

To demonstrate the need and value of full-system evaluations on network architectures, we also evaluate Express Virtual Channels (EVCs) [19], a novel network flow control proposal, in a full-system simulation fashion. The original EVC paper had presented results with synthetic network traffic. Although it had provided results on scientific benchmarks, it was done in a trace-driven fashion with network-only simulation. Thus, it had failed to capture the system-level implications of the proposed technique. The trace-driven evaluation approach adopted by the authors showed the network latency/throughput/power benefits of EVCs but could not predict the full-system impact of the novel network design. We integrate EVCs into GARNET and evaluate it with scientific benchmarks on a 64-core CMP system against an interconnect with conventional virtual channel flow control [9]. Our evaluation shows the impact EVCs have in improving system performance.

为证明对网络架构的full-system评估的需求和价值,我们还以一种full-system仿真的方式评估了EVC,一种新的网络流控制建议。原始的EVC文章给出了合成网络流量的结果。虽然它给出了对科学基准测试的结果,但是是以痕迹驱动的方式进行的,只对网络进行了仿真。因此,并没有捕获提出的技术的系统级的影响。作者采用的痕迹驱动的评估方法,展示了EVCs的延迟/吞吐量/功耗的优势,但是不能用传统的virtual channel flow conrol来预测互联。我们的评估展示了EVCs在改进系统性能的影响。

We believe that the contribution of GARNET is three-fold: GARNET的贡献有三点:

  • It enables system-level optimization techniques to be evaluated with a state-of-the-art interconnection network and obtain correct results. 用目前最好的互联网络进行系统级的优化,并得到正确的结果,这成为可能。

  • It enables the evaluation of novel network proposals in a full-system fashion to find out the exact implications of the technique on the entire system as opposed to only the network. 对提出的新型网络,可以以full-system的方式进行评估,发现技术对整个系统的精确影响。

  • It enables the implementation and evaluation of techniques that simultaneously use the interconnection network as well as other top-level system components, like caches, memory controller, etc. Such techniques are difficult to evaluate faithfully without a full-system simulator that models the interconnection network as well as other components in detail. 同时使用互联网络和其他最高层系统组成部分,如缓存,内存控制器等的技术的实现和评估成为可能。这种技术在没有full-system仿真器时很难忠实的评估,对互联网络和其他组成部分都要进行细节的建模。

Availability: GARNET was distributed as part of the GEMS official release (version 2.1) in February 2008 and is available at http://www.cs.wisc.edu/gems/. A network-only version of GARNET can be found at http://www.princeton.edu/∼niketa/garnet, to allow network researchers to fully debug their network-on-chip (NoC) designs prior to full-system evaluations. GARNET is starting to gather a user base, and has already been cited by published articles [6, 15]. GARNET has also been plugged into an industrial simulation infrastructure [21] and is being used by industrial research groups at Intel Corp for their evaluations.

The rest of the paper is organized as follows: Section 2 describes the GARNET model in detail. Section 3 provides validation for the GARNET model. Section 4 evaluates a shared and a private CMP configuration with a state-of-the art interconnection network and compares it to the original GEMS’ simple network model. Section 5 describes EVCs and presents full-system results. Section 6 touches upon the related work and Section 7 concludes.

本文剩余部分组织如下:第2部分详细描述了GARNET模型。第3部分提供了GARNET模型的验证。第4部分评估了带有目前最好的互联网络的共享和私有CMP配置,将其与原始GEMS简单网络模型进行了比较。第5部分描述了EVCs,给出了full-system结果。第6部分介绍了相关的工作,第7部分给出了结论。

2 GARNET

We next present details of GARNET.

2.1 Base GARNET model design

State-of-the-art on-chip interconnect: Modern state-of-the-art on-chip network designs use a modular packet-switched fabric in which network channels are shared over multiple packet flows. We model a classic five-stage state-of-the-art virtual channel router. The router can have any number of input and output ports depending on the topology and configuration. The major components, which constitute a router, are the input buffers, route computation logic, VC allocator, switch allocator and crossbar switch. Since on-chip designs need to adhere to tight power budgets and low router footprints, we model flit-level buffering rather than packet-level buffering. The high bandwidth requirements of cache-coherent CMPs also demands sophisticated network designs such as credit-based VC flow control at every router [14]. A five-stage router pipeline was selected to adhere to high clock frequency network designs. Every VC has its own private buffer. The routing is dimension-ordered. Since research in providing hardware multicast support is still in progress and state-of-the-art on-chip networks do not have such support, we do not model it inside the routers.

目前最好的片上互联:现代目前最好的片上网络设计,使用了一个模块化的包交换结构,其中网络通道在多个包流中共享。我们建模了一个经典的5级流水目前最好的virtual channel路由器。路由器可以有任意数量的输入和输出端口,这依赖于拓扑和配置。组成一个路由器的主要组成部分是,输入buffer,路由计算逻辑,VC分配器,switch分配器和crossbar switch。由于片上设计需要遵守严格的功耗预算和低路由器面积,所以我们对flit级的buffering进行建模,而不是对packet级的buffering进行建模。缓存一致的CMPs需要高带宽,这也需要复杂的网络设计,比如在每个路由器上的credit-based VC flow control。我们选择了一个五级流水路由器,以达到高时钟频率网络设计。每个VC都有其自己的私有的buffer。路由是按照维度排序的。由于提供硬件multicast支持的研究仍然在进行中,目前最好的片上网络并没有这样的支持,所以我们并不在路由器中提供这样的建模。

A head flit, on arriving at an input port, first gets decoded and gets buffered according to its input VC in the buffer write (BW) pipeline stage. In the same cycle, a request is sent to the route computation (RC) unit simultaneously, and the output port for this packet is calculated. The header then arbitrates for a VC corresponding to its output port in the VC allocation (VA) stage. Upon successful allocation of an output VC, it proceeds to the switch allocation (SA) stage where it arbitrates for the switch input and output ports. On winning the switch, the flit moves to the switch traversal (ST) stage, where it traverses the crossbar. This is followed by link traversal (LT) to travel to the next node. Body and tail flits follow a similar pipeline except that they do not go through RC and VA stages, instead inheriting the VC allocated by the head flit. The tail flit on leaving the router, deallocates the VC reserved by the packet.

在head filter到达输入端口时,根据其输入VC在buffer write的流水线阶段中,首先被解码并buffered。在同一个周期中,一个请求被送到路由计算单元(route computation unit),计算这个packet的输出端口。Header然后进行仲裁,得到一个VC,对应着其在VC分配阶段的输出端口。一旦成功的分配了一个输出VC,然后就进行到了switch allocation (SA)阶段,对switch输入和输出端口进行仲裁。在赢得了switch后,flit移动到switch traversal(ST)阶段,其中通过crossbar。这之后是link traversal(LT),以移动到下一个节点。Body和tail flits按照类似的流水线进行,但是它们不会经过RC和VA阶段,而是继承了head flit分配的VC。在tail flit离开路由器的时候,将packet保留的VC释放掉。

Router microarchitectural components: Keeping in mind on-chip area and energy considerations, single-ported buffers and a single shared port into the crossbar from each input were designed. Separable VC and switch allocators, as proposed in [25], were modeled. This was done because these designs are fast and of low complexity, while still providing reasonable throughput, making them suitable for the high clock frequencies and tight area budgets of on-chip networks. The individual allocators are round-robin in nature.

路由器微架构组成部分:考虑了片上面积和能量,设计了单端口buffer和从每个输入到crossbar的单个共享端口。对可分离的VC和switch allocators进行了建模。因为这些设计很快速,复杂度很低,而且仍然能提供合理的吞吐量,所以很适用于片上网络的高时钟频率和紧凑的面积预算。单个的分配器实质上就是round-robin。

Interactions between memory system and GARNET: As shown in Figure 1, the interconnection network acts as the communication backbone for the entire memory system on a CMP. The various L1 and L2 cache controllers and memory controllers communicate with each other using the interconnection network. Note that we are talking about a shared L2 system here [16]. The network interface acts as the interface between various modules and the network. On a load/store, the processor looks in the L1 cache. On a L1 cache miss, the L1 cache controller places the request in the request buffer. The network interface takes the message and breaks it into network-level units (flits) and routes it to the appropriate destinations which might be a set of L1 controllers as well as L2 controllers. The destination network interfaces combine the flits into the original request and pass it on to the controllers. The responses use the network in a similar manner for communication. Some of these messages might be on the critical path of memory accesses. A poor network design can degrade the performance of the memory system and also the overall system performance. Thus, it is very important to architect the interconnection network efficiently.

内存系统与GARNET之间的相互作用:如图1所示,互联网络是CMP中,整个存储系统的通信骨干。各种L1和L2缓存控制器和内存控制器,使用互联网络进行互相通信。注意,这里我们讨论的是一个共享的L2系统。网络接口作为各种模块和网络之间的接口。在load/store时,处理器查看L1缓存。在L1缓存miss时,L1缓存控制器将这个请求放到请求buffer中。网络接口接收到这个信息,将其分成网络层的单元(flits),将其路由到合适的目的地,可能是L1控制器的集合,以及L2控制器。目的网络接口将flits结合成原始的请求,将其送到控制器中。响应以类似的方式使用网络进行通信。这些信息的一些可能会在内存访问的关键路径上。一个差的网络设计,会降低内存系统的性能,以及整个系统的性能。因此,高效的设计互联网络的架构,是非常重要的。

There is ongoing debate whether future CMPs will employ shared or private last-level on-chip caches [31]. While shared on-chip caches increase the overall on-chip cache capacity (by avoiding replication), it increases the L2 access time on an L1 cache miss. It is not entirely clear now as to what kind of cache organizations would be employed in future CMPs. For a shared-cache CMP architecture, as shown in Figure 1, the on-chip network is used for communication between various L1’s and L2 banks and also between the L2 banks and on-chip memory controllers. In contrast, for a private cache CMP architecture, the on-chip network comes into play only for the interaction between private L2’s and on-chip memory controllers. The more the communication on the on-chip network, the more it becomes critical to model it in detail. This does not, however, mean that for a private-cache CMP architecture, the on-chip network design details are not important. The interconnect is typically architected to tolerate average network traffic. Thus, during moderate to high loads, the on-chip network can experience high latencies. Similarly, if a configuration has a low bandwidth requirement in the average case, the network chosen will be of low bandwidth and will congest much earlier. Thus, modeling the details of the on-chip interconnect is important to know the exact effect of the network on the entire system.

未来的CMPs的最后一级片上缓存,会采用共享结构还是私有结构,一直都有争论。共享的片上缓存增加了整体的片上缓存容量(避免了复制),但在L1缓存miss时增加了L2的访问时间。未来CMPs要采用什么类型的缓存组织方式,现在还并没有完全清楚。对于共享缓存的CMP架构,如图1所示片上网络用于各种L1和L2的banks之间的通信,以及L2 banks和片上内存控制器之间的通信。比较起来,对于私有cache CMP架构,片上网络就只用于私有L2和片上内存控制器的相互作用。片上网络的通信越多,对其进行详细建模就越关键。但是,这并不意味着,私有缓存的CMP架构中,片上网络设计的细节就不重要了。互联架构的设计,通常可以容忍平均的网络流量。因此,在中等到高loads时,片上网络会经历高延迟。类似的,如果一个配置在平均情况下的带宽需求很低,选择的网络会有低带宽,会更早的充满。因此,对片上互联的细节进行建模非常重要,可以知道网络对整个系统的精确效果。

Point-to-point ordering: Some coherence protocols require the network to support point-to-point ordering. This implies that if two messages are sent from a particular source node to the same destination node one after the other, the order in which they are received at the destination has to be the same in which they were sent. To provide this support, GARNET models features wherein the VC and switch allocators support system-level ordering. In the VC allocation phase, if there are contending requests from VCs to the same output port for an output VC, then requests for the packet which arrived at the router first have to be serviced first. During switch allocation, if there are contending requests for the same output port from multiple VCs from the same input port, then the request for the packet that arrived at the router first will be serviced before others. This and deterministic routing guarantee that packets with the same source and destination will not be re-ordered in the network.

点到点的排序:一些一致性协议需要网络支持点到点的排序。这意味着,如果两条消息是从特定的源节点送到相同的目的节点,一个在另一个后面,它们在目的地节点被接收的顺序,要和它们被发送的顺序是一样的。为提供这种支持,GARNET对这个特征进行建模,其中VC和switch allocators支持系统级的排序。在VC allocation阶段,如果从VCs到一个输出VC的相同的输出端口有竞争的请求,那么这个packet的请求,第一个到达路由器的,必须第一个得到服务。在switch allocation阶段,如果从多个VCs从相同的输入端口的相同输出端口有竞争性的请求,那么这个packet的请求,第一个到达路由器的,必须比其他的首先得到服务。这种确定的路由保证了,有相同源和目的的packet,不会在网络中被重新排序。

Network power: GARNET has the Orion power models [30] incorporated. Various performance counters that are required for power calculations are recorded during the simulation. The performance counters keep track of the amount of switching at various components of the network. GARNET models per-component events (e.g., when a read occurs in a router buffer), but not bit-wise switching activity. An average bit-switching activity (0.5) is assumed per component activity. The variable, SWITCHING FACTOR, inside Orion can be used to vary this factor. The various statistics collected per router are the number of reads and writes to router buffers, the total activity at the local and global (VC and switch) arbiters and the total number of crossbar traversals. The total activity at each network link is also recorded. The total power consumption of the network is equal to the total energy consumed times the clock frequency. The total energy consumption in the network is the sum of energy consumption of all routers and links. The energy consumed inside a router, as shown in Equation 1, is the summation of the energy consumed by all components inside the router. Now, the energy of each component is the summation of the dynamic and leakage energy. The dynamic energy is defined as E = 0.5αCV^2, where α is the switching activity, C is the capacitance and V is the supply voltage. The capacitance is a function of various physical (transistor width, wire length, etc.) and architectural (buffer size, number of ports, flit size, etc.) parameters. The physical parameters can be specified in the Orion configuration file and the architectural parameters are extracted from GARNET. GARNET feeds in the per-component activity to Orion. With all the above parameters available, the total network dynamic power is calculated. The total leakage power consumed in the network is the sum total of the leakage power of router buffers, crossbar, arbiters and links. Since GEMS does not simulate actual data values, we do not feed them into Orion. Orion uses the leakage power models described in [7] for its calculations.

网络能量:GARNET中包含了Orion能量模型。各种能量计算需要的性能计数器,在仿真的过程中都进行了记录。性能计数器追踪了网络的各个组成部分中switching的数量。GARNET建模了每个组成部分的事件(如,当在路由器buffer中发生了read行为),但并不是逐bit的switching行为。对每个组成部分的activity,我们假设一个平均的bit-switching activity (0.5)。Orion内部的变量,SWITCHING FACTOR,用于改变这个因子。每个路由器收集的各种统计量,是对路由器buffers的读和写,在局部和全局仲裁器的总计activity,和crossbar traversal的总计数量。网络中的总计能量消耗,是所有路由器和links的能量消耗的和。在路由器内部消耗的能量,如式1所示,是路由器内部的所有组成部分消耗的能量的和。现在,每个组成部分的能量,是动态能量和泄露能量的和。动态能量定义为E = 0.5αCV^2,其中α是switching activity,C是电容,V是供应电压。电容是各种物理参数(晶体管宽度,线长,等),和架构参数(buffer size,端口数量,flit size,等)的函数。物理参数可以在Orion配置文件中指定,架构参数从GARNET中提取出来。GARNET将每个组成部分的activity送到Orion中。上述所有参数都可用了,总共的网络动态功耗就可以进行计算了。网络中消耗的总计泄露能量,是路由器buffers,crossbar,仲裁器和links的泄露能量之和。由于GEMS并不仿真实际的数据值,我们并不将其送到Orion中。Orion使用[7]中描述的泄露功率模型进行计算。

$$E_{router} = E_{buffer_write} + E_{buffer_read} + E_{vc_arb} + E_{sw_arb} + E_{xb}$$(1)

2.2 GARNET configuration and statistics

We next present details about the configuration and statistics in GARNET. 下面我们给出GARNET中的配置和统计的细节。

Configuration: GARNET can be easily configured to model a desired interconnect. The various input parameters of GARNET are shown in Table 1. The various configurable elements of GARNET are as follows:

配置:GARNET可以很容易的进行配置,来建模一个理想的互联。GARNET的各种输入参数如表1所示。GARNET的各种可配置元素如下:

  1. Network topology: In GARNET, the network topology is configurable. GEMS allows the number of processors, L2 banks, memory banks, etc., to be specified. GARNET models network interfaces (NICs) that interface the various cache and memory controllers with the network. Each NIC is connected to a network router. The topology of the interconnect is specified by a set of links between the network routers in a configuration file. The configuration file can be used to specify complex network configurations like fat tree [10], flattened butterfly [17], etc. Since the topology in GARNET is a point-to-point specification of links, irregular topologies can be evaluated as well.

网络拓扑:在GARNET中,网络拓扑是可配置的。GEMS可以指定处理器的数量,L2 banks,内存banks等。GARNET对网络接口(NICs)进行建模,在各种缓存和内存控制器与网络之间建立接口。每个NIC都连接到一个网络路由器。互联的拓扑是由配置文件中,网络路由器之间的连接的集合指定的。配置文件可以用于指定复杂的网络配置,如fat tree [10],flattened butterfly [17],等。由于GARNET中的拓扑是连接的point-to-point指定,所以也可以评估不规则的拓扑。

  1. Interconnect bandwidth: Interconnect bandwidth can be configured through flit size that specifies the link bandwidth – number of bytes per cycle per network link. Links with lower bandwidth than this (for instance, some off-chip links) can be modeled by specifying a longer latency across them in the topology configuration file.

互联带宽:互联带宽可以通过flit大小来配置,指定了连接带宽,每个网络连接每个周期的bytes数量。低带宽的连接(比如,一些片外连接),可以通过在拓扑配置文件中指定一个更长的延迟来进行建模。

  1. Router parameters: Since the topology of a GARNET network is configurable, routers in the network can have an arbitrary number of input and output ports. The number of VCs per port and the number of buffers per VC are parameterized and can be specified.

路由器参数:由于GARNET网络的拓扑是可配置的,网络中的路由器会有任意数量的输入端口和输出端口。每个端口的VCs的数量,和每个VC的buffer数量,都是参数化的,可以被指定。

  1. Routing algorithm: Each router in GARNET has a routing table that is populated at configuration time. The routing table, as shown in Figure 2, specifies the output port to which a particular packet needs to be forwarded to reach a particular destination. GARNET reads the topology configuration file and populates the routing tables by calculating a minimal path between the nodes. There can be more than one minimal path that can exist from a source to a destination and GARNET puts all such entries in the routing table. The routing table also has a field, link weight, that can be specified in the topology configuration file on a link-by-link basis. During the RC phase inside the routers, these weights are looked up and compared. The output link with the minimal weight is selected for routing. This allows various routing algorithms to be modeled in GARNET. Figure 3 demonstrates how X-Y routing can be modeled by assigning all X direction links to have a lower weight than the Y direction links. This leads to routing conflicts being broken in favor of X-direction links. Other routing protocols, such as X-Y-Z routing, left-first, south-last, etc., can be similarly specified using the configuration file. In it’s current distribution, GARNET relies on the routing protocol to avoid deadlocks in the network. Thus, the assignment of link weights should be done such that it leads to a deadlock-free routing scheme. Since GARNET adopts a table-based approach to routing, adaptive routing can be easily implemented by choosing routes based on dynamic variables, such as link activity, buffer utilization, etc., and not based on link weights. A deadlock recovery/avoidance scheme would also have to be implemented in that case. Table-based routing also allows the opportunity to employ fancier routing schemes, such as region-based routing [12] and other non-minimal routing algorithms. The table population algorithm would have to be modified for that, which can be done easily. The power consumption of the routing tables are, however, not currently modeled in GARNET, since it is not required for simpler routing schemes, such as X-Y routing. This can, however, be easily modeled similar to router buffers.

路由算法:GARNET中的每个路由器都有一个路由表,在配置时进行传播。路由表如图2所示,指定了一个特定的packet要到达一个特定的目的地时,需要转到哪个输出端口。GARNET读取拓扑配置文件,通过计算节点间的最短路径,来传播路由表。从源到目的地,会有不止一条最短路径,GARNET将所有这种entries都放到路由表中。路由表还有一个字段,连接权重,在拓扑配置文件中,可以以link-by-link的形式逐个指定。在路由器的RC阶段中,这些权重被查找和比较。选择带有最小权重的连接被输出进行路由。这允许在GARNET中可以建模多个路由算法。图3展示了X-Y路由怎样进行建模,即指定所有的X方向连接比Y方向连接的权重要低。这会在路由冲突时,更加倾向于x方向的连接。其他的路由协议,比如X-Y-Z路由,left-first,south-last,等,可以类似的使用配置文件进行指定。在其目前的发行版中,GARNET依赖于路由协议来避免网络中的死锁。因此,连接权重的指定,需要使其得到无死锁的路由方案。由于GARNET采用了基于表的方法来进行路由,自适应路由可以很容易的实现,只要选择基于动态变量的路由,比如link activity, buffer utilization,等,而不是基于link weights。在这种情况中,还要实现一个死锁恢复/避免方案。基于表的路由,也可能会采用更花哨的路由方案,比如基于区域的路由,和其他非最小化路由算法。表传播算法必须要修改,这可以很容易完成。路由表的功耗在GARNET中还没有进行建模,因为对于简单路由方案来说这并不是必须的,比如X-Y路由。但是,这也可以进行很容易的传播,与路由器buffer类似。

  1. Configuring the router pipeline: GEMS uses a queue-driven event model to simulate timing. Various components communicate using message buffers of varying latency and bandwidth, and the component at the receiving end of the buffer is scheduled to wake up when the next message is available to be read from the buffer. The simulation proceeds by invoking the wakeup method for the next scheduled event on the event queue. Since GARNET is a part of GEMS, it is also implemented in a similar event-driven fashion. The dependencies between various modules are as follows. The NIC puts flits into a queue and schedules the network link after a specific number of cycles. The network link wakes up and picks up the specific flit and schedules the input port of the attached router next. The input port wakes up and picks up the flit from the network link and writes into a particular VC buffer. The subsequent router pipeline stages (VA, SA and ST) follow in a similar event-driven fashion. The event-driven nature of the implementation allows the GARNET pipeline to be readily changed to a different one with a bit of programming effort. Different stages can be merged with each other to shorten the pipeline to one with fewer stages. Additional stages can similarly be added by simply following the event-driven approach. The modular design also allows the allocators to be easily modified. The VC and switch allocators currently implemented are separable [25] with the local and global arbitration being round-robin in nature. If a different allocation scheme is desired (e.g., matrix arbiter), only the arbitration code would need to be modified. As an illustration of the ease with which GARNET can be modified, it took us just one week to code up and evaluate EVCs in the GEMS+GARNET infrastructure.

配置路由器流水线:GEMS使用一种队列驱动的事件模型,来仿真时序。各种组成部分的通信,使用的是不同延迟和带宽的message buffers,在buffer的接收端的组成部分,当下一条消息准备好从buffer中读取时,被安排wake up。仿真的进行是对事件队列的下一个安排的事件来调用wakeup方法。由于GARNET是GEMS的一部分,所以也是以类似的事件驱动的方式。各种模块的依赖关系如下。NIC将flits放到队列中,在特定的周期数量后,安排进行网络连接。网络连接wake up,pick up特定的flit,schedule连接的路由器的输入端口。输入端口wake up,接收到网络连接来的flit,将其写入到特定的VC buffer中。后续的路由器流水线阶段(VA, SA和ST)按照类似的事件驱动的方式进行。实现的事件驱动的本质,使GARNET的流水线可以很容易的改变成不同的配置,只需要很少量的代码。不同的阶段可以相互合并,以缩短流水线成更少级数的流水线。按照事件驱动方法,也可以类似的加入额外的流水线阶段。模块化的设计,使得allocator也可以很容易的修改。VC和switch allocator目前的实现的分离的,其local和global仲裁就是round-robin。如果期望不同的allocation方案(如,矩阵仲裁器),只需要修改仲裁代码。作为GARNET可以很容易修改的例证,我们只需要一个星期的编程,就用GEMS+GARNET评估了EVCs。

Flexible pipeline model: Some system designers may wish to vary the NoC pipeline length to explore the sensitivity of full-system performance to per-hop router delay, while still faithfully modeling link, switch and buffer constraints. GARNET includes a “flexible-pipeline” model to enable such evaluations. This model implements an output-queued router that adds a variable number of cycles after a flit reaches an output of a router. This models the router pipeline delay. A head flit, on arriving at a router, moves to a destined output port and output VC, which is decided beforehand. The output port and VC are arbitrated for one hop in advance. This is because, on arrival at the router, the flit has to be guaranteed a buffer. The flit gets buffered at the output queue for the number of pipeline stages specified. In the meanwhile, the head flit sends a VC arbitration request to the downstream router, which calculates the output port for the flit and looks for a free VC. On successfully getting a free VC, the router signals the upstream router that VC arbitration for the flit was completed. The head flit now arbitrates for the output physical link and moves out of the router. Body and tail flits follow the same course except that they do not arbitrate for a VC. Tail flits also release the VC after they exit the router. A flit, prior to its departure from a router, checks for free buffers in the downstream router. Thus, finite buffering is also modeled. This implementation models link contention, VA, router pipeline delay and finite buffering. What it idealizes is the flow control, specifically how the router crossbar switch is allocated.

灵活的流水线模型:一些系统设计者可能会希望NoC流水线长度进行变化,以探索full-system性能对每跳路由器延迟的敏感性,同时还能忠实的建模link, switch和buffer约束。GARNET包含了一个灵活流水线模型,以使这种评估成为可能。这种模型实现了一种输出队列的路由器,在一个flit到达路由器的输出后加上了不同数量的cycles。这建模了路由器流水线延迟。一个head flit到达路由器后,移动到目的输出端口,并输出VC,这是事先决定好的。输出端口和VC提前仲裁一跳。这是因为,在到达路由器后,flit必须有一个buffer。flit在输出队列中进入buffer,其中流水线级数是指定的。同时,head flit向下游路由器发送一个VC仲裁请求,计算flit的输出端口,查找一个可用的VC。在成功的得到可用VC后,路由器给上游路由器发信号,对这个flit的VC仲裁已经结束。这个head flit现在仲裁输出物理link,从这个路由器中移出。Body和tail flits按照相同的流程进行,除了它们不需要仲裁VC。Tail flits在退出路由器时,还需要释放VC。一个flit在离开路由器之前,会在下游路由器中检查可用的buffers。因此,有限的buffering也进行了建模。这种实现建模了link竞争,VA,路由器流水线延迟,和有限buffering。其理想化的是流控制,具体的是,路由器crossbar switch是怎样分配的。

  1. Network-only simulation: Some network studies require an interconnect model to be evaluated with synthetic traffic types (uniform random, tornado, bit complement, etc.) as inputs. Such studies are very common in the interconnection network research community. Synthetic traffic stresses various network resources and provides an estimate of the network’s performance/power under various scenarios. Keeping this in mind, GARNET has been designed to run in a network-only mode also. Different synthetic traffic types can be fed to the network and various network performance/power statistics extracted.

只有网络的仿真。一些网络研究需要互联模型在合成流量类型作为输入中被评估(uniform random, tornado, bit complement, 等)。这样的研究在互联网络研究团体中很常见。合成流量在很多网络资源中给出压力,对网络的性能/功耗在多种场景中给出估计。要记住,GARNET的设计也可以在只有网络的情况中运行。不同的合成流量类型可以送入网络,然后提取出各种不同的网络性能/功耗统计等。

Statistics: The various statistics that GARNET outputs are total number of packets/flits injected into the network, link utilization (average and per-link), average load on a VC, average network latency (inside the network as well as queuing at the interfaces) and network power numbers (dynamic and leakage). Apart from this, various counters can be easily added to modules and statistics displayed.

统计:GARNET输出的各种统计是,注入到网络中的packets/flits的总计数量,link利用率(每个link的,和平均的),在一个VC上的平均load,平均网络延迟(网络中的,以及在接口的队列中的),和网络功率数值(动态的,和泄露的)。除了这些,各种计数器也可以很容易的加入到模块中,显示出各种统计数据。

3 GARNET Validation

We validated the GARNET model by running network-only simulations with synthetic traffic (uniform random and tornado) and comparing the results against previously published ones [19, 22]. We also simulated other synthetic traffic patterns and validated them against the PoPNet [1] network simulator, establishing that the latency-throughput curves match.

我们验证GARNET模型的方式,用合成流量(uniform random and tornado)来运行只有网络的仿真,与之前发表的数据进行比较。我们还仿真其他合成流量模式,与PoPNet网络仿真器进行验证,确定了延迟-吞吐量的曲线是匹配的。

3.1 Validation of uniform random traffic with EVCs

In [19], results were presented for EVCs for uniform random traffic on a network with the classic five-stage pipeline router. That work used a cycle-accurate in-house interconnect simulator that modeled all major components of the router pipeline, viz., buffer management, routing algorithm, VA, SA, and flow control between routers. GARNET, similarly, models all the above aspects. We evaluated the detailed network configuration in GARNET on a 7×7 mesh with dimension-ordered routing, five ports per router, and six VCs per port with shared 24-flit buffers per port. Traffic was generated in a manner similar to [19]. We obtained the actual latency numbers from the authors of [19] and plotted them against latency values observed in GARNET in Figure 4. As shown in the figure, the plots saturate at similar injection rates. The low-load latencies of GARNET also conform to that of EVC’s baseline network.

在[19]中,对EVCs在uniform random流量上在经典5级流水线路由器的网络进行了仿真并给出了结果。那个工作使用了cycle-accurate机构内的互联仿真器,对路由器流水线的主要组成部分都进行了建模,即,buffer管理,路由算法,VA,SA和路由器之间的流控制。类似的,GARNET也对上述方面进行了建模。我们在GARNET中评估了详细的网络配置,在一个7x7的网格上,有维度顺序的路由,每个路由器有5个端口,每个端口有6个VCs,每个端口有共享的24-flit buffers。流量的生成方式与[19]类似。我们从[19]中得到实际的延迟数值,将其与GARNET观察到的延迟数值一起,在图4中画出来。如图所示,饱和点是在类似的注入率上。GARNET的低load延迟也符合EVC的基准网络。

3.2 Validation of tornado traffic with ViChaR

In [22], results were presented for ViChaR for tornado traffic on a baseline network with the classic five-stage pipelined router. That work used an in-house cycle-accurate on-chip network simulator that operated at the granularity of individual architectural components. The simulator modeled pipelined routers and network links very similar to how GARNET models them. We tried to reproduce their results by simulating a similar network configuration on an 8×8 mesh with dimension-ordered routing, five ports per router, and four VCs per port with each VC having 4-flit buffers. The latency plot that we obtained from GARNET is shown in Figure 5. This closely matches results shown in Figure 6 (GEN-TN-16 curve) in [22]. GARNET saturates slightly earlier than ViChaR’s baseline network. This could be an artifact of different kinds of switch and VC allocators used in GARNET and ViChaR’s baseline router. GARNET uses a round-robin separable allocation scheme. Although ViChaR’s baseline router uses separable allocators, the exact arbiters are not mentioned in the paper. The low-load latencies also seem to conform to that of ViChaR’s baseline.

在[22]中,给出了ViChaR对于tornado流量的结果,基准网络是经典的5级流水线路由器。那个工作使用了机构内的cycle-accurate片上网络仿真器,在单个架构组成部分的粒度上操作。仿真器建模了流水线路由器和网络links,与GARNET怎样对其进行建模非常类似。我们尝试复现其结果,仿真了类似的网络配置,8x8的网格,维度顺序的路由,每个路由器5个端口,每个端口4个VCs,每个VC有4-flit buffers。图5给出了我们从GARNET得到的延迟plot。这与图6中得到的结果匹配的非常好。GARNET比ViChaR的基准网络饱和的略早。这可能是GARNET和ViChaR的基准路由器中的switch和VC allocator的类型不同的结果。GARNET使用的是round-robin分离allocation方案。虽然ViChaR的基准路由器使用的可分离allocator,准确的仲裁器在论文中并没有提及。低load延迟似乎符合ViChaR基准。

3.3 Simulation time overhead

In the SIMICS + GEMS setup, the major simulation overhead is the interfacing of additional timing modules (RUBY, OPAL) with the functional module (SIMICS). Additional details in the timing module does not have a huge impact on simulation time. For instance, to simulate the same number of instructions, GEMS + GARNET took about 21% more simulation time than GEMS + simple network in our experiments.

在SIMICS+GEMS的设置中,主要的仿真代价是额外的时序模块(RUBY, OPAL)与功能模块(SIMICS)的接口。在时序模块额外的细节,对仿真时间并没有大的影响。比如,为仿真相同数量的指令,GEMS+GARNET比GEMS+简单网络的仿真时间只多了21%。

4 Evaluation of a System Design Decision with and without GARNET

As discussed briefly in Section 2, there is no clear answer yet as to whether future CMP systems would have shared or private last-level caches. We performed an experiment in which we ran a scientific benchmark in full-system fashion and evaluated a shared-cache and a private-cache CMP system separately. We kept all other systems parameters, benchmark data set, processor configuration, coherence protocol, total on-chip cache capacity, on-chip network, memory capacity, etc., the same. The aim of the experiment was to conclude whether a private-cache or a shared-cache configuration was preferable for the chosen benchmark. We performed separate experiments, one with the simple pipeline of the original GEMS and one with the detailed pipeline of GARNET.

就像在第2部分简要讨论的一样,未来的CMP系统的最后一级缓存应当是共享的,还是私有的,还尚未清楚。我们进行了试验,以full-system的形式运行了一个科学基准测试,分别评估了一个共享缓存和私有缓存CMP系统。我们让其他系统参数,基准测试数据集,处理器配置,一致性协议,总计片上缓存容量,片上网络,存储的容量等,都一样。试验的目标是给出结论,是私有缓存还是共享缓存的配置,对于给定的基准测试更好一些。我们进行了单独的试验,一个是带有GEMS自带的简单流水线,一个是带有GARNET的详细流水线。

4.1 Target system

We simulated a tiled 16-core CMP system with the parameters shown in Table 2. Each tile consists of a two-issue in-order SPARC processor with 32 KB L1 I&D caches. It also includes a 1 MB private/2 MB shared L2 cache bank. To keep the entire on-chip cache capacity same (32 MB) for both the shared and private configurations, a total of 16 MB on-chip directory caches were used for the private L2 configuration. A DRAM was attached to the CMP via four memory controllers along the edges. The DRAM access latency was modeled as 275 cycles. The on-chip network was chosen to be a 4×4 mesh, consisting of 16-byte links with deterministic dimension-ordered XY routing. We simulated a 4-cycle router pipeline with a 1-cycle link. This is the default GARNET network configuration. Each input port contains eight virtual channels and four buffers per virtual channel. Since the simple network assumes a one-cycle router delay, to mimic the same pipeline, we assume the delay of every on-chip link to be four cycles. The simple network assumes packet-level buffering and does not model the router pipeline. It also assumes infinite packet buffering at routers. Hence, the best we could do was to match the per-hop latencies in the simple model to that of the GARNET model. To better isolate the contention effects that GARNET captures, and not overwhelm the latency effects of a short vs. long pipeline, we selected the same “per-hop” latencies in both the models.

我们仿真了一个tiled 16核 CMP系统,参数如表2所示。每个tile包含一个双发射顺序SPARC处理器,有32KB L1 I&D缓存。还包含1MB私有2MB共享 L2缓存bank。为对于共享和私有配置保证整个片上缓存容量一样(32 MB),对于私有L2配置的情况,还使用了16MB片上directory缓存。DRAM通过四个内存控制器接入到CMP中。DRAM访问延迟建模为275周期。片上网络选择为4x4网格,包含16 byte links,带有确定性的维度排序的XY路由。我们仿真了一个4周期的路由器流水线,带有1周期的link。这是默认的GARNET网络配置。每个输入端口包含8个虚拟通道,每个虚拟通道4个buffers。由于简单网络假设单周期路由器延迟,为模仿相同的流水线,我们假设每个片上link的延迟为4周期。简单网络假设packet级的buffering,并不建模路由器流水线。它还在路由器中假设无限的packet buffering。因此,我们能做的最好的,就是将简单模型中逐跳的延迟,与GARNET模型进行匹配。为更好的孤立GARNET捕获的竞争效果,不overwhelm短流水线vs长流水线的延迟效果,我们在两个模型中选择相同的逐跳延迟。

4.2 Workload

We ran the RADIX and LU application from the SPLASH-2 [28] application suite on the above-mentioned configurations. SPLASH-2 is a suite of scientific multi-threaded applications that has been used in academic evaluations for the past two decades. The benchmarks were warmed up and checkpointed to avoid cold-start effects. We ensured that caches were warm by restoring the cache contents captured as part of our checkpoint creation process. We ran the parallel portion of each benchmark to completion. To address the variability in parallel workloads, we simulated each design point multiple times with small, pseudo-random perturbations of request latencies to cause alternative paths to be taken in each run [2]. We averaged the results of the runs.

我们以上述配置运行SPLASH-2应用包中的RADIX和LU应用。SPLASH-2是一组多线程科学应用,在过去20年中都应用到学术评估中。这个基准测试进行了预热和保存checkpoint,以免冷启动的效果。我们确保了缓存是预热过的,从checkpoint中恢复出缓存的内容。我们运行每个基准测试的并行部分,直到结束。为处理并行workloads的变化,我们对每个设计点仿真数次,用小的伪随机的扰动加入到请求的延迟中,导致在每次运行中都会采取不同的运行路径。我们将运行结果进行平均。

4.3 Evaluation results

Figure 7 shows the normalized runtime of the LU benchmark with the simple GEMS network as well as the GARNET network. Both network results indicate that a shared-cache CMP configuration would perform better for the LU benchmark and the particular configuration. However, the magnitude of speedup obtained from the different networks is different. The evaluations with the GARNET network report a 47.3% reduction in overall system runtime as opposed to 60% reported with the simple network configuration. Contention modeling in the GARNET network increased the average memory latency of the shared-cache configuration by 18%, whereas increasing the average memory latency of the private-cache configuration by only 9%. This contention latency could not be captured by the simple GEMS network. Thus, by having a realistic network model like GARNET, more accurate full-system performance numbers can be obtained.

图7展示了LU基准测试用简单GEMS网络以及GARNET网络的归一化的运行结果。两个网络的结果都说明,共享缓存的CMP配置对LU基准测试以及这个特定的配置效果更好。但是,不同的网络得到的加速幅度是不一样的。GARNET网络的评估结果得到了47.3%的总体运行时间降低,简单网络的配置有60%的时间降低。GARNET网络的竞争建模,增加了18%的共享缓存配置平均内存延迟,而增加了9%的私有缓存配置的平均内存延迟。竞争延迟在简单GEMS网络下是捕获不到的。因此,有了一个简单的网络模型,比如GARNET,可以得到更精确的full-system性能。

Figure 8 shows the normalized runtime of the RADIX benchmark with the simple GEMS network as well as the GARNET network. The simple network results indicate that a shared-cache CMP configuration performs better for RADIX on the specific configurations. However, the GARNET results show that a private-cache CMP architecture far outperforms the shared-cache CMP architecture. Contention modeling in GARNET leads to an increase of 8% in average memory latency for the shared-cache configuration while increasing the average memory latency by only 0.02% for the private-cache configuration. Although there was just an 8% increase in average memory latency for the shared-cache configuration, the full-system runtime increased by more than 3×. The contribution of user instructions to this increase was only 20%, while the OS instructions increased by 3×. As discussed earlier, timing variations in the network can cause parallel programs to take different execution paths, causing differences in the number of instructions and memory accesses. This happens due to different OS scheduling decisions, the different order in which threads attain spin locks, etc. This reiterates the importance of an accurate network model, which captures realistic timing characteristics, inside a full-system framework.

图8展示了RADIX基准测试在简单GEMS网络以及GARNET网络的归一化运行时间。简单网络的结果说明,对于RADIX在这个特定的配置上,共享缓存CMP的配置更好一些。但是,GARNET的结果表明,私有缓存的CMP架构远比共享缓存的架构要好。GARNET中的竞争建模,为共享缓存配置,带来了平均内存延迟8%的提升,而对私有缓存配置只增加了0.02%的平均内存延迟。虽然对于共享缓存配置,在平均内存延迟上只有8%的增加,full-system运行时间增加了超过3倍。用户指令对这个增加的贡献只有20%,而OS指令的增加为3x。之前讨论过,网络中的时序变化可以导致并行程序采取不同的执行路径,导致指令和内存访问的数量的差异。这是因为不同的OS调度决策,线程获得spin lock等的顺序不同等。这重申了精确网络模型的重要性,因为可以在full-system框架中可以捕获真实的时序特征。

5 Express Virtual Channels

As explained in Section 2.1, a packet in a state-of-the-art network needs to traverse multiple stages of the router pipeline at each hop along its route. Hence, energy/delay in such networks is dominated largely by contention at intermediate routers, resulting in a high router-to-link energy/delay ratio. EVCs [19] were introduced as a flow-control and router microarchitecture design to reduce the router energy/delay overhead by providing virtual express lanes in the network which can be used by packets to bypass intermediate routers along a dimension. The flits traveling on these lanes circumvent buffering (BW) and arbitration (VA and SA) by getting forwarded as soon as they are received at a router, reducing the router pipeline to just two stages (Figure 9). This reduces both the latency (due to bypassing of the router pipeline) and dynamic power (as buffer reads/writes and VC/switch arbitrations are skipped). This technique also leads to a reduction in the total number of buffers required in the network to support the same bandwidth, which in turn reduces the leakage power.

在2.1节中解释过,在目前最好的网络中,一个packet在其路由的每一跳中,需要经过多级路由器流水线。因此,在这种网络中能耗/延迟,主要是由在中间级路由器的竞争决定的,得到很高的router-to-link能耗/延迟比。EVCs作为一种流控制和路由器微架构被引入,以降低路由器的能耗/延迟消耗,在网络中提供虚拟快速通道,可以被packets所用,以绕过中间路由器。在这些路上行进的flits避免了buffering(BW)和仲裁(VA and SA),一旦在一个路由器端被收到,就被转发,降低了路由器的流水线到2级(图9)。这降低了延迟(因为绕过了路由器流水线),和动态功耗(因为buffer read/write和VC/switch仲裁被跳过了)。这种技术还减少了网络中需要的buffer的总数量,可以支持相同的带宽,这又降低了泄露能耗。

The express virtual lanes are created by statically designating the VCs at all router ports as either normal virtual channels (NVCs), which carry flits one hop; or k-hop EVCs (k can take all values between 2 and some l_max), which can carry flits k hops at a time, bypassing the k−1 routers in between, in that dimension (XY routing is assumed). In dynamic EVCs, discussed in the original paper, each router can act as either a source/sink node, which allows flits to get buffered and move through the normal router pipeline, or as a bypass node which gives priority to flits on EVCs to pass straight through the switch, without having to be buffered. In the source/sink routers, the head flits arbitrate for k-hop EVCs or NVCs, depending on their route (i.e., the number of hops remaining in that dimension) and the availability of each type of VCs. The body and tail flits follow on the same VC and release it when the tail flit leaves. Once a k-hop EVC is obtained, the intermediate k−1 nodes can be bypassed since the EVC flits send lookaheads, one cycle in advance, to set up the switches at the bypass nodes (to ensure uncontended access to the output ports). All packets try to use a combination of EVCs to best match their route, such that they are able to bypass most nodes in a dimension. These ideas are illustrated in Figure 10.

快速虚拟通道的创建,是在所有路由器端口上静态的指定VCs,因为正常的虚拟通道(NVCs)带flits一跳,k-hop EVCs(k的值在2和l_max之间),可以一次将flits带k跳,跳过之间的k-1个路由器。在动态EVCs中,每个路由器都可以作为源/漏节点,这使flits可以进行buffer,以正常路由器的流水线进行移动,或作为一个绕过节点,使flits在EVCs上直接通过switch,不需要进行buffer。在源/漏路由器中,head flits进行仲裁,是要进行k-hop EVCs或NVCs,依赖于其路由(即,在这个维度中剩余的hops数)和每种类型VCs的可用性。Body和tail flits按照相同的VC进行,当tail flit离开时,释放之。一旦得到一个k-hop EVC,中间的k-1个节点可以绕过,因为EVC flits发送了lookaheads,提前了一个周期,以在绕过节点设置switch(以确保对输出端口是非竞争访问)。所有的packets都试图使用EVCs的组合,以对其路由进行更好的匹配,这样他们可以跳过维度中的多数节点。这些思想如图10所示。

5.1 Evaluation of EVCs in GARNET

We evaluated EVCs on GEMS with GARNET as the interconnect model, and EVCs coded into it, to study the system-level impact of this network optimization. 我们在GEMS中用GARNET评估EVCs作为互联模型,和EVCs编码到其中,研究网络优化的系统级影响。

Target system: We simulated a 64-core CMP with shared L2 distributed in a tiled fashion [16]. Each core consists of a two-issue in-order SPARC processor with 64 KB L1 I&D caches. Each tile also includes a 1 MB L2 bank. A DRAM is attached to the CMP via eight memory controllers placed along the edges. The DRAM access latency was modeled as 275 cycles. The on-chip network was chosen to be an 8×8 mesh consisting of 16-byte links with deterministic XY dimension-ordered routing. Each message class was assumed to contain eight virtual channels with finite buffering.

目标系统。我们仿真了一个64核CMP,共享L2,以tiled方式分布。每个core由一个双发射顺序SPARC处理器组成,64KB L1 I&D 缓存。每个tile还包括了1 MB L2 bank。一个DRAM通过8个内存控制器接入到CMP中,分布在边缘。DRAM访问延迟建模为275个周期。片上网络选择为8x8网格,由16-byte links组成,带有确定性的XY维度顺序的路由。每个消息类别假设包含8个虚拟通道,带有有限的buffering。

Protocols: We evaluated the MOESI based directory protocol, which is part of the GEMS release, for the network with EVCs, and compared it to the baseline network modeled in GARNET.

协议。对于带有EVCs的网络,我们评估基于MOESI的directory协议,这是GEMS中的一部分,将其与GARNET中建模的基准网络进行比较。

Workloads: We ran SPLASH-2 applications [28] on the above-mentioned configuration. We ran the parallel portion of each workload to completion for each configuration. All benchmarks were warmed up and checkpointed to avoid cold-start effects. We ensured that caches were warm by restoring the cache contents captured as part of our checkpoint creation process.

workloads。我们以上述配置运行SPLASH-2应用。我们运行每个workload的并行部分,在每个配置下运行完成。所有基准测试都进行预热和保存checkpoint,以避免冷启动效果。我们通过从checkpoint中恢复缓存内容,确保缓存是预热过的。

Evaluation results: The system-level impact of a network optimization like EVCs can vary depending upon the system configuration. For instance, if there is a large number of L2 misses and most requests go off-chip (either because of small on-chip cache or non-local access patterns), consuming hundreds of cycles, saving few tens of cycles on-chip by router bypassing would not give a huge speedup for the whole system. However, if most requests are satisfied on-chip, which only consumes tens of cycles, then reducing the latency of transfer of requests and data on the chip would decrease the overall miss latency as well. Figure 11 shows the comparison of the normalized network latencies of each benchmark against the baseline. We observe an average network latency improvement of 21.5% and, interestingly, this number is almost uniform across all benchmarks. This is because the benchmarks that we ran, did not stress the on-chip network and mostly ran at low loads. Thus, EVCs provide similar performance improvements to all of them over the baseline. We also observe an almost uniform 22.2% reduction in average miss latencies across all benchmarks. This can be understood by studying the cache hit rates for the benchmarks in the baseline network for this particular chip configuration. We observe that, on an average, 86% of the L1 misses are satisfied on-chip, and less than 5% of the total memory accesses go off-chip as L2 misses. This means that on-chip latency dominates the average miss latency value in our case. Hence, the network speedup by EVCs translates to a reduction in miss latencies of the same order. However, Figure 12 shows that the overall system speedup is not the same for all benchmarks. For instance, RADIX and FFT have a smaller overall speedup than the others. This can be attributed to the change in the number of instructions executed by the benchmarks. For RADIX and FFT, the total number of instructions executed by the program with an EVC network increases over the baseline case. This could be due to a different path taken by the execution because of changes in the timing of delivery of messages, as discussed before. This also changes the cache hit/miss dynamics, and we observe that RADIX has a 4% higher L2 miss rate than the baseline case, which would in turn result in more off-chip accesses, thus reducing the impact of reduction of on-chip latency by EVCs. This is an instance of a case where the performance gain by a network optimization might not translate to overall system speedup of the same order.

评估结果。像EVCs这样的网络优化的系统级影响,视系统配置的不同而变化。比如,如果L2 misses的数量很大,多数请求都会off-chip(要么是因为小型片上缓存,或非局部访问模式),消耗了数百个周期,通过路由器旁路来节约片上的几十个周期,不会对整个系统带来很大的加速。但是,如果多数请求都在片上被满足了,这都只消耗几十个周期,那么降低请求迁移的延迟,和数据在片上的延迟,也会降低总体的miss延迟。图11展示了每个基准测试相对比基准的归一化网络延迟。我们观察到,评估的网络延迟改进有21.5%,有趣的是,这个数值在所有基准测试中几乎都一样。这是因为,我们运行的基准测试,并没有给片上网络太大压力,主要是以低loads运行。因此,EVCs在所有测试中对基准给出了相似的性能改进。我们还观察到,在所有基准测试中,平均miss延迟都降低了大约22.2%。通过研究基准测试在基准网络中对这种特定芯片配置的缓存hit率,就可以理解。我们观察到,平均来说,86%的L1 misses都在片上被满足,少于5%的总计内存访问到片外成为L2 misses。这意思是,在我们的情况中,片上延迟在平均miss延迟值中占主要部分。因此,EVCs的网络加速会以相同的order成为miss延迟的降低。但是,图12表明,总体系统加速对所有基准测试并不是一样的。比如,RADIX和FFT的总体加速会比其他更小。这可以归因为,基准测试执行的指令数量的变化。对于RADIX和FFT,程序以EVC网络执行的指令的总体数量,比基准的情况要多。这可能是因为执行采取了不同的路径,因为消息发送的时序的变化,之前讨论过。这还改变了缓存hit/miss的动态,我们观察到RADIX比基准情况的L2 miss率高了4%,这也会导致更多的片外访问,因此降低了EVC片上延迟降低的影响。网络优化的性能改进,并不一定会成为相同的总体的系统加速,这是一个例子。

Evaluating a network optimization like EVCs from a system’s perspective can help make design decisions, such as using EVCs to send critical control messages like congestion alerts faster, or to send coherence messages like invalidates, which may be on the critical path, etc., and study the overall impact. GARNET provides a useful platform to study these effects in total, rather than in isolation.

从系统的角度评估一个网络优化如EVCs,可以帮助进行设计决策,比如使用EVCs来更快的发送关键的控制信息,如congestion alerts,或发送一致性信息,如invalidates,这可能在关键路径上,等,以及研究总体的影响。GARNET提供了一个有用的平台,来在总体上研究这些效果,而不是单独研究。

6 Related Work

Simulation is one of the most valuable techniques used by computer architects to evaluate design innovations. In the past, SimpleScalar [4] was widely used in the architecture research community to evaluate uniprocessor systems. However, it and other such simulators run only user-mode single-threaded workloads. With the rapid adoption of many-core chips, designers are now interested in simulation infrastructures that can run multithreaded workloads and model multiple processing cores along with the memory system. Multithreaded workloads depend upon many OS services (e.g., I/O, synchronization, thread scheduling, and migration). Full-system simulators that can run realistic parallel workloads are thus essential for evaluating many-core chips. While there exist full-system simulators, like RSIM [23] and SESC [27], that can simulate multiprocessor systems, the interconnection network (specifically the NoC) is not modeled accurately in these frameworks. In the many-core era, the on-chip network is an integral part of the memory system and not modeling its details leads to an approximate model. While PharmSim [5] models the entire system, including the on-chip network, to a certain degree of detail, it is not publicly available for use.

仿真是计算机体系架构师使用的最宝贵的技术之一,来评估设计的创新。在过去,SimpleScalar广泛用在架构研究上,来评估单处理器系统。但是,这和其他这样的仿真器,都是在用户模式的单线程workload上运行的。随着多核芯片的迅速采用,设计者现在对可以运行多线程的workloads并建模多处理器核的仿真基础设施感兴趣。多线程workloads依赖于很多OS服务(如,I/O,同步,线程调度,和migration)。可以运行真实的并行workloads的full-system仿真器,对于评估多核芯片是最关键的。存在一些full-system仿真器,如RSIM,SESC,可以仿真多处理器系统,在这些框架中,互联网络(具体的是NoC)并没有进行准确的建模。在多核时代,片上网络是存储系统的有机组成部分,不对其进行详细建模,只能带来近似的模型。PharmSim建模了整个系统,包含片上网络,细节到了一定程度,但并不是公开可用的。

There are various network-only simulators, like NOXIM [24] and SICOSYS [26], that the NoC community uses for experiments, but they cannot be used to perform full-system evaluations. Although SICOSYS has been plugged into RSIM for simulating symmetric multiprocessor systems, the platform is not used for studying CMP systems with an NoC. ASIM [11] is a full-system simulator, used in industrial laboratories, that models the entire processor and memory system. From what we could gather, it models only ring interconnects and thus cannot be used in design explorations that use interconnects other than a ring. There have been recent efforts [3] into looking at FPGA-based simulation infrastructures for many-core systems. These, however, require detailed register-transfer level implementation which is time-consuming. To the best of our knowledge, there exists no publicly available simulation infrastructure that models the entire processing, memory and interconnect system at the level of detail as the proposed GEMS+GARNET toolset.

有很多只有网络的仿真器,如NOXIM和SICOSYS,NoC团体用于进行试验,但是它们并不能进行full-system评估。虽然SICOSYS插入到了RSIM中,仿真对称多处理器系统,这个平台并没有用于研究CMP系统和NoC。ASIM是一个full-system仿真器,在工业实验室中使用,建模了整个处理器和存储系统。从我们能收集到的信息看,它只建模了环形互联,因此不能用于使用其他互联类型的设计探索。最近有一些工作,研究基于FPGA的仿真基础设施,用于多核系统。但是,这需要详细的RTL实现,耗时很长。据我们所知,并没有公开可用的仿真基础设施,详细的建模整个处理系统,存储系统和互联系统,只有GEMS+GARNET toolset。

7 Conclusion

With on-chip networks becoming a critical component of present and future CMP designs, understanding the system-level implications of network techniques becomes very important. It also becomes necessary to evaluate CMP design proposals with a realistic interconnect model. In this work, we presented GARNET, a detailed network model, which is incorporated inside a full-system simulator (GEMS). We analyzed the impact of a realistic interconnection network model on system-level design space explorations. We also evaluated EVCs, a network flow control technique, in a full-system fashion and observed overall system performance improvements that cannot be predicted by network-only simulations. Our results indicate that the close-knit interactions between the memory system and the network can no longer be ignored. We thus believe that system designs should also model the on-chip communication fabric, and on-chip network designs should be evaluated in a full-system manner.

随着片上网络成为现在和未来CMP设计的关键的组成部分,理解网络技术的系统级影响,这变得非常重要。评估CMP设计建议,带有实际的互联模型,这也变得非常重要。本文中,我们给出了GARNET,一个详细的网络模型,与full-system仿真器GEMS结合到了一起。我们分析了一个实际的互联网络模型对系统级设计空间探索的影响。我们还以full-system的模式评估了EVCs,一种网络流控制技术,观察了整体系统性能改进,而只有网络的仿真是无法进行预测的。我们的结果表明,存储系统和网络系统的紧密互相影响,不能被忽略了。我们因此相信,系统设计也应当建模片上通信的结构,片上网络设计应当以full-system的方式进行评估。