Analyzing CUDA Workloads Using a Detailed GPU Simulator

Ali Bakhoda et. al. @ University of British Columbia, Vancouver, BC, Canada

0. Abstract

Modern Graphic Processing Units (GPUs) provide sufficiently flexible programming models that understanding their performance can provide insight in designing tomorrow’s manycore processors, whether those are GPUs or otherwise. The combination of multiple, multithreaded, SIMD cores makes studying these GPUs useful in understanding tradeoffs among memory, data, and thread level parallelism. While modern GPUs offer orders of magnitude more raw computing power than contemporary CPUs, many important applications, even those with abundant data level parallelism, do not achieve peak performance. This paper characterizes several non-graphics applications written in NVIDIA’s CUDA programming model by running them on a novel detailed microarchitecture performance simulator that runs NVIDIA’s parallel thread execution (PTX) virtual instruction set. For this study, we selected twelve non-trivial CUDA applications demonstrating varying levels of performance improvement on GPU hardware (versus a CPU-only sequential version of the application). We study the performance of these applications on our GPU performance simulator with configurations comparable to contemporary high-end graphics cards. We characterize the performance impact of several microarchitecture design choices including choice of interconnect topology, use of caches, design of memory controller, parallel workload distribution mechanisms, and memory request coalescing hardware. Two observations we make are (1) that for the applications we study, performance is more sensitive to interconnect bisection bandwidth rather than latency, and (2) that, for some applications, running fewer threads concurrently than on-chip resources might otherwise allow can improve performance by reducing contention in the memory system.

现代GPU提供了足够灵活的编程模型，理解其性能，可以提供设计未来的众核处理器（不管是否是GPU或者其他）的洞见。多个多线程的SIMD核的组合，使研究这些GPU在理解内存，数据和线程级的并行的折中时会有用。现代GPU比现代的CPU的原始算力多了好几个数量级，但很多重要的应用，即使是那些有充足的数据级并行的，都没有达到峰值性能。本文通过对几个用NVidia CUDA编程模型写的非图形应用，将其运行在一个详细的为架构性能仿真器上，运行的是NVidia的PTX虚拟指令集，以此来刻画程序特征。为这个研究，我们选择了12个CUDA应用，证明了在GPU上运行的程序，比在只有CPU的机器上，有着不同程度的性能改进。我们研究这些应用的性能，采用的GPU性能仿真器，其配置与当前的高端图形卡类似。我们研究了几个微架构设计选择对性能的影响，包括互联拓扑的选择，缓存的适用，内存控制器的设计，并行workload分布的机制，和内存请求合并的硬件。我们得到的两个观察结果是，(1)对于我们研究的应用，性能对于互联bisection带宽更加敏感，而不是延迟，(2)对于一些应用，与片上资源相比，同时运行更少的线程，可能会改进性能，因为会减少存储系统中的竞争。

1. Introduction

While single-thread performance of commercial superscalar microprocessors is still increasing, a clear trend today is for computer manufacturers to provide multithreaded hardware that strongly encourages software developers to provide explicit parallelism when possible. One important class of parallel computer hardware is the modern graphics processing unit (GPU) [22,25]. With contemporary GPUs recently crossing the teraflop barrier [2,34] and specific efforts to make GPUs easier to program for non-graphics applications [1, 29, 33], there is widespread interest in using GPU hardware to accelerate non-graphics applications.

虽然商用超标量微处理器的单线程性能仍然在增长，但对于计算机生产商来说，一个清晰的趋势是，要尽量提供多线程的硬件，强烈的鼓励软件开发者利用显式的并行性。并行计算机硬件的一个重要类别是现代GPU。当前的GPU最近刚刚跨过了tflop的门槛，也有很多工作使GPU对非图形应用更容易编程，在使用GPU硬件来加速非图形应用上有很多兴趣。

Since its introduction by NVIDIA Corporation in February 2007, the CUDA programming model [29,33] has been used to develop many applications for GPUs. CUDA provides an easy to learn extension of the ANSI C language. The programmer specifies parallel threads, each of which runs scalar code. While short vector data types are available, their use by the programmer is not required to achieve peak performance, thus making CUDA a more attractive programming model to those less familiar with traditional data parallel architectures. This execution model has been dubbed a single instruction, multiple thread (SIMT) model [22] to distinguish it from the more traditional single instruction, multiple data (SIMD) model. As of February 2009, NVIDIA has listed 209 third-party applications on their CUDA Zone website [30]. Of the 136 applications listed with performance claims, 52 are reported to obtain a speedup of 50× or more, and of these 29 are reported to obtain a speedup of 100× or more. As these applications already achieve tremendous benefits, this paper instead focuses on evaluating CUDA applications with reported speedups below 50× since this group of applications appears most in need of software tuning or changes to hardware design.

NVidia在2007年提出了CUDA编程模型，用于在GPU上开发很多应用。CUDA是ANSI C语言的容易学习的扩展。程序员指定并行线程，每个线程运行标量代码。短向量数据类型是可用的，但程序员的使用不需要得到峰值性能，因此对于对传统的数据并行架构不太熟的程序员来说，CUDA就是一个很有吸引力的编程模型。这个执行模型被称为SIMT模型，以与传统的SIMD模型区分。到2009年，NVidia列出了209个第三方CUDA应用。136个应用带有性能测试，其中52个获得了50x的加速，29个有100x以上的加速。这些应用已经获得了极大的好处，本文关注的是加速不到50x的这些应用的评估，因为这组应用更需要软件调优，或硬件设计的变化。

This paper makes the following contributions: 本文有如下的贡献：

It presents data characterizing the performance of twelve existing CUDA applications collected on a research GPU simulator (GPGPU-Sim). 给出了在GPGPU-sim上12个CUDA应用的性能特征数据。
It shows that the non-graphics applications we study tend to be more sensitive to bisection bandwidth versus latency. 非图形应用对bisection带宽更加敏感，对延迟则没那么敏感。
It shows that, for certain applications, decreasing the number of threads running concurrently on the hardware can improve performance by reducing contention for on-chip resources. 对一些特定应用，降低在硬件上同时运行的线程数量，会改进性能，因为这会减少片上资源的竞争。
It provides an analysis of application characteristics including the dynamic instruction mix, SIMD warp branch divergence properties, and DRAM locality characteristics. 给出了应用特征分析，包括动态指令混合，SIMD warp分支发散性质，DRAM局部性特征。

We believe the observations made in this paper will provide useful guidance for directing future architecture and software research. 我们相信，本文给出的观察会对未来的架构和软件研究给出有用的指导。

The rest of this paper is organized as follows. In Section 2 we describe our baseline architecture and the microarchitecture design choices that we explore before describing our simulation infrastructure and the benchmarks used in this study. Our experimental methodology is described in Section 3 and Section 4 presents and analyzes results. Section 5 reviews related work and Section 6 concludes the paper.

本文剩余部分组织如下。在第2部分，我们描述了我们探索的基准架构和微架构设计选择，然后描述了本文中用到的我们的仿真基础设施和benchmarks。我们在第3部分和第4部分的试验方法给出了分析结果。第5部分回顾了相关工作，第6部分给出了结论。

2. Design and Implementation

In this section we describe the GPU architecture we simulated, provide an overview of our simulator infrastructure and then describe the benchmarks we selected for our study.

本节中，我们描述了仿真的GPU架构，概览了仿真器基础设施，然后描述了选择研究的benchmarks。

2.1. Baseline Architecture

Figure 1(a) shows an overview of the system we simulated. The applications evaluated in this paper were written using CUDA [29,33]. In the CUDA programming model, the GPU is treated as a co-processor onto which an application running on a CPU can launch a massively parallel compute kernel. The kernel is comprised of a grid of scalar threads. Each thread is given an unique identifier which can be used to help divide up work among the threads. Within a grid, threads are grouped into blocks, which are also referred to as cooperative thread arrays (CTAs) [22]. Within a single CTA threads have access to a common fast memory called the shared memory and can, if desired, perform barrier synchronizations.

图1a展示了仿真的系统的概览。本文评估的应用是使用CUDA写的。在CUDA编程模型中，GPU被当作一个协处理器对待，在CPU中运行的一个应用，可以在GPU上启动大量并行的计算核。核是由标量线程网格组成。每个线程都有唯一的标志符，可以用于帮助将很多工作分割到线程中。在一个网格中，很多线程组成blocks，也称为CTA。在一个CTA中的线程，可以访问共同的快速存储，称为共享存储，这是可以进行barrier synchronization的。

Figure 1(a) also shows our baseline GPU architecture. The GPU consists of a collection of small data-parallel compute cores, labeled shader cores in Figure 1, connected by an interconnection network to multiple memory modules (each labeled memory controller). Each shader core is a unit similar in scope to a streaming multiprocessor (SM) in NVIDIA terminology [33]. Threads are distributed to shader cores at the granularity of entire CTAs, while per-CTA resources, such as registers, shared memory space, and thread slots, are not freed until all threads within a CTA have completed execution. If resources permit, multiple CTAs can be assigned to a single shader core, thus sharing a common pipeline for their execution. Our simulator omits graphics specific hardware not exposed to CUDA.

图1a展示了我们的基准GPU架构。GPU由很多小的数据并行的计算核组成，即图1中的shader cores，由互联网络将多个shader cores和存储模块（标记为存储控制器）连接到一起。每个shader core与NVidia的SM是类似的单元。线程以整个CTA的颗粒度分配到shader cores中，一个CTA中的所有线程都完成了执行，相关的资源，包括寄存器，共享存储空间，线程slots，才会被释放掉。如果资源允许，多个CTAs可以指定到一个shader core中，因此在执行的过程中会共享一个流水线。我们的仿真器忽略了图形特定的硬件，这些是没有开放给CUDA的。

Figure 1(b) shows the detailed implementation of a single shader core. In this paper, each shader core has a SIMD width of 8 and uses a 24-stage, in-order pipeline without forwarding. The 24-stage pipeline is motivated by details in the CUDA Programming Guide [33], which indicates that at least 192 active threads are needed to avoid stalling for true data dependencies between consecutive instructions from a single thread (in the absence of long latency memory operations). We model this pipeline with six logical pipeline stages (fetch, decode, execute, memory1, memory2, writeback) with superpipelining of degree 4 (memory1 is an empty stage in our model). Threads are scheduled to the SIMD pipeline in a fixed group of 32 threads called a warp [22]. All 32 threads in a given warp execute the same instruction with different data values over four consecutive clock cycles in all pipelines (the SIMD cores are effectively 8-wide). We use the immediate post-dominator reconvergence mechanism described in [11] to handle branch divergence where some scalar threads within a warp evaluate a branch as “taken” and others evaluate it as “not taken”.

图1b展示了一个shader core的详细实现。本文中，每个shader core的SIMD宽度为8，使用24级顺序流水线，没有forwarding。24级流水线是CUDA编程指南的细节指定的，这说明，至少需要192个活跃的线程，才能避免一个线程中有真的数据依赖性的连续指令不会stall（在没有长延迟内存操作时）。我们用6级逻辑流水线（取指，解码，执行，内存1，内存2，写回），带有4度超流水，对这个流水线进行建模（在我们的模型中，内存1是空的）。线程以32为固定单位，调度到SIMD流水线中，称为一个warp。给定warp中的所有的32个线程，在4个连续的时钟周期内，执行相同的指令，带有不同的数据值（SIMD核的有效宽度是8）。我们使用[11]中的immediate post-dominator reconvergence机制，来处理branch divergence，即在一个warp中，一些标量线程评估分枝为taken，其他的评估为not taken。

Threads running on the GPU in the CUDA programming model have access to several memory regions (global, local, constant, texture, and shared [33]) and our simulator models accesses to each of these memory spaces. In particular, each shader core has access to a 16KB low latency, highly-banked per-core shared memory; to global texture memory with a per-core texture cache; and to global constant memory with a per-core constant cache. Local and global memory accesses always require off chip memory accesses in our baseline configuration. For the per-core texture cache, we implement a 4D blocking address scheme as described in [14], which essentially permutes the bits in requested addresses to promote spatial locality in a 2D space rather than in linear space. For the constant cache, we allow single cycle access as long as all threads in a warp are requesting the same data. Otherwise, a port conflict occurs, forcing data to be sent out over multiple cycles and resulting in pipeline stalls [33]. Multiple memory accesses from threads within a single warp to a localized region are coalesced into fewer wide memory accesses to improve DRAM efficiency. To alleviate the DRAM bandwidth bottleneck that many applications face, a common technique used by CUDA programmers is to load frequently accessed data into the fast on-chip shared memory [40].

在CUDA编程模型中，在GPU上运行的线程可以访问几个内存区域(global, local, constant, texture, shared)，我们的仿真器模型可以访问所有这些内存空间。特别是，每个shader core都可以访问16KB低延迟，多bank的，每个核中都有的shared内存；到global texture内存，每个核都有texture cache；到global constant内存，每个核都有constant cache。在我们的基准配置中，local和global内存访问都需要片外内存访问。对于每个核的texture cache，我们实现了一个4D阻塞式寻址方案[14]，在地址访问请求中，将bits进行改变顺序，以促进2D空间中的空间局部性，而不是线性空间。对于constant cache，只要warp中的所有线程是访问同样的数据，我们就可以进行单周期访问。否则，当发生端口冲突时，会迫使在多个周期中发送出去，会导致流水线stall。单个warp中线程到局部区域的多个内存访问，会合并成更少的宽内存访问，以改进DRAM效率。为缓解很多应用面临的DRAM带宽瓶颈，CUDA程序员使用的一个常见技术是，将频繁访问的数据载入到片上的快速共享内存中。

Thread scheduling inside a shader core is performed with zero overhead on a fine-grained basis. Every 4 cycles, warps ready for execution are selected by the warp scheduler and issued to the SIMD pipelines in a loose round robin fashion that skips non-ready warps, such as those waiting on global memory accesses. In other words, whenever any thread inside a warp faces a long latency operation, all the threads in the warp are taken out of the scheduling pool until the long latency operation is over. Meanwhile, other warps that are not waiting are sent to the pipeline for execution in a round robin order. The many threads running on each shader core thus allow a shader core to tolerate long latency operations without reducing throughput.

一个shader core中的线程调度，是细粒度的，没有代价的。每4个周期，准备好执行的warp都由warp scheduler选中，以松散的轮询方式发送到SIMD流水线中，跳过没准备好的warps，比如那些等待全局内存访问的。换句话说，不论何时有warp中的线程面临长延迟的运算，warp中的所有线程都会从调度池中被拿掉，直到长延迟运算完成。同时，其他没有等待的warps则以轮询的方式送入到流水线中执行。在每个shader core中运行的很多线程，会使shader core可以容忍长延迟运算，同时不会减少通量。

In order to access global memory, memory requests must be sent via an interconnection network to the corresponding memory controllers, which are physically distributed over the chip. To avoid protocol deadlock, we model physically separate send and receive interconnection networks. Using separate logical networks to break protocol deadlock is another alternative, but one we did not explore. Each on-chip memory controller then interfaces to two off-chip GDDR3 DRAM chips. Figure 2 shows the physical layout of the memory controllers in our 6x6 mesh configuration as shaded areas. The address decoding scheme is designed in a way such that successive 2KB DRAM pages [19] are distributed across different banks and different chips to maximize row locality while spreading the load among the memory controllers.

为访问全局内存，内存访问请求必须通过互联网络发送到对应的内存控制器中，这在整个芯片上分布。为避免协议死锁，我们对物理上分离的发送和接收互联网络进行建模。使用分离的逻辑网络来打破协议死锁是另一种选择，但是我们没有这样做。每个片上内存控制器都与2个片外GDDR3 DRAM芯片进行接口。图2展示了，在我们的6x6 mesh配置中，内存控制齐的物理布局，如灰色区域显示。地址解码方案的设计，连续的2KB DRAM页要分布在不同的bankss中，和不同的芯片中，以最大化行局部性，同时将load分布在内存控制器中。

2.2. GPU Architectural Exploration

This section describes some of the GPU architectural design options explored in this paper. Evaluations of these design options are presented in Section 4.

本节中描述本文中探索的一些GPU架构设计选项。这些设计选项的评估，在第4节中给出。

2.2.1. Interconnect. The on-chip interconnection network can be designed in various ways based on its cost and performance. Cost is determined by complexity and number of routers as well as density and length of wires. Performance depends on latency, bandwidth and path diversity of the network [9]. (Path diversity indicates the number of routes a message can take from the source to the destination.)

基于代价和性能，片上互联网络可以以多种方式进行设计。代价是由于复杂度和路由器的数量，以及连线的密度和长度决定的。性能依赖于延迟，带宽和网络路径的多样性。（路径多样性指信息从源到目的所要经过的路由数量）

Butterfly networks offer minimal hop count for a given router radix while having no path diversity and requiring very long wires. A crossbar interconnect can be seen as a 1-stage butterfly and scales quadratically in area as the number of ports increase. A 2D torus interconnect can be implemented on chip with nearly uniformly short wires and offers good path diversity, which can lead to a more load balanced network. Ring and mesh interconnects are both special types of torus interconnects. The main drawback of a mesh network is its relatively higher latency due to a larger hop count. As we will show in Section 4, our benchmarks are not particularly sensitive to latency so we chose a mesh network as our baseline while exploring the other choices for interconnect topology.

Bufferfly对给定的路由器radix，网络的跳数最小，同时没有路径多样性，需要非常长的连线。而crossbar互联可以视为1阶butterfly，面积随着端口数量呈4次方的增长。2D torus互联在片上的实现几乎是一致的短连线，也有很好的路径多样性，会得到复杂更均衡的网络。Ring和mesh互联都是特殊类型的torus互联。Mesh网络的一个主要缺陷是其相对较高的延迟，因为有更大的跳数。我们在第4部分会看到，我们的benchmarks会延迟并不是特别敏感，所以我们选择了一个mesh网络作为基准，同时探索其他类型的互联拓扑。

2.2.2. CTA distribution. GPUs can use the abundance of parallelism in data-parallel applications to tolerate memory access latency by interleaving the execution of warps. These warps may either be from the same CTA or from different CTAs running on the same shader core. One advantage of running multiple smaller CTAs on a shader core rather than using a single larger CTA relates to the use of barrier synchronization points within a CTA [40]. Threads from one CTA can make progress while threads from another CTA are waiting at a barrier. For a given number of threads per CTA, allowing more CTAs to run on a shader core provides additional memory latency tolerance, though it may imply increasing register and shared memory resource use. However, even if sufficient on-chip resources exist to allow more CTAs per core, if a compute kernel is memory-intensive, completely filling up all CTA slots may reduce performance by increasing contention in the interconnection network and DRAM controllers. We issue CTAs in a breadth-first manner across shader cores, selecting a shader core that has a minimum number of CTAs running on it, so as to spread the workload as evenly as possible among all cores.

GPU可以在数据并行应用中可以使用足够的并行线来容忍内存访问延迟，因为可以将warp的执行交替进行。这些warps要么是属于同一个CTA或不同的CTAs，但是运行在相同的shader core中。在一个shader core上运行多个更小的CTAs，而不是使用一个更大的CTA，其一个优势与在一个CTA中使用barrier synchronization相关。一个CTA中的线程在进行，而另一个CTA的线程可以在一个barrier处等待。对于每个CTA中给定数量的线程，允许在一个shader core中运行更多的CTAs，可以提供额外的内存延迟容忍度，但这也意味着更增加寄存器和共享内存的资源使用。但是，即使有足够的片上资源来允许每个核中有更多的CTAs，如果一个计算kernel的内存访问很多，完全填充所有的CTA slots可能会降低性能，因为会提升互联网络和DRAM控制器的竞争。我们在多个shader cores中发射CTAs的方式是宽度优先的方式，选择的一个shader core其中的CTAs数量要最小，这样会将workload尽量平均的分配到所有cores中。

2.2.3. Memory Access Coalescing. The minimum granularity access for GDDR3 memory is 16 bytes and typically scalar threads in CUDA applications access 4 bytes per scalar thread [19]. To improve memory system efficiency, it thus makes sense to group accesses from multiple, concurrently-issued, scalar threads into a single access to a small, contiguous memory region. The CUDA programming guide indicates that parallel memory accesses from every half-warp of 16 threads can be coalesced into fewer wide memory accesses if they all access a contiguous memory region [33]. Our baseline models similar intra-warp memory coalescing behavior (we attempt to coalesce memory accesses from all 32 threads in a warp).

访问GDDR3内存的最小粒度是16 bytes，而CUDA应用中典型的标量线程每线程访问4 bytes。为改进内存系统效率，将多个同时发射的标量线程的访问分组成到小型的连续的内存区域的一个访问，就有意义了。CUDA编程指南表明，从每半个warp的16个线程的并行内存访问，如果都访问的是连续的内存区域，可以合并成更少的宽内存访问。我们的基准将类似的warp内内存合并行为建模（我们将一个warp中的所有32个线程的内存访问合并）。

A related issue is that since the GPU is heavily multi-threaded a balanced design must support many outstanding memory requests at once. While microprocessors typically employ miss-status holding registers (MSHRs) [21] that use associative comparison logic merge simultaneous requests for the same cache block, the number of outstanding misses that can be supported is typically small (e.g., the original Intel Pentium 4 used four MSHRs [16]). One way to support a far greater number of outstanding memory requests is to use a FIFO for outstanding memory requests [17]. Similarly, our baseline does not attempt to eliminate multiple requests for the same block of memory on cache misses or local/global memory accesses. However, we also explore the possibility of improving performance by coalescing read memory requests from later warps that require access to data for which a memory request is already in progress due to another warp running on the same shader core. We call this inter-warp memory coalescing. We observe that inter-warp memory coalescing can significantly reduce memory traffic for applications that contain data dependent accesses to memory. The data for inter-warp merging quantifies the benefit of supporting large capacity MSHRs that can detect a secondary access to an outstanding request [45].

一个相关的问题是，由于GPU的线程数量很多，一个均衡的设计要支持很多进行中的内存请求。微处理器通常采用MSHRs，使用关联的比较逻辑来合并对同一个cache block的同时的请求，但是支持的进行中的misses的数量一般是较小的（如，Intel Pentium 4使用4个MSHRs）。支持多的多数量的进行中的内存请求的一个方法，是对进行中的内存请求使用一个FIFO。类似的，我们的基准不会试图消除对同样内存块的多个请求。但是，我们也探索通过合并读取内存请求，来改进性能。我们称这个warp间的内存合并。我们观察到，对包含数据依赖的内存访问的应用，warp间的内存合并可以极大的减少内存访问量。warp间合并的数据，量化了支持大容量MSHRs的好处。

2.2.4. Caching. While coalescing memory requests captures spatial locality among threads, memory bandwidth requirements may be further reduced with caching if an application contains temporal locality or spatial locality within the access pattern of individual threads. We evaluate the performance impact of adding first level, per-core L1 caches for local and global memory access to the design described in Section 2.1. We also evaluate the effects of adding a shared L2 cache on the memory side of the interconnection network at the memory controller. While threads can only read from texture and constant memory, they can both read and write to local and global memory. In our evaluation of caches for local and global memory we model non-coherent caches. (Note that threads from different CTAs in the applications we study do not communicate through global memory.)

合并内存访问的请求捕获了线程间的空间局部性，如果一个应用在单个线程的访问模式中包含时间局部性或空间局部性，则内存带宽的要求可以用caching进一步降低。我们评估对local和global内存访问增加第一级的每个core的L1 caches，评估性能影响。我们还评估了在互联网络的内存端在内存控制器上增加一个共享的L2 cache的效果。线程只能从texture和constant内存中读取，而对local和global内存则既可以读又可以写。在我们对local和global内存的caches的评估中，我们建模了non-coherent caches。（注意，我们的研究的应用中不同CTAs的线程不会通过global内存来进行通信）

2.3. Extending GPGPU-Sim to Support CUDA

We extended GPGPU-Sim, the cycle-accurate simulator we developed for our earlier work [11]. GPGPU-Sim models various aspects of a massively parallel architecture with highly programmable pipelines similar to contemporary GPU architectures. A drawback of the previous version of GPGPU-Sim was the difficult and time-consuming process of converting/parallelizing existing applications [11]. We overcome this difficulty by extending GPGPU-Sim to support the CUDA Parallel Thread Execution (PTX) [35] instruction set. This enables us to simulate the numerous existing, optimized CUDA applications on GPGPU-Sim. Our current simulator infrastructure runs CUDA applications without source code modifications on Linux based platforms, but does require access to the application’s source code. To build a CUDA application for our simulator, we replace the common.mk makefile used in the CUDA SDK with a version that builds the application to run on our microarchitecture simulator (while other more complex build scenarios may require more complex makefile changes).

我们之前开发了cycle-accurate GPGPU-Sim，现在进行了拓展。GPGPU-Sim用高度课编程的流水，对高度并行架构的各个方面进行了建模，与现在GPU架构类似。上一个版本的GPGPU-Sim的缺点是，现有的应用转化成并行程序非常困难、耗时。这次我们解决了这个问题，将GPGPU-Sim拓展，支持CUDA PTX指令集。这使我们可以在GPGPU-Sim上仿真已经有的很多优化过的CUDA应用。我们当前的仿真器基础设施在基于Linux的平台上运行CUDA应用不需要修改源代码，但需要访问应用的源代码。为给我们的仿真器构建CUDA应用，我们将在CUDA SDK中使用的common.mk makefile替换为另一个版本，构建的应用可以运行在我们的微架构仿真器上（其他更复杂的构建场景会需要更复杂的makefile变化）

Figure 3 shows how a CUDA application can be compiled for simulation on GPGPU-Sim and compares this compilation flow to the normal CUDA compilation flow [33]. Both compilation flows use cudafe to transform the source code of a CUDA application into host C code running on the CPU and device C code running on the GPU. The GPU C code is then compiled into PTX assembly (labeled “.ptx” in Figure 3) by nvopencc, an open source compiler provided by NVIDIA based on Open64 [28,36]. The PTX assembler (ptxas) then assembles the PTX assembly code into the target GPU’s native ISA (labeled “cubin.bin” in Figure 3(a)). The assembled code is then combined with the host C code and compiled into a single executable linked with the CUDA Runtime API library (labeled “libcuda.a” in Figure 3) by a standard C compiler. In the normal CUDA compilation flow (used with NVIDIA GPU hardware), the resulting executable calls the CUDA Runtime API to set up and invoke compute kernels onto the GPU via the NVIDIA CUDA driver.

图3展示了CUDA应用怎样编译到在GPGPU-Sim上仿真运行，并与正常的CUDA编译过程进行了比较。两个编译过程都使用cudafe来将CUDA应用的源代码转换成宿主C代码运行在CPU上，和运行在GPU上的设备C代码。GPU的C代码由于nvopencc编译成PTX汇编（在图3中标记为.ptx），这是由Nvidia提供的基于open64的开源编译器。汇编代码与宿主C代码结合在一起，编译成单个可执行文件，与CUDA运行时API库一起，由标准C编译器链接成为一个可执行文件（在图3中标记为libcuda.a）。在正常CUDA编译流程中（与NVidia GPU硬件一起使用），得到的可执行文件调用CUDA运行时API，通过NVidia CUDA驱动设置并调用计算核。

When a CUDA application is compiled to use GPGPU-Sim, many steps remain the same. However, rather than linking against the NVIDIA supplied libcuda.a binary, we link against our own libcuda.a binary. Our libcuda.a implements “stub” functions for the interface defined by the header files supplied with CUDA. These stub functions set up and invoke simulation sessions of the compute kernels on GPGPU-Sim (as shown in Figure 3(b)). Before the first simulation session, GPGPU-Sim parses the text format PTX assembly code generated by nvopencc to obtain code for the compute kernels. Because the PTX assembly code has no restriction on register usage (to improve portability between different GPU architectures), nvopencc performs register allocation using far more registers than typically required to avoid spilling. To improve the realism of our performance model, we determine the register usage per thread and shared memory used per CTA using ptxas. We then use this information to limit the number of CTAs that can run concurrently on a shader core. The GPU binary (cubin.bin) produced by ptxas is not used by GPGPU-Sim. After parsing the PTX assembly code, but before beginning simulation, GPGPU-Sim performs an immediate post-dominator analysis on each kernel to annotate branch instructions with reconvergence points for the stack-based SIMD control flow handling mechanism described by Fung et al. [11]. During a simulation, a PTX functional simulator executes instructions from multiple threads according to their scheduling order as specified by the performance simulator. When the simulation completes, the host CPU code is then allowed to resume execution. In our current implementation, host code runs on a normal CPU, thus our performance measurements are for the GPU code only.

当CUDA应用编译使用GPGPU-Sim运行时，很多步骤都是一样的。但是，我们没有与NVidia提供的libcuda.a二进制库链接，而是与我们自己的libcuda.a二进制链接。我们的libcuda.a对CUDA提供的头文件中定义的接口实现了stub函数。这些stub函数设置并调用GPGPU-Sim中的计算核的仿真session（如图3b所示）。在第一个仿真session之前，GPGPU-Sim解析nvopencc生成的PTX汇编代码文本格式，以得到计算核的代码。因为PTX汇编代码在寄存器使用上没有任何限制（在不同的GPU架构上可以改进可移植性），nvopencc进行寄存器分配所使用的寄存器数量比典型所需要的要多的多，以避免spilling。为改进我们的性能模型的真实性，我们使用ptxas来确定每个线程的寄存器使用，和每个CTA的共享内存使用。我们然后使用这个信息来限制在一个shader core上可以并行运行的CTAs的数量。Ptxas所产生的GPU二进制（cubin.bin）并不是GPGPU-Sim所使用的。在解析PTX汇编代码后，在开始仿真前，GPGPU-Sim对每个核进行立刻的post-dominator分析，以用reconvergence点标注分支指令，用在基于stack的SIMD控制流处理机制上。在仿真过程中，PTX功能仿真器从多个线程，根据性能仿真器指定的调度顺序执行指令。当仿真结束时，宿主CPU代码继续执行。在我们当前的实现中，宿主代码运行在正常的CPU上，因此我们的性能度量只是对于GPU代码的。

2.4. Benchmarks

Our benchmarks are listed in Table 1 along with the main application properties, such as the organization of threads into CTAs and grids as well as the different memory spaces on the GPU exploited by each application. Multiple entries separated by semi-colons in the grid and CTA dimensions indicate the application runs multiple kernels.

我们的benchmarks列在表1中，和主要的应用属性一起，比如将线程组织到CTAs和grids中的方式，以及每个应用利用的GPU上的不同的内存空间。grid和CTA维度中由分号分隔的多个entries说明，应用运行了多个kernels。

For comparison purposes we also simulated the following benchmarks from NVIDIA’s CUDA software development kit (SDK) [32]: Black-Scholes Option Pricing, Fast Walsh Transform, Binomial Option Pricing, Separable Convolution, 64-bin Histogram, Matrix Multiply, Parallel Reduction, Scalar Product, Scan of Large Arrays, and Matrix Transpose. Due to space limitations, and since most of these benchmarks already perform well on GPUs, we only report details for Black-Scholes (BLK), a financial options pricing application, and Fast Walsh Transform (FWT), widely used in signal and image processing and compression. We also report the harmonic mean of all SDK applications simulated, denoted as SDK in the data bar charts in Section 4.

为比较的目的，我们还仿真了下面的benchmarks，来自NVidia的CUDA软件开发包skd中：Black-Scholes Option Pricing, Fast Walsh Transform, Binomial Option Pricing, Separable Convolution, 64-bin Histogram, Matrix Multiply, Parallel Reduction, Scalar Product, Scan of Large Arrays, and Matrix Transpose。由于空间限制，因为这些benchmarks在GPUs上都已经表现良好了，我们只给出Black-Scholes的细节，这是一个金融期权定价应用，和快速Walsh变换(FWT)，广泛应用在信号和图像处理和压缩中。我们还给出了仿真的所有SDK应用的平均值，表示为第4部分数据柱状图中的SDK。

Below, we describe the CUDA applications not in the SDK that we use as benchmarks in our study. These applications were developed by the researchers cited below and run unmodified on our simulator.

下面，我们描述一下不在SDK中的CUDA应用，我们用作本文中的benchmarks。这些应用是由研究者开发的，在我们的仿真器中不加修改的运行。

AES Encryption (AES) [24] This application, developed by Manavski [24], implements the Advanced Encryption Stan- dard (AES) algorithm in CUDA to encrypt and decrypt files. The application has been optimized by the developer so that constants are stored in constant memory, the expanded key stored in texture memory, and the input data processed in shared memory. We encrypt a 256KB picture using 128-bit encryption.

Graph Algorithm: Breadth-First Search (BFS) [15] Developed by Harish and Narayanan [15], this application performs breadth-first search on a graph. As each node in the graph is mapped to a different thread, the amount of parallelism in this applications scales with the size of the input graph. BFS suffers from performance loss due to heavy global memory traffic and branch divergence. We perform breadth- first search on a random graph with 65,536 nodes and an average of 6 edges per node.

Coulombic Potential (CP) [18,41] CP is part of the Parboil Benchmark suite developed by the IMPACT research group at UIUC [18,41]. CP is useful in the field of molecular dynamics. Loops are manually unrolled to reduce loop overheads and the point charge data is stored in constant memory to take advantage of caching. CP has been heavily optimized (it has been shown to achieve a 647× speedup versus a CPU version [40]). We simulate 200 atoms on a grid size of 256×256.

gpuDG (DG) [46] gpuDG is a discontinuous Galerkin time-domain solver, used in the field of electromagnetics to calculate radar scattering from 3D objects and analyze wave guides, particle accelerators, and EM compatibility [46]. Data is loaded into shared memory from texture memory. The inner loop consists mainly of matrix-vector products. We use the 3D version with polynomial order of N=6 and reduce time steps to 2 to reduce simulation time.

3D Laplace Solver (LPS) [12] Laplace is a highly parallel finance application [12]. As well as using shared memory, care was taken by the application developer to ensure coalesced global memory accesses. We observe that this benchmark suffers some performance loss due to branch divergence. We run one iteration on a 100x100x100 grid.

LIBOR Monte Carlo (LIB) [13] LIBOR performs Monte Carlo simulations based on the London Interbank Offered Rate Market Model [13]. Each thread reads a large number of variables stored in constant memory. We find the working set for constant memory fits inside the 8KB constant cache per shader core that we model. However, we find memory bandwidth is still a bottleneck due to a large fraction of local memory accesses. We use the default inputs, simulating 4096 paths for 15 options.

MUMmerGPU (MUM) [42] MUMmerGPU is a parallel pairwise local sequence alignment program that matches query strings consisting of standard DNA nucleotides (A,C,T,G) to a reference string for purposes such as genotyping, genome resequencing, and metagenomics [42]. The reference string is stored as a suffix tree in texture memory and has been arranged to exploit the texture cache’s optimization for 2D locality. Nevertheless, the sheer size of the tree means high cache miss rates, causing MUM to be memory bandwidth- bound. Since each thread performs its own query, the nature of the search algorithm makes performance also susceptible to branch divergence. We use the first 140,000 characters of the Bacillus anthracis str. Ames genome as the reference string and 50,000 25-character queries generated randomly using the complete genome as the seed.

Neural Network (NN) [5] Neural network uses a convolutional neural network to recognize handwritten digits [5]. Pre-determined neuron weights are loaded into global memory along with the input digits. We modified the original source code to allow recognition of multiple digits at once to increase parallelism. Nevertheless, the last two kernels utilize blocks of only a single thread each, which results in severe under-utilization of the shader core pipelines. We simulate recognition of 28 digits from the Modified National Institute of Standards Technology database of handwritten digits.

N-Queens Solver (NQU) [37] The N-Queen solver tackles a classic puzzle of placing N queens on a NxN chess board such that no queen can capture another [37]. It uses a simple backtracking algorithm to try to determine all possible solutions. The search space implies that the execution time grows exponentially with N. Our analysis shows that most of the computation is performed by a single thread, which explains the low IPC. We simulate N=10.

Ray Tracing (RAY) [26] Ray-tracing is a method of rendering graphics with near photo-realism. In this implementation, each pixel rendered corresponds to a scalar thread in CUDA [26]. Up to 5 levels of reflections and shadows are taken into account, so thread behavior depends on what object the ray hits (if it hits any at all), making the kernel susceptible to branch divergence. We simulate rendering of a 256x256 image.

StoreGPU (STO) [4] StoreGPU is a library that accelerates hashing-based primitives designed for middleware [4]. We chose to use the sliding-window implementation of the MD5 algorithm on an input file of size 192KB. The developers minimize off-chip memory traffic by using the fast shared memory. We find STO performs relatively well.

Weather Prediction (WP) [27] Numerical weather prediction uses the GPU to accelerate a portion of the Weather Research and Forcast model (WRF), which can model and predict condensation, fallout of various precipitation and related thermodynamic effects of latent heat release [27]. The kernel has been optimized to reduce redundant memory transfer by storing the temporary results for each altitude level in the cell in registers. However, this requires a large amount of registers, thus limiting the maximum allowed number of threads per shader core to 192, which is not enough to cover global and local memory access latencies. We simulate the kernel using the default test sample for 10 timesteps.

3. Methodology

Table 2 shows the simulator’s configuration. Rows with multiple entries show the different configurations that we have simulated. Bold values show our baseline. To simulate the mesh network, we used a detailed interconnection network model, incorporating the configurable interconnection network simulator introduced by Dally et al. [9]. Table 3 shows the interconnection configuration used in our simulations.

表2展示了仿真器的配置。有多个entries的行，展示了我们仿真了不同的配置，加粗的值是我们的baseline。为仿真mesh网络，我们使用了详细的互联网络模型，[9]中提出的可配置的互联网络仿真器。表3展示了我们仿真中使用的互联配置。

We simulate all benchmarks to completion to capture all the distinct phases of each kernel in the benchmarks, especially the behavior at the tail end of the kernels, which can vary drastically compared to the beginning. If the kernels are relatively short and are frequently launched, the difference in performance when not simulating the benchmark to completion can be significant.

我们完成了所有benchmarks的仿真，以捕获benchmarks中每个核的不同的状态，尤其是kernels的tail end的行为，这与开始相比，会有极大的不同。如果这些核相对较短，而且频繁的启动，与不将benchmark仿真到结束相比，性能的差异还是非常大的。

We note that the breadth-first CTA distribution heuristic described in Section 2.2.2 can occasionally lead to counter-intuitive performance results due to a phenomina we will refer to as CTA load imbalance. This CTA load imbalance can occur when the number of CTAs in a grid exceeds the number that can run concurrently on the GPU. For example, consider six CTAs on a GPU with two shader cores where at most two CTAs can run concurrently on a shader core. Assume running one CTA on one core takes time T and running two CTAs on one core takes time 2T (e.g., no off-chip accesses and six or more warps per CTA—enough for one CTA to keep our 24 stage pipeline full). If each CTA in isolation takes equal time T, total time is 3T (2T for the first round of four CTAs plus T for the last two CTAs which run on separate shader cores). Suppose we introduce an enhancement that causes CTAs to run in time 0.90T to 0.91T when run alone (i.e., faster). If both CTAs on the first core now finish ahead of those on the other core at time 1.80T versus 1.82T, then our CTA distributor will issue the remaining 2 CTAs onto the first core, causing the load imbalance. With the enhancement, this actually causes an overall slowdown since now 4 CTAs need to be completed by the first core, requiring a total time of at least 3.6T. We carefully verified that this behavior occurs by plotting the distribution of CTAs to shader cores versus time for both configurations being compared. This effect would be less significant in a real system with larger data sets and therefore grids with a larger number of CTAs. Rather than attempt to eliminate the effect by modifying the scheduler (or the benchmarks) we simply note where it occurs.

我们关注到，2.2.2中描述的广度优先CTA分布会偶尔得到反直觉的性能结果，因为会出现一种叫做CTA load imbalance的现象。这种CTA load imbalance的现象的发生，是在一个grid中的CTAs的数量超过了在GPU上可以同时运行的数量。比如，考虑6个CTAs在GPU上，有2个shader cores，一个shader core上最多可以运行2个CTAs。假设在一个core上运行一个CTA耗时T，在一个core上运行2个CTAs耗时2T（如，没有片外访问，每个CTA有6个或更多的warps，足以使一个CTA保持我们的24级流水线满载）。如果每个CTA单独会耗时T，总时间就是3T（2T是第一轮的4个CTAs，加上T是最后2个CTAs运行在分离的shader cores）。假设我们提出了一个增强，导致CTAs运行时间为0.90T到0.91T（单独运行）。如果两个CTAs在第一个core中现在完成了，时间为1.80T，另一个core中要在1.82T时候完成，那么我们的CTA分配器会将剩余的2个CTAs发射到第一个core中，导致load imbalance。有了这个增强，实际上导致了整体的slowdown，因为现在第一个core需要完成4个CTAs，需要的时间至少是3.6T。我们仔细的验证了这个行为的发生，将CTAs在shader cores中的分布画了出来，将两种配置进行了比较。这种效果在真实的系统中会没那么明显，因为更大的数据集会导致grids的CTAs数量很大。我们只是说明其在哪里发生，而没有修改调度器（或benchmarks），以消除这个效果。

In Section 4.7 we measure the impact of running greater or fewer numbers of threads. We model this by varying the number of concurrent CTAs permitted by the shader cores, which is possible by scaling the amount of on-chip resources available to each shader core. There are four such resources: Number of concurrent threads, number of registers, amount of shared memory, and number of CTAs. The values we use are shown in Table 2. The amount of resources available per shader core is a configurable simulation option, while the amount of resources required by each kernel is extracted using ptxas.

在4.7节中，我们度量了运行更多或更少数量的线程的影响。我们对这个的建模，是将shader cores允许的同时的CTAs的数量进行变化，通过缩放片上每个shader core可用资源数量，就可以实现这个效果。有4个这种资源：并发线程的数量，寄存器的数量，共享内存的数量，CTAs的数量。我们所使用的值如表2所示。每个shader core可用的资源数是可配置的仿真选项，每个kernel所需要的资源数量是用ptxas提取出来的。

4. Experimental Results

In this section we evaluate the designs introduced in Section 2. Figure 4.1 shows the classification of each benchmark’s instruction type (dynamic instruction frequency). The Fused Multiply-Add and ALU Ops (other) sections of each bar show the proportion of total ALU operations for each benchmark (which varies from 58% for NQU to 87% for BLK). Only DG, CP and NN utilize the Fused Multiply-Add operations extensively. Special Function Unit (SFU) instructions are also only used by a few benchmarks. CP is the only benchmark that has more than 10% SFU instructions.

本节中，我们评估第2部分给出的设计。图4.1展示了每个benchmark的指令类型的分类（动态指令频率）。每个bar的FMA和ALU ops展示了，每个benchmark的总计的ALU运算（从58%到87%）。只有DG，CP和NN广泛使用了FMA运算。SFU指令也只有几个benchmarks使用。CP是唯一的包含SFU指令多余10%的benchmark。

The memory operations portion of Figure 4.1 is further broken down in terms of type as shown in Figure 5. Note that “param” memory refers to parameters passed through the GPU kernel call, which we always treat as cache hits. There is a large variation in the memory instruction types used among benchmarks: for CP over 99% of accesses are to constant memory while for NN most accesses are to global memory.

图4.1中的内存操作进一步进行了分解，如图5所示。注意，param内存是指传递到GPU kernel调用的参数，我们将其视为cache hits。各个benchmarks所用的内存操作类型有很大的变化：对于CP超过99%都是constant memory，而对NN，多数访问都是到global memory的。

4.1. Baseline

We first simulated our baseline GPU configuration with the bolded parameters shown in Table 2. Figure 6 shows the performance of our baseline configuration (for the GPU only) measured in terms of scalar instructions per cycle (IPC). For comparison, we also show the performance assuming a perfect memory system with zero memory latency. Note that the maximum achievable IPC for our configuration is 224 (28 shader cores x 8-wide pipelines). We also validated our simulator against an Nvidia Geforce 8600GTS (a “low end” graphics card) by configuring our simulator to use 4 shaders and two memory controllers. The IPC of the GPU hardware, as shown in Figure 8(a), was estimated by dividing the dynamic instruction count measured (in PTX instructions) in simulation by the product of the measured runtime on hardware and the shader clock frequency [31]. Figure 8(b) shows the scatter plot of IPC obtained with our simulations mimicking the 8600GTS normalized to the theoretical peak IPC versus the normalized IPC data measured using the 8600GTS. The correlation coefficient was calculated to be 0.899. One source of difference, as highlighted by the data for CP which actually achieves a normalized IPC over 1, is likely due to compiler optimizations in ptxas which may reduce the instruction count on real hardware7. Overall, the data shows that applications that perform well in real GPU hardware perform well in our simulator and applications that perform poorly in real GPU hardware also perform poorly in our simulator. In the following sections, we explore reasons why some benchmarks do not achieve peak performance.

4.2. Branch Divergence

Branch divergence was highlighted by Fung et al. as a major source of performance loss for multithreaded SIMD architectures with intra-warp branch divergence [11]. Figure 7 shows warp occupancies (number of active threads in an issued warp) over the entire runtime of the benchmarks. This metric can be seen as a measure of how much GPU throughput potential is wasted due to unfilled warps. The control flow portion of the bars in Figure 4 shows that BFS, LPS, and NQU contain from 13% to 28% control flow operations. However, intensive control flow does not necessarily translate into high branch divergence; it depends more on whether or not all threads in a warp branch in the same direction. NN has the lowest warp occupancy while it contains only 7% control flow operations. On the other hand, LPS with 19% control flow has full warp occupancy 75% of the time. It is best to analyze Figure 7 with Table 1 in mind, particularly in the case of NN. In NN, two of the four kernels have only a single thread in a block and they take up the bulk of the execution time, meaning that the unfilled warps in NN are not due to branch divergence. Some benchmarks (such as AES, CP, LIB, and STO) do not incur significant branch divergence, while others do. MUM experiences severe performance loss in particular because more than 60% of its warps have less than 5 active threads. BFS also performs poorly since threads in adjacent nodes in the graph (which are grouped into warps) behave differently, causing more than 75% of its warps to have less than 50% occupancy. Warp occupancy for NN and NQU is low due to large portions of the code being spent in a single thread.

4.3. Interconnect Topology

Figure 9 shows the speedup of various interconnection network topologies compared to a mesh with 16 Byte channel bandwidth. On average our baseline mesh interconnect performs comparable to a crossbar with input speedup of two for the workloads that we consider. We also have evaluated two torus topologies: “Torus - 16 Byte Channel BW”, which has double the bisection bandwidth of the baseline “Mesh” (a determining factor in the implementation cost of a network); and “Torus - 8 Byte Channel BW”, which has the same bisection bandwidth as “Mesh”. The “Ring” topology that we evaluated has a channel bandwidth of 64. The “Crossbar” topology has a parallel iterative matching (PIM) allocator as opposed to an iSLIP allocator for other topologies. The two- stage butterfly and crossbar employ destination tag routing while others use dimension-order routing. The ring and mesh networks are the simplest and least expensive networks to build in terms of area.

As Figure 9 suggests, most of the benchmarks are fairly insensitive to the topology used. In most cases, a change in topology results in less than 20% change in performance from the baseline, with the exception of the Ring and Torus with 8 Byte channel bandwidth. BLK experiences a performance gain with Ring due to the CTA load imbalance phenomena described in Section 3. BLK has 256 CTAs. For the Ring configuration, the number of CTAs executed per shader core varies from 9 to 10. However, for the baseline configuration, one of the shader cores is assigned 11 CTAs due to small variations in time coupled with our greedy CTA distribution heuristic. When more CTAs run on a shader core, all CTAs on that shader core take longer to complete, resulting in a performance loss for the baseline configuration for BLK.

As we will show in the next section, one of the reasons why different topologies do not change the performance of most benchmarks dramatically is that the benchmarks are not sensitive to small variations in latency, as long as the interconnection network provides sufficient bandwidth.

4.4. Interconnection Latency and Bandwidth

Figure 10 shows the IPC results for various mesh network router latencies. Without affecting peak throughput, we add an extra pipelined latency of 4, 8, or 16 cycles to each router on top of our baseline router’s 2-cycle latency. An extra 4 cycle latency per router is easily tolerated for most benchmarks and causes only 3.5% performance loss when harmonically averaged across all benchmarks. BLK and CP experience a performance gain due to the CTA load imbalance phenomena described in Section 3. With 8 extra cycles of latency per router, the performance degradation is noticeable (slowdown by 9% on average) and becomes much worse (slowdown by 25% on average) at 16 cycles of extra latency. Note that these experiments are only intended to show the latency sensitivity of benchmarks.

We also modify the mesh interconnect bandwidth by varying the channel bandwidth from 8 bytes to 64 bytes. Figure 11 shows that halving the channel bandwidth from 16 bytes to 8 bytes has a noticeable negative impact on most benchmarks, but doubling and quadrupling channel bandwidth only results in a small gain for a few workloads i.e., BFS and DG.

DG is the most bandwidth sensitive workload, getting a 31% speedup and 53% slowdown for flit sizes of 32 and 8 respectively. The reason why DG does not exhibit further speedup with flit size of 64 is because at this point, the interconnect has already been overprovisioned. Our analysis shows that for the baseline configuration, the input port to the return interconnect from memory to the shader cores is stalled 16% of the time on average. Increasing the flit size to 32 completely eliminates these stalls, which is why there is no further speedup for interconnect flit size of 64. Note that our memory read request packet sizes are 8 bytes, allowing them to be sent to the memory controllers in a single flit for all of the configurations shown in Figure 11.

Overall, the above data suggests that performance is more sensitive to interconnect bandwidth than to latency for the non-graphics workloads that we study. In other words, restricting channel bandwidth causes the interconnect to become a bottleneck.

4.5. DRAM Utilization and Efficiency

In this section we explore the impact that memory controller design has on performance. Our baseline configuration uses an Out-of-Order (OoO) First-Ready First-Come First-Serve (FR- FCFS) [39] memory controller with a capacity of 32 memory requests. Each cycle, the OoO memory controller prioritizes memory requests that hit an open row in the DRAM over requests that require a precharge and activate to open a new row. Against this baseline, we compare a simple First-In First- Out (FIFO) memory controller that services memory requests in the order that they are received, as well as a more aggressive FR-FCFS OoO controller with an input buffer capacity of 128 (OoO128). We measure two metrics besides performance: The first is DRAM efficiency, which is the percentage of time spent sending data across the pins of DRAM over the time when there are any memory requests being serviced or pending in the memory controller input buffer; the second is DRAM utilization, which is the percentage of time spent sending data across the DRAM data pins over the entire kernel execution time. These two measures can differ if an application contains GPU computation phases during which it does not access DRAM (e.g., if it has been heavily optimized to use “shared memory”).

Figure 12 compares the performance of our baseline to FIFO and OoO128. We observe that AES, CP, NQU, and STO exhibit almost no slowdown for FIFO. Figure 14 shows AES and STO obtain over 75% DRAM efficiency. Close examination reveals that at any point in time all threads access at most two rows in each bank of each DRAM, meaning that a simple DRAM controller policy suffices. Furthermore, Figure 13 shows that AES and STO have low DRAM utilization despite the fact that they process large amounts of data. Both these applications make extensive use of shared memory (see Figure 5). NQU and CP have very low DRAM utilization, making them insensitive to memory controller optimizations (CP slows down for OoO128 due to variations in CTA load distribution). Performance is reduced by over 40% when using FIFO for BFS, LIB, MUM, RAY, and WP. These benchmarks all show drastically reduced DRAM efficiency and utilization with this simple controller.

4.6. Cache Effects

Figure 15 shows the effects on IPC of adding caches to the system. The first 3 bars show the relative speedup of adding a 16KB, 32KB or 64KB cache to each shader core. The last two bars show the effects of adding a 128KB or 256KB L2 cache to each memory controller in addition to a 64KB L1 cache in each shader. CP, RAY and FWT exhibit a slowdown with the addition of L1 caches. Close examination shows that CP experiences a slowdown due to the CTA load imbalance phenomena described in Section 3, whereas RAY and FWT experience a slowdown due to the way write misses and evictions of dirty lines are handled. For the baseline (without caches for local/global accesses) writes to memory only cause the memory controller to read data out of DRAM if a portion of a 16B is modified due to writes that are not coalesced. When caches are added for local and global accesses, for simplicity, a write miss prevents a warp from being scheduled until the cache block is read from DRAM. Furthermore, when a dirty line is evicted, the entire line is written to DRAM even if only a single word of that line is modified. We leave exploration of better cache policies to future work.

Benchmarks that make extensive use of “shared memory”, namely AES, LPS, NQU, and STO, do not respond signifi- cantly to caches. On the other hand, BFS and NN have the highest ratio of global memory instructions to all instructions (at 19% and 27% respectively) and so they experience the highest speedup among workloads.

4.7. Are More Threads Better?

Increasing the number of simultaneously running threads can improve performance by having a greater ability to hide memory access latencies. However, doing so may result in higher contention for shared resources, such as interconnect and memory. We explored the effects of varying the resources that limit the number of threads and hence CTAs that can run concurrently on a shader core, without modifying the source code for the benchmarks. We vary the amount of registers, shared memory, threads, and CTAs between 25% to 200% of those available to the baseline. The results are shown in Figure 16. For the baseline configuration, some benchmarks are already resource-constrained to only 1 or 2 CTAs per shader core, making them unable to run using a configuration with less resources. We do not show bars for configurations that for this reason are unable to run. NQU shows little change when varying the number of CTAs since it has very few memory operations. For LPS, NN, and STO, performance increases as more CTAs per core are used. LPS cannot take advantage of additional resources beyond the baseline (100%) because all CTAs in the benchmark can run simultaneously for the baseline configuration. Each CTA in STO uses all the shared memory in the baseline configuration, therefore increasing shared memory by half for the 150% configuration results in no increase in the number of concurrently running CTAs. AES and MUM show clear trends in decreasing performance as the number of CTAs increases. We observed that with more concurrent CTAs, AES and MUM experience increased contention in the memory system resulting in 8.6× and 5.4× worse average memory latency, respectively comparing 200% resources vs. 50%. BFS, RAY, and WP show distinct optima in performance when the CTA limit is at 100%, 100%, and 150% of the baseline shader, respectively. Above these limits, we observe DRAM efficiencies decrease and memory latencies increase, again suggesting increased contention in the memory system. For configuration with limits below the optima, the lack of warps to hide memory latencies reduces performance. CP suffers CTA load imbalance due to CTA scheduling for the 50% and 100% configurations. Similarly, DG suffers CTA load imbalance in the 150% configuration.

Given the widely-varying workload-dependent behavior, al- ways scheduling the maximal number of CTAs supported by a shader core is not always the best scheduling policy. We leave for future work the design of dynamic scheduling algorithms to adapt to the workload behavior.

4.8. Memory Request Coalescing

Figure 17 presents data showing the improvement in performance when enabling inter-warp memory coalescing described in Section 2.2.3. The harmonic mean speedup versus intrawarp coalescing is 6.1%. CP’s slowdown with inter-warp coalescing is due to load imbalance in CTA distribution. Accesses in AES, DG, and MUM are to data dependent locations which makes it harder to use the explicitly managed shared memory to capture locality. These applications use the texture cache to capture this locality and inter-warp merging effectively eliminates additional requests for the same cache block at the expense of associative search hardware.

It is interesting to observe that the harmonic mean speedup of the CUDA SDK benchmarks is less than 1%, showing that these highly optimized benchmarks do not benefit from interwarp memory coalescing. Their careful program optimizations ensure less redundancy in memory requests generated by each thread.

5. Related Work

Existing graphics-oriented GPU simulators include Qsilver [43], which does not model programmable shaders, and ATTILLA [10], which focuses on graphics specific features. Ryoo et al. [41] use CUDA to speedup a variety of relatively easily parallelizable scientific applications. They explore the use of conventional code optimization techniques and take advantage of the different memory types available on NVIDIA’s 8800GTX to obtain speedup. While their analysis is performed by writing and optimizing applications to run on actual CUDA hardware, we use our novel performance simulator to observe the detailed behavior of CUDA applications upon varying architectural parameters.

There have been acceleration architectures proposed besides the GPU model that we analyze in this paper. Mahesri et al. introduce a class of applications for visualization, interaction, and simulation [23]. They propose using an accelerator architecture (xPU) separate from the GPU to improve performance of their benchmark suite. The Cell processor [7, 38] is a hardware architecture that can function like a stream processor with appropriate software support. It consists of a controlling processor and a set of SIMD co-processors each with independent program counters and instruction memory. Merrimac [8] and Imagine [3] are both streaming processor architectures developed at Stanford.

Khailany et al. [20] explore VLSI costs and performance of a stream processor as the number of streaming clusters and ALUs per cluster scales. They use an analytical cost model. The benchmarks they use also have a high ratio of ALU operations to memory references, which is a property that eases memory requirements of streaming applications. The UltraSPARC T2 [44] microprocessor is a multithreading, multicore CPU which is a member of the SPARC family and comes in 4, 6, and 8 core variations, with each core capable of running 8 threads concurrently. They have a crossbar between the L2 and the processor cores (similar to our placement of the L2 in Figure 1(a)). Although the T1 and T2 support many concurrent threads (32 and 64, respectively) compared to other contemporary CPUs, this number is very small compared to the number on a high end contemporary GPU (e.g., the Geforce 8800 GTX supports 12,288 threads per chip).

We quantified the effects of varying cache size, DRAM bandwidth and other parameters which, to our knowledge, has not been published previously. While the authors of the CUDA applications which we use as benchmarks have published work, the emphasis of their papers was not on how changes in the GPU architecture can affect their applications [4, 5, 12, 13, 15, 24, 26, 27, 37, 41, 42, 46]. In terms of streaming multiprocessor design, all of the above-mentioned works have different programming models from the CUDA programming model that we employ.

6. Conclusions

In this paper we studied the performance of twelve con- temporary CUDA applications by running them on a detailed performance simulator that simulates NVIDIA’s parallel thread execution (PTX) virtual instruction set architecture. We pre- sented performance data and detailed analysis of performance bottlenecks, which differ in type and scope from application to application. First, we found that generally performance of these applications is more sensitive to interconnection network bisection bandwidth rather than (zero load) latency: Reducing interconnect bandwidth by 50% is even more harmful than increasing the per-router latency by 5.3× from 3 cycles to 19 cycles. Second, we showed that caching global and local memory accesses can cause performance degradation for benchmarks where these accesses do not exhibit temporal or spatial locality. Third, we observed that sometimes running fewer CTAs concurrently than the limit imposed by on-chip resources can improve performance by reducing contention in the memory system. Finally, aggressive inter-warp memory coalescing can improve performance in some applications by up to 41%.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPGPU-sim-2009.md

GPGPU-sim-2009.md

Analyzing CUDA Workloads Using a Detailed GPU Simulator

0. Abstract

1. Introduction

2. Design and Implementation

2.1. Baseline Architecture

2.2. GPU Architectural Exploration

2.3. Extending GPGPU-Sim to Support CUDA

2.4. Benchmarks

3. Methodology

4. Experimental Results

4.1. Baseline

4.2. Branch Divergence

4.3. Interconnect Topology

4.4. Interconnection Latency and Bandwidth

4.5. DRAM Utilization and Efficiency

4.6. Cache Effects

4.7. Are More Threads Better?

4.8. Memory Request Coalescing

5. Related Work

6. Conclusions

Files

GPGPU-sim-2009.md

Latest commit

History

GPGPU-sim-2009.md

File metadata and controls

Analyzing CUDA Workloads Using a Detailed GPU Simulator

0. Abstract

1. Introduction

2. Design and Implementation

2.1. Baseline Architecture

2.2. GPU Architectural Exploration

2.3. Extending GPGPU-Sim to Support CUDA

2.4. Benchmarks

3. Methodology

4. Experimental Results

4.1. Baseline

4.2. Branch Divergence

4.3. Interconnect Topology

4.4. Interconnection Latency and Bandwidth

4.5. DRAM Utilization and Efficiency

4.6. Cache Effects

4.7. Are More Threads Better?

4.8. Memory Request Coalescing

5. Related Work

6. Conclusions