我当然知道并行编程很难，但你总可以做些什么（一）

前言

这是A Week系列的第四期。

主要是开了一个新坑，讲讲McKenney大师写的Parallel Programming，因为本书在国内尚无正式的译本，所以多写了一些翻译和理解。

Chapter 1

Is Parallel Programming Hard, And, If So,What Can You Do About It?

作者Paul E. McKenney是世界并行编程专家，Linux 内核中 RCU 实现和 rcutorture 测试模块的维护者，也是RCU的发明人。

图1:paralle1

在一开始有关于使用本书的指南，主要是一些代码和习题的初衷。

图2:paralle1

下面进入正式的章节：

Chapter 2 Introduction

图3:paralle1

New technologies that are difficult to use at introduction invariably become easier over time.新事物总是在开始很难，但随着长时间的普及而变得简单。

It instead examines the reasons why parallel programming is difficult, and then works to help the reader to overcome these difficulties.

因为并行编程确实很难，但是这些困难都会克服的。

包括以下几个方面：

The historic high cost and relative rarity of parallel systems.
The typical researcher’s and practitioner’s lack of experience with parallel systems.
The paucity of publicly accessible parallel code.
The lack of a widely understood engineering discipline of parallel programming.
The high overhead of communication relative to that of processing, even in tightly coupled shared-memory computers.

除了因为是未知性较高，部分也是因为关联性复杂、运算代价高、通信繁琐。

图4:paralle1

虽然现存的大部分是并发编程，但随着计算机体系结构的发展，以及操作系统内核、数据库系统和消息处理系统的提升，并行编程的潜力将得到释放，虽然相对的通信成本还过高，但颇有点"苦尽甘来"的感觉。

并行编程的三个目标：

Performance.性能
Productivity.生产力
Generality.普适性

他们之间是三角关系，永远在平衡和拉扯中。

图5:paralle1

Parallelism is the way to go for those wanting to avail themselves of the full performance of their systems.并行是为了发挥机器的全部性能。

图6:paralle1

One of the inescapable consequences of the rapid decrease in the cost of hardware is that software productivity becomes increasingly important.硬件成本下降后，软件生产力的重要性得到了提高。

In fact, this economic force explains much of the maniacal focus on portability, which can be seen as an important special case of generality.为了兼容，总是要失去一些性能的。

图7:paralle1

总是要在performance, productivity, and generality中取得平衡，包括C/C++、Java、MPI等。

图8:paralle1

但也有其他的方法来提高性能：

Run multiple instances of a sequential application.
Make the application use existing parallel software.
Optimize the serial application.

1.If your program is analyzing a large number of different scenarios, or is analyzing a large number of independent data sets, one easy and effective approach is to create a single sequential program that carries out a single analysis, then use any of a number of scripting environments (for example the bash shell) to run a number of instances of

that sequential program in parallel.

用单一的结果运算脚本并行地去计算数据集的结果。

2.There is no longer any shortage of parallel software environments that can present a single-threaded programmingenvironment, including relational databases [Dat82], web-application servers, and map-reduce environments.

除了加入GPU外，还可以加入关系型数据库、web应用服务和map-reduce环境。

3.随着摩尔定律不能再来显著的单线程性能改进，并行编程将是打破性能瓶颈最好的选择。但数据结构与算法的选择仍然很重要，数据布局的调整也会提高性能，减少CPU等待磁盘的时间.

Of course, the easier it is to parallelize your program, the more attractive parallelization becomes as an optimization.

图9:paralle1

But parallel programming involves two-way communication, with a program’s performance and scalability being the communication from the machine to the human. In short, the human writes a program telling the computer what to do, and the computer critiques this program via the resulting performance and scalability. Therefore, appeals

to abstractions or to mathematical analyses will often be of severely limited utility.

但并行编程不仅是写代码，也会涉及到two-way communication双向通信，以及可伸缩性，还需要有抽象和数学分析能力。So，人为因素占比较大。

图10:paralle1

如图所言：对软件开发人员的要求较高。

图11:paralle1

Work Partitioning

良好的并行程序设计拥有适合的数据结构，而它所需要的state spaces以及I/O devices都很合理，也不会引起cpu caches，但糟糕的并行程序会引起争夺，及难以理解。

Parallel Access Control

The first parallel-access-control issue is whether the form of access to a given resource depends on that resource’s location.

The other parallel-access-control issue is how threads coordinate access to the resources.

同步永远会涉及到权限机制，deadlock, livelock, transaction rollback这些问题永远都在。

Resource Partitioning and Replication

The most effective parallel algorithms and systems exploit resource parallelism, so much so that it is usually wise to begin parallelization by partitioning your write-intensive resources and replicating frequently accessed read-mostly resources.

对写入密集型资源进行分区，以便于并行化资源的访问和复制，并随着时间动态改变。dynamically change the partitioning over time.

图12:paralle1

So,这就是并行编程的挑战。

Interacting With Hardware

Hardware interaction is normally the domain of the operating system, the compiler, libraries, or other software-environment infrastructure.

但涉及到性能的方向，我们就会想要压榨干机器的所有潜力，所以必须要理解与硬件的交互。以下都需要有所理解。

cache geometry,

system topology,

interconnect protocol of the target hardware.

图13:paralle1

Writing concurrent programs has a reputation for being exotic and difficult. I believe it is neither. You need a system that provides you with good primitives and suitable libraries, you need a basic caution and carefulness, you need an armory of useful techniques, and you need

to know of the common pitfalls. I hope that this paper has helped you towards sharing my belief.

图14:paralle1

1980年以来，

人类从未停下对并行编程的挑战，

所以根本没有理由拒绝,21世纪的今天所面临的挑战！

Chapter 3 Hardware and its Habits

This chapter therefore looks at the costof synchronization and communication within a shared-memory system.

图15:paralle1

CPU其实不是按照一条直线跑道在跑。

In contrast, the CPU of the late 1990s and of the 2000s execute many instructions simultaneously, using pipelines; superscalar techniques;

out-of-order instruction and data handling; speculative execution, and more [HP17, HP11] in order to optimize the flow of instructions and data through the CPU.

图16:paralle1

Pipelined CPUs

Suitable control flow can be provided by a program that executes primarily in tight loops, for example, arithmetic on large matrices or vectors.这其中有很多控制流，cpu在边猜测边执行，猜错的成本很高，对于可预测控制流程序运行良好，但涉及到非可预测的分支，cpu猜错后必须丢弃此前的所有分支指令，非常浪费时间。

图17:paralle1

This gets even worse in the increasingly common case of hyperthreading (or SMT, if you prefer), especially on pipelined superscalar out-of-order CPU featuring speculative execution.

Therefore, the execution of one hardware thread can and often is perturbed by the actions of other hardware threads sharing that core.

图18:paralle1

n theory, this contention is avoidable, but in practice CPUs must choose very quickly and without the benefit of clairvoyance. In particular, adding an instruction to a tight loop can sometimes actually cause execution to speed up.

CPU运行得太快了，以至于对指令的判断能力跟不上。

Memory References

memory references often pose severe obstacles to modern CPUs.

these caches require highly predictabledata-access patterns to successfully hide those latencies.

nfortunately, common operations such as traversing a linked list have extremely unpredictable memory-access patterns.

延迟总是会成为cpu访问内存的困扰,当然原子操作也会。

Atomic Operations

even though they are in fact being executed piece-at-a-time, with one common trick being to identify all the cachelines containing the data to be atomically operated on, ensure that these cachelines are owned by the CPU executing the atomic operation, and only then proceed with the atomic operation while ensuring that these cachelines remained owned by this CPU.

图19:paralle1

Memory Barriers

Cache Misses

I/O Operations

在CSAPP书中，对于内存屏障、缓存未命中、以及IO操作有更全面的描述。

图20:paralle1

A major goal of parallel hardware design is to reduce this ratio as needed to achieve the relevant performance and scalability goals.

Overheads

Don’t design bridges in ignorance of materials, and don’t design low-level software in ignorance of the underlying hardware.

图21:paralle1

下面进入硬件系统架构。

图22:paralle1

This simplified sequence is just the beginning of a discipline called cache-coherency protocols. 缓存一致性。

图23:paralle1

same-CPU compare-and-swap（CAS）,CSAPP里也有讲到。

这里讲得很细，建议直接看原文吧。

CAS is an atomic operation in which the hardware compares the contents of the specified memory location to a specified “old” value, and if they compare equal, stores a specified “new” value, in which case the CAS operation succeeds.这里涉及到很多优化和计算方式的改变。

图24:paralle1

Hardware Optimizations

图25:paralle1

1.One hardware optimization is large cachelines.

2.A second related hardware optimization is cache prefetching, in which the hardware reacts to consecutive accesses by prefetching subsequent cachelines, thereby evading speed-of-light delays for these subsequent cache-lines.

3.A third hardware optimization is the store buffer, whichallows a string of store instructions to execute quickly even when the stores are to non-consecutive addresses and when none of the needed cachelines are present in the CPU’s cache.

4.A fourth hardware optimization is speculative execution, which can allow the hardware to make good use of the store buffers without resulting in memory misordering.

5.A fifth hardware optimization is large caches, allowing individual CPUs to operate on larger datasets without incurring expensive cache misses.

6.A final hardware optimization is read-mostly replication, in which data that is frequently read but rarely updated is present in all CPUs’ caches.

Hardware Free Lunch?

The major reason that concurrency has been receiving somuch focus over the past few years is the end of Moore’s-Law induced single-threaded performance increases.

摩尔定律的终结，带来对并发的关注。

但是任何事物都是有代价的。

图26:paralle1

现在开始谈3D集成了。

图27:paralle1

好吧，数电模电没好好学，流下悔恨的眼泪…….

图28:paralle1

Special-Purpose Accelerators

专用的并行编程加速器

Existing Parallel Software

其实现有的并行软件也能解决部分的工业问题。

Software Design Implications

The lesson should be quite clear: Parallel algorithms must be explicitly designed with these hardware properties firmly in mind.

图29:paralle1

Ok，第三章结束！

主要讲得是如何实现出色的并行性能和可扩展性，大量的底层细节。

最后问大家一个问题吧。

Question：

OK, if we are going to have to apply distributed-programming techniques to shared-memory parallel programs, why not just always use these distributed techniques and dispense with shared memory?

Answer:

Because it is often the case that only a small fraction of the program is performance-critical. Shared-memory parallelism allows us to focus distributed-programming techniques on that small fraction, allowing simpler shared-memory techniques to be used on the non-performance-critical bulk of the program.

未完待续~

感谢阅读到这里，周末愉快~