Lec 3-parallel programming models

  • SIMD: single instruction multiple data, multiple ALU runs simultaneously, controled by a single CU through broadcast

  • SPMD: single program multiple data, the programs can each run on different instructions at the same moment

ISPC

ISPC implements the gang abstraction using SIMD instructions

Call to ISPC function spawns “gang” of ISPC “program instances”. All instances run ISPC code concurrently. Upon return, all instances have completed.

programCount: number of concurrent instances in the gang (value is uniform among them), set by runtime system

programIndex: id of the current instance in the gang (value varies)

uniform: type modifier, use it for optimization (not correctness)

1
2
3
for (uniform int i = 0; i < N; i += programCount) {
idx = i + programIndex;
}

programIndex = 0, 1, ..., programCount-1

the program does NOT run on different threads, but on a single thread using SIMD vector instructions, upon return, all instances have completed. This mean that one gang can only make use of one core.

so programCount is actually (a small mutiple of) SIMD width

work assignment:

  • blocked 01/23/45/67=(0246+1357)

    drawback: 工作量可能不是均匀分布的,因此会造成不同实例无法同时完成; ispc是一个线程进行处理,所以是没有连续访问的优化的

  • interleave 04/15/26/37=(0123+4567)

    advantage: 每一轮的向量中加载的数据是连续的,大大优化了加载的速度

interleaving is better

让系统自动分配顺序:foreach

1
2
3
float partial = 0.0;
foreach (i = 0 ... N) partial += x[i];
uniform sum = reduce_add(partial);

the reduce_add() function adds together the values of the given equation in each of the active program instances, its return value is uniform

Three models of communication

the abstractions aren’t tied rigidly to a specific hardware.

shared address space model

threads communicate by writing/reading to a shared variable, while they can have private variable as well

不断加入更多CPU和内存时如何保持速度?

bus: only one thing can be on it at a time, unscalable

线程如何知道它想要的数据在那儿?如何判断读到的数据是最新的数据?通过互斥锁来同步

  • uniform memory access

    Uniform memory access time: cost of accessing an uncached memory address is the same for all processors (same distance). Processors connected via interconnection(bus), the bus connects to memory && I/O.

  • non-uniform memory access(NUMA)

    每个处理器都能高速访问自己直连的内存区域,但也能通过总线访问其他核的内存区域。由于处理器主要在自己的内存上工作,因此速度快且不会给总线带来压力,保证了互连带宽。

可扩展性较好

message passing model

Threads operate within their own private address spaces.

Threads communicate by sending/receiving messages

  • send: specifies recipient, buffer to be transmitted, and optional message identifier (“tag”)

  • receive: sender, specifies buffer to store data, and optional message identifier

  • Sending messages is the only way to exchange data between threads 1 and 2

  • Only a network is need to implement this approach. It’s easy to build large machines with this approach

typical: MPI(message passing interface)

data-parallel model

map a function onto a large amount of data

SIMD/SPMD

stream programming model:

  • streams: collections of elements, stream<float> input;

  • kernels: side-effect-free functions that apply to the elements

  • stream_gather(input, indices, tmp_input)

  • stream_scatter(tmp_output, indices, output)

advantage:

  • stream<float> means that elements are prefetched which helps locality

  • if there are multiple steps, then the outcome after each step needn’t be stored into memory, but can goto the next step immediately

drawbacks:

  • need a library

  • rigid program structure

conclusion

  • Shared address space: very little structure

  • Message passing: highly structured communication

  • Data-parallel: very rigid computation structure

  • modern practice: mixed programming models

Lec 4-Parallel programming basics

对于固定的计算,$Speadup=\frac{Time(1\ processor)}{Time(p\ processors)}$

思考过程

  1. Decomposition: 确定哪些工作可以并行完成

    Main idea: 分割出足够多的可并行任务来让处理器保持繁忙

  2. Assignment: 决定如何将工作分割并分配给各个处理器

  3. Orchestration: 设计数据访问、线程交流与同步

  4. Mapping: 确认这种实现能够在硬件上运行

Amdahl(阿姆达尔)定律

设系统内必须串行化的程序比重为F,CPU处理器数量为N,则有:

$$ Speedup\leq\frac{1}{F+\frac{1-F}{N}} $$