Lec 3-parallel programming models

SIMD: single instruction multiple data, multiple ALU runs simultaneously, controled by a single CU through broadcast
SPMD: single program multiple data, the programs can each run on different instructions at the same moment

ISPC

ISPC implements the gang abstraction using SIMD instructions

Call to ISPC function spawns “gang” of ISPC “program instances”. All instances run ISPC code concurrently. Upon return, all instances have completed.

programCount: number of concurrent instances in the gang (value is uniform among them), set by runtime system

programIndex: id of the current instance in the gang (value varies)

uniform: type modifier, use it for optimization (not correctness)

1
2
3

for (uniform int i = 0; i < N; i += programCount) {
    idx = i + programIndex;
}

programIndex = 0, 1, ..., programCount-1

the program does NOT run on different threads, but on a single thread using SIMD vector instructions, upon return, all instances have completed. This mean that one gang can only make use of one core.

so programCount is actually (a small mutiple of) SIMD width

work assignment:

blocked 01/23/45/67=(0246+1357)

drawback: 工作量可能不是均匀分布的，因此会造成不同实例无法同时完成; ispc是一个线程进行处理，所以是没有连续访问的优化的
interleave 04/15/26/37=(0123+4567)

advantage: 每一轮的向量中加载的数据是连续的，大大优化了加载的速度

interleaving is better

让系统自动分配顺序：foreach

1
2
3

float partial = 0.0;
foreach (i = 0 ... N) partial += x[i];
uniform sum = reduce_add(partial);

the reduce_add() function adds together the values of the given equation in each of the active program instances, its return value is uniform

Three models of communication

the abstractions aren’t tied rigidly to a specific hardware.

shared address space model

threads communicate by writing/reading to a shared variable, while they can have private variable as well

不断加入更多CPU和内存时如何保持速度？

bus: only one thing can be on it at a time, unscalable

线程如何知道它想要的数据在那儿？如何判断读到的数据是最新的数据？通过互斥锁来同步

uniform memory access

Uniform memory access time: cost of accessing an uncached memory address is the same for all processors (same distance). Processors connected via interconnection(bus), the bus connects to memory && I/O.
non-uniform memory access(NUMA)

每个处理器都能高速访问自己直连的内存区域，但也能通过总线访问其他核的内存区域。由于处理器主要在自己的内存上工作，因此速度快且不会给总线带来压力，保证了互连带宽。

可扩展性较好

message passing model

Threads operate within their own private address spaces.

Threads communicate by sending/receiving messages

send: specifies recipient, buffer to be transmitted, and optional message identifier (“tag”)
receive: sender, specifies buffer to store data, and optional message identifier
Sending messages is the only way to exchange data between threads 1 and 2
Only a network is need to implement this approach. It’s easy to build large machines with this approach

typical: MPI(message passing interface)

data-parallel model

map a function onto a large amount of data

SIMD/SPMD

stream programming model:

streams: collections of elements, stream<float> input;
kernels: side-effect-free functions that apply to the elements
stream_gather(input, indices, tmp_input)
stream_scatter(tmp_output, indices, output)

advantage:

stream<float> means that elements are prefetched which helps locality
if there are multiple steps, then the outcome after each step needn’t be stored into memory, but can goto the next step immediately

drawbacks: