Parallel programing (preview)
Lec 3-parallel programming models
SIMD: single instruction multiple data, multiple ALU runs simultaneously, controled by a single CU through broadcast
SPMD: single program multiple data, the programs can each run on different instructions at the same moment
ISPC
ISPC implements the gang
abstraction using SIMD
instructions
Call to ISPC function spawns “gang” of ISPC “program instances”. All instances run ISPC code concurrently. Upon return, all instances have completed.
programCount
: number of concurrent instances in the gang (value is uniform among them), set by runtime system
programIndex
: id of the current instance in the gang (value varies)
uniform
: type modifier, use it for optimization (not correctness)
1 | for (uniform int i = 0; i < N; i += programCount) { |
programIndex = 0, 1, ..., programCount-1
the program does NOT run on different threads, but on a single thread using SIMD vector instructions, upon return, all instances have completed. This mean that one gang can only make use of one core.
so programCount
is actually (a small mutiple of) SIMD width
work assignment:
blocked 01/23/45/67=(0246+1357)
drawback: 工作量可能不是均匀分布的,因此会造成不同实例无法同时完成; ispc是一个线程进行处理,所以是没有连续访问的优化的
interleave 04/15/26/37=(0123+4567)
advantage: 每一轮的向量中加载的数据是连续的,大大优化了加载的速度
interleaving is better
让系统自动分配顺序:foreach
1 | float partial = 0.0; |
the reduce_add()
function adds together the values of the given equation in each of the active program instances, its return value is uniform
Three models of communication
the abstractions aren’t tied rigidly to a specific hardware.
shared address space model
threads communicate by writing/reading to a shared variable, while they can have private variable as well
不断加入更多CPU和内存时如何保持速度?
bus: only one thing can be on it at a time, unscalable
线程如何知道它想要的数据在那儿?如何判断读到的数据是最新的数据?通过互斥锁来同步
uniform memory access
Uniform memory access time: cost of accessing an uncached memory address is the same for all processors (same distance). Processors connected via interconnection(bus), the bus connects to memory && I/O.
non-uniform memory access(NUMA)
每个处理器都能高速访问自己直连的内存区域,但也能通过总线访问其他核的内存区域。由于处理器主要在自己的内存上工作,因此速度快且不会给总线带来压力,保证了互连带宽。
可扩展性较好
message passing model
Threads operate within their own private address spaces.
Threads communicate by sending/receiving messages
send: specifies recipient, buffer to be transmitted, and optional message identifier (“tag”)
receive: sender, specifies buffer to store data, and optional message identifier
Sending messages is the only way to exchange data between threads 1 and 2
Only a network is need to implement this approach. It’s easy to build large machines with this approach
typical: MPI(message passing interface)
data-parallel model
map a function onto a large amount of data
SIMD/SPMD
stream programming model:
streams: collections of elements,
stream<float> input;
kernels: side-effect-free functions that apply to the elements
stream_gather(input, indices, tmp_input)
stream_scatter(tmp_output, indices, output)
advantage:
stream<float>
means that elements are prefetched which helps localityif there are multiple steps, then the outcome after each step needn’t be stored into memory, but can goto the next step immediately
drawbacks:
need a library
rigid program structure
conclusion
Shared address space: very little structure
Message passing: highly structured communication
Data-parallel: very rigid computation structure
modern practice: mixed programming models
Lec 4-Parallel programming basics
对于固定的计算,$Speadup=\frac{Time(1\ processor)}{Time(p\ processors)}$
思考过程
Decomposition: 确定哪些工作可以并行完成
Main idea: 分割出足够多的可并行任务来让处理器保持繁忙
Assignment: 决定如何将工作分割并分配给各个处理器
Orchestration: 设计数据访问、线程交流与同步
Mapping: 确认这种实现能够在硬件上运行
Amdahl(阿姆达尔)定律
设系统内必须串行化的程序比重为F,CPU处理器数量为N,则有:
$$ Speedup\leq\frac{1}{F+\frac{1-F}{N}} $$