These are my notes and thoughts as I go along this course.
Lecture notes are at
https://gfxcourses.stanford.edu/cs149/fall23/lecture/multicore/
There are many forms of parallelism.
Superscalar – This is when the compiler without any help from humans figures out there there is instruction level parallelism.
Note that you also need hardware support. You need to be able to fetch these independent instructions in the same cycle and also you need independent execution units. In the figure above you can see that this processor can support upto two instructions in parallel.
The way to think about this is to think in terms of a single instruction stream. And in this stream the compiler is trying to find individual instructions that it can execute in parallel.
Multiple Cores
The core idea here is instead of using chip real-estate to make a complex archtiecture to make a single instruction stream run as fast as possible, replicate simpler cores instead. The individual cores are not as fancy (as in cannot maybe do superscalar stuff) but you can potentially run copies of programs.
SIMD
Single Instruction Multiple Data
Another idea is to amortize the complexity of an instruction in our original program across multiple ALUs.
Multi-threading
Caches
Cache is a hardware implementation. It’s memory onchip, so it occupies chip real-estate. It brings storage very close to the execution units, so the time time to load is very small. Multiple levels of cache exist, L1,L2,L3. The way to think about it is if you want to retrieve a book on a certain subject from the library, you go borrow that book from the library and whatever other books you would need for your research later and come put it in your cupboard. This locality is what improves performance.
==Forms of data locality==
Spatial Locality – When you load an address into the cache, you load nearby addresses too.
Temporal Locality* – when you load an address into the cache, you may reuse it again sometime in the future.
When you look at the above program and think of Superscalar architecture. There isnt much instruction level parallelism that you can apply. Think of ==sinx== as a single instruction stream. You will need to execute for each ==x[i]== . This is in contrast to say a multi-core or SIMD processor, where the same instruction stream can replicate across multiple cores and there by execute parallely for each value.
SIMD Processing
Notice here how there are 16 cores so we can run 16 independent instruction streams. Then each stream can further run SIMD instructions.
So we could be processing 128 values in parallel. This is best case.
There is a potential problem with SIMD especially when there is conditional execution. For instance if we have if and else loop, then if some elements across the ALU take the ‘if’ and some take the ‘else’ then since they are SIMD instruction, only some ALUs will be active at a given clock cycle and then the remaining (‘else’) ones run.
This is a very important concept called Instruction Coherence.
Coherent execution is very important for SIMD instructions to run well.