Computer Organization and Architecture Instruction Level Parallelism and Superscalar Processors
Computer Organization and Architecture
Instruction Level Parallelism and Superscalar Processors
Outline
-
Overview of Superscalar
-
Design Issues of Superscalar
-
Superscalar in Pentium
-
Superscalar in ARM CORTEX-A8
Overview of Superscalar

Ideal pipeline
-
理想的指令流水线的执行过程
-
指令执行分为6个阶段,且不共享资源
-
每个时间单位都会有1个指令完成执行
-
指令数量足够多时,执行效率为原来的6倍
Actual pipeline
-
Not all instructions require the same steps
- Some pipeline stages are idle
-
Running time of different pipeline stages is different
- running time of some pipeline stages is wasted
-
Instructions are not independent of each other
- Poor operation of the pipeline
Problem about pipeline
-
The pipeline pauses due to dependencies between instructions, which is called pipeline risk
-
There are three types of dependencies
-
Data dependence
-
Control dependence
-
resource dependence
-
Question
-
Is instruction pipelining truly parallel?
-
Yes: There are indeed multiple instructions in the pipeline being processed at the same time
-
No: multiple instructions do not enter the pipeline at the same time
-
-
How to further improve the execution efficiency of instructions?
-
Optimize pipeline: super pipeline
-
True instruction level parallelism: superscalar pipelining
-
Superpipeline
-
In an ordinary pipeline, each clock cycle can complete processing of one pipeline stage
-
Many pipeline stages need less than half a clock cycle
-
Superpipeline
-
Double internal clock rate is adopted for instruction scheduling
-
Double internal clock speed gets two tasks per external clock cycle
-
-
Get twice the instruction throughput

-
四阶段流水线,指令划分为4个阶段
-
普通流水线中,每个阶段需要1个时钟周期来完成。最高能达到4倍的指令执行效率,每个时钟周期可以输出1个指令的执行结果
-
超级流水线。通过采用双倍内部时钟的方式,每0.5个外部时钟周期,就能完成1个指令阶段的执行。最高能达到8倍的执行效率
Limit of Superpipeline
-
The effect is similar to that of increasing the main frequency
-
It is still not really instruction level parallelism
-
The overall performance is limited by the clock cycle and the length of time the instruction phase executes
- Long execution phases affect overall performance
-
Another technology-superscalar
Vector and Scalar
-
Scalar
-
Also called “vector free”. Some physical quantities have only numerical value, but no direction. Some of them are positive or negative
-
A single number used to represent a single attribute of a thing
-
For example: temperature, length
-
-
Vector
-
Originally refers to a quantity with size and direction
-
A group of orderly arranged numbers used to demarcate the quantitative characteristics of things
-
For example: the position of a point in the plane coordinate system$ (x, y) $
-
Scalar instruction
-
The instructions that do not have vector processing functions and only operate on a single quantity, namely a scalar quantity, are called scalar instructions
-
Most instructions are scalar
Vector instruction
-
The basic operating object is a vector, that is, a group of numbers arranged in order
-
The instruction determines the address of the vector operand and directly or implicitly specifies vector parameters such as increment, vector length, etc
-
The vector instruction specifies that the processor processes vector according to the same operation, which can effectively improve the operation speed
-
Some mainframes are equipped with vector operation instruction systems with complete functions

Superscalar
-
超标量采用了2个独立的流水线
-
每个流水线都可以再进行指令的并行运行
-
能够并行执行每个阶段的2个指令
-
在稳定运行的状态下,可以达到8倍的执行效率

Ceneral Superscalar organization
-
包含了2个整数运算单元,2个浮点数运算单元,一个存储单元
-
整数运算单元中,允许有2个指令并行执行,浮点数运算单元也允许2个浮点指令同时运行。与此同时,1个存储器操作也可以并行来执行
-
这个结构中,同时允许5个指令并行执行
Key problem of superscalar
-
Superscalar implementations raise a number of complex design issues related to the instruction pipeline
-
First, the relevance of the pipeline itself still exists
-
Multiple pipelines bring more complex correlation problems
-
-
The compiler is required to have more complex optimization techniques to achieve greater instruction level parallelism
Application of superscalar
-
Superscalar technology itself is proposed and developed with the development of RISC technology
-
RISC processors also tend to use superscalar technology
-
Although RISC machine lends itself readily to superscalar techniques, the superscalar approach can be used on either a RISC or CISC architecture
-
Superscalar approach has now become the standard method for implementing high-performance microprocessors
Factors limiting parallelism
-
Instruction level parallelism
- The degree to which program instructions can be executed in parallel
-
Compiler capabilities
- The compiler can maximize instruction level parallelism of programs
-
Hardware techniques
- Hardware capability supports parallel operation of instructions
Limitations
-
The most important reason for limiting instruction level parallelism is the correlation between instructions in the program
-
The dependencies between instructions include
-
True data dependency
-
Output dependency
-
Anti-dependency
-
Procedural dependency
-
Resource conflicts
-
True Data Dependency

-
I0和i1能够同时进行取指和解码
-
但是由于i1取的操作数是i0的结果,所以必须要等到i0执行完之后,i1才能进行取指
-
第二条指令存在一个时钟周期的延迟
-
先写后读,也称为“写后读相关性”
-
这种相关性和指令的执行顺序严格相关,是真实的相关性
Procedural dependency

-
分支前和分支后的指令不能并行执行
-
如果指令非定长,则必须对指令进行解码,才能确定取多长的指令
-
如果使用的是变长指令,在取后续指令之前,前一个指令必须要部分译码,否则下一个指令不知道从内存的哪个位置去取,这阻止了同时取指的操作
-
超标量更适合RISC架构的理由之一,因为RISC的指令都是定长的,不会有这种相关性
Resource conflict
-
资源冲突,也称为资源相关性
-
指令i0和i1在执行过程中,都需要用到同一个功能单元,所以他们不能并行执行,只能串行处理。这里浪费了一个时钟周期
-
资源冲突和数据相关性的表现差不多,但是资源冲突可以通过复制资源来解决,例如在前面讲到的增加干衣机
Design Issues of Superscalar
Parallelism
-
Factors limiting parallelism
-
Instruction level parallelism
-
Compiler capabilities
-
Hardware techniques
-
-
Instruction level parallelism
-
Instructions have the characteristics of parallel execution
-
Instructions in a sequence are independent
-
Execution can be overlapped
-
Governed by data and procedural dependency
-
Machine Parallelism
-
Machine Parallelism
-
Ability to take advantage of instruction level parallelism
-
Governed by number of parallel pipelines
-
The ability to find independent instructions and obtain instruction level parallelism
-
-
Instructions that can be executed in parallel
|
|
- Instructions that cannot be executed in parallel
|
|
Instruction issue
-
Instruction issue: the process of starting instructions to be executed by the functional unit of the processor
- Instruction issue occurs when instruction moves from the decode stage of the pipeline to the first execute stage of the pipeline
-
In order to improve the parallelism, it is necessary to use a reasonable issue order, instead of the original order
-
In essence, instruction emission is a strategy to find instructions that can enter the pipeline and be executed
Order about instruction issue
-
Three sequences are involved in the command sending process
-
Order in which instructions are fetched
-
Order in which instructions are executed
-
Order in which instructions change registers and memory
-
-
The one constraint on the processor is that the result must be correct
-
Instruction issue policy refers to the protocol used to start the execution of the command
Instruction issue policy
-
The original instruction stream itself has dependencies
-
To improve the parallelism of execution, the processor may change the order in which instructions are executed
-
The more sophisticated the processor, the less it is bound by a strict relationship between these orderings
-
There are three issue policy
-
In-order issue with in-order completion
-
In-order issue with out-of-order completion
-
Out-of order issue with out-of –order completion
-
In-order issue/in-order completion
-
Issue instructions in the order they occur
-
Write the results in the same order to complete the execution of instructions
-
Very inefficient,even scalar pipeline will not use policy
-
In Superscalar pipeline
-
May fetch more than one instruction
-
To ensure orderly completion, when the functional units conflict, or the execution of the functional units requires multiple cycles, instruction issue must wait
-

-
超标量处理器有2个独立的流水线,能够同时取2个指令
-
有3个执行单元,以及2个写回的单元
-
I1需要2个周期完成执行;I3和I4需要同时使用某个功能单元,导致出现冲突;I5依赖于I4的结果;I5和I6需要同时使用某个功能单元,导致出现冲突
-
成对取指并送到译码单元进行译码。I1需要花费2个时钟周期执行。所以I3和I4需要在第四个周期开始执行。由于I3和I4资源冲突,所以I3和I4需要顺序执行。I5需要依赖I4的结果,并且I5和I6存在资源冲突,所以I5和I6需要串行执行
-
8个指令,总共需要8个时钟周期才能完成
-
由于指令执行的时间不一样,所以如果同时发射的指令给执行单元,需要等全部执行完成之后,才能进行下一次发射
-
如果指令间存在数据依赖关系,需要停止调度行为,等具备条件之后,才能进行指令发射
In-order issue/out-of-order completion

-
I1和I2同时发射到执行单元,由于I2只需要1个周期完成,所以I2可以先完成。I3可以和I1同时执行,并进入写的阶段
-
I4由于和I3资源冲突,所以需要等I3完成之后才能执行。I5依赖于I4的结果,所以也需要等待I4,同理I6需要等待I5
-
整体上需要7个周期完成指令的执行
Output (write-write) dependency
- In the process of out of sequence completion, the execution order is different from the original order, which may lead to output dependency problems
|
|
- Analyze
- I2 depends on result of I1 - data dependency
- If I3 completes before I1, the result from I1 will be wrong - output (write-write) dependency
How?
-
Adopt dynamic scheduling strategy
-
Idea: Move the dependent instructions out of the way of independent ones (s.t. independent ones can execute)
- Rest areas for dependent instructions: Reservation stations
-
Monitor the source “values” of each instruction in the resting area
-
When all source “values” of an instruction are available, “fire” (i.e. dispatch) the instruction
- Instructions dispatched in data-flow order,not control-flow
-
Benefit
-
Latency tolerance: Allows independent instructions to execute and complete in the presence of a long latency operation
-
Reasonably schedule instructions with dependencies
-
-
Problem about In-order issue
-
When decoding instructions, if there are related points or conflicting points, the decoding needs to stop
-
In this way, subsequent instructions cannot be decoded
-
At this time, the processor cannot check whether any instruction is independent and can be executed on the pipeline
Out-of-order issue/out-of-order completion
-
Solution: decouple decode from execution
-
Decode
-
Decode stage can continuously fetch and decode
-
Decoded instruction is put into the buffer
-
As long as the buffer is not full, fetching and decoding can continue
-
-
Execution
-
When the functional unit is available, transmit the executable instructions to execute
-
Since the instruction has been decoded, the processor can first identify whether the instruction can be executed
-

- 第一个周期,I1和I2进行解码,完成解码后进入发射缓冲区。执行单元为空,I1和I2被发射出去执行
- 第二个周期,I3和I4进行解码,完成解码后进入发射缓冲区。由于I3和I4共用执行单元,I3发射出去进行执行
- 第三个周期,I5和I6进行解码,完成解码后进入发射缓冲区。此时,I3执行完了,所以I4可以发射了。同时由于I5和I4有数据相关性,所以I5不能发射,于是把I6发射出去执行
- 第四个周期,没有指令需要解码
- 第五个周期,没有指令需要解码。此时发射缓冲区只有I5。I6执行完成,可以发射I5指令。I5在第五个时钟周期完成执行,并在第六个时钟周期完成写入操作
- 整个过程需要6个时钟周期。比之前的又缩短了1个周期
Anti-dependency(Write-after-read)
-
Out-of-order issue/out-of-order completion also need to comply with restrictions
-
Anti correlation occurs
|
|
- Analyze
- I3 can not complete before I2 starts as I2 needs a value in R3 and I3 changes R3
Dependency Analyzing
-
True data dependency reflects the real dependency between data
-
In essence, anti dependency and output dependency are caused by register conflict
- Register contents may not reflect the correct ordering from the program
-
Instruction issue stops, pipeline stall
- Processor pauses for one cycle
-
This situation is more serious when register optimization technology is used
-
Register optimization technology maximizes the use of registers to improve performance
-
Register conflicts will be more significant
-
Register renaming
-
Registers are dynamically allocated by hardware
-
When an instruction with a register as the destination operand is executed, a new register is allocated
-
The instruction that accesses the original register after this instruction must be modified to the newly allocated register to maintain consistency
-
Avoid dependencies caused by register conflicts
- Original
|
|
-
I1和I2存在真实数据相关性
-
I3和I4存在真实数据相关性
-
I3和I2存在反相关性,读后写
-
I3和I1存在输出相关性,写后写
-
Register renaming
|
|
-
采用寄存器重命名的规则,I1的R3修改成R3b,I3中的R3,修改成R3c。
-
I3和I2之间的反相关性没有了,I3和I1之间的输出相关性也没有了,I3可以立即发射
-
真实数据相关性无法通过寄存器重命名来解决
Analysis of three technologies ! ! !
-
Techniques for improving performance in superscalar processors
-
Duplication of Resources
-
Out of order issue
-
Renaming
-
-
Resources are the foundation
- Sufficient resources to execute multiple pipelines
-
Out of order issue is the method
- Provide executable instructions through disordered transmissionRenaming is a guarantee
-
Renaming is a guarantee
- Rename mechanism reduces the correlation between instructions
About instruction window
-
Out of order issue: register window is used to cache instructions after decoding
-
Through the register window, the processor can identify independent instructions that can be placed in the execution segment
-
If the instruction window is very small, the probability of successful recognition is very low
-
The instruction window needs to be large enough to find independent instructions and use the hardware more effectively
-
Need instruction window large enough (more than 8)
Effect of technology

Without Procedural Dependencies
-
Base:不复制任何功能单元
-
+Id/st:增加了装入/存储单元
-
+alu:增加了ALU单元
-
+both:增加了ALU和Id/st
-
不考虑过程相关性
-
没有采用寄存器重命名,增加硬件执行效果并不明显。而采用寄存器重命名后,增加了ALU会明显提高加速比
-
从发射窗口的角度来看,窗口数量从8个增加到16个,效果就很明显。但从16个到32个,效果稍差一些
-
资源复制、乱序发射、寄存器重命名三者相互影响
Consideration of control dependence
-
Also called branch hazard
-
When branching instructions, it is not possible to determine which instruction to execute after the branch
-
In the pipeline, after prefetching the wrong instruction, it is necessary to discard and re fetch the instruction, which causes the pipeline to fail to run with full load
Methods
-
Processing method of control dependence
-
Multiple Streams
-
Prefetch Branch Target
-
Loop buffer
-
Branch prediction
-
Delayed branching
-
-
Goal: Keep the pipeline running full
About delayed branch
-
Delayed branching is often used in RIS
-
Calculate result of branch before unusable instructions pre-fetched
-
Instructions that are not affected by branches are immediately followed by branch
-
Keeps pipeline full while fetching new instruction stream
-
-
Not as good for superscalar
- Multiple instructions need to execute in delay slot
- Instruction dependence problems
- Often use branch prediction
Superscalar execution
-
静态程序通过取指和分支预测,形成动态的指令流
-
指令流经过处理器的相关性检查,会去掉不必要的相关性,比如反相关和输出相关。然后将指令放到执行窗口中,等待执行
-
在执行窗口中的指令,根据真实数据相关性来排序。处理器根据真实数据相关性和资源可用性,来发射指令到执行单元进行执行
-
最后的执行结果需要有一个提交的步骤。因为指令不是按照原有的顺序来执行的,同时分支预测和推测执行使得有些执行的结果需要丢弃
-
Simultaneously fetch multiple instructions
-
Multiple fetching and decoding
-
Branch prediction logic
-
-
Logic to determine true dependencies involving register values
- Determine instruction position of true correlation
-
Dealing with unnecessary dependencies
- Anti-dependency and output dependency
-
Mechanisms to initiate multiple instructions in parallel
-
Instruction window
-
Out of order issue logic
-
-
Resources for parallel execution of multiple instructions
- The system has sufficient resources
-
Mechanisms for committing process state in correct order
- Submit results according to the order of instructions
Summary
-
Resources are the foundation
- Machine parallelism
-
Out of order issue is the method
- Instruction level parallelism
-
Renaming is a guarantee
- Methods of improving instruction level parallelism
-
Through superscalar pipeline, multiple pipelines can run at the same time to achieve truly parallel operation at the instruction level