目录

Computer Organization and Architecture Instruction Level Parallelism and Superscalar Processors


Computer Organization and Architecture

Instruction Level Parallelism and Superscalar Processors

Outline

  • Overview of Superscalar

  • Design Issues of Superscalar

  • Superscalar in Pentium

  • Superscalar in ARM CORTEX-A8

Overview of Superscalar

/img/Computer Organization and Architecture/chapter14-1.png
Ideal pipeline
  • 理想的指令流水线的执行过程

  • 指令执行分为6个阶段,且不共享资源

  • 每个时间单位都会有1个指令完成执行

  • 指令数量足够多时,执行效率为原来的6倍


Actual pipeline

  • Not all instructions require the same steps

    • Some pipeline stages are idle
  • Running time of different pipeline stages is different

    • running time of some pipeline stages is wasted
  • Instructions are not independent of each other

    • Poor operation of the pipeline

Problem about pipeline

  • The pipeline pauses due to dependencies between instructions, which is called pipeline risk

  • There are three types of dependencies

    • Data dependence

    • Control dependence

    • resource dependence

Question

  • Is instruction pipelining truly parallel?

    • Yes: There are indeed multiple instructions in the pipeline being processed at the same time

    • No: multiple instructions do not enter the pipeline at the same time

  • How to further improve the execution efficiency of instructions?

    • Optimize pipeline: super pipeline

    • True instruction level parallelism: superscalar pipelining

Superpipeline

  • In an ordinary pipeline, each clock cycle can complete processing of one pipeline stage

  • Many pipeline stages need less than half a clock cycle

  • Superpipeline

    • Double internal clock rate is adopted for instruction scheduling

    • Double internal clock speed gets two tasks per external clock cycle

  • Get twice the instruction throughput

/img/Computer Organization and Architecture/chapter14-2.png
  • 四阶段流水线,指令划分为4个阶段

  • 普通流水线中,每个阶段需要1个时钟周期来完成。最高能达到4倍的指令执行效率,每个时钟周期可以输出1个指令的执行结果

  • 超级流水线。通过采用双倍内部时钟的方式,每0.5个外部时钟周期,就能完成1个指令阶段的执行。最高能达到8倍的执行效率

Limit of Superpipeline
  • The effect is similar to that of increasing the main frequency

  • It is still not really instruction level parallelism

  • The overall performance is limited by the clock cycle and the length of time the instruction phase executes

    • Long execution phases affect overall performance
  • Another technology-superscalar

Vector and Scalar

  • Scalar

    • Also called “vector free”. Some physical quantities have only numerical value, but no direction. Some of them are positive or negative

    • A single number used to represent a single attribute of a thing

    • For example: temperature, length

  • Vector

    • Originally refers to a quantity with size and direction

    • A group of orderly arranged numbers used to demarcate the quantitative characteristics of things

    • For example: the position of a point in the plane coordinate system$ (x, y) $

Scalar instruction
  • The instructions that do not have vector processing functions and only operate on a single quantity, namely a scalar quantity, are called scalar instructions

  • Most instructions are scalar

Vector instruction
  • The basic operating object is a vector, that is, a group of numbers arranged in order

  • The instruction determines the address of the vector operand and directly or implicitly specifies vector parameters such as increment, vector length, etc

  • The vector instruction specifies that the processor processes vector according to the same operation, which can effectively improve the operation speed

  • Some mainframes are equipped with vector operation instruction systems with complete functions

/img/Computer Organization and Architecture/chapter14-3.png
Superscalar
  • 超标量采用了2个独立的流水线

  • 每个流水线都可以再进行指令的并行运行

  • 能够并行执行每个阶段的2个指令

  • 在稳定运行的状态下,可以达到8倍的执行效率

/img/Computer Organization and Architecture/chapter14-4.png
Ceneral Superscalar organization
  • 包含了2个整数运算单元,2个浮点数运算单元,一个存储单元

  • 整数运算单元中,允许有2个指令并行执行,浮点数运算单元也允许2个浮点指令同时运行。与此同时,1个存储器操作也可以并行来执行

  • 这个结构中,同时允许5个指令并行执行

Key problem of superscalar

  • Superscalar implementations raise a number of complex design issues related to the instruction pipeline

    • First, the relevance of the pipeline itself still exists

    • Multiple pipelines bring more complex correlation problems

  • The compiler is required to have more complex optimization techniques to achieve greater instruction level parallelism

Application of superscalar
  • Superscalar technology itself is proposed and developed with the development of RISC technology

  • RISC processors also tend to use superscalar technology

  • Although RISC machine lends itself readily to superscalar techniques, the superscalar approach can be used on either a RISC or CISC architecture

  • Superscalar approach has now become the standard method for implementing high-performance microprocessors

Factors limiting parallelism

  • Instruction level parallelism

    • The degree to which program instructions can be executed in parallel
  • Compiler capabilities

    • The compiler can maximize instruction level parallelism of programs
  • Hardware techniques

    • Hardware capability supports parallel operation of instructions

Limitations

  • The most important reason for limiting instruction level parallelism is the correlation between instructions in the program

  • The dependencies between instructions include

    • True data dependency

    • Output dependency

    • Anti-dependency

    • Procedural dependency

    • Resource conflicts


True Data Dependency

/img/Computer Organization and Architecture/chapter14-5.png
  • I0和i1能够同时进行取指和解码

  • 但是由于i1取的操作数是i0的结果,所以必须要等到i0执行完之后,i1才能进行取指

  • 第二条指令存在一个时钟周期的延迟

  • 先写后读,也称为“写后读相关性”

  • 这种相关性和指令的执行顺序严格相关,是真实的相关性


Procedural dependency

/img/Computer Organization and Architecture/chapter14-6.png
  • 分支前和分支后的指令不能并行执行

  • 如果指令非定长,则必须对指令进行解码,才能确定取多长的指令

  • 如果使用的是变长指令,在取后续指令之前,前一个指令必须要部分译码,否则下一个指令不知道从内存的哪个位置去取,这阻止了同时取指的操作

  • 超标量更适合RISC架构的理由之一,因为RISC的指令都是定长的,不会有这种相关性


Resource conflict

  • 资源冲突,也称为资源相关性

  • 指令i0和i1在执行过程中,都需要用到同一个功能单元,所以他们不能并行执行,只能串行处理。这里浪费了一个时钟周期

  • 资源冲突和数据相关性的表现差不多,但是资源冲突可以通过复制资源来解决,例如在前面讲到的增加干衣机

Design Issues of Superscalar

Parallelism

  • Factors limiting parallelism

    • Instruction level parallelism

    • Compiler capabilities

    • Hardware techniques

  • Instruction level parallelism

    • Instructions have the characteristics of parallel execution

    • Instructions in a sequence are independent

    • Execution can be overlapped

    • Governed by data and procedural dependency

Machine Parallelism
  • Machine Parallelism

    • Ability to take advantage of instruction level parallelism

    • Governed by number of parallel pipelines

    • The ability to find independent instructions and obtain instruction level parallelism

  • Instructions that can be executed in parallel

1
2
3
Load R1 <- R2(23)
Add R3 <- R3, "1"
Add R4 <- R4, R2
  • Instructions that cannot be executed in parallel
1
2
3
Add R3 <- R3, "1"
Add R4 <- R3, R2
Store[R4] <- R0

Instruction issue

  • Instruction issue: the process of starting instructions to be executed by the functional unit of the processor

    • Instruction issue occurs when instruction moves from the decode stage of the pipeline to the first execute stage of the pipeline
  • In order to improve the parallelism, it is necessary to use a reasonable issue order, instead of the original order

  • In essence, instruction emission is a strategy to find instructions that can enter the pipeline and be executed

Order about instruction issue
  • Three sequences are involved in the command sending process

    • Order in which instructions are fetched

    • Order in which instructions are executed

    • Order in which instructions change registers and memory

  • The one constraint on the processor is that the result must be correct

  • Instruction issue policy refers to the protocol used to start the execution of the command

Instruction issue policy
  • The original instruction stream itself has dependencies

  • To improve the parallelism of execution, the processor may change the order in which instructions are executed

  • The more sophisticated the processor, the less it is bound by a strict relationship between these orderings

  • There are three issue policy

    • In-order issue with in-order completion

    • In-order issue with out-of-order completion

    • Out-of order issue with out-of –order completion

In-order issue/in-order completion
  • Issue instructions in the order they occur

  • Write the results in the same order to complete the execution of instructions

  • Very inefficient,even scalar pipeline will not use policy

  • In Superscalar pipeline

    • May fetch more than one instruction

    • To ensure orderly completion, when the functional units conflict, or the execution of the functional units requires multiple cycles, instruction issue must wait

/img/Computer Organization and Architecture/chapter14-7.png
  • 超标量处理器有2个独立的流水线,能够同时取2个指令

  • 有3个执行单元,以及2个写回的单元

  • I1需要2个周期完成执行;I3和I4需要同时使用某个功能单元,导致出现冲突;I5依赖于I4的结果;I5和I6需要同时使用某个功能单元,导致出现冲突

  • 成对取指并送到译码单元进行译码。I1需要花费2个时钟周期执行。所以I3和I4需要在第四个周期开始执行。由于I3和I4资源冲突,所以I3和I4需要顺序执行。I5需要依赖I4的结果,并且I5和I6存在资源冲突,所以I5和I6需要串行执行

  • 8个指令,总共需要8个时钟周期才能完成

  • 由于指令执行的时间不一样,所以如果同时发射的指令给执行单元,需要等全部执行完成之后,才能进行下一次发射

  • 如果指令间存在数据依赖关系,需要停止调度行为,等具备条件之后,才能进行指令发射


In-order issue/out-of-order completion
/img/Computer Organization and Architecture/chapter14-8.png
  • I1和I2同时发射到执行单元,由于I2只需要1个周期完成,所以I2可以先完成。I3可以和I1同时执行,并进入写的阶段

  • I4由于和I3资源冲突,所以需要等I3完成之后才能执行。I5依赖于I4的结果,所以也需要等待I4,同理I6需要等待I5

  • 整体上需要7个周期完成指令的执行


Output (write-write) dependency

  • In the process of out of sequence completion, the execution order is different from the original order, which may lead to output dependency problems
1
2
3
R3 = R3 + R5 ;(l1)
R4 = R3 + 1  ;(l2)
R3 = R5 + 1  ;(l3)
  • Analyze
    • I2 depends on result of I1 - data dependency
    • If I3 completes before I1, the result from I1 will be wrong - output (write-write) dependency

How?

  • Adopt dynamic scheduling strategy

  • Idea: Move the dependent instructions out of the way of independent ones (s.t. independent ones can execute)

    • Rest areas for dependent instructions: Reservation stations
  • Monitor the source “values” of each instruction in the resting area

  • When all source “values” of an instruction are available, “fire” (i.e. dispatch) the instruction

    • Instructions dispatched in data-flow order,not control-flow
  • Benefit

    • Latency tolerance: Allows independent instructions to execute and complete in the presence of a long latency operation

    • Reasonably schedule instructions with dependencies


  • Problem about In-order issue

  • When decoding instructions, if there are related points or conflicting points, the decoding needs to stop

  • In this way, subsequent instructions cannot be decoded

  • At this time, the processor cannot check whether any instruction is independent and can be executed on the pipeline


Out-of-order issue/out-of-order completion
  • Solution: decouple decode from execution

  • Decode

    • Decode stage can continuously fetch and decode

    • Decoded instruction is put into the buffer

    • As long as the buffer is not full, fetching and decoding can continue

  • Execution

    • When the functional unit is available, transmit the executable instructions to execute

    • Since the instruction has been decoded, the processor can first identify whether the instruction can be executed

/img/Computer Organization and Architecture/chapter14-9.png
  • 第一个周期,I1和I2进行解码,完成解码后进入发射缓冲区。执行单元为空,I1和I2被发射出去执行
  • 第二个周期,I3和I4进行解码,完成解码后进入发射缓冲区。由于I3和I4共用执行单元,I3发射出去进行执行
  • 第三个周期,I5和I6进行解码,完成解码后进入发射缓冲区。此时,I3执行完了,所以I4可以发射了。同时由于I5和I4有数据相关性,所以I5不能发射,于是把I6发射出去执行
  • 第四个周期,没有指令需要解码
  • 第五个周期,没有指令需要解码。此时发射缓冲区只有I5。I6执行完成,可以发射I5指令。I5在第五个时钟周期完成执行,并在第六个时钟周期完成写入操作
  • 整个过程需要6个时钟周期。比之前的又缩短了1个周期
Anti-dependency(Write-after-read)
  • Out-of-order issue/out-of-order completion also need to comply with restrictions

  • Anti correlation occurs

1
2
3
4
R3 = R3 + R5 ;l1
R4 = R3 + 1  ;l2
R3 = R5 + 1  ;l3
R7 = R3 + R4 ;l4
  • Analyze
    • I3 can not complete before I2 starts as I2 needs a value in R3 and I3 changes R3

Dependency Analyzing

  • True data dependency reflects the real dependency between data

  • In essence, anti dependency and output dependency are caused by register conflict

    • Register contents may not reflect the correct ordering from the program
  • Instruction issue stops, pipeline stall

    • Processor pauses for one cycle
  • This situation is more serious when register optimization technology is used

    • Register optimization technology maximizes the use of registers to improve performance

    • Register conflicts will be more significant


Register renaming

  • Registers are dynamically allocated by hardware

  • When an instruction with a register as the destination operand is executed, a new register is allocated

  • The instruction that accesses the original register after this instruction must be modified to the newly allocated register to maintain consistency

  • Avoid dependencies caused by register conflicts


  • Original
1
2
3
4
R3 = R3 + R5 ;I1
R4 = R3 + 1  ;I2
R3 = R5 + 1  ;I3
R7 = R3 + R4 ;I4
  • I1和I2存在真实数据相关性

  • I3和I4存在真实数据相关性

  • I3和I2存在反相关性,读后写

  • I3和I1存在输出相关性,写后写

  • Register renaming

1
2
3
4
R3b = R3a + R5a ;l1
R4b = R3b + 1   ;l2
R3c = R5a + 1   ;l3
R7 = R3c + R4b  ;l4
  • 采用寄存器重命名的规则,I1的R3修改成R3b,I3中的R3,修改成R3c。

  • I3和I2之间的反相关性没有了,I3和I1之间的输出相关性也没有了,I3可以立即发射

  • 真实数据相关性无法通过寄存器重命名来解决

Analysis of three technologies ! ! !

  • Techniques for improving performance in superscalar processors

    • Duplication of Resources

    • Out of order issue

    • Renaming

  • Resources are the foundation

    • Sufficient resources to execute multiple pipelines
  • Out of order issue is the method

    • Provide executable instructions through disordered transmissionRenaming is a guarantee
  • Renaming is a guarantee

    • Rename mechanism reduces the correlation between instructions

About instruction window

  • Out of order issue: register window is used to cache instructions after decoding

  • Through the register window, the processor can identify independent instructions that can be placed in the execution segment

  • If the instruction window is very small, the probability of successful recognition is very low

  • The instruction window needs to be large enough to find independent instructions and use the hardware more effectively

  • Need instruction window large enough (more than 8)


Effect of technology

/img/Computer Organization and Architecture/chapter14-10.png
Without Procedural Dependencies
  • Base:不复制任何功能单元

  • +Id/st:增加了装入/存储单元

  • +alu:增加了ALU单元

  • +both:增加了ALU和Id/st

  • 不考虑过程相关性

  • 没有采用寄存器重命名,增加硬件执行效果并不明显。而采用寄存器重命名后,增加了ALU会明显提高加速比

  • 从发射窗口的角度来看,窗口数量从8个增加到16个,效果就很明显。但从16个到32个,效果稍差一些

  • 资源复制、乱序发射、寄存器重命名三者相互影响

Consideration of control dependence

  • Also called branch hazard

  • When branching instructions, it is not possible to determine which instruction to execute after the branch

  • In the pipeline, after prefetching the wrong instruction, it is necessary to discard and re fetch the instruction, which causes the pipeline to fail to run with full load

Methods

  • Processing method of control dependence

    • Multiple Streams

    • Prefetch Branch Target

    • Loop buffer

    • Branch prediction

    • Delayed branching

  • Goal: Keep the pipeline running full

About delayed branch

  • Delayed branching is often used in RIS

  • Calculate result of branch before unusable instructions pre-fetched

    • Instructions that are not affected by branches are immediately followed by branch

    • Keeps pipeline full while fetching new instruction stream

  • Not as good for superscalar

    • Multiple instructions need to execute in delay slot
    • Instruction dependence problems
    • Often use branch prediction

Superscalar execution

  • 静态程序通过取指和分支预测,形成动态的指令流

  • 指令流经过处理器的相关性检查,会去掉不必要的相关性,比如反相关和输出相关。然后将指令放到执行窗口中,等待执行

  • 在执行窗口中的指令,根据真实数据相关性来排序。处理器根据真实数据相关性和资源可用性,来发射指令到执行单元进行执行

  • 最后的执行结果需要有一个提交的步骤。因为指令不是按照原有的顺序来执行的,同时分支预测和推测执行使得有些执行的结果需要丢弃


  • Simultaneously fetch multiple instructions

    • Multiple fetching and decoding

    • Branch prediction logic

  • Logic to determine true dependencies involving register values

    • Determine instruction position of true correlation
  • Dealing with unnecessary dependencies

    • Anti-dependency and output dependency
  • Mechanisms to initiate multiple instructions in parallel

    • Instruction window

    • Out of order issue logic

  • Resources for parallel execution of multiple instructions

    • The system has sufficient resources
  • Mechanisms for committing process state in correct order

    • Submit results according to the order of instructions

Summary

  • Resources are the foundation

    • Machine parallelism
  • Out of order issue is the method

    • Instruction level parallelism
  • Renaming is a guarantee

    • Methods of improving instruction level parallelism
  • Through superscalar pipeline, multiple pipelines can run at the same time to achieve truly parallel operation at the instruction level