Computer Organization and Architecture Processor Structure and Function
Computer Organization and Architecture
Processor Structure and Function
Outline
-
Processor Organization
-
Register Organization
-
Instruction Cycle
-
Instruction Pipelining
Processor Organization
-
A CPU must be able to
-
Fetch instruction from memory
-
Decode the instruction to determine what action to do
-
Fetch data
-
Process data
-
Write data
-

-
CPU必须要能够暂时保存一些数据,以对数据进行处理
-
CPU需要记住下一个指令的位置,这样才能在当前指令执行完成之后,能到找到下一个指令
-
处理过程中需要能够保存指令和数据
-
CPU包括ALU,CU,还需要有一组存储部件——寄存器
-
CPU通过一组系统总线和计算机的其他部件进行连接。系统总线包括控制总线、数据总线和地址总线

-
CPU的内部总线把ALU、寄存器和CU连在一起,完成数据在寄存器和ALU的传送
-
控制单元对寄存器、内部总线和ALU进行控制,控制各个部件按照指令要求完成相应的处理
-
在ALU内部,还包括各种更小的组件,例如状态标志,移位器,求补器,以及算术和布尔逻辑等
Register Organization
-
CPU must have some working space (temporary storage)
- Called registers
-
Number and function vary between processor designs
-
One of the major design decisions
-
Top level of memory hierarchy
-
Registers in the CPU Including two types
-
User-visible registers
-
Reduce access to main memory and improve instruction processing efficiency by optimizing the use of registers
-
-
Control and status registers
-
Used by control unit
-
Control the operation of the CPU and the execution of the program by the privileged operating system
-
User-visible registers
-
User visible registers can be divided into four categories according to its purpose
-
General purpose: assigned to various purposes
-
Data: for data retention only
-
Address: used for some addressing mode
-
Condition codes: also called flag register, it stores some flags of operation results
-
General purpose registers
-
True general purpose
- Registers and opcodes are orthogonal in the instruction set
- Registers can be arbitrarily matched with opcodes
-
Restricted general registers
- Specially used for floating point number or stack operation
-
In some cases, general registers can be used for addressing
- Register indirect addressing
- Displacement addressing
-
Data register
- It can only be used for store data, not for addressing
- Accumulator
-
Addressing register
-
Used for a specific addressing mode
-
Segment pointer
-
Index register
-
Stack pointer
-
General or Specialized
-
Whether registers are general or special affects the design of instruction sets
-
Specialized
-
Opcode can implicitly use a register group or a register
-
Smaller instructions
-
Less flexibility
-
-
General purpose
-
Increase flexibility and programmer options
-
Increase instruction size & complexity
-
-
More registers require more bits to specify registers in instructions
-
Fewer registers require more memory access
-
Too many registers does not reduce memory references remarkably and takes up processor real estate
- Between 8-32 is appropriate
-
Using register files with RISC makes use of using more registers
-
Address register: large enough to hold full address
-
Data address: large enough to hold full word
-
Sometimes combine two data registers to hold double length data
- In C language, there is a double integer and a long integer, both of which are two words long
Condition code registers
-
Also called flag registers, some of which are visible to users
-
After operating,CPU set the condition bit according to the result
-
After arithmetic operation, positive, negative, zero or overflow may occur
-
These conditions will be set in
-
-
Programs are allowed to read the condition code and perform
-
Condition code cannot be modified by the program
Control & status registers
-
Four registers are essential to instruction execution
-
Program Counter (PC)
-
Instruction Register (IR)
-
Memory Address Register (MAR)
-
Memory Buffer Register (MBR)
-
-
Not all processors have MAR and MBR. However, the system still needs registers similar to these two registers
Program status word
-
The PSW contains status information
-
The flags include
-
Sign, zero, carry, equal, overflow
-
interrupt enable/disable
-
Supervisor: indicates whether the CPU is executing in supervisor or user mode
-
-
Supervisor mode
-
Not available to user programs
-
Used by operating system(System call)
-
Certain privileged instructions can be executed only in supervisor mode
-
Other status and control registers
-
Other additional status and control registers
-
Pointer register to process control block
-
Interrupt vector register in vector interrupt computer
-
System stack pointer
-
Page table pointer register in virtual memory
-
I/O operation related registers
-
-
Control and status registers design elements
- Need to support the operating system
- Storage location in registers and memory
Instruction Cycle
-
Instruction cycle includes fetching cycle and execution cycle
-
In execution cycle, first decode to get the operation type of the instruction
-
If instruction has operands, get the operand specifier in the instruction
-
Immediate
-
Register
-
Direct addressing: memory access once
-
Indirect addressing: may requires more memory accesses
- Also called “indirect cycle”
-
Instruction cycle with indirect
-
指令周期包括取指周期和执行周期,还可能包括间接周期和中断周期
-
取指后,通过译码确定是否包含需要间接寻址的操作数,如果有,进入间接周期
-
当前指令执行完成之后,检查是否有中断。如果有,进入中断周期

-
指令周期中,先取指,然后进行指令操作译码
-
如果涉及到操作数,进行操作数地址计算,然后取操作数
-
之后进行数据操作。操作结果如果要保存到存储器中,需要计算操作数的地址,然后保存
-
在这条指令执行完成之后,检测是否有中断。如果没有中断,继续执行下一条指令。如果有中断,就按照中断的处理规则,进行中断处理

-
间接寻址过程中,由于操作数地址需要通过计算得到,所以在取操作数的过程中,可能会存在多次访问存储器的情况
-
取操作数和存结果的过程中,都可能会存在间接周期
Data flow (instruction fetch)
-
Depends on CPU design
-
Fetch inn general
-
PC contains address of next instruction
-
Address moved to MAR
-
Address placed on address bus
-
Control unit requests memory read
-
Result placed on data bus, copied to MBR, then to IR
-
Meanwhile PC incremented by 1
-

Data flow(fetch diagram)
-
刚开始,下一个地址在PC中
-
地址给MAR
-
地址放到数据总线上
-
控制单元发起读控制
-
存储器把数据,也就是指令内容,放到数据总线上
-
MBR读取数据总线内容,然后把指令给IR
-
控制单元还需要让PC+1,指向下一个指令
-
IR is examined
-
If there is no indirect addressing, enter the execution cycle
-
If indirect addressing, indirect cycle is performed
-
Rightmost N bits of MBR transferred to MAR
-
Control unit requests memory read
-
Result (address of operand) moved to MBR
-
Data flow (execute)
-
May take many forms
-
Depends on instruction being executed
-
May include
-
Memory read/write
-
Input/Output
-
Register transfers
-
ALU operations
-
Data flow (interrupt)
-
Simple and predictable
-
Current PC saved to allow resumption after interrupt
-
Contents of PC copied to MBR
-
Special memory location (e.g. stack pointer) loaded to MAR
-
MBR written to memory
-
-
PC loaded with address of interrupt handling routine
-
Interrupt handler first instruction fetched
Instruction Pipelining
Why need pipeline?
-
Development of computer application requires continuous improvement of processing capacity
-
The development of integrated circuit, clock frequency, registers, cache, etc. have reduced the instruction processing time and improved the processing ability
-
More and more difficult to solve problems by simply relying on the performance of hardware
-
The goal is the execution efficiency of instructions
-
Better organization is needed to improve the efficiency of instruction execution
What is pipeline? ! ! !
-
The working mode of factory assembly line is used for reference
-
Divide the execution of instructions into several stages
-
Different stages of multiple instructions can be processed in parallel
-
-
Although execution time of each instruction is not shortened, the execution time of a group of instructions is shortened due to the parallel method
-
This is the basic idea of instruction pipeline
Prefetch
-
Before pipelining, next instruction is taken after current instruction is executed
-
With pipelining, more than one instruction in different stages of the pipeline
-
How to get instructions is a problem
-
Fetch accessing main memory
-
Execution usually does not access memory
-
Fetch next instruction during execution of current instruction
-
-
Called instruction prefetch
Advantage
-
During execution of an instruction, a new instruction has entered the pipeline
-
After current instruction is executed, it can be executed immediately
-
Next instruction has finished fetching
-
Save time for fetching
-
-
Accessing memory is required for fetching
-
If cache hits, take it directly
-
If cache missing, access memory
-
-
In fact, the instruction cycle is divided into more detailed stages, more pipeline stages, and more overlapping and efficient instruction execution stages
Which instruction is Prefetched?
-
Which instruction is appropriate for prefetching?
-
Next instruction of the current instruction?
-
If it is executed sequentially, no problem
-
If there is a transition, the next instruction needs to be determined according to the conditions
-
Hard to predict
-
-
Does a misprediction in prefetching affect correctness?
-
No, prefetched data at a “mis-predicted” address is simply not used
-
There is no need for state recovery
-
Basics characteristics
-
In modern systems, prefetching is usually done in cache block granularity
-
Prefetching is a technique that can reduce both
-
Miss rate
-
Miss latency
-
-
Prefetching can be done by
- Hardware
- Compiler
- Programmer
Prefetching: the four questions
-
What
- What addresses to prefetch
-
When
- When to initiate a prefetch request
-
Where
- Where to place the prefetched data
-
How
- Software, hardware, execution-based, cooperative
Two stage instruction pipeline

-
简单的指令过程就是串行处理,取指-执行-取指-执行,效率低
-
采用两阶段流水线后,在当前指令的执行过程中,进行下一个指令的取指
-
如果当前指令执行完成后,下一个指令不是预取的,需要重新取指
-
取指和执行指令的时间重叠,节省了时间
-
但是由于取指和执行指令的时间需要不一样,所以执行速度不能翻倍

Instruction pipelining
-
两阶段流水线的执行过程
-
上一条指令的执行阶段和下一条指令的取指阶段在时间上是重叠的
-
每个指令的总体执行时间没有缩短,部指令的执行时间缩短了
-
如果取指和执行时间相同,那么流水线的执行时间是串行执行的一半,性能提升一倍
Improved performance
-
But not doubled
-
Fetch usually shorter than execution
-
Instruction execution process is complex and time-consuming
-
Execution time determines the improvement effect
-
-
Jump or branch instruction
-
means that prefetched instructions are not the required instructions
-
Get the actual instructions according to the results
-
-
Add more stages to improve performance
improve concurrency
-
Goal: More concurrency → Higher instruction throughput
-
Method: When an instruction is using some resources in its processing phase, process other instructions on idle resources
-
Fetch next instruction when an instruction is being decoded
-
Decode an instruction when an instruction is being executed
-
Execute the next instruction when current instruction is accessing memory
-
When an instruction is writing its result into the register file, access data memory for the next instruction
-
Summary
-
Analogy: “Assembly line processing” of instructions
-
Pipeline the execution of multiple instructions
-
Divide the instruction processing cycle into distinct “stages” of processing
-
Ensure there are enough hardware resources to process one instruction in each stage
-
Process a different instruction in each stage
-
Instructions are executed in the order of program
-
-
Benefit: Increases instruction processing throughput

-
加法指令流水线
-
整个指令分为4个阶段:取指,译码,执行,写结果,均为t
-
采用4阶段流水线,每个阶段完全独立,n个指令,需要$nt+3t$的时间
-
基本上是$\frac {1}{4}$的时间
In practice

-
烘干衣服需要2个时间单位,这样,如果完全串行,需要20个时间单位
-
采用流水线后,有等待烘干机的时间
-
4件衣服需要11个时间单位
-
理论上的速度为非流水线的$2.5$倍
-
最慢的步骤决定了整个系统的吞吐量
How

-
烘干机成为整个系统的瓶颈
-
补充资源,配置2个烘干机
-
下一个衣服洗完后,不需要等待上一个衣服的烘干,用另一台烘干机
-
关键环节增加资源,使得整个吞吐量回到之前的情况
-
代价就是配置额外的资源
Goal
-
Increase instruction throughput with little increase in cost
-
Process instructions in the order required by the program
-
Hardware cost cannot be increased too much
-
Instruction throughput can be greatly increased
-
An ideal pipeline
-
Repetition of identical operations
-
Same operation, different operation objects
-
Automobiles of the same model can be produced on one assembly line
-
Different operations require different steps, which affects the operation of the pipeline
-
The production of automobiles and motorcycles requires different steps and cannot be put on the same assembly line
-
-
Operating objects are independent of each other
-
There is no dependency between each operation object
-
For example, there is no relationship between cars produced on the assembly line
-
Operating objects with sequential dependencies affect each other during parallel operations
-
-
A complete operation can be decomposed into several sub operations
-
Each sub operation takes the same time
-
Each sub operation requires independent resources and does not share resources
-
If sub operation requires different time, some sub operations must wait
-
Resource sharing leads to resource contention
-
-
For the pipeline design of instructions, we divide the execution of instructions into six stages
-
Fetch instruction(FI)
-
Decode instruction(DI)
-
Calculate operands(CO)
-
Fetch operands(FO)
-
Execute instructions(EI)
-
Write result(WO)
-
-
Overlap these operations
Timing of pipeline

-
理想的指令流水线的执行过程
-
指令执行分为6个阶段,相互之间不共享资源
-
按照流水线的方式来执行,从第六个时间单位开始,每个时间单位都会有1个指令完成执行
-
指令数量足够多时,执行效率为原来的6倍
Summary
-
The total execution time for each individual instruction is not changed by pipelining
- It still takes an instruction cycle to make it all the way through the processor
-
Pipelining doesn’t speed up instruction execution time
-
It does speed up program execution time by increasing the number of instructions finished per unit time
Branch in a pipeline

-
指令1和2的执行都是正常的
-
指令3在时间片8时,需要跳转到指令15的执行
-
指令4-7已经完成的处理作废
-
需要重新开始指令15的取指
-
第9到第12时间片,没有指令完成执行,称为分支惩罚
-
分支越多,分支惩罚就越多,整个程序的指令吞吐率就越低
Six stage instruction pipeline

-
第一步是取指,之后是指令译码,并计算操作数地址
-
此时,需要判断指令是否是无条件转移,如果是,那么更新PC,并清空流水线,继续开始取指
-
如果不转移,正常执行指令,取操作数,然后执行指令,并写操作数
-
判断是否进行分支,或者是否有中断。如果是,那么和无条件分支一样,更改PC,清空流水线,继续往下执行后续指令
Other factors
-
Data transmission between different parts takes time
-
Theoretically, the more stages, the higher the efficiency of instruction execution
-
The more stages are divided, the more complex the control between stages will be
-
Latching delay, buffering between phases takes a certain time
-
Need reasonable design
-
Speedup factors with instruction pipelining
- 假定总共需要执行n条指令,采用的流水线段数为k,那么使用指令流水线相对于不使用流水线的加速比的定义是
$$ S_k=\frac {nk}{k+n-1} $$
-
随着指令数的增加,加速比趋向于流水线的阶段
-
指令数越多,加速比越接近理论上的加速比。而随着段数的增加,加速比增加缓慢
-
流水线段数能带来更好的潜在加速比,但同时也带来很多问题。比如分支时需要清空流水线,段间延时也需要考虑
Analysis of instruction pipeline
- What are the characteristics of an ideal pipeline?
-
Repetition of identical operations
-
Operating objects are independent of each other
-
A complete operation can be decomposed into several sub operations
-
Identical operations … NOT!
-
different instructions → not all need the same stages
-
Forcing different instructions to go through the same pipe stages
-
Some pipeline stages are idle
-
Leading to a waste of time, called external fragmentation
Independent operations … NOT!
-
instructions are not independent of each other
-
Need to detect and resolve inter-instruction dependencies to ensure the pipeline provides correct results
-
Pipeline stalls frequently due to branch
-
Poor operation of the pipeline
Uniform sub-operations … NOT!
-
different pipeline stages → not the same latency
-
Need to force each stage to be controlled by the same clock
-
Some pipe stages are too fast but all take the same clock cycle time
-
These wasted time are called internal fragmentation
Issues in pipeline design
-
Reasonably divide the stages of instructions
-
How many stages is the instruction cycle divided into?
-
what is done in each stage
-
-
Handling exceptions, interrupts
-
Keeping the pipeline correct, moving, and full
-
Data dependences
-
Control dependences
-
Resource conflict
-
Long-latency (or multi-cycle) operations
-
Causes of pipeline stalls
-
Pipeline stall: A condition when the pipeline stops moving
-
Causes of stall
- Resource contention
- Dependences between instructions, including data dependence and control dependence
- Long-latency (multi-cycle) operations
Dependences and Their Types
-
Also called “hazard” or “pipeline bubble”
-
Dependences dictate ordering requirements between instructions
-
Two types
-
Data dependence
-
Control dependence
-
-
Resource contention is sometimes called resource dependence
-
When dependency occurs, the pipeline will be suspended, which is called pipeline adventure
Resource hazards
-
Two (or more) instructions in pipeline need same resource
-
Executed in serial rather than parallel for part of pipeline
-
Also called structural hazard
-
-
It is caused by unreasonable structure or insufficient resources
- Such as using the same register
-
The solution is generally to increase available resources, such as adding a dryer in the previous example
Example

-
第3个时钟周期,$I_1$需要读取内存取操作数,同时$I_3$也需要取指
-
两个指令读需要读存储器,发生资源冲突
-
$I_3$需要空一个时钟周期,等到第4个时钟周期的时候,才去取指
-
因为资源冲突而浪费了1个时钟周期
-
如果只有一个$ALU$,执行指令也可能会冲突
Handling resource contention
-
Solution 1: Eliminate the cause of contention
-
Duplicate the resource or increase its throughput
-
E.g., use separate instruction and data memories (caches)
-
E.g., use multiple ports for memory structures
-
-
Solution 2: Detect the resource contention and stall one
- Need to decide which one to stop
Data hazards
-
Conflict in access of an operand
- E.g. ,both instructions access a particular memory or register operand
-
If two instructions are executed serially in strict order, that is one instruction executes after the finish of the previous instruction execution. No problem
-
If in a pipeline, operand value could be updated so as to produce different result from strict sequential execution
-
Data Hazard is caused by the conflict of access to the same operand location
Types of data hazard
-
Types of data dependences:
-
read after write
- Called “True dependence ”
-
write after read
- Called “Anti dependence ”
-
write after write
- Called “Output dependence”
-
True dependency
-
Read after write (RAW), or true dependency
-
An instruction modifies a register or memory location
-
Succeeding instruction reads data in that location
-
Hazard occurs if read takes place before write complete
-
What needs to be read by succeeding instruction is the modified data
-
After the pipeline is adopted, the read data becomes the data before writing
-

-
第一个指令需要写$r_3$
-
第二个指令需要读$r_3$
-
第二个指令必须要等第一个指令执行完成之后并写了$r_3$,才能完成读操作数的指令,否则读取的$r_3$不是需要的数
Anti dependence
-
Write after read (WAR), or anti-dependency
-
An instruction reads a register or memory location
-
Succeeding instruction writes to location
-
Hazard occur if write completes before read takes place
-
The data of the first instruction read operation is incorrect

-
第一个指令读$r_1$
-
第二个指令写$r_1$
-
如果先执行了第二个指令,那么结果也不正确
-
在超标量中会出现这种情况
Output dependence
-
Write after write (WAW), or output dependency
-
Two instructions both write to same location
-
Hazard if writes take place in reverse of order intended sequence
-
The data to be stored is the data written by the second instruction
-
In the pipeline, the data actually saved is the data written by the first instruction
-
Data of memory or register is not required
-

-
第一个执行写$r_3$
-
第三个指令也写$r_3$
-
如果第三个指令先执行了,也结果不正确
-
在超标量中会出现这种情况
How?
-
True dependences always need to be obeyed because they constitute true dependence on a value
-
True dependences always need to be obeyed because they constitute true dependence on a value
-
Anti and output dependences exist due to limited number of architectural registers
- They are dependence on a name, not a value
-
Without special hardware and specific avoidance algorithms, results in inefficient pipeline usage

Data hazard diagram
-
在第五个时钟周期,加法指令写EAX
-
第四个时钟周期,减法要用EAX
-
如果第二个指令不等待,那取的EAX还是最早的EAX,不是加法的结果
-
所以减法指令需要停顿2个时钟周期,到第六个时钟周期才会去取操作数
-
浪费了2个时钟周期
Method of handle
-
True dependences are more interesting
- Actual interdependence between data,requires waiting
-
Anti and output dependences are easier to handle
-
It’s all about writing
-
Use more registers
-
Use different registers to eliminate possible correlation
-
-
Some fundamental ways of handling true dependences
-
Detect and wait until value is available in register file
-
Detect and eliminate the dependence at the software level
-
Register renaming
-
Discussed later
-
-
Predict the needed value(s), execute “speculatively”, and verify
-
Control dependence
-
Also called “control hazard”“branch hazard”
-
A Special Case of Data Dependence
-
Occurs when the pipeline makes a wrong judgment on branch transfer
-
Brings instructions into pipeline that must subsequently be discarded
-
The pipeline cannot run with full load
solve
-
Multiple Streams
-
Prefetch Branch Target
-
Loop buffer
-
Branch prediction
-
Delayed branching
Multiple streams
-
Have two pipelines for each branch
- Prefetch each branch into a separate pipeline
-
Finally, determine which pipeline to use according to the branching conditions
-
Shortcoming
-
Leads to bus & register contention
-
Multiple branches lead to further pipelines being needed
-
Prefetch branch target
- Target of branch is prefetched in addition to instructions following branch
- It is not executed after prefetching, but fetching and decoding
- Keep target until branch is executed
- It can save the time of fetching and decoding
- Used by IBM 360/91
Loop buffer
-
Very fast memory
-
Contains n instructions taken in the most recent order
-
When a branch may occur, first check whether the transfer target is in the buffer
-
Very good for small loops or jumps
Branch prediction
-
There are two types of branch predictions
-
Static branch predictions: the branch does not depend on the execution history
-
Dynamic branch prediction: the branch depends on the execution history
-
-
Static branching
-
Predict never taken
-
Predict always taken
-
Predict by Opcode
-
-
Dynamic branching
-
Taken/not taken switch
-
Branch history table
-
Static branch prediction
-
Predict never taken
-
Assume that jump will not happen
-
Always fetch next instruction
-
68020 & VAX 11/780
-
-
Predict always taken
-
Assume that jump will happen
-
Always fetch target instruction
-
-
“Predict always taken ” are most used
-
Predict by Opcode
-
Some instructions are more likely to result in a jump than others
-
Can get up to 75% success
-
Dynamic branch prediction
-
Record the history of conditional branch instructions in a program
-
Taken/not taken switch: One or more bits are used to indicate recent history of the instruction
-
The branching decision is depended on these bits
-
Based on previous history
-
Good for loops
-
Branch history table
-
If Predict branch,target address can only be obtained by decoding instructions
-
A waiting time is required
-
How to improve efficiency?
-
-
A storage area called branch target buffer is designed
-
Also called branch history table
-
It records information related to branch transfer, including branch instruction address, transfer history bit, and target address information
-
-
Target address information
-
Can be target instruction
-
Use this instruction directly
-
Less time
-
It will take up more space
-
-
Can be the target instruction address
-
Less space
-
More time
-
-
Whether to save time or space depends on the specific situation
-
Branch history table

-
预测转移后,指令预取的时候,先去转移历史表中查询
-
如果有的话,根据指令状态进行预测,可能是目标地址,或者是下一顺序地址
-
如果不匹配,顺序取下一个指令
-
-
分支指令执行时,根据实际是否发生了转移,更新转移历史表中的状态位
-
如果条件分支指令不在表中的时候,需要把指令加到这个表中,同时需要替换到当前表中的一项。替换方法可以采用很多种方法,类似于cache的替换策略
-
转移历史表动态自动维护
Correlation-based prediction
-
The execution effect of the branch history table in the loop statement is good
-
In more complex structures, branch instruction directly correlates with that of related branches instruction
-
A method called Correlation-based branch history is proposed
- Create a global branch history table
- Predict by combining global and current branch instructions
Delayed Branch
-
A method of instruction rearrangement
-
Delayed branches need to calculate the impact of branches and determine which instructions are not affected before prefetching unwanted instructions
-
Execute such an instruction immediately after the branch instruction
-
The execution of this instruction keeps the pipeline in a full rotation state, and the clock cycle will not be wasted due to waiting
-