Computer Organization and Architecture Reduced Instruction Set Computers
Computer Organization and Architecture
Reduced Instruction Set Computers
Outline
-
Register and instruction architecture
-
Reduced Instruction Set Architecture
-
The Use of a Large Register File
-
Compiler-Based Register Optimization
-
RISC
Pipelining -
RISC
VersusCISC
Controversy
Register and instruction architecture
Major advanced in computers
-
The family concept
-
IBM System/360 1964,DEC PDP-8
-
Separates architecture from implementation
-
-
Microprogrammed control unit
-
Idea by Wilkes 1951,Produced by IBM S/360 1964
-
Easier controller design and implementation
-
-
Cache
-
IBM S/360 model 85 1969
-
Greatly improves the performance of computer
-
-
Microprocessors
-
Intel 4004 1971
-
Reduced the size of the computer
-
-
Pipelining
-
Introducing parallelism into instruction execution
-
Greatly improves instruction throughput
-
-
Multiple processors
-
Multiple processors combine to form a new architecture
-
Further improve the processing capacity of the computer
-
-
RISC
-
Major challenges to traditional CPU
-
It has been widely used
-
Learn from RISC and be widely used in different fields
-
-
Solid State RAM
- Access speed is much faster than mechanical hard disk
- Greatly improves the computer’s I/O performance
Register and ISAs
-
Accumulators
-
Early stored-program computers had one register
-
Very inconvenient to use
-
Requires a memory-based operand-addressing mode in instruction
-
Example
|
|
-
Add the accumulator to the word in memory at address 200
-
Place the sum back in the accumulator
-
Next step, more registers…
-
Dedicated registers
-
separate accumulators for multiply or divide instructions
-
op-of-stack pointer
-
-
Extended Accumulator
- Increase bits of register
- Increase the number of registers
-
-
More flexible
-
Next step, more registers…
- General-purpose registers
- Registers can be used for any purpose E.g. MIPS, ARM, x86
- General-purpose registers
-
Instruction architecture
-
Register-memory architectures
- One operand may be in memory (i.e. 80386 processors)
-
Register-register architectures (load-store)
- All operands must be in registers E.g. MIPS, ARM
-
Comparison of processors
-
指令数量、长度、寻址方式等方面,CISC明显高于RISC
-
RISC和超标量的通用寄存器数量比CISC要多
-
RISC和超标量一般采用硬布线控制,所以没有配置控制存储器
Reduced Instruction Set Architecture
CISC
-
CISC–Complex Instruction Set Computer
-
With the development of computer technology, an instruction set design concept is proposed
-
The gap between high-level programming languages and instruction sets is growing
-
Need more complex compilers to translate high-level languages
-
A high-level language statement requires multiple instructions to complete,low execution efficiency
-
-
CISC proposed by instruction set designer
-
Provides more types of instructions, and even uses a single instruction to implement complex high-level language statements
-
Provide more addressing modes to meet the needs of high-level languages for various addressing
-
Characteristics of CISC
-
There are many kinds of instruction opcodes
- The opcode of X86 is 1-2 bytes
-
Variable instruction length
- The instruction length of X86 is 1-16 bytes
-
Various addressing modes
- X86 has 9 addressing modes, such as base address proportional index band offset
Ideal effect of CISC
-
Compilers are easy to write
-
The instruction set provides many types of instructions
-
The compiler can use the most appropriate instructions to translate statements in high-level languages. Less instructions after compilation and less space
-
The instruction set provides rich addressing modes
-
Meet the requirements of high-level language for flexible addressing mode
-
-
Improve the execution efficiency of high-level language statements
-
It used to require multiple instructions to complete statements in a high-level language, but now it can be completed with one instruction, and some high-level language statements can be implemented at the hardware level
-
Instruction level addressing mode to help implement complex instructions
-
Instruction execution characteristics
-
Operations performed
- Functions that can be completed by CPU and interaction with memory
-
Operands used
- Type and frequency of operands,which determine the organization and addressing mode of the storage system
-
Execution sequencing
- Control function and pipeline organization
Operations
-
It is best to optimize the most used and time-consuming statements
-
Assignments
- Movement of data
-
Condition(Loop and IF)
- Sequence control
-
Procedure call-return
- Very time consuming
-
过程调用和返回是典型的高级语言程序中最耗时的操作
-
循环语句、条件语句和幅值语句也占很大的比重
- Operands
-
主要是局部标量变量
-
优化方向应集中于本地变量的存储和访问
-
Procedure calls
-
Very time consuming
-
Depends on number of parameters passed
-
Depends on level of nesting
-
-
Characteristics
-
Number of parameter is mostly less than 6
-
Most variables are local
-
Most programs do not do a lot of calls followed by lots of returns
-
-
It further explains that operand access is highly localized
Summary
-
Assignment statement
-
high proportion, valuable to improve efficiency
-
Need to access cache or storage
-
Use register access to reduce memory access and improve efficiency
-
-
Condition and procedure calls
-
time consuming, high proportion, valuable to improve efficiency
-
Influence the execution of the pipeline
-
Design a better pipeline to reduce the impact of transfer statements on the water line
-
单纯依靠提供接近于高级语言的指令并不一定能提高典型语句的执行效率
-
Ideal 1 of CISC: Compilers are easy to write
-
Implementation method: complex instruction
-
Compiler simplification?
-
Because of strict requirements of instruction design,compiler needs high-level language strictly meet the instruction
-
The compiler needs to optimize machine instructions to reduce the length of generated machine code and meet the requirements of pipeline operation. This is also difficult to achieve for complex instructions Optimization more difficult
-
-
Ideal 2 of CISC: smaller programs
-
Implementation method: mores instruction
-
Smaller programs?
-
Program looks using less memory
-
Memory is now cheap
-
May not occupy less bits, just look shorter in symbolic form
-
More instructions require longer op-codes
-
CISC has no fewer machine instructions than RISC
-
-
Ideal 3 of CISC: high efficiency
-
Implementation method: more instructions and more addressing mode
-
Faster programs?
-
Compiler bias towards use of simpler instructions
-
CISC need more complex control unit
-
Microprogram control store larger
-
Simple instructions take longer to execute
-
Conclusion
-
The goal CISC hopes to achieve is actually contradictory to the way CISC realizes it
-
Target: improve operational efficiency by optimizing the most frequently used and time-consuming functions
-
Following methods may be better choices
-
More registers to reduce memory access
-
Careful design of pipeline to improve the efficiency of the pipeline
-
Careful designed simple instruction set to improve the efficiency of instruction execution
-
RISC
-
RISC: Reduced Instruction Set Computer
-
Key features
-
Large number of general purpose registers
-
Compiler technology to optimize register use
-
Limited and simple instruction set
-
Emphasis on optimising the instruction pipeline
-
Main contributors of RISC
-
John Cocke: IBM
- Influences: Known as “the father of RISC Architecture”. Turing Award Recipient and National Medal of Science
-
Dave Patterson: UC Berkeley
- Influences: Sun SPARC from his achievements
-
John L. Hennessy: Stanford
- Influences: In 1984, MIPS (Microprocessor without interlocked pipelined stages) was founded
Typical features of RISC
-
Simplified instruction set
-
Standardized, fixed length instruction format (ARM instructions are all 32-bit)
-
Limited operation types, only 8-bit operation code
-
Fetching and decoding instruction become easier
-
One instruction per machine cycle
-
Hardwired design (no microcode)
-
-
Use registers whenever possible
-
Except for load/save instructions, other instructions are for register operations
-
Data operations can only be performed in registers
-
Memory access has only three addressing modes
-
-
Better pipeline design
-
Instruction pipeline is carefully designed to better meet the impact of conditional branches and procedure calls on the flow pipeline
-
Each instruction is conditionally executed, which can reduce branches
-
Influence of RISC concept
-
RISC’s design concept gave birth to a series of new computer architectures
- Simplified pipeline design on RISC instruction set is becoming more and more attractive
-
RISC and CISC learn from each other
-
Make up for the disadvantage of CISC pipeline implementation
-
RISC is also learning from CISC, and both sides are learning from each other
-
-
Current situation
-
With the development of architecture and microelectronics technology, the so-called disadvantage of CISC in structure is gradually reduced
-
RISC’s theory of superiority has gradually died down, and RISC camp has been losing ground
Focus on the design and implementation of micro structures and physics, and explore the possibilities buried in operating systems, compilers and upper applications
Summary
-
The number of available registers greatly influenced the instruction set architecture (ISA)
-
Complex Instruction Set Computers were very complex
-
CISC was necessary
-
The processor of X86 architecture dominates the server and desktop market
-
X86 draws many advantages from RISC
-
-
RISC is still widely concerned and applied
-
ARM occupies a major share of the embedded market
-
Mobile phones, tablets PC and various sensors in daily life mostly adopt ARM architecture
-
ARM borrows a bit from both RISC and CISC
-
The Use of a Large Register File
-
Target: Keep frequently accessed operands in registers
-
Software solution
-
Require compiler to allocate registers
-
Allocate based on most used variables in a given time
-
Requires sophisticated program analysis
-
-
Hardware solution
-
Have more registers
-
Thus more variables will be in registers
-
How and problem?
-
Limited number of registers requires reasonable use, and the locality principle provides the possibility
-
Store local scalar variables in registers
-
Reduces memory access
-
-
Every procedure (function) call changes locality
-
Parameters must be passed
-
Results must be returned
-
Variables from calling programs must be restored
-
Register windows
-
Characteristics of procedure call
-
Only few parameters
-
Limited range of depth of call
-
-
Divide the available registers into several small registers set
-
Calls switch to a different set of registers
-
Returns switch back to a previously used set of registers
-
-
Set of registers called register windows
-
Three areas within a register set
-
Parameter registers
-
Local registers
-
Temporary registers
-
-
Temporary registers from one set overlap parameter registers from the next
- This allows parameter passing without moving data
-
At any time, only one register window is visible

Overlapping register windows
-
本级的临时变量寄存器和下一级的参数寄存器在物理上是同一个,在传递参数时,不需要移动数据
-
程序中过程的调用和返回的数量不确定,所以寄存器窗口应该足够多,以保证所有的过程调用都能用到
-
由于寄存器的数量有限,只能保证少数最近的过程能够使用寄存器。更早的过程调用还是需要保存到存储器中。当嵌套深度减少的时候,再将数据从存储器恢复到寄存器中
-
这种方式称为环形缓冲窗口

Circular buffer diagram
-
寄存器窗口以一种部分重叠的形式形成一个环形。当环形寄存器窗口都充满了后,再有过程调用,把最早的寄存器窗口保存到存储器
-
调用时,移动当前窗口指针以显示当前活动的寄存器窗口
-
如果所有窗口都在使用中,将生成一个中断,并将最早的窗口(调用嵌套中最远的窗口)保存到内存中
-
保存的窗口指针标识最近保存在内存中的窗口
-
当过程返回的时候,CWP会回退一个。当CWP回退到和SWP一样的时候,就会引起一个中断,导致保存到存储器中的寄存器窗口恢复
-
嵌套层数不会太深,所以一般不会保存到存储器中
Global variables
-
Allocated by the compiler to memory
-
Inefficient for frequently accessed variables
-
Frequent access to memory, low efficiency
-
-
A set of registers for global variables
-
compiler determines which global variables can be placed in global registers
-
Replacement is also determined by the compiler
-
Registers v cache
-
Cache
- Inserting cache between processor and memory can solve the problem of speed difference
-
Register
- organized in the form of windows, which is similar to a small fast buffer. It stores a subset of all variables that may be used many times
-
寄存器组中保存的是所有局部标量变量。cache保存的是最近使用过的标量变量
-
寄存器组中保存的是个别的变量。cache中保存的是内存中的一个块
-
寄存器组方案中,需要编译器来决定全局变量的保存。cache中则是根据最近使用原则进行管理
-
寄存器的数据保存或者恢复,依赖的是过程调用嵌套的深度。cache根据替换算法进行替换
-
寄存器组采用的是寄存器寻址。cache采用的是内存寻址
-
The register saves time because all local scalar variables are retained
- Not efficient use of space, because not all procedures will need the full window space allocated to them
-
The cache may make more efficient use of space because it stores necessary data dynamically

Register access
-
要访问基于窗口的寄存器组中的一个标量变量,需要给出窗口号和一个寄存器号
-
通过一个相对简单的译码器,就可以得到对应的寄存器,读出这个数据

Cache access
-
对于cache访问,需要生成一个完整的地址,操作的复杂度和寻址方式有关
-
进行对比,看数据是否命中
-
如果命中,就可以读取数据
-
如果没有命中,那么就需要先替换cache行,然后才能得到数据
Compiler-Based Register Optimization
-
HLL programs have no explicit references to registers
-
Optimizing use is up to compiler
-
Assign symbolic or virtual register to each candidate variable
-
Map symbolic registers to real registers
-
Symbolic registers that do not overlap can share real registers
-
If you run out of real registers,some variables use memory
-
-
The essence is to judge which data needs to be put in the register at any time
Graph coloring
-
Symbol registers is more than register
-
determine which symbol registers can use the actual registers
-
Using Graph Coloring of topology
-
Given a graph of nodes and edges
-
Assign a colour to each node
-
Adjacent nodes have different colours
-
Use minimum number of colours
-
-
Nodes are symbolic registers
-
Two registers that are live in the same program fragment are joined by an edge
-
Try to colour the graph with n colours, where n is the number of real registers
-
Nodes that can not be coloured are placed in memory
Graph colouring approach

-
构造无向图:A和BC在时间上有重叠,A和BC有连线;B和所有的节点都有时间重叠,B和所有的节点都有连线。C和ABD有时间上的重叠,C和ABD有连线
-
从A开始,给A赋一个灰色,B和C必须要和A不一样,并且B和C也不能一样,给B附一个顺斜杠,C赋一个反斜杠。D节点和BC相连,和A不相连,D点可以用灰色。E节点和BD相连,E可以用C的颜色。F和BED相连,而BDE分别是正斜杠、灰色和反斜杠,F必须要用到第四个颜色
-
如果物理寄存器只有3个的话,那么F就需要保存到存储器中了,通过加载和保存来处理
Large Register vs Compiler
-
When the number of registers is small, the effect will be better by optimizing the registers
-
When the number of registers is large, the effect of register optimization will not be very good
-
Optimization of registers is mainly for the case of a small number of registers
RISC Pipelining
-
Most instructions are register to register
-
Two phases of execution
-
I: Instruction fetch
-
E: Execute
- ALU operation with register input and output
-
-
For load and store
-
I: Instruction fetch
-
E: Execute: Calculate memory address
-
D: Register to memory or memory to register operation
-


-
图a是没有采用流水线技术,完全按照顺序来执行
-
图b是采用两阶段流水线的执行情况。由于同时只能有1个存储器访问,取指和存储会冲突,导致取指会延后一个时钟周期
-
存储器的访问限制导致了时钟周期的浪费
-
如果存储支持2个访问,可以用三阶段流水线
-
指令相关性:Add rC$\leftarrow$rA+rB,指令需要的操作数为rA和rB。而第二条指令要到第四个时钟周期才能得到rB。所以要插入一个空指令NOOP
-
指令执行阶段,通常涉及到寄存器的读和ALU的操作,把E阶段进一步分为E1和E2。其中E1阶段完成寄存器的读,而E2阶段则完成ALU操作和寄存器的写操作
-
使用四阶段流水线来提高效率。但同样需要考虑相关性问题
Optimization of Pipelining
-
Dependency of data and branch will disrupt the pipeline and affect the efficiency
-
Two methods: Delayed branch, Loop Unrolling
-
Delayed branch
-
Branch instruction affects only the instructions that follow it
-
This following instruction is the delay slot
-
Arrange a useful instruction to replace the NOOP instruction
-
delayed branch
-
Calculate result of branch before unusable instructions pre-fetched
-
Instructions that are not affected by branches are immediately followed by branch
-
Keeps pipeline full while fetching new instruction stream
-
-
Not as good for superscalar
-
Multiple instructions need to execute in delay slot
-
Instruction dependence problems
-
Often use branch prediction
-
-
Problem: How do you find instructions to fill the delay slots?
-
Branch must be independent of delay slot instructions
-
Unconditional branch: Easier to find instructions to fill the delay slot
-
Conditional branch: Condition computation should not depend on instructions in delay slots → difficult to fill the delay slot
-
Advantage
-
Keeps the pipeline full with useful instructions in a simple way assuming
-
Number of delay slots = number of instructions to keep the pipeline full before the branch resolves
-
All delay slots can be filled with useful instructions
-
Disadvantage
-
Not easy to fill the delay slots (even with a 2-stage pipeline)
-
Number of delay slots increases with pipeline depth, superscalar execution width
-
Number of delay slots should be variable with variable latency operations
-
Another method-Loop Unrolling
-
Replicate body of loop a number of times
- Iterate loop fewer times
- Reduces loop overhead
- Increases instruction parallelism
-
During the execution of the loop body, because of the locality principle, some data can be used for many times, which can reduce the number of times to access the
|
|
-
原始的循环是要做一个一维数组的变化。循环体中有一个语句需要执行
-
把这个循环体进行拆解,变成了2个语句,一个循环体执行了2个迭代操作
|
|
- 2个指令可以并行进行。原来每次循环需要访问3次存储器,现在相当于2次循环只需要访问4次存储器,降低了存储器访问的次数
RISC Versus CISC Controversy
-
Not clear cut
-
RISC designs may benefit from the inclusion of some CISC features and that
-
CISC designs may benefit from the inclusion of some RISC features
-
-
Many designs borrow from both philosophies
-
PowerPC are no longer “pure” RISC
-
Pentium II and later Pentium models do incorporate some RISC characteristics
-