Computer Organization and Architecture Parallel Process & Multicore Computers
Computer Organization and Architecture
Parallel Process & Multicore Computers
Outline
-
Parallel Processing
- Multiple Processor Organizations
- Symmetric Multiprocessors
- Clusters
- Nonuniform Memory Access
- Vector Computation
-
Multicore Computers
Multiple Processor Organizations
Types of multiple processor
-
Single instruction, single data stream –
SISD
-
Single instruction, multiple data stream –
SIMD
-
Multiple instruction, single data stream –
MISD
-
Multiple instruction, multiple data stream -
MIMD
SISD Organizations
-
SISD的结构中包含1个CU控制单元,1个PU处理单元,以及1个MU存储单元
-
CU向PU发送指令流,MU向PU发送数据流
-
PU根据CU发送的指令流,对来自MU的数据流进行操作,并产生结果。
-
SISD并没有并行的能力,PU按照CU提供的指令流,进行相应的操作
-
SIMD,单指令多数据流,结构中包含1个控制单元,多个处理单元。每个处理单元有自己的存储器
-
控制单元将指令流发送给多个处理单元进行同步处理,同步处理采用的是锁步方式
-
不同的处理器在不同的数据集上执行相同的指令,产生不同的处理结果
-
实质是对不同的数据集进行相同的处理,通过并行得到一组结果,并行处理提高效率
-
矢量和阵列处理器属于SIMD类型
MISD
-
Sequence of data
-
Transmitted to set of processors
-
Each processor executes different instruction sequence
-
Never been implemented

MIMD Organizations
-
MIMD,多指令多数据流架构,多个控制单元CU,多个处理单元PU。存储方面,有两种结构
-
共享存储器:所有的PU共享一个存储器,数据都存储在共享存储器中
-
分布式存储:每个PU都有自己的LM,这些机器通过互联网连接在一起
-
-
一组处理器,能够同时执行不同的指令序列。每个处理器都有自己的数据集。
-
对称多处理SMP,集群,非均匀存储器访问NUMA等,都属于MIMD架构
Symmetric Multiprocessors
SMP
-
Tightly Coupled
-
Processors share memory and I/O
-
Share single memory or pool
-
Shared bus to access memory
-
Public area set in shared storage stores status information to achieve communication between processors
-
Memory access time to given area of memory is approximately the same for each processor
-
Characteristic of SMP
-
Two or more processors with similar function
-
All processors share memory and I/O
-
All processors share access to I/O
-
Perform the same function
-
-
Controlled by a centralized operating system

Symmetric Multiprocessor Organization
-
每个处理器有自己的$L1\ cache$,也可能会配置各自的$L2\ cache\newline$
-
处理器都挂在共享的系统总线上,共享对主存储器的访问
-
I/O系统也挂在系统总线上,各个处理器对I/O系统进行共享访问
SMP Advantages
-
High performance
- Greatly improved performance if some work can be done in parallel
-
High availability
-
All processors can perform the same functions
-
Failure of a single processor does not halt the system
-
-
Incremental growth
-
Flexible system expansion
-
User can enhance performance by adding additional processors
-
-
Scaling
-
Vendors can offer range of products based on number of processors
-
Different products have different prices and performance, which can give users more choices
-
Design issues
-
SMP system is managed by a unified operating system
-
Operating system is responsible for scheduling processes and resources
-
Operating system needs to complete
-
Simultaneous concurrent processes
-
Scheduling
-
Synchronization
-
Memory management
-
Reliability and fault tolerance
-
-
Simultaneous concurrent processes
-
Allow multiple processors to execute the same piece of OS code at the same time
-
Manage OS tables and other structures to avoid deadlocks
-
-
Scheduling
- Reasonably schedule the processor execution process
-
Synchronization
- Provide synchronization mechanism to ensure mutual exclusion and order of memory and I/O access
-
Memory management
- Solve concurrency and consistency problems
- Ensure the performance and correctness under multiprocessors
-
Reliability and fault tolerance
- For a processor failure, the operating system shall be able to reconstruct the system so that the system can be degraded for use
Clusters
-
Loosely Coupled
-
Collection of independent uniprocessors or
SMPs
-
Interconnected to form a cluster
-
Communication via fixed path or network connections
-
Composition of cluster
-
A group of interconnected whole computers
-
Working together as unified resource
-
Illusion of being one machine
-
Each computer called a node
-
-
Characteristics
-
High performance
-
High availability
-
Alternative to SMP
-
-
Server applications
Cluster Benefits
-
Absolute scalability
-
Build hundreds of thousands of independent computers into a large cluster system, and the processing capacity may far exceed that of the largest independent computer
-
Each machine in the cluster can be a single processing system or a multiprocessor architecture
-
-
Incremental scalability
-
New nodes can be added to the cluster system step by step to improve processing capacity and gradually expand
-
Very flexible capacity expansion
-
-
High availability
-
Each node is an independent computer
-
The failure of one or more nodes will not affect the use of the cluster system
-
Node fault diagnosis and fault tolerance are automatically completed by the system
-
-
Superior price/performance
-
Combine mature commercial computers into a cluster
-
The system performance is far greater than that of a single large server
-
High cost performance
-
Blade Servers
-
Common implementation of cluster
-
Server houses multiple server modules (blades) in single chassis
-
Save space
-
Improve system management
-
Chassis provides power supply
-
Each blade has processor, memory, disk
-

Blade Servers
Cluster v. SMP
-
Both provide multiprocessor support to high demand applications
-
Both available commercially
-
SMP for longer
-
SMP
-
Easier to manage and control
-
Closer to single processor systems
-
Scheduling is important
-
Less physical space
-
Lower power consumption
-
-
Clustering
-
Superior incremental & absolute scalability
-
Superior availability
- Redundancy
-
Nonuniform Memory Access
NUMA
-
Tightly Coupled
-
Nonuniform memory access
- Access times to different regions of memory may differ
-
Main object
-
Overcoming the limitation on the number of processors in SMP
-
It solves the problem caused by the independent memory used by each node in the cluster system
-
-
Alternative to SMP & clustering
-
Uniform memory access
-
All processors have access to all parts of memory
-
Using load & store
-
Access time to all regions of memory is the same
-
Access time to memory for different processors same
-
As used by SMP
-
Nonuniform Memory Access
-
All processors have access to all parts of memory
-
Using load & store
-
Access time of processor differs depending on region of memory
-
Different processors access different regions of memory at different speeds
-
Cache coherent NUMA(
CC-NUMA
)-
Cache coherence is maintained among the caches of the various processors
-
For a system without cache consistency maintenance, it is similar to a cluster system
-
CC-NUMA
is discussed here -
Significantly different from SMP and clusters
-
Motivation
-
SMP has practical limit to number of processors
- Bus traffic limits to between 16 and 64 processors
-
In cluster,each node has own memory
- Apps do not see large global memory
- Coherence maintained by software not hardware
-
NUMA retains SMP flavour while giving large scale multiprocessing
- e.g. Silicon Graphics Origin NUMA 1024 MIPS R10000 processors
-
Objective
-
maintain transparent system wide memory while permitting multiprocessor nodes
-
each with own bus or internal interconnection system
-

CC-NUMA Organization
-
NUMA系统由多个结点组成。每个节点包含有若干个处理器,每个处理器有自己的L1 cache和L2 cache,有自己的内部总线,并且有自己的主存和I/O
-
处理器访问存储器的时候,首先看是否在cache中,如果不在,cache会去访问本地存储器。如果在的话,就通过内部总线取过来。如果不在本地存储器中,cache会发出一个请求,通过互联网络从远端取过来,放到总线上,发出请求的cache从总线上读取。这些动作都是自动的,对处理器和cache都是透明的。
Vector Computation
-
Maths problems involving physical processes present different difficulties for computation
-
Aerodynamics, seismology, meteorology
-
Continuous field simulation
-
-
Requirement
-
High precision
-
Repeated floating point calculations on large arrays of numbers
-
-
Solution 1: supercomputer
-
Hundreds of millions of float
-
Optimized for Vector Computation
-
$10-15 million
-
Limited market
-
Research, government agencies, meteorology
-
-
Solution 2: Array processor
-
Alternative to supercomputer
-
Configured as peripherals to mainframe & mini
-
Just run vector portion of problems
-
Multicore Computers
What is Multicore Computers?
-
Also known as single chip multiprocessor
-
Two or more processors are integrated on a single chip, and each processor is called a core
-
Each core consists of all components of an independent processor, including register set, ALU, pipeline hardware, control unit, and L1 data and instruction cache
-
Some multicore processors also include L2 cache and L3 cache on the chip
Hardware Performance Issues
-
Microprocessors have seen an exponential increase in performance
-
Improved organization
-
Increased clock frequency
-
-
Increase in Parallelism
-
Pipelining
-
Superscalar
-
Simultaneous multithreading
-
Simultaneous multithreading
-
同步多线程能够从多个线程中取出指令来运行,它能够同时执行不同线程的指令
-
同步多线程架构中,配置了多个PC和多个寄存器组,底层共享指令cache和数据cache。这样可以在多个线程之间共享流水线资源
-
通过同步多线程技术,系统能够动态调整系统环境,如有可能同时执行不同线程的指令。当一个线程遇到长延迟事件时,允许另一个线程使用所有的处理单元
Hardware Performance Issues
-
Processor performance continues to improve
-
Adjustment of chip architecture
-
Improvement of main frequency
-
-
Diminishing returns
-
More complexity requires more logic
-
Need more chip area for coordinating and signal transfer logic
-
Harder to design, make and debug
-
Hardware performance reaches the bottleneck, which is very difficult to improve
-
Power consumption
-
Power requirements grow exponentially with chip density and clock frequency
-
Increased power consumption causes CPU cooling problems
-
It is increasingly difficult to improve performance by improving chip integration
-
One solution is use more chip area for cache
-
Storage transistors require low power consumption
-
Cache is close to CPU and fast
-
By 2015,100 billion transistors on 300mm2 ,Cache of 100MB ,1 billion transistors for logic
-
-
Large capacity cache provides basic resources for multi-core processors
Pollack’s rule
-
Pollack’s rule
-
Performance is roughly proportional to square root of increase in complexity
-
Double complexity gives 40% more performance
-
-
So,integrating multiple processor cores on one chip becomes a better solution
-
Multicore makes performance close to linear improvement
-
Unlikely that one core can use all cache effectively
-
Software Performance Issues
-
Performance benefits dependent on effective exploitation of parallel resources
-
Amdahl’s Law
-
Even small amounts of serial code impact performance
-
10% inherently serial on 8 processor system gives only 4.7 times performance
-
-
Other factors affecting performance: communication, distribution of work and cache coherence overheads

Performance Effect of Multiple Cores
-
图(a)给出了串行代码比例对加速比的影响。如果没有串行代码,理论上加速比和性能的提升成正比。但是,由于串行处理的问题,导致加速比比理论值小了很多。
-
图(b)指出管理开销对加速比的影响。可以看到,在5个处理器的时候,加速比最大,随着核数的增加,管理开销会导致性能收益递减
Effective Applications
-
Some applications effectively exploit multicore processors
-
Database
-
Servers handling independent transactions
-
Multi-threaded native applications,such as Lotus Domino, Siebel CRM
-
Multi-process applications, such as Oracle, SAP, PeopleSoft
-
-
Java applications
-
JVM
is multi-thread with scheduling and memory management -
Sun’s Java Application Server, BEA’s Weblogic, IBM Websphere, Tomcat
-
-
Multi-instance applications
- One application running multiple times
-
Game Software
Multicore Organization
-
Number of core processors on chip
-
Number of levels of cache on chip
-
Amount of shared cache
-
Next slide examples of each organization
-
(a) ARM11 MPCore
-
(b) AMD Opteron
-
(c) Intel Core Duo
-
(d) Intel Core i7
-
Individual Core Architecture
-
Intel Core Duo uses superscalar cores
-
Intel Core i7 uses simultaneous multi-threading (SMT)
- Scales up number of threads supported
- 4 SMT cores, each supporting 4 threads appears as 16 core
Intel x86 Multicore Organization
Example: Core Duo and Core i7
ARM11 MPCore
-
Up to 4 processors each with own L1 instruction and data cache
-
Distributed interrupt controller
-
Timer per CPU
-
Watchdog
- Warning alerts for software failures
- Counts down from predetermined values
- Issues warning at zero
-
CPU interface
- Interrupt acknowledgement, masking and completion acknowledgement
-
CPU
- Single ARM11 called MP11
-
Vector floating-point unit
- FP co-processor
-
L1 cache
-
Snoop control unit
- Maintain L1 cache coherency
Summary of parallel
-
Internal of CPU
-
Pipeline
-
Superscalar
-
simultaneous multi-threading(SMT)
-
-
On chip
- Multicore
-
Internal of machine
-
SMP
-
NUMA
-
Array processor
-
-
Multi-machine
- Cluster