目录

Computer Organization and Architecture Parallel Process & Multicore Computers


Computer Organization and Architecture

Parallel Process & Multicore Computers

Outline

  • Parallel Processing

    • Multiple Processor Organizations
    • Symmetric Multiprocessors
    • Clusters
    • Nonuniform Memory Access
    • Vector Computation
  • Multicore Computers

Multiple Processor Organizations

Types of multiple processor

  • Single instruction, single data stream – SISD

  • Single instruction, multiple data stream – SIMD

  • Multiple instruction, single data stream – MISD

  • Multiple instruction, multiple data stream - MIMD

SISD Organizations

  • SISD的结构中包含1个CU控制单元,1个PU处理单元,以及1个MU存储单元

  • CU向PU发送指令流,MU向PU发送数据流

  • PU根据CU发送的指令流,对来自MU的数据流进行操作,并产生结果。

  • SISD并没有并行的能力,PU按照CU提供的指令流,进行相应的操作

  • SIMD,单指令多数据流,结构中包含1个控制单元,多个处理单元。每个处理单元有自己的存储器

  • 控制单元将指令流发送给多个处理单元进行同步处理,同步处理采用的是锁步方式

  • 不同的处理器在不同的数据集上执行相同的指令,产生不同的处理结果

  • 实质是对不同的数据集进行相同的处理,通过并行得到一组结果,并行处理提高效率

  • 矢量和阵列处理器属于SIMD类型

MISD

  • Sequence of data

  • Transmitted to set of processors

  • Each processor executes different instruction sequence

  • Never been implemented

/img/Computer Organization and Architecture/chapter17-1.png
MIMD Organizations
  • MIMD,多指令多数据流架构,多个控制单元CU,多个处理单元PU。存储方面,有两种结构

    • 共享存储器:所有的PU共享一个存储器,数据都存储在共享存储器中

    • 分布式存储:每个PU都有自己的LM,这些机器通过互联网连接在一起

  • 一组处理器,能够同时执行不同的指令序列。每个处理器都有自己的数据集。

  • 对称多处理SMP,集群,非均匀存储器访问NUMA等,都属于MIMD架构

Symmetric Multiprocessors

SMP

  • Tightly Coupled

  • Processors share memory and I/O

    • Share single memory or pool

    • Shared bus to access memory

    • Public area set in shared storage stores status information to achieve communication between processors

    • Memory access time to given area of memory is approximately the same for each processor

Characteristic of SMP

  • Two or more processors with similar function

    • All processors share memory and I/O

    • All processors share access to I/O

    • Perform the same function

  • Controlled by a centralized operating system

/img/Computer Organization and Architecture/chapter17-2.png
Symmetric Multiprocessor Organization
  • 每个处理器有自己的$L1\ cache$,也可能会配置各自的$L2\ cache\newline$

  • 处理器都挂在共享的系统总线上,共享对主存储器的访问

  • I/O系统也挂在系统总线上,各个处理器对I/O系统进行共享访问

SMP Advantages

  • High performance

    • Greatly improved performance if some work can be done in parallel
  • High availability

    • All processors can perform the same functions

    • Failure of a single processor does not halt the system

  • Incremental growth

    • Flexible system expansion

    • User can enhance performance by adding additional processors

  • Scaling

    • Vendors can offer range of products based on number of processors

    • Different products have different prices and performance, which can give users more choices

Design issues

  • SMP system is managed by a unified operating system

  • Operating system is responsible for scheduling processes and resources

  • Operating system needs to complete

    • Simultaneous concurrent processes

    • Scheduling

    • Synchronization

    • Memory management

    • Reliability and fault tolerance

  • Simultaneous concurrent processes

    • Allow multiple processors to execute the same piece of OS code at the same time

    • Manage OS tables and other structures to avoid deadlocks

  • Scheduling

    • Reasonably schedule the processor execution process
  • Synchronization

    • Provide synchronization mechanism to ensure mutual exclusion and order of memory and I/O access
  • Memory management

    • Solve concurrency and consistency problems
    • Ensure the performance and correctness under multiprocessors
  • Reliability and fault tolerance

    • For a processor failure, the operating system shall be able to reconstruct the system so that the system can be degraded for use

Clusters

  • Loosely Coupled

  • Collection of independent uniprocessors or SMPs

  • Interconnected to form a cluster

  • Communication via fixed path or network connections


  • Composition of cluster

    • A group of interconnected whole computers

    • Working together as unified resource

    • Illusion of being one machine

    • Each computer called a node

  • Characteristics

    • High performance

    • High availability

    • Alternative to SMP

  • Server applications

Cluster Benefits

  • Absolute scalability

    • Build hundreds of thousands of independent computers into a large cluster system, and the processing capacity may far exceed that of the largest independent computer

    • Each machine in the cluster can be a single processing system or a multiprocessor architecture

  • Incremental scalability

    • New nodes can be added to the cluster system step by step to improve processing capacity and gradually expand

    • Very flexible capacity expansion

  • High availability

    • Each node is an independent computer

    • The failure of one or more nodes will not affect the use of the cluster system

    • Node fault diagnosis and fault tolerance are automatically completed by the system

  • Superior price/performance

    • Combine mature commercial computers into a cluster

    • The system performance is far greater than that of a single large server

    • High cost performance

Blade Servers

  • Common implementation of cluster

  • Server houses multiple server modules (blades) in single chassis

    • Save space

    • Improve system management

    • Chassis provides power supply

    • Each blade has processor, memory, disk

/img/Computer Organization and Architecture/chapter17-3.png
Blade Servers

Cluster v. SMP

  • Both provide multiprocessor support to high demand applications

  • Both available commercially

  • SMP for longer


  • SMP

    • Easier to manage and control

    • Closer to single processor systems

    • Scheduling is important

    • Less physical space

    • Lower power consumption

  • Clustering

    • Superior incremental & absolute scalability

    • Superior availability

      • Redundancy

Nonuniform Memory Access

NUMA

  • Tightly Coupled

  • Nonuniform memory access

    • Access times to different regions of memory may differ
  • Main object

    • Overcoming the limitation on the number of processors in SMP

    • It solves the problem caused by the independent memory used by each node in the cluster system

  • Alternative to SMP & clustering


  • Uniform memory access

    • All processors have access to all parts of memory

    • Using load & store

    • Access time to all regions of memory is the same

    • Access time to memory for different processors same

    • As used by SMP

Nonuniform Memory Access

  • All processors have access to all parts of memory

  • Using load & store

  • Access time of processor differs depending on region of memory

  • Different processors access different regions of memory at different speeds


  • Cache coherent NUMA(CC-NUMA)

    • Cache coherence is maintained among the caches of the various processors

    • For a system without cache consistency maintenance, it is similar to a cluster system

    • CC-NUMA is discussed here

    • Significantly different from SMP and clusters

Motivation

  • SMP has practical limit to number of processors

    • Bus traffic limits to between 16 and 64 processors
  • In cluster,each node has own memory

    • Apps do not see large global memory
    • Coherence maintained by software not hardware
  • NUMA retains SMP flavour while giving large scale multiprocessing

    • e.g. Silicon Graphics Origin NUMA 1024 MIPS R10000 processors
  • Objective

    • maintain transparent system wide memory while permitting multiprocessor nodes

    • each with own bus or internal interconnection system

/img/Computer Organization and Architecture/chapter17-4.png
CC-NUMA Organization
  • NUMA系统由多个结点组成。每个节点包含有若干个处理器,每个处理器有自己的L1 cache和L2 cache,有自己的内部总线,并且有自己的主存和I/O

  • 处理器访问存储器的时候,首先看是否在cache中,如果不在,cache会去访问本地存储器。如果在的话,就通过内部总线取过来。如果不在本地存储器中,cache会发出一个请求,通过互联网络从远端取过来,放到总线上,发出请求的cache从总线上读取。这些动作都是自动的,对处理器和cache都是透明的。

Vector Computation

  • Maths problems involving physical processes present different difficulties for computation

    • Aerodynamics, seismology, meteorology

    • Continuous field simulation

  • Requirement

    • High precision

    • Repeated floating point calculations on large arrays of numbers


  • Solution 1: supercomputer

    • Hundreds of millions of float

    • Optimized for Vector Computation

    • $10-15 million

    • Limited market

    • Research, government agencies, meteorology

  • Solution 2: Array processor

    • Alternative to supercomputer

    • Configured as peripherals to mainframe & mini

    • Just run vector portion of problems

Multicore Computers

What is Multicore Computers?

  • Also known as single chip multiprocessor

  • Two or more processors are integrated on a single chip, and each processor is called a core

  • Each core consists of all components of an independent processor, including register set, ALU, pipeline hardware, control unit, and L1 data and instruction cache

  • Some multicore processors also include L2 cache and L3 cache on the chip

Hardware Performance Issues

  • Microprocessors have seen an exponential increase in performance

    • Improved organization

    • Increased clock frequency

  • Increase in Parallelism

    • Pipelining

    • Superscalar

    • Simultaneous multithreading

Simultaneous multithreading

  • 同步多线程能够从多个线程中取出指令来运行,它能够同时执行不同线程的指令

  • 同步多线程架构中,配置了多个PC和多个寄存器组,底层共享指令cache和数据cache。这样可以在多个线程之间共享流水线资源

  • 通过同步多线程技术,系统能够动态调整系统环境,如有可能同时执行不同线程的指令。当一个线程遇到长延迟事件时,允许另一个线程使用所有的处理单元

Hardware Performance Issues

  • Processor performance continues to improve

    • Adjustment of chip architecture

    • Improvement of main frequency

  • Diminishing returns

    • More complexity requires more logic

    • Need more chip area for coordinating and signal transfer logic

    • Harder to design, make and debug

    • Hardware performance reaches the bottleneck, which is very difficult to improve

Power consumption

  • Power requirements grow exponentially with chip density and clock frequency

  • Increased power consumption causes CPU cooling problems

  • It is increasingly difficult to improve performance by improving chip integration


  • One solution is use more chip area for cache

    • Storage transistors require low power consumption

    • Cache is close to CPU and fast

    • By 2015,100 billion transistors on 300mm2 ,Cache of 100MB ,1 billion transistors for logic

  • Large capacity cache provides basic resources for multi-core processors

Pollack’s rule

  • Pollack’s rule

    • Performance is roughly proportional to square root of increase in complexity

    • Double complexity gives 40% more performance

  • So,integrating multiple processor cores on one chip becomes a better solution

    • Multicore makes performance close to linear improvement

    • Unlikely that one core can use all cache effectively

Software Performance Issues

  • Performance benefits dependent on effective exploitation of parallel resources

    • Amdahl’s Law

    • Even small amounts of serial code impact performance

    • 10% inherently serial on 8 processor system gives only 4.7 times performance

  • Other factors affecting performance: communication, distribution of work and cache coherence overheads

/img/Computer Organization and Architecture/chapter17-5.png
Performance Effect of Multiple Cores
  • 图(a)给出了串行代码比例对加速比的影响。如果没有串行代码,理论上加速比和性能的提升成正比。但是,由于串行处理的问题,导致加速比比理论值小了很多。

  • 图(b)指出管理开销对加速比的影响。可以看到,在5个处理器的时候,加速比最大,随着核数的增加,管理开销会导致性能收益递减

Effective Applications

  • Some applications effectively exploit multicore processors

    • Database

    • Servers handling independent transactions

    • Multi-threaded native applications,such as Lotus Domino, Siebel CRM

    • Multi-process applications, such as Oracle, SAP, PeopleSoft

  • Java applications

    • JVM is multi-thread with scheduling and memory management

    • Sun’s Java Application Server, BEA’s Weblogic, IBM Websphere, Tomcat

  • Multi-instance applications

    • One application running multiple times
  • Game Software

Multicore Organization

  • Number of core processors on chip

  • Number of levels of cache on chip

  • Amount of shared cache

  • Next slide examples of each organization

    • (a) ARM11 MPCore

    • (b) AMD Opteron

    • (c) Intel Core Duo

    • (d) Intel Core i7

Individual Core Architecture

  • Intel Core Duo uses superscalar cores

  • Intel Core i7 uses simultaneous multi-threading (SMT)

    • Scales up number of threads supported
    • 4 SMT cores, each supporting 4 threads appears as 16 core

Intel x86 Multicore Organization

Example: Core Duo and Core i7

ARM11 MPCore

  • Up to 4 processors each with own L1 instruction and data cache

  • Distributed interrupt controller

  • Timer per CPU

  • Watchdog

    • Warning alerts for software failures
    • Counts down from predetermined values
    • Issues warning at zero
  • CPU interface

    • Interrupt acknowledgement, masking and completion acknowledgement
  • CPU

    • Single ARM11 called MP11
  • Vector floating-point unit

    • FP co-processor
  • L1 cache

  • Snoop control unit

    • Maintain L1 cache coherency

Summary of parallel

  • Internal of CPU

    • Pipeline

    • Superscalar

    • simultaneous multi-threading(SMT)

  • On chip

    • Multicore
  • Internal of machine

    • SMP

    • NUMA

    • Array processor

  • Multi-machine

    • Cluster