目录

CMU 15-445 Lecture #23: Distributed OLTP Databases


CMU 15-445 Database Systems

Lecture #23: Distributed OLTP Databases

  • We have not discussed how to ensure that all nodes agree to commit a txn and then to make sure it does commit if the DBMS decides it should.
    • → What happens if a node fails?
    • → What happens if messages show up late?
    • → What happens if the system does not wait for every node to agree to commit?

Atomic Commit Protocols

  • 多个节点要进行原子提交
  • 解决方案
    • Two-Phase Commit (1970s)
    • Three-Phase Commit (1983)
    • Paxos (1989)
    • Raft (2013)
    • ZAB (2008? Zookeeper Atomic Broadcast protocol, Apache Zookeeper)
    • Viewstamped Replication (1988)

Two-Phase Commit

/img/CMU 15-445 Database Systems/chapter23-1.png
Two-Phase Commit,Phase1: Prepare:询问其他节点准备好了没
/img/CMU 15-445 Database Systems/chapter23-2.png
Two-Phase Commit 其他节点认为可行,回复OK
/img/CMU 15-445 Database Systems/chapter23-3.png
Two-Phase Commit Phase2: Commit:向其他节点发送commit请求
/img/CMU 15-445 Database Systems/chapter23-4.png
Two-Phase Commit 本地提交完成后回复OK
/img/CMU 15-445 Database Systems/chapter23-5.png
告诉应用提交成功
  • 如果有节点发送了ABORT的请求,协调者会回复应用事务回滚,然后向所有节点发送ABORT指令

  • 2PC OPTIMIZATIONS

    • Early Prepare Voting (Rare)

      • → If you send a query to a remote node that you know will be the last one to execute in this txn, then that node will also return their vote for the prepare phase with the query result.(如果知道是事务的最后一句我自己自己投票,不用等用户指令)
    • Early Ack After Prepare (Common)

      • → If all nodes vote to commit a txn, the coordinator can send the client an acknowledgement that their txn was successful before the commit phase finishes.(提交之前先后告诉应用本次txn成功了,也有一定数据安全的风险)
/img/CMU 15-445 Database Systems/chapter23-6.png
Early Ack After Prepare

PAXOS

  • Consensus protocol where a coordinator proposes an outcome (e.g., commit or abort) and then the participants vote on whether that outcome should succeed(协调员提出来一个请求,大家投票决定)

  • Does not block if a majority of participants are available and has provably minimal message delays in the best case.(可用情况下少数服从多数)

/img/CMU 15-445 Database Systems/chapter23-7.png
事务冲突时要进行舍弃

  • Two Phase Commit: Blocks if coordinator fails after the prepare message is sent, until coordinator recovers.

  • Paxos: Nonblocking if a majority participants are alive, provided there is a sufficiently long period without further failures.

  • Raft: Similar to Paxos but with fewer node types. Only nodes with most up-to-date log can become leaders.

  • Multi-Paxos: If the system elects a single leader that oversees proposing changes for some period, then it can skip the propose phase. The system periodically renews who the leader is using another Paxos round. When there is a failure, the DBMS can fall back to full Paxos.

Replication

  • 副本,冗余存储保证稳定性

  • Design Decisions:

    • → Replica Configuration
    • → Propagation Scheme
    • → Propagation Timing
    • → Update Method

  • Replica Configuration

  • Approach #1: Primary-Replica

    • → All updates go to a designated primary for each object.
    • → The primary propagates updates to its replicas by shipping logs.
    • → Read-only txns may be allowed to access replicas.
    • → If the primary goes down, then hold an election to select a new primary.
  • Approach #2: Multi-Primary

    • → Txns can update data objects at any replica.
    • → Replicas must synchronize with each other using an atomic commit protocol.
/img/CMU 15-445 Database Systems/chapter23-8.png
  • K-Safety
    • 确保线上至少有K个数据副本
    • 低于该阈值,数据库直接下线

  • Propagation Scheme

    • synchronous scheme: 主库提交事务的时候要卡住去通知从节点,当从节点也提交成功后才告诉用户成功了

    • asynchronous scheme: 主库提交完了就拉到,不管后续

    • /img/CMU 15-445 Database Systems/chapter23-9.png
      synchronous, asynchronous
    • 折中解决方案:半同步:等日志传送到备库,不用等备库执行完(MySQL)


  • Propagation Timing
    • Continuous:持续不断向备库传播,难的就是回滚的问题,要主备一起回滚
    • On Commit:The DBMS only sends the log messages for a txn to the replicas once the txn is commits.(主库提交完了才给备库发日志),不用在回滚上面浪费时间

  • Active vs Passive

    • Active-Active:主库给备库的是SQL语句,备库还要重新执行一遍

    • Active-Passive:主库给备库的物理日志,备库直接执行日志就行

    • 物理日志大,SQL执行慢,现实中往往是混合传播(MySQL)

CAP Theorem

  • Consistent

  • Always Available

  • Network Partition Tolerant(断开网络,集群分裂还能用)

  • 上述的三点永远不可能同时实现

  • 拓展:联邦数据库:通过统一API来让不同种类的数据库合成一个集群,但是业界目前没有什么产品

/img/CMU 15-445 Database Systems/chapter23-10.png