Jungle's Blog

数据库SQL引擎基础(OceanBase-MiniOB)

junglece430@gmail.com (Jungle) — Sun, 21 Jul 2024 01:18:11 +0800

数据库SQL引擎基础$(OceanBase-MiniOB)$

引擎架构概览

MySQL的引擎架构（红框）
OceanBase引擎架构
一条SQL语句的常见结构

1
2
3
4
5
6
7
8
9


SELECT XXX (7)
FROM XXX (1)
JOIN XXX (3)
ON XXX (2)
WHERE XXX (4)
GROUP BY XXX (5)
HAVING XXX (6)
ORDER BY XXX (8)
LIMIT XXX (9)

解析 & 执行SQL

Parser

将SQL转化为数据库能识别的数据结构

Resolver

同时也会将SQL转化为一个新的数据工程性的数据结构

Transformer & Optimizer

在这两层主要是要把上一层的结构转化为火山模型中的“算子树（自己造的词，方便理解）”，同时对查询计划进行优化
生成查询计划
优化查询计划
OceanBase的优化计划部分

Executor

常见的就是火山模型
所有的代数运算都被看成了operator（也就是图里面的一个节点）
每一层的operator都将下一层的operator看成一张表，每次调用next()都是获取表中的一张数据
优点：简单，让查询计划变得富有弹性，使其便于优化
缺点：大量的虚函数调用，流水线模型的破坏，不能让CPU乱序执行带来的效率低下（但是这个模型在90年代诞生，当时I/O的问题远比CPU的问题要重要的多）
关于火山模型的问题和向量化|类JIT代码动态编译/生成，比较好的两篇文章，读完收获很大
- 为什么需要向量化执行引擎
- 数据库计算引擎的优化技术：向量化执行与代码生成
其他优化的点
- 操作符融合：比如把Scan和Fillter融合在一块执行
- 拉取模型和推送模型：火山模型中应该是上层的算子来进行推动，但是推送模型希望是下面的算子往上推，而不是从上面拉取
下推引擎中，下层的operator拿到了上层的要求后直接带着要求返回结果，而不是想上拉引擎那样只是傻傻的返回tuple

Fast Parser & Plan Cache

把之前SQL解析出来的Plan给进行缓存，这样就可以避免大的，频繁调用的SQL不停的被解析造成的效率低下
OceanBase的优化：SQL可以先找到和自己相似的SQL进行数值上的替换，这样自己的查询计划也就出来了

常见名词

一个高效的SQL有很多考虑因素
- disk
- memory
- cpu
- SQL计划是否更优，选择的算法是否合理

基础关系代数符号 & 优化操作

相关文章
- CMU 15-445 Lecture #11: Joins Algorithms
- SQL 改写系列九：外连接转内连接的常见场景与错误

关系表达式的等价规则转换（当成工具字典就行）

运算律优化的例子，选择下推，连接变小
由于连接有交换律，根据排列组合，极少的连接就会造出来非常多的计划，但是实际上数据库不可能挨个去算每个计划的代价
Interesting sort order(如果前面连接出来的结果有序，那么后面优先选择有序的表进行连接，这样可以使用sorted join来加快顺序)

左深树

根据前面的优化，左深树比右深树更有优势

启发式规则

多年经验和论文总结而成

统计信息

根据选择率和基数和基数来决定是否使用索引还是用全表（不一定索引比全表扫描强，比如sex这种列）

尾巴

留一篇文章下周工位摸鱼读（又该申请权限力）
- TiDB 源码阅读系列文章（七）基于规则的优化 | PingCAP
写到445的Lab3感觉需要拓展的知识点开始变得异常的多（因为这部分本身的比较广泛，前面的bpm和hash index都比较窄和单一），实验指导书看了一半就附带被推动着看了好几篇有关火山模型和向量化处理的文章（主要是工位上只能看这些，知乎那两篇确实质量很高，看完收获很大），和之前在牛客上认识，现在在2012实验室做db的朋友摸鱼聊天，看他解决bug的同时发现指令流水线，内存对齐，编译链接，并行计算这些计算机架构|操作系统的知识其实在这种底层的领域会不停的碰到，有一个扎实的基础会加快你发现bug原因的速度，看来当时的量化方法，csapp的知识不能再吃灰了，有时间还是得补一下自己的基础

CMU 15-445 Lecture #24: Distributed OLAP Databases

junglece430@gmail.com (Jungle) — Tue, 18 Jun 2024 09:58:42 +0800

CMU 15-445 Database Systems

Lecture #24: Distributed OLAP Databases

Decision Support Systems

OLAP其他称呼，就是分析数据进行公司的决策分析的
两种“数据”架构
- Star Schema
- Snowflake Schema

Star Schema
- 一张核心表，通过外键连接到外面的表
- 下面两个字段是核心字段，上面的用来连接其他表的
- 一个中心，多个描述
- 缺点：部分枚举本来可以用数字+映射表示，在这个架构里面必须写真实属性，造成了数据冗余

Snowflake Schema
- 外面的一层还可以向外拓展，解决了上面数据冗余的问题

对比
- 雪花型更省空间，而且避免了规范化的问题（前面星型的枚举人们可能用不同的词汇描述，比如一个人记low，另一个记bad）
- 雪花型的查询更复杂，跑的会更慢

Execution Models

PUSH QUERY TO DATA
- 将查询计划分给网关节点，网关节点根据查询内容和数据分布的情况向其他节点分配查询计划，每一个节点尽力执行查询计划然后将结果返回给应用
PULL DATA TO QUERY
- 数据不在本地，如果我需要执行查询的话必须要把数据加载到本地（放在shared disk架构里面就是把页拉到自己本地）

Query Planning

Physical Operators
- 把查询计划切成一个个得小块推送到对应的数据节点上面去，常用的方案
SQL
- 切的是SQL语句

Distributed Join Algorithms

把可以JOIN的部分集中到一个节点JOIN

Cloud Systems

Newer systems are starting to blur the lines between shared-nothing and shared-disk.
- Example: You can do simple filtering on Amazon S3 before copying data to compute nodes.

Managed DBMSs
- 数据库软件还是那个软件，但实际上是虚拟机+软件（腾讯/阿里云的数据库就是这样，帮你托管数据库）
- 便宜，易转型（从本地MySQL转到云数据库）
Cloud-Native DBMSs
- 就是为了在云上运行设计的
shared disk可以动态分配计算资源
云数据库也有人希望把不同厂家的组件组装起来来一个新的，主要是希望统一文件格式，但是太难。。。

CMU 15-445 Lecture #23: Distributed OLTP Databases

junglece430@gmail.com (Jungle) — Mon, 17 Jun 2024 17:09:19 +0800

CMU 15-445 Database Systems

Lecture #23: Distributed OLTP Databases

We have not discussed how to ensure that all nodes agree to commit a txn and then to make sure it does commit if the DBMS decides it should.
- → What happens if a node fails?
- → What happens if messages show up late?
- → What happens if the system does not wait for every node to agree to commit?

Atomic Commit Protocols

多个节点要进行原子提交
解决方案
- Two-Phase Commit (1970s)
- Three-Phase Commit (1983)
- Paxos (1989)
- Raft (2013)
- ZAB (2008? Zookeeper Atomic Broadcast protocol, Apache Zookeeper)
- Viewstamped Replication (1988)

Two-Phase Commit

Two-Phase Commit，Phase1: Prepare：询问其他节点准备好了没

Two-Phase Commit 其他节点认为可行，回复OK

Two-Phase Commit Phase2: Commit：向其他节点发送commit请求

Two-Phase Commit 本地提交完成后回复OK

告诉应用提交成功

如果有节点发送了ABORT的请求，协调者会回复应用事务回滚，然后向所有节点发送ABORT指令
2PC OPTIMIZATIONS
- Early Prepare Voting (Rare)
  - → If you send a query to a remote node that you know will be the last one to execute in this txn, then that node will also return their vote for the prepare phase with the query result.(如果知道是事务的最后一句我自己自己投票，不用等用户指令)
- Early Ack After Prepare (Common)
  - → If all nodes vote to commit a txn, the coordinator can send the client an acknowledgement that their txn was successful before the commit phase finishes.(提交之前先后告诉应用本次txn成功了，也有一定数据安全的风险)

Early Ack After Prepare

PAXOS

Consensus protocol where a coordinator proposes an outcome (e.g., commit or abort) and then the participants vote on whether that outcome should succeed（协调员提出来一个请求，大家投票决定）
Does not block if a majority of participants are available and has provably minimal message delays in the best case.（可用情况下少数服从多数）

事务冲突时要进行舍弃

Two Phase Commit: Blocks if coordinator fails after the prepare message is sent, until coordinator recovers.
Paxos: Nonblocking if a majority participants are alive, provided there is a sufficiently long period without further failures.
Raft: Similar to Paxos but with fewer node types. Only nodes with most up-to-date log can become leaders.
Multi-Paxos: If the system elects a single leader that oversees proposing changes for some period, then it can skip the propose phase. The system periodically renews who the leader is using another Paxos round. When there is a failure, the DBMS can fall back to full Paxos.

Replication

副本，冗余存储保证稳定性
Design Decisions:
- → Replica Configuration
- → Propagation Scheme
- → Propagation Timing
- → Update Method

Replica Configuration
Approach #1: Primary-Replica
- → All updates go to a designated primary for each object.
- → The primary propagates updates to its replicas by shipping logs.
- → Read-only txns may be allowed to access replicas.
- → If the primary goes down, then hold an election to select a new primary.
Approach #2: Multi-Primary
- → Txns can update data objects at any replica.
- → Replicas must synchronize with each other using an atomic commit protocol.

K-Safety
- 确保线上至少有K个数据副本
- 低于该阈值，数据库直接下线

Propagation Scheme
- synchronous scheme: 主库提交事务的时候要卡住去通知从节点，当从节点也提交成功后才告诉用户成功了
- asynchronous scheme: 主库提交完了就拉到，不管后续
- synchronous, asynchronous
- 折中解决方案：半同步：等日志传送到备库，不用等备库执行完（MySQL）

Propagation Timing
- Continuous：持续不断向备库传播，难的就是回滚的问题，要主备一起回滚
- On Commit：The DBMS only sends the log messages for a txn to the replicas once the txn is commits.（主库提交完了才给备库发日志），不用在回滚上面浪费时间

Active vs Passive
- Active-Active：主库给备库的是SQL语句，备库还要重新执行一遍
- Active-Passive：主库给备库的物理日志，备库直接执行日志就行
- 物理日志大，SQL执行慢，现实中往往是混合传播（MySQL）

CAP Theorem

→ Consistent
→ Always Available
→ Network Partition Tolerant（断开网络，集群分裂还能用）
上述的三点永远不可能同时实现
拓展：联邦数据库：通过统一API来让不同种类的数据库合成一个集群，但是业界目前没有什么产品

CMU 15-445 Lecture #22: Introduction to Distributed Databases

junglece430@gmail.com (Jungle) — Mon, 17 Jun 2024 10:20:12 +0800

CMU 15-445 Database Systems

Lecture #22: Introduction to Distributed Databases

Introduction Distributed DBMSs

辨析概念：Parallel DBMSs vs Distributed DBMSs
Parallel DBMSs:
- Nodes are physically close to each other.
- Nodes connected with high-speed LAN(Local Area Network).
- Communication cost is assumed to be small.
Distributed DBMSs:
- Nodes can be far from each other.
- Nodes connected using public network.
- Communication cost and problems cannot be ignored
主要的差别是通信代价，前者可以忽略，后者无法忽略

Single node -> Distributed DBMSs
Questions to consider
- Optimization & Planning
- Concurrency Control
- Logging & Recovery

System Architectures

A distributed DBMS’s system architecture specifies what shared resources are directly accessible to CPUs.
This affects how CPUs coordinate with each other and where they retrieve/store objects in the database.

Every database node are on the same machine(Extreme Situations), like parallel DBMSs, ignore it(

注意上面图片不是被许多XXX共享就证明其他的机器就那一个组件，为了跑起来基本的OS, CPU, memory和disk都要有，不过是那些部分参与到这个分布式的系统中

shared memory
A large memory and a large disk, both of them are shared by many cpus by network
- Nodes access a common memory address space via a fast interconnect.
- Can still use local memory / disk for intermediate results
This looks a lot like shared everything. Nobody does this.(我认为原因是CPU要通过网络来访问内存，内存最大的优势”高速“直接没了)
在单体架构上面有使用，多路服务器（多个CPU，每个CPU多核来提高并发能力，中间是通过高速总线相连的）

shared disk
large disk, shared by many cpus and memories
- Nodes access a single logical disk via an interconnect, but each have their own private memories.
- Scale execution layer independently from the storage layer.（解耦了执行层和存储层，可以独立拓展来加强单个层的性能）
- This architecture facilitates data lakes and serverless systems.？？？没太懂
该架构在近年新的数据库产品中使用的及其广泛，原因就是近些年云技术的兴起，大多数公司都趋向于购买云数据库，而在云厂商的机房中很容易就能做到这种架构（云计算在设计的时候就把计算和存储分开了，所以机房往往是计算服务器和磁盘集群式分开的，每个部分取一些就是这个架构）
国内该架构的代表：PolarDB
查询过程，可以看到执行层和存储层分开了，Node数量决定计算能力，存储大小决定存储能力
这个架构面临的一个最大的问题就是缓存同步的问题，Node1进行了数据的更新，如果马上不刷盘（因为耗时太长），那么就需要将更新数据的信息同步到其他节点的缓存，其实无论解决那种方案，都要面临这类问题

shared nothing
Most common situations
Each DBMS node has its own CPU, memory, and local disk.
Nodes only communicate with each other via network
- Better performance & efficiency.
- Harder to scale capacity.
- Harder to ensure consistency.
- 单层扩容（加一台机器所有能力都上去了，其他性能提示的效果因为短板效应被浪费了）和保证一致性都做的不好（数据不在同一块盘上面）
- 好处：和上面shared disk相比把盘肢解了并放在本地，提升了部分性能
这类架构也在现在的分布式数据库产品中大量使用，代表产品是Redis

涉及数据分区的问题，扩容也带来了数据重新分区的问题

不在本节点需要呼叫其他节点

Design Issues

Homogeneous Nodes vs. Heterogeneous Nodes:

Homogeneous Nodes
- Every node in the cluster can perform the same set of tasks (albeit on potentially different partitions of data).
- Makes provisioning and failover “easier”.

Heterogenous Nodes
- Nodes are assigned specific tasks.
- Can allow a single physical node to host multiple “virtual” node types for dedicated tasks.

Heterogenous Nodes on mongodb

DATA TRANSPARENCY
- Applications should not be required to know where data is physically located in a distributed DBMS
- Any query that run on a single-node DBMS should produce the same result on a distributed DBMS.
! In practice, developers need to be aware of the communication costs of queries to avoid excessively “expensive” data movement.

DATABASE PARTITIONING
- Split database across multiple resources:
  - Disks, nodes, processors.
  - Often called “sharding” in NoSQL systems.
The DBMS executes query fragments on each partition and then combines the results to produce a single answer.
The DBMS can partition a database physically(shared nothing) or logically (shared disk).

NAÏVE TABLE PARTITIONING
- Assign an entire table to a single node.
- Assumes that each node has enough storage space for an entire table.
- Ideal if queries never join data across tables stored on different nodes and access patterns are uniform.
- 主要弱点是JOIN

HORIZONTAL PARTITIONING
- Split a table’s tuples into disjoint subsets based on some partitioning key and scheme.
- Choose column(s) that divides the database equally in terms of size, load, or usage.
Partitioning Schemes:

→ Hashing

→ Ranges

→ Predicates

Hash分区，但是如果WHERE的属性不是Hash列就要所有节点都查一遍，这种查询就很不好，查询尽量要带上Hash那列的限制

扩容往往面临着新的Hash计算和分区

CONSISTENT HASHING
分布式系统常见的问题
八股：什么是一致性哈希?

P4分割了P3的一部分数据，只需要P3把这部分数据传给P4就可以了，避免了之前大规模数据迁移的情况

P2和P6为P1的一部分数据存储副本

存储的过程也变得冗余了（多存几份）

物理分区和逻辑分区

shared memory 架构下的分布式数据库不做物理分区，他的分区就是逻辑上把不同数据的查询分配给了不同的节点，物理分区常见于shared nothing架构

Distributed Concurrency Control

If our DBMS supports multi-operation and distributed txns, we need a way to coordinate their execution in the system.
Two different approaches:
- → Centralized: Global “traffic cop”(“交警”).
- → Decentralized: Nodes organize themselves.

TP MONITORS
A TP Monitor is an example of a centralized coordinator for distributed DBMSs. Originally developed in the 1970-80s to provide txns between terminals and mainframe databases.
- → Examples: ATMs, Airline Reservations.
- Standardized protocol from 1990s: X/Open XA

系统里面有一个节点专门管理每个节点的锁和事务的提交，数据库执行事务时需要申请

改进：数据路由和事务管理器合二为一

现在用的不多，单点很容易造成性能瓶颈

DECENTRALIZED COORDINATOR

在节点中选出来一个Leader(老版本是master+slave，后面因为政治原因不让用了，新版的数据库文档都没有这个词了)，选出来的Leader就担任了上面TP Monitor的事务职责

DISTRIBUTED CONCURRENCY CONTROL
- Many of the same protocols from single-node DBMSs can be adapted.
- This is harder because of:
  - → Replication.
  - → Network Communication Overhead.
  - → Node Failures (Permanent + Ephemeral).
  - → Clock Skew.(时钟不同步)

分布式数据库的2PL很容易发生死锁，出现了死锁不容易解决，而且用网络去同步锁的开销也很大

CMU 15-445 Lecture #21: Database Crash Recovery

junglece430@gmail.com (Jungle) — Mon, 22 Apr 2024 11:11:57 +0800

CMU 15-445 Database Systems

Lecture #21: Database Crash Recovery

Crash Recovery

The DBMS relies on its recovery algorithms to ensure database consistency(C), transaction atomicity(A), and durability(D) despite failures.
Each recovery algorithm is comprised of two parts:
- Actions during normal transaction processing to ensure that the DBMS can recover from a failure
- Actions after a failure to recover the database to a state that ensures the atomicity, consistency, and durability of transactions.
Check Point的问题
- 性能问题：刷盘的时候整个DBMS都停住了
- 扫描的时候Check Point前后都要看，也很浪费效率
- 没有特别合适的刷盘频率，高了频繁小卡，低了定时大卡
Algorithms for Recovery and Isolation Exploiting Semantics(ARIES)
- Developed at IBM Research in early 1990s for the DB2 DBMS
- There are three key concepts in the ARIES recovery protocol:
  - Write Ahead Logging(WAL): Any change is recorded in log on stable storage before the database change is written to disk (STEAL + NO-FORCE). 写盘策略
  - Repeating History During Redo: On restart, retrace actions and restore database to exact state before crash.
  - Logging Changes During Undo: Record undo actions to log to ensure action is not repeated in the event of repeated failures.

WAL Records

Every log record now includes a globally unique log sequence number (LSN).日志的序列号

日志分类

每个数据页会有一个pageLSN，记录这一页最新的修改
每个系统会有一个flushedLSN，前面的进了磁盘，后面的都在内存没有刷盘
脏页写回到磁盘的必要条件 $pageLST\le flushedLSN$，这个脏页之前所作的修改必须先要刷到磁盘里面去，它才能刷回到盘里面去
每次刷盘的时候要更新flushedLSN

Normal Execution

情景：每个事务都会读和写数据，结果有commit和rollback
假设
- 所有的log都在一页里面
- 写磁盘是原子操作
- 使用严格2PL
- 窃取式+非强制
COMMIT
- log上面加一条COMMIT
- COMMIT之前有关这个事务的所有日志都要刷盘，刷盘是连续写+同步
- 后面刷脏页的时候会追加一句TXN-END
- 刷完盘之后的数据在内存里面就可以干掉了
- COMMIT证明提交成功，TXN-END代表脏页被刷回去了
ROLLBACK
- 加上prevLSN字段：记录这个事务的上一条日志的地点（类比双链表中的prev指针）
- 回滚就是记录相反的日志
- 具体操作：
  - 加上ABORT
  - 撤销修改，同时追加对应的回滚日志
  - 清理做完了加上TXN-END的标志
  - 注意：清理的过程是不可能回滚的

Checkpointing

检查点的问题
- 要停止处理新事务，同时所有正在运行的事务都要做完才能刷盘，这个对效率的影响很大
- 改进：让所有进行中的事务暂停/给所有需要刷盘的数据加锁，而不是等他们做完
Active Transaction Table (ATT)
- Checkpointing的时候还在活动的事务的表
- One entry per currently active txn.
  - → txnId: Unique txn identifier.
  - → status: The current “mode” of the txn.
  - → lastLSN: Most recent LSN created by txn
  - Remove entry after the TXN-END record.(TXN-END才算不活动)
- Txn Status Codes:
  - R → Running
  - C → Committing
  - U → Candidate for Undo
Dirty Page Table (DPT)
- Checkpointing的时候的脏页
- One entry per dirty page in the buffer pool:
  - → recLSN: The LSN of the log record that first caused the page to be dirty.
记录的时候标注更多信息

Fuzzy Checkpoints
- Checkpointing的时候其他事务也继续运行
- 把checkpoint从一个时间点变成一个时间段(POINT -> BEGIN+END)
- BEGIN+END

ARIES Recovery

Analysis: Read the WAL to identify dirty pages in the buffer pool and active transactions at the time of the crash. At the end of the analysis phase the ATT tells the DBMS which transactions were active at the time of the crash. The DPT tells the DBMS which dirty pages might not have made it to disk.
Redo: Repeat all actions starting from an appropriate point in the log (even txns that will abort).
Undo: Reverse the actions of transactions that did not commit before the crash.

恢复的过程

CMU 15-445 Lecture #20: Database Logging

junglece430@gmail.com (Jungle) — Sun, 21 Apr 2024 15:08:57 +0800

CMU 15-445 Database Systems

Lecture #20: Database Logging

Crash Recovery

情景：数据库运行到一半没电了

Recovery algorithms are techniques to ensure database consistency(C), transaction atomicity(A), and durability(D) despite failures(example no power)
The key primitives that used in recovery algorithms are UNDO and REDO. Not all algorithms use both primitives.
- UNDO: The process of removing the effects of an incomplete or aborted transaction.
- REDO: The process of re-applying the effects of a committed transaction for durability.

FAILURE CLASSIFICATION

Type #1 – Transaction Failures
- Logical Errors:→ Transaction cannot complete due to some internal error condition (e.g., integrity constraint violation).
- Internal State Errors:→ DBMS must terminate an active transaction due to an error condition (e.g., deadlock).
Type #2 – System Failures
- Software Failure:→ Problem with the OS or DBMS implementation (e.g., uncaught divide-by-zero exception).
- Hardware Failure:
  - → The computer hosting the DBMS crashes (e.g., power plug gets pulled).
  - → Fail-stop Assumption: Non-volatile storage contents are assumed to not be corrupted by system crash.
Type #3 – Storage Media Failures
- Non-Repairable Hardware Failure:
  - → A head crash or similar disk failure destroys all or part of non-volatile storage.
  - → Destruction is assumed to be detectable (e.g., disk controller use checksums to detect failures).
  - The recovery protocol can’t recover from this! Database must be restored from an archived version.
考虑：磁盘比内存慢的多，所以DBMS的模式是load到内存池后修改，最后刷盘
- 保证的点：只要commit成功，数据永远不会丢（除了硬盘爆炸），如果事务中间失败了，那么这个事务应该就和没发生一样

方案选择：是否commit后要立即刷盘，刷盘的时候同一页其他未commit的事务修改的数据怎么处理(考虑回滚问题)

STEAL POLICY
- STEAL: Is allowed.(别人没提交的数据我也刷)
- NO-STEAL: Is not allowed.(别人没提交我就不能刷到磁盘)
FORCE POLICY
- FORCE: Is required.(commit后立即刷)
- NO-FORCE: Is not required.(不强制要求)

NO-STEAL FORCE

NO-STEAL FORCE
- 优点：好实现，不需要undo和redo的操作
- 缺点：效率低，刷盘频率太高，没有undo和redo那所有的东西都要load到内存池，也很伤害效率，能够修改数据的量收到缓存池大小的限制（缓存池就用来做的数据备份，没有写进磁盘的数据需要全部暂存在缓存池）

Shadow Paging

具体实现：SHADOW PAGING
把需要的数据Copy一份再改，改的时候也刷盘，commit之后改指针指向，最后清除原有页
- Undo: 把本地复制出来的页全部干掉
- Redo: 不需要，commit必须刷盘
实际应用：SQLITE (PRE-2010)
- 在硬盘上面留原始版本(undo),commit的时候刷盘
缺点：对磁盘有大量的随机读写，性能不好
思路：随机写=>顺序写：WAL

Write-Ahead Logging（WAL）

With write-ahead logging, the DBMS records all the changes made to the database in a log file (on stable storage) before the change is made to a disk page
The log contains sufficient information to perform the necessary undo and redo actions to restore the database after a crash.
The DBMS must write to disk the log file records that correspond to changes made to a database object before it can flush that object to disk.
Buffer Pool Policy: STEAL + NO-FORCE
一般是打头，结尾，COMMIT必须是要把所有数据都刷到磁盘里面
日志格式
- Transaction Id
- Object Id
- Before Value (UNDO)
- After Value (REDO)
- Not necessary for Before Value and After Value if using append-only MVCC
- 先刷日志再刷盘，commit代表刷日志成功
- crash之后靠日志恢复
问题：用户commit你就要把日志刷到盘里面，但是刷盘频率高了又破坏效率
- 优化： group commit: commit的时候卡住，凑够了几个事务再一起刷

Logging Schemes

Physical Logging:
- Record the byte-level changes made to a specific location in the database.
- Example: git diff
- 缺点：会被写放大(UPDATE ALL FRO A TABLE => BIG Physical Log)
Logical Logging:
- SQL
- 缺点：恢复的时候慢，还有SQL自己的缺陷(NOW()函数不能重放，LIMIT不保证次次相同，备库没有主库的索引)
Physiological Logging
- 基础是物理日志，混合SQL，偏移量换成槽
三种日志
MYSQL为什么undo和redo分开：安全，还有就是undo log可以用来做mvcc

Checkpoints

Blocking / Consistent Checkpoint Protocol:
- → Pause all queries.
- → Flush all WAL records in memory to disk.
- → Flush all modified pages in the buffer pool to disk.
- → Write a entry to WAL and flush to disk.
- → Resume queries
日志不清理也会爆磁盘
crash之后你要知道从哪恢复，类比游戏存档（坐佛/坐火）
缓存点停住，把日志和脏页全刷回去，然后在日志里面记上一个，表示上面的数据都刷盘了
注意CHECKPOINT上面有完全提交的和半提交的, T2用redo,T3用undo
T1在检查点之前全刷，不用管，T2检查点前有BEGIN，检查点后COMMIT，用REDO恢复，T3检查点之前有BEGIN，检查点后没有COMMIT，用UNDO恢复

CONCLUSION

Write-Ahead Logging is (almost) always the best approach to handle loss of volatile storage.
Use incremental updates (STEAL + NO-FORCE) with checkpoints.
On Recovery: undo uncommitted txns + redo committed txns.

CMU 15-445 Lecture #19: Multi-Version Concurrency Control

junglece430@gmail.com (Jungle) — Sun, 21 Apr 2024 10:39:27 +0800

CMU 15-445 Database Systems

Lecture #19: Multi-Version Concurrency Control

Multi-Version Concurrency Control

常常作为2PL和T/O的辅助手段
The DBMS maintains multiple physical versions of a single logical object in the database（维护多个历史版本（像git））
- When a txn writes to an object, the DBMS creates a new version of that object. 不改动，直接创建一个新的版本
- When a txn reads an object, it reads the newest version that existed when the txn started.

First implementations was Rdb/VMS and InterBase at DEC in early 1980s.

→ Both were by Jim Starkey, co-founder of NuoDB.

→ DEC Rdb/VMS is now “Oracle Rdb”.

→ InterBase was open-sourced as Firebird.

解决的问题
- Writers do not block readers.
- Readers do not block writers.
- 我去上面读历史版本就是了
Read-only txns can read a consistent snapshot without acquiring locks. 好像在读静态数据
- Use timestamps to determine visibility.用时间戳来确定可见性
- MVCC naturally supports Snapshot Isolation (SI).天然支持快照隔离读
Easily support time-travel queries.可以读取某一个时刻的历史版本，和IDE退到昨天的代码很像（其他方案很难做到，会把历史数据直接给写没）

MVCC 写

MVCC 读

防止级联回滚，只读最新的commit数据

从上面的一张图可以看到T1和T2没法做到完全串行化，T2没有读到T1commit上去的数据，所以说只靠MVCC做不到完全串行化，Oracle最高隔离级别就是上面的图，快照隔离

There are five important MVCC design considerations:
1. Concurrency Control Protocol
2. Version Storage
3. Garbage Collection
4. Index Management
5. Deletes

Concurrency Control Protocol
- Approach #1: Timestamp Ordering: Assign txns timestamps that determine serial order.
- Approach #2: Optimistic Concurrency Control: Three-phase protocol from last class,Use private workspace for new versions.
- Approach #3: Two-Phase Locking: Txns acquire appropriate lock on physical version before they can read/write a logical tuple.

Design consideration: Version Storage

Version Storage
- The DBMS uses the tuples’ pointer field to create a version chain per logical tuple
  - This allows the DBMS to find the version that is visible to a particular txn at runtime.
  - Indexes always point to the “head” of the chain.
- Approach #1: Append-Only Storage: New versions are appended to the same table space.
- Append-Only Storage
  - 两种插法：头插法和尾插法，头插法搜索效率高（大部分txn要最新的数据）
- Approach #2: Time-Travel Storage: Old versions are copied to separate table space.
  - Time-Travel Storage
- Approach #3: Delta Storage: The original values of the modified attributes are copied into a separate delta record space.
  - Delta Storage就是Time-Travel Storage的省空间版本，只存增量
  - MySQL用的就是这个方案

Design consideration: Garbage Collection

Garbage Collection
- 历史版本不能一直存着（那样存储空间就会被严重浪费），所以需要定期回收无用的历史版本
- 怎么判断无用？
  - 现在运行的事务都看不到这个版本了(Snapshot Isolation)
  - 创建这个版本的事务回滚了
- 两个问题
  - 怎么发现过期的版本?
  - 决定何时回收才能保证内存安全?
Approach #1: Tuple-level
- Find old versions by examining tuples directly.
- Background Vacuuming vs. Cooperative Cleaning
- Background Vacuuming:后台清理
- 用位图表面那些页被更新过，只扫被更新过的页，这样可以减少GC的压力
- Cooperative Cleaning:查询的时候顺便清理
Approach #2: Transaction-level
- Txns keep track of their old versions so the DBMS does not have to scan tuples to determine visibility
- 事务记录自己改了什么，GC定时间戳去扫描事务的操作记录然后清理无用数据

Design consideration: Index Management

Primary key indexes point to version chain head
修改主键=先删除后插入
Secondary indexes
- Approach #1: Logical Pointers: 记录数据的逻辑地址，比如主键的值
- Approach #2: Physical Pointers: 记录物理地址
- 指向物理地址带来的一个严重的后果是如果版本链需要更新，那么一大批二级索引指向版本链的pointer也要更新
- 逻辑地址就没有上面的毛病，只需要改主键索引，辅助索引就不需要变
- 还有一个折中的方案就是在物理地址索引和版本链之间加上一个表做索引到版本链的代理
删除后再插入也会出现问题
Delete
- Approach #1: Deleted Flag: Maintain a flag to indicate that the logical tuple has been deleted after the newest physical version. This can either be in the tuple header or a separate column.
- Approach #2: Tombstone Tuple: Create an empty physical version to indicate that a logical tuple is deleted. Use a separate pool for tombstone tuples with only a special bit pattern in version chain pointer to reduce storage overhead.

CMU 15-445 Lecture #18: Timestamp Ordering Concurrency Control

junglece430@gmail.com (Jungle) — Sat, 20 Apr 2024 20:07:38 +0800

CMU 15-445 Database Systems

Lecture #18: Timestamp Ordering Concurrency Control

Timestamp Ordering Concurrency Control

纯用锁很影响性能，锁是一个悲观的方法
乐观的方法：用时间戳
If $TS(T_i) < TS(T_j)$, then the DBMS must ensure that the execution schedule is equivalent to the serial schedule where $T_i$appears before $T_j$ .
Multiple implementation strategies:
- → System/Wall Clock.:不可能完全准确，一般不用
- → Logical Counter.:单机系统一般用这个
- → Hybrid.:分布式系统用这个

Basic Timestamp Ordering (BASIC T/O)

Every object X is tagged with timestamp of the last txn that successfully did read/write:时间戳也分两种
- → W-TS(X) – Write timestamp on X
- → R-TS(X) – Read timestamp on X
Check timestamps for every operation:
- → If txn tries to access an object “from the future”, it aborts and restarts.(不能操作“未来”的数据)
BASIC T/O – READS
- Don’t read stuff from the “future.”
- Action: Transaction Ti wants to read object X.
- If TS(Ti) < W-TS(X), this violates the timestamp order of Ti with regard to the writer of X.
  - → Abort Ti and restart it with a new TS.
- Else:
  - → Allow Ti to read X.
  - → Update R-TS(X) to max(R-TS(X), TS(Ti))
  - → Make a local copy of X to ensure repeatable reads for Ti.
BASIC T/O – WRITES
- Can’t write if a future transaction has read or written to the object.(不能写未来读过和写过的数据)
- Action: Transaction Ti wants to write object X.
- If TS(Ti) < R-TS(X) or TS(Ti) < W-TS(X)
  - → Abort and restart Ti.
- Else:
  - → Allow Ti to write X and update W-TS(X)
  - → Also, make a local copy of X to ensure repeatable reads.
Thomas Write Rule
- 对上述理论进行优化
- If TS(Ti) < R-TS(X) (未来有事务读了这个数据)
  - Abort and Restart Ti
- If TS(Ti) < R-WS(X) (未来有事务写了这个数据，等效成我写了然后未来被覆盖掉了)
  - The DBMS can instead ignore the write and allow the transaction to continue instead of aborting and restarting it. This is called the Thomas Write Rule.
BASIC T/O总结
- 优点：无锁，无死锁
- 缺点：对于长的事务可能会饥饿（一直rollback），前面的事务一旦修改数据后回滚那么后面的事务会读到错误的数据，读数据的时候需要copy一份到本地，如果读的数据过多，那么开销会很大

事务2的数据来源于事务1，事务1一旦回滚那么无法恢复事务2

Optimistic Concurrency Control (OCC)

Also based on timestamp

The DBMS creates a private workspace for each txn.
- → Any object read is copied into workspace.
- Modifications are applied to workspace.(If data is wried, only applied to private workspace, no need to be wried to DBMS)
When a txn commits, the DBMS compares workspace write set to see whether it conflicts with other txns.(提交的时候DBMS看你workspace里面写的数据，和其他事务对比看看有无冲突)
If there are no conflicts, the write set is installed into the “global” database.(无冲突一把全部写回到数据库里面)

OCC PHASES
- #1 – Read Phase: Track the read/write sets of txns and store their writes in a private workspace.
- #2 – Validation Phase: When a txn commits, check whether it conflicts with other txns.
- #3 – Write Phase: If validation succeeds, apply private changes to database. Otherwise abort and restart the txn.

OCC在提交的时候才会分配时间戳

OCC – READ PHASE
- Track the read/write sets of txns and store their writes in a private workspace.
- The DBMS copies every tuple that the txn accesses from the shared database to its workspace ensure repeatable reads.
OCC – VALIDATION PHASE
VALIDATION PHASE的决策图，不能提交的时候发现后面的事务读/写了自己写了的数据（因为按照串行的理论，应该是后面的事务要读自己提交后的数据，但是自己还没提交，后面的事务读的是自己之前的数据！）
- Approach #1: Backward Validation：和前面的历史数据做校验
- Approach #2: Forward Validation：和未来的事务做校验
OCC – WRITE PHASE
- Serial Commits: → Use a global latch to limit a single txn to be in the Validation/Write phases at a time.(直接锁全表写，一是为了解决并发问题，二来由于写的数据都准备好了，所以写耗费的时间很短，对并发度的影响不高)
OCC works well when the # of conflicts is low:
- → All txns are read-only (ideal).
- → Txns access disjoint subsets of data.
If the database is large and the workload is not skewed, then there is a low probability of conflict, so again locking is wasteful.
OCC问题
- 本地Copy带来的额外开销
- commit的时候校验的逻辑很麻烦，消耗性能
- 写的步骤是锁表的，也可能会称为性能瓶颈
- 一旦出了问题前面干的全部回退，这也是一种浪费，2PL能够执行到一半发现死锁直接让这个事务回退，损失就比OCC要小

Dynamic Databases and The Phantom Problem

2PL和OCC在完全串行化上面都有BUG。。。$\rightarrow$ 幻读
前面讨论的问题都是read和update的问题，但是没有讨论insert和delete的问题

幻读的情景

2PL和OCC有这个BUG的原因：我只能控制现存的数据，但是不管数据的插入/删除

THE PHANTOM PROBLEM
Approach #1: Re-Execute Scans
- 对可能产生幻读的行为(SELECT … FROM … GROUP BY …/insert/delete)进行记录，然后在提交的时候再扫描一遍检测有无并发的问题
- 缺点：这种扫描开销过大，性能上接受不了
Approach #2: Predicate Locking
- 最早由System R发明
- Shared lock on the predicate in a WHERE clause of a SELECT query.
- Exclusive lock on the predicate in a WHERE clause of any UPDATE, INSERT, or DELETE query
- It is rarely implemented in systems; an example of a system that uses it is HyPer (precision locking).
- 谓词锁控制数据竞争
Approach #3: Index Locking
- 有索引给索引上锁，没索引就要加大的列锁/表锁了
MySQL的解决方案：间隙锁

Isolation Levels

数据库很难做到完全串行，而且很多业务也不需要完全串行，所以有不同的隔离级别
Isolation Levels (Strongest to Weakest):
- SERIALIZABLE: No Phantoms, all reads repeatable, and no dirty reads.
  - Possible implementation: Index locks + Strict 2PL.
- REPEATABLE READS: Phantoms may happen.
  - Possible implementation: Strict 2PL.
- READ-COMMITTED: Phantoms and unrepeatable reads may happen.
  - Possible implementation: Strict 2PL for exclusive locks, immediate release of the shared lock after a read.
- READ-UNCOMMITTED: All anomalies may happen.
  - Possible implementation: Strict 2PL for exclusive locks, no shared locks for reads.
如果显式声明一个表是READ ONLY的话，那么数据库会进行优化（不加锁），还有的数据库会自动检测，如果没有写的操作会自动优化

CMU 15-445 Lecture #17: Two-Phase Locking

junglece430@gmail.com (Jungle) — Fri, 19 Apr 2024 17:58:06 +0800

CMU 15-445 Database Systems

Lecture #17: Two-Phase Locking

Transaction Locks

在操作数据的时候通过DBMS的锁管理器给数据上一把锁，这样就可以避免并发的数据竞争问题
但是这个锁怎么加怎么解的方案需要设计

利用锁保证数据安全

Lock Types
- S-LOCK: Shared locks for reading(Reading Lock)
- X-LOCK: Exclusive locks for writing(Writing Lock)
仅仅W(R)的时候上锁，修改完了解锁是无法修复串行化带来的问题，因为这个操作在一个事务里面只是一段，没有锁这个事务

lock -> W(R) -> unlock下的串行问题

Two-Phase Locking(2PL)

这个就是后面的研究人员为了避免上面加锁还是没有解决并发问题提出来的一个理论，这个加锁的理论和上面最大的不同就是不用预先知道整个事务的全貌（前面的加锁方案好多都是事后诸葛亮，但是放在真实场景下你又不可能回滚去干这玩意）
2PL分为两个阶段
- Phase #1– Growing: In the growing phase, each transaction requests the locks that it needs from the DBMS’s lock manager. The lock manager grants/denies these lock requests.
- Phase #2– Shrinking: Transactions enter the shrinking phase immediately after they release their first lock. In the shrinking phase, transactions are only allowed to release locks. They are not allowed to acquire new ones.

2PL Two Phases

2PL的问题: cascading aborts

Shrinking阶段如果Rollback，会造成其他事务读到了你上面修改过但未commit的数据

Solution: Strong Strict 2PL (aka Rigorous 2PL)

Strong Strict 2PL解决了cascading aborts问题

Deadlock Handling

2PL的另一个问题: Dead-Locks

Strong Strict 2PL解决不了Dead-Lock的问题，出来环锁基本就解不开了

Two ways of dealing with deadlocks:
- → Approach #1: Deadlock Detection:DBMS会维护一个waits-for graph来描述所有并发的事务谁在等谁的锁
  - Nodes are transactions
  - Edge from $T_i$ to $T_j$ if $T_i$ is waiting for $T_j$ to release a lock.
  - The system periodically checks for cycles in waits-for graph and then decides how to break it.
  - When the DBMS detects a deadlock, it will select a “victim” transaction to rollback(rollback or restart) to break the cycle.
  - 权衡: 检测周期和死锁解开时间反相关，和开销正相关，还有就是干掉那个事务(执行时间，young还是old，执行了几条SQL，加了几个锁)
  - Deadlock handling: rollback length
    - Approach #1: Completely → Rollback entire txn and tell the application it was aborted.
    - Approach #2: Partial (Savepoints) → DBMS rolls back a portion of a txn (to break deadlock) and then attempts to re-execute the undone queries.
- → Approach #2: Deadlock Prevention
  - 给每个事务加上时间戳，越靠前的事务越老，越靠后的事务越年轻
  - Older Timestamp = Higher Priority (e.g., T1 > T2)
  - Wait-Die (“Old Waits for Young”):
    - If requesting txn has higher priority than holding txn, then requesting txn waits for holding txn. (老事务碰到年轻的事务占有锁，就等到年轻的事务解锁)
    - Otherwise requesting txn aborts.(反之年轻的事务等老事务的锁，直接rollback自杀)
  - Wound-Wait (“Young Waits for Old”)
    - If requesting txn has higher priority than holding txn, then holding txn aborts and releases lock(老的事务要锁，发现整个锁被年轻的事务持有，直接rollback年轻的事务然后抢锁)
    - Otherwise requesting txn waits.(年轻的事务发现锁在老的事务哪里，那就等老的事务解锁)
  - 这个主要的思路就是解决了构成死锁条件里面“持有并等待”的条件，冲突了直接开抢
  - 注意: restart的txn的时间戳用上次的时间戳，不然可能会造成饥饿

Lock Granularities

获取锁的时候是获取属性锁，行锁，表锁，还是库锁？整个需要DBMS来负责，需要保证你加锁的数量尽可能小(10亿行锁 vs 一张表锁)，也需要考虑对并发度的影响
Intention Lock:高层级的锁会有标记来判断下面有没有加锁的（比如表锁会记录下面的行有没有加锁的），节省了向下检索的效率
- 意向锁也有S锁和X锁
分层的锁在实际工程中相当好用
LOCK ESCALATION
- 如果下层的锁过多了，那么DBMS就会自动升级成高层的锁（怎么和JVM的锁升级机制的思想很像？）
一般加锁都是DBMS自动负责的，但是用户可以用SQL手动来加锁

1
2
3
4
5


LOCK TABLE <table> <mode>;

SELECT * FROM <table>
WHERE <qualification> FOR UPDATE;
#这样告诉了MYSQL这个不加读锁，加写锁(后面要UPDATE)

CONCLUSION

2PL is used in almost every DBMS.
Automatically generates correct interleaving:
- Locks + protocol (2PL, SS2PL …)
- Deadlock detection + handling
- Deadlock prevention

CMU 15-445 Lecture #16: Concurrency Control Theory

junglece430@gmail.com (Jungle) — Mon, 15 Apr 2024 16:04:09 +0800

CMU 15-445 Database Systems

Lecture #16: Concurrency Control Theory

Motivation

Lost Update Problem (Concurrency Control):数据竞争
Durability Problem (Recovery):故障恢复

Transactions

特点：ACID
Atomicity: Atomicity ensures that either all actions in the transaction happen, or none happen.
Consistency: If each transaction is consistent and the database is consistent at the beginning of the transaction, then the database is guaranteed to be consistent when the transaction completes. Data is consistent if it satisfies all validation rules such as constraints, cascades and triggers.
Isolation: Isolation means that when a transaction executes, it should have the illusion that it is isolated from other transactions. Isolation ensures that concurrent execution of transactions should have the same resulting database state as a sequential execution of the transactions.
Durability: If a transaction commits, then its effects on the database should persist.

ACID: Atomicity

Approach #1: Logging:常用的方法就是记日至，典型的就是undo log，日志也能提交性能（异步刷磁盘）
Approach #2: Shadow Paging:备份自己改的那些页

ACID: Consistency

后面的事务能看见前面事务的变动
业务的一致性是后端程序员保证的

ACID: Isolation

好像就我一个人在用数据库
但是实际上是好多txn在一起跑
这边也是两大流派
- 悲观控制
- 乐观控制+回滚
三种冲突
- R-W
- R-W也叫不可重复读
- W-R
- W-R也叫脏读
- W-W

ACID: Durability

一旦commit，必须保证持久化到磁盘上面