Skip to content

What Goes Around Comes Around... And Around...

Published: at 12:00 AM

Table of contents

Open Table of contents

Introduction

大概是五年前,我开始学习数据库的底层知识。有一本很好的入门教材,叫作Readings in Database Systems 简称小红书。里面开篇就是推荐了《What Goes Around Comes Around》这篇论文。在这篇论文里介绍了当时的数据库的发展情况。然后我今天想要分享的是20年后的更新之作 《What Goes Around Comes Around… And Around…》,这是 Michael Stonebraker 和 Andy Pavlo 合作完成的,而且 Stonebraker 就是前作的作者之一。 这篇论文非常好的总结了过去二十年数据库领域的变化和趋势,非常值得一读。

TL;DR

这篇 Paper 里细节很多,我印象比较深刻的是:

Notes

Data Models & Query Languages

MapReduce System

性能太差,然后死掉了。

MR’s Legacy

Key/Value Stores

关系数据库可以轻易地模拟 KV 数据库

如果需要 embedded DBMS,可以使用 SQLite 和 DuckDB

新架构上有一个趋势是使用 embedded KV stores 来作为 full-featured DBMSs 的底层 storage manager,比如 Meta 给 MySQL 替换了一个基于 RocksDB 的 MyRocks 引擎,MongoDB 使用 WiredTiger’s KV Store 替代 MMAP

The key/value (KV) data model is the simplest model possible. It represents the following binary relation:

(key,value)

Document Databases

The document data model represents a database as a collection of record objects.

在 2000s 的时候,关系数据库的 scalability 很差。NoSQL 提供了很好的 scalability,然后就火了,比如 MongoDB。当时有两个 marketing messages

讽刺的是,at the end of 2010s, almost every NoSQL DBMS added a SQL interface.

  • DynamoDB PartiQL
  • Cassandra CQL
  • Aerospike AQL
  • Couchbase SQL++
  • Mongo added SQL for their Atlas service in 2021

有些也添加了 strongly consistent (ACID) transactions

SQL:2016 开始添加JSON的支持,SQL:2023添加了 JSON 的类型,主要的难度在于 SQL optimizers,尤其是早期的,slow and ineffective

My Thoughts on MongoDB

我在2014-2016年创业的时候就重度使用 MongoDB,一开始的时候还没有 WiredTiger,性能非常差。但是对于小型业务来说非常好,他有着很好的 scalability 和 flexibility。 对于 Node.js 来说又是非常的好用,直接把 JSON 存起来实在是太方便了,aggregation pipeline 也很好用。Replication 和 Sharding 也很好用。

前段时间我又重新去看一眼 MongoDB,发现功能上并没有什么大的变化。当然性能上应该提升了很多,毕竟一个数据库其实大部分的工作就是提升性能、可靠性,功能性其实挺谨慎的。 让我震惊的是 MongoDB 和 Redis 都支持了 vector search。有点担心 vector database 厂商的前景了。

Column-Family Databases aka wide-column

Text Search Engines

By tokenizing documents into a “bag of words” and then building full-text indexes (aka inverted indexes) on those tokens to support queries on their contents.

Array Databases

常见的 array data representation usage:

System Challenges with storing and querying real-world array data sets
Discussion

Vector Databases

Similar to how the column-family model is a reduction of the document model, the vector data model simplifies the array data model to one-dimensional rasters.

developers use them to store single-dimension embeddings generated from AI tools. 导致了 vector databases 的流行

听过一个 LanceDB 创始人的分享,他们一开始并不知道自己的数据库有什么用,还是 AI 厂商找到了他们,说他们的数据库可以用来存储 embedding。

For example, one could convert each Wikipedia article into an embedding using Google BERT and store them in a vector database along with additional article meta-data: (title, date, author, [embedding-vector]) sizes:

One compelling feature of vector DBMSs is that they provide better integration with AI tools (e.g., ChatGPT, LangChain) than RDBMSs.

vector DBMSs are essentially document-oriented DBMSs with specialized ANN indexes.

In 2023, many of the major RDBMSs added vector indexes, including Oracle, SingleStore, Rockset, and Clickhouse. Contrast this with JSON support in RDBMSs. NoSQL systems like MongoDB and CouchDB became popular in the late 2000s and it took several years for RDBMSs to add support for it

现有数据库引入 vector index 也不困难,有很多开源的库

Graph Databases

有两种流行的表达:

Key Challenge

these systems have to overcome is that it is possible to simulate a graph as a collection of tables:

Node (node_id, node_data)
Edge (node_id_1, node_id_2, edge_data)

This means that RDBMSs are always an option to support graphs. 但是 vanilla SQL is not expressive enough for graph queries and thus require multiple client-server roundtrips for traversal operations.

在石墨文档,基本也是这么搞的。也是这个导致我对数据库产生了兴趣

Summary

A reasonable conclusion from the above section is that non-SQL, non-relational systems are either a niche market or are fast becoming SQL/RM systems.

要么有自己的合适的市场,要么成为 SQL/RM

System Architectures

Columnar Systems

Data warehouse applications have common properties that are distinct from OLTP workloads:

  1. They are historical in nature (i.e., they are loaded periodically and then are read-only).
  2. Organizations retain everything as long as they can afford the storage — think terabytes to petabytes.
  3. Queries typically only access a small subset of attributes from tables and are ad-hoc in nature.
Organizing the DBMS’s storage by columns instead of rows has several benefits

In summary, column stores are new DBMS implementations with specialized optimizers, executors, and storage formats. They have taken over the data warehouse marketplace because of their superior performance.

Data Lakes / Lakehouse

这是一个趋势,从单体、专门的 OLAP data warehouses for OLAP workloads 转向 data lakes backed by object stores.

Original Data Warehouse

ingesting -> stored with proprietary formats

Vendors viewed their DBMSs as the “gatekeepers” for all things related to data in an organization.

但是在过去,很多组织特别是科技公司都不这样搞

Data lake architecture
  1. applications upload files to a distributed object store, bypassing the traditional route through the DBMS
    • applications write data to data lakes using open-source, disk-resident file formats. The two most popular formats are Twitter/Cloudera’s Parquet and Meta’s ORC
    • Apache Arrow is a similar binary format for exchanging in-memory data between systems.
  2. Users then execute queries and processing pipelines on these accumulated files using a lakehouse (a portmanteau of data warehouse and data lake) execution engine
    • These lakehouse systems provide a unified infrastructure supporting SQL and non-SQL workloads.
      • non-SQL 是至关重要的,很多数据科学家和 ML practitioner 经常使用 Python-based notebooks that use Panda’s DataFrame API to access data instead of SQL
DataLakes真香?

a data lake seems like a terrible idea for an organization:

但是几乎所有的主流的 cloud vendors 提供了 some variation of a managed data lake service. why?

NewSQL Systems

many organizations could not use these NoSQL systems because their applications could not give up strong transactional requirements. But the existing RDBMSs were not able to scale across multiple machines 所以,NewSQL 出现了。

two main groups of NewSQL systems

但是切换 NewSQL 也不是很如意,毕竟大家在切换 OLTP DBMSs 时更加谨慎。

Aftermath of NewSQL

distributed, transactional SQL RDBMSs

主流的 NoSQL 也在添加事务。。。 MongoDB, Cassandra, and DynamoDB,甚至 Google Spanner

Google said this cogently when they discarded eventual consistency in favor of real transactions with Spanner in 2012

Hardware Accelerators

to accelerate queries

OLAP workloads will continue to move aggressively to the cloud, but special purpose hardware is not likely to find acceptance unless it is built by the cloud vendor.

The only place that custom hardware accelerators will succeed is for the large cloud vendors.

Amazon did this already with their Redshift AQUA accelerators. Google BigQuery has custom components for in-memory shuffles.

Blockchain Databases

The ideal use case for blockchain databases is peer-to-peer applications where one cannot trust anybody.

At the present time, cryptocurrencies (Bitcoin) are the only use case for blockchains

够讽刺的了

Legitimate businesses are unwilling to pay the performance price (about five orders of magnitude) to use a blockchain DBMS

To the best of our knowledge, all the major cryptocurrency exchanges run their businesses off traditional RDBMSs and not blockchain systems.

最关键的是什么?

No sensible company would rely on random participants on the Internet as the backup solution for mission-critical databases.

Summary

Parting Comments

Conclusion

One of us will likely still be alive and out on bail in two decades, and thus fully expects to write a follow-up to this paper in 2044.

LOL