Skip to content

On-demand Container Loading in AWS Lambda

Published: at 12:00 AM

These are my notes based on the paper On-demand Container Loading in AWS Lambda.
This paper is awarded the Best Paper Award at USENIX ATC ‘23 and there is a presentation video in the link.

Table of contents

Open Table of contents

Introduction

  AWS Lambda 支持两种部署方式 Zip 和 Container。 其中 Container 的部署方式更加灵活,但是会导致 Cold Start 变长。 毕竟镜像也是需要从 ECR 拉到 lambda 运行的的机器上的。假如你的镜像有几个 GB, 那么单是拉取镜像这一个环节就会花掉很多时间,直接导致你的 Lambda 冷启动变得无法接受。 我之前的一个项目中就遇到了这个问题。虽然 container 的方式省事部署起来省事。 但是作为商业软件,延迟直接决定了你的生命。最后换成了zip。部署的大小从几个 GB 降到了几十个 MB, 相应的 Cold Start 也降到了非常可以接受的程度。

  今年 8 月的时候我无意中发现了这篇论文,好奇之下就拜读了,大受启发。尽管 AWS 在提高 Container 部署方式的性能上做了很多努力,但是我依然建议大家使用 ZIP 的部署方式。(除非你的image可以做到非常小)

  从标题上,我们也可以知道,这篇论文主要谈论的是 Loading, ON-DEMAND.

AWS Lambda’s Design Goals

这些都很重要,但个人比较关心的是最后一条,也就是cold start

挑战在哪里?

security

customer code and data is not trusted, and the only communication between the workload inside the MicroVM and the shared worker components is over a simple, well tested, and formally verified implementation of virtio (specifically virtio-net and virtio-blk).

efficiency, latency, cost

如何解决?

caching, deduplication, convergent encryption, erasure coding, block-level demand loading (sparse loading)

这些内容在论文中都有详细说明,我只列出我感兴趣的。

我觉得有趣的几个地方

Architecture

AWS Lambda Worker

AWS Lambda Invoke Path AWS Lambda Worker Components

Reducing Data Movement

When we launched AWS Lambda, we recognized that reducing data movement during these cold starts was critical.

Simply moving and unpacking a 10GiB image for each of these 15,000 containers would require 150Pb/s of network bandwidth.

在写 C++ 程序的时候,提高性能的方法中有一条也是这个:能 Move 的绝对不 Copy

Three Factors to simplify this problem

核心就是一个 Cache

Block-Level Loading

Local Agent and Worker Local Cache

Lambda Worker

Written in Rust
We used the tokio runtime, and reqwest and hyper for HTTP

Deduplication

Base container images, such as the official Docker alpine, ubuntu, and node.js are extremely widely used

Approximately 80% of newly uploaded Lambda functions result in zero unique chunks, and are just re-uploads of images that had been uploaded in the past

To deduplicate or not to deduplicate, that is the question

对于大部分的用户镜像,相同的部分实际上是很多的。 但是过度的 deduplication increase blast radius。比如 alpine 的一个部分丢了,坏了,会导致大量用户的使用出现问题。 需要平衡

Convergent Encryption

A cryptographic hash of each block (in the case of Farsite a file block, in our case a chunk of a flattened container image) is used to deterministically derive a cryptographic key that is used for encrypting the block.

How?

The flattening process takes each chunk, derives a key from it by computing its SHA256 digest, and then encrypts the block using AES-CTR (with the derived key).

AES-CTR is used with a deterministic (all zero) IV, ensuring that the same ciphertext always leads to the same plaintext.

Tiered Cache

chunks

worker local cache -> remote available-zone-level(AZ-level) shared cache -> S3

AZ-level cache
Performance

from the worker’s perspective

Summary

  1. Container的性能还是不行,有条件还是要上 zip
  2. 无论在什么场景下,减少Data Movement都是提高的性能的好办法
  3. 设计好你的 Cache
  4. Want faster programs? Know your hardware! Deep dive into the details