Why JSON Handling in Node.js Sucks?

Open Table of contents

JSON 和 JavaScript
TL;DR
Bigint?
- Using simdjson
  - JavaScript非常不幸的采用了 UTF-16 (technically, UCS-2 initially)
- 所以 Nodejs 里如何提高呢？
Summary

JSON 和 JavaScript

JSON stands for “JavaScript Object Notation”，是一种简单的数据交换格式。 JSON 由 Douglas Crockford 开发的，灵感来自 JavaScript 的对象字面量语法。所以在类型上 JSON 天生就是一个不善于表达的类型。尽管如此，由于前端的流行，JSON 也成为目前世界上最流行的交换格式了。

对于一家公司来说，JSON 的开销非常大(10%~40%)。更快的 JSON.parse 可以省掉一大笔钱。

去年的时候，我当时在搞 Ingester，要处理 JSON 格式数据的入库，当时就想去尝试使用 Node.js 来完成这个功能，但是后来发现 Node.js 的 JSON 处理速度实在是太太太太太太慢了，于是就放弃了，最后使用了 C++。

TL;DR

simdjson is the true king (Always use simdjson if you are using C++ to parse JSON)
Node.js currently does not have built-in support for SIMD and simdjson.parse(simdjson_nodejs) is slower than JSON.parse
JSON.parse with no reviver is the fastest.
Try bytedance/sonic if you are using Golang

Bigint?

直入主题

首先，JSON 不支持 bigint，所以一般是采用 string 的方式来保存。在 JSON.parse 的时候首先想到的就是使用 JSON.parse with reviver。但实际上，reviver is run after the value is parsed. 如果成功还好，如果失败了，就更慢了。

Using simdjson

simdjson is the king

simdjson 是这个世界上最快的 JSON parser。所以我自然想到了使用 Node.js bindings for simdjson 来解决这个问题。

而且 simdjson 是超越 JSON spec 的，所以是可以正确 parse bigint 的。当然只是为了解决bigint的问题的话，Node.js 也是有库可用的 sidorares/json-bigint，但还是慢啊！

luizperes/simdjson_nodejs 是一个 Node.js bindings for the simdjson library. 但是，它的速度比 JSON.parse 还要慢。为啥呢？先看一组benchmark

filesize (KB)	JSON.parse(ms)	simdjson.parse (ms)	JSON.parse (MB/s)	simdjson.parse (MB/s)	X faster
1.16	0.002	0.007	483.85	159.29	0.33

JSON.parse x 416,390 ops/sec ±0.25% (98 runs sampled)
SIMDJSON.parse x 137,086 ops/sec ±1.02% (92 runs sampled)

我就不列跑 benchmark 的机器和环境了。

在simdjson_nodejs的文档里道出了真相：

Obs.: Please see that the overhead of converting a C++ object to a JS object might make the parsing time in the NodeJS slower for the simdjson. Therefore, parsing it lazily is preferrable. For more information check issue #5.

JavaScript非常不幸的采用了 UTF-16 (technically, UCS-2 initially)

JS中，需要转成 UTF-16，比如string.length就是针对的utf-16。

Major JS VMs(V8, JSC, and Hermes) keep strings as one of two representations: UTF-16 or ASCII ASCII is safe and character positions of ASCII strings are the same as in UTF-16.

但是 simdjson 使用的是UTF-8，所以需要

JSON parse internal flow

先转成 UTF-16，然后转成 UTF8 parse JSON，然后再转成 UTF-16，大量的时间花费在了 transcode 上。

所以 Nodejs 里如何提高呢？

搞 SIMD 的 Daniel Lemire 就经常给 Node.js 的库提交代码，当然知道要搞 SIMD 了。

simdjson + simdutf，在 transcode 的时候也提高一下

// First, estimate how much capacity we need to transcode
auto utf8_capacity = simdutf::utf8_length_from_utf16(str, utf16_size);

// Then, allocate a string buffer
std::unique_ptr<char[]> utf8_str{new char[utf8_capacity]};

// Now actually transcode
auto utf8_size = simdutf::convert_utf16_to_utf8(str, utf16_size, utf8_str.get());

结果呢？JSON.parse() 比原来的版本快了 1.2-1.3x。LOL

所以啊，我放弃了，最后还是用了 C++!

Summary

Node.js目前并没有 built-in support for SIMD(Single Instruction, Multiple Data) operations. 所以 Node.js 并不能充分利用现代计算机硬件的能力来提高 JSON.parse。怎么办呢？我也不知道。或许使用其他的 JavaScript Runtime 会是一个答案。