JSON Optimization
- RFC PR: Open in the new tab
- Tracking Issue: Open in the new tab
Summary
This RFC describes the design of JSON performance optimization.
Motivation
Currently, Databend Semi-Structured data is stored as the raw text JSON format and use serde_json
to do serialization and deserialization.
It has the following performance problems:
- JSON must be parsed every time used, and text parsing is very slow, especially for large complex object data.
- Query a single key path requires reading and parsing the whole JSON data, which takes more parsing time and takes up more memory.
In order to make the query performance of Semi-Structured data as fast as other data types. This RFC introduced the following two proposals:
- Automatically detects frequently queried key paths of JSON columns and extracts them as virtual columns to achieving similar query performance as other columns.
- Use JSONB instead of JSON as the underlying binary encoding format for Semi-Structured data types, which will optimize the parsing speed and benefit for key path access.
Guide-level explanation
None.
Reference-level explanation
Virtual Column
The schema of JSON can be changed arbitrarily. However, in practice use, JSON data is often generated by machine and has a fairly rigid and predictable structure. Taking advantage of this feature, we can detect and extract frequently queried key path of JSON as virtual column to speed up query processing.
Collect frequently accessed JSON key paths
Extracting and storing virtual columns requires extra parsing processes and storage resources, so it is not appropriate to generate virtual columns for all key paths. We should only generate virtual columns for frequently queried JSON key paths.
In order to know which key paths are frequently queried, we use the
Extract virtual columns
The operation of extracting virtual columns is performed asynchronous, so that it will not affect the performance of the data insertion. Since Databend will compact the table blocks periodically, we can do extracting the virtual column as an additional operation, if the column data type is JSON. In this way, the data has been read for compact can be reused, which will reduce unnecessary read amplification.
The process of extracting virtual columns is as follows:
- Collect key paths which the accesses counts exceeds the threshold generated by FPGrowth algorithm.
- Parse JSON data, infer JSON schema for all the key paths, include the data type of value, and the row numbers.
- Check all key paths, if the data type of the value is the same, and there is data in each row, extract data from this key path to generate virtual columns and store them in a separate parquet file.
Take the following JSON as an example.
{"id":1,"title":"a","user":{"id":1,"name":"a"},"tags":["t1","t2"]}
{"id":2,"title":"b","user":{"id":2,"name":"b"},"tags":["t3","t4"]}
{"id":3,"title":"c","user":{"id":3,"name":3}}
{"id":4,"title":"d","user":{"id":4,"name":4}}
The key path id
, title
and user:id
have the same data type, we can extract them as virtual columns.
For user.name
, the value of data type is not same.
For tags
, it isn't on the every row. We don't generate virtual columns for them.
Push down key path access to storage
There are two ways to access json key path: dot notation such as col:level1:level2
and bracket notation such as col['level1'][0]
.
For example:
create table test (id int8, v json);
insert into test values(1, parse_json('{"k1":{"k2":"v"},"a":[0,1,2]}'));
select v:k1:k2, v['a'][0] from test;
+---------+-----------+
| v:k1:k2 | v['a'][0] |
+---------+-----------+
| "v" | 0 |
+---------+-----------+
Currently, the access of JSON key path will be converted to a get
scalar function, and traverse the JSON by keys to fetch the value.
In order to use the virtual columns to improve the performance, we need to modify the query plan and push down the access of key path to the storage layer.
Since we use JSONB instead of JSON, even for key paths that don't generated virtual columns, pushing down to the storage layer will bring benefits.
JSONB
JSONB stands for JSON Binary
or JSON better
, which is an optimized binary format data type supported by PostgreSQL since v9.4.
CockroachDB also implements JSONB based on the PostgreSQL design.
By storing data as JSONB instead of JSON, we can get the following benefits:
- The parsing and querying process of JSONB is significantly faster.
- Accessing a particular key path don't need read the full data, which will greatly speed up overall performance.
- External indexes can be added to further speed up queries.
binary encoding
JSONB is a tree structure. Each node consists of three parts, a 32-bit header, several 32-bit JEntry
, and the variable-length raw content.
- The header is also known as the container header, it includes a 3-bit type (include
array
,object
andscalar
) and a 28-bit to indicate the number of elements inarray
orobject
. - The
JEntry
has a 3-bit value type(includetrue
,false
,null
,string
,number
,array
andobject
), a 28-bit provides the length or offset of the raw content, and a 1-bit flag to indicate whether length or offset is used. - The last part is the raw content values, which can be located by the length or offset in
JEntry
.
Take this JSON as an example, we can see the encoding format of JSONB as follow.
{"a":1,"b":[true,2,"v"]}
.--------------. .-----------------------------------.
| header | -> | type(object) | item nums(2) |
.--------------. .-----------------------------------.
| key1 JEntry | -> | flag | val type(string) | len/off | -+
.--------------. .-----------------------------------. |
| key2 JEntry | -> | flag | val type(string) | len/off | -+-+
.--------------. .-----------------------------------. | |
| val1 JEntry | -> | flag | val type(number) | len/off | -+-+-+
.--------------. .-----------------------------------. | | |
| val2 JEntry | -> | flag | val type(string) | len/off | -+-+-+-+
.--------------. .-----------------------------------. | | | |
| "a" | <-----------------------------------------+ | | |
.--------------. | | |
| "b" | <-------------------------------------------+ | |
.--------------. | |
| 1 | <---------------------------------------------+ |
.--------------. |
| [true,2,"v"] | <-----------------------------------------------+
.--------------.
| .--------------+ .-----------------------------------.
+---------> | header | -> | type(array) | item nums(3) |
.--------------+ .-----------------------------------.
| item1 JEntry | -> | flag | val type(bool true) |
.--------------+ .-----------------------------------.
| item2 JEntry | -> | flag | val type(number) | len/off | -+
.--------------+ .-----------------------------------. |
| item3 JEntry | -> | flag | val type(string) | len/off | -+-+
.--------------+ .-----------------------------------. | |
| 2 | <-----------------------------------------+ |
.--------------. |
| "v" | <-------------------------------------------+
.--------------.
For more detailed information, see the design of
Benefit from the tree struct of JSONB, searching for key paths only need parse part of the full data. In each node, object keys can be accessed in O(𝑙𝑜𝑔(𝑛)), since keys are sorted and binary search can be used, and array element can be accessed in O(1), since the length of JEntry is fixed.
Drawbacks
- Extracting virtual columns requires additional asynchronous tasks.
- JSONB may take more disk space than JSON.
Rationale and alternatives
serde
.There are also some optimizations for plain text json, such as
Prior art
The main idea of this RFC comes from the
Both PostgreSQL and CockroachDB use JSONB as optimized JSON format, we can implement Rust version JSONB based on their design.
Unresolved questions
None.
Future possibilities
- Extract virtual columns for key paths with value have different data types
- Add extra index for JSONB