幻灯片 1 - Lu Jiaheng's homepage

Download Report

Transcript 幻灯片 1 - Lu Jiaheng's homepage

BSON
Chengyin Xia
2012.9.20
Introduction
BSON is a computer data interchange format used mainly as a data storage
and network transfer format in the MongoDB database. It is a binary form for
representing simple data structures and associative arrays (called objects or
documents in MongoDB).
BSON is a binary format in which zero or more key/value pairs are stored as a
single entity. We call this entity a document. The name "BSON" is based on the
term JSON and stands for "Binary JSON".
Key/value pair examples:
{"hello": "world"}
{"BSON": ["awesome", 5.05, 1986]}
Before grammar
• Terminal and nonterminal
• In computer science, terminal and nonterminal symbols are the lexical
elements used in specifying the production rules that constitute a
formal grammar. The terminals and nonterminals of a particular grammar
are two disjoint sets.
• BNF syntax
• BNF is short for Backus–Naur Form. It’s a tool to describe grammar.
Terminal and nonterminal
• Terminal
Terminal symbols are literal characters that can appear in the inputs to or
outputs from the production rules of a formal grammar and that cannot be
broken down into "smaller" units. To be precise, terminal symbols cannot
be changed using the rules of the grammar. A formal language defined by a
particular grammar is the set of strings that can be produced by the grammar
and that consist only of terminal symbols.
• Nonterminal
Nonterminals, are the symbols which can be replaced. A formal grammar
includes a start symbol, a designated member of the set of nonterminals
from which all the strings in the language may be derived by successive
applications of the production rules. In fact, the language defined by a
grammar is precisely the set of terminal strings that can be so derived.
BNF
A BNF specification is a set of derivation rules, written as
<symbol> ::= _expression_
where <symbol> is a nonterminal, and the _expression_ consists
of one or more sequences of symbols; more sequences are
separated by the vertical bar, '|', indicating a choice, the whole
being a possible substitution for the symbol on the left. Symbols
that never appear on a left side are terminals. On the other hand,
symbols that appear on a left side are non-terminals and are
always enclosed between the pair < >.
The '::=' means that the symbol on the left must be replaced with
the expression on the right.
Grammar
The grammar specifies version 1.0 of the BSON
standard. We've written the grammar using a pseudoBNF syntax. Valid BSON data is represented by the
document non-terminal.
Basic Types
The following basic types are used as terminals in the
rest of the grammar. Each type must be serialized in
little-endian format.
• byte
1 byte (8-bits)
• int32 4 bytes (32-bit signed integer)
• int64 8 bytes (64-bit signed integer)
• double 8 bytes (64-bit IEEE 754 floating point)
Derivation rules
The following derivation rules specify the rest of the BSON
grammar. Note that quoted strings represent terminals, and should
be interpreted with C semantics (e.g."\x01" represents the byte
0000 0001). Also note that we use the * operator as shorthand for
repetition (e.g.("\x01"*2) is "\x01\x01"). When used as a unary
operator,* means that the repetition can occur 0 or more times.
Rules
• document
• e_list
::=
::=
int32 e_list "\x00"
element e_list
|
""
element
::= "\x01" e_name double
| "\x02" e_name string
| "\x03" e_name document
| "\x04" e_name document
| "\x05" e_name binary
| "\x06" e_name
| "\x07" e_name (byte*12)
| "\x08" e_name "\x00"
| "\x08" e_name "\x01"
| "\x09" e_name int64
| "\x0A" e_name
| "\x0B" e_name cstring cstring
| "\x0C" e_name string (byte*12)
| "\x0D" e_name string
| "\x0E" e_name string
| "\x0F" e_name code_w_s
| "\x10" e_name int32
| "\x11" e_name int64
| "\x12" e_name int64
| "\xFF" e_name
| "\x7F" e_name
BSON Document
Sequence of elements
Floating point
UTF-8 string
Embedded document
Array
Binary data
Undefined — Deprecated
ObjectId
Boolean "false"
Boolean "true"
UTC datetime
Null value
Regular expression
DBPointer — Deprecated
JavaScript code
Symbol — Deprecated
JavaScript code w/ scope
32-bit Integer
Timestamp
64-bit integer
Min key
Max key
Rules
•
•
•
•
•
•
•
•
•
•
•
•
e_name
string
cstring
binary
subtype
::=
::=
::=
::=
::=
|
|
|
|
|
|
code_w_s
::=
cstring
int32 (byte*) "\x00"
(byte*) "\x00"
int32 subtype (byte*)
"\x00"
"\x01"
"\x02"
"\x03"
"\x04"
"\x05"
"\x80"
int32 string document
Key name
String
CString
Binary
Binary / Generic
Function
Binary (Old)
UUID (Old)
UUID
MD5
User defined
Code w/ scope
Example 1
• {"hello": "world"}
Document
e_list
(int32
\x16\x00\x00\x00
("\x02"
"\x00")
e_list
(element
element
e_name
e_name
(cstring)
cstring
((byte*) "\x00")
e_list)
string)
string
(int32
(byte*) "\x00")
x06\x00\x00\x00
“”
world
hello
{"hello": "world"}
\x16\x00\x00\x00\x02hello\x00\x
06\x00\x00\x00world\x00\x00
Example 2
{"BSON": ["awesome", 5.05, 1986]}
"\x31\x00\x00\x00\x04BSON\x00&\x00
\x00\x00\x020\x00\x08\x00\x00
\x00awesome\x00\x011\x00333333\x1
4@\x102\x00\xc2\x07\x00\x00
\x00\x00"
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
31\x00\x00\x00
: 文档的长度
第5个字节\x04
: 元素的类型,\x04,即数组类型
BSON\x00
: 元素的名字,以"\0"结尾,在这里,元素的名字是BSON
&\x00\x00\x00
: 数组即一个子文档,子文档的长度,这里子文档实际上是
{"0":"awesome","1":5.05,"2":1986}
\x02
: 元素的类型,即"awesome"的类型是string
0\x00
: 即“0”‘\0’,即元素的名字是“0”,字符串以‘\0’结尾
\x08\x00\x00\x00awesome\x00 : 长度 + "awesome" + '\0’
\x01
: 元素的类型,即5.05的类型是Floating point
1\x00
: 即"1"'\0',即元素的名字是"1”
333333\x14@
: 即5.05
\x10
: 元素的类型,即1986的类型是32-bit Integer
2\x00
: 即"2"'\0',即元素的名字是"2”
\xc2\x07\x00\x00
: 即1986
\x00
: 子文档,即数组的结尾
\x00
: 文档的结尾
The point of BSON
BSON is designed to be efficient in space, but in many cases is not
much more efficient than JSON. In some cases BSON uses even
more space than JSON. The reason for this is another of the BSON
design goals: traversability. BSON adds some "extra" information to
documents, like length prefixes, that make it easy and fast to
traverse.
BSON is also designed to be fast to encode and decode. For
example, integers are stored as 32 (or 64) bit integers, so they don't
need to be parsed to and from text. This uses more space than JSON
for small integers, but is much faster to parse.