From ec559db1c80c350e4a54f52715ab4555813733bb Mon Sep 17 00:00:00 2001 From: Igor Mandrigin Date: Mon, 17 Feb 2020 19:56:30 +0300 Subject: [PATCH] Formal semantics: more instructions (#366) --- docs/programmers_guide/witness_formal_spec.md | 603 ++++++++++++++---- 1 file changed, 477 insertions(+), 126 deletions(-) diff --git a/docs/programmers_guide/witness_formal_spec.md b/docs/programmers_guide/witness_formal_spec.md index e39f918bf..2b5e2d750 100644 --- a/docs/programmers_guide/witness_formal_spec.md +++ b/docs/programmers_guide/witness_formal_spec.md @@ -2,75 +2,179 @@ The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC 2119](https://tools.ietf.org/html/rfc2119). -## The Stack +## Data Types -Witnesses are executed on a stack machine that builds tries/calculates hashes. +### Basic -The stack consists of stack items. -Each item is a pair: `(node, hash)`, where `node` is an object representing the Merkle trie node, and `hash` is 32 byte node hash value. The exact implementation details of `node`s and `hash`es are not the part of the spec. +`nil` - an empty value. -Each witness is a queue of instructions. +`Any` - any data type. MUST NOT be `nil`. -In every execution cycle a single instruction gets dequeued and a matching substitution rule gets applied to the stack. +`Int` - an integer value. We treat the domain of integers as infinite, +the overflow behaviour or mapping to the actual data types is undefined +in this spec and should be up to implementation. -In the end, when there are no more instructions left, there MUST be only one item left on the stack. +`Byte` - a single byte. + +`Hash` - 32 byte value, representing a result of Keccak256 hashing. + +### Composite + +`()` - an empty array of arbitrary type. + +`(Type...)` - an array of a type `Type`. MUST NOT be empty. + +`{field:Type}` - an object with a field `field` of type `Type`. + + - full notation: `type T = {field:Type}` + + - inline `type TBase = T1{field:Type}|T2{field2:Type2}` + +### Type Definitions + +The type definitions are a bit similar to [Haskell](https://en.wikibooks.org/wiki/Haskell/Type_declarations). +The key differences are how the arrays and type fields are defined. ### Nodes -There are multiple types of nodes that can be present on the stack. +``` +type Node = HashNode{raw_hash:Hash} + | ValueNode{raw_value:(Byte...)} + | AccountNode{nonce:Int balance:Int storage:nil|Node code:nil|CodeNode|HashNode} + | LeafNode{key:(Byte...) value:ValueNode|AccountNode} + | ExtensionNode{key:(Byte...) child:Node} + | BranchNode{child0:nil|Node child1:nil|Node child3:nil|Node ... child15:nil|Node} + | CodeNode{code:(Byte... )} +``` + +### Witness + +Witness MUST have at least 1 element. + +``` +type WitnessHeader = {version:Int} +type Instruction = {code:Int parameter:Any...} +type Witness = (Node|Instruction...) +``` + + +## Execution Enviroment + +The witness execution environment MUST contain the following 2 elements: + +- **WitnessHeader** -- a header containing the version of the witness. The `version` MUST be 1. + +- **Witness** -- a witness to be executed; + +- **Substitution Rules** -- a list of all possible substitution rules. + + +## Execution process + +Initially, the Witness MUST BE an array of `Instruction`s. + +Then, as substitution rules are applied to the witness, some elements of the +array are replaces with `Node`s. + +The execution continues until there are no substitution rules left to execute. + +Here is how the execution code might look like in Go for building a single trie. + +```go +witness := GetInitialWitness() +rules := GetSubstitutionRules() +numberOfRulesApplied := 1 // initial state + +for numberOfRulesApplied > 0 { + witness, numberOfRulesApplied := ApplyRules(witness, rules) +} + +if len(witness) == 1 { + trie.root = witness[0] +} else { + panic("witness execution failed") +} + +``` + + +And here is an example of the execution process (we will use the set of rules +form the **Substitution Rules** section of this document): + +* **Step 1**. Witness: `(HASH h1 HASH h2 BRANCH 0b101 HASH h3 BRANCH 0b11)` + +* **Step 2**. Apply `HASH` substitution rules. +Witness: `(HashNode{h1} HashNode{h2} BRANCH 0b101 HashNode{h3} BRANCH 0b11)` + +* **Step 3**. Apply `BRANCH` substitution rules (only once, because `BRANCH 0b11` +doesn't pass its `GUARD` statements just yet). +Witness: `(BranchNode{0: HashNode{h1} 2:HashNode{h2}} HashNode{h3} BRANCH 0b11)` + +* **Step 4**. Apply `BRANCH` substitution rules again. +Witness: `(BranchNode{0: BranchNode{0: HashNode{h1} 2:HashNode{h2}} 1:HashNode{h3}})` + +* **Step 5**. No more rules are applicable, the witness contains only one + element, the execution ends successfully. + + +## End Criteria + +The execution ends when there are no substitution rules applicable for this +witness. + +### Building a single trie from the witness + +If we are building a single trie from the witness, then the only SUCCESS +executon is when the following are true: + +- The execution state MUST match the End Criteria +- There MUST be only one item left in the witness +- This item MUST be of the type `Node` + +In that case, this last item will be the root of the built trie. + +Every other end state is considered a FAILURE. + + +### Building a Forest + +We also can build a forest of tries with this approach, by adding a new +Instruction `NEW_TRIE` and adjusting the success criteria a bit: + +- The execution state MUST match the End Criteria; +- The items that are left in the witness MUST follow this pattern: `(Node + NEW_TRIE ... Node)` +- Each `Node` element will be a root of a trie. + +Every other end state is considered a FAILURE. + + +## Instructions & Parameters + +A single instruction consists of substitution rules and parameters. + +Each instruction MAY have one or more parameters. +The parameters values MUST be encoded in the witness. + +That makes it different from the helper function parameters that MAY come from the stack or MAY come from the witness. -- `hashNode` -- `branchNode` -- `accountNode` -- `valueNode` -- `extensionNode` ## Substitution rules -Here is an example substitution rule. The current instruction is named `INSTR1`. +A substitution rule consists of 3 parts: -``` -STACK(node1, hash1) STACK(node2, hash2) INSTR1(params) |=> -STACK(helper1(node1, node2), helper2(hash1, hash2)) -``` +`[GUARD] PATTERN |=> RESULT` -This substitution rule replaces 2 stack items `(node1, hash1)` and `(node2, hash2)` -with a single stack item `(helper1(node1, node2), helper2(hash1, hash2))`. +- to the left of the `|=>` sign: -Where `helper1` and `helper2` are pure functions defined in pseudocode (see the example below). + - optional `GUARD` statements; -``` -helper1 (value1, value2) { - return value1 + value2 -} + - the pattern to match against; -helper2 (hash1, hash2) { - return KECCAK(hash1 + hash2) -} -``` +- result, to the right of the `|=>` sign. -The specification for a substitution rule -``` -[GUARD ...] - -[ STACK(, ), ... ] [()] |=> -STACK(node-value, hash-value), ... -``` - -There MUST be one and only one substitution rule applicable to the execution state. If an instruction has multiple substitution rules, the applicability is defined by the `GUARD` statements. -The substitution rule MAY have one or more GUARD statements. -The substitution rule MAY have one or more STACK statements before the instruction. -The substitution rule MUST have exactly one instruction. -The substitution rule MAY have parameters for the instruction. -The substitution rule MUST have at least one STACK statement after the arrow. - -So, the minimal substitution rule is for the `HASH` instruction that pushes one hash to the stack: -``` -HASH(hashValue) |=> STACK(hashNode(hashValue), hashValue) -``` - -## GUARDs +### `GUARD`s Each substitution rule can have zero, one or multiple `GUARD` statements. Each `GUARD` statement looks like this: @@ -79,23 +183,110 @@ Each `GUARD` statement looks like this: GUARD ``` -That means that for the substitution rule to be applicable the `` in the guard statement must be true. +For a substitution rule to be applicable, the `` in its `GUARD` statement MUST be true. -If a substitution rule has multiple guard statements, all of the conditions specified there should be satisfied. +If a substitution rule has multiple `GUARD` statements, all of them MUST BE satisfied. -## `TRAP` statements +If there are no `GUARD` statements, the substitution rule's applicability is +only defined by the PATTERN. -If no substitution rules are applicable for an instruction, then the execution MUST stop (`TRAP` instruction) and the partial results MUST NOT be used. +Example: +``` + GUARD NBITSET(mask) == 2 +|---- GUARD STATEMENT ---| -Implementation of the `TRAP` instruction is undefined, but it should stop the execution flow. (e.g. in Golang it might be a function returning an error or a panic). + Node(n0) Node(n1) BRANCH(mask) |=> + BranchNode{MAKE_VALUES_ARRAY(mask, n0, n1)} +``` -## Instructions & Parameters +For the example rule to be applicable both facts MUST be true: -Each instruction MAY have one or more parameters. -The parameters values MUST be situated in the witness. -The parameter values MUST NOT be taken from the stack. +1. `mask` contains only 2 bits set to 1 (the rest are set to 0); + +2. to the left of `BRANCH` instruction there is at least 2 `Node`s. + +Fact (1) comes from the `GUARD` statement. + + +### PATTERN + +`[NodeType(boundVar1)... NodeType(boundVarN)] Instruction[(param1... paramN)]` + +A pattern MUST contain a single instruction. +A pattern MAY contain one or more `Node`s to the left of the instruction to +match. +An instruction MAY have one or more parameters. + +Pattern matching is happening by the types. `Node` type means any node is +matched, some specific node type will require a specific match. + +Pattern can have bound variable names for both matched nodes and instruction +parameters (if present). + +Match: + +``` +HASH h0 HashNode{h1} HashNode{h2} BRANCH 0b11 + |------------------- MATCH -----------| + +HASH h0 BranchNode{0: HashNode{h1} 1: HashNode{h2}} + |----------- SUBSTITUTED -------------------| +``` + +No match (not enough nodes to the left of the instruction): + +``` +HASH h0 HASH h1 HashNode{h2} BRANCH 0b11 +``` + +### Result + +`NodeType(HELPER_FUNCTION(arguments))` + +The result is a single `Node` statement that replaces the pattern in the +witness if it matches and the guards are passed. + +The result MAY contain helper functions or might have in-line computation. +The result MUST have a specific node type. No generic `Node` is allowed. + +Helper functions or inline computations might use bound variables from the +pattern. + +Example + +``` + +Node(n0) Node(n1) BRANCH(mask) |=> +BranchNode{MAKE_VALUES_ARRAY(mask, n0, n1)} + ^ ^-- ^--- BOUND NODES + |---- BOUND INSTRUCTION PARAM + |------ HELPER CALL ------------| +|----------------- RESULT ------------------| + +``` + +### Bringing it all together + + +So the full syntax is this: + +``` +[GUARD ...] [ NodeType(bound_variable1)... ] INSTRUCTION[(param1 ...)] |=> +Node() +``` + +`NodeType` is one of the types of nodes to match. Can also be `Node` to match +any non-nil node. + +Substitution rules MUST be non-ambiguous. Even though, there can be multiple +substitution rules applicable to the whole witness at the same time, there MUST +be only one rule that is applicable to a certain position in the witness. + +So, the minimal substitution rule is the one for the `HASH` instruction that pushes one hash to the stack: +``` +HASH(hashValue) |=> HashNode{hashValue} +``` -That makes it different from the helper function parameters that MAY come from the stack or MAY come from the witness. ## Helper functions @@ -106,82 +297,126 @@ Helper functions MUST have at least one argument. Helper functions MAY have variadic parameters: `HELPER_EXAMPLE(arg1, arg2, list...)`. Helper functions MAY contain recursion. -## Data types - -INTEGER - we treat the domain of integers as infinite, the overflow behaviour or mapping to the actual data types is undefined in this spec and should be dependent on implementation. - -## Execution flow - -Let's look at the example. - -Our example witness would be `HASH h1; HASH h2; BRANCH 0b11`. - -Initial state of the stack is ` `; - ---- - -**1. Executing `HASH h1`**: push a `hashNode` to the stack. - -Witness: `HASH h2; BRANCH 0b11` - -Stack: `(hashNode(h1), h1)` - ---- - -**2. Executing `HASH h2`**: push one more `hashNode` to the stack. - -Witness `BRANCH 0b11` - -Stack: `(hashNode(h2), h2); (hashNode(h1), h1)` - ---- - -**3. Executing `BRANCH 0b11`**: replace 2 items on the stack with a single `fullNode`. - -Witness: ` ` - -Stack: `(fullNode{0: hashNode(h2), 1: hashNode(h1)}, KECCAK(h2+h1))` - ---- - -So our stack has exactly a single item and there are no more instructions in the witness, the execution is completed. - -## Modes - -There are two modes of execution for this stack machine: - -(1) **normal execution** -- the mode that constructs a trie; - -(2) **hash only execution** -- the mode that calculates the root hashes of a trie without constructing the trie itself; - -In the mode (2), a stack item MUST only contain the node hash. - -Stack item, mode (1): `(node, hash)`. - -Stack item, mode (2): `(hash)`. - -The exact implementation details are undefined in this spec. - ## Instructions +### `LEAF key raw_value` + +**Substitution rules** + +Replaces the instruction with a `ValueNode` wrapped with a `LeafNode`. + +``` +LEAF(key, raw_value) |=> LeafNode{key, ValueNode(raw_value)} +``` + +### `EXTENSION key` + +Wraps a node to the left of the instruction with an `ExtensionNode`. + +**Substitution rules** + +``` +Node(node) EXTENSION(key) |=> ExtensionNode{key, node} +``` + +### `HASH raw_hash` + +Replaces the instruction with a `HashNode`. + +**Substitution rules** + +``` +HASH(hash_value) |=> HashNode{hash_value} +``` + +### `CODE raw_code` + +Replaces the instruction with a `CodeNode`. + +``` +CODE(raw_code) |=> CodeNode{raw_code} +``` + +### `ACCOUNT_LEAF key nonce balance has_code has_storage` + +Replaces the instruction and, optionally, up to 2 nodes to the left of the +instructon with a single `AccountNode` wrapped with a `LeafNode`. + +**Substitution rules** + +``` +GUARD has_code == true +GUARD has_storage == true + +CodeNode(code) Node(storage_hash_node) ACCOUNT_LEAF(key, nonce, balance, has_code, has_storage) |=> +LeafNode{key, AccountNode{nonce, balance, storage_root, code}} + +--- + +GUARD has_code == true +GUARD has_storage == true + +HashNode(code) Node(storage_hash_node) ACCOUNT_LEAF(key, nonce, balance, has_code, has_storage) |=> +LeafNode{key, AccountNode{nonce, balance, storage_root, code}} + +--- + +GUARD has_code == false +GUARD has_storage == true + +Node(storage_root) ACCOUNT_LEAF(key, nonce, balance, has_code, has_storage) |=> +LeafNode{key, AccountNode{nonce, balance, storage_root, nil}} + +--- + +GUARD has_code == true +GUARD has_storage == false + +CodeNode(code) ACCOUNT_LEAF(key, nonce, balance, has_code, has_storage) |=> +LeafNode{key, AccountNode{nonce, balance, nil, nil, code}} + +--- + +GUARD has_code == true +GUARD has_storage == false + +HashNode(code) ACCOUNT_LEAF(key, nonce, balance, has_code, has_storage) |=> +LeafNode{key, AccountNode{nonce, balance, nil, nil, code}} + +--- + +GUARD has_code == false +GUARD has_storage == false + +ACCOUNT_LEAF(key, nonce, balance, has_code, has_storage) |=> +LeafNode{key, AccountNode{nonce, balance, nil, nil, nil}} + +``` + +### `NEW_TRIE` + +No substitution rules. This instruction is used as a "divider" when building +a forest of tries. + ### `BRANCH mask` -This instruction pops `NBITSET(mask)` items from both node stack and hash stack (up to 16 for each one). Then it pushes a new branch node on the node stack that has children according to the stack; it also pushes a new hash to the hash stack. +Replaces `NBITSET(mask)` `Node`s to the left of the instruction with a single +`BranchNode` with these nodes as children according to `mask`. **Substitution rules** ``` -GUARD NBITSET(mask) == 2 +GUARD NBITSET(mask) == 2 + +Node(n0) Node(n1) BRANCH(mask) |=> +BranchNode{MAKE_VALUES_ARRAY(mask, n0, n1)} -STACK(n0, h0) STACK(n1, h1) BRANCH(mask) |=> -STACK(branchNode(MAKE_VALUES_ARRAY(mask, n0, n1)), keccak(CONCAT(MAKE_VALUES_ARRAY(mask, h0, n1)))) --- - GUARD NBITSET(mask) == 3 -STACK(n0, h0) STACK(n1, h1) STACK(n2, h2) BRANCH(mask) |=> -STACK(branchNode(MAKE_VALUES_ARRAY(mask, n0, n1, n2)), keccak(CONCAT(MAKE_VALUES_ARRAY(mask, h0, n1, n2)))) +Node(n0) Node(n1) Node(n2) BRANCH(mask) |=> +BranchNode{MAKE_VALUES_ARRAY(mask, n0, n1, n2)} --- @@ -191,8 +426,8 @@ STACK(branchNode(MAKE_VALUES_ARRAY(mask, n0, n1, n2)), keccak(CONCAT(MAKE_VALUES GUARD NBITSET(mask) == 16 -STACK(n0, h0) STACK(n1, h1) ... STACK(n15, h15) BRANCH(mask) |=> -STACK(branchNode(MAKE_VALUES_ARRAY(mask, n0, n1, ..., n15)), keccak(CONCAT(MAKE_VALUES_ARRAY(mask, h0, n1, ..., n15)))) +Node(n0) Node(n1) ... Node(n15) BRANCH(mask) |=> +BranchNode{MAKE_VALUES_ARRAY(mask, n0, n1, ..., n15)} ``` ## Helper functions @@ -221,6 +456,12 @@ MAKE_VALUES_ARRAY(mask, idx, values...) { } ``` + +### `RLP(value)` + +returns the RLP encoding of a value + + ### `NBITSET(number)` returns number of bits set in the binary representation of `number`. @@ -245,4 +486,114 @@ returns the first value in the specified array ### `REST(array)` -returns the array w/o the first item \ No newline at end of file +returns the array w/o the first item + +### `KECCAK(bytes)` + +returns a keccak-256 hash of `bytes` + + +## Serialization + +The format for serialization of everything except hashes (that we know the +length of) is [CBOR](https://cbor.io). It is RFC-specified and concise. + +For hashes we use the optimization of knowing the lengths, so we just read 32 +bytes + +### Block Witness Format + +Each block witness consists of a header followed by a list of instructions. + +There is no length of witness specified anywhere, the code expects to just reach `EOF`. + +Serialized Witness: `(HEADER, OP1, OP2, ..., OPn-1, OPn, EOF)` + +#### Encoding + +##### CBOR + +The parts of the key that are encoded with CBOR are marked by the `CBOR` function. + +##### Keys + +Keys are also using custom encryption to make them more compact. + +The nibbles of a key are encoded in a following way `[FLAGS NIBBLE1+NIBBLE2 NIBBLE3+NIBBLE4 NIBBLE5... ]` + +*FLAGS* +* bit 0 -- 1 if the number of nibbles were odd +* bit 1 -- 1 if the nibbles end with 0x10 (the terminator byte) + +This is shown later as `ENCODE_KEY` function. + +#### Header + +format: `version:byte` + +encoded as `[ version ]` + +the current version is 1. + +#### Instructions + +Each instruction starts with an opcode (`uint`). + +Then it might contain some data. + +##### `LEAF` + +format: `LEAF key:[]byte value:[]byte` + +encoded as `[ 0x00 CBOR(ENCODE_KEY(key))... CBOR(value)... ]` + + +##### `EXTENSION` + +format: `EXTENSION key:[]byte` + +encoded as `[ 0x01 CBOR(ENCODE_KEY(key))... ]` + + +##### `BRANCH` + +format: `BRANCH mask:uint32` + +*mask* defines which children are present +(e.g. `0000000000001011` means that children 0, 1 and 3 are present and the other ones are not) + +encoded as `[ 0x02 CBOR(mask)...]` + + +##### `HASH` + +format: `HASH hash:[32]byte` + +encoded as `[ 0x03 hash_byte_1 ... hash_byte_32 ]` + + +##### `CODE` + +format: `CODE code:[]byte` + +encoded as `[ 0x04 CBOR(code)... ]` + + +##### `ACCOUNT_LEAF` + +format: `ACCOUNT_LEAF key:[]byte flags [nonce:uint64] [balance:[]byte]` + +encoded as `[ 0x05 CBOR(ENCODE_KEY(key))... flags /CBOR(nonce).../ /CBOR(balance).../ ]` + +*flags* is a bitset encoded in a single bit (see [`witness_operators_test.go`](../../trie/witness_operators_test.go) to see flags in action). +* bit 0 defines if **code** is present; if set to 1, then `has_code=true`; +* bit 1 defines if **storage** is present; if set to 1, then `has_storage=true`; +* bit 2 defines if **nonce** is not 0; if set to 0, *nonce* field is not encoded; +* bit 3 defines if **balance** is not 0; if set to 0, *balance* field is not encoded; + +##### `NEW_TRIE` + +format: `NEW_TRIE` + +encoded as `[ 0xBB ]` +