Formal semantics: more instructions (#366)

2024-12-22 03:30:37 +00:00 · 2020-02-17 19:56:30 +03:00 · 2020-02-17 19:56:30 +03:00 · ec559db1c8
commit ec559db1c8
parent 21ce7b4c0d
1 changed files with 477 additions and 126 deletions
--- a/docs/programmers_guide/witness_formal_spec.md
+++ b/docs/programmers_guide/witness_formal_spec.md
@ -2,75 +2,179 @@

 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC 2119](https://tools.ietf.org/html/rfc2119).

-## The Stack
+## Data Types

-Witnesses are executed on a stack machine that builds tries/calculates hashes.
+### Basic

-The stack consists of stack items.
-Each item is a pair: `(node, hash)`, where `node` is an object representing the Merkle trie node, and `hash` is 32 byte node hash value. The exact implementation details of `node`s and `hash`es are not the part of the spec.
+`nil` - an empty value.

-Each witness is a queue of instructions.
+`Any` - any data type. MUST NOT be `nil`.

-In every execution cycle a single instruction gets dequeued and a matching substitution rule gets applied to the stack.
+`Int` - an integer value. We treat the domain of integers as infinite,
+the overflow behaviour or mapping to the actual data types is undefined
+in this spec and should be up to implementation.

-In the end, when there are no more instructions left, there MUST be only one item left on the stack.  
+`Byte` - a single byte.
+
+`Hash` - 32 byte value, representing a result of Keccak256 hashing.
+
+### Composite
+
+`()` - an empty array of arbitrary type.
+
+`(Type...)` - an array of a type `Type`. MUST NOT be empty.
+
+`{field:Type}` - an object with a field `field` of type `Type`.
+
+  - full notation: `type T = {field:Type}`
+
+  - inline `type TBase = T1{field:Type}|T2{field2:Type2}`
+
+### Type Definitions
+
+The type definitions are a bit similar to [Haskell](https://en.wikibooks.org/wiki/Haskell/Type_declarations).
+The key differences are how the arrays and type fields are defined.

 ### Nodes

-There are multiple types of nodes that can be present on the stack.
+```
+type Node = HashNode{raw_hash:Hash}
+          | ValueNode{raw_value:(Byte...)}
+          | AccountNode{nonce:Int balance:Int storage:nil|Node code:nil|CodeNode|HashNode}
+          | LeafNode{key:(Byte...) value:ValueNode|AccountNode}
+          | ExtensionNode{key:(Byte...) child:Node}
+          | BranchNode{child0:nil|Node child1:nil|Node child3:nil|Node ... child15:nil|Node}
+          | CodeNode{code:(Byte... )}
+```
+
+### Witness
+
+Witness MUST have at least 1 element.
+
+```
+type WitnessHeader = {version:Int}
+type Instruction = {code:Int parameter:Any...}
+type Witness = (Node|Instruction...)
+```
+
+
+## Execution Enviroment
+
+The witness execution environment MUST contain the following 2 elements:
+
+- **WitnessHeader** -- a header containing the version of the witness. The `version` MUST be 1.
+
+- **Witness** -- a witness to be executed;
+
+- **Substitution Rules** -- a list of all possible substitution rules.
+
+
+## Execution process
+
+Initially, the Witness MUST BE an array of `Instruction`s.
+
+Then, as substitution rules are applied to the witness, some elements of the
+array are replaces with `Node`s.
+
+The execution continues until there are no substitution rules left to execute.
+
+Here is how the execution code might look like in Go for building a single trie.
+
+```go
+witness := GetInitialWitness()
+rules := GetSubstitutionRules()
+numberOfRulesApplied := 1 // initial state
+
+for numberOfRulesApplied > 0 {
+    witness, numberOfRulesApplied := ApplyRules(witness, rules)
+}
+
+if len(witness) == 1 {
+    trie.root = witness[0]
+} else {
+    panic("witness execution failed")
+}
+
+```
+
+
+And here is an example of the execution process (we will use the set of rules
+form the **Substitution Rules** section of this document):
+
+* **Step 1**. Witness: `(HASH h1 HASH h2 BRANCH 0b101 HASH h3 BRANCH 0b11)`
+
+* **Step 2**. Apply `HASH` substitution rules.
+Witness: `(HashNode{h1} HashNode{h2} BRANCH 0b101 HashNode{h3} BRANCH 0b11)`
+
+* **Step 3**. Apply `BRANCH` substitution rules (only once, because `BRANCH 0b11`
+doesn't pass its `GUARD` statements just yet).
+Witness: `(BranchNode{0: HashNode{h1} 2:HashNode{h2}} HashNode{h3} BRANCH 0b11)`
+
+* **Step 4**. Apply `BRANCH` substitution rules again.
+Witness: `(BranchNode{0: BranchNode{0: HashNode{h1} 2:HashNode{h2}} 1:HashNode{h3}})`
+
+* **Step 5**. No more rules are applicable, the witness contains only one
+    element, the execution ends successfully.
+
+
+## End Criteria
+
+The execution ends when there are no substitution rules applicable for this
+witness.
+
+### Building a single trie from the witness
+
+If we are building a single trie from the witness, then the only SUCCESS
+executon is when the following are true:
+
+- The execution state MUST match the End Criteria
+- There MUST be only one item left in the witness
+- This item MUST be of the type `Node`
+    
+In that case, this last item will be the root of the built trie.
+
+Every other end state is considered a FAILURE.
+
+
+### Building a Forest 
+
+We also can build a forest of tries with this approach, by adding a new
+Instruction `NEW_TRIE` and adjusting the success criteria a bit:
+
+- The execution state MUST match the End Criteria;
+- The items that are left in the witness MUST follow this pattern: `(Node
+    NEW_TRIE ... Node)`
+- Each `Node` element will be a root of a trie.
+
+Every other end state is considered a FAILURE.
+
+
+## Instructions & Parameters
+
+A single instruction consists of substitution rules and parameters.
+
+Each instruction MAY have one or more parameters.
+The parameters values MUST be encoded in the witness.
+
+That makes it different from the helper function parameters that MAY come from the stack or MAY come from the witness.

- `hashNode`
- `branchNode`
- `accountNode`
- `valueNode`
- `extensionNode`

 ## Substitution rules

-Here is an example substitution rule. The current instruction is named `INSTR1`.
+A substitution rule consists of 3 parts: 

-```
-STACK(node1, hash1) STACK(node2, hash2) INSTR1(params) |=> 
-STACK(helper1(node1, node2), helper2(hash1, hash2))
-```
+`[GUARD] PATTERN |=> RESULT`

-This substitution rule replaces 2 stack items `(node1, hash1)` and `(node2, hash2)`
-with a single stack item `(helper1(node1, node2), helper2(hash1, hash2))`.
+- to the left of the `|=>` sign:

-Where `helper1` and `helper2` are pure functions defined in pseudocode (see the example below).
+    - optional `GUARD` statements;

-```
-helper1 (value1, value2) {
-    return value1 + value2
-}
+    - the pattern to match against;

-helper2 (hash1, hash2) {
-    return KECCAK(hash1 + hash2)
-}
-```
+- result, to the right of the `|=>` sign.

-The specification for a substitution rule

-```
-[GUARD <CONDITION> ...]
-
-[ STACK(<node-var-name>, <hash-var-name>), ... ] <INSTRUCTION>[(<params>)] |=>
-STACK(node-value, hash-value), ...
-```
-
-There MUST be one and only one substitution rule applicable to the execution state. If an instruction has multiple substitution rules, the applicability is defined by the `GUARD` statements.
-The substitution rule MAY have one or more GUARD statements.
-The substitution rule MAY have one or more STACK statements before the instruction.
-The substitution rule MUST have exactly one instruction.
-The substitution rule MAY have parameters for the instruction.
-The substitution rule MUST have at least one STACK statement after the arrow.
-
-So, the minimal substitution rule is for the `HASH` instruction that pushes one hash to the stack:
-```
-HASH(hashValue) |=> STACK(hashNode(hashValue), hashValue)
-```
-
-## GUARDs
+### `GUARD`s

 Each substitution rule can have zero, one or multiple `GUARD` statements.
 Each `GUARD` statement looks like this:
@ -79,23 +183,110 @@ Each `GUARD` statement looks like this:
 GUARD <CONDITION>
 ```

-That means that for the substitution rule to be applicable the `<CONDITION>` in the guard statement must be true.
+For a substitution rule to be applicable, the `<CONDITION>` in its `GUARD` statement MUST be true.

-If a substitution rule has multiple guard statements, all of the conditions specified there should be satisfied.
+If a substitution rule has multiple `GUARD` statements, all of them MUST BE satisfied.

-## `TRAP` statements
+If there are no `GUARD` statements, the substitution rule's applicability is
+only defined by the PATTERN.

-If no substitution rules are applicable for an instruction, then the execution MUST stop (`TRAP` instruction) and the partial results MUST NOT be used.
+Example:
+```
+ GUARD NBITSET(mask) == 2
+|---- GUARD STATEMENT ---|

-Implementation of the `TRAP` instruction is undefined, but it should stop the execution flow. (e.g. in Golang it might be a function returning an error or a panic).
+ Node(n0) Node(n1) BRANCH(mask) |=> 
+ BranchNode{MAKE_VALUES_ARRAY(mask, n0, n1)}
+```

-## Instructions & Parameters
+For the example rule to be applicable both facts MUST be true:

-Each instruction MAY have one or more parameters.
-The parameters values MUST be situated in the witness.
-The parameter values MUST NOT be taken from the stack.
+1. `mask` contains only 2 bits set to 1 (the rest are set to 0);
+
+2. to the left of `BRANCH` instruction there is at least 2 `Node`s.
+
+Fact (1) comes from the `GUARD` statement.
+
+
+### PATTERN
+
+`[NodeType(boundVar1)... NodeType(boundVarN)] Instruction[(param1... paramN)]`
+
+A pattern MUST contain a single instruction.
+A pattern MAY contain one or more `Node`s to the left of the instruction to
+match.
+An instruction MAY have one or more parameters.
+
+Pattern matching is happening by the types. `Node` type means any node is
+matched, some specific node type will require a specific match.
+
+Pattern can have bound variable names for both matched nodes and instruction
+parameters (if present).
+
+Match:
+
+```
+HASH h0 HashNode{h1} HashNode{h2} BRANCH 0b11
+       |------------------- MATCH -----------|
+
+HASH h0 BranchNode{0: HashNode{h1} 1: HashNode{h2}}
+       |----------- SUBSTITUTED -------------------|
+```
+
+No match (not enough nodes to the left of the instruction):
+
+```
+HASH h0 HASH h1 HashNode{h2} BRANCH 0b11
+```
+
+### Result
+
+`NodeType(HELPER_FUNCTION(arguments))`
+
+The result is a single `Node` statement that replaces the pattern in the
+witness if it matches and the guards are passed.
+
+The result MAY contain helper functions or might have in-line computation.
+The result MUST have a specific node type. No generic `Node` is allowed.
+
+Helper functions or inline computations might use bound variables from the
+pattern. 
+
+Example
+
+```
+                             
+Node(n0) Node(n1) BRANCH(mask) |=>
+BranchNode{MAKE_VALUES_ARRAY(mask, n0, n1)}
+                             ^     ^-- ^--- BOUND NODES
+                             |---- BOUND INSTRUCTION PARAM
+          |------ HELPER CALL ------------|
+|----------------- RESULT ------------------|
+
+```
+
+### Bringing it all together
+
+
+So the full syntax is this:
+
+```
+[GUARD <CONDITION> ...] [ NodeType(bound_variable1)... ] INSTRUCTION[(param1 ...)] |=>
+Node(<HELPER_FUNCTION_OR_COMPUTATION>)
+```
+
+`NodeType` is one of the types of nodes to match. Can also be `Node` to match
+any non-nil node.
+
+Substitution rules MUST be non-ambiguous. Even though, there can be multiple
+substitution rules applicable to the whole witness at the same time, there MUST
+be only one rule that is applicable to a certain position in the witness.
+
+So, the minimal substitution rule is the one for the `HASH` instruction that pushes one hash to the stack:
+```
+HASH(hashValue) |=> HashNode{hashValue}
+```

-That makes it different from the helper function parameters that MAY come from the stack or MAY come from the witness.

 ## Helper functions

@ -106,82 +297,126 @@ Helper functions MUST have at least one argument.
 Helper functions MAY have variadic parameters: `HELPER_EXAMPLE(arg1, arg2, list...)`.
 Helper functions MAY contain recursion.

-## Data types
-
-INTEGER - we treat the domain of integers as infinite, the overflow behaviour or mapping to the actual data types is undefined in this spec and should be dependent on implementation.
-
-## Execution flow 
-
-Let's look at the example.
-
-Our example witness would be `HASH h1; HASH h2; BRANCH 0b11`.
-
-Initial state of the stack is ` <empty> `;
-
---
-
-**1. Executing `HASH h1`**: push a `hashNode` to the stack.
-
-Witness: `HASH h2; BRANCH 0b11`
-
-Stack: `(hashNode(h1), h1)`
-
---
-
-**2. Executing `HASH h2`**: push one more `hashNode` to the stack.
-
-Witness `BRANCH 0b11`
-
-Stack: `(hashNode(h2), h2); (hashNode(h1), h1)`
-
---
-
-**3. Executing `BRANCH 0b11`**: replace 2 items on the stack with a single `fullNode`.
-
-Witness: ` <empty> `
-
-Stack: `(fullNode{0: hashNode(h2), 1: hashNode(h1)}, KECCAK(h2+h1))`
-
---
-
-So our stack has exactly a single item and there are no more instructions in the witness, the execution is completed.
-
-## Modes
-
-There are two modes of execution for this stack machine:
-
-(1) **normal execution** -- the mode that constructs a trie;
-
-(2) **hash only execution** -- the mode that calculates the root hashes of a trie without constructing the trie itself;
-
-In the mode (2), a stack item  MUST only contain the node hash.
-
-Stack item, mode (1): `(node, hash)`.
-
-Stack item, mode (2): `(hash)`.
-
-The exact implementation details are undefined in this spec.
-
 ## Instructions

+### `LEAF key raw_value`
+
+**Substitution rules**
+
+Replaces the instruction with a `ValueNode` wrapped with a `LeafNode`.
+
+```
+LEAF(key, raw_value) |=> LeafNode{key, ValueNode(raw_value)}
+```
+
+### `EXTENSION key`
+
+Wraps a node to the left of the instruction with an `ExtensionNode`.
+
+**Substitution rules**
+
+```
+Node(node) EXTENSION(key) |=> ExtensionNode{key, node}
+```
+
+### `HASH raw_hash`
+
+Replaces the instruction with a `HashNode`.
+
+**Substitution rules**
+
+```
+HASH(hash_value) |=> HashNode{hash_value}
+```
+
+### `CODE raw_code`
+
+Replaces the instruction with a `CodeNode`.
+
+```
+CODE(raw_code) |=> CodeNode{raw_code}
+```
+
+### `ACCOUNT_LEAF key nonce balance has_code has_storage`
+
+Replaces the instruction and, optionally, up to 2 nodes to the left of the
+instructon with a single `AccountNode` wrapped with a `LeafNode`.
+
+**Substitution rules**
+
+```
+GUARD has_code == true
+GUARD has_storage == true
+
+CodeNode(code) Node(storage_hash_node) ACCOUNT_LEAF(key, nonce, balance, has_code, has_storage) |=>
+LeafNode{key, AccountNode{nonce, balance, storage_root, code}}
+
+---
+
+GUARD has_code == true
+GUARD has_storage == true
+
+HashNode(code) Node(storage_hash_node) ACCOUNT_LEAF(key, nonce, balance, has_code, has_storage) |=>
+LeafNode{key, AccountNode{nonce, balance, storage_root, code}}
+
+---
+
+GUARD has_code == false
+GUARD has_storage == true
+
+Node(storage_root) ACCOUNT_LEAF(key, nonce, balance, has_code, has_storage) |=>
+LeafNode{key, AccountNode{nonce, balance, storage_root, nil}}
+
+---
+
+GUARD has_code == true
+GUARD has_storage == false
+
+CodeNode(code) ACCOUNT_LEAF(key, nonce, balance, has_code, has_storage) |=>
+LeafNode{key, AccountNode{nonce, balance, nil, nil, code}}
+
+---
+
+GUARD has_code == true
+GUARD has_storage == false
+
+HashNode(code) ACCOUNT_LEAF(key, nonce, balance, has_code, has_storage) |=>
+LeafNode{key, AccountNode{nonce, balance, nil, nil, code}}
+
+---
+
+GUARD has_code == false
+GUARD has_storage == false
+
+ACCOUNT_LEAF(key, nonce, balance, has_code, has_storage) |=>
+LeafNode{key, AccountNode{nonce, balance, nil, nil, nil}}
+
+```
+
+### `NEW_TRIE`
+
+No substitution rules. This instruction is used as a "divider" when building
+a forest of tries.
+
 ### `BRANCH mask`

-This instruction pops `NBITSET(mask)` items from both node stack and hash stack (up to 16 for each one). Then it pushes a new branch node on the node stack that has children according to the stack; it also pushes a new hash to the hash stack.
+Replaces `NBITSET(mask)` `Node`s to the left of the instruction with a single
+`BranchNode` with these nodes as children according to `mask`.

 **Substitution rules**
 ```

-GUARD NBITSET(mask) == 2 
+GUARD NBITSET(mask) == 2
+
+Node(n0) Node(n1) BRANCH(mask) |=> 
+BranchNode{MAKE_VALUES_ARRAY(mask, n0, n1)}

-STACK(n0, h0) STACK(n1, h1) BRANCH(mask) |=> 
-STACK(branchNode(MAKE_VALUES_ARRAY(mask, n0, n1)), keccak(CONCAT(MAKE_VALUES_ARRAY(mask, h0, n1))))
 ---

-
 GUARD NBITSET(mask) == 3

-STACK(n0, h0) STACK(n1, h1) STACK(n2, h2) BRANCH(mask) |=> 
-STACK(branchNode(MAKE_VALUES_ARRAY(mask, n0, n1, n2)), keccak(CONCAT(MAKE_VALUES_ARRAY(mask, h0, n1, n2))))
+Node(n0) Node(n1) Node(n2) BRANCH(mask) |=> 
+BranchNode{MAKE_VALUES_ARRAY(mask, n0, n1, n2)}

 ---

@ -191,8 +426,8 @@ STACK(branchNode(MAKE_VALUES_ARRAY(mask, n0, n1, n2)), keccak(CONCAT(MAKE_VALUES

 GUARD NBITSET(mask) == 16

-STACK(n0, h0) STACK(n1, h1) ... STACK(n15, h15) BRANCH(mask) |=>
-STACK(branchNode(MAKE_VALUES_ARRAY(mask, n0, n1, ..., n15)), keccak(CONCAT(MAKE_VALUES_ARRAY(mask, h0, n1, ..., n15))))
+Node(n0) Node(n1) ... Node(n15) BRANCH(mask) |=>
+BranchNode{MAKE_VALUES_ARRAY(mask, n0, n1, ..., n15)}
 ```

 ## Helper functions
@ -221,6 +456,12 @@ MAKE_VALUES_ARRAY(mask, idx, values...) {
 }
 ```

+
+### `RLP(value)`
+
+returns the RLP encoding of a value
+
+
 ### `NBITSET(number)`

 returns number of bits set in the binary representation of `number`.
@ -245,4 +486,114 @@ returns the first value in the specified array

 ### `REST(array)`

-returns the array w/o the first item
+returns the array w/o the first item
+
+### `KECCAK(bytes)`
+
+returns a keccak-256 hash of `bytes`
+
+
+## Serialization
+
+The format for serialization of everything except hashes (that we know the
+length of) is [CBOR](https://cbor.io). It is RFC-specified and concise.
+
+For hashes we use the optimization of knowing the lengths, so we just read 32
+bytes
+
+### Block Witness Format
+
+Each block witness consists of a header followed by a list of instructions.
+
+There is no length of witness specified anywhere, the code expects to just reach `EOF`.
+
+Serialized Witness: `(HEADER, OP1, OP2, ..., OPn-1, OPn, EOF)`
+
+#### Encoding
+
+##### CBOR
+
+The parts of the key that are encoded with CBOR are marked by the `CBOR` function.
+
+##### Keys
+
+Keys are also using custom encryption to make them more compact.
+
+The nibbles of a key are encoded in a following way `[FLAGS NIBBLE1+NIBBLE2 NIBBLE3+NIBBLE4 NIBBLE5... ]`
+
+*FLAGS*
+* bit 0 -- 1 if the number of nibbles were odd
+* bit 1 -- 1 if the nibbles end with 0x10 (the terminator byte)
+
+This is shown later as `ENCODE_KEY` function.
+
+#### Header
+
+format: `version:byte`
+
+encoded as `[ version ]`
+
+the current version is 1.
+
+#### Instructions 
+
+Each instruction starts with an opcode (`uint`).
+
+Then it might contain some data.
+
+##### `LEAF`
+
+format: `LEAF key:[]byte value:[]byte` 
+
+encoded as `[ 0x00 CBOR(ENCODE_KEY(key))... CBOR(value)... ]`
+
+
+##### `EXTENSION`
+
+format: `EXTENSION key:[]byte` 
+
+encoded as `[ 0x01 CBOR(ENCODE_KEY(key))... ]`
+
+
+##### `BRANCH`
+
+format: `BRANCH mask:uint32`
+
+*mask* defines which children are present 
+(e.g. `0000000000001011` means that children 0, 1 and 3 are present and the other ones are not)
+
+encoded as `[ 0x02 CBOR(mask)...]`
+
+
+##### `HASH` 
+
+format: `HASH hash:[32]byte`
+
+encoded as `[ 0x03 hash_byte_1 ... hash_byte_32 ]`
+
+
+##### `CODE`
+
+format: `CODE code:[]byte`
+
+encoded as `[ 0x04 CBOR(code)... ]`
+
+
+##### `ACCOUNT_LEAF`
+
+format: `ACCOUNT_LEAF key:[]byte flags [nonce:uint64] [balance:[]byte]` 
+
+encoded as `[ 0x05 CBOR(ENCODE_KEY(key))... flags /CBOR(nonce).../ /CBOR(balance).../ ]`
+
+*flags* is a bitset encoded in a single bit (see [`witness_operators_test.go`](../../trie/witness_operators_test.go) to see flags in action).
+* bit 0 defines if **code** is present; if set to 1, then `has_code=true`;
+* bit 1 defines if **storage** is present; if set to 1, then `has_storage=true`;
+* bit 2 defines if **nonce** is not 0; if set to 0, *nonce* field is not encoded;
+* bit 3 defines if **balance** is not 0; if set to 0, *balance* field is not encoded;
+
+##### `NEW_TRIE`
+
+format: `NEW_TRIE`
+
+encoded as `[ 0xBB ]`
+