rfc | start_date | decision_date | pr | status |
---|---|---|---|---|
0009 |
2018-08-02 |
2018-08-30 |
approved |
This RFC proposes an algorithm to hash entries with a well defined set of attributes, and a way to hash from the values to the whole entry.
This RFC is incompatible with the reference implementation
The current specfication doesn't define what is the algorithm to get from the entry data to the entry hash for the given hashing algorithm (from now on SHA-256).
The reference implementation (ORJ) reuses the item canonicalisation
algorithm to canonicalise the entry JSON representation and then hash the
result. This means that any attribute currently part of the reference
implementation Entry
entity is serialised and used to compute the hash. As
an example, the index entry number, part of the experimental work on Indexes
has affected the result of the entry hash even though it is (a) not stable and
(b) not part of the real identity of an entry.
Finally, the current implementation relies on JSON and on the attribute names serialised as strings.
This RFC aims to provide an algorithm that doesn't rely on specific attribute
names (e.g. entry-number
vs number
, item-hash
vs itemHash
, etc) which
supports a better decoupling between the data model and the REST API (we don't
have plans for it but a hypothetical GraphQL interface wouldn't work with the
current attribute names and yet conceptually an entry is the same).
Note that the item hash algorithm is not affected by this RFC.
This section describes the algorithm twice. First in a high level and then a more step by step with intermediate values.
For reference, this is the abstract definition of an entry:
type Entry =
{ number : Integer
, key : ID
, timestamp : Timestamp
, item : Set Hash
}
When this algorithm operates on hashes (e.g. tag, concatenate) it is done on the byte level, not the hexadecimal string representation that the latter example shows as partial representations.
- Let number be the string representation of the entry number.
- Let key be the string representation of the entry key.
- Let timestamp be the string representation of the entry timestamp.
- Let items be the set of item hashes as bytes. Note that
the item hashes must not have the prefix (i.e.
sha-256:
) when transformed to bytes. - Let hashList be an empty list.
- Apply the hashValue function to number tagged with
i
(Integer). And append the result to hashList. - Apply the hashValue function to key tagged with
u
(String). And append the result to hashList. - Apply the hashValue function to timestamp tagged with
t
(Timestamp). And append the result to hashList. - Let itemsBytes be an empty set.
- For each value item in the items set:
- Apply the hashValue function to item tagged with
r
(Hash) and append the result to itemsBytes. - Sort the elements of itemsBytes, concatenate them and
- Apply the hashValue function tagged with
s
(Set). And append the result to hashList. - Concatenate the elements of hashList in order (i.e.
[numberHash, keyHash, timestampHash, itemsHash]
), tag it withl
(List) and hash the result.
The sorting algorithm for a set of hashes is done by comparing the list of
bytes one by one. For example, given a set ["foo", "bar"]
you'll get the
folllowing byte lists after hashing them as unicode:
[ [166,166,229,231,131,195,99,205,149,105,62,193,137,194,104,35,21,217,86,134,147,151,115,134,121,181,99,5,242,9,80,56]
, [227,3,206,11,208,244,193,253,254,76,193,232,55,215,57,18,65,226,224,71,223,16,250,97,1,115,61,193,32,103,93,254]
]
In this case, the set is already sorted given that 166
is smaller than
227
.
To walk through the algorithm I'll use the following entry:
Entry
{ number = 6
, key = ID "GB"
, timestamp = Timestamp 2016 4 5 13 23 5 Utc
, item = Set [Sha256 "6b18693874513ba13da54d61aafa7cad0c8f5573f3431d6f1c04b07ddb27d6bb"]
}
And this is the current serialisation in JSON:
[
{
"index-entry-number":"6",
"entry-number":"6",
"entry-timestamp":"2016-04-05T13:23:05Z",
"key":"GB",
"item-hash":["sha-256:6b18693874513ba13da54d61aafa7cad0c8f5573f3431d6f1c04b07ddb27d6bb"]
}
]
Notice they are not exaclty the same even though they represent the same
entity. Also notice that the ID
type is not yet an approved RFC. For this
RFC this shouldn't be a problem as it works the same as if the type were a
string.
Each value MUST be tagged by a byte that identifies the type, in a similar way as the objecthash algorithm.
type Tag
= Hash
| Integer
| Set
| String
| Timestamp
| List
tagToChar : Tag -> Char
tagToChar tag =
case tag of
Hash ->
'r'
Integer ->
'i'
Set ->
's'
String ->
'u'
Timestamp ->
't'
List ->
'l'
To keep the explanation simpler hashValue
assumes SHA-256 as the hashing
algorithm.
hashValue : Tag -> String -> Hash
Where:
Tag
is the tag for theValue
type andString
is the string representation of the value to hash,Hash
is the list of bytes resulting of the hashing algorithm.
Following the JSON example from above, these are the hashes for each value:
numberValue = toString 6
numberValue == "6"
numberHash = hashValue Integer numberValue
toHex numberHash == "396ee89382efc154e95d7875976cce373a797fe93687ca8a27589116644c4bcd"
timestampValue = toString (Timestamp 2016 4 5 13 23 5 Utc)
toHex timestampValue == "2016-04-05T13:23:05Z"
timestampHash = hashValue Timestamp timestampValue
toHex timestampHash == "f22ecc4464f22c8fee624769189665a0afd7ef10a2775a000082c47cbd9f6419"
keyValue = toString (ID "GB")
keyValue == "GB"
keyHash = hashValue String keyValue
toHex keyHash == "fff7021c7df4426be0f9a3c83f236eb6f85d159e624b010d65e6dde267889c21"
The item is a set of values rather than a leaf value and each value is already
hashed. This means that each element of the set needs to be tagged with r
(Hash) and hashed again.
itemsBytes = map (hashValue Hash) (Set [Sha256 "6b18693874513ba13da54d61aafa7cad0c8f5573f3431d6f1c04b07ddb27d6bb"])
Then sort the set concatenate and tag with s
(Set) and hash the result.
itemHash = hashValue Set (concat (sort itemsBytes))
toHex itemHash == "cff910f74878650a3cceb54039bdb62707de9d20e80d4385127732a4e444bd57"
The order is:
hashList =
[ numberHash
, keyHash
, timestampHash
, itemHash
]
Note: The order has been chosen arbitrarily.
And finally, tag and hash the result of concatenating the list of hashes:
hash : List Hash -> Hash
entryHash = hash hashList
entryHash == "51a02cd5692c6a03ba78330cb68f8e26e976c5933af0aa8d779589a1e6264e4b"