rfc | start_date | decision_date | pr | status |
---|---|---|---|---|
8 |
2018-08-03 |
2018-08-31 |
approved |
This RFC proposes an algorithm to hash items such that data can be redacted at any granularity without breaking the integrity of the register.
This RFC is not backwards compatible
The current item hashing algorithm makes impossible to redact a bit of data inside the item. All you can do is to redact the full blob. This don't end well when for example a GDPR request forces redacting a piece of information replicated in many items. Which is likely to happen given that every time a new item is introduced with a change, the rest of the data is copied.
The following explanation is an implementation of the objecthash algorithm.
Items are stored as a set of attribute-value pairs where values are strings. Types and casting from the string representation are handled in a different phase of the process. To align with this fact and to align with the features provided by objecthash values are tagged according to two possible scenarios: a string value and a set of string values (i.e. a cardinality n value).
type Tag
= Dict -- 'd'
| String -- 'u'
| Set -- 's'
Note that a Set is unordered at origin so order is applied as part of the hashing process.
For reference, this is the abstract definition of an item:
type Value
= StringValue String -- cardinality 1
| SetValue (Set String) -- cardinality n
type Item =
Dict String Value
Note: Attribute names should be of type Fieldname
but it's not yet defined
in a RFC. In any case, Fieldname
is a subset of String
so this RFC is
intended to be forward compatible.
When this algorithm operates on hashes (e.g. tag, concatenate) it is done on the byte level, not the hexadecimal string representation that the latter example shows as partial representations.
- Let item be the normalised blob of data to hash.
- Let hashList be an empty list.
- Let valueHash be null.
- Foreach (attr, value) pair in item:
- If value is null, continue.
- If value is a Set:
- Let elList be an empty list.
- Foreach el in value:
- If el starts with
**REDACTED**
, append el without**REDACTED**
to elList. - Otherwise, normalise el according to string normalisation
tag it with
u
(String), hash it and append it to elList. - Concatenate elList elements, sort them, tag it with
s
(Set), hash it and set it to valueHash.
- If el starts with
- If value starts with
**REDACTED**
, set valueHash with value without**REDACTED**
. - Otherwise, normalise value according to string normalisation
tag it with
u
(String), hash it and set valueHash. - Tag attr with
u
(String), hash it and set attrHash. - Concat attrHash and valueHash in this order, and append to hashList.
- Sort hashList.
- Concat hashList elements, tag with
d
, hash it and return.
Note: Any time the algorithm says "tag with" it means to prepend the byte
corresponding to the tag (e.g. 0x70
) to the list of bytes of the thing to tag.
Note: Blob normalisation will be addressed in another RFC.
The sorting algorithm for a set of hashes is done by comparing the list of
bytes one by one. For example, given a set ["foo", "bar"]
you'll get the
folllowing byte lists after hashing them as unicode:
[ [166,166,229,231,131,195,99,205,149,105,62,193,137,194,104,35,21,217,86,134,147,151,115,134,121,181,99,5,242,9,80,56]
, [227,3,206,11,208,244,193,253,254,76,193,232,55,215,57,18,65,226,224,71,223,16,250,97,1,115,61,193,32,103,93,254]
]
In this case, the set is already sorted given that 166
is smaller than
227
.
A string should be in the NFC form as defined by the Unicode standard: https://en.wikipedia.org/wiki/Unicode_equivalence
Redaction works in the same way as objecthash.
In summary, these two items are equivalent:
i_0 =
Dict [ ("foo", "abc")
, ("bar", "xyz")
]
i_1 =
Dict [ ("foo", "**REDACTED**2a42a9c91b74c0032f6b8000a2c9c5bcca5bb298f004e8eff533811004dea511")
, ("bar", "xyz")
]
The reason they are equivalent is because the hash for "abc"
is
"2a42a9c91b74c0032f6b8000a2c9c5bcca5bb298f004e8eff533811004dea511". So when
the hashing algorithm is applied to the whole item, the redacted value is used
as is (without the redaction tag).
Note: To simplify the code, the hashing algorithm SHA-256
is implied. A full
implementation must acknowledge the possibility of a different algorithm.
To walk through the algorithm I'll use the following item:
Dict
[ ("id", "GB")
, ("official-name", "The United Kingdom of Great Britain and Northern Ireland")
, ("name", "United Kingdom")
, ("citizen-names", Set ["Briton", "British citizen"])
]
Which is serialised as JSON as:
{
"id": "GB",
"official-name": "The United Kingdom of Great Britain and Northern Ireland",
"name": "United Kingdom",
"citizen-names": ["Briton", "British citizen"]
}
The following definitions set the context of operation:
type Tag
= Dict
| Set
| String
tagToChar : Tag -> Char
tagToChar tag =
case tag of
Dict ->
'd'
Set ->
's'
String ->
'u'
hashTagged : String -> Hash
hashString : String -> Hash
hashSet : Set -> Hash
hashValue : Value -> Hash
hashPair : (String, Value) -> List Byte
hash : Item -> Hash
The hash
function, step by step would roughly look like:
id = hashPair ("id", "GB")
id == "17b788a70eeccbdc2fcb2d2d3db216c02fa88ac668beeb164bb2328c864bf3f4fff7021c7df4426be0f9a3c83f236eb6f85d159e624b010d65e6dde267889c21"
officialName = hashPair ("official-name", "The United Kingdom of Great Britain and Northern Ireland")
officialName == "cf09bea8c0107bd2150b073150d48db0a5b24c83defc7960ed698378d9f84b93bf1860175c77869938cf9f4b37edb00f2f387be7b361f9c2c4a2ac202c1ba2e5"
name = hashPair ("name", "United Kingdom")
name == "5c0be87ed7434d69005f8bbd84cad8ae6abfd49121b4aaeeb4c1f4a2e298771194099b1e0b9a1e673bafee513080197fa1980895ca27e091fdd4c54fab2bed24"
citizenNames = hashPair ("citizen-names", Set ["Briton", "British citizen"])
citizenNames == "bb3a7ac86d4f90c20d099992de0bd09bf3c4f27169c2cd873836762b01d5a2be16897987a6ee59d9ffdb456ed02df34a79b05346498d4360172568101ae157c1"
Then combine the partial results and tag it as a dictionary:
itemHash = hashDict [id, officialName, name, citizenNames]
itemHash == "45d9392ad17cead3fa46501eba3e5ac237cb46a39f1e175905f00ef6a6667257"
Now, let's redact part of the original item. From:
Dict
[ ("id", "GB")
, ("official-name", "The United Kingdom of Great Britain and Northern Ireland")
, ("name", "United Kingdom")
, ("citizen-names", Set ["Briton", "British citizen"])
]
To:
Dict
[ ("id", "GB")
, ("official-name", "**REDACTED**bf1860175c77869938cf9f4b37edb00f2f387be7b361f9c2c4a2ac202c1ba2e5")
, ("name", "United Kingdom")
, ("citizen-names", Set ["Briton", "British citizen"])
]
The resulting hash would be the same "45d9392ad17cead3fa46501eba3e5ac237cb46a39f1e175905f00ef6a6667257".
This algorithm is as secure as the previous one.