Canonicalization

One of the hardest issues to solve around digital signatures is generating a digest or hash of the source data consistently and despite any structural changes that may have a happened while in transit. The objective of canonicalization is to ensure that the data is logically equivalent at the source and destination so that the digest can be calculated reliably on both sides, and thus be used in digital signatures. In the world of XML, the Canonical XML Version 1.1 W3C recommendation aims to set out rules to be used to create consistent documents. Anyone who’s worked with XML signatures however (XML-DSIG) knows that despite the best intentions and libraries, it can still be difficult to get the expected results, especially when using different languages at the source and destination. JSON conversely lacks clearly defined or active industrial standard around canonicalization, despite having a much simpler syntax. Indeed, the JSON Web Token specification gets around canonical issues by including the actual signed payload data as a Base64 string inside the signatures. One of the objectives of GOBL is to create a document that could potentially be stored in any key-value format alternative to JSON, like YAML, Protobuf, or maybe even XML. Perhaps GOBL documents need to be persisted to a document database like CouchDB or a JSONB field in PostgreSQL. It should not matter what the underlying format or persistence engine is, as long as the logical contents are exactly the same. Thus when signing documents its essential we have a reliable canonical version of JSON, even if the data is stored somewhere else. The c14n package included in GOBL, is inspired by the works of others and aims to define a simple standardized approach to canonical JSON that could potentially be implemented easily in other languages.

GOBL JSON C14n

GOBL considers the following JSON values as explicit types:

a string
a number, which extends the JSON spec and is split into:
- an integer
- a float
an object
an array
a boolean
null

JSON in canonical form:

Must be encoded in valid UTF-8. A document with invalid character encoding will be rejected.
Must not include superfluous or non-semantic whitespace.
Must order the attributes of objects lexicographically by the code points of their names.
Must remove attributes from objects whose value is null.
Must not remove null values from arrays.
Must represent numbers that are mathematically integers—i.e., those with a zero-valued fractional part—using the canonical JSON integer form. These numbers must not be represented with:
- a leading minus sign when the value is zero (i.e., use 0, not -0);
- a decimal point (e.g., 3, not 3.0);
- exponent notation (e.g., 1000, not 1e3);
- leading zeroes (e.g., 42, not 042), as already prohibited by the JSON specification.
Must represent floating-point numbers in exponential notation, adhering to the following format:
- A nonzero single-digit integer part to the left of the decimal point (e.g., 1.23E+3, not 12.3E+2);
- A nonempty fractional part to the right of the decimal point (e.g., 1.2E3, not 1.E3);
- No trailing zeroes in the fractional part, unless required to satisfy the condition above;
- A capital E as the exponent separator (not lowercase e);
- No plus sign (+) in either the mantissa or the exponent;
- No leading zeroes in the exponent (e.g., 1.2E3, not 1.2E003).

Must represent all strings, including object attribute keys, in their minimal length UTF-8 encoding:

using two-character escape sequences where possible for characters that require escaping, specifically:

Character	Escape Sequence	Unicode
`"` Quotation Mark	`\"`	`U+0022`
`\` Reverse Solidus (backslash)	`\\`	`U+005C`
`⌫` Backspace	`\b`	`U+0008`
`⇥` Character Tabulation (tab)	`\t`	`U+0009`
`␊` Line Feed (newline)	`\n`	`U+000A`
`␌` Form Feed	`\f`	`U+000C`
`↵` Carriage Return	`\r`	`U+000D`

using six-character \u00XX uppercase hexadecimal escape sequences for control characters that require escaping but lack a two-character sequence described previously, and
reject any string containing invalid encoding.

The GOBL JSON c14n package has been designed to operate using any raw JSON source and uses the Go encoding/json library’s streaming methods to parse and recreate a document in memory. A simplified object model is used to map JSON structures ready to be converted into canonical JSON.

Usage Example

package main

import (
  "fmt"
  "strings"

  "github.com/invopop/gobl/c14n"
)

func main() {
  d := `{ "foo":"bar", "c": 123.4, "a": 56, "b": 0.0, "y":null}`
  r := strings.NewReader(d)
  res, err := c14n.CanonicalJSON(r)
  if err != nil {
    panic(err.Error())
  }
  fmt.Printf("Result: %v\n", string(res))
  // Output: {"a":56,"b":0.0E0,"c":1.234E2,"foo":"bar"}
}

Prior Art

This specification and implementation is based on the gibson042 canonicaljson specification with simplifications concerning invalid UTF-8 characters, null values in objects, and a reference implementation that is more explicit making it potentially easier to be recreated in other programming languages. The gibson042 specification is in turn based on the now expired JSON Canonical Form internet draft which lacks clarity on the handling of integer numbers, is missing details on escape sequences, and doesn’t consider invalid UTF-8 characters. Canonical representation of floats is consistent with XML Schema 2, section 3.2.4.2, and expects integer numbers without an exponential component as defined in RFC 7638 - JSON Web Key Thumbprint.

Overview

Use Cases

Quick Start

Tax Regimes

Addons

Catalogues

Schemas (Draft 0)

Canonicalization

GOBL JSON C14n

Usage Example

Prior Art

Overview

Use Cases

Quick Start

Tax Regimes

Addons

Catalogues

Schemas (Draft 0)

​GOBL JSON C14n

​Usage Example

​Prior Art

GOBL JSON C14n

Usage Example

Prior Art