Thursday, January 18, 2018

The Perils of the One Obvious Way to Encode JSON

Background

JSON is one of the most common data encoding formats, allowing arbitrarily structured data to be represented in a normalized manner. JSON is used widely in logs and other information security related data sources. For example, IDS such as Suricata and LaikaBOSS can be configured to provide JSON logs. Threat intelligence sources, such as Censys and Virustotal, provide APIs that return JSON formatted data. For many tasks, JSON is my preferred data format. Indeed, I frequently use JSON in systems that I implement.

The Issue

Recently, I've run across an issue that has caused me some annoyance when working with JSON data. By default, certain implementations, namely the python standard library JSON encoder, writes JSON that uses only ASCII characters, employing Unicode escaping (\uXXXX) for any non-ASCII characters. The result of this is that any non-ASCII characters cannot be found directly in the raw JSON representation. The impact of this is that retrieval of data can be impacted. I've seen popular search systems not behave according to user expectations due to the mismatch between ASCII encoded and UTF-8 encoded JSON.

While UTF-8 encoding is the norm for JSON data on the internet, many data sources and APIs provide JSON data where non-ASCII characters are escaped. Beyond the obvious disconnect between escaped and UTF-8 data, I've also seen this Unicode escaping trigger subtle implementation incompatibilities, such as those related to surrogate pairs, that would not occur if the data was simply UTF-8 encoded.

Example

To illustrate, let's take an example from censys.io. Certificate cbd2dd40350b8fe782d1f554b00ca5e394865f0700ac2250da265163e890cb9a has non-ASCII characters in the Subject Organization field. However, if you view the raw JSON data, you'll see that this data is escaped. This behavior is not unique to Censys and is actually very common.

This issue can be further illustrated by downloading the raw JSON file. The easiest way to do this without a Censys account is to simply copy/paste from the raw JSON page. If the raw JSON for the Censys report is downloaded, then the escaped organization can be viewed:
$ cat cert_ascii.json | grep -A 1 organization
      "organization": [
        "\u4e2d\u4f01\u52a8\u529b\u79d1\u6280\u80a1\u4efd\u6709\u9650\u516c\u53f8"
--
      "organization": [
        "GeoTrust Inc."
If a tool such as jq is used to display the json, then the non-ASCII characters are displayed as usual:
$ cat cert_ascii.json | jq . | grep -A 1 organization
      "organization": [
        "中企动力科技股份有限公司"
--
      "organization": [
        "GeoTrust Inc."
If the default python JSON encoder is used, then the data is escaped as it was when originally downloaded:
$ cat cert_ascii.json | python -m json.tool | grep -A 1 organization
            "organization": [
                "GeoTrust Inc."
--
            "organization": [
                "\u4e2d\u4f01\u52a8\u529b\u79d1\u6280\u80a1\u4efd\u6709\u9650\u516c\u53f8"
Imagine a system where the searches occur on the raw JSON without un-escaping the non-ASCII characters, but the data is displayed using a tool that displays the Unicode characters. For simplicity, we can use grep for the search tool and jq as the display tool, but similar issues can arise with more refined systems. If non-ASCII characters are used, data that is displayed to the user cannot be searched directly.
$ cat cert_ascii.json | jq -r ".parsed.subject.organization[]"
中企动力科技股份有限公司
$ grep -o -F "中企动力科技股份有限公司" cert_ascii.json
To search non-ASCII data, our best bet is to re-encode the JSON data. There's more than than one way to do it, but the most straightforward and reliable method in python (2.x) is as follows:
#!/usr/bin/env python
"""
Simple script to re-encode json data using utf8 encoding instead of ascii
"""

import sys
import json

def main():
    json_in = sys.stdin.read()
    obj = json.loads(json_in)
    json_out = json.dumps(obj,ensure_ascii=False).encode('utf8')
    print(json_out)

if __name__ == "__main__":
    main()
Note the use of the "ensure_ascii" parameter and explicit encoding as UTF-8. If we use this script to re-encode the JSON data, then the non-ASCII string can be found:
$ cat cert_ascii.json | ./json_encode_utf8.py > cert_utf8.json
$ grep -o -F "中企动力科技股份有限公司" cert_utf8.json
中企动力科技股份有限公司
中企动力科技股份有限公司

Solutions

The biggest problem with escaping non-ASCII data is that retrieval issues can go unnoticed, especially if non-ASCII data is rare. Some systems just work, doing the normalization necessary to search data as expected, without any intervention, but many don't. Some systems allow the queries to be normalized at search time, but this can be very inefficient depending upon the implementation. In most situations, the best option is to normalize the JSON data to unescaped UTF-8 before storage. Some systems allow this to be done as part of the native ingestion pipeline, but the JSON can always be re-encoded externally as needed. My preference, however, would be to avoid the cost of re-encoding altogether. Data producers should simply generate JSON data encoded in UTF-8 without escaping. I also question the wisdom of JSON encoders that use ASCII encoding as the default when UTF-8 is the recognized norm. Enabling systems to avoid handling UTF-8 encoded data probably isn't doing anyone a favor.