BackgroundJSON is one of the most common data encoding formats, allowing arbitrarily structured data to be represented in a normalized manner. JSON is used widely in logs and other information security related data sources. For example, IDS such as Suricata and LaikaBOSS can be configured to provide JSON logs. Threat intelligence sources, such as Censys and Virustotal, provide APIs that return JSON formatted data. For many tasks, JSON is my preferred data format. Indeed, I frequently use JSON in systems that I implement.
The IssueRecently, I've run across an issue that has caused me some annoyance when working with JSON data. By default, certain implementations, namely the python standard library JSON encoder, writes JSON that uses only ASCII characters, employing Unicode escaping (\uXXXX) for any non-ASCII characters. The result of this is that any non-ASCII characters cannot be found directly in the raw JSON representation. The impact of this is that retrieval of data can be impacted. I've seen popular search systems not behave according to user expectations due to the mismatch between ASCII encoded and UTF-8 encoded JSON.
While UTF-8 encoding is the norm for JSON data on the internet, many data sources and APIs provide JSON data where non-ASCII characters are escaped. Beyond the obvious disconnect between escaped and UTF-8 data, I've also seen this Unicode escaping trigger subtle implementation incompatibilities, such as those related to surrogate pairs, that would not occur if the data was simply UTF-8 encoded.
ExampleTo illustrate, let's take an example from censys.io. Certificate cbd2dd40350b8fe782d1f554b00ca5e394865f0700ac2250da265163e890cb9a has non-ASCII characters in the Subject Organization field. However, if you view the raw JSON data, you'll see that this data is escaped. This behavior is not unique to Censys and is actually very common.
This issue can be further illustrated by downloading the raw JSON file. The easiest way to do this without a Censys account is to simply copy/paste from the raw JSON page. If the raw JSON for the Censys report is downloaded, then the escaped organization can be viewed:
$ cat cert_ascii.json | grep -A 1 organization "organization": [ "\u4e2d\u4f01\u52a8\u529b\u79d1\u6280\u80a1\u4efd\u6709\u9650\u516c\u53f8" -- "organization": [ "GeoTrust Inc."If a tool such as jq is used to display the json, then the non-ASCII characters are displayed as usual:
$ cat cert_ascii.json | jq . | grep -A 1 organization "organization": [ "中企动力科技股份有限公司" -- "organization": [ "GeoTrust Inc."If the default python JSON encoder is used, then the data is escaped as it was when originally downloaded:
$ cat cert_ascii.json | python -m json.tool | grep -A 1 organization "organization": [ "GeoTrust Inc." -- "organization": [ "\u4e2d\u4f01\u52a8\u529b\u79d1\u6280\u80a1\u4efd\u6709\u9650\u516c\u53f8"Imagine a system where the searches occur on the raw JSON without un-escaping the non-ASCII characters, but the data is displayed using a tool that displays the Unicode characters. For simplicity, we can use grep for the search tool and jq as the display tool, but similar issues can arise with more refined systems. If non-ASCII characters are used, data that is displayed to the user cannot be searched directly.
$ cat cert_ascii.json | jq -r ".parsed.subject.organization" 中企动力科技股份有限公司 $ grep -o -F "中企动力科技股份有限公司" cert_ascii.jsonTo search non-ASCII data, our best bet is to re-encode the JSON data. There's more than than one way to do it, but the most straightforward and reliable method in python (2.x) is as follows:
#!/usr/bin/env python """ Simple script to re-encode json data using utf8 encoding instead of ascii """ import sys import json def main(): json_in = sys.stdin.read() obj = json.loads(json_in) json_out = json.dumps(obj,ensure_ascii=False).encode('utf8') print(json_out) if __name__ == "__main__": main()Note the use of the "ensure_ascii" parameter and explicit encoding as UTF-8. If we use this script to re-encode the JSON data, then the non-ASCII string can be found:
$ cat cert_ascii.json | ./json_encode_utf8.py > cert_utf8.json $ grep -o -F "中企动力科技股份有限公司" cert_utf8.json 中企动力科技股份有限公司 中企动力科技股份有限公司