Thursday, January 18, 2018

The Perils of the One Obvious Way to Encode JSON

Background

JSON is one of the most common data encoding formats, allowing arbitrarily structured data to be represented in a normalized manner. JSON is used widely in logs and other information security related data sources. For example, IDS such as Suricata and LaikaBOSS can be configured to provide JSON logs. Threat intelligence sources, such as Censys and Virustotal, provide APIs that return JSON formatted data. For many tasks, JSON is my preferred data format. Indeed, I frequently use JSON in systems that I implement.

The Issue

Recently, I've run across an issue that has caused me some annoyance when working with JSON data. By default, certain implementations, namely the python standard library JSON encoder, writes JSON that uses only ASCII characters, employing Unicode escaping (\uXXXX) for any non-ASCII characters. The result of this is that any non-ASCII characters cannot be found directly in the raw JSON representation. The impact of this is that retrieval of data can be impacted. I've seen popular search systems not behave according to user expectations due to the mismatch between ASCII encoded and UTF-8 encoded JSON.

While UTF-8 encoding is the norm for JSON data on the internet, many data sources and APIs provide JSON data where non-ASCII characters are escaped. Beyond the obvious disconnect between escaped and UTF-8 data, I've also seen this Unicode escaping trigger subtle implementation incompatibilities, such as those related to surrogate pairs, that would not occur if the data was simply UTF-8 encoded.

Example

To illustrate, let's take an example from censys.io. Certificate cbd2dd40350b8fe782d1f554b00ca5e394865f0700ac2250da265163e890cb9a has non-ASCII characters in the Subject Organization field. However, if you view the raw JSON data, you'll see that this data is escaped. This behavior is not unique to Censys and is actually very common.

This issue can be further illustrated by downloading the raw JSON file. The easiest way to do this without a Censys account is to simply copy/paste from the raw JSON page. If the raw JSON for the Censys report is downloaded, then the escaped organization can be viewed:
$ cat cert_ascii.json | grep -A 1 organization
      "organization": [
        "\u4e2d\u4f01\u52a8\u529b\u79d1\u6280\u80a1\u4efd\u6709\u9650\u516c\u53f8"
--
      "organization": [
        "GeoTrust Inc."
If a tool such as jq is used to display the json, then the non-ASCII characters are displayed as usual:
$ cat cert_ascii.json | jq . | grep -A 1 organization
      "organization": [
        "中企动力科技股份有限公司"
--
      "organization": [
        "GeoTrust Inc."
If the default python JSON encoder is used, then the data is escaped as it was when originally downloaded:
$ cat cert_ascii.json | python -m json.tool | grep -A 1 organization
            "organization": [
                "GeoTrust Inc."
--
            "organization": [
                "\u4e2d\u4f01\u52a8\u529b\u79d1\u6280\u80a1\u4efd\u6709\u9650\u516c\u53f8"
Imagine a system where the searches occur on the raw JSON without un-escaping the non-ASCII characters, but the data is displayed using a tool that displays the Unicode characters. For simplicity, we can use grep for the search tool and jq as the display tool, but similar issues can arise with more refined systems. If non-ASCII characters are used, data that is displayed to the user cannot be searched directly.
$ cat cert_ascii.json | jq -r ".parsed.subject.organization[]"
中企动力科技股份有限公司
$ grep -o -F "中企动力科技股份有限公司" cert_ascii.json
To search non-ASCII data, our best bet is to re-encode the JSON data. There's more than than one way to do it, but the most straightforward and reliable method in python (2.x) is as follows:
#!/usr/bin/env python
"""
Simple script to re-encode json data using utf8 encoding instead of ascii
"""

import sys
import json

def main():
    json_in = sys.stdin.read()
    obj = json.loads(json_in)
    json_out = json.dumps(obj,ensure_ascii=False).encode('utf8')
    print(json_out)

if __name__ == "__main__":
    main()
Note the use of the "ensure_ascii" parameter and explicit encoding as UTF-8. If we use this script to re-encode the JSON data, then the non-ASCII string can be found:
$ cat cert_ascii.json | ./json_encode_utf8.py > cert_utf8.json
$ grep -o -F "中企动力科技股份有限公司" cert_utf8.json
中企动力科技股份有限公司
中企动力科技股份有限公司

Solutions

The biggest problem with escaping non-ASCII data is that retrieval issues can go unnoticed, especially if non-ASCII data is rare. Some systems just work, doing the normalization necessary to search data as expected, without any intervention, but many don't. Some systems allow the queries to be normalized at search time, but this can be very inefficient depending upon the implementation. In most situations, the best option is to normalize the JSON data to unescaped UTF-8 before storage. Some systems allow this to be done as part of the native ingestion pipeline, but the JSON can always be re-encoded externally as needed. My preference, however, would be to avoid the cost of re-encoding altogether. Data producers should simply generate JSON data encoded in UTF-8 without escaping. I also question the wisdom of JSON encoders that use ASCII encoding as the default when UTF-8 is the recognized norm. Enabling systems to avoid handling UTF-8 encoded data probably isn't doing anyone a favor.

Thursday, August 6, 2015

What Laika BOSS means to NIDS

I’ve long advocated for analysis of files transferred through the network. I’d like to explain how I feel Laika BOSS relates to NIDS. Just as we need NIDS to decode network protocols, apply signatures, and collect metadata on network traffic, so too we need a file-centric analysis system. Laika BOSS is this system, or at least an instance of this class of system. It performs recursive file format decoding, signature matching appropriate for files (yara), and metadata extraction. Laika BOSS is the widget to which NIDS should offload analysis of payload objects. Laika BOSS is the polished implementation of the file analysis portion of the vision that Ruminate IDS advanced.

As explained in my plea for NIDS to adopt an interface for external analysis of payload objects, the motivation for this separation comes down to specialization/Unix philosophy and general need for file analysis capabilities. There is a market for really good NIDS. I see the need for really good file-centric IDS (FIDS). They should be plumbed together, but the file analysis systems need to support more than just NIDS sources. If Laika BOSS’s relatively simple zeroMQ message format becomes the de-facto interface NIDS use to provide files for external analysis, I’ll be flattered.

Some NIDS will be tempted to try to build Laika BOSS style analysis into their core engine. While some steps towards greater payload visibility will always be welcome, going too far down this path will be a distraction. This will ultimately be similar to NIDS that have tried to bring in external log analysis/SIM functionality. A NIDS specific file analysis framework will always lag in content compared to a general purpose file IDS. More importantly though, there is still much to be done in the realm of network protocol analysis. The security community is much better served with NIDS engines that focus on support for protocols such as RDP, SMB, and HTTP2 over file formats such as JAR, SWF, and PDF. If a NIDS project wants to implement their own FIDS, I'd recommend keeping them separate, similar to Snort/Razorback.

In addition to opportunity cost, today’s NIDS just can't support the depth of analysis on files that some users want. This isn’t to say that current NIDS aren’t making an effort to provide these features. Both Bro and Suricata are both making strides towards this goal. I think the lua based rules that Emerging Threats has developed for Suricata, including functionality such as decompressing a JAR and analyzing constituent files, has given the community a taste of what must be.

However, today’s NIDS, at least as currently architected, can’t support the depth of analysis that is needed for today’s complex file formats. It’s not just the complexity of desired workloads that is a problem, it’s also inability to scale to ever increasingly parallel and economic platforms. It’s easy to pay for servers totaling dozens of CPU cores, but current NIDS architectures can’t scale well beyond about 16 threads, let alone operate on cloud resources. Suricata is the only mainstream NIDS that I know of that supports dynamic load balancing, but in practice this doesn’t scale well, with most high speed implementations using static load balancing for relatively independent worker threads. (And no, mechanisms such as the AF_PACKET rollover option do not help if they balance packets from the same stream to different workers, breaking stream reassembly).

The Laika BOSS white paper reports statistics from a real network using Laika BOSS configured with many modules including CPU intensive detection mechanisms for some file formats. Most object scan times are under 50ms but the average time spent processing a portable executable is 83s. My previous studies on this topic have convinced me that this level of variance in throughput will never be feasible in the current NIDS paradigm because as one scales up the number of NIDS threads, the limit on scalability remains the worst case load distribution. However, many want to be able to do this depth of analysis and are more than happy to pay for commodity CPUs to do it. Most users are willing to put depth of analysis in front of IPS capabilities and the associated real time constraints. If NIDS provide files to Laika BOSS asynchronously, the additional layer of dynamic load balancing makes this level of analysis feasible.

We’re very close to being able to provide payload files to external systems in current NIDS. Both Suricata and Bro can write files out to a filesystem. Furthermore, lua scripts in Suricata have access to reassembled file payloads. Bro has a new file analysis framework. There are a few limitations with all of these approaches. In many cases, providing network layer context for a file is not straightforward. In other cases, access to the file is provided in chunks instead of in a single buffer. However, I hope that in the near future, we’ll see efficient integration of Laika BOSS with NIDS, allowing the offloading of analysis of all objects passing through the network. In the meantime, there are organizations using Vortex/Ruminate derivatives to provide network payloads in a scalable, if not ugly way.

I heartily endorse Laika BOSS. I’m proud that Lockheed Martin has made this tool open source. I feel it has a good shot at becoming the file analysis engine that the community needs. I believe that the ease of creating modules and the general interface suitable for many use cases will make it popular among users. I hope that NIDS engines are soon able to efficiently integrate with Laika BOSS to offload file-centric analysis.

Saturday, June 21, 2014

Tactics for Storing Security Event Data

I’d like to share some tactics that I have found to be useful for storing and retrieving security event data. These approaches can certainly be backed by theory, but I've also come to learn their value through real world experience, including some hard knocks. One impetus for finally putting these thoughts down is to explain the motivation behind MongoR, a wrapper around MongoDB which was written by a close colleague and recently open sourced by my employer. However, MongoR only addresses some of these considerations, and I believe these issues should be addressed by all data stores that want to better support users who warehouse security event data. Many data stores already utilize these tactics or make it possible for the user to do it on their own, but few leverage all the optimizations possible for storing logs.

Storing and retrieving security event data, such as network infrastructure or sensor logs, is non-trivial due to the high insert rate required. In addition to large volume, most off-the-shelf data stores are poorly tuned to rolling window style of data retention that is usually desired. For example, a typical use case would be to store 90 days of logs and to be able to search them based on key indicator types such as IP addresses.

Security event data typically has the following properties:
  • Organized primarily by time
  • Limited, Fixed Retention
  • Immutable
  • Not Highly Relational
And queries usually have the following properties:
  • Very high insert to query ratio
  • Core indicator types (ex. IP) comprise most searches
  • Searches are typically for rare data (ex. attack activity)

These properties are nearly diametrically opposed to that of a typical database driven app such as a service ticket, e-commerce, or ERP system. As such, it’s not surprising that it requires some work to adapt databases to storing event streams and that many systems that are successful look more like search engines than databases.

Why Mongo?

NoSQL data stores like MongoDB work very well for security event data, fitting the not highly relational property well, and providing a lot of flexibility when a fixed schema just isn’t possible. Ex. Storing and accessing arbitrary HTTP headers in something like Mongo is great. When there are search engines like Elastic Search that store documents with flexible schemas, why would you even use something like Mongo? There are a few reasons, but one major driver is that Mongo and similar databases readily support alternative access methods, such as map/reduce. In speaking about tactics for log storage, I’m generally not going to differentiate between NoSQL databases like MongoDB and search engines like ElasticSearch as the delineation between them is blurry anyway. I merely wanted to note that there is room for both approaches in the security event storage realm.

Events are Stored to be Deleted

Providing the capacity to support an adequate insert rate can often be difficult, but paying nearly as much for deleting an old event as inserting a new event adds insult to injury. The time honored approach is to use some sort of time based data partitioning where each shard contains a time slice of events, whether the trigger for a new shard be date based or size based. Glossing over issues such as fragmentation, the expensive part here is not pruning the data itself, but removing entries from indexes. Trimming documents out of the indexes is expensive and significantly increases the database's working set. Deleting event data should be cheap. Many data stores don’t support date based sharding with efficient deletes so this must be done by the user. This is one of the core benefits of MongoR: pruning old data is a simple collection (table) drop.

Working Set Blues

Generalizing beyond efficient deletes, the major driver for data partitioning is keeping the working set of a database manageable. The only way to maintain high insert rates is to keep most of the area of active writing cached in RAM. This is especially imperative for indexes as these usually involve very random writes (the document data itself can usually be written sequentially). Sharding helps keep the working set manageable by bounding the size of the data actively written. The downside to this approach is that you have to check every shard during a query, multiplying IOPS by shard count. For security event data, this is very often a wise trade-off. This is the other major benefit MongoR adds on top of MongoDB: managing many collections that are small enough to fit in RAM.

Tiered Storage Please

While MongoR addresses some of the most fundamental issues required to make MongoDB serviceable for high rate security event data, there is more to be done and it’s not specific to MongoDB. One my pet peeves are databases that don’t readily support separating indexes from raw data. There are many people who want to index a relatively small part of raw events, say just the IP address from web server logs. When indexes are mingled with document data, the IOPS required for writing the indexes can be much higher because the indexes may be spread over a larger number of blocks than would occur if indexes are grouped together. To the degree that document data and indexes have different read/write patterns and especially when indexes are smaller than base data, using tiered storage is beneficial.

Indexing the Kitchen Sink

Many systems used to store security event data support indexing a portion of the data in event records, while some support an all or nothing approach. Some are flexible in what fields can be indexed and some are fixed. Flexibility is always appreciated by advanced users who can say in advance what they will typically search. The all or nothing approach can make the use of the system very expensive and can lead to excluding data useful for analysis but not likely to be searched directly.

Stop the Sync Shell Game

One of the fundamental trade-offs a data store has to make is between efficiency and data safety. You can cache heavily and get high performance or constantly sync data to disks and get data safety, but not both. This dichotomy, however, neglects a few optimizations. First of all, security event data is by design immutable. If done right, the raw event data would be dumped with simple sequential writes and loaded with unsynchronized reads.

Another optimization lies in the presumption of some sort of sharding where only the current data collection needs to be loaded in RAM (and for proper performance needs to be wholly in RAM). If the indexes are the expensive part of a data store, requiring all sorts of IOPS to keep synced, why are they written to disk before the shard is finished? In the event of a failure, the indexes can always be re-created from the source data (sequential access). This also opens the door to using efficient data structures for the in memory indexes that may well differ from what is best for the on disk indexes or indexes that are created incrementally. Once you’ve relegated yourself to keeping your current shard in memory, embrace it and leverage the resulting possible optimizations.

There is no Trie

Moving beyond binary trees for indexes is liberating. Various alternative indexing mechanisms, most of them involving hashing of some flavor, can provide constant time lookups and require fewer IOPS than binary trees. For example, I’ve been using discodbs based on cmph, and have had fantastic results. The minimal perfect hash functions used result in indexes that are less complex than a binary tree, involve much fewer IOPS for queries, and are much smaller than an optimal bloom filter. The trade-off is that they are immutable and cannot be efficiently updated or created incrementally, but that is not a concern for immutable log data.

Proponents of binary trees will claim that prefix matches are a killer feature. I’m not sure they are, and they are easily provided through other techniques anyway. Domain names are a prime example of a data type that does not work well with the prefix searches supported by typical binary tree indexes. Sure, you can reverse the domains and then prefix matching works, but you could also extract the prefixes you want searchable and insert them each as keys in your constant time lookup indexes. The point is that you often have to invest extra effort to make binary trees work for many common security event indicator types.

The security domain analogues of all the pre-processing that goes into full text indexing (tokenization, stemming, stopwords, natural language analysis, etc.) are all very immature. As this field matures, I feel we’ll be better able to say what is required to support any searching beyond exact term matches. Regardless, once you move past storing IP addresses as integers, it’s not clear whether binary tree prefix matching buys much in queries on security event data. In a day when general purpose and reliability focused filesystems are using alternatives to binary trees (ex. htree of ext4), security event data stores should be too.

Lower the Resolution Please

One place I think the security community should look for cost savings is in lowering the resolution of indexes from the per event/row level to groups/block of events. I’ve seen great savings in this in practice and in some cases it can even improve performance on an absolute scale, let alone economics. When the desired result is a very small number of records, low resolution indexes coupled with processing of a very small number of blocks to do the final filtering can beat out bigger, more expensive, and more precise indexes. One use case that benefits from this approach is pivoting on indicators of threat actors, which are increasingly often exchanged in threat intelligence sharing communities. In these searches, it is most often desirable to search the widest time window possible (full data set), to search on a common indicator type, and for there to be few or no results. Often the low resolution answer is adequate--there is no need to drill down to individual events. I’ve seen low resolution indexes that are orders of magnitude smaller than the raw source data provide utility through low latency query times that row level indexes can never touch because of fundamental differences in size. Significantly ameliorating the cost of very common terms, ex. your web server’s IP, and thereby naturally diminishing the need for stopword mechanisms is a nice side effect of dropping index resolution.

Trading CPU for IOPS

I’m reticent to mention compression as a method of improving performance for fear of being flamed as a heretic, but I think it has to be mentioned. I know disk is cheap, but IOPS aren’t. Getting adequate IOPS is often among the biggest considerations (and cost) of the hardware for security event data stores (RAM used for disk caching is usually up there too). Huge sequentially accessible gzipped files are expensive to use, especially to retrieve a small number of records. But compression doesn’t have to be that way. You can use smaller blocks to get good compression ratios and reasonable random access. Considering only simple tweaks to gzip, pigz and dictzip address the single threaded and random access limitations of standard gzip.

As an oversimplified example, imagine you have a server with a disk drive that provides 150 MB/s sequential read rate, CPU cores that do decompression at about 20 MB/s (compressed size) per core/thread, and data that compresses with a compression rate of 5:1. If you want to do analytics, say map/reduce or zgrep/awk, on the whole data store, you are better off using compression if you can dedicate 2 or more cores to decompression. If you can dedicate 8 cores, you are going to be able to stream at 750 MB/s instead of 150 MB/s.

The CPU/IOPS tradeoff is not just about improving sequential compressed log access, that is just a simple example that everyone who has ever zgreped text logs understands. A better example is the very compact indexes created using perfect hash functions such as done in discodb which require relatively high CPU but low IOPS to query.

Conclusions

Security event data has special characteristics that enable various optimizations. For the data stores that do not yet make these techniques easy, there is an opportunity to better cater to the security log use case. MongoR demonstrates temporal data partitioning that keeps the working set in memory resulting in scalable, fast inserts and efficient deletes.


Saturday, October 19, 2013

BlueLight: A Firefox Extension for Notification of SSL Monitoring

Recently I built a Firefox Extension that was useful for my particular, if not peculiar needs. I wanted to share with any who might find it useful. BlueLight is designed to provide notification of SSL inspection--the type that is common for organizations to perform on the traffic of their consenting users.

There are many tools, and no shortage of Firefox extensions, that relate to SSL security and detecting MitM attacks. CipherFox and similar extensions are very useful, but didn’t fit my specific need because they aren’t quite noisy enough--I wanted more active and intrusive notification when CA’s of interest were used. Certificate Patrol and similar systems are useful for detecting the introduction of SSL inspection, but these systems don’t fit the scenario of overt, consensual, and long term SSL inspection. BlueLight is based heavily on Cert Alert. In fact, if the alerting criteria in Cert Alert wasn’t hardcoded, I’d probably be using it for this purpose.

BlueLight is useful when SSL Inspection is occurring, usually through a MitM scenario on web proxies using organizational PKI. Obviously, being notified of this on a per site basis is only useful when the organization is selective about what traffic is inspected—if everything is MitM’d then this notification provides no value.

Some claim that users are more secure when their traffic is subject to the organizations protections and monitoring. In this case, BlueLight provides re-assuring feedback to the user, letting them know that they are covered. Others may want to use BlueLight to know when they are under the purview of surveillance. It may deter them from taking some action while being subject to monitoring. In the case that monitoring should not occur on specific traffic, it provides useful notification to the user, so that the erroneous inspection can be rectified. In this vein, I’ve seen BlueLight be particularly useful as it alerts for all SSL elements of the page, not just the main URL (it alerts on the first occurrence, and only the first occurrence).

BlueLight isn’t designed to be useful for other scenarios such as detecting unauthorized SSL MitM attacks or any other covert SSL malfeasance. However, since BlueLight can be configured to alert on basically any certificate issuer, it may well be useful for other similar uses.

BlueLight has to be configured by the user to be useful. As it is, it’s probably only useful to reasonably technically savvy folk. In sharing BlueLight with the larger community, I hope it might be useful to others. BlueLight can be downloaded from addons.mozzilla.org or from csmutz.com/bluelight.

Monday, August 5, 2013

Intelligence Compliance: Checklist Mentality For Cyber Intelligence

The past few years have seen a sharp increase in the amount of targeted attack intelligence shared and the number of venues used for sharing. There is a geeky fervor emerging in the community rooted in the premise that if we accelerate sharing and consumption of intelligence, even to the point of automation, we could significantly improve network defense. Momentum is snowballing for adoption of specific standards for intel sharing, the foremost of which is the mitre suite of STIX/TAXII/MAEC. There is a seemingly self-evident necessity to share more intel, share it wider, and share it faster.

As someone who seeks to apply technology to incident response, I see great promise in standards, technology, and investments to accelerate the distribution and application of threat intelligence. I’ve spent years developing capabilities to perform security intelligence and have seen first-hand the benefits of information sharing. The amount of security data exchanged will likely continue to grow. However, I have concerns about the shift from security intelligence to intelligence compliance and about the fundamental benefit of comprehensive distribution of threat intelligence.

Compliance Supplanting Analysis

Information sharing has been critical to the success of intelligence based network defense. Direct competitors from various industries have come together to battle the common problem of espionage threats. Threat intelligence sharing has been wildly successful as sharing has included relevant context, has been timely, and as the attackers have been common to those sharing. Years ago, much of this sharing was informal, based primarily in direct analyst to analyst collaboration. Over time, formal intel sharing arrangements have evolved and are proliferating today, increasing in count of sharing venues, the number of participants, and the volume of intel.

The primary concern I have with this increase in intel sharing is that it is often accompanied by a compliance mindset. If there’s anything we should have learned from targeted attacks, it is that compliance based approaches will not stop highly motivated attacks. It’s inevitable that conformance will fail, given enough attacker effort. For example, targeted attackers frequently have access to zero-day exploits that aren’t even on the red/yellow/green vulnerability metric charts, let alone affected by our progress in speeding patching from months to days. The reactive approach to incident response is focused primarily on preventing known attacks. As a community, we have developed intelligence driven network defense to remedy this situation. It allows us to get ahead of individual attacks by keeping tabs on persistent attackers, styled campaigns in the vernacular, in addition to a proper vulnerability focused approaches. The beauty of intelligence driven incident response is that it gives some degree of assurance that if you have good intelligence on an attack group, you will detect/mitigate subsequent attacks if they maintain some of the patterns they have exhibited in the past. This may seem like a limited guarantee, and indeed it is narrow, but it’s the most effective way to defeat APT. Intelligence compliance, on the other hand, promises completeness in dispatching with all documented and shared attacks, but it makes no promise for previously unknown attacks.

To explain in detail, the point of kill chain and associated analysis isn’t merely to extract a proscribed set of data to be used as mitigation fodder, but to identify multiple durable indicators to form a campaign. This has been succinctly explained by my colleague, Mike Cloppert, in his blog post on defining campaigns. The persistence of these indicators serves not only as the method of aggregating individual attacks into campaigns, but the presence of these consistencies is the substance of the assurance that you can reliably defeat a given campaign. By definition, individual attacks from the same campaign have persistent attributes. If you have a few of these, 3 seems to be the magic number, across multiple phases of the kill chain, you have good assurance that you will mitigate subsequent attacks, even if one of these previously static attributes change. If you can’t identify or your IDS can’t detect enough of these attributes, security intelligence dictates that you either dig deeper to identify these and/or upgrade your detection mechanisms to support these durable indicators. Ergo, defense driven by intelligence dictates you do analysis to identify persistent attack attributes and build your counter-measures around these.

Intelligence compliance, on the other hand, provides no similar rational basis for preparation against previously unseen attacks. Surely, a compliance focused approach has some appeal. It is often viewed as seeking to ensure consistency in an activity that is highly valuable for security intelligence. In other cases, less critical drivers overshadow primary mission success. Intel compliance provides a highly structured process that can be easily metered—very important in bureaucratic environments. The one guarantee that intelligence compliance does give is that you have the same defenses, at least those based on common intelligence, as everyone else in your sharing venue. This is important in when covering your bases is the primary driver. This provides no guarantee about new attacks or the actual security of your data, but does allow you to ensure that known attacks are mitigated, which is arguably most important for many forms of liability. Lastly, giving, taking, or middle manning intel can all be used as chips in political games ranging from true synergies to contrived intrigues. Intelligence compliance provides a repeatable and measurable process which caters to managerial ease and is also able to be aligned with legal and political initiatives.

There are limitless ways in which intelligence compliance can go wrong. Most failings can be categorized as either supplanting more effective activities or as a shortcoming in the mode of intelligence sharing which reduces its value. It is also fair to question if perfect and universal intelligence compliance would even be effective. Remember, intelligence compliance usually isn’t adopted on technical merits. The best, if not naïve, argument for adoption of this compliance mindset is that intel sharing has been useful for security intelligence, hence, by extrapolation, increasing threat data sharing must be better. Sadly, the rationale for a compliance approach to intel sharing frequently digresses to placing short-sighted blame avoidance in front of long term results.

The primary way in which intel compliance goes awry is when it displaces the capacity for security intelligence. If analysts spend more time on superfluous processing of external information than doing actual intelligence analysis, you’ve got serious means/ends inversion problems. Unfortunately, it’s often easier to process another batch of external intel verses digging deeper on a campaign to discover resilient indicators. This can be compared to egregious failings in software vulnerability remediation where more resources are spent on digital paper pushing, such as layers of statistics on compliance rates or elaborate exception justification and approval documentation, than is expended actually patching systems. An important observation is that the most easily shared and mitigated indicators, say IP addresses (firewalls) or email addresses (email filters), are also easily modified by attackers. For that reason, some of the most commonly exchanged indicators are also the most ephemeral, although this does depend on the specific attack group. If an indicator isn’t reused by an attacker, then sharing is useful for detecting previous attacks (hopefully before severe impact) but doesn’t prevent new attacks. A focus on short-lived intel can result in a whack-a-mole cycle that saps resources cleaning up compromises. This vicious cycle is taken to a meta level when human intensive processes are used despite increased intel sharing volume, putting organizations too far behind to make technological and process improvements that would facilitate higher intel processing efficiency. This plays into the attacker’s hand. This is exactly the scenario that security intelligence, including kill chain analysis, disrupts--allowing defenders to turn an attacker’s persistence into a defensive advantage.

Another intel sharing tragedy occurs when attack data is exchanged as though it is actionable intelligence and yet it’s no more than raw attack data. There are many who would advocate extracting a consistent set of metadata from attacks and regurgitating that as intelligence for sharing to their peers. I’m all for sharing raw attack data, but it must be analyzed to produce useful intelligence. If the original source doesn’t do any vetting of the attack data and shares a condensed subset, then the receiver will be forced to vet the data to see if it’s useful for countermeasures, but often with less context. The canonical example which illustrates the difference between raw attack data and intelligence is the IP address of an email server that sent a targeted malicious email, where that server is part of a common webmail service. Clearly this is valid attack data, but it’s about as specific to the targeted attacker as the license plate number of a vehicle the terrorist once rode in is to that terrorist, given that vehicle was a public bus. Some part of this vetting can and should be automated, but effective human adjudication is usually still necessary. Furthermore, many of the centralized clearinghouses don’t have the requisite data to adequately vet these indicators, so they are happily brokered to all consumers. To make matters more difficult, different organizations are willing to accept different types of collateral mitigation, the easiest solution being to devolve into the common denominator for the community which is a subset of the actionable intelligence for each organization. For example, given an otherwise legitimate website that is temporarily compromised to host malicious content, some members of a community may be able to block the whole website while others may not be able to accept the business impact. The easiest solution is for the community is to reject the overly broad mitigation causing collateral impact, while the optimal use of the intelligence requires risk assessment by each organization.

While ambiguity between raw attack data and vetted intelligence is the most obscene operationally, because it can result blocking benign user activity, there are other issues related to incomplete context on so called intelligence. An important aspect of security intelligence is proper prioritization. For example, many defenders invest significantly more analyst time in espionage threats while rapidly dispatching with commodity crimeware. If this context is not provided, improper triage might results in wasted resources. Ideally, this would include not only a categorization of the class of threat, but the actual threat identity, i.e. campaign name. Similarly, often intelligence is devoid of useful context such as whether the IP address reported is used as the source of attacks against vulnerable web servers, for email delivery, or for command and control. This can lead to imprecise or inappropriate countermeasures. Poorly vetted or ambiguous intel is analogous to (and sometimes turns into) the noisy signatures in your IDS—they waste time and desensitize.

With all that being said, I’m an advocate of intel sharing. I’m an advocate of sharing raw attack data, which is useful for those with the capacity to extract the useful intelligence. Realizing that this isn’t scalable, I’m also an advocate of sharing well vetted intelligence, with the requisite context to be actionable. However, even if your shop doesn’t have the ability to process raw attack data at a high volume, sharing that data with those who can ostensibly results in sharing of intel back to you that you couldn’t synthesize yourself. My main concern with intelligence compliance is that it robs time and resources from security intelligence while providing no guarantee of efficacy.

Intelligence Race to Zero

Beyond supplanting security intelligence, my other concern with the increase in information sharing is that as we become more proficient at ubiquitous sharing, the value of the intelligence will be diminished. This will occur whether the intelligence is revealed directly to the attackers or if lack of success causes them to evolve. Either way, I question if intelligence applied universally for counter-measures can ever be truly effective. Almost all current intelligence sharing venues at least give lip service to confidentiality of the shared intelligence and I believe many communities do a pretty good job. However, as intel is shared more widely, it is increasingly likely that the intel will be directly leaked to attackers. This principle drives the strict limitations on distribution normally associated with high value intelligence. It is also generally accepted that it’s impractical to keep data that is widely distributed for enforcement secure from attackers. The vast majority of widely deployed mitigations, such as AV signatures and spam RBLs are accepted to be available to attackers. In this scenario, you engage in an intelligence race, trying to speed the use of commodity intel. This is the antithesis of security intelligence which seeks to mitigate whole campaigns with advanced persistent intelligence.

Note that even if your raw intel isn’t exposed to attackers, the effects are known to the attacker—their attacks are no longer successful. Professional spooks have always faced the dilemma of leveraging intelligence versus protecting sources and methods. If some intelligence is used, especially intelligence from exclusive sources, then the source is compromised. As an example, the capability to decrypt axis messages during WWII was jealously protected. The tale that Churchill sacrificed a whole city to German bombers is hyperbole, but it certainly is representative of the type of trade-offs that must be made when protecting sources and methods. Note that this necessity to protect intel affects it’s use through the whole process, not just merely the decision for use at the end. For example, if two pieces of information are collected that when combined would solidify actionable intelligence, but these are compartmentalized, then the dots will never be connected and the actionable intelligence will never be produced. We see this play out in so called failures of the counter-terrorism intelligence community, where conspiracy theorists ascribe the failings to malice but the real cause is, more often than not, hindrances to analysis.

It’s worth considering how sources and methods apply specifically to network defense. Generally, I believe there is a small subset of intelligence that can be obtained solely through highly sensitive sources that is also useful for network defense. In most cases, if you can use an indicator for counter-measures, you can also derive it yourself, because it must be visible to the end defender. Also, while some sources may be highly sensitive, the same or similar information about attack activity (not attribution), is often available through open sources or through attack activity itself. Obviously, this notion isn’t absolutely true, but I believe it to be the norm. As a counter-example, imagine that a super sensitive source reveals that an attacker has added a drive by exploit to an otherwise legitimate website frequented by the intended victim audience. In this example the intel is still hard to leverage and relatively ephemeral: one still has to operationalize this knowledge in a very short time frame and this knowledge is by definition related specifically to this single attack.

Resting on the qualitative argument of indicator existentialism, the vast majority of counter-measures can be derived from attacker activity visible to the end network defender. This is necessarily true of the most durable indicators. Therefore, I don’t consider protecting sources (for network defense) the biggest consideration and advocate wide sharing of raw attack data. However, that certainly doesn’t mean that the analysis techniques and countermeasure capabilities are not sensitive. Indeed, most of my work in incident response has been about facilitating deeper analysis of network data, allowing durable and actionable intelligence to be created and leveraged. Competitive advantage in this realm has typically been found by either looking deeper or being more clever. In a spear phishing attack, for example, this may be in analysis of obscure email header data or malicious document metadata or weaponization techniques. Often the actionable intelligence is an atomic indicator, say a document author value, which could presumably be changed by the attacker if known. Some may require more sophistication on the part of the defender: requiring novel detection algorithms, correlations, or computational techniques such as that which my pdfrate performs. Either way, the doctrine of security intelligence is based in the premise that persistent indicators can be found for persistent attackers, even if it requires significant analysis to elucidate them. This analysis to identify reliable counter-measures is what security intelligence dictates and is often the opportunity cost of intelligence compliance. I’ve seen some strong indicators continue to be static for years, allowing new attacks to be mitigated despite other major changes in the attacks.

It is my belief, backed by personal experience and anecdotal evidence, that if what would otherwise be a strong mitigation, if kept secret, is used widely, then the lifespan of that indicator will be decreased. In the end I’m not sure it matters too much if the intelligence is directly revealed or if the attackers are forced to evolve due to lack of success, but that probably affects how quickly attackers change. In my experience, it is true that the greater the sophistication on the part of the defender and the greater the technical requirements for security systems, then the less likely useful indicators are to be subverted. However, it’s possible that continued efficacy has more to do with narrow application due to the small number of organizations able to implement the mitigation rather than difficulty of attackers changing their tactics. Often I wonder if, like outrunning the proverbial bear, today’s success in beating persistent adversaries may be more about being better than other potential victims than actually directly beating the adversary. While intelligence driven security, and by extension information sharing, is much more effective than classic incident response, I think it is yet to be proven if ubiquitous intel sharing can actually get ahead of targeted attacks or if attackers will win the ensuing intelligence/evolution race.

One benefit of the still far from fully standardized information sharing and defense systems of today is diversity. Each organization has their own special techniques for incident prevention--their own secret sauce for persistent threats. It’s inevitable that intelligence gaps will occur and some attacks, at least new ones, will not be stopped as early as desired. The diversity of exquisite detections among organizations combined with attack information sharing, even that of one-off indicators, allows for a better chance of response to novel attacks, even if this response is sub-optimal. A trend to greater standardization of intelligence sharing, driven by compliance, will tend to remove this diversity over time, as analysts, systems, and processes will be geared to greater intel volume and lower latency at the expense of intelligence resiliency.

Long Road Ahead

While I’m primarily concerned about being let down when we get there, it’s also important to note that as a community, we have a long pilgrimage before we make it to the ubiquitous intelligence sharing promised land. Mitre’s STIX et al are widely being accepted across the community as the path forward, which is great progress. Now that the high level information exchange format and transport is agreed upon, we still have a lot of minutia to work out. For example, much of that actual schema for sharing is still wide open. For example, many indicator types still have to be defined; standards for normalization and presentation still need to be agreed upon, and the fundamental meaning of the indicators still need to be agreed upon across the community.

I think it’s instructive to compare the STIX suite to the CVE/CVSS/CWE/CWSS/OVAL suite, even though the comparison is not perfect. These initiatives were designed to drive standardization, automation, and improve latency of closing vulnerabilities. There is plethora of information tracked through these initiatives: from (machine readable) lists of vulnerable products, to the taxonomy of the vulnerability type, to relatively complicated ratings of the vulnerability. Despite this wealth of information, I don’t think we’ve achieved the vulnerability assessment, reporting, and remediation nirvana these mechanisms were built to support. Of all the information exchanged through these initiatives, probably the most important, and admittedly the most mundane, is the standardized CVE identifier, which the community uses to reference a given vulnerability. This is one area where current sharing communities can improve—standardized identifiers for malware families, attack groups, and attacker techniques. While many groups have these defined informally, more structured and consistent definitions would be helpful to the community, especially as indicators are tied to these names to provide useful context to the indicators (and provide objective definitions of the named associations). Community agreement on these identifiers is more difficult than the same for vulnerabilities, and building the lexicon for translations between sharing communities is also necessary, as defining these labels is less straightforward and occurs on a per community basis. As we better define these intelligence groupings and use them more consistently in intel sharing, we’ll have more context for proper prioritization, help ensure both better vetted intel and more clear campaign definitions, and have better assurances that our intelligence is providing the value we aim to achieve out of sharing.

In helping assess the effectiveness of information sharing, I think the following questions are useful:
  • How relevant to your organization is the shared intelligence?
  • Is the intelligence shared with enough context for appropriate prioritization and use?
  • How well is actionable intelligence vetted? Is raw attack data shared (for those who want it)?
  • How durable is the shared intelligence? How long does it remain useful?
  • How timely is the shared intel? Is it only useful for detecting historical activity or does it also allow you to mitigate future activity?
  • Do you invest in defensive capabilities based on shared intelligence and intelligence gaps?
  • Do you have metrics which indicate your most effective intelligence sources, including internal analysis?
  • Do you have technology that speeds the mundane portion of intelligence processing, reserving human resources for analysis?

Closing Argument

I’m sure that there are some who will argue that it’s possible to have both security intelligence and intelligence compliance. I must concede that it is possible in theory. However, as there is plenty of room for progress in both arenas, and resources are scarce, I don’t believe there is an incident response shop that claims to do either to the fullest degree possible, let alone both. Also, the two mindsets, analysis and compliance, are very much different and come with a vastly different culture. Most organizations are left with a choice—to focus on analysis and security intelligence or to choose box checking and information sharing compliance.

Similarly, I question the seemingly self-evident supposition that sharing security data ubiquitously and instantaneously will defeat targeted attacks. While there will almost certainly be some raising of the bar, causing some less sophisticated threats to be ineffective, we’ll also likely see an escalation among effective groups. This will force a relatively expensive increase in speed and volume of intel sharing among defenders while attackers modify their tactics to make sharing less effective.

As we move forward with increased computer security intelligence sharing, we can’t let the compliance mindset and information sharing processes become ends unto themselves. Up until the time when our adversaries cease to be highly motivated and focused on long term, real world success, let’s not abandon the analyst mindset and security intelligence which have been the hallmark of the best incident responders.

Saturday, October 27, 2012

PDFrate Update: API and Community Classifier

I am very pleased with the activity on pdfrate.com in the last few weeks. There have been a good number of visitors and some really good submissions. I’m really impressed at the number of targeted PDFs that were submitted and I’m happy with the pdfrate’s ability to classify these. I really appreciate those who have taken the time to label their submissions (assuming they know if they are malicious or not) so that the service can be improved through better training data.

There is now an API for retrieval of scan results. See the API section for more details, but as an example, you can view the report (JSON) for the Yara 1.6 manual.

This API may be unconventional, but I do like how easy it is to get scan results. You submit a file and get the JSON object back synchronously. I’ve split the metadata out from the scan results for a couple reasons. First, the metadata can be very large. Second, the metadata is currently presented as text blob, and I wasn’t sure how people would want it stuffed into JSON. If you want both, you have to make two requests. You can also view the metadata blob for the Yara 1.6 manual.

I’m happy that there have already been enough submissions, including ones that weren’t classified well by the existing data sets, that I’ve generated a community classifier based on PDFrate.com user submissions and voting. I’m thrilled that there were submissions matching categories of malicious PDFs that I know are floating around but simply aren’t in the existing data sets. I expect that if the current submission rate stays the same or goes up, the community classifier will become the most accurate classifier, because it will contain fresher and more relevant training data. Again, as an example, you can check out the report for the Yara 1.6 manual which now includes a score from the community classifier.

If a submission had votes before Oct 25th, it was included in the community classifier. Some users will note that even though they themselves did not vote on their submissions, they have votes. I reviewed many interesting submissions and placed votes on them so that they could be included in the community classifier. I decided to not do a bulk rescan of all documents already submitted. It wasn't for technical reasons. Note that the ratings occur solely based on the previously extracted metadata and as such are very fast. I did so because I didn’t want to provide potentially deceptive results to users. If a document is in the training set, it is generally considered an unfair test to use the resulting classifier on it, as the classifier will almost always provide good results. Regardless, if you want to have a submission re-scanned, just submit the file over again.

Again, I’m pleased with the PDFrate so far. I hope this service continues to improve and that it provides value to the community.