Saturday, June 21, 2014

Tactics for Storing Security Event Data

I’d like to share some tactics that I have found to be useful for storing and retrieving security event data. These approaches can certainly be backed by theory, but I've also come to learn their value through real world experience, including some hard knocks. One impetus for finally putting these thoughts down is to explain the motivation behind MongoR, a wrapper around MongoDB which was written by a close colleague and recently open sourced by my employer. However, MongoR only addresses some of these considerations, and I believe these issues should be addressed by all data stores that want to better support users who warehouse security event data. Many data stores already utilize these tactics or make it possible for the user to do it on their own, but few leverage all the optimizations possible for storing logs.

Storing and retrieving security event data, such as network infrastructure or sensor logs, is non-trivial due to the high insert rate required. In addition to large volume, most off-the-shelf data stores are poorly tuned to rolling window style of data retention that is usually desired. For example, a typical use case would be to store 90 days of logs and to be able to search them based on key indicator types such as IP addresses.

Security event data typically has the following properties:
  • Organized primarily by time
  • Limited, Fixed Retention
  • Immutable
  • Not Highly Relational
And queries usually have the following properties:
  • Very high insert to query ratio
  • Core indicator types (ex. IP) comprise most searches
  • Searches are typically for rare data (ex. attack activity)

These properties are nearly diametrically opposed to that of a typical database driven app such as a service ticket, e-commerce, or ERP system. As such, it’s not surprising that it requires some work to adapt databases to storing event streams and that many systems that are successful look more like search engines than databases.

Why Mongo?

NoSQL data stores like MongoDB work very well for security event data, fitting the not highly relational property well, and providing a lot of flexibility when a fixed schema just isn’t possible. Ex. Storing and accessing arbitrary HTTP headers in something like Mongo is great. When there are search engines like Elastic Search that store documents with flexible schemas, why would you even use something like Mongo? There are a few reasons, but one major driver is that Mongo and similar databases readily support alternative access methods, such as map/reduce. In speaking about tactics for log storage, I’m generally not going to differentiate between NoSQL databases like MongoDB and search engines like ElasticSearch as the delineation between them is blurry anyway. I merely wanted to note that there is room for both approaches in the security event storage realm.

Events are Stored to be Deleted

Providing the capacity to support an adequate insert rate can often be difficult, but paying nearly as much for deleting an old event as inserting a new event adds insult to injury. The time honored approach is to use some sort of time based data partitioning where each shard contains a time slice of events, whether the trigger for a new shard be date based or size based. Glossing over issues such as fragmentation, the expensive part here is not pruning the data itself, but removing entries from indexes. Trimming documents out of the indexes is expensive and significantly increases the database's working set. Deleting event data should be cheap. Many data stores don’t support date based sharding with efficient deletes so this must be done by the user. This is one of the core benefits of MongoR: pruning old data is a simple collection (table) drop.

Working Set Blues

Generalizing beyond efficient deletes, the major driver for data partitioning is keeping the working set of a database manageable. The only way to maintain high insert rates is to keep most of the area of active writing cached in RAM. This is especially imperative for indexes as these usually involve very random writes (the document data itself can usually be written sequentially). Sharding helps keep the working set manageable by bounding the size of the data actively written. The downside to this approach is that you have to check every shard during a query, multiplying IOPS by shard count. For security event data, this is very often a wise trade-off. This is the other major benefit MongoR adds on top of MongoDB: managing many collections that are small enough to fit in RAM.

Tiered Storage Please

While MongoR addresses some of the most fundamental issues required to make MongoDB serviceable for high rate security event data, there is more to be done and it’s not specific to MongoDB. One my pet peeves are databases that don’t readily support separating indexes from raw data. There are many people who want to index a relatively small part of raw events, say just the IP address from web server logs. When indexes are mingled with document data, the IOPS required for writing the indexes can be much higher because the indexes may be spread over a larger number of blocks than would occur if indexes are grouped together. To the degree that document data and indexes have different read/write patterns and especially when indexes are smaller than base data, using tiered storage is beneficial.

Indexing the Kitchen Sink

Many systems used to store security event data support indexing a portion of the data in event records, while some support an all or nothing approach. Some are flexible in what fields can be indexed and some are fixed. Flexibility is always appreciated by advanced users who can say in advance what they will typically search. The all or nothing approach can make the use of the system very expensive and can lead to excluding data useful for analysis but not likely to be searched directly.

Stop the Sync Shell Game

One of the fundamental trade-offs a data store has to make is between efficiency and data safety. You can cache heavily and get high performance or constantly sync data to disks and get data safety, but not both. This dichotomy, however, neglects a few optimizations. First of all, security event data is by design immutable. If done right, the raw event data would be dumped with simple sequential writes and loaded with unsynchronized reads.

Another optimization lies in the presumption of some sort of sharding where only the current data collection needs to be loaded in RAM (and for proper performance needs to be wholly in RAM). If the indexes are the expensive part of a data store, requiring all sorts of IOPS to keep synced, why are they written to disk before the shard is finished? In the event of a failure, the indexes can always be re-created from the source data (sequential access). This also opens the door to using efficient data structures for the in memory indexes that may well differ from what is best for the on disk indexes or indexes that are created incrementally. Once you’ve relegated yourself to keeping your current shard in memory, embrace it and leverage the resulting possible optimizations.

There is no Trie

Moving beyond binary trees for indexes is liberating. Various alternative indexing mechanisms, most of them involving hashing of some flavor, can provide constant time lookups and require fewer IOPS than binary trees. For example, I’ve been using discodbs based on cmph, and have had fantastic results. The minimal perfect hash functions used result in indexes that are less complex than a binary tree, involve much fewer IOPS for queries, and are much smaller than an optimal bloom filter. The trade-off is that they are immutable and cannot be efficiently updated or created incrementally, but that is not a concern for immutable log data.

Proponents of binary trees will claim that prefix matches are a killer feature. I’m not sure they are, and they are easily provided through other techniques anyway. Domain names are a prime example of a data type that does not work well with the prefix searches supported by typical binary tree indexes. Sure, you can reverse the domains and then prefix matching works, but you could also extract the prefixes you want searchable and insert them each as keys in your constant time lookup indexes. The point is that you often have to invest extra effort to make binary trees work for many common security event indicator types.

The security domain analogues of all the pre-processing that goes into full text indexing (tokenization, stemming, stopwords, natural language analysis, etc.) are all very immature. As this field matures, I feel we’ll be better able to say what is required to support any searching beyond exact term matches. Regardless, once you move past storing IP addresses as integers, it’s not clear whether binary tree prefix matching buys much in queries on security event data. In a day when general purpose and reliability focused filesystems are using alternatives to binary trees (ex. htree of ext4), security event data stores should be too.

Lower the Resolution Please

One place I think the security community should look for cost savings is in lowering the resolution of indexes from the per event/row level to groups/block of events. I’ve seen great savings in this in practice and in some cases it can even improve performance on an absolute scale, let alone economics. When the desired result is a very small number of records, low resolution indexes coupled with processing of a very small number of blocks to do the final filtering can beat out bigger, more expensive, and more precise indexes. One use case that benefits from this approach is pivoting on indicators of threat actors, which are increasingly often exchanged in threat intelligence sharing communities. In these searches, it is most often desirable to search the widest time window possible (full data set), to search on a common indicator type, and for there to be few or no results. Often the low resolution answer is adequate--there is no need to drill down to individual events. I’ve seen low resolution indexes that are orders of magnitude smaller than the raw source data provide utility through low latency query times that row level indexes can never touch because of fundamental differences in size. Significantly ameliorating the cost of very common terms, ex. your web server’s IP, and thereby naturally diminishing the need for stopword mechanisms is a nice side effect of dropping index resolution.

Trading CPU for IOPS

I’m reticent to mention compression as a method of improving performance for fear of being flamed as a heretic, but I think it has to be mentioned. I know disk is cheap, but IOPS aren’t. Getting adequate IOPS is often among the biggest considerations (and cost) of the hardware for security event data stores (RAM used for disk caching is usually up there too). Huge sequentially accessible gzipped files are expensive to use, especially to retrieve a small number of records. But compression doesn’t have to be that way. You can use smaller blocks to get good compression ratios and reasonable random access. Considering only simple tweaks to gzip, pigz and dictzip address the single threaded and random access limitations of standard gzip.

As an oversimplified example, imagine you have a server with a disk drive that provides 150 MB/s sequential read rate, CPU cores that do decompression at about 20 MB/s (compressed size) per core/thread, and data that compresses with a compression rate of 5:1. If you want to do analytics, say map/reduce or zgrep/awk, on the whole data store, you are better off using compression if you can dedicate 2 or more cores to decompression. If you can dedicate 8 cores, you are going to be able to stream at 750 MB/s instead of 150 MB/s.

The CPU/IOPS tradeoff is not just about improving sequential compressed log access, that is just a simple example that everyone who has ever zgreped text logs understands. A better example is the very compact indexes created using perfect hash functions such as done in discodb which require relatively high CPU but low IOPS to query.

Conclusions

Security event data has special characteristics that enable various optimizations. For the data stores that do not yet make these techniques easy, there is an opportunity to better cater to the security log use case. MongoR demonstrates temporal data partitioning that keeps the working set in memory resulting in scalable, fast inserts and efficient deletes.


Saturday, October 19, 2013

BlueLight: A Firefox Extension for Notification of SSL Monitoring

Recently I built a Firefox Extension that was useful for my particular, if not peculiar needs. I wanted to share with any who might find it useful. BlueLight is designed to provide notification of SSL inspection--the type that is common for organizations to perform on the traffic of their consenting users.

There are many tools, and no shortage of Firefox extensions, that relate to SSL security and detecting MitM attacks. CipherFox and similar extensions are very useful, but didn’t fit my specific need because they aren’t quite noisy enough--I wanted more active and intrusive notification when CA’s of interest were used. Certificate Patrol and similar systems are useful for detecting the introduction of SSL inspection, but these systems don’t fit the scenario of overt, consensual, and long term SSL inspection. BlueLight is based heavily on Cert Alert. In fact, if the alerting criteria in Cert Alert wasn’t hardcoded, I’d probably be using it for this purpose.

BlueLight is useful when SSL Inspection is occurring, usually through a MitM scenario on web proxies using organizational PKI. Obviously, being notified of this on a per site basis is only useful when the organization is selective about what traffic is inspected—if everything is MitM’d then this notification provides no value.

Some claim that users are more secure when their traffic is subject to the organizations protections and monitoring. In this case, BlueLight provides re-assuring feedback to the user, letting them know that they are covered. Others may want to use BlueLight to know when they are under the purview of surveillance. It may deter them from taking some action while being subject to monitoring. In the case that monitoring should not occur on specific traffic, it provides useful notification to the user, so that the erroneous inspection can be rectified. In this vein, I’ve seen BlueLight be particularly useful as it alerts for all SSL elements of the page, not just the main URL (it alerts on the first occurrence, and only the first occurrence).

BlueLight isn’t designed to be useful for other scenarios such as detecting unauthorized SSL MitM attacks or any other covert SSL malfeasance. However, since BlueLight can be configured to alert on basically any certificate issuer, it may well be useful for other similar uses.

BlueLight has to be configured by the user to be useful. As it is, it’s probably only useful to reasonably technically savvy folk. In sharing BlueLight with the larger community, I hope it might be useful to others. BlueLight can be downloaded from addons.mozzilla.org or from csmutz.com/bluelight.

Monday, August 5, 2013

Intelligence Compliance: Checklist Mentality For Cyber Intelligence

The past few years have seen a sharp increase in the amount of targeted attack intelligence shared and the number of venues used for sharing. There is a geeky fervor emerging in the community rooted in the premise that if we accelerate sharing and consumption of intelligence, even to the point of automation, we could significantly improve network defense. Momentum is snowballing for adoption of specific standards for intel sharing, the foremost of which is the mitre suite of STIX/TAXII/MAEC. There is a seemingly self-evident necessity to share more intel, share it wider, and share it faster.

As someone who seeks to apply technology to incident response, I see great promise in standards, technology, and investments to accelerate the distribution and application of threat intelligence. I’ve spent years developing capabilities to perform security intelligence and have seen first-hand the benefits of information sharing. The amount of security data exchanged will likely continue to grow. However, I have concerns about the shift from security intelligence to intelligence compliance and about the fundamental benefit of comprehensive distribution of threat intelligence.

Compliance Supplanting Analysis

Information sharing has been critical to the success of intelligence based network defense. Direct competitors from various industries have come together to battle the common problem of espionage threats. Threat intelligence sharing has been wildly successful as sharing has included relevant context, has been timely, and as the attackers have been common to those sharing. Years ago, much of this sharing was informal, based primarily in direct analyst to analyst collaboration. Over time, formal intel sharing arrangements have evolved and are proliferating today, increasing in count of sharing venues, the number of participants, and the volume of intel.

The primary concern I have with this increase in intel sharing is that it is often accompanied by a compliance mindset. If there’s anything we should have learned from targeted attacks, it is that compliance based approaches will not stop highly motivated attacks. It’s inevitable that conformance will fail, given enough attacker effort. For example, targeted attackers frequently have access to zero-day exploits that aren’t even on the red/yellow/green vulnerability metric charts, let alone affected by our progress in speeding patching from months to days. The reactive approach to incident response is focused primarily on preventing known attacks. As a community, we have developed intelligence driven network defense to remedy this situation. It allows us to get ahead of individual attacks by keeping tabs on persistent attackers, styled campaigns in the vernacular, in addition to a proper vulnerability focused approaches. The beauty of intelligence driven incident response is that it gives some degree of assurance that if you have good intelligence on an attack group, you will detect/mitigate subsequent attacks if they maintain some of the patterns they have exhibited in the past. This may seem like a limited guarantee, and indeed it is narrow, but it’s the most effective way to defeat APT. Intelligence compliance, on the other hand, promises completeness in dispatching with all documented and shared attacks, but it makes no promise for previously unknown attacks.

To explain in detail, the point of kill chain and associated analysis isn’t merely to extract a proscribed set of data to be used as mitigation fodder, but to identify multiple durable indicators to form a campaign. This has been succinctly explained by my colleague, Mike Cloppert, in his blog post on defining campaigns. The persistence of these indicators serves not only as the method of aggregating individual attacks into campaigns, but the presence of these consistencies is the substance of the assurance that you can reliably defeat a given campaign. By definition, individual attacks from the same campaign have persistent attributes. If you have a few of these, 3 seems to be the magic number, across multiple phases of the kill chain, you have good assurance that you will mitigate subsequent attacks, even if one of these previously static attributes change. If you can’t identify or your IDS can’t detect enough of these attributes, security intelligence dictates that you either dig deeper to identify these and/or upgrade your detection mechanisms to support these durable indicators. Ergo, defense driven by intelligence dictates you do analysis to identify persistent attack attributes and build your counter-measures around these.

Intelligence compliance, on the other hand, provides no similar rational basis for preparation against previously unseen attacks. Surely, a compliance focused approach has some appeal. It is often viewed as seeking to ensure consistency in an activity that is highly valuable for security intelligence. In other cases, less critical drivers overshadow primary mission success. Intel compliance provides a highly structured process that can be easily metered—very important in bureaucratic environments. The one guarantee that intelligence compliance does give is that you have the same defenses, at least those based on common intelligence, as everyone else in your sharing venue. This is important in when covering your bases is the primary driver. This provides no guarantee about new attacks or the actual security of your data, but does allow you to ensure that known attacks are mitigated, which is arguably most important for many forms of liability. Lastly, giving, taking, or middle manning intel can all be used as chips in political games ranging from true synergies to contrived intrigues. Intelligence compliance provides a repeatable and measurable process which caters to managerial ease and is also able to be aligned with legal and political initiatives.

There are limitless ways in which intelligence compliance can go wrong. Most failings can be categorized as either supplanting more effective activities or as a shortcoming in the mode of intelligence sharing which reduces its value. It is also fair to question if perfect and universal intelligence compliance would even be effective. Remember, intelligence compliance usually isn’t adopted on technical merits. The best, if not naïve, argument for adoption of this compliance mindset is that intel sharing has been useful for security intelligence, hence, by extrapolation, increasing threat data sharing must be better. Sadly, the rationale for a compliance approach to intel sharing frequently digresses to placing short-sighted blame avoidance in front of long term results.

The primary way in which intel compliance goes awry is when it displaces the capacity for security intelligence. If analysts spend more time on superfluous processing of external information than doing actual intelligence analysis, you’ve got serious means/ends inversion problems. Unfortunately, it’s often easier to process another batch of external intel verses digging deeper on a campaign to discover resilient indicators. This can be compared to egregious failings in software vulnerability remediation where more resources are spent on digital paper pushing, such as layers of statistics on compliance rates or elaborate exception justification and approval documentation, than is expended actually patching systems. An important observation is that the most easily shared and mitigated indicators, say IP addresses (firewalls) or email addresses (email filters), are also easily modified by attackers. For that reason, some of the most commonly exchanged indicators are also the most ephemeral, although this does depend on the specific attack group. If an indicator isn’t reused by an attacker, then sharing is useful for detecting previous attacks (hopefully before severe impact) but doesn’t prevent new attacks. A focus on short-lived intel can result in a whack-a-mole cycle that saps resources cleaning up compromises. This vicious cycle is taken to a meta level when human intensive processes are used despite increased intel sharing volume, putting organizations too far behind to make technological and process improvements that would facilitate higher intel processing efficiency. This plays into the attacker’s hand. This is exactly the scenario that security intelligence, including kill chain analysis, disrupts--allowing defenders to turn an attacker’s persistence into a defensive advantage.

Another intel sharing tragedy occurs when attack data is exchanged as though it is actionable intelligence and yet it’s no more than raw attack data. There are many who would advocate extracting a consistent set of metadata from attacks and regurgitating that as intelligence for sharing to their peers. I’m all for sharing raw attack data, but it must be analyzed to produce useful intelligence. If the original source doesn’t do any vetting of the attack data and shares a condensed subset, then the receiver will be forced to vet the data to see if it’s useful for countermeasures, but often with less context. The canonical example which illustrates the difference between raw attack data and intelligence is the IP address of an email server that sent a targeted malicious email, where that server is part of a common webmail service. Clearly this is valid attack data, but it’s about as specific to the targeted attacker as the license plate number of a vehicle the terrorist once rode in is to that terrorist, given that vehicle was a public bus. Some part of this vetting can and should be automated, but effective human adjudication is usually still necessary. Furthermore, many of the centralized clearinghouses don’t have the requisite data to adequately vet these indicators, so they are happily brokered to all consumers. To make matters more difficult, different organizations are willing to accept different types of collateral mitigation, the easiest solution being to devolve into the common denominator for the community which is a subset of the actionable intelligence for each organization. For example, given an otherwise legitimate website that is temporarily compromised to host malicious content, some members of a community may be able to block the whole website while others may not be able to accept the business impact. The easiest solution is for the community is to reject the overly broad mitigation causing collateral impact, while the optimal use of the intelligence requires risk assessment by each organization.

While ambiguity between raw attack data and vetted intelligence is the most obscene operationally, because it can result blocking benign user activity, there are other issues related to incomplete context on so called intelligence. An important aspect of security intelligence is proper prioritization. For example, many defenders invest significantly more analyst time in espionage threats while rapidly dispatching with commodity crimeware. If this context is not provided, improper triage might results in wasted resources. Ideally, this would include not only a categorization of the class of threat, but the actual threat identity, i.e. campaign name. Similarly, often intelligence is devoid of useful context such as whether the IP address reported is used as the source of attacks against vulnerable web servers, for email delivery, or for command and control. This can lead to imprecise or inappropriate countermeasures. Poorly vetted or ambiguous intel is analogous to (and sometimes turns into) the noisy signatures in your IDS—they waste time and desensitize.

With all that being said, I’m an advocate of intel sharing. I’m an advocate of sharing raw attack data, which is useful for those with the capacity to extract the useful intelligence. Realizing that this isn’t scalable, I’m also an advocate of sharing well vetted intelligence, with the requisite context to be actionable. However, even if your shop doesn’t have the ability to process raw attack data at a high volume, sharing that data with those who can ostensibly results in sharing of intel back to you that you couldn’t synthesize yourself. My main concern with intelligence compliance is that it robs time and resources from security intelligence while providing no guarantee of efficacy.

Intelligence Race to Zero

Beyond supplanting security intelligence, my other concern with the increase in information sharing is that as we become more proficient at ubiquitous sharing, the value of the intelligence will be diminished. This will occur whether the intelligence is revealed directly to the attackers or if lack of success causes them to evolve. Either way, I question if intelligence applied universally for counter-measures can ever be truly effective. Almost all current intelligence sharing venues at least give lip service to confidentiality of the shared intelligence and I believe many communities do a pretty good job. However, as intel is shared more widely, it is increasingly likely that the intel will be directly leaked to attackers. This principle drives the strict limitations on distribution normally associated with high value intelligence. It is also generally accepted that it’s impractical to keep data that is widely distributed for enforcement secure from attackers. The vast majority of widely deployed mitigations, such as AV signatures and spam RBLs are accepted to be available to attackers. In this scenario, you engage in an intelligence race, trying to speed the use of commodity intel. This is the antithesis of security intelligence which seeks to mitigate whole campaigns with advanced persistent intelligence.

Note that even if your raw intel isn’t exposed to attackers, the effects are known to the attacker—their attacks are no longer successful. Professional spooks have always faced the dilemma of leveraging intelligence versus protecting sources and methods. If some intelligence is used, especially intelligence from exclusive sources, then the source is compromised. As an example, the capability to decrypt axis messages during WWII was jealously protected. The tale that Churchill sacrificed a whole city to German bombers is hyperbole, but it certainly is representative of the type of trade-offs that must be made when protecting sources and methods. Note that this necessity to protect intel affects it’s use through the whole process, not just merely the decision for use at the end. For example, if two pieces of information are collected that when combined would solidify actionable intelligence, but these are compartmentalized, then the dots will never be connected and the actionable intelligence will never be produced. We see this play out in so called failures of the counter-terrorism intelligence community, where conspiracy theorists ascribe the failings to malice but the real cause is, more often than not, hindrances to analysis.

It’s worth considering how sources and methods apply specifically to network defense. Generally, I believe there is a small subset of intelligence that can be obtained solely through highly sensitive sources that is also useful for network defense. In most cases, if you can use an indicator for counter-measures, you can also derive it yourself, because it must be visible to the end defender. Also, while some sources may be highly sensitive, the same or similar information about attack activity (not attribution), is often available through open sources or through attack activity itself. Obviously, this notion isn’t absolutely true, but I believe it to be the norm. As a counter-example, imagine that a super sensitive source reveals that an attacker has added a drive by exploit to an otherwise legitimate website frequented by the intended victim audience. In this example the intel is still hard to leverage and relatively ephemeral: one still has to operationalize this knowledge in a very short time frame and this knowledge is by definition related specifically to this single attack.

Resting on the qualitative argument of indicator existentialism, the vast majority of counter-measures can be derived from attacker activity visible to the end network defender. This is necessarily true of the most durable indicators. Therefore, I don’t consider protecting sources (for network defense) the biggest consideration and advocate wide sharing of raw attack data. However, that certainly doesn’t mean that the analysis techniques and countermeasure capabilities are not sensitive. Indeed, most of my work in incident response has been about facilitating deeper analysis of network data, allowing durable and actionable intelligence to be created and leveraged. Competitive advantage in this realm has typically been found by either looking deeper or being more clever. In a spear phishing attack, for example, this may be in analysis of obscure email header data or malicious document metadata or weaponization techniques. Often the actionable intelligence is an atomic indicator, say a document author value, which could presumably be changed by the attacker if known. Some may require more sophistication on the part of the defender: requiring novel detection algorithms, correlations, or computational techniques such as that which my pdfrate performs. Either way, the doctrine of security intelligence is based in the premise that persistent indicators can be found for persistent attackers, even if it requires significant analysis to elucidate them. This analysis to identify reliable counter-measures is what security intelligence dictates and is often the opportunity cost of intelligence compliance. I’ve seen some strong indicators continue to be static for years, allowing new attacks to be mitigated despite other major changes in the attacks.

It is my belief, backed by personal experience and anecdotal evidence, that if what would otherwise be a strong mitigation, if kept secret, is used widely, then the lifespan of that indicator will be decreased. In the end I’m not sure it matters too much if the intelligence is directly revealed or if the attackers are forced to evolve due to lack of success, but that probably affects how quickly attackers change. In my experience, it is true that the greater the sophistication on the part of the defender and the greater the technical requirements for security systems, then the less likely useful indicators are to be subverted. However, it’s possible that continued efficacy has more to do with narrow application due to the small number of organizations able to implement the mitigation rather than difficulty of attackers changing their tactics. Often I wonder if, like outrunning the proverbial bear, today’s success in beating persistent adversaries may be more about being better than other potential victims than actually directly beating the adversary. While intelligence driven security, and by extension information sharing, is much more effective than classic incident response, I think it is yet to be proven if ubiquitous intel sharing can actually get ahead of targeted attacks or if attackers will win the ensuing intelligence/evolution race.

One benefit of the still far from fully standardized information sharing and defense systems of today is diversity. Each organization has their own special techniques for incident prevention--their own secret sauce for persistent threats. It’s inevitable that intelligence gaps will occur and some attacks, at least new ones, will not be stopped as early as desired. The diversity of exquisite detections among organizations combined with attack information sharing, even that of one-off indicators, allows for a better chance of response to novel attacks, even if this response is sub-optimal. A trend to greater standardization of intelligence sharing, driven by compliance, will tend to remove this diversity over time, as analysts, systems, and processes will be geared to greater intel volume and lower latency at the expense of intelligence resiliency.

Long Road Ahead

While I’m primarily concerned about being let down when we get there, it’s also important to note that as a community, we have a long pilgrimage before we make it to the ubiquitous intelligence sharing promised land. Mitre’s STIX et al are widely being accepted across the community as the path forward, which is great progress. Now that the high level information exchange format and transport is agreed upon, we still have a lot of minutia to work out. For example, much of that actual schema for sharing is still wide open. For example, many indicator types still have to be defined; standards for normalization and presentation still need to be agreed upon, and the fundamental meaning of the indicators still need to be agreed upon across the community.

I think it’s instructive to compare the STIX suite to the CVE/CVSS/CWE/CWSS/OVAL suite, even though the comparison is not perfect. These initiatives were designed to drive standardization, automation, and improve latency of closing vulnerabilities. There is plethora of information tracked through these initiatives: from (machine readable) lists of vulnerable products, to the taxonomy of the vulnerability type, to relatively complicated ratings of the vulnerability. Despite this wealth of information, I don’t think we’ve achieved the vulnerability assessment, reporting, and remediation nirvana these mechanisms were built to support. Of all the information exchanged through these initiatives, probably the most important, and admittedly the most mundane, is the standardized CVE identifier, which the community uses to reference a given vulnerability. This is one area where current sharing communities can improve—standardized identifiers for malware families, attack groups, and attacker techniques. While many groups have these defined informally, more structured and consistent definitions would be helpful to the community, especially as indicators are tied to these names to provide useful context to the indicators (and provide objective definitions of the named associations). Community agreement on these identifiers is more difficult than the same for vulnerabilities, and building the lexicon for translations between sharing communities is also necessary, as defining these labels is less straightforward and occurs on a per community basis. As we better define these intelligence groupings and use them more consistently in intel sharing, we’ll have more context for proper prioritization, help ensure both better vetted intel and more clear campaign definitions, and have better assurances that our intelligence is providing the value we aim to achieve out of sharing.

In helping assess the effectiveness of information sharing, I think the following questions are useful:
  • How relevant to your organization is the shared intelligence?
  • Is the intelligence shared with enough context for appropriate prioritization and use?
  • How well is actionable intelligence vetted? Is raw attack data shared (for those who want it)?
  • How durable is the shared intelligence? How long does it remain useful?
  • How timely is the shared intel? Is it only useful for detecting historical activity or does it also allow you to mitigate future activity?
  • Do you invest in defensive capabilities based on shared intelligence and intelligence gaps?
  • Do you have metrics which indicate your most effective intelligence sources, including internal analysis?
  • Do you have technology that speeds the mundane portion of intelligence processing, reserving human resources for analysis?

Closing Argument

I’m sure that there are some who will argue that it’s possible to have both security intelligence and intelligence compliance. I must concede that it is possible in theory. However, as there is plenty of room for progress in both arenas, and resources are scarce, I don’t believe there is an incident response shop that claims to do either to the fullest degree possible, let alone both. Also, the two mindsets, analysis and compliance, are very much different and come with a vastly different culture. Most organizations are left with a choice—to focus on analysis and security intelligence or to choose box checking and information sharing compliance.

Similarly, I question the seemingly self-evident supposition that sharing security data ubiquitously and instantaneously will defeat targeted attacks. While there will almost certainly be some raising of the bar, causing some less sophisticated threats to be ineffective, we’ll also likely see an escalation among effective groups. This will force a relatively expensive increase in speed and volume of intel sharing among defenders while attackers modify their tactics to make sharing less effective.

As we move forward with increased computer security intelligence sharing, we can’t let the compliance mindset and information sharing processes become ends unto themselves. Up until the time when our adversaries cease to be highly motivated and focused on long term, real world success, let’s not abandon the analyst mindset and security intelligence which have been the hallmark of the best incident responders.

Saturday, October 27, 2012

PDFrate Update: API and Community Classifier

I am very pleased with the activity on pdfrate.com in the last few weeks. There have been a good number of visitors and some really good submissions. I’m really impressed at the number of targeted PDFs that were submitted and I’m happy with the pdfrate’s ability to classify these. I really appreciate those who have taken the time to label their submissions (assuming they know if they are malicious or not) so that the service can be improved through better training data.

There is now an API for retrieval of scan results. See the API section for more details, but as an example, you can view the report (JSON) for the Yara 1.6 manual.

This API may be unconventional, but I do like how easy it is to get scan results. You submit a file and get the JSON object back synchronously. I’ve split the metadata out from the scan results for a couple reasons. First, the metadata can be very large. Second, the metadata is currently presented as text blob, and I wasn’t sure how people would want it stuffed into JSON. If you want both, you have to make two requests. You can also view the metadata blob for the Yara 1.6 manual.

I’m happy that there have already been enough submissions, including ones that weren’t classified well by the existing data sets, that I’ve generated a community classifier based on PDFrate.com user submissions and voting. I’m thrilled that there were submissions matching categories of malicious PDFs that I know are floating around but simply aren’t in the existing data sets. I expect that if the current submission rate stays the same or goes up, the community classifier will become the most accurate classifier, because it will contain fresher and more relevant training data. Again, as an example, you can check out the report for the Yara 1.6 manual which now includes a score from the community classifier.

If a submission had votes before Oct 25th, it was included in the community classifier. Some users will note that even though they themselves did not vote on their submissions, they have votes. I reviewed many interesting submissions and placed votes on them so that they could be included in the community classifier. I decided to not do a bulk rescan of all documents already submitted. It wasn't for technical reasons. Note that the ratings occur solely based on the previously extracted metadata and as such are very fast. I did so because I didn’t want to provide potentially deceptive results to users. If a document is in the training set, it is generally considered an unfair test to use the resulting classifier on it, as the classifier will almost always provide good results. Regardless, if you want to have a submission re-scanned, just submit the file over again.

Again, I’m pleased with the PDFrate so far. I hope this service continues to improve and that it provides value to the community.

Saturday, September 15, 2012

Announcing PDFrate Public Service

I’m excited to announce PDFrate: a website that provides malicious document identification using machine learning based on metadata and structural features. The gory details of the underlying mechanisms will be presented at ACSAC 2012.

I’ve been working on this research since 2009, which was a year where the stream of PDF 0-days being leveraged by targeted attackers was nearly unbroken. I’ve refined the underlying techniques to a place where they are very effective in real operations and are addressed rigorously enough for academic acceptance. Note that I originally designed this for the purpose of detecting APT malicious documents but have found it to be largely effective on broad based crimeware PDFs also. Furthermore, it is pretty effective at distinguishing between the two. I can speak from personal experience that mechanisms underlying PDFrate provide a strong compliment to signature and dynamic analysis detection mechanisms.

Those that are interested should head over to the pdfrate site and check out the “about” page in particular which explains the mechanisms and points to some good examples.

PDFrate demonstrates a well refined mechanism for detecting malicious documents. This currently operates on PDF documents. I am close to extending this to office documents. But I see this paradigm extending much farther than just malicious documents. I see wise (and deep) selection of features and machine learning being effective for many things other things such as emails, network transactions such as HTTP, web pages, and other file formats such as SWF and JAR.

I’m happy to provide the PDFrate service to the community so that others can leverage (and critique) this mechanism. Providing this as a service is a really good way for others to be able to use it because it removes a lot of the difficulty of implementation and configuration, the hardest part of which is collecting and labeling a training set. High quality training data is critical for high quality classification and this data is often hard for a single organization/individual to compile. While the current data sets/classifiers provided on the site are fine for detecting similar attacks, there is room for improvement and generalization which I hope will come from community submissions and ratings. So please vote on submissions, malicious or not, as this will speed the development and evolution of a community driven classifier. This service could benefit from some additional recent targeted PDFs.

In addition to the classification that PDFrate provides, it also provides one of the best document metadata extraction capabilities that I’ve seen. While there are many tools for PDF analysis, the metadata and structure extraction capabilities used by PDFrate provide a great mix of speed, simplicity, robustness, saliency, and transparency. Even if you aren’t sold on using PDFrate for classification, you might see if you like the metadata it provides. Again, the about provides illustrative examples.

I hope this service is useful to the community. I look forward to describing in depth in December at ACSAC!

Saturday, July 28, 2012

Security Vanity: Public Key Crypto for Secure Boot

For the last few months there’s been a bit of chatter about the restrictions Microsoft will be imposing on hardware vendors, requiring them to implement a specific flavor of “secure boot” where the hardware verifies the signature of the bootloader. The claim is that signing code from the hardware up will improve security, especially helping defeat boot loader viruses. This mechanism obviously makes installing other operating systems more difficult. In the case of ARM processors, there are further restrictions, basically preventing manual circumvention.

The responses from the organizations interested in other operating systems and consumer interests have been mixed. Some, like Fedora and Ubuntu, have drafted courses that seem surprisingly compliant, while others, like FSF, don’t seem to be so keen on the idea. Linus provided a fairly apathetic response noting that the proposed method for secure boot doesn’t help security much but probably doesn’t hurt usability that much either. Most of these responses make sense considering the interests they represent.

It may be fine for non-security folk to trust the advice of security professionals when we say that something is necessary for “security”. However, we do ourselves and society a great disservice when we allow security to be used as a cover for actions based in ulterior motives or when we support ineffective controls. Whether you are on the conspiracy theory or the incompetence more likely than malice side of the house, allowing the cause of security to be used recklessly hurts. It hurts our reputation and ability to enact sensible security measures in the future. It hurts the very cause we are professing to support, which for me is securing liberty. For example, I believe it will take years for our society to recognize how harmful the security theater of airport screening is. On the flip side, there has been no coherent demonstration of any significant effect in preventing terrorism, but our efforts have been fabulously effective at magnifying terror.

We have to call out security vanity when we see it. The use of public key crypto (without much of the complementary infrastructure) in secure boot needs to be questioned. Code signing, and digital signatures in general, can add a high degree of trust to computing. However, experience has shown that PKI, and crypto in general, gets busted over time, especially through key disclosure. It only takes one disclosure to cause a serious issue, especially if robust key management regimes are not in place. By design, there will be no practical way keep this PKI up to date without significant effort on the part of the user. No removal of broken root keys, no certificate revocations, etc. So the type of mechanisms that are used for keeping current code signing, SSL, etc up to date can’t apply here. Note that all of these mechanisms for key management are used by necessity—examples of individual keys/certificates being revoked occurs in the wild, even if you exclude really spooky attacks like Stuxnet or the RSA breach. The whole class of things designed to restrict how users use their devices, things like DVD’s CSS and HDCP, illustrate an important principle. It’s not a question of whether these mechanisms can be defeated; it’s a question of whether anyone cares to do it and how long it will take. It doesn’t matter how good the crypto algorithms are in a PKI based secure boot system, if proper key management can’t occur, the system is critically flawed.

If someone is pushing a defective by design security mechanism, are people justified questioning in their motives? I believe so. The purpose of this post isn’t to rant about anti-competitive practices, anti-consumer restrictions on hardware devices, use of “digital locks” laws to inhibit otherwise legal activity, etc. It is to point out that the use of PKI based secure boot is either based in motives other than security or it is based in bad engineering.

Let’s give Microsoft the benefit of the doubt. Let’s say that boot loader viruses are an issue that need to be addressed. At least we believe it is something that could become an issue during the life of the systems that will be shipping in the near future. The question at this point is, why use PKI? Are there alternatives? Granted there aren’t too many options for trust at this level, but I believe there are options. For example, I’ve long thought it would be useful for both desktop and mobile systems to have a small portion of storage which is writable only with a physical switch which would typically be in the read only position. This sort of thing could be really useful for things like bootloaders, which are updated relatively infrequently. You can criticize this approach, and debate whether it’s more likely for a lay person to be able to make the correct decision about when to press the button to make their secure storage area readable or for them to do manual key management, but the point is that there are alternatives. Note that I’ve intentionally ignored the stiffer requirements for ARM devices.

Using public key crypto for secure boot where proper key management cannot occur is a horrible idea. It will not provide security against determined threats over time. It is incumbent on the security community to clarify this, so that an effective solution can be implemented or the true motives can be discussed. The security community cannot tacitly be party to security vanity.