I can’t count how many times I’ve seen machine learning supposedly applied to solve a problem in the realm of information security. In my estimation, the vast majority of these attempts are a waste of resources that never demonstrate any real-world value. It saddens me to consistently see lots of effort and brainpower wasted on a field that I believe has a lot of potential. I’d like to share my thoughts on how machine learning can be effectively applied to incident detection. My focus is to address this topic in a manner and forum that is accessible by people in industry, especially those who fund, lead, or execute cyber security R&D. I hope some people in academia might find it useful, assuming they can stomach the information as presented here (including lack of academic formality and original empirical evidence). For what it’s worth, I consider myself having a pretty good amount of real world experience in targeted attack detection and a fair amount of academic experience in machine learning.
Before I get too far, a few definitions are in order. Specifically, I need to clarify what I mean by “Machine Learning”. As used here, “Machine Learning” indicates the use of computer algorithms that provide categorization capabilities beyond simple signatures or thresholds and which implement generalized or fuzzy matching capabilities. Typically, the machine is trained with some examples of data of interest (usually attack and benign data) from which it learns through construction of a model that can be used to classify a larger corpus of observations (usually as attack or benign) even when the larger corpus contains observations that don’t exactly match the observations in the training data.
With “Incident Detection”, I’m trying to be a little more broad than the classic definition of Intrusion Detection or NIDS by adding in connotations relative to Incident Response. I almost used CND, but that isn’t quite right because CND is a very broad topic. “Using Machine Learing for CNA Detection” would be an accurate alternate title. While I’ll be using NIDS heavily in my examples, note that for me NIDS isn’t merely about detecting malicious activity on the network, it’s also about detecting and providing forensics capabilities to analyze otherwise benign attack activity performed by targeted, persistent attackers (or in other words supporting cyber kill chain analysis).
During this short essay, I’ll reference two academic papers. The first is the PhD thesis of my friend, mentor, and former boss: Rohan Amin. His thesis, Detecting Targeted Malicious Email through Supervised Classification of Persistent Threat and Recipient Oriented Features, is the best examples of the useful application of machine learning to the problem of incident detection I’ve ever seen. I’ve conversed with Rohan on his research from start to finish and have largely been waiting to write this essay until he finished his thesis so I would have a positive example to talk about. His research is refreshing: from the choosing one of the most pressing security problems of the APT age to making brilliant technical contributions. If rated against the recommendations I will make herein, Rohan’s paper scores very high.
My second reference is Outside the Closed World: On Using Machine Learning For Network Intrusion Detection which was presented at IEEE S+P 2010. Robin Sommer and Vern Paxson are academic researchers with some serious credentials in the field of NIDS. They are probably best known in industry for their contributions to Bro IDS. Their paper is geared to academics but tries to encourage some amount of real world relevancy in research. It makes me laugh with cynicism sometimes at the political correctness and positive tone with which they make recommendations to researchers such as “Understand what the system is doing.” While I don’t agree with everything Sommer and Paxson say, they say a lot that is spot on, the paper is well written, it provides a good view into how academics think, and it even explicitly, albeit briefly, calls out the difference in approach required for opportunistic and targeted attacks.
Solve a Problem (worth solving)
Sommer and Paxson said it so well:
The intrusion detection community does not benefit any further from yet another study measuring the performance of some previously untried combination of a machine learning scheme with a particular feature set, applied to something like the DARPA dataset.
Amen. The Engineer in me and my personality scoffs at what I see as a too haphazard and inefficient process of invention which involves combining one of the set of machine learning techniques with one of the set of possible problems, often apparently pseudo-randomly, until a good fit is found through empirical evaluation. Sure, there are numerous examples of where this general approach has worked in the past. Ex. Goodyear’s invention of sulfur vulcanization for rubber is often thought to have happened by luck. Certainly this methodology is at least compatible with Edison’s maxim of “Genius was 1 percent inspiration and 99 percent perspiration.” While systematically testing every permutation of machine learning algorithms, problems, and other options such as data sets and features selections, is perfectly valid, I don’t like it. Most people investing in research probably shouldn’t either. One of the problems I see with this in the real world is that many people have what they think is a whiz bang machine learning algorithm, possibly even working well in a different domain. Since cyber security is a hot topic, people try to port the whiz bang mechanism to the probleme du jour, e.g. cyber security. Often these efforts fail not because there isn’t some way in which the whiz bang mechanism could provide value in the cyber security realm, but because the whiz bang mechanisms isn’t applied to a specific enough or relevant enough problem, poor data is used for evaluation, etc.
One strong predictor of the relevancy of the research being conducted and the technology that will come from it is the relevancy of the data being evaluated. Could it be any more clear that if you are using data that is too old to reflect current conditions, you can have little confidence that your resulting technology will address today’s threats? Furthermore, if you are using synthetic data, you may be able to show empirically that your solution solves a possible problem under certain conditions, but you have no guarantee that the problem is a problem worth solving or that the conditions assumed will ever be reached in the real world. Sommer and Paxson largely trash any research that relies predominately on the DARPA 1998-2000 Intrusion Detection Evaluation data sets, with which I passionately agree.
While the relevancy of the data being evaluated is a pretty good litmus test for the relevancy of the technology coming from the research, I believe it’s much more fundamental than that. Below I present two models for R&D. In the S-P-D process, novelty is ensured by taking a solution and using increasing innovation and discovery to find a problem and then a data set/features set for which the solution can be empirically shown to be valid. This correlates to the all too frequently played out example I alluded to above where a whiz bang machine learning algorithm is applied to a new domain such as cyber security. The researcher spends most of his time figuring out how to apply the solution to a problem including finding or creating data that shows how the solution solves a problem. Clearly, there is little guarantee for real world relevancy, but academic novelty is assured throughout the process. On the other hand, in the D-P-S process, relevancy in ensured because the data is drawn from real world observation. By evaluating data real world events, a problem is discovered, described, and prioritized. Resources are dedicated to research, and a useful solution is sought. Academic novelty is not necessarily guaranteed, but relevancy is systemic. Rohan’s PhD research exemplifies the D-P-S problem. Between 2003 and 2006 Targeted Malicious Email (TME) evolved as the principle attack vector for highly targeted sophisticated attacks. As the problem of APT attacks became more severe and more was learned about the attacks, TME detection was identified as a critical capability. Analysis of the data (real attacks) revealed consistent patterns between attacks that current security systems could not effectively detect. Rohan recognized the potential of machine learning to improve detection capabilities and did the hard work of refining and demonstrating his ideas.
While I’m normally not a fan of these sort of models and diagrams, I want to make this point clear to the people funding cyber R&D. If you want to improve the ROI of your cyber R&D, make sure you are funding D-P-S projects, not S-P-D research. What does that mean for non-business types? The most important thing cyber security researcher need today is Data demonstrating real Problems. In the current climate, there is an over abundance of money being poured in cyber R&D. I agree with the vast majority of the recommendations given by Sommer and Paxson regarding data, including the recommendation that NIDS researchers secure access to a large production network. Researchers also understand the threat environment of that network. I will add that if individual organizations, industries, and governments want to advance current cyber security R&D, the most important thing they can do is provide researchers access to the data demonstrating the biggest problems they are facing, including required context. For more coverage on the topic of sharing attack information with researchers, see my post on how Keeping Targeted Attacks Secret Kills R&D.
On Problem Selection
In my very first blog post, I discussed Developing Relevant Information Security Systems. Some of the ideas presented there apply to the discussion at hand.
Machine Learning as applied to intrusion detection is often considered synonymous with anomaly detection. Even Sommer and Paxson equate the two. Maybe this springs from the classic taxonomy of NIDS that branches at signature matching and anomaly detection. Personally, I question the value of this taxonomy. Certainly NIDS like Bro somewhat break this taxonomy, requiring it to be expanded to at least misuse detection or anomaly detection. Even that division isn’t fully comprehensive. Detecting activity from persistent malicious actors, even if that activity isn’t malicious per se, is an important task of NIDS also, but doesn’t fall cleanly under traditional definitions of either misuse detection or anomaly detection.
Regardless of how you classify your NIDS, I don’t agree with equating machine learning and anomaly detection. Machine learning can be applied to misuse detection can’t it? While Rohan’s PhD work isn’t fully integrated with any public NIDS, it very well could be. Similarly, anomaly detection systems as discussed in academia often use machine learning to create models for detection, but it’s equally possible for anomaly detection systems to use human expert created thresholds or models.
The biggest problem I have with equating machine learning with anomaly detection is that anomaly detection is largely a nebulous and silly problem. Equating the two trivializes machine learning. It’s pretty easy to identify statistically significant outliers in data sets. The problem is that the designation as anomalous is often rather arbitrary, with most researchers doing little to demonstrate the real world relevancy of any anomalous detections. Furthermore, for all but the most draconian of environments, anomaly detection is silly anyway. Anyone with any operational experience knows that the mind numbingly vast majority of “anomalous” activity is actually benign. Furthermore, highly targeted attacks quite often are, by design, made to blend in with “normal” activity.
Most of the discussion heretofore has been targeted at people making high level decisions about R&D. Now, I’ll provide some more concrete principles that can be applied by people actually implementing machine learning for Incident Detection. They are as follows:
- Use Machine Learning for Complex Relationships
- Serve the Analyst
- Features are the Most Important
- Use the Right Algorithm
Use Machine Learning for Complex Relationships (with Many Variables)
When should you use Machine Learning instead of other traditional approaches such as signature matching or simple thresholds? When you have to combine many variables in a complex manner to provide reliable detections. Why?
Traditional methods work very well for detection mechanisms based on a small number of features. For example, skilled analysts often combine two low fidelity conditions into one high fidelity condition using correlation engines or complex rule definitions. I’ve seen this done manually with three or more variables, but it gets real ugly really quickly as the number of variables increases, especially when each dimension is more complex than a simple binary division.
On the other hand, machines, if properly designed, function very well with high dimensional models. Computers are adept at analyzing complex relationships in n-dimensional space.
Why not use machine learning for low dimensional analysis? Because it’s usually an unnecessary complication. Furthermore, humans are usually more accurate than machines at dealing with the low dimensional case because they are able to add contextual knowledge often not directly derivable from a training set.
Serve the Analyst
Any advanced detection mechanism must serve the analyst. It will fail otherwise. By serving the analyst, I mean empowering and magnifying the efforts of the analyst. The human should ultimately be the master of the tool. To me it seems ridiculous, but there are actually people, including a lot of researchers, that believe (or purport to believe) that tools such as IDS should (and can) be made to house all the intelligence of the system and that the roles of humans is merely to service and vet alerts. This is ridiculous. This is so backwards, that I can’t even believe some people seriously believe this. It’s sad to see it play out in practice. Much like airport security, which has gotten out of hand with increasingly intrusive screening that provides little to no value, I have to question the motives of the people pushing this mindset. Is it even possible for them to believe this is the right way to go? Are they just ignorant and reckless? Maybe it just comes down to greed or gross self-interest. Regardless of the reason, this mindset is broken.
Toggling back to the positive side, machine learning has a great potential to empower analysis. Advanced data mining, including machine learning, should be used not only to aid that analyst is automating detections but also in understanding and visualizing previous attack data so that new detections can be created.
It is vital that the analyst understand how any machine learning mechanisms work under the hood. For example, an expert should understand and review the models generated by the machine so that the expert can provide a sanity check and so that the human can understand the significance of the patterns the machine identifies. One of the coolest parts of Rohan’s PhD thesis is that he uncovered many pertinent patterns in the data, such most targeted job classes. In addition, as the accuracy of the classifier begins to wane over time, it is the expert analyst who will be able to recommend the appropriate changes to the system, such new features to be included in analysis.
Part of empowering the analysts is giving the analyst the data needed to understand any alerts or detections. Any alert should be accompanied with a method of determining what activity triggered the alert and why the activity is thought to be malicious. Many machine learning mechanisms fail because they don’t do this well. They will tell an operator that they think something may be bad, but can’t or won’t tell the operator why, let alone providing sufficient context, making the operator’s job of vetting the alert that much harder. Incidentally, if the machine learning based detection mechanism provides adequate context, it lowers the cost and pain of validating false positives, lessening their adverse impact on operations.
For an advanced detection mechanisms to have success in an operational environment, it must be made with the goal of serving the expert analyst. I believe much of the “Symantec Gap” described by Sommer and Paxson arises from ignoring this principle.
Features are the Most Important
The most important thing to consider when applying machine learning to computer security is feature selection. Remember the 2007 financial system meltdown? The author of much of the software that “facilitated” the meltdown, wrote an article describing his work and how it was abused by reckless investment banks. Glossing over the details (which are very different), the high level misuse case is often the same as cases of abuse of machine learning: People hope that by putting low value meat scraps into some abstract and complicated meat grinder of a machine they get some output that is better than the ingredients put in. It’s a very appealing idea. If one can turn things you don’t want to look at into hot dogs or sausage by running it through a meat grinder, why can’t we turn it into steak with a really big, complex meat grinder? Machine learning mechanisms can be very good at targeting specific and complex patterns in data, but at the end of the day, GIGO still applies.
Expressiveness of Features
The most important part of using machine learning for IDS is to ensure that the machine is trained with features that expose attributes that are useful for discriminating individual observations. A classic example from the world of NIDS is the inadequacy of network monitoring tools that operate at layer 3 or layer 4 to detect layer 7 (or deeper) attacks. When I get on the network payload analysis soapbox (which I often do) one of my favorite examples is as follows:
Image in you have an open email relay that sends your organization two emails. Both are about the same size, both contain an attachment of the same type, and both contain content relevant to your organization. One is a highly targeted malicious email, the other is benign.
Can you discriminate the between the two based on netflow? Not a chance. There is nothing about the layer 3 or layer 4 data that is malicious. Remember, the malicious content is the attachment, not anything done at the network layer by the unwitting relay. It doesn’t matter how many features you extract from netflow or how much you processes it, you’re not going to be able to make a meaningful and reliable differentiation.
It’s crucial when using machine learning as a detection mechanism that you have some level of confidence that the features can actually be used to draw meaningful conclusions. The straightforward way to do this is to have analysts identify low fidelity indicators that when combined in complex ways, will yield meaningful results. Sure, some data mining may be involved here, and the process may be iterative, but you’ve got to have expressive and meaningful features. In my estimation, the biggest contribution Rohan makes with his study is demonstrating the value of features that most other mechanisms ignore (and incidentally, are harder for attackers to change).
Disparate Data Sources as a Red Herring
One claim made in support of machine learning is that with machine learning, you can correlate disparate data sources. This is really a red herring. You don’t necessarily need machine learning to do this. I’ve seen traditional SIMS, processing a wide variety of data feeds, used to make really impressive detections based on analyst crafted rules that aren’t particularly complex, in and of themselves, but which require a lot of work and technological horsepower behind the scenes because they leverage data from multiple sources. Sure, machine learning facilitates use of complex relationships in data, but those relationships don’t necessarily have to be from disparate data sources.
That being said, machine learning can be wildly successful at leveraging complex relationships within disparate data sources. Rohan’s PhD work demonstrates this fabulously. One temptation, however, is to try to unnaturally “enrich” data, often consisting of inadequate features to begin with, by joining yet other features. The hope is to improve the quality of the models generated. This is all fine and well if the data joined provides some utility in classification. Also, for most machine learning techniques, if the all classes in the training data set are adequately represented and the training set has adequate entropy, no serious harm can be done by joining features with no value in improving classification. However, if some classes are under-represented (as is often the case with the “bad” examples) or if the training data doesn’t have adequate entropy (as is often the case with artificial data), “enriching” data with other data sources can incorrectly improve measures of statistical significance and performance of the machine learner in a way that wouldn’t apply to real world data. Returning to our example of the email which can’t be detected with netflow data, let’s assume the benign email is sent by the relay with an ephemeral source port of 36865 and the malicious email is sent with a source port of 36866. Now let’s say that the researcher wants to “enrich” his data by adding all sort of lookups based on the layer 3 and layer 4 parameters such as geoip lookups, etc. If the researcher joins IANA assigned port numbers into the mix, the machine’s model will discover that the benign email was sent with at source port of “kastenxpipe” and the malicious email has a source port of “unassigned”. The spurious conclusion is clear: malicious emails sent through ignorant relays originate from “unassigned” source ports. This example is contrived, but this sort of things actually occurs.
By far the most important thing to get right when applying machine learning to the field of incident detection is operating on meaningful features.
Use the Right Algorithm (but don’t fret about it)
One aspect of applying machine to incident detection is choosing the right algorithm. This is also the one aspect that is usually belabored the most in academia, especially in research that is farthest from being applicable to real world problems. There are a lot of religious battles that go on in this realm also. However, very little of this provides real world value.
My suggestion is to choose the algorithm or one of the set of algorithms that makes sense for your data and how your system is going to operate. Don’t fret too much about it. I think of this selection much like a choosing a cryptographic algorithm. The primary factor in doing this is choosing the type of cryptographic function: hash, digital signature, block cipher, complete secure channel, etc. To a large degree, it probably doesn’t matter if you choose SSL, SSH, or IPSEC for use as a secure channel. Sure, there may be some small factors or even external factors may make one slightly more desirable, but at the end of the day, any from the palette of choices will likely provide you an adequately secure channel, all other things being equal.
Also, similar to making choices for crypto systems, you should avoid inventing or rolling your own unless you have a compelling reason to do so and you know what you are doing. All too often, I see exotic and home-grown machine learning techniques applied to information security. Often I see ROC charts, figures on performance, and other convoluted diagrams justifying these sorts of things. Just like with crypto, I think it’s appropriate to hold researchers to a high burden of proof to demonstrate the real world benefit of any “bleeding edge” machine learning mechanisms being applied to incident detection.
Again, Rohan’s PhD work is exemplary of the principles I’m trying to express. He chose a machine learning mechanism that fit his data and use cases well. While he did spend a fair amount of time and efforts trying to tweak the classifier (see cost sensitive stuff), this had marginal benefit. He provides few suggestions for future work in improving the machine learning mechanisms. However, he recommends, and I agree with his recommendation, that the overall system could be improved by exposing more relevant features (such as file attachment metadata) and tightening outcome classes by separating the “bad” in classification into multiple groupings based on similarity of attacks.
With that high level principle out of the way, I’ll say a little about specific classes of mechanisms or specific algorithms. In doing so I’ll express a few biases and religious beliefs that aren’t backed with the same level of objectivity contained in the rest of this essay.
I love Random Forests. Lots of other people do too. Random Forests works well with numerical data as well as other data types like categorical data. While Random Forests may not be the most simple example, tree based classification mechanisms are very easy to understand and once a classifier is trained, insanely efficient to classify new observations. The algorithm takes care of identifying variable importance and tuning the classifier accordingly. Many other mechanisms can only do part of all of this, require a large amount of manual tuning, require manual data normalization, etc. Random Forests is easy and works very well in many situations.
Text Based Mechanisms
Text based mechanisms are all the rage. They are awesome for helping make sense of human to human communication. For example, bayesian algorithms used in SPAM filtering mechanisms are actually rather effective at identifying and filtering high fidelity SPAM based on the text intended for human consumption. Document clustering mechanisms are very effective at weeding through large corpuses of documents, identifying those about similar topics. There is a huge amount of contemporary research on and new whiz bang mechanisms related to text mining, natural language processing, etc.
For the part of information assurance that requires operating on human to human communication, text based machine learning mechanisms hold high potential. However, most communication of interest in incident detection isn’t human to human, but is computer to computer. A large portion of computer to computer communication is done through exchange of numerical data. However, it is somewhat humorous to see researchers attempt to apply text classification mechanisms to predominately numerical data, such as network sensor data. While there may be legitimate reasons to do this, I see these efforts with the same cynical doubts concerning longevity with which I regard efforts to vectorize logical problems into problems suitable for floating point operations so GPUs can be leveraged.
R: Freedom in Stats and Machine Learning
One tool that I have to give a quick shout out to is R. Many people call R the free version of S (S is a popular stats tool), just like people say Linux is the free version of Unix. It’s a pretty close analogy. R is not only free as in beer, but is very free as in speech. There’s a huge and growing community supporting it. People who like Linux, Perl, and the CLI will love R. One thing I like about R is that everything you do is done via commands. Those commands are stored in a history, just like bash. If you want to automate something you’ve done manually, all you do is turn your R history into an R script. It’s easy to process stats, create graphs, or run machine learning algorithms without ever touching a GUI. It is much like Latex in that it has a steep learning curve, but people who master it are usually happy with the things they can do with it.
I hope that in the future there will be a greater measure of success in applying machine learning to incident detection. I hope those funding and directing research will help ensure a greater measure of relevancy by providing researchers with the data and problems necessary to conduct relevant research. I also hope that the principles I’ve laid out will be useful for people other than myself in helping to guide research in the future.