Saturday, October 27, 2012

PDFrate Update: API and Community Classifier

I am very pleased with the activity on pdfrate.com in the last few weeks. There have been a good number of visitors and some really good submissions. I’m really impressed at the number of targeted PDFs that were submitted and I’m happy with the pdfrate’s ability to classify these. I really appreciate those who have taken the time to label their submissions (assuming they know if they are malicious or not) so that the service can be improved through better training data.

There is now an API for retrieval of scan results. See the API section for more details, but as an example, you can view the report (JSON) for the Yara 1.6 manual.

This API may be unconventional, but I do like how easy it is to get scan results. You submit a file and get the JSON object back synchronously. I’ve split the metadata out from the scan results for a couple reasons. First, the metadata can be very large. Second, the metadata is currently presented as text blob, and I wasn’t sure how people would want it stuffed into JSON. If you want both, you have to make two requests. You can also view the metadata blob for the Yara 1.6 manual.

I’m happy that there have already been enough submissions, including ones that weren’t classified well by the existing data sets, that I’ve generated a community classifier based on PDFrate.com user submissions and voting. I’m thrilled that there were submissions matching categories of malicious PDFs that I know are floating around but simply aren’t in the existing data sets. I expect that if the current submission rate stays the same or goes up, the community classifier will become the most accurate classifier, because it will contain fresher and more relevant training data. Again, as an example, you can check out the report for the Yara 1.6 manual which now includes a score from the community classifier.

If a submission had votes before Oct 25th, it was included in the community classifier. Some users will note that even though they themselves did not vote on their submissions, they have votes. I reviewed many interesting submissions and placed votes on them so that they could be included in the community classifier. I decided to not do a bulk rescan of all documents already submitted. It wasn't for technical reasons. Note that the ratings occur solely based on the previously extracted metadata and as such are very fast. I did so because I didn’t want to provide potentially deceptive results to users. If a document is in the training set, it is generally considered an unfair test to use the resulting classifier on it, as the classifier will almost always provide good results. Regardless, if you want to have a submission re-scanned, just submit the file over again.

Again, I’m pleased with the PDFrate so far. I hope this service continues to improve and that it provides value to the community.

Saturday, September 15, 2012

Announcing PDFrate Public Service

I’m excited to announce PDFrate: a website that provides malicious document identification using machine learning based on metadata and structural features. The gory details of the underlying mechanisms will be presented at ACSAC 2012.

I’ve been working on this research since 2009, which was a year where the stream of PDF 0-days being leveraged by targeted attackers was nearly unbroken. I’ve refined the underlying techniques to a place where they are very effective in real operations and are addressed rigorously enough for academic acceptance. Note that I originally designed this for the purpose of detecting APT malicious documents but have found it to be largely effective on broad based crimeware PDFs also. Furthermore, it is pretty effective at distinguishing between the two. I can speak from personal experience that mechanisms underlying PDFrate provide a strong compliment to signature and dynamic analysis detection mechanisms.

Those that are interested should head over to the pdfrate site and check out the “about” page in particular which explains the mechanisms and points to some good examples.

PDFrate demonstrates a well refined mechanism for detecting malicious documents. This currently operates on PDF documents. I am close to extending this to office documents. But I see this paradigm extending much farther than just malicious documents. I see wise (and deep) selection of features and machine learning being effective for many things other things such as emails, network transactions such as HTTP, web pages, and other file formats such as SWF and JAR.

I’m happy to provide the PDFrate service to the community so that others can leverage (and critique) this mechanism. Providing this as a service is a really good way for others to be able to use it because it removes a lot of the difficulty of implementation and configuration, the hardest part of which is collecting and labeling a training set. High quality training data is critical for high quality classification and this data is often hard for a single organization/individual to compile. While the current data sets/classifiers provided on the site are fine for detecting similar attacks, there is room for improvement and generalization which I hope will come from community submissions and ratings. So please vote on submissions, malicious or not, as this will speed the development and evolution of a community driven classifier. This service could benefit from some additional recent targeted PDFs.

In addition to the classification that PDFrate provides, it also provides one of the best document metadata extraction capabilities that I’ve seen. While there are many tools for PDF analysis, the metadata and structure extraction capabilities used by PDFrate provide a great mix of speed, simplicity, robustness, saliency, and transparency. Even if you aren’t sold on using PDFrate for classification, you might see if you like the metadata it provides. Again, the about provides illustrative examples.

I hope this service is useful to the community. I look forward to describing in depth in December at ACSAC!

Saturday, July 28, 2012

Security Vanity: Public Key Crypto for Secure Boot

For the last few months there’s been a bit of chatter about the restrictions Microsoft will be imposing on hardware vendors, requiring them to implement a specific flavor of “secure boot” where the hardware verifies the signature of the bootloader. The claim is that signing code from the hardware up will improve security, especially helping defeat boot loader viruses. This mechanism obviously makes installing other operating systems more difficult. In the case of ARM processors, there are further restrictions, basically preventing manual circumvention.

The responses from the organizations interested in other operating systems and consumer interests have been mixed. Some, like Fedora and Ubuntu, have drafted courses that seem surprisingly compliant, while others, like FSF, don’t seem to be so keen on the idea. Linus provided a fairly apathetic response noting that the proposed method for secure boot doesn’t help security much but probably doesn’t hurt usability that much either. Most of these responses make sense considering the interests they represent.

It may be fine for non-security folk to trust the advice of security professionals when we say that something is necessary for “security”. However, we do ourselves and society a great disservice when we allow security to be used as a cover for actions based in ulterior motives or when we support ineffective controls. Whether you are on the conspiracy theory or the incompetence more likely than malice side of the house, allowing the cause of security to be used recklessly hurts. It hurts our reputation and ability to enact sensible security measures in the future. It hurts the very cause we are professing to support, which for me is securing liberty. For example, I believe it will take years for our society to recognize how harmful the security theater of airport screening is. On the flip side, there has been no coherent demonstration of any significant effect in preventing terrorism, but our efforts have been fabulously effective at magnifying terror.

We have to call out security vanity when we see it. The use of public key crypto (without much of the complementary infrastructure) in secure boot needs to be questioned. Code signing, and digital signatures in general, can add a high degree of trust to computing. However, experience has shown that PKI, and crypto in general, gets busted over time, especially through key disclosure. It only takes one disclosure to cause a serious issue, especially if robust key management regimes are not in place. By design, there will be no practical way keep this PKI up to date without significant effort on the part of the user. No removal of broken root keys, no certificate revocations, etc. So the type of mechanisms that are used for keeping current code signing, SSL, etc up to date can’t apply here. Note that all of these mechanisms for key management are used by necessity—examples of individual keys/certificates being revoked occurs in the wild, even if you exclude really spooky attacks like Stuxnet or the RSA breach. The whole class of things designed to restrict how users use their devices, things like DVD’s CSS and HDCP, illustrate an important principle. It’s not a question of whether these mechanisms can be defeated; it’s a question of whether anyone cares to do it and how long it will take. It doesn’t matter how good the crypto algorithms are in a PKI based secure boot system, if proper key management can’t occur, the system is critically flawed.

If someone is pushing a defective by design security mechanism, are people justified questioning in their motives? I believe so. The purpose of this post isn’t to rant about anti-competitive practices, anti-consumer restrictions on hardware devices, use of “digital locks” laws to inhibit otherwise legal activity, etc. It is to point out that the use of PKI based secure boot is either based in motives other than security or it is based in bad engineering.

Let’s give Microsoft the benefit of the doubt. Let’s say that boot loader viruses are an issue that need to be addressed. At least we believe it is something that could become an issue during the life of the systems that will be shipping in the near future. The question at this point is, why use PKI? Are there alternatives? Granted there aren’t too many options for trust at this level, but I believe there are options. For example, I’ve long thought it would be useful for both desktop and mobile systems to have a small portion of storage which is writable only with a physical switch which would typically be in the read only position. This sort of thing could be really useful for things like bootloaders, which are updated relatively infrequently. You can criticize this approach, and debate whether it’s more likely for a lay person to be able to make the correct decision about when to press the button to make their secure storage area readable or for them to do manual key management, but the point is that there are alternatives. Note that I’ve intentionally ignored the stiffer requirements for ARM devices.

Using public key crypto for secure boot where proper key management cannot occur is a horrible idea. It will not provide security against determined threats over time. It is incumbent on the security community to clarify this, so that an effective solution can be implemented or the true motives can be discussed. The security community cannot tacitly be party to security vanity.

Saturday, May 19, 2012

NIDS need an Interface for Object Analysis

The information security community needs NIDS projects to provide an interface for analysis of objects transferred through the network. Preferably, this interface would be standardized across NIDS. Here I will provide my rationale for this assertion and some opinions on what that should look like.

Why

Security practitioners have benefited from project/vendor agnostic interfaces such as Milter, ICAP, etc in their respective realms. These interfaces provide the operator with compatibility, and therefore, flexibility. In this paradigm, the operator can choose, for example, a single mail server and then independently choose one or more mail filters. Decoupling the two allows the operator to focus selection of each product on the how well they do their primary jobs, mail transfer and mail filtering in our example. This is a classic example of the Unix philosophy of doing one thing well.
The NIDS world needs the same thing. We need a standardized interface that allows NIDS to focus on analyzing network traffic and allows a quasi-independent client object analysis system to focus on analyzing objects transferred through the network.
In the past it has been hard for me to understand why NIDS haven’t followed the huge shift up the stack from exploits that occur at the network layer (ex. Sasser worm or WuFTD exploits) to exploits that occur in client objects ( ex. PDF exploits or Mac Flashback Trojan). I’ve heard many people say NIDS are for detecting network exploits and if you want to detect client exploits, use something on the client. I would note that this mentality has been an excuse for some NIDS to not do full protocol decoding (ex. base64 decoding of MIME, gzip decoding of HTTP, etc). This has left network defenders with the motive (detect badness as soon as possible, preferably before it gets to end hosts), the opportunity (NIDS see the packets bearing badness into network), but without the means (NIDS blind to many forms of client attacks).
To be fair to the various NIDS projects, every relevant NIDS is taking steps to support some amount of client payload analysis. Actually, it seems like it is a relatively high priority for many of them. I know the Bro guys are diligently working on a file analysis framework for Bro 2.1. Furthermore, there are some very valid reasons why it’s in the security community’s best interest to abstract NIDS from detecting client object malfeasance, directly that is. First, back to the Unix philosophy mentioned above, we want our NIDS to be good at network protocol analysis. I can think of plenty of room for improvement in NIDS while remaining focused on network protocol analysis. Second, and central to the argument made here, we don’t want to fragment our client payload analysis just because of differences in where files are found (ex. network vs. host). To illustrate this last point, think about AV. Imagine if you had to have a separate AV engine, signature language, etc just because of where the object came from even though you are looking at the exact same objects. This is again, is one of the main reasons Milter, ICAP, etc exist. I cite VRT’s Razorback as a great example of this. From the beginning it supported inputs from various sources so the analysis performed could be uniform, regardless of original location of the objects. Lastly, one could argue that network protocol analysis and client object analysis are different enough that they require frameworks that look different because they tackle different problems. I’ve built both NIDS and file analysis systems. While they do similar things, their goals, and therefore engineering priorities are largely different. Anecdotally, I see NIDS focusing on reassembly and object analyzers focusing on decoding, NIDS caring about timing of events while file analyzers largely consider an object the same independent of time, etc.
What I’m arguing is that NIDS should be separated or abstracted from client file analysis frameworks. This certainly doesn’t mean a NIDS vendor/project can’t also have a complementary file analysis framework. Again, note Razorback and association with Snort. What I would like to see, however, is some sort of standard interface used between the NIDS and the file analysis framework. Why can’t I use Razorback with Bro or Suricata? Sure, I could with a fair amount of coding, but it could and should be easier than that. Instead of hacking the NIDS itself, I should be able to write some glue code that would be roughly equivalent to the glue required to create a milter or ICAP for the file analysis service. Furthermore, this interface needs to move beyond the status quo (ex. dump all files in directory) to enable key features such as scan results/object metadata being provided back to the NIDS engine.
Lastly, the existence of a standard interface would make it easier for users to demand client object analysis functionality for things flowing through the network while making it easier for NIDS developers to deliver this capability.
The security community needs a cross-NIDS standard for providing client objects to an external analysis framework.

What

The following are my ideas on what this interface between a NIDS and a file analysis framework should look like.

Division of Labor

It is probably best to start by defining more clearly what should go through this interface for embedded object analysis. When I say client objects I mean basically anything transferred through the network that makes sense as a file when divorced from the network. My opinion is that all things that are predominately transferred through the network are candidates for analysis external to the NIDS. Canonical examples of what I mean include PDF documents, HTML files, Zip files, etc. Anything that is predominately considered overhead used to transfer data through the network should be analyzed by the NIDS. Canonical examples include packet headers, HTTP protocol data, etc.
Unfortunately, the line gets a little blurry when you consider things like emails, certificates, and long text blobs like directory listings. I don’t see this ambiguity as a major issue. In an ideal world, the NIDS framework would support external analysis of objects, even it can do some analysis internally also.
I will discuss in greater detail later, but the NIDS also has the responsibility for any network layer defragmentation, decoding, etc.

An API, not a Protocol

Fundamentally, I see this as an API backed by a library, not a network protocol (like ICAP, Milter, etc). In the vast majority of cases, I see this library providing communication between a NIDS and a client analysis on a single system with mechanisms that support low latency and high performance. In many cases, I expect the file analysis may well go and do other things with the object, such as transfer it over the network or write it do disk. The type of object analysis frameworks I’m envisioning would likely have a small stub that would talk to the NIDS interface API and take to the object analysis framework submission system. Using Razorback’s parlance, a relatively small “Collector” would use this API to get objects from the NIDS and send them into the framework for analysis.
I see the various components looking roughly as follows:

+------+  objects and metadata   +------+  /------>
|      | ---> +-----------+ ---> | File | /  Object
| NIDS |      | Interface |      | Anal\| -------->
|      | <--- +-----------+ <--- | ysis | \ Routing
+------+   response (optional)   +------+  \------>

From the NIDS to the Object Framework

The primary thing the NIDS has to do is give an object to the external framework. This is fundamentally going to be a stream of bytes. It is most easily thought of as a file. The biggest question to me is who owns the memory and is there is a buffer copy involved? My opinion is that to make performance reasonable you have to be able to do some sort of analysis in the framework without a buffer copy. You need to be able to run the interface, and at some simple object analyzers, on a buffer owned by the NIDS, probably running in the NIDS’ process spaces. Obviously, if you need to do extensive analysis, you’ll probably want it to be truly external. That’s fine, the framework can let you have the buffer and you can make a copy of it, etc
The other thing that the NIDS needs to provide to an external analyzer is the metadata associated with the object. At a minimum this needs to include enough to trace this back to the network transaction—ex. IPs, ports, timestamp. Ideally this would include much more--things that are important for understanding context of the object. For example, I can imagine uses for many of the HTTP headers, both request and response, being useful or even necessary for analysis. For example, the following are metadata items I’d like to see from HTTP for external analysis: Method, Resource, Response Code, Referer, Content-Type, Last-Modified, User-Agent, and Server.
Delving into details, I would think it prudent for the interface between the NIDS and the external framework to define the format for data to be transferred without specifying exactly what that data is. The interface should say something like use HTTP/SMTP style headers, YAML, or XML of this general format, and define some general standards without trying to define this exhaustively or make this too complicated. In most cases, it seems that identifiers can be a direct reflection of the underlying protocol.

From the Object Framework back to the NIDS

One important capability that this interface should enable is feedback from the external analysis framework back into the NIDS. I’d really like to see, and think it’s crucial for NIDS if they want to avoid marginalization, to not only pass objects out for analysis but also to accept data back. I see two important things coming back: a scan result and object metadata. In its most simple form the scan result is something like “bad” or “good” and could probably be represented with 1 bit. In reality, I can see a little more expressiveness being desired, like why the object is considered bad (signature name, etc). In addition to descriptions of badness or goodness, I see great value in the external framework being able to respond back with metadata from the object (not necessarily an indication of good or bad). This could be all sorts of relevant data including the document author, PE compile time, video resolution, etc. This data can be audited immediately through policies in your NIDS (or SIMS) and can be saved for network forensics.
Of course, this feedback from the external object analysis system needs to be optional. I can think of perfectly good reasons to pass objects out of the NIDS without expecting any feedback. Basically all NIDS to external analysis connections are a one way street today. However, as this paradigm advances, it will be important to have this bi-directional capability. I can also think of plenty of things that a NIDS could do if it received feedback from the external analyzer. Having this capability, even if it is optional, is critical to developing this paradigm.

Timing and Synchronization

An important consideration is the level of synchronization between the NIDS and the external object analyzer. I believe it would be possible for an interface to support a wide range from basically no synchronization to low latency synchronization suitable for IPS. Obviously, the fire and forget paradigm is easy and to support this paradigm nicely you could make feedback from the analyzer optional and make whatever scanning occurs asynchronous. However, an external analysis interface that expects immediate feedback can be easily be made asynchronous by simply copying the inputs, returning a generic response, and then proceeding to do analysis. On the other hand, morphing an asynchronous interface into a synchronous one can be difficult. For that reason, it would be good for this to be built in from the beginning, even if it is optional, to begin with.
Can this sort of interface be practical for IPS/inline or it destined to be a technology used only passively? My answer is an emphatic yes. I challenge NIDS to find a way to allow external analysis whose latency is adequately low enough to be reasonable for use inline. I’ll share my thoughts on payload fragmentation below, but even if objects can’t be analyzed until the NIDS has seen the whole object, the NIDS still has recourses if it gets an answer back from the external analyzer quick enough. It would also be nice if NIDS started to support actions based on relatively high latency responses from an external analysis framework even if the payload in question has already been transferred though the network. Options for actions include things like passive alerting, adding items to black lists (IP, URL, etc) that can be applied in real time, and addition of signatures for specific payloads or hashes of specific payloads. It seems inevitable that the interface between the NIDS and the external analyzer will include some sort of configurable timeout capability to protect the NIDS from blocking too long.

De-fragmentation and Decoding:

I envision a situation where the NIDS is responsible for network layer normalization and the file analysis framework does anything necessary for specific file formats. The NIDS is responsible for all network protocol decoding such as gzip and chunked encoding of HTTP. However, things like MIME decoding can be done internally to the NIDS or externally depending whether it is desired to pass out complete emails or individual attachments.
The NIDS should be responsible for any network layer reassembly/de-fragmentation required. Ex. IPfrag, TCP, etc. I think some exceptions to this are reasonable. First of all, it makes a lot of sense for the interface to allow payloads to be passed externally in sequential pieces, with the responsibility of the NIDS to do all the re-ordering, segment overlap deconfliction, etc. necessary. This would support important optimizations on both sides of the interface. It should also be considered acceptable for the NIDS to punt in situations where the application, not the network stack, fragments the payload but that fragmentation is visible in the network application layer protocol. For example, it would be desirable for the NIDS to reassemble HTTP 206 fragments, if possible, but I can see the argument that this reassembly can be pushed on the external analyzer, especially considering that there are situations where completely reassembling the payload object simply isn’t possible.
It should be clear that any sort of defragmentation required for analyzing files, such as javascript includes or multi-file archives, is the responsibility of the object analysis framework.

Filtering

To be effective, any system of this sort should support filtering of objects before passing them out for external analysis. While there is always room for improvement, I think most NIDS are already addressing this in some manner. The external file analysis framework probably needs to support some filtering internally, in order to route files to the appropriate scanning modules. Razorback is a good example here.
It would be possible for the interface between the two to support some sort of sampling of payloads whereby the external analyzer is given the first segment of a payload object and given the option to analyze or ignore the rest of it. I consider this unnecessary complication. I think some would desire and it could be natural for the interface to support feedback every time the NIDS provides payload segments (assuming payloads are provided in fragments) to the external analyzer. I personally don’t see this as a top priority because unless you are doing the most superficial analysis or are operating on extremely simple objects, the external analysis framework will not be able to provide reliable feedback until the full object has been seen. While both the NIDS and the external analysis framework will likely want to implement object filtering, there is no strong driver for this filtering to be built directly into the interface and this is probably just extra baggage.

Use Cases

To make sure what I’m describing is clear, I offer up 3 use cases that cover most of the critical requirements. These use cases should be easy to implement in practice and would likely be useful to the community.
Use Case 1: External Signature Matching
Use something like yara to do external signature matching. Yara provides greater expressiveness in signatures than most NIDS natively support and many organizations have a large corpus of signatures for client objects like PDF documents, PE executables, etc. If not yara, then another signature matching solution, like one of the many AV libraries (say clam) would be an acceptable test. The flags/signature names should be returned to the NIDS. This should be fast enough that this can be done synchronously.
Use Case 2: External Object Analysis Framework
Use something like Razorback to do flexible and iterative file decoding and analysis. Since Razorback is purely passive and intentionally is not bound by tight real time constraints, no feedback is sent back into the NIDS and the scanning is done asynchronously. All alerting is done by Razorback, independent of the NIDS. For extra credit, splice in your interface between Snort as a Collector (used to collect objects to send into Razorback) and Razorback.
Use Case 3: External File Metadata Extraction
Use something to do client file metadata extraction. Feed this metadata back into the NIDS to be incorporated back into the signature/alerting/auditing/logging framework. Unfortunately, I don’t have any really good recommendations of specific metadata extractors to use. One could use libmagic or something similar, but some NIDS already incorporate this or something like it to do filtering of objects, etc. Magic number inspection doesn’t quite get to the level of file metadata I’m looking for anyway. There are a few file metadata extraction libraries out there, but I can’t personally I recommend any of them for this application. As such, I’ve written a simple C function/program, docauth.c, that does simple document author extraction from a few of the most commonly used (and pwnd) document formats. This should be easy enough to integrate with the type of interface I envision. In thiscase, you would be able to provide file metadata back into the NIDS. This NIDS could incorporate this metadata into its detection engine. This should be fast enough to be able to be done synchronously.
Use Case 4: File Dumper
This is a lame use case, but I include it to show that the most common way to support external file analysis in NIDS, dumping files to a directory, can be achieved easily using the proposed interface. This relieves the NIDS of dealing with the files. The NIDS provides the file to be scanned and metadata about the file to the interface. The external analyzer simply writes this to disk. There is no feedback.

Requirements:

I’ll also provide what I think are reasonable requirements (intentionally lacking formality) for a standard interface used between a NIDS and an external client object analysis system.

The interface will provide the NIDS the capability to:
  • Pass files to an external system 
    • Files may be passed in ordered fragments
  • Pass metadata associated with the files to the external system
    • The metadata is provided in a standardized format, but the actual meaning of the data passed is not defined by the interface, rather it must be agreed upon by the NIDS and the external analyzer.
  • Optionally receive feedback from the external analyzer asynchronously
  • Optionally timeout external analysis if it exceed configured threshold 
The NIDS performs the following functionality:
  • Normalization (decoding, decompression, reassembly, etc) of all network protocols to extract files as would be seen by clients
  • Provide relevant metadata in conjunction extracted files for external analysis
  • Optionally, provide mechanisms to filter objects being passed to external analyzer
  • Optionally, incorporate feedback from external analysis into the NIDS detection engine
The Object analysis framework performs the following:
  • Receive objects and metadata from the NIDS
  • Perform analysis, which may include object decoding
  • Optionally, provided feedback to the NIDS

Conclusion

I’ve provided my opinion that it is in the best interest of the security community to have a standardized interface between NIDS and client object analysis. This would provide flexibility for users. It would help NIDS remain relevant while allowing them to stay focused. I envision interface as a library that supports providing objects and metadata out to an object analyzer and receive feedback in return.
I’m confident I’m not the only one that sees the need for abstraction between NIDS and object scanning, but I hope this article helps advance this cause. NIDS developers are already moving towards greater support for object analysis. I’d be gratified if the ideas presented here help shape, or at least help confirm demand, for this functionality. It’s important that users request more mature support of object analysis in NIDS and push for standardization across the various projects/vendors.

Saturday, March 10, 2012

Flushing out Leaky Taps v2

Note: this post is a re-write of a previous post, Flushing out Leaky Taps which I originally posted in June 2010.

Many organizations rely heavily on their network monitoring tools. These tools, which rely on data from network taps, are often assumed to have complete network visibility. While most network monitoring tools provide stats on the packets dropped internally, most don’t tell you how many packets were lost externally to the appliance. I suspect that very few organizations do an in depth verification of the completeness of tapped data nor quantify the amount of loss that occurs in their tapping infrastructure before packets arrive at network monitoring tools. Since I’ve seen very little clear documentation on the topic, this post will focus on techniques and tools for detecting and measuring tapping issues.

Impact of Leaky Taps


How many packets does your tapping infrastructure drop before ever reaching your network monitoring devices? How do you know?

I’ve seen too many environments where tapping problems have caused network monitoring tools to provide incorrect or incomplete results. Often these issues last for months or years without being discovered, if ever. Making decisions or relying on bad data is never good. Many public packet traces also include the type of visibility issues I will discuss.

In most instances, you need to worry about packet loss in your monitoring devices before you worry about loss in tapping. In most devices there are multiple places where loss can occur resulting in multiple places were loss is reported. For example, if running a networking monitoring application on linux, possible places were loss can occur and be reported (if it is reported) are as follows:






Loss occurringis reported at:
between Kernel and Applicationapplication (pcap) dropped
between NIC and Kernelifconfig dropped
between Tap Feed and NICethtool -S link error
between Network and Tap FeedNA

This post focuses primarily on the last item where loss is not directly observable in the network monitor. In seeking to understand loss occurring outside the network monitor, we must assume a lack of loss inside the network monitor. In other words, the methods presented here seek to identify loss comprehensively. If loss is observed or inferred, and any loss in the network monitor device can be ruled out, then loss external to the network monitor device can be identified.

I’m not going to discuss in detail the many things that can go wrong in getting packets from your network to a network monitoring tool. For a quick overview on different strategies for tapping, I’d recommend this article by the argus guys. I will focus largely on the resulting symptoms and how to detect, and to some degree, quantify them. I’m going to focus on two very common cases: low volume packet loss and unidirectional (simplex) visibility.

Low volume packet loss is common in many tapping infrastructures, from span ports up to high end regenerative tapping devices. I feel that many people wrongly assume that taps either work 100% or not at all. In practice, it is common for tapping infrastructures to drop some packets such that your network monitoring device never even gets the chance to inspect them. Many public packet traces include loss that could have been caused by the issues discussed here. Very often this loss isn’t even recognized, let alone quantified.

The impact of this loss depends on what you are trying to do. If you are collecting netflow, then the impact probably isn’t too bad since you’re looking at summaries anyway. You’ll have slightly incorrect packet and byte counts, but overall the impact is going to be small. Since most flows contain many packets, totally missing a flow is unlikely. If you’re doing signature matching IDS, such as snort, then the impact is probably very small, unless you win the lottery and the packet dropped by your taps is the one containing the attack you want to detect. Again, stats are in your favor here. Most packet based IDSs are pretty tolerant of packet loss. However, if you are doing comprehensive deep payload analysis, the impact can be pretty severe. Let’s say you have a system that collects and/or analyzes all payload objects of certain type--it could be anything from emails to multi-media files. If you loose just one packet used to transfer part of the payload object, you can impact your ability to effectively analyze that payload object. If you have to ignore or discard the whole payload object, the impact of a single lost packet can be significantly multiplied in that many packets worth of data can’t be analyzed.

Another common problem is unidirectional visibility. There are sites and organizations that do asymmetric routing such they actually intend to tap and monitor unidirectional flows. Obviously, this discussion only applies to situations where one intends to tap a bi-directional link but only ends up analyzing one direction. One notorious example of a public data set suffering from this issue is the 2009 Inter-Service Academy Cyber Defense Competition.

Unidirectional capture is common, for example, when using regenerative taps which split tapped traffic into two links based on direction but only one directional link makes it into the monitoring device. Most netflow systems are actually designed to operate well on simplex links so the adverse affect is that you only get data on one direction. Simple packet based inspection works fine, but more advanced, and usually rare, rules or operations using both directions obviously won’t work. Multi-packet payload inspection may still be possible on the visible direction, but it often requires severe assumptions to be made about reassembly, opening the door to classic IDS evasion. As such, some deep payload analysis systems, including vortex and others based on libnids, just won’t work on unidirectional data. Simplex visibility is usually pretty easy to detect and deal with, but it often goes undetected because most networking monitoring equipment functions well without full duplex data.

External Verification


Probably the best strategy for verifying network tapping infrastructure is to perform some sort of comparison of data collected passively with data collected inline. This could be comparing packet counts on routers or end devices to packet counts on a network monitoring device. For higher order verification, you should do something like compare higher order network transaction logs from an inline or end device against passively collected transaction logs. For example, you could compare IIS or Apache webserver logs to HTTP transaction logs collected by an IDS such as Bro or Suricata. These verification techniques are often difficult. You’ve got to try to deal with issues such as clock synchronization and offsets (caused by buffers in tapping infrastructure or IDS devices), differences in the data sources/logs used for concordance, etc. This is not trivial, but often can be done.

Usually the biggest barrier to external verification of tapping infrastructure is the lack of any comprehensive external data source. Many people rely on passive collection devices for their primary and authoritative network monitoring. Often times, there just isn’t another data source to which you can compare your passive network monitoring tools.

One tactic I’ve used to prove loss in taps is to use two sets of taps such that packets must traverse both taps. If one tap sees a packet traverse the network and another tap doesn’t, and both monitoring tools claim 0 packet loss, you know you’ve got a problem. I’ve actually seen situations where one network monitoring device didn’t see some packets and the other network monitoring devices didn’t see some packets, but the missing packets from the two traces didn’t overlap.

Another strategy that I've heard proposed is to use some sort of periodic heartbeat, such as as ping packet with a certain byte sequence in it, which the network monitor can then observe. If the periodic heartbeat isn't observed, then the network monitor can alert the lack of this heartbeat and the potential monitor visibility gaps can be investigated. I'm not a huge fan of this strategy. While it may work for some cases, I see many situations where alerts for visibility gaps would be caused much more frequently by conditions other than monitor visibility issues. Also, if both loss and heartbeats are a very small fraction of overall traffic, it would be possible for loss to occur without the heartbeat alert being set off. We certainly don't need another false positive, false negative prone alert to ignore.

Inferring Tapping Issues


While not easy and necessarily not as precise nor as complete as comparing to external data, using network monitoring tools to infer visibility gaps in the data they are seeing is possible. Many network protocols, namely TCP, provide mechanisms specifically designed to ensure reliable transport of data, even in the event packet drops. Unlike an endpoint, however, a passive observer can’t simply ask for a retransmission when a packet is dropped. Even so, a passive observer can use the mechanisms the endpoints use for reliable transport to infer if it missed packets passed between endpoints. For example, if Alice sends a packet to Bob which the passive observer Eve doesn’t see, but Bob acknowledges receipt with Alice and Eve sees the acknowledgement, Eve can infer that she missed a packet.

It's important to note the distinction between network packet drops and network monitor visibility gaps. When IP networks are overloaded, they drop packets. In the vast majority of cases, it's a normal (and desirable) part of flow control for networks to drop a small number of packets. Usually, when these packets are dropped, the endpoints slow communication a bit and the dropped packet is retransmitted. On the other hand, it's almost always undesirable for your network monitoring device to not see packets that are successfully transferred through the network. While a network monitor will not see the packets dropped in the network or lost due to a visibility gap, these cases are very different. The former is characterized by lack of endpoint acknowledgement and packet retransmissions while the latter usually is accompanied by endpoint acknowledgement of the "lost" packet and a lack of re-transmission. Again, this post focuses on analyzing unwanted network monitor loss, not normal network drops.

Data and Tools


For those who would like to follow along, I’ve created 3 simple pcaps. The full pcap contains all the packets from a HTTP download of the ASCII “Alice in Wonderland” from Project Gutenburg. The loss pcap, is the same except that one packet, packet 50, was removed. The half pcap is the same as the full pcap, but only contains the packets going to the server, without the packets going to the client.

In addition to these pcaps I'll also use a larger pcap of about 100MB which contains 125338 packets. This packet trace was collected on a low rate network where I have confidence that no packets were lost in the taps or the monitor (but normal network packet drops are likely). Using editcap, I've created 2 altered versions of this capture where I removed 100 and 500 packets respectively:

$ capinfos example.pcap | grep "^Number"
Number of packets: 125338
$ editcap example.pcap ex_loss_100.pcap `for i in {1..100}; do echo $[ ( $RANDOM * 125338 ) / 32767 ]; done`
$ capinfos ex_loss_100.pcap | grep "^Number"
Number of packets: 125238
$ editcap example.pcap ex_loss_500.pcap `for i in {1..500}; do echo $[ ( $RANDOM * 125338 ) / 32767 ]; done`
$ /usr/sbin/capinfos ex_loss_500.pcap | grep "^Number"
Number of packets: 124842

Note that I actually only removed 496 instead of 500 in the latter trace because some packets were randomly selected for removal twice. These packet traces will not be shared publicly, but those following along should be able to use their own capture file and obtain similar results.

For tools, I’ll be using argus and tshark to infer packet loss in the tap. Argus is a network flow monitoring tool. Tshark is the CLI version of the ever popular wireshark. Since deep payload analysis systems are often greatly affected by packet loss, I’ll explain how the two types of packet loss affect vortex.

Low Volume Loss in Taps


Detecting and quantifying low volume loss can be difficult. For a long time, the most effective tool I used for measuring this was tshark, especially the tcp analysis lost_segment and ack_lost_segment flags.

Note that this easily identifies the lost packet at postion 50:


$ tshark -r alice_full.pcap -R tcp.analysis.lost_segment
$ tshark -r alice_loss.pcap -R tcp.analysis.lost_segment
50 0.410502 152.46.7.81 -> 66.173.221.158 TCP [TCP Previous segment lost] [TCP segment of a reassembled PDU]


Unfortunately, this in and of itself doesn't help us know for sure if this packet was dropped in the network or was lost in the network monitor.

Theoretically, that's what tcp.analysis.ack_lost_segment is for.

$ tshark -r alice_full.pcap -R tcp.analysis.ack_lost_segment
$ tshark -r alice_loss.pcap -R tcp.analysis.ack_lost_segment


What's going on? This packet should have been acknowledged by the endpoint (it was originally in the trace) and the ACK for this packet was not removed. Unfortunately, this functionality doesn't always work reliably. I have seen it work some times, but as shown here, it doesn't work all the time. tshark does reliably flag packets that are lost, but doesn't reliably flag those that are lost but ACK'd. This differentiation is important because it allows us to separate packets that are lost in the network from those that lost in tapping/capture infrastructure. This bug was reported to the wireshark development team by György Szaniszló. As far as I know, the fixes that György proposed still have not been implemented. In addition to his explanation of current tshark functionality, György included an additional test pcap and a patch to fix this functionality. I recommend reading this bug report and considering using the patch he provided if you are serious about using tshark to help infer loss external to your network monitors. I'd love to see the wireshark team fix this functionality.

Please note that while the "ack_lost_segment" doesn't appear to work reliably, the "lost_segment" appears to work as expected. This can help you validate your network tapping infrastructure, especially if you can quantify the packets dropped in the network. At the very least, the loss reported here could be considered a reflection of the upper bound of the loss external to your monitor. I’ve created a simple (but inefficient) script that can be used on many pcaps. Since tshark doesn’t release memory, you’ll need to use pcap slices smaller than the amount of memory in your system. The script is as follows:


#!/bin/bash

while read file
do
total=`tcpdump -r $file -nn "tcp" 2>/dev/null | wc -l`
errors=`tshark -r $file -R tcp.analysis.lost_segment | wc -l`
percent=`echo $errors $total | awk '{ print $1*100/$2 }'`
bandwidth=`capinfos $file | grep "bits/s" | awk '{ print $3" "$4 }'`
echo "$file: $percent% $bandwidth "
done


It is operated by piping it a list of pcap files. For example, here are the results from my private example traces:


ls example.pcap ex_loss_100.pcap ex_loss_500.pcap | ./calc_tcp_loss.sh
example.pcap: 0.00261199% 40351.43 bits/s
ex_loss_100.pcap: 0.0392112% 40321.12 bits/s
ex_loss_500.pcap: 0.192323% 40194.42 bits/s
...


I believe the small amount of loss in the unmodified example.pcap resulted from normal network packet drops. Note that the loss percentage reported scales up nicely with our simulated network monitor loss.

In the case of low volume loss in taps, argus historically hasn’t been the most helpful:


$ argus -X -r alice_full.pcap -w full.argus
$ ra -r full.argus -n -s stime flgs saddr sport daddr dport spkts dpkts loss
10:12:54.474330 e 66.173.221.158.55812 152.46.7.81.80 87 121 0
$ argus -X -r alice_loss.pcap -w loss.argus
$ ra -r loss.argus -n -s stime flgs saddr sport daddr dport spkts dpkts loss
10:12:54.474330 e 66.173.221.158.55812 152.46.7.81.80 87 120 0


Note that there is one less dpkt (destination packet). Other than the packet counts, there is no way to know that packet loss occurred. For as long as I've used argus, it does a good job of identifying and quantifying normal network packet drops, as evidenced by retransmissions, etc. This is reported by flags of "s" and "d", for source and destination loss respectively, as well as various *loss stats.

Very recently, however, the argus community and developers have added mechanisms to argus to directly address inference of network monitor packet loss and differentiate it from normal network packet drops. The result are flags and counters for what is termed "gaps" in traffic visibility. This functionality is included in recent development versions (my examples are made using argus-3.0.5.10 and argus-clients-3.0.5.34). The inferred lapses in monitor visibility are denoted with the flag of "g" and the statistics "sgap" and "dgap" which measure the bytes of an inferred gap in network traffic.

For example let's look again at the alice example, regenerating it with the new argus that has gap detection capabilities and looking at the "dgap" instead of "loss" statistic:

$ /usr/local/sbin/argus -X -r alice_full.pcap -w full.argus
$ ra -r full.argus -n -s stime flgs saddr sport daddr dport spkts dpkts dgap
10:12:54.474330 e 66.173.221.158.55812 152.46.7.81.80 87 121 0
$ argus -X -r alice_loss.pcap -w loss.argus
$ ra -r loss.argus -n -s stime flgs saddr sport daddr dport spkts dpkts dgap
10:12:54.474330 e g 66.173.221.158.55812 152.46.7.81.80 87 120 1460


Now argus correctly flags the connection as having a gap in it and identifies the size of the gap. Let's see how argus does on my private example:

$ argus -X -r example.pcap -w - | ra -nn -r - | grep g | wc -l
3
$ argus -X -r ex_loss_100.pcap -w - | ra -nn -r - | grep g | wc -l
36
$ argus -X -r ex_loss_500.pcap -w - | ra -nn -r - | grep g | wc -l
102


Argus seems to scales up nicely as the number of packets lost increases. I suspect that the reason the that increase in flagged flows isn't linear is because some flow may have more than one gap in them. Note that argus can't detect every lost packet, but it does detect a large portion of them. Factors such as the ratio of reliable to unreliable protocols used, size of individual flows, number of concurrent connections, and the percentage of packets lost parametrize the ratio of lost packets to flagged flows. I think it's safe to assume that if all these parameters are held constant and a large enough number of observations are used, that one can estimate the quantity of packets lost based on the number of flows flagged with a high degree of accuracy. One could also also base estimates on the number of bytes in the sgap and dgap stats, but this also involves making assumptions (size of packets in gaps).

One concerning result of the above demonstration, however, is that there are some flows flagged as having gaps in the unmodified packet trace which I believe to be free of any network monitor loss (visibility gaps). There are a few reasons why some flows may be flagged erroneously. Probably the most common is due to a given connection being split across multiple flow records, causing a gap to be detected due to the boundary condition. This can be rectified by using longer flow status intervals so that this boundary case occurs less. Increasing the flow status interval from the default of 5 seconds to 60 seconds is enough to fix our falsely flagged flows:

$ argus -X -S 60 -r example.pcap -w - | ra -nn -r - | grep g | wc -l
0
$ argus -X -S 60 -r ex_loss_100.pcap -w - | ra -nn -r - | grep g | wc -l
25
$ argus -X -S 60 -r ex_loss_500.pcap -w - | ra -nn -r - | grep g | wc -l
64

So increasing the flow status interval is enough to have my private example report 0 flows with gaps(from which we infer network monitor loss). It also decreases the number of flagged flows in the case of legitimate loss as more connection data is bundled into a single flow record, flagged flows contain more than one lost packet. This functionality is very new to argus, so expect it to improve over time. The developers are now discussing how to improve things such as filtering, aggregation, and flow boundary conditions. One of the most exciting facets of this functionality being built into argus is that this data is available at no additional effort going forward. This makes periodic or even continuous validation of network monitor visibility extremely easy.

Updated 03/12/2012: See the comments section for results using Bro.

Vortex basically gives up on trying to reassemble a TCP stream if there is a packet that is lost and the TCP window is exceeded. The stream gets truncated at the first hole and the stream remains in limbo until it idles out or vortex closes.


$ vortex -r alice_full.pcap -e -t full
Couldn't set capture thread priority!
full/tcp-1-1276956774-1276956775-c-168169-66.173.221.158:55812s152.46.7.81:80
full/tcp-1-1276956774-1276956775-c-168169-66.173.221.158:55812c152.46.7.81:80
VORTEX_ERRORS TOTAL: 0 IP_SIZE: 0 IP_FRAG: 0 IP_HDR: 0 IP_SRCRT: 0 TCP_LIMIT: 0 TCP_HDR: 0 TCP_QUE: 0 TCP_FLAGS: 0 UDP_ALL: 0 SCAN_ALL: 0 VTX_RING: 0 OTHER: 0
VORTEX_STATS PCAP_RECV: 0 PCAP_DROP: 0 VTX_BYTES: 168169 VTX_EST: 1 VTX_WAIT: 0 VTX_CLOSE_TOT: 1 VTX_CLOSE: 1 VTX_LIMIT: 0 VTX_POLL: 0 VTX_TIMOUT: 0 VTX_IDLE: 0 VTX_RST: 0 VTX_EXIT: 0 VTX_BSF: 0

$ vortex -r alice_loss.pcap -e -t loss
Couldn't set capture thread priority!
loss/tcp-1-1276956774-1276956774-e-31056-66.173.221.158:55812s152.46.7.81:80
loss/tcp-1-1276956774-1276956774-e-31056-66.173.221.158:55812c152.46.7.81:80
VORTEX_ERRORS TOTAL: 2 IP_SIZE: 0 IP_FRAG: 0 IP_HDR: 0 IP_SRCRT: 0 TCP_LIMIT: 0 TCP_HDR: 0 TCP_QUE: 2 TCP_FLAGS: 0 UDP_ALL: 0 SCAN_ALL: 0 VTX_RING: 0 OTHER: 0
Hint--TCP_QUEUE: Investigate possible packet loss (if PCAP_LOSS is 0 check ifconfig for RX dropped).
VORTEX_STATS PCAP_RECV: 0 PCAP_DROP: 0 VTX_BYTES: 31056 VTX_EST: 1 VTX_WAIT: 0 VTX_CLOSE_TOT: 1 VTX_CLOSE: 0 VTX_LIMIT: 0 VTX_POLL: 0 VTX_TIMOUT: 0 VTX_IDLE: 0 VTX_RST: 0 VTX_EXIT: 1 VTX_BSF: 0


Note that there are fewer bytes collected, vortex warns about packet loss, there are TCP_QUEUE errors, and the stream doesn’t close cleanly in the loss pcap.

Simplex Capture


Simplex capture is actually pretty simple to identify. It’s only problematic because many tools don’t warn you if it is occurring, so you often don’t even know it is happening. The straightforward approach is to use netflow and look for flows with packets in only one direction.


$ argus -X -r alice_half.pcap -w half.argus
$ ra -r half.argus -n -s stime flgs saddr sport daddr dport spkts dpkts loss
10:12:54.474330 e 66.173.221.158.55812 152.46.7.81.80 87 0 0


This couldn’t be more clear. There are only packets in one direction. If you use a really small flow record interval, you’ll want to do some flow aggregation to ensure you will get packets from both directions in a given flow record. Note that argus by default creates bidirectional flow records. If your netflow system does unidirectional flow records, you need to do a little more work like associating the two unidirectional flows and making sure both sides exist.

You can use one of many tools, such as tcpdump or tshark, and see that for a given connection, you only see packets in one direction.

Vortex handles simplex network traffic in a straightforward, albeit somewhat lackluster manner--it just ignores it. LibNIDS, on which vortex is based, is designed to overcome NIDS TCP evasion techniques through exactly mirroring the functionality of TCP stack but assumes full visibility (no packet loss) to do so. If it doesn’t see both sides of a TCP handshake, it won’t follow the stream because a full handshake hasn’t occurred. As such the use of vortex on the half pcap is rather uneventful:


$ vortex -r alice_half.pcap -e -t half
Couldn't set capture thread priority!
VORTEX_ERRORS TOTAL: 0 IP_SIZE: 0 IP_FRAG: 0 IP_HDR: 0 IP_SRCRT: 0 TCP_LIMIT: 0 TCP_HDR: 0 TCP_QUE: 0 TCP_FLAGS: 0 UDP_ALL: 0 SCAN_ALL: 0 VTX_RING: 0 OTHER: 0
VORTEX_STATS PCAP_RECV: 0 PCAP_DROP: 0 VTX_BYTES: 0 VTX_EST: 0 VTX_WAIT: 0 VTX_CLOSE_TOT: 0 VTX_CLOSE: 0 VTX_LIMIT: 0 VTX_POLL: 0 VTX_TIMOUT: 0 VTX_IDLE: 0 VTX_RST: 0 VTX_EXIT: 0 VTX_BSF: 0


The most optimistic observer will point out that at least vortex makes it clear when you don’t have full duplex traffic--because you see nothing.

Conclusion


I hope the above is helpful to others who rely on passive network monitoring tools. I’ve discussed the two most prevalent tapping issues I’ve seen personally. One topic I’ve intentionally avoided because it’s hard to discuss and debug is interleaving of aggregated taps, especially issues with timing. For example, assume you do some amount of tap aggregation, especially aggregation of simplex flows, either using an external tap aggregater or bonded interfaces inside your network monitoring system. If enough buffering occurs, it may be possible for packets from each simplex flow to be interleaved incorrectly. For example, a SYN-ACK, may end up in front of the corresponding SYN. There are other subtle tapping issues, but the two I discussed above are by far the most prevalent problems I’ve seen. Verifying or quantifying the loss in your tapping infrastructure once is above and beyond what many organizations do. If you rely heavily on the validity of your data, you may consider doing this periodically or automatically so you detect any changes or failures. Even better, for those that heavily rely on their network monitors, building self this self-validation into the tools themselves seems like the right thing to do.

Saturday, February 18, 2012

The Net Defender’s Long March

In my experience, activities in the spring, especially the month of March, set the tone for network defense efforts for the coming months. I’ve patterned this poem after a military classic: the Long March. I hope the Incident Response community finds this both enjoyable and encouraging.

The Net Defender’s Long March

The Net Defender fears not the tempo of the Long March
Esteeming as routine the perennial surge
Waves of spear phishes are swatted like gnats
And more zero days are revealed, streams of bytes
Ardent are the intruders in assaying the defenses for an unmitigated vector
Tepid are the vendors in acknowledging and remediating vulnerabilities
Long hours are joyfully passed in forensic analysis
The Defenders unite to share intel, each heart resolute