Tuesday, April 27, 2010

Snort Releases Near Real-Time Extension

Is that a pig I see flying? No, but VRT has released a a near real-time extension to snort.

I'm far from the first to discuss it, but figured I had to mention it because so much of the content on this blog has been, and will be, about near real-time network analysis.

My initial reaction is that I thought the day would never come. It was not too long ago that near real-time IDS was the domain of a few hardcore net defenders who built their own tools. Having built a platform for NRT and seen it used with great success, I can't advocate the technique zealously enough.

I'm really happy to see Sourcefire making this step toward the paradigm for which a few of us have been clamoring for years. Regardless of the implementation, just recognizing the validity of the paradigm and its value is an important step. Furthermore, the definition of NRT that VRT is using is very similar to the definition I've been using with my colleagues for some time. There seems to be a true understanding of what is being asked for, not just buzzword reflection.

While I haven't been able to play with it as much as I'd like, I have a few quick comments/thoughts:

If you have problems with libtool during compilation, delete the ltmain.sh from the unpacked tarball and replace it with the ltmain.sh from your distro's ltmain.sh. This file should be in the libtool package (rpm -ql libtool).

Other than that little issue, the install was easy for me.

The documentation is basically non-existent. Browsing through the source code, I got a bit of feel for what was going on, but I don't understand fully how everything fits together. A howto guide, explaining how to do NRT on arbitrary data would be nice, but who am I to complain about poor documentation :)

One thing that I was surprised to see, however, was an implementation of the pdf parsing routines in C. They utilized other C code written by Didier Stevens, but they didn't use his python implementation of what I think is similar functionality. I believe making use of existing code, with the smallest amount of re-factoring possible, is an important enabler for agility in NRT analysis. After all, in my view, NRT is about taking the detection tools used in other domains and applying them to data extracted passively from the network.

From what I can see, snort-nrt looks very promising.

Saturday, April 24, 2010

Vortex Howto Series: Near Real-Time IDS

This installment of the vortex howto series will build upon previous installments to demonstrate additional features of vortex relevant to implementing a near-real time IDS.

Most mainstream IDSs are extremely packet focused. There are many reasons for this, but at least one of these is in order to support IPS where the “P” is for prevention. The rationale is that to block attacks, one must be able to make a decision on whether to block or pass a packet in a very short period of time. Conventional IDSs focus heavily on efficiency, usually having a very strict C API for analysis modules.

Vortex supports a very different philosophy. Vortex takes a stream-centric approach. The focus is on supporting analysis on the data traveling through the network, not the mechanism for transporting the data (packets). Vortex doesn’t even try to support preventing attacks but focuses on facilitating deep analysis of network payload data, especially processor intensive or high latency analysis. Vortex has a very flexibly API, one which anyone familiar with Linux/Unix will appreciate. I think of it is as a find command for network payload data.

For this installment we’re going to improve upon the example provided in the readme. We’re going to use ssdeep to do fuzzy hash comparisons against known attack signatures. We’ll call our IDS ssdeep-n. We’re using ssdeep because it’s relatively computationally expense. Actually, it’s extremely slow. While ssdeep has a very easy to use API, we’re intentionally not going to use it because we want to demonstrate the ability to use vortex to take any Unix command line tool and use it for network analysis.

So without further ado, here is our analyzer:
#simple script to run ssdeep on network stream (or any list of files)
#output should be piped to log file or logging system (logger)

while read file
result=`ssdeep -m /etc/ssdeep-n.sigs -b $file`
if ! echo $result | grep matches > /dev/null
rm $file
mv $file /var/lib/ssdeep-n/hits/
echo $result | sed 's/ \/etc\/ssdeep-n\.sigs:/ /g'
You can download it here.

While contrived and not the most efficient solution, this is sufficiently generalized to be representative of what could be done with basically any Unix command, including those that don’t support multiple files per invocation or situations where you need to capture/parse the output of the command. We execute ssdeep on the stream file provided by vortex and capture the output. We check the captured output for what we find interesting. If we don’t detect a match, we purge the stream. If we do detect a match, we archive the stream file to /var/lib/ssdeep-n/hits/ and output an alert, massaging the alert text a small amount.

For a data set, the defcon17 CTF packet captures will be used. I downloaded the packet captures and used mergecap to combined them back into one pcap with the following properties:
$ capinfos ctf_dc17.pcap
File name: ctf_dc17.pcap
File type: Wireshark/tcpdump/... - libpcap
File encapsulation: Ethernet
Number of packets: 38994342
File size: 7780760337 bytes
Data size: 7156850841 bytes
Capture duration: 185602.101865 seconds
Start time: Fri Jul 31 13:26:38 2009
End time: Sun Aug 2 17:00:00 2009
Data rate: 38560.18 bytes/s
Data rate: 308481.46 bits/s
Average packet size: 183.54 bytes
Closely related to the data set, is the signature set we’ll be using. You can download it from here. The signature file contains ssdeep hashes for an assortment of attack data, some of which will match against the defcon 17 data set. Fearing to depart too much from the standards set by the security industry, the signature names are painfully useless :)

Now we’re ready to actually get our near real time IDS to run. Based on the knowledge from some of the previous articles in this series, the following is a good starting point:
$ vortex -r ctf_dc17.pcap -e -t /dev/shm/ssdeep-n \
-S 1000000000 -C 1000000000 |./ssdeep-n.sh
One of the most important vortex options, at least for those of us that care about security, is the -u option. Live captures usually require root privileges to open the capture device but we’d like to not run as root any longer than necessary. The -u option tells vortex to suid down to a non-root user after opening the capture device/file. Changing the command so it can be executed as root, but quickly dropping to the use of the user nobody, which has limited permission, yields the following:
# vortex -r ctf_dc17.pcap -u nobody -e -t /dev/shm/ssdeep-n \
-S 1000000000 -C 1000000000 | su nobody -c './ssdeep-n.sh'
While we aren’t reading from a live interface, we very easily could be. We’re using su so the analyzer runs with the non-root account also.

Libnids, on which vortex is built, has some statically sized hash tables. In general we want these hash tables to be large enough that they are never filled, but not too much larger than necessary as they consume a fair amount of memory. One of these hash tables is the main connection hash table. Each active connection which vortex is capturing requires an entry in this hash table. When this hash table fills up, vortex ignores additional connections until active connections are closed. The default value of 1M is pretty good, but for demonstrative purposes, we’re going to set this to 2M by using -s 2097152. You will know you need to increase this if you ever have errors of the category “TCP_LIMIT”. Similarly, libnids has a static hash table for IP Frag which can be set with -H. We’ll leave this at the default, but if you have a network where IP frag is actually used routinely, you may want to increase this.

Vortex doesn’t provide the data to the external analyzer until all the requested data from the stream has been gathered or until the connection has successfully closed. For various reasons, vortex can’t always detect when communication has terminated. To prevent connections from being followed indefinitely, even after the connections have been abandoned by one or both ends, the -K option provides a timeout. Note however, that this timeout is only reset when data is transferred through the connection, not when other possibly valid TCP traffic, such as keepalives, ACKs, etc are observed. Vortex has an especially hard time detecting the end of many of the connections in the defcon data set we are using, so we definitely need to set this option. In practice, the -K option also helps guard against benign or malicious resource exhaustion. Common settings of this range from 1s to 3600s. We’ll set this to 600s with -K 600.

Adding the hash table size options and timeout yields:
# vortex -r ctf_dc17.pcap -u nobody -s 2097152 -K 600 \
-e -t /dev/shm/ssdeep-n -S 1000000000 -C 1000000000 \
| su nobody -c './ssdeep-n.sh'
Another important aspect of running vortex for long periods of time, as you would do with a near-real time IDS, is logging of health/status. By default vortex dumps error and performance stats at program termination, but vortex can be configured to dump this data periodically. The -E and -T set the reporting interval for error and performance statistics which are output to syslog and STDERR. We’ll use 3600 for each so we get stats back every hour. The -L option sets the syslog tag so that different instances of vortex can be differentiated from each other. We’ll use -L ssdeep-n.

One subtle item of note here is that while basically all aspects of vortex timings are based on the time loaded from the packet captures, either live or dead, the periods for error and performance stats logging are implemented in system time (not pcap time). In this example, we’ll see the multi-day packet capture processed in a couple hours. The times from the packet captures, including the -K idle timeout will be based on pcap time, while the error and stats messages will be based on local system time.

Adding logging yields the following:
# vortex -r ctf_dc17.pcap -u nobody -s 2097152 -K 600 \
-e -t /dev/shm/ssdeep-n -E 3600 -T 3600 -L ssdeep-n \
-S 1000000000 -C 1000000000 | su nobody -c './ssdeep-n.sh \
| logger -s -p local0.info -t ssdeep-n'
We’re taking the output of ssdeep-n and feeding it to logger such that logs are echoed back to the terminal via STDOUT and sent to system log.

So now we’re ready to actually run our near real time IDS.

The results look something like the following:
Apr 24 15:23:18 localhost ssdeep-n: VORTEX_STATS PCAP_RECV: 0
Apr 24 15:23:18 localhost ssdeep-n: VORTEX_ERRORS TOTAL: 0
Apr 24 15:27:10 localhost ssdeep-n: tcp-30216-1249077951
matches Command DGB (75)
Apr 24 15:28:20 localhost ssdeep-n: tcp-56998-1249080094
matches Response DGB (93)
Apr 24 15:28:20 localhost ssdeep-n: tcp-56998-1249080094
matches Response DGB (93)
Apr 24 15:28:54 localhost ssdeep-n: tcp-62766-1249080436
matches Response DGB (66)
Apr 24 15:34:30 localhost ssdeep-n: tcp-112145-1249083434
matches Response DGB (94)
Apr 24 15:36:25 localhost ssdeep-n: tcp-129781-1249084423
matches Response DGB (94)
Apr 24 16:23:18 localhost ssdeep-n: VORTEX_STATS PCAP_RECV: 0
PCAP_DROP: 0 VTX_BYTES: 374632450 VTX_EST: 370486 VTX_WAIT: 9999
Apr 24 16:23:18 localhost ssdeep-n: VORTEX_ERRORS TOTAL: 484
Apr 24 16:33:29 localhost ssdeep-n: tcp-394718-1249150608
matches Attack ABC (97)
Apr 24 16:33:30 localhost ssdeep-n: tcp-394734-1249150609
matches Attack ABC (97)
Apr 24 16:49:00 localhost ssdeep-n: tcp-431134-1249152504
matches Attack ABC (97)
Apr 24 17:23:18 localhost ssdeep-n: VORTEX_STATS PCAP_RECV: 0
PCAP_DROP: 0 VTX_BYTES: 642622346 VTX_EST: 532289 VTX_WAIT: 9999
Apr 24 17:23:18 localhost ssdeep-n: VORTEX_ERRORS TOTAL: 713
A few of the signatures matched, with varying degrees of similarity. Since we’ve archived the matches, we can go examine them. For example, let’s look at one of the very popular “Attack ABC” hits:
[csmutz@master ~]$ hexdump -v /var/lib/ssdeep-n/hits\
/tcp-431134-1249152504-1249152504-i-2056-\ | head
0000000 9090 9090 9090 9090 9090 9090 9090 9090
0000010 9090 9090 9090 9090 9090 9090 9090 9090
0000020 7dbf b830 3110 66c9 f0b9 db01 d9d9 2474
0000030 58f4 7831 8310 04c0 7803 9f0c edc5 ba99
0000040 5975 c5cc b196 c5e6 fd66 1d82 fe98 9d72
0000050 0165 5a8d d5e0 9b73 be14 1aee 86eb 0c74
0000060 f715 c888 718e d758 4eca 2858 2b2b b48a
0000070 73a1 f051 83b4 2fa5 1321 e837 0734 0939
0000080 d8c9 f5c6 1e36 1d43 5fc8 f2b3 c55e cb35
0000090 e124 2c38 99d9 bcad 9949 a0ab da68 454b
I don’t know much about what is supposed to be going on here, but I do know that starting your conversation off with a NOP sled, is in computer etiquette, not the nicest way to start a conversation. While contrived, we’ve “detected” an attack. We could look more, but I think that’s a sufficient discussion of our results.

We’ve demonstrated how to use vortex to build a near real-time IDS. While ssdeep is probably not something you’d ever want to run on bare network streams, we’ve shown how easy it is to take basically any Unix command that operates on files, including computationally expensive ones, and apply the same functionality to network streams in near real time. While we used a program written in C with a straight-forward API, we could just have easily used a perl/python/ruby script, java program, or even VB script written for windows which runs via mono or wine. No re-implementation is required to take a valuable detection mechanisms and run it on network traffic in near real-time. I think of the most valuable things vortex could be used for is doing the type of decoding and or data extraction that just isn’t possible with mainstream IDS. For example, assuming the signature matching capabilities of Snort isn't good enough for you, what about extracting MS documents from network traffic and running officecat on them? Similarly, if you like Bro-style transaction logs for network protocols, why not extract metadata from pdfs traversing the network with pdftk or one of Didier Stevens PDF tools?

While we’ve run our near real-time IDS from a dead capture file, it could just as easily be done from live capture. Vortex includes some example init scripts that could be used to run vortex in a daemon mode, such as you would need to do for a network sensor. Vortex facilitates the creation of agile and flexible near real-time detection mechanisms.

As we’ll show in the next installment of this series, vortex removes the real-time constraints inherent in network packet capture from our content analysis. Vortex also can be used to take detection mechanisms as we’ve implemented here and scale them across highly parallel systems.

Tuesday, April 20, 2010

Keeping Targeted Attacks Secret Kills R&D

I’m really impressed with Google’s response to what has been coined Operation Aurora by others. I’m impressed for lots reasons. I’m impressed because they recognize the value of their intellectual property and when they realized that it was threatened, they took decisive actions to protect their interests. I think it’s sad that so many companies in a similar situation would be blinded by short sighted lust for the “emerging market” that they fail to protect themselves and fail to recognize that the same market is far from a fair or open. I’m impressed that when they apparently felt that the espionage was backed or at least condoned by the Chinese government, they called them out. Most of all, I’m happy they made this public.

That being said, I’m not too impressed that google, and the majority of the computer security industry for that matter, were taken off guard by these attacks. The level of sophistication and determination is not new nor is the type of data targeted. For the purpose of this article, when I refer to a targeted and sophisticated attacks I’m referring to attacks where one or more attacker groups repeatedly seeks to (re-)penetrate an organization’s computer systems for ends specific to the victim organization, typically exfiltration of sensitive information. These attacks are characterized by a high degree of knowledge of the victims, often a high degree of social engineering, adequate technical sophistication, and high degree of organization/coordination on the part of the attackers. I refrain from using the term advanced persistant threat (APT), because while it has had a fairly precise meaning among the people using the term for some time, the meaning has been blurred quite a bit of late. For the purposes of this article, the specific identities of the attackers, including affiliation or backing by nation-states, is not important. A few public reports of these sorts of attacks go back to at least the 2003-2005 timeframe, probably earlier, but that’s when I started paying attention. Maybe the one thing that is new is the type of industry targeted. I think google should have known it was coming. I’ll bet they had some warnings they chose to ignore, but I guess I can't fault them too much.

The response by the security industry to these attacks is pitiful. Many people recognize that the state of the art, including mainstream enterprise security tools, can’t stop, let alone detect, this sort of activity. While there are a few valiant incident responders who have been dealing with sophisticated targeted attacks for some time, many with a good deal of success, the security vendors have basically ignored their pleas and ideas for improved security tools. I’ve heard vendors say “You don’t want to do that” and “the market for that isn’t big enough for us to implement it”.

What has to happen for the security industry to realize they need to deal with sophisticated targeted attacks? First, organizations need to realize the value of their intellectual property. Second they need to realize that it’s at risk. I think most organizations are at this point. Third, they need to realize that conventional security wisdom, practices, and tools, won’t protect them against this, for some people new, class of attacker. Unfortunately, all too often, this epiphany only comes after personal and painful experience. Fourth, enough people need to start demanding effective solutions that vendors feel compelled to deliver them and academia recognizes the problems that need researching. Lastly, the solutions--a capable workforce, processes and practices, technology, etc need to be developed.

While there are many hindrances, one of the biggest obstacles to effectively dealing with targeted attacks is silence. While this class of attack is far from new, basically no one talks about it. While there are plenty examples of good public documentation of sophisticated attacks, ex. Businessweek E-espionage threat, NG’s report on Chinese Espionage, and Mandiant M-trends, basically no one credible steps up and confirms the validity of the data, leaving many to dismiss these reports as sensational journalism, conspiracy theories, and marketing hype. Based on solid public data, I guess I don’t blame people for questioning the reality of this threat until they experience it personally.

This code of silence related to compromises is very detrimental to solving the problem through the various available avenues: political/diplomatic, legal, and security systems including technology and people. There are a lot of legitimate reasons for not broadcasting your status as victim of a sophisticated attack and/or the type details required to help prevent future occurrences. Most of them I wouldn’t agree with, especially if everyone in the same industry/sector is in the same boat and you all know it. One of the few legitimate reasons to keep details of these attacks secret is that defending against persistent attackers is best achieved through an attacker focused or security intelligence driven approach. But how long is your threat intelligence still useful? Surely keeping specific attack data secret past a year or two doesn’t buy you much in terms of security intelligence as the most aggressive attackers change tactics and techniques more frequently than this. Hopefully it doesn't reveal too much about your capabilities either, as they need to be evolving that quickly also. Does acknowledging you’ve been attacked after your incident response is finished, or at least well under way, buy you anything in terms of threat intelligence? I don’t think so. I admire google for going public and doing something about it. I’m happy to see some public details, but more details and official acknowledgement from google would be nice. Sadly, google is right when they say they’ve already been more open that most others in the industry.

The organizations that keep targeted attacks and the details of them secret are part of the problem, or at the very least, aren’t doing everything necessary to help solve the problem. I think it’s a little hypocritical for organizations to complain about the security industry and academia not addressing this class of threat when no one will talk about the problem publicly with the requisite level of certainty and specificity.

Focusing on security R&D, there are a few things I think need to happen before the security tools industry and academia can start to address targeted attacks. The people doing R&D need to know what type of attacks are actually occurring, they need to understand the importance of a threat focused response model, and they need some decent data.

Understanding the Targeted Attack Scenario

One of the major problems with current academic and applied research is that most researchers don’t understand the basics of a highly targeted attack scenario. They don’t know how serious the problem is. If you tell an academic that the sky is falling because of targeted attacks and give them a high level overview, they’ll either yawn or laugh at you. Case in point, the following hypothetical conversation:

Boots on Ground Responder: We’ve got to do something about these highly socially engineered spear-phishing attacks!

Heads in Clouds Researcher: If you graph the social network, how many nodes away is the sender from the recipient?

Boots on Ground Responder: Uh, 1. Sometimes 2. Sometimes more, it depends.

Heads in Clouds Researcher: Ok, what about the malware? Rootkit? Polymorphism? Any Red pill/Blue pill?

Boots on Ground Responder: In this case nothing like that. Just simple malware that provides minimal backdoor. Malware isn’t even packed.

Heads in Clouds Researcher: Ok, this stuff isn’t being detected by your AV, IDS, etc but it’s still making it through firewalls, proxies, etc. Any interesting data hiding techniques?

Boots on Ground Responder: No, not really. Malware evades AV because it’s never been seen before. In cases where they need to evade our IDS, they use trivial obfuscation like ceasar ciphers. Usually though, they just hide in plain sight.

Heads in Clouds Researcher: Doesn’t sound too interesting to me. Just patch your systems and tell your users not to click on unsolicited email.

Boots on Ground Responder: Yeah, right. Still, we see repeated patterns in all of these attacks. I can’t give you details, but there’s got to be a way to catch these guys.

Heads in Clouds Researcher: Ok, well I’m going to go back to musing on the trusting trust problem…

The sad part is there are some really interesting problems, true academic problems, but for the most part, academia isn’t seeing them. I don’t think it’s because academia isn’t trying to find good problems to solve, I think it’s because the interesting details aren’t being shared.

Researchers need to learn how different targeted attacks are from opportunistic attacks. They need to understand how the goals and methods differ. They need to understand how different the targeting mechanisms are. They need to understand how valuable an intelligence driven response model is. However, they won’t learn it until someone shows them.

Supporting threat focused response

So much conventional security wisdom and basically all academic research takes a vulnerability focused approach. The focus is on detecting and mitigating individual attacks, not persistent campaigns comprising series of attacks. That’s the best approach for many classes of attacks, but isn’t the best if determined attackers continue attacking the same target over and over again. So many other people have spoken on this topic, that I’ll defer to them and steer my ramblings toward application of these principles to security tool development. For the reader’s reference, I recommend this podcast by some of the thought leaders in this realm. If what they are saying is news to you, check out their blogs, etc.

People doing security R&D have to learn about intelligence driven incident response. While some products support this approach, almost none fully embrace it. Even worse, academia is basically mute on the topic.

One aspect of a threat focused response model that is very important for security R&D is the importance of prioritization of response. While I have seen some products and research that recognizes the importance of prioritization based on the vulnerability/exploit, basically no security R&D addresses prioritization based on intelligence or attacker identity. Given the following choice, which would you rather detect/block: A stealthy rootkit installed by a botnet for the purpose of identity theft/fraud or an email containing a link to an exploit which when visited gives a sophisticated attacker user level access to the compromised computer? Most academics and many in the security industry would take the former because of impact on the system but a small group of security professional will lean hard towards the latter because of impact to the organization’s overall mission.

Another important aspect of threat focused response is relative importance of prevention and detection. For an intelligence driven response model, detection is king, and prevention is a distant second. In fact in some cases, it might actually be beneficial to not mitigate attacker activity if the attack is or will be mitigated further in the attack sequence (or kill chain) and if blocking the attack prevents collection of further threat intelligence (ex. firewall block). On the flip side, being able to detect an attack, even if it wasn’t or couldn’t be blocked, is imperative. If you look at the bigger picture, being able to block an attack is always the best, but if you can’t or didn’t detect it in real time, detecting it in near real time often almost as good. While many don’t appreciate it, being able to do historical detections, or understanding how intrusions started, including attacker activity preceding the actual attack, is also important to an intelligence driven response.

Lastly, post unsuccessful attack analysis is almost ignored by conventional tools and research. However, successful incident responders know the importance of analyzing unsuccessful attacks and developing mitigations across all facets of the attack sequence.

People doing security R&D have to learn to build features supporting threat intelligence into their tools and research.

Irrelevant Data Supports Irrelevant Research

One of the biggest hurdles to overcome for basically any sort of research is obtaining good data. The relative dearth of data related to target attacks kills research. If you were a researcher, would you choose a problem for which there is no public data? How could you? Even if you are doing more applied R&D, getting good data isn’t so easy.

There are a couple approaches to getting data for research: you can either gather the data for yourself, or you can use someone else’s data, usually a public data set. The problem with gathering the data yourself is that most researchers will never be able to gather data on targeted attacks. By their very nature, traditional computer security collection mechanisms such as honeypots, honey monkeys, etc will not normally ever see a targeted attack, definitely not a persistent campaign of target attacks. Even the researchers and vendors that do end up seeing samples representing one phase of targeted attacks, say malware, don’t see the full attack lifecycle. How can you address all phases of the attack if you only see one?

So there are good public data sets and there are some that aren’t so great, however, it seems that once a reasonably valid data set is used, it gets used over and over again. I admire folk who put together quality data sets for the community. One infamous example in the realm of incident detection is the DARPA 99 Intrustion Detection Evaluation dataset. While probably a decent data set at the time, and while memories of winnuke, etc may well be indelibly seared into the minds of some cyber war horses, these sort attacks are about as far from targeted attacks as you can get. DARPA 99 has been used and abused for a long time, but people still use it! Why? There aren’t many other options for public data sets. Other decent options for some types of research include packet captures from events like the Defcon CTF and NSA/West Point Competition, but these events are by their very nature very poor sources for persistent and highly target attacks.

While it will be necessary to develop good data sets involving targeted attacks, it’s going to be a hard effort. First, to demonstrate a persistent attacker, you need months, even years of data. As attacks have moved up the protocol stack and have become incredibly personalized, sanitizing data is going to be a lot more difficult than scrubbing IP addresses and hostnames. To truly address targeted attacks, tools will have to be configured with information about the data and people using the computer systems (not just the computer systems themselves). What that means for researchers is that to understand the significance of a target attack, you have to understand the targeted organization and targeted individuals. Lastly, as incident responders know, to be effective, data needs to be integrated from all phases of the attack and come in all sorts of formats: logs, netflow or packet captures, malware, etc. It’s clear that a perfect public data set for target attacks will never exist, but organizations can make steps by releasing older data.

While I doubt that any quality public data sets will be coming soon, organizations need to learn the value of collecting an internal data set. By nature, Incident Responders aren’t always the most disciplined at things like collecting and labeling data for historical purposes, especially considering the conditions in which they operate. Regardless, a little bit of effort to compile historical attack data for future reference, including labeling of data, pays huge dividends both in responding to future attacks and providing good training/test data for new tools.

Keeping quiet about sophisticated targeted attacks kills, among other things, intelligence driven tool R&D. For the technology to catch up with the threat, the problem needs to be discussed publicly and more details need to be shared. Publicly sharing attack information is critical to the research and development required to catch up technologically with sophisticated attacks. If the code of silence isn't broken, incident responders will continue to flounder with mainstream security tools while security tool vendors will continue to have watershed moments.

Saturday, April 3, 2010

Vortex Howto Series: Network Forensics

In my last installment in the vortex howto series, I showed how to use the most basic features of vortex to build a network surveillance tool. In this post, I will demonstrate more features of vortex through the example of an exercise in network forensics.

As stated in the first article, the primary purpose of these howtos is to demonstrate how to use vortex to perform various tasks. I’ll go out of my way to explain some of the capabilities and features of vortex, as many of them aren’t particularly intuitive. In course of doing so, I’ll compare and contrast vortex to some of the other tools out there. While it will be clear that not much effort is being invested in building the tools demonstrated in this series, the tools should be just interesting enough to demonstrate the type of thing that could be done with in conjunction with vortex. Lastly, most of the data analyzed in this series is admittedly lame.

Our goal in this installment will be to use vortex for network forensics. More specifically, we’re going to be doing forensics analysis for a web site that was attacked. In this case, we’re going to be looking at a password guessing attack, but the same techniques would be useful for other attacks such as SQL injection, other protocols tunneled over HTTP, etc.

To further clarify our goals, let’s assume the fact that the attack occurred is known already. What needs to be done is to dissect the attack. We need to understand what the attacker did, how it was done, and what the result was. While the type of data you would collect and how you would report it depends largely upon your goal--legal prosecution, damage assessment, or security intelligence, we’ll take a relatively general approach.

Good public attack data is hard to find. The data for this installment comes from a live production network from which I was able to obtain this packet trace. It is available for download here. I’ve taken all data out except for the data relevant to the single attack we will investigate. Specifically, the web site attacked is a wordpress blog at http://www.elderhaskell.com. The attacker’s address is, about which I know very little, and care even less. Let’s pretend we were notified of a potential attack and asked to investigate. For sake of simplicity, let’s say all we know is that a potential attack occurred--no additional information or data will be given (ex. server logs, IDS alerts, etc). Unfortunately, this sort of engagement is all to common in the realm of network forensics.

At this point most sane people would use wireshark/tshark to get a quick look at the data. Ok, here’s a screenshot from wireshark.

There seems to be a pattern of GETs followed by POSTs for the login page. Looking at a few of the login attempts, the attacker appears to be trying to guess credentials for the site. Were any of the login attempts successful? Were these attempts manual or automated? What was the sequence of events and timings for the various transactions?

While all this information could be extracted and compiled from wireshark/tshark, this would be a very manual process. I script things. Furthermore, since the whole point of this blog series is to use vortex, I guess we’d better use it.

Before we extract the streams, let’s look at a few more of the vortex options.

One really important option to understand is the -k option. Why would you ever want to “disable libNIDS TCP/IP checksum processing”? This is useful in cases where legitimate traffic has invalid TCP checksums, usually because of an artifact of the capture mechanism. One of the most common reasons for this is that packet captures are performed on the same machine as the client or the server and the packet capture libraries don’t have a view of packets with valid TCP checksums. This happens, for instance on linux when the kernel, instead of performing the checksum calculation itself, offloads the calculation to the network card which occurs after point in the TCP/IP stack where the packets are captured. Anyhow, if you are trying to analyze a pcap or perform live analysis with systemically invalid checksums (often 0), try the -k option. Since checksums are rarely legitimately bad, this should have minimal adverse impact in most situations even though disabling all TCP checksums checks may be more than is absolutely necessary. If you aren’t doing a live capture, another good option is to use tcprewrite, e.g. the --fixcsum option.

While not particularly relevant in this case because the pcap has already been filtered to only contain attack relevant traffic, understanding vortex filtering mechanisms is important. Let’s say, for example, we started with a larger pcap that had traffic to and from other clients and servers. To continue the example, let’s say we want to filter out only traffic going to a webserver server with an IP of and running on port 80. You could use a filter expression like “host and tcp port 80” to create a BPF but there are a few problems with this:

Problem #1: IP frag. While not particularly common, it’s out there and can even make you vulnerable to evasion if you aren’t careful. LibNIDS, on which vortex is based, goes to great lengths to accurately reassemble network traffic without introducing these sort of loop-holes. The following is taken directly from the libNIDS documention:

filters like ''tcp dst port 23'' will NOT correctly handle appropriately fragmented traffic, e.g. 8-byte IP fragments; one should add "or (ip[6:2] & 0x1fff != 0)" at the end of the filter to process reassembled packets.

Parenthetically, any filter with “src” or “dst” alone will likely break libNIDS, and therefore vortex, which requires seeing both sides of a conversation, unlike some other IDS systems. However, the filter “tcp port 23” is also vulnerable to IP frag evasion as described above.

Problem #2: Packet Filtering isn’t Stream Filtering. Even in the absence of other complications like IPfrag, you still might find the filter expression above a little imprecise. While it may sound a little out there, what if connects as a client using port 80 to a server on port 25 on If you used the filter above, you’d pick these connections up also which might just confuse you in your analysis. While you could further convolute your BPF with an expression such as “(dst host and dst port 80) or (src host and src port 80)”, vortex, a la libBSF, provides a better way—stream filtering semantics.
If vortex is compiled with support for libBSF, then the –g and –G options are available. These are analogous to –f and –F except that instead of compiling a BPF, a BSF is compiled. The BSF is applied to each stream as it is established. For the example above, we could do a BSF such as “svr host and svr port 80” which makes it very clear what streams we are looking for. However, since vortex has to do a lot more work to apply a filter to streams than a packet based filter and since filtering often occurs in an external system that doesn’t know BSF, a BPF or other packet filter is often used in front of a BSF. Ex. we could do something like “host and (tcp port 80 or (ip[6:2] & 0x1fff != 0))” as a packet filter in addition to the BSF above.

One other option that should be mentioned is -v. The -v option outputs empty streams. Why would you want to do that? If you ask vortex to provide both to server and to client streams, it will always give you the to server stream then the to client stream. This pairing and ordering is guaranteed, except in the case that one of the simplex streams is empty but the other half of the conversation is not. By default, empty simplex streams in an active (albeit one-sided) conversation are not output. Imagine you have an analyzer that expects both files. Some TCP streams may only have one file, which may throw your processing off. The -v rectifies this, ensuring to server and to client streams are always paired by creating empty files when necessary.

Probably the most important option for the task at hand is -e. The -e option causes quite a bit more metadata to be put in the filenames. Files go from looking like to tcp-1-1229100756-1229100756-c-390- The readme provides good information on how to decode this metadata:


We’ll be using some of this extended metadata for this task, namely the serial number and timestamps. This extended metadata is one clear reason why you would use vortex over something like tcpflow for this type of task. While it might sound far-fetched, I’ve run into situations where in a short space of time, the tcp quads were repeated and output files got clobbered by this. One other thing worth noting is that the connection_size metadata is the size of the data collected from both flows, and as such, the only difference in the filename for the to server and to client flows is the single character direction flag which is either “s” or “c”.
With that background instruction, let’s extract the flows:
$ mkdir streams
$ vortex -r net_4n6_data.pcap -v -e -t streams
Couldn't set capture thread priority!
X_BSF: 0

Before we continue, a little explanation of the output at the end is in order. The ERRORS and STATS printouts show various error counts and statistics. Paying attention to these is a good thing to do. The README provides details of what these mean and vortex provides hints in many cases were a certain class of error is strongly indicative of a possible problem. 0 errors always is a good thing. Just like tcpdump, and most all pcap based apps for that matter, vortex doesn’t report packet received/dropped counts for dead captures. The VTX_EST: 92 tells us that there were 92 TCP connection monitored, and the VTX_CLOSE: 92 tells us that all 92 have been closed with a normal TCP close (FIN/ACK business).

Ok, let’s get down to some real forensics.

Since this post is already long, I’m not going to include my complete analysis notes, but if you’d like to view them, they are here. In course of looking at the data, I developed a script to summarize the requests and responses. It is as follows:

while read line
id=`echo $line | awk -F- '{ print $2 }'`;
timestamp=`echo $line | awk -F- '{ print $4 }'`;
time=`date +%H:%M:%S -d @$timestamp`;
date=`date -d @$timestamp`;
action=`head -n 1 $line | awk '{ print $1" "$2}'`;
req_digest=`grep -v -E "^(Content-Length|log=)" $line | \
md5sum | head -c 6`;
resp_digest=`echo $line | sed s/s/c/ | xargs grep \
-v -E "^(Date|Last-Modified)" | md5sum | head -c 6`;
creds=`grep -E "^log=" $line | awk -F'&' '{ print \
$1" "$2 }' | sed -r 's/(log=|pwd=)//g'`;
echo "$id $time $action $req_digest $resp_digest $creds";

When executed it creates a summary as follows:
$ ls tcp*s* | sort -k 2 -g -t- | ./summarize.sh
1 10:12:01 GET /wp-login.php 17beda 1be989
2 10:12:02 POST /wp-login.php 3f8d0a 0f713b admin admin
3 10:12:03 GET /wp-login.php 43ab32 1be989
4 10:12:04 POST /wp-login.php 3f8d0a 0f713b admin simple1
5 10:12:05 GET /wp-login.php 43ab32 1be989
6 10:12:06 POST /wp-login.php 3f8d0a 0f713b admin password
7 10:12:07 GET /wp-login.php 43ab32 1be989
8 10:12:08 POST /wp-login.php 3f8d0a 0f713b admin 123456
9 10:12:08 GET /wp-login.php 43ab32 1be989
10 10:12:09 POST /wp-login.php 3f8d0a 0f713b admin qwerty
11 10:12:10 GET /wp-login.php 43ab32 1be989
12 10:12:11 POST /wp-login.php 3f8d0a 0f713b admin abc123
89 10:13:20 GET /wp-login.php 43ab32 1be989
90 10:13:21 POST /wp-login.php 3f8d0a 98ee8f wp-admin wp_password
91 10:13:22 GET /wp-login.php 43ab32 1be989
92 10:13:23 POST /wp-login.php 3f8d0a 98ee8f wp-admin wpadmin

The first column is the stream number. The second column is the time. The third column is the HTTP method (GET or POST). The fourth is the HTTP resource, which is the same for all activity. The fifth column is a digest (first 6 chars of md5) that was made of the request stream, with the variable data such as the Content-Length header and the form data (credentials) removed. Similarly, the sixth column is a digest of the response, minus the Date and Last-Modified headers removed. These digests allow us to quickly see which requests/responses are the same so that we can manually inspect the few unique requests and responses.

A quick analysis of the summaries and the unique requests/responses shows that the attacker followed a set pattern of a GET followed by a POST. The attacker tried a list of 23 passwords for each of the two usernames: admin and wp-admin. None of the attempts to log in were successful.

It’s highly like this was an automated attack. It’s also likely it wasn’t particularly targeted as it appears the same attacker was hitting other websites at approximately the same time. For example, note the following log from http://northstarlearning.org/logs/access_100222.log indicates: - - [20/Feb/2010:20:33:28 -0800] "GET /logs/access_091214.logwp-login.php HTTP/1.1" 404 - "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)" "northstarlearning.org"

Usually forensics involves some sort of report. We’ve made a report in the form of a interactive timeline of the attack using simile timeplot.

If the timeline doesn't appear or to view it full width, try here.

The timeline shows the stream number, the HTTP method, and the username and password, if applicable. Clicking on each event summary brings up more details: The TCP parameters, request and response hashes (minus variable data as mentioned above), and the timestamp.

While the attack we analyzed wasn’t particularly special nor interesting, I hope it is clear how vortex could be applied to other situations. For example, if tracking a sophisticated and persevering attacker, much information could be extracted to collect security intelligence, aiding in a threat focused defense. For a more traditional vulnerability focused security approach, there is much information that could be used to drive future mitigations.

I’ve demonstrated how to use vortex to perform network forensics. One clear advantage that this approach has over manually inspecting every packet in wireshark/tshark is scalability. We can easily process large amounts of data using simple scripting. While tshark allows this sort of approach for a large set of protocols by allowing the user to select fields to display, if one is needs to analyze protocols or payload data not supported by tshark, using vortex and an external analyzer is often a wise approach. While I’ve created a pretty lame shell script in this example, many would use more powerful programming languages and their associated repository of protocol parsing code to have simple access to the data. For example, using perl and HTTP::Parser would make sense for this sort of thing. Vortex has some small, but significant advantages over the likes of tcpflow because of features such as the extended metadata.

In future installments in this series on how to use vortex, we’ll show how to use vortex to perform near real time intrusion detection on a live network and then we’ll show how to do deep content analysis in a highly scalable manner suitable on today’s highly parallel general purpose systems.