Saturday, April 3, 2010

Vortex Howto Series: Network Forensics

In my last installment in the vortex howto series, I showed how to use the most basic features of vortex to build a network surveillance tool. In this post, I will demonstrate more features of vortex through the example of an exercise in network forensics.

As stated in the first article, the primary purpose of these howtos is to demonstrate how to use vortex to perform various tasks. I’ll go out of my way to explain some of the capabilities and features of vortex, as many of them aren’t particularly intuitive. In course of doing so, I’ll compare and contrast vortex to some of the other tools out there. While it will be clear that not much effort is being invested in building the tools demonstrated in this series, the tools should be just interesting enough to demonstrate the type of thing that could be done with in conjunction with vortex. Lastly, most of the data analyzed in this series is admittedly lame.

Our goal in this installment will be to use vortex for network forensics. More specifically, we’re going to be doing forensics analysis for a web site that was attacked. In this case, we’re going to be looking at a password guessing attack, but the same techniques would be useful for other attacks such as SQL injection, other protocols tunneled over HTTP, etc.

To further clarify our goals, let’s assume the fact that the attack occurred is known already. What needs to be done is to dissect the attack. We need to understand what the attacker did, how it was done, and what the result was. While the type of data you would collect and how you would report it depends largely upon your goal--legal prosecution, damage assessment, or security intelligence, we’ll take a relatively general approach.

Good public attack data is hard to find. The data for this installment comes from a live production network from which I was able to obtain this packet trace. It is available for download here. I’ve taken all data out except for the data relevant to the single attack we will investigate. Specifically, the web site attacked is a wordpress blog at http://www.elderhaskell.com. The attacker’s address is 193.226.51.2, about which I know very little, and care even less. Let’s pretend we were notified of a potential attack and asked to investigate. For sake of simplicity, let’s say all we know is that a potential attack occurred--no additional information or data will be given (ex. server logs, IDS alerts, etc). Unfortunately, this sort of engagement is all to common in the realm of network forensics.

At this point most sane people would use wireshark/tshark to get a quick look at the data. Ok, here’s a screenshot from wireshark.



There seems to be a pattern of GETs followed by POSTs for the login page. Looking at a few of the login attempts, the attacker appears to be trying to guess credentials for the site. Were any of the login attempts successful? Were these attempts manual or automated? What was the sequence of events and timings for the various transactions?

While all this information could be extracted and compiled from wireshark/tshark, this would be a very manual process. I script things. Furthermore, since the whole point of this blog series is to use vortex, I guess we’d better use it.

Before we extract the streams, let’s look at a few more of the vortex options.

One really important option to understand is the -k option. Why would you ever want to “disable libNIDS TCP/IP checksum processing”? This is useful in cases where legitimate traffic has invalid TCP checksums, usually because of an artifact of the capture mechanism. One of the most common reasons for this is that packet captures are performed on the same machine as the client or the server and the packet capture libraries don’t have a view of packets with valid TCP checksums. This happens, for instance on linux when the kernel, instead of performing the checksum calculation itself, offloads the calculation to the network card which occurs after point in the TCP/IP stack where the packets are captured. Anyhow, if you are trying to analyze a pcap or perform live analysis with systemically invalid checksums (often 0), try the -k option. Since checksums are rarely legitimately bad, this should have minimal adverse impact in most situations even though disabling all TCP checksums checks may be more than is absolutely necessary. If you aren’t doing a live capture, another good option is to use tcprewrite, e.g. the --fixcsum option.

While not particularly relevant in this case because the pcap has already been filtered to only contain attack relevant traffic, understanding vortex filtering mechanisms is important. Let’s say, for example, we started with a larger pcap that had traffic to and from other clients and servers. To continue the example, let’s say we want to filter out only traffic going to a webserver server with an IP of 192.168.1.1 and running on port 80. You could use a filter expression like “host 192.168.1.1 and tcp port 80” to create a BPF but there are a few problems with this:

Problem #1: IP frag. While not particularly common, it’s out there and can even make you vulnerable to evasion if you aren’t careful. LibNIDS, on which vortex is based, goes to great lengths to accurately reassemble network traffic without introducing these sort of loop-holes. The following is taken directly from the libNIDS documention:

filters like ''tcp dst port 23'' will NOT correctly handle appropriately fragmented traffic, e.g. 8-byte IP fragments; one should add "or (ip[6:2] & 0x1fff != 0)" at the end of the filter to process reassembled packets.

Parenthetically, any filter with “src” or “dst” alone will likely break libNIDS, and therefore vortex, which requires seeing both sides of a conversation, unlike some other IDS systems. However, the filter “tcp port 23” is also vulnerable to IP frag evasion as described above.

Problem #2: Packet Filtering isn’t Stream Filtering. Even in the absence of other complications like IPfrag, you still might find the filter expression above a little imprecise. While it may sound a little out there, what if 192.168.1.1 connects as a client using port 80 to a server on port 25 on 10.1.1.1. If you used the filter above, you’d pick these connections up also which might just confuse you in your analysis. While you could further convolute your BPF with an expression such as “(dst host 192.168.1.1 and dst port 80) or (src host 192.168.1.1 and src port 80)”, vortex, a la libBSF, provides a better way—stream filtering semantics.
If vortex is compiled with support for libBSF, then the –g and –G options are available. These are analogous to –f and –F except that instead of compiling a BPF, a BSF is compiled. The BSF is applied to each stream as it is established. For the example above, we could do a BSF such as “svr host 192.168.1.1 and svr port 80” which makes it very clear what streams we are looking for. However, since vortex has to do a lot more work to apply a filter to streams than a packet based filter and since filtering often occurs in an external system that doesn’t know BSF, a BPF or other packet filter is often used in front of a BSF. Ex. we could do something like “host 192.168.1.1 and (tcp port 80 or (ip[6:2] & 0x1fff != 0))” as a packet filter in addition to the BSF above.

One other option that should be mentioned is -v. The -v option outputs empty streams. Why would you want to do that? If you ask vortex to provide both to server and to client streams, it will always give you the to server stream then the to client stream. This pairing and ordering is guaranteed, except in the case that one of the simplex streams is empty but the other half of the conversation is not. By default, empty simplex streams in an active (albeit one-sided) conversation are not output. Imagine you have an analyzer that expects both files. Some TCP streams may only have one file, which may throw your processing off. The -v rectifies this, ensuring to server and to client streams are always paired by creating empty files when necessary.

Probably the most important option for the task at hand is -e. The -e option causes quite a bit more metadata to be put in the filenames. Files go from looking like 10.1.1.1:1954s172.16.1.1:80 to tcp-1-1229100756-1229100756-c-390-10.1.1.1:1954s172.16.1.1:80. The readme provides good information on how to decode this metadata:

{proto}-{connection_serial_number}-{connection_start_time}-{connection_end_time}-{connection_end_reason}-{connection_size}-{client_ip}:{client_port}{direction}{server_ip}:{server_port}

We’ll be using some of this extended metadata for this task, namely the serial number and timestamps. This extended metadata is one clear reason why you would use vortex over something like tcpflow for this type of task. While it might sound far-fetched, I’ve run into situations where in a short space of time, the tcp quads were repeated and output files got clobbered by this. One other thing worth noting is that the connection_size metadata is the size of the data collected from both flows, and as such, the only difference in the filename for the to server and to client flows is the single character direction flag which is either “s” or “c”.
With that background instruction, let’s extract the flows:
$ mkdir streams
$ vortex -r net_4n6_data.pcap -v -e -t streams
Couldn't set capture thread priority!
streams/tcp-1-1266678719-1266678721-c-2916-193.226.51.2:16118s
66.173.221.158:80
streams/tcp-1-1266678719-1266678721-c-2916-193.226.51.2:16118c
66.173.221.158:80
...
streams/tcp-92-1266678802-1266678803-c-3460-193.226.51.2:26574s
66.173.221.158:80
streams/tcp-92-1266678802-1266678803-c-3460-193.226.51.2:26574c
66.173.221.158:80
VORTEX_ERRORS TOTAL: 0 IP_SIZE: 0 IP_FRAG: 0 IP_HDR: 0 IP_SRCRT
: 0 TCP_LIMIT: 0 TCP_HDR: 0 TCP_QUE: 0 TCP_FLAGS: 0 UDP_ALL: 0
SCAN_ALL: 0 VTX_RING: 0 OTHER: 0
VORTEX_STATS PCAP_RECV: 0 PCAP_DROP: 0 VTX_BYTES: 296501 VTX_ES
T: 92 VTX_WAIT: 0 VTX_CLOSE_TOT: 92 VTX_CLOSE: 92 VTX_LIMIT: 0
VTX_POLL: 0 VTX_TIMOUT: 0 VTX_IDLE: 0 VTX_RST: 0 VTX_EXIT: 0 VT
X_BSF: 0

Before we continue, a little explanation of the output at the end is in order. The ERRORS and STATS printouts show various error counts and statistics. Paying attention to these is a good thing to do. The README provides details of what these mean and vortex provides hints in many cases were a certain class of error is strongly indicative of a possible problem. 0 errors always is a good thing. Just like tcpdump, and most all pcap based apps for that matter, vortex doesn’t report packet received/dropped counts for dead captures. The VTX_EST: 92 tells us that there were 92 TCP connection monitored, and the VTX_CLOSE: 92 tells us that all 92 have been closed with a normal TCP close (FIN/ACK business).

Ok, let’s get down to some real forensics.

Since this post is already long, I’m not going to include my complete analysis notes, but if you’d like to view them, they are here. In course of looking at the data, I developed a script to summarize the requests and responses. It is as follows:
#!/bin/bash

while read line
do
id=`echo $line | awk -F- '{ print $2 }'`;
timestamp=`echo $line | awk -F- '{ print $4 }'`;
time=`date +%H:%M:%S -d @$timestamp`;
date=`date -d @$timestamp`;
action=`head -n 1 $line | awk '{ print $1" "$2}'`;
req_digest=`grep -v -E "^(Content-Length|log=)" $line | \
md5sum | head -c 6`;
resp_digest=`echo $line | sed s/s/c/ | xargs grep \
-v -E "^(Date|Last-Modified)" | md5sum | head -c 6`;
creds=`grep -E "^log=" $line | awk -F'&' '{ print \
$1" "$2 }' | sed -r 's/(log=|pwd=)//g'`;
echo "$id $time $action $req_digest $resp_digest $creds";
done

When executed it creates a summary as follows:
$ ls tcp*s* | sort -k 2 -g -t- | ./summarize.sh
1 10:12:01 GET /wp-login.php 17beda 1be989
2 10:12:02 POST /wp-login.php 3f8d0a 0f713b admin admin
3 10:12:03 GET /wp-login.php 43ab32 1be989
4 10:12:04 POST /wp-login.php 3f8d0a 0f713b admin simple1
5 10:12:05 GET /wp-login.php 43ab32 1be989
6 10:12:06 POST /wp-login.php 3f8d0a 0f713b admin password
7 10:12:07 GET /wp-login.php 43ab32 1be989
8 10:12:08 POST /wp-login.php 3f8d0a 0f713b admin 123456
9 10:12:08 GET /wp-login.php 43ab32 1be989
10 10:12:09 POST /wp-login.php 3f8d0a 0f713b admin qwerty
11 10:12:10 GET /wp-login.php 43ab32 1be989
12 10:12:11 POST /wp-login.php 3f8d0a 0f713b admin abc123
...
89 10:13:20 GET /wp-login.php 43ab32 1be989
90 10:13:21 POST /wp-login.php 3f8d0a 98ee8f wp-admin wp_password
91 10:13:22 GET /wp-login.php 43ab32 1be989
92 10:13:23 POST /wp-login.php 3f8d0a 98ee8f wp-admin wpadmin

The first column is the stream number. The second column is the time. The third column is the HTTP method (GET or POST). The fourth is the HTTP resource, which is the same for all activity. The fifth column is a digest (first 6 chars of md5) that was made of the request stream, with the variable data such as the Content-Length header and the form data (credentials) removed. Similarly, the sixth column is a digest of the response, minus the Date and Last-Modified headers removed. These digests allow us to quickly see which requests/responses are the same so that we can manually inspect the few unique requests and responses.

A quick analysis of the summaries and the unique requests/responses shows that the attacker followed a set pattern of a GET followed by a POST. The attacker tried a list of 23 passwords for each of the two usernames: admin and wp-admin. None of the attempts to log in were successful.

It’s highly like this was an automated attack. It’s also likely it wasn’t particularly targeted as it appears the same attacker was hitting other websites at approximately the same time. For example, note the following log from http://northstarlearning.org/logs/access_100222.log indicates:

193.226.51.2 - - [20/Feb/2010:20:33:28 -0800] "GET /logs/access_091214.logwp-login.php HTTP/1.1" 404 - "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)" "northstarlearning.org"

Usually forensics involves some sort of report. We’ve made a report in the form of a interactive timeline of the attack using simile timeplot.




If the timeline doesn't appear or to view it full width, try here.

The timeline shows the stream number, the HTTP method, and the username and password, if applicable. Clicking on each event summary brings up more details: The TCP parameters, request and response hashes (minus variable data as mentioned above), and the timestamp.

While the attack we analyzed wasn’t particularly special nor interesting, I hope it is clear how vortex could be applied to other situations. For example, if tracking a sophisticated and persevering attacker, much information could be extracted to collect security intelligence, aiding in a threat focused defense. For a more traditional vulnerability focused security approach, there is much information that could be used to drive future mitigations.

I’ve demonstrated how to use vortex to perform network forensics. One clear advantage that this approach has over manually inspecting every packet in wireshark/tshark is scalability. We can easily process large amounts of data using simple scripting. While tshark allows this sort of approach for a large set of protocols by allowing the user to select fields to display, if one is needs to analyze protocols or payload data not supported by tshark, using vortex and an external analyzer is often a wise approach. While I’ve created a pretty lame shell script in this example, many would use more powerful programming languages and their associated repository of protocol parsing code to have simple access to the data. For example, using perl and HTTP::Parser would make sense for this sort of thing. Vortex has some small, but significant advantages over the likes of tcpflow because of features such as the extended metadata.

In future installments in this series on how to use vortex, we’ll show how to use vortex to perform near real time intrusion detection on a live network and then we’ll show how to do deep content analysis in a highly scalable manner suitable on today’s highly parallel general purpose systems.

No comments:

Post a Comment