Thursday, January 13, 2011

Gnawing on HTTP 206 Fragmented Payloads with Ruminate

I've been madly working on getting Ruminate to a point where I can recommend it to people in industry for use, hopefully by the end of January 2011. I've done a huge amount of work on HTTP decoding including a working implementation of HTTP 206 defragmentation which I consider a "killer feature" when dealing with payloads transferred through the network. I wanted to take a break from the documentation and code packaging that Ruminate so badly needs to discuss the importance of this mechanism, along with some examples. This discussion should also help clarify the areas where Ruminate is seeking to innovate.

HTTP 206 Partial Content



As NIDS begin to earnestly address true layer 7 decoding and embedded object analysis (ex. files transferred through network), they will run into complications like HTTP 206. I haven't heard much about HTTP 206 defrag so I assume this isn't on most people's radar.

What is HTTP 206? It's basically HTTP's method of fragmenting payload objects. 206 is the response code, just like 200 or 404. If you want to download just part of a file, you can ask the server to give you a specific set (or sets) of bytes and compliant servers will respond with only the data you asked for via a 206 response.

If you're not looking for malicious content in HTTP 206 transactions, you should be. Who really cares about HTTP 206 transactions if they represent a very small number of total HTTP transactions on a network? One oft overlooked detail is that HTTP 206 is actually used to transfer a significant amount (often up to 20%) of the most interesting payloads, such as PDF documents or PE executables. Even though HTTP 206 is often used naively by unwitting clients, it is used to transfer malicious content just as well as benign content, making life harder for your NIDS in the process.

Layer 7 and Embedded Object Defrag


One of Ruminate's goals is to address layer 7 and payload object analysis with the same level of vigor that current NIDS address layer 3 and layer 4. Part of this analysis necessarily involves layer 7 and payload object defrag/reassembly just like layer 3 and layer 4 defrag/reassembly have been big topics for the current generation NIDS. HTTP 206 is a perfect example of layer 7 fragmentation that is loosely analogous to ipfrag, etc. What is an example of client application object fragmentation? Imagine you have malicious javascript and you want to evade NIDS that are smart enough to decode basic javascript obfuscation like hex armoring. One option is to split your javascript across multiple files (which all get included at run time), possibly across multiple servers/domains.

The next release of Ruminate will include thousands of lines of new and improved HTTP parsing code, including a new 206defrag service. When individual HTTP parser node comes across a HTTP 206 response, it feeds the fragmented payload to the 206defrag service which does the defragmentation. When 206defrag service has all the pieces of the file, the reassembled payload is passed through the object multiplexer to the appropriate analysis service(s), ex. PDF.

I'm very pleased at the progress I've made to address HTTP 206. First of all, it actually works! In operation so far, I've been able to look at a lot of interesting payloads that I wouldn't have been able to otherwise.

I wanted to share some examples that demonstrate uses of HTTP 206 in the wild. The first example will be very straightforward and is the type of thing you’ll see most often. The other two examples demonstrate characteristics that are less common, but still happen in the real world. None of the examples were contrived or fabricated--they were taken from real network traffic that I had no direct influence on. I will however, use them to show what I believe to be useful functionality of Ruminate. I anonymized the client IP addresses, but other than that, the data is just as observed. Note that other than interesting examples of HTTP 206 in action, there is absolutely no malicious, sensitive, private or otherwise interesting data in the pcaps. The 206_examples.zip download includes the pcaps of the examples and the relevant logs from Ruminate. For those stout of heart enough to actually tinker Ruminate in its current state, I’ve also included the new HTTP code in the download also.

Example A


Example A is a canonical example of HTTP 206 fragmentation. Let’s start with the logs:

[csmutz@master 206_examples]$ cat http_a.log
Jan 12 01:47:39 node1 http[26350]: tcp-198786717-1294814857-1294814859-c-33510-10.101.84.70:10977c129.174.93.161:80_http-0 1.1 GET cs.gmu.edu /~tr-admin/papers/GMU-CS-TR-2010-20.pdf 0 32768 206 1292442029 application/pdf TG ALHEk http://cs.gmu.edu/~tr-admin/papers/GMU-CS-TR-2010-20.pdf - "zh-CN,zh;q=0.8" "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/534.10 (KHTML, like Gecko) Chrome/8.0.552.224 Safari/534.10" "Apache"
Jan 12 01:47:39 master 206defrag: input tcp-198786717-1294814857-1294814859-c-33510-10.101.84.70:10977c129.174.93.161:80_http-0 555523 0 32768 cs.gmu.edu /~tr-admin/papers/GMU-CS-TR-2010-20.pdf 10.101.84.70
Jan 12 01:48:17 node4 http[26947]: tcp-198787353-1294814861-1294814896-c-523548-10.101.84.70:10978c129.174.93.161:80_http-0 1.1 GET cs.gmu.edu /~tr-admin/papers/GMU-CS-TR-2010-20.pdf 0 522755 206 1292442029 application/pdf TG ALHEk http://cs.gmu.edu/~tr-admin/papers/GMU-CS-TR-2010-20.pdf - "zh-CN,zh;q=0.8" "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/534.10 (KHTML, like Gecko) Chrome/8.0.552.224 Safari/534.10" "Apache"
Jan 12 01:48:17 master 206defrag: input tcp-198787353-1294814861-1294814896-c-523548-10.101.84.70:10978c129.174.93.161:80_http-0 555523 32768 522755 cs.gmu.edu /~tr-admin/papers/GMU-CS-TR-2010-20.pdf 10.101.84.70
Jan 12 01:48:17 master 206defrag: output tcp-198786717-1294814857-1294814859-c-33510-10.101.84.70:10977c129.174.93.161:80_http-0_206defrag normal 2 555523 5a484ada9c816c0e8b6d2d3978e3f503 tcp-198786717-1294814857-1294814859-c-33510-10.101.84.70:10977c129.174.93.161:80_http-0,tcp-198787353-1294814861-1294814896-c-523548-10.101.84.70:10978c129.174.93.161:80_http-0
[csmutz@master 206_examples]$ cat object_a.log
Jan 12 01:48:17 master object_mux[11977]: tcp-198786717-1294814857-1294814859-c-33510-10.101.84.70:10977c129.174.93.161:80_http-0_206defrag 555523 5a484ada9c816c0e8b6d2d3978e3f503 pdf PDF document, version 1.4

Unfortunately I don’t have time to explain in full the log formats, etc. Hopefully I'll document that somewhere more accessible than the code soon :). The first log line demonstrates the 1st HTTP transaction where the client asks the server for the first 32k of the PDF and the server obliges.

Headers are as follows:

GET /~tr-admin/papers/GMU-CS-TR-2010-20.pdf HTTP/1.1
Host: cs.gmu.edu
Connection: keep-alive
Referer: http://cs.gmu.edu/~tr-admin/papers/GMU-CS-TR-2010-20.pdf
Accept: */*
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/534.10 (KHTML, like Gecko) Chrome/8.0.552.224 Safari/534.10
Accept-Encoding: gzip,deflate,sdch
Accept-Language: zh-CN,zh;q=0.8
Accept-Charset: GBK,utf-8;q=0.7,*;q=0.3
Range: bytes=0-32767

HTTP/1.1 206 Partial Content
Date: Wed, 12 Jan 2011 06:47:37 GMT
Server: Apache
Last-Modified: Wed, 15 Dec 2010 19:40:29 GMT
ETag: "56010f-87a03-497781c080540"
Accept-Ranges: bytes
Content-Length: 32768
Content-Range: bytes 0-32767/555523
Connection: close
Content-Type: application/pdf

That’s all straightforward. The HTTP parser realizes that it doesn’t have a complete payload object so instead of passing it to the object multiplexer it sends it to the 206defrag service. The next log line shows the 206defrag service receiving this fragment. Since it doesn’t have the whole object yet, it holds on to it.

After sampling the first 32k, the client gets the rest of the PDF. Headers as follows:

GET /~tr-admin/papers/GMU-CS-TR-2010-20.pdf HTTP/1.1
Host: cs.gmu.edu
Connection: keep-alive
Referer: http://cs.gmu.edu/~tr-admin/papers/GMU-CS-TR-2010-20.pdf
Accept: */*
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/534.10 (KHTML, like Gecko) Chrome/8.0.552.224 Safari/534.10
Accept-Encoding: gzip,deflate,sdch
Accept-Language: zh-CN,zh;q=0.8
Accept-Charset: GBK,utf-8;q=0.7,*;q=0.3
Range: bytes=32768-555522
If-Range: "56010f-87a03-497781c080540"

HTTP/1.1 206 Partial Content
Date: Wed, 12 Jan 2011 06:47:41 GMT
Server: Apache
Last-Modified: Wed, 15 Dec 2010 19:40:29 GMT
ETag: "56010f-87a03-497781c080540"
Accept-Ranges: bytes
Content-Length: 522755
Content-Range: bytes 32768-555522/555523
Connection: close
Content-Type: application/pdf

Again, this is very straightforward. The client gets the rest of the file. Note the “Etag” and “If-Range” headers. If clients and servers consistently used this convention it might make reassembly easier. Alas, it’s frequently not used. The server was nice enough to report a content type of “application/pdf” for both fragments, doesn’t use any other content-encoding or transfer-encoding, etc. If only all transactions were this simple!

After receiving the 2nd fragment on the 4th log line, the 206defrag service realizes it has the whole payload now. Line 5 shows the service sending this payload object off for analysis. In line 6 the object multiplexer decides to send this file on to the PDF analyzer. Not shown here, but the PDF analysis service deems this PDF well worth the time reading :)

This is a very simple and clean example of HTTP 206 fragmentation. Most uses of HTTP 206 are similar to this, even if not quite this simple. In very many cases, instead of being split across separate TCP streams, the fragments are sent serially in the same stream a la pipelined request/responses. This general scenario is very common for PDFs.

One point I’d like to make here is that if your NIDS doesn’t do HTTP 206 defrag, you loose the opportunity to analyze a significant portion of PDFs, at least any analysis that requires looking at the whole PDF at once.

Example B


Example B is interesting for a couple reasons. Again, let’s start with the logs:

[csmutz@master 206_examples]$ cat http_b.log
Jan 12 02:17:56 node4 http[27618]: tcp-198921731-1294816073-1294816075-i-936869-192.168.72.14:3254c65.54.95.206:80_http-0 1.1 GET au.download.windowsupdate.com /msdownload/update/software/uprl/2011/01/windows-kb890830-v3.15-delta_7d99803eaf3b6e8dfa3581348bc694089579d25a.exe 0 816896 206 1294342831 application/octet-stream TP AEk - - "" "Microsoft BITS/6.6" "Microsoft-IIS/7.5"
Jan 12 02:17:56 master 206defrag: input tcp-198921731-1294816073-1294816075-i-936869-192.168.72.14:3254c65.54.95.206:80_http-0 1022920 0 816896 au.download.windowsupdate.com /msdownload/update/software/uprl/2011/01/windows-kb890830-v3.15-delta_7d99803eaf3b6e8dfa3581348bc694089579d25a.exe 192.168.72.14
Jan 12 02:17:56 node4 http[27618]: tcp-198921731-1294816073-1294816075-i-936869-192.168.72.14:3254c65.54.95.206:80_http-1 1.1 GET au.download.windowsupdate.com /msdownload/update/software/uprl/2011/01/windows-kb890830-v3.15-delta_7d99803eaf3b6e8dfa3581348bc694089579d25a.exe 0 0 - - - - AEk - - "" "Microsoft BITS/6.6" ""
Jan 12 02:33:26 node1 http[26761]: tcp-199054360-1294817575-1294817576-r-206649-192.168.72.14:3257c65.54.95.14:80_http-0 1.1 GET au.download.windowsupdate.com /msdownload/update/software/uprl/2011/01/windows-kb890830-v3.15-delta_7d99803eaf3b6e8dfa3581348bc694089579d25a.exe 0 206024 206 1294342831 application/octet-stream TP AEk - - "" "Microsoft BITS/6.6" "Microsoft-IIS/7.5"
Jan 12 02:33:26 master 206defrag: input tcp-199054360-1294817575-1294817576-r-206649-192.168.72.14:3257c65.54.95.14:80_http-0 1022920 816896 206024 au.download.windowsupdate.com /msdownload/update/software/uprl/2011/01/windows-kb890830-v3.15-delta_7d99803eaf3b6e8dfa3581348bc694089579d25a.exe 192.168.72.14
Jan 12 02:33:26 master 206defrag: output tcp-198921731-1294816073-1294816075-i-936869-192.168.72.14:3254c65.54.95.206:80_http-0_206defrag normal 2 1022920 fc13fee1d44ef737a3133f1298b21d28 tcp-198921731-1294816073-1294816075-i-936869-192.168.72.14:3254c65.54.95.206:80_http-0,tcp-199054360-1294817575-1294817576-r-206649-192.168.72.14:3257c65.54.95.14:80_http-0
[csmutz@master 206_examples]$ cat object_b.log
Jan 12 02:33:26 master object_mux[3282]: tcp-198921731-1294816073-1294816075-i-936869-192.168.72.14:3254c65.54.95.206:80_http-0_206defrag 1022920 fc13fee1d44ef737a3133f1298b21d28 null PE32 executable for MS Windows (GUI) Intel 80386 32-bit

At first glance, this looks a lot like the last example. There are some subtle but notable differences. First of all, the first tcp stream contains two requests, not one. While the first transaction looks normal, the log for the second is incomplete. The size of the response payload is “-“, there is no response code either, and none of the response headers are set. What is happening here is that Ruminate can validate and parse the request but it can’t do so with the response, so it just gives the metadata for the request. What is going on here? To find out, we’ll have to go to the packets...

Looking at packet 956, we see the second pipelined request. Presumably everything is still normal at this point:

[csmutz@master 206_examples]$ tshark -nn -r 206_example_b.pcap | grep "^956 "
956 1.259759 192.168.72.14 -> 65.54.95.206 HTTP GET /msdownload/update/software/uprl/2011/01/windows-kb890830-v3.15-delta_7d99803eaf3b6e8dfa3581348bc694089579d25a.exe HTTP/1.1

If we go farther down the packet trace we get to the point that the client receives the header for the 2nd response in packet 1213:

[csmutz@master 206_examples]$ tshark -nn -r 206_example_b.pcap | grep -C 2 "^1213 "
1211 1.407243 192.168.72.14 -> 65.54.95.206 TCP [TCP Dup ACK 1101#52] 3254 > 80 [ACK] Seq=581 Ack=899890 Win=65535 Len=0 SLE=935155 SRE=965425
1212 1.407254 65.54.95.206 -> 192.168.72.14 TCP [TCP segment of a reassembled PDU]
1213 1.407255 65.54.95.206 -> 192.168.72.14 HTTP HTTP/1.1 206 Partial Content (application/octet-stream)
1214 1.407347 192.168.72.14 -> 65.54.95.206 TCP [TCP Dup ACK 1101#53] 3254 > 80 [ACK] Seq=581 Ack=899890 Win=65535 Len=0 SLE=935155 SRE=965425
1215 1.407465 192.168.72.14 -> 65.54.95.206 TCP [TCP Dup ACK 1101#54] 3254 > 80 [ACK] Seq=581 Ack=899890 Win=65535 Len=0 SLE=935155 SRE=965425

Already we see something amiss. The client is ACKing incessantly some data at a point that is a partway into the payload of the 2nd response. As it turns out, the client never ACKs any more data, even though the server tries to ram the whole response down the client’s buffer. It appears that the whole payload for the 2nd response is transferred over the wire, but the client never ACKs it. Ruminate handles this case by assuming the client threw away the unACKed data and doing essentially the same. Since the whole response can’t be reconstructed, Ruminate punts and provides no metadata about the response in the log and doesn't send the payload fragment to the 206defrag service, considering it invalid. Some could argue that it would be nice if Ruminate was a little more promiscuous in the TCP reassembly and HTTP parsing. While I could see the argument that it would be nice to provide some information about the response, the current behavior is relatively simple and safe. I suspect that some other NIDS and network forensics utilities would actually use all the unACKed data, opening the door to analyze the whole payload at this point. I can see the appeal of this approach. I’m not 100% sure I’ve analyzed this situation correctly, but I think Ruminate does the right thing in this case.

It seems apparent that the client discarded this unACKed data because several minutes later, it requests the second fragment over again, which it receives successfully. After the client receives this second fragment, Ruminate splices it together and the exe is sent off for analysis. The interesting part about this 2nd attempt for the 2nd fragment is that this time the client chose a different mirror to download from--it’s on the same subnet but is a different IP.

I chose this example because it points out a few things. First it demonstrates how the classic layer 4 defrag accuracy problem can influence the layer 7 defrag problem. Similarly, it alludes to the same problems applied to layer 7. What do you do if layer 7, ex. HTTP 206 fragments, overlap? Which version do you keep if it’s different? Can this be used for NIDS evasion like it was in the layer 4 case? These are the type of interesting questions I hope Ruminate aids in studying.

I believe this example also helps validate some of the architecture of Ruminate, from dynamic load balancing of streams to a service based approach. Since the two layer 7 fragments were sent from distinct client/server IP pairs, you have no guarantee that the conventional method of static header load balancing would send the layer 7 fragments to the same HTTP analysis node. If you are going to do this the conventional NIDS way, you are forced accept a high cost in synchronization between the two analyzer nodes because layer 7 defrag can involve large amounts of data spread through long periods of time. The service based approach not only factors in realities of today’s commodity IT infrastructure, but makes this problem look relatively simple.

Example C



Instead of leading off with the logs for this example, I need to explain one more wrinkle of HTTP 206. I didn’t learn about this until I was trying to implement 206defrag and was disappointed that to see that many of the PDFs I tried to download on my own machine weren’t being successfully reconstructed by Ruminate (my computer almost always does HTTP 206 when downloading PDFs). If the client requests more than one byte range in a single request, the server puts the various responses in a MIME blob that separates the byte ranges much like multiple attachments to an email, but from what I’ve seen, sans the base64 encoding. If I understand correctly, this is very similar to how some POSTs are encoded.

This is how it looks in practice:

GET /courses/ECE545/viewgraphs_F04/loCarb_VHDL_small.pdf HTTP/1.1
Host: teal.gmu.edu
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 ( .NET CLR 3.5.30729; .NET4.0C) Creative ZENcast v1.02.10
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 115
Connection: keep-alive
X-REMOVED: Range
X-Behavioral-Ad-Opt-Out: 1
X-Do-Not-Track: 1
Range: bytes=1-1,0-4095

HTTP/1.1 206 Partial Content Date: Mon, 10 Jan 2011 17:02:50 GMT
Server: Apache
Last-Modified: Sat, 20 Nov 2004 02:05:07 GMT
ETag: "25fb6-79bec-d67fac0"
Accept-Ranges: bytes
Content-Length: 4303
Keep-Alive: timeout=15, max=100
Connection: Keep-Alive
Content-Type: multipart/byteranges; boundary=49980f01bf1635062

--49980f01bf1635062
Content-type: application/pdf
Content-range: bytes 1-1/498668

P
--49980f01bf1635062
Content-type: application/pdf
Content-range: bytes 0-4095/498668

%PDF-1.4
...

In this case you see the client asking for and the server responding with the second byte of the PDF, then the first 4K of it.

For brevity’s sake, I’ll only display the 206defrag “output” log:

Jan 10 12:04:02 master 206defrag: output tcp-170962418-1294678989-1294679016-c-233988-10.45.179.94:19950c129.174.93.170:80_http-0-part-1_206defrag normal 70 498668 94046a5fb1c5802d0f1e6d704cf3e10e tcp-170962418-1294678989-1294679016-c-233988-10.45.179.94:19950c129.174.93.170:80_http-0-part-1,tcp-170962418-1294678989-1294679016-c-233988-10.45.179.94:19950c129.174.93.170:80_http-1-part-1,tcp-170962841-1294678990-1294679016-c-305932-10.45.179.94:19953c129.174.93.170:80_http-1-part-4,tcp-170962418-1294678989-1294679016-c-233988-10.45.179.94:19950c129.174.93.170:80_http-6-part-1,tcp-170962418-1294678989-1294679016-c-233988-10.45.179.94:19950c129.174.93.170:80_http-7-part-2,tcp-170962841-1294678990-1294679016-c-305932-10.45.179.94:19953c129.174.93.170:80_http-2-part-1,...

In case you’re curious, yes the “70” early in the log means that the payload was assembled from 70 fragments. Furthermore, the “normal” means that the fragments that were spliced together from contiguous segments without any portions of the fragments overlapping. Note that the duplication of byte 1 numerous times doesn’t affect this because it’s not necessary to use those fragments. In the future, I could be more granular with the logic and logging for special cases where fragments are duplicated, fragments overlap, etc. I have little knowledge of how specific HTTP clients handle situations like overlapping fragments.

One other thing of note is that these fragments are being transferred through two simultaneous TCP connections (client port 19950 and 19953) using multiple HTTP 1.1 transactions. One other thing that I think is interesting about this example is the seemingly sporadic order in which the fragments are requested:

The following shows the client TCP port, the HTTP transaction index in that TCP connection, the, the MIME part index, the fragment start index, and the fragment length.

[csmutz@master 206_examples]$ cat http_c.log | grep input | sed -r 's/tcp-.*:([0-9]+)c.*-([0-9]+-part-[0-9]+) /\1.\2 /' | awk '{ print $7" "$9" "$10 }'
19953.0-part-0 1 1
19950.0-part-0 1 1
19953.0-part-1 487541 4096
19950.0-part-1 0 4096
19953.1-part-0 1 1
19950.1-part-0 1 1
19950.1-part-1 4096 14319
19953.1-part-1 478933 1325
19953.1-part-2 477152 1781
19950.2-part-0 1 1
19953.1-part-3 480258 803
19953.1-part-4 18415 2540
19950.2-part-1 494520 4096
19953.1-part-5 481061 697
19950.3-part-0 1 1
19953.2-part-0 1 1
19953.2-part-1 32255 13312
19950.3-part-1 498616 52
19953.3-part-0 1 1
19950.4-part-0 1 1
19953.3-part-1 52049 5315
19953.3-part-2 483154 1646
19950.4-part-1 491637 2883
19953.3-part-3 57364 5529
19953.3-part-4 485870 46
...

I’m not sure I can discern any pattern to the manner in which the fragments are transferred, but it’s definitely not in order. While this looks like a bit of a shotgun (double barreled in this case) approach to getting this file, it’s not overly haphazard as the fragments line up nicely. I did quickly look at the byteranges themselves to see if they correlated to the internal structure of the PDF (objects/streams) but didn’t see anything too obvious in the couple I examined. I’m also not sure why the client wants to request the second byte so frequently. According to my reckoning, the payload was reconstructed from 70 fragments, using 22 HTTP transactions, through 2 unique TCP connections. While definitely the exception rather than the norm, this is an example where the buffer then analyze model of Ruminate has significant benefits over the stateful incremental analysis model of conventional packet based NIDS.

While examples of rare conditions, examples B and C demonstrate the type of issues I’ve built Ruminate to be able to study and address. As attacks continue to move up the stack, NIDS research needs to also.

Descending out of the clouds into the real world, example A isn’t as uncommon as many might suppose. I’m hoping that the upcoming release of Ruminate, with vastly improved HTTP parsing capabilities, will prove useful to some in operational environments. I feel it important to reiterate that Ruminate is a research oriented tool--it’s somewhere between experimental and proof of concept. The last thing I want is for Ruminate to be used in manner that someone is misled with a false sense of security. It should go without saying, but only those who are willing to accept any limitations (presumably without knowing all of them) or are willing to do adequate vetting themselves should rely on Ruminate in production environments. That being said, I’ve been pleasantly surprised with what I’ve been able to do with Ruminate so far.

In the next couple weeks I’m going to work on refining, packaging, and documenting Ruminate so it will easier for those who want to try to play with it. I hope to have this done around the end of the month.

5 comments:

  1. Hey Charles, have you looked at the HTP library from Suricata to see how they handle this?

    http://www.openinfosecfoundation.org/index.php/download-suricata

    ReplyDelete
  2. Thanks, great post. I had no idea 206 was being used so much.

    I looked at the traces and as you pointed out found strange stuff.

    bytes 1-1 was asked for in almost every chunk - wonder why ?

    "bytes=1-1,52049-57363,483154-484799,57364-62892,485870-485915,62893-75939"

    some of the ranges are contiguous so can be collapsed like 52049-57363 , 57364-62892 , 62893-75939 can be requested as 52049-75939 ! Why were they not ? Maybe the browser wanted a multipart/bytes back ?

    Nice work. I will try out your code over the weekend.

    ReplyDelete
  3. Richard,

    libHTP, http://sourceforge.net/projects/libhtp/, which Suricata uses, appears to be very promising for parsing HTTP transactions and exposing HTTP payload data. In fact, when I get to the point in Ruminate where I worry more about efficiency, robustness, etc than new features and chasing crazy ideas, I’ll seriously consider swapping out the Perl implementation using HTTP::Paser with something using libHTP. Perl/Python binding for libHTP would be a welcome facilitator. However, as far as I understand, reassembling the payload fragments across multiple HTTP transactions is out of scope for libHTP. It should be clear from the architecture of Ruminate that to me it makes a lot of sense to abstract away the inter layer 7 transaction processing (206defrag) from the regular layer 7 transaction processing (http_parser). Traditional NIDS aren’t concerned with comprehensive payload object extraction and reassembly because their detection models don’t need/support it. Ruminate needs to have access to complete payload objects to be able to analyze them. One thing I have give Suricata credit for is pushing the envelope on layer 7 decoding, e.g. gzip content encoding a la libHTP.

    Vivek,

    I don’t know why some clients request byte 1 over and over again. I know my desktop does this routinely though when viewing PDFs in firefox. For the moment, I’m focusing my efforts on building the network instrumentation necessary to understand what’s going on.

    ReplyDelete
  4. Thanks Charles, for doing and publishing this research and also for the pcaps. The reassembly process looks fairly straightforward, except for the multiple sources cases.

    Without having looked at the issue at all yet, I'm wondering about evasion possibilities by exploiting server/client specific handling of corner cases. In IP-defrag and TCP-stream reassembly in Suricata we are spending a lot of effort to do it right based on the target OS (Snort has shown the way here). Would similar issues be possible here? IE handling 206-frag overlaps differently from FF, Webkit from some download manager, etc?

    ReplyDelete
  5. Victor,

    Yes, I believe the type of issues that you mentioned and to which I alluded could be interesting academic problems. Back in the days that most attacks/evasion operated at layers 3 and 4, these sorts of corner cases were all the rage in academia. I don’t understand why the layer 7 and embedded payload object equivalents aren’t being discussed as rigorously today. While I’m starting with the normal case of HTTP 206, I hope Ruminate helps promote research into these types of corner cases at layer 7 and above.

    That being said, in a practice, I think it’s critical to be able to adequately handle the normal cases before worrying about the exceptional cases. For example, it’s often more important in the real world to be able to routinely operate on adequate amounts of reassembled/decoded data (classic flow depth problem) than to be able to handle extremely rare cases (such as TCP defrag corner cases). While I’m nowhere near there yet with HTTP 206 defrag in Ruminate, one pragmatic approach that likely won’t earn academic accolades because it isn’t sexy but just might help in the real world is to handle ambiguous cases by simply trying multiple possible reconstructions. Ex. if you are reassembling fragments, and the two fragments overlap, and the overlapping fragments actually differ, then try both possible reconstructions.

    Circling back to my comments to Richard above, while some other examples could be cited, layer 7 defrag is most critical if you are doing things that require looking at the reassembled payload objects—like Ruminate or VRT’s Razorback, http://labs.snort.org/razorback/, are attempting to do. If you are just going to apply a short signature or string match to a payload, full reassembly really isn’t as important. If you want to undo file format encoding, such as compression in PDF, SWF, or DOCX, then reassembly is critical. If you want to inspect PEs, throwing all or some of them in a sandbox, full reassembly is essential. This sort of analysis is what Ruminate is all about, hence my interest in HTTP 206 defrag-—both the normal and corner cases.

    ReplyDelete