<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-471870110572762709</id><updated>2012-02-18T06:34:34.235-08:00</updated><category term='technology'/><category term='memories'/><category term='cyber'/><category term='cli'/><category term='devel'/><category term='cyberwar'/><category term='opinion'/><category term='supercomputer'/><category term='idle'/><category term='apt'/><category term='near real-time IDS'/><category term='ruminate'/><category term='machine learning'/><category term='vortex howto'/><category term='security intelligence'/><category term='packet capture'/><category term='snort'/><category term='humor'/><title type='text'>SmuSec</title><subtitle type='html'>Charles Smutz's thoughts on computer security and software development, inlcuding topics such as intrusion detection systems and incident response.</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://smusec.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://smusec.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Charles Smutz</name><uri>http://www.blogger.com/profile/05098439824931378207</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_g1XmJJW8J_g/S-igCjTBm_I/AAAAAAAAAAY/7Dc7fU1NH5g/S220/CharlesSmutz2009.jpg'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>29</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-471870110572762709.post-1762110709417810226</id><published>2011-07-19T16:17:00.001-07:00</published><updated>2011-07-20T05:32:45.957-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='humor'/><category scheme='http://www.blogger.com/atom/ns#' term='cyber'/><title type='text'>Cyberprefixation Benchmark</title><content type='html'>Continuing my crusade against senseless buzzword use which I began with a &lt;a href="http://smusec.blogspot.com/2011/06/for-jargon-file-cyberprefixation.html"&gt;recent post&lt;/a&gt;, I've created a &lt;a href="http://www.csmutz.com/tools/cyberprefixation_benchmark.php"&gt;cyberprefixation benchmark&lt;/a&gt; tool. It ranks pages based on numerous variables.&lt;br /&gt;&lt;br /&gt;Flaming examples that demonstrate all of the input variables and the upper end of the rating scale include two pages from DHS: &lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://www.csmutz.com/tools/cyberprefixation_benchmark.php?url=http%3A%2F%2Fwww.dhs.gov%2Ffiles%2Fcybersecurity.shtm"&gt;Cybersecurity&lt;/a&gt; (score: 66.2, rating: complusive)&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.csmutz.com/tools/cyberprefixation_benchmark.php?url=http%3A%2F%2Fwww.dhs.gov%2Fxabout%2Fstructure%2Feditorial_0839.shtm"&gt;National Cyber Security Division&lt;/a&gt; (score: 188.5, rating: hazardous)&lt;/li&gt;&lt;/ul&gt; &lt;br /&gt;&lt;br /&gt;If this tools helps even one person then I'll feel like the (tiny) time spent on this was worth it :)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/471870110572762709-1762110709417810226?l=smusec.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://smusec.blogspot.com/feeds/1762110709417810226/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://smusec.blogspot.com/2011/07/cyberprefixation-benchmark.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/1762110709417810226'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/1762110709417810226'/><link rel='alternate' type='text/html' href='http://smusec.blogspot.com/2011/07/cyberprefixation-benchmark.html' title='Cyberprefixation Benchmark'/><author><name>Charles Smutz</name><uri>http://www.blogger.com/profile/05098439824931378207</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_g1XmJJW8J_g/S-igCjTBm_I/AAAAAAAAAAY/7Dc7fU1NH5g/S220/CharlesSmutz2009.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-471870110572762709.post-187601885809812588</id><published>2011-06-25T17:11:00.001-07:00</published><updated>2011-06-25T18:42:03.123-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='technology'/><category scheme='http://www.blogger.com/atom/ns#' term='ruminate'/><title type='text'>Ruminate 06/22 Weekendly Build</title><content type='html'>It's been way too long since I've posted any public updates on Ruminate. I'd like to highlight two things from the &lt;a href="http://ruminate-ids.org/files/weekendly/ruminate-20110622.zip"&gt;06/22 Weekendly Build&lt;/a&gt;. Enough has changed to warrant a "release", except that I haven't been able to do as much testing as I normally do.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Splitting Vortex and Ruminate Server&lt;/h3&gt;&lt;br /&gt;Previous versions of Ruminate have been based on essentially a fork of Vortex. Now Ruminate relies on Vortex (or some similar thing like tcpflow) to generate network streams and Ruminate takes it from there. This allows Ruminate to benefit immediately from any updates to Vortex and better fits the implementation paradigm I've chosen for Ruminate (loose composition of many small components). This also allows for a single instance of of the stream capture mechanism (instead of one per protocol).&lt;br /&gt;&lt;br /&gt;The new architecture looks like:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/-DUS1wPs4sx4/TgZ_R3-2jvI/AAAAAAAAAB0/Fi1gIw6hAOE/s1600/Ruminate%2BComponents.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 273px; height: 320px;" src="http://2.bp.blogspot.com/-DUS1wPs4sx4/TgZ_R3-2jvI/AAAAAAAAAB0/Fi1gIw6hAOE/s320/Ruminate%2BComponents.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5622321129880719090" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Now you start the stream distribution mechanism on the capture server with something like:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;vortex {options} | ruminate_server&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Note that ruminate_server doesn't take the same options as the old one. I haven't yet decided how I want to specify some of the options (like which streams get classified as which protocol and which port those streams are distributed on) so these are set in the code. In the future, I hope to make this much more flexible, allowing for protocol selection to be based not only on port, but also on content. Right now, streams are processed by the first, and only the first, protocol parser whose filter is matched by the stream. In the future, I'd like to support more than one, probably by giving a copy to each parser that wants the stream.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Significant Fix to http_parser&lt;/h3&gt;&lt;br /&gt;Those who have used Ruminate extensively will know that it occasionally comes across a stream that just kills the performance of http_parser. It's not that big deal if one of many http_parsers churns for a long period of time if you have a lot of them, but it's clearly not ideal. From what I can tell, the major cause of this situation is inefficient code in http_parser in the case of an HTTP response that doesn't include a Content-Length header. I've put in a fix for this that provides orders of magnitude improvement in this case, especially if the response payload is large.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Going Forward&lt;/h3&gt;&lt;br /&gt;There are a few things I'm looking at doing going forward. I mentioned enhancing stream distribution mechanisms above.&lt;br /&gt;&lt;br /&gt;I may also try to publicly share some performance stats of Ruminate running a large network (~ 1 Gpbs) so that I demonstrate that Ruminate really does scale well. Most of the data I've published has involved Ruminate being used data sets much smaller than I would have liked.&lt;br /&gt;&lt;br /&gt;I'm thinking of creating a Flash scanning service similar to the PDF service. Exploitation of SWF vulnerabilities is rampant. Like PDFs, some of the complications of SWFs (like file format compression and internal script language) are good for demonstrating the benefits of Ruminate. &lt;br /&gt;&lt;br /&gt;The point of these object analyzers has primarily been for demonstrating the value of the framework and the associated mechanisms but in the future I hope to innovate in detection mechanisms also.&lt;br /&gt;&lt;br /&gt;While my primary purpose in building Ruminate is to conduct research, I hope sharing my implementation will be helpful to some, notwithstanding the many imperfections.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/471870110572762709-187601885809812588?l=smusec.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://smusec.blogspot.com/feeds/187601885809812588/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://smusec.blogspot.com/2011/06/ruminate-0622-weekendly-build.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/187601885809812588'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/187601885809812588'/><link rel='alternate' type='text/html' href='http://smusec.blogspot.com/2011/06/ruminate-0622-weekendly-build.html' title='Ruminate 06/22 Weekendly Build'/><author><name>Charles Smutz</name><uri>http://www.blogger.com/profile/05098439824931378207</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_g1XmJJW8J_g/S-igCjTBm_I/AAAAAAAAAAY/7Dc7fU1NH5g/S220/CharlesSmutz2009.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/-DUS1wPs4sx4/TgZ_R3-2jvI/AAAAAAAAAB0/Fi1gIw6hAOE/s72-c/Ruminate%2BComponents.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-471870110572762709.post-7824368208513839749</id><published>2011-06-18T12:27:00.000-07:00</published><updated>2011-06-18T14:39:46.368-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='humor'/><category scheme='http://www.blogger.com/atom/ns#' term='cyber'/><title type='text'>For the Jargon File: Cyberprefixation</title><content type='html'>Cyberprefixation: n.&lt;br /&gt;&lt;br /&gt;1. the compulsive, excessive, or vain use of the term “cyber” before other words forming words or word sequences not used in typical dialogue. In this context, “cyber” frequently indicates a narrow meaning such as information security, however, the precise meaning is almost always ambiguous. In cyberprefixation, the addition of the term “cyber” doesn’t necessarily provide any meaningful description or clarification, but rather is used predominately for its value as a buzzword. Cyberprefixation may be used to describe all cases where “cyber” precedes the word it modifies, whether separated by punctuation or where combined to form single word. Cyberprefixation often results in creation of &lt;a href="http://en.wikipedia.org/wiki/Nonce_word"&gt;nonce words&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Cyberprefixation is closely related to cyalliteration, but differs in that cyberprefixtion refers the use of "cyber" in it's entirety, whereas cyalliteration leverages only first syllable “cy”.&lt;br /&gt;&lt;br /&gt;Example: That press release was a prime example of &lt;span style="font-weight:bold;"&gt;cyberprefixation&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;See also: &lt;a href="http://www.catb.org/jargon/html/C/cybercrud.html"&gt;cybercrud&lt;/a&gt; and &lt;a href="http://www.catb.org/jargon/html/B/buzzword-compliant.html"&gt;buzzword-compliant&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/471870110572762709-7824368208513839749?l=smusec.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://smusec.blogspot.com/feeds/7824368208513839749/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://smusec.blogspot.com/2011/06/for-jargon-file-cyberprefixation.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/7824368208513839749'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/7824368208513839749'/><link rel='alternate' type='text/html' href='http://smusec.blogspot.com/2011/06/for-jargon-file-cyberprefixation.html' title='For the Jargon File: Cyberprefixation'/><author><name>Charles Smutz</name><uri>http://www.blogger.com/profile/05098439824931378207</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_g1XmJJW8J_g/S-igCjTBm_I/AAAAAAAAAAY/7Dc7fU1NH5g/S220/CharlesSmutz2009.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-471870110572762709.post-4219038240907216787</id><published>2011-05-31T17:05:00.000-07:00</published><updated>2011-06-25T17:10:54.819-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='technology'/><category scheme='http://www.blogger.com/atom/ns#' term='cli'/><title type='text'>Join: Relational Queries the CLI Way</title><content type='html'>In this post, I hope to share a little CLI-fu that I’ve learned that I haven’t seen used very frequently by my fellow practitioners. My hope is that I may be able to return the favor many others have extended to me by showing how to use a nifty CLI tool.&lt;br /&gt;&lt;br /&gt;Occasionally, one comes across the need to perform relational queries on data that is stored in flat files. One legitimate tactic for doing so is to just load the data at hand in a relational database such as sqlite or mysql. There are many situations where this is less than desirable or just not practical. In such situations, I’ve seen people hack together bash/perl/whatever scripts, many of which are extremely inefficient, harder than they need to be, and/or just plain ugly. Using “join”, in conjunction with classic line based text processing utils, can provide very for elegant solutions in some of these situations. Never heard of join? Keep reading as I extol its virtues!&lt;br /&gt;&lt;br /&gt;I learned to use join working with &lt;a href="http://rumiante-ids.org/"&gt;Ruminate IDS&lt;/a&gt;. Ruminate creates logs for each of the processing layers data traverses. In its current state, correlating these logs at various layers of the processing stack is left for an external log aggregation/correlation system. In exploring events in flat file logs, I use join to splice multiple layers of the processing stack together, similar to what you would do using a join in SQL.&lt;br /&gt;&lt;br /&gt;The following is some data I will use for this example. In order to sanitize the smallest amount of data possible, I’ve only included the logs entries I will be using here. Feel free to intersperse (or imagine interspersing) additional dummy logs if you like (make sure to maintain ordering on the keys used for joining—see explanation far below). This data represents the processing associated with the same malicious pdf transfered over the network in two separate transactions:&lt;br /&gt;&lt;span style="font-family: monospace; font-size: 9pt;"&gt;&lt;br /&gt;$ cat clamav.log&lt;br /&gt;tcp-1305036479-10.1.1.1:51770c114.143.209.62:80_http-0 Exploit.PDF-22632&lt;br /&gt;tcp-1305128525-10.1.1.1:57460c89.114.97.13:80_http-0 Exploit.PDF-22632&lt;br /&gt;$ cat object.log&lt;br /&gt;tcp-1305036479-10.1.1.1:51770c114.143.209.62:80_http-0 2080 9e9dfd9534fe89518ba997deac07e90d PDF document, version 1.6&lt;br /&gt;tcp-1305128525-10.1.1.1:57460c89.114.97.13:80_http-0 2080 9e9dfd9534fe89518ba997deac07e90d PDF document, version 1.6&lt;br /&gt;$ cat http.log&lt;br /&gt;tcp-1305036479-10.1.1.1:51770c114.143.209.62:80_http-0 GET haeied.net /1.pdf&lt;br /&gt;tcp-1305128525-10.1.1.1:57460c89.114.97.13:80_http-0 GET haeied.net /1.pdf.&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;Note that these log files are presented in reverse order of the processing stack. HTTP processing extracts objects and creates network protocol logs. File metadata is extracted from those objects. The objects are then multiplexed to analyzers like clamav for analysis.&lt;br /&gt;&lt;br /&gt;Let’s say I want to look at all the files transferred over the network that matched the clamav signature “Exploit.PDF-22632”. I use the classic grep:&lt;br /&gt;&lt;span style="font-family: monospace; font-size: 9pt;"&gt;&lt;br /&gt;$ grep -F "Exploit.PDF-22632" clamav.log&lt;br /&gt;tcp-1305036479-10.1.1.1:51770c114.143.209.62:80_http-0 Exploit.PDF-22632&lt;br /&gt;tcp-1305128525-10.1.1.1:57460c89.114.97.13:80_http-0 Exploit.PDF-22632&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;Unfortunately, the TCP quad and timestamp doesn’t provide us much useful context. Let’s join in the http.log data:&lt;br /&gt;&lt;span style="font-family: monospace; font-size: 9pt;"&gt;&lt;br /&gt;$ grep -F "Exploit.PDF-22632" clamav.log | join - http.log&lt;br /&gt;tcp-1305036479-10.1.1.1:51770c114.143.209.62:80_http-0 Exploit.PDF-22632 GET haeied.net /1.pdf&lt;br /&gt;tcp-1305128525-10.1.1.1:57460c89.114.97.13:80_http-0 Exploit.PDF-22632 GET haeied.net /1.pdf&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;Whoa, that was easy. Note that join assumed that we wanted to use the first column as the key for joining. While we’re at it, let’s join in the object.log data, only selecting the columns we are interested in:&lt;br /&gt;&lt;span style="font-family: monospace; font-size: 9pt;"&gt;&lt;br /&gt;$ grep -F "Exploit.PDF-22632" clamav.log | join - http.log | join - object.log | cut -d" " -f 2-6,8-&lt;br /&gt;Exploit.PDF-22632 GET haeied.net /1.pdf 2080 PDF document, version 1.6&lt;br /&gt;Exploit.PDF-22632 GET haeied.net /1.pdf 2080 PDF document, version 1.6&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;One big advantage of join is that it is easy to use in conjunction with other filter programs such as grep, sed, and zcat. You might use sed to convert tcp quads from IDS alerts and firewall logs into exactly the same format so you can join them on the tcp quad as the key. Join works very well on large files, including compressed files, decompressing them on the fly to efficiently get the data you want. The following is the same, with the difference of operating on compressed files:&lt;br /&gt;&lt;span style="font-family: monospace; font-size: 9pt;"&gt;&lt;br /&gt;$ gzip -c clamav.log &amp;gt; clamav.log.gz&lt;br /&gt;$ gzip -c object.log &amp;gt; object.log.gz&lt;br /&gt;$ gzip -c http.log &amp;gt; http.log.gz&lt;br /&gt;$&lt;br /&gt;$ zgrep -F "Exploit.PDF-22632" clamav.log | join - &amp;lt;(zcat http.log.gz) | join - &amp;lt;(zcat object.log.gz) | cut -d" " -f 2-6,8- &lt;br /&gt;Exploit.PDF-22632 GET haeied.net /1.pdf 2080 PDF document, version 1.6 &lt;br /&gt;Exploit.PDF-22632 GET haeied.net /1.pdf 2080 PDF document, version 1.6&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Again, very easy to get a nice little report using data spanning multiple files.&lt;br /&gt;&lt;br /&gt;To continue demonstrating join, I’m going to refer to the data used in an &lt;a href="http://www.sql-tutorial.net/SQL-JOIN.asp"&gt;SQL JOIN tutorial&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;I used the data in CSV form as follows:&lt;br /&gt;&lt;span style="font-family: monospace; font-size: 9pt;"&gt;&lt;br /&gt;$ cat customers.csv&lt;br /&gt;1,John,Smith,John.Smith@yahoo.com,2/4/1968,626 222-2222&lt;br /&gt;2,Steven,Goldfish,goldfish@fishhere.net,4/4/1974,323 455-4545&lt;br /&gt;3,Paula,Brown,pb@herowndomain.org,5/24/1978,416 323-3232&lt;br /&gt;4,James,Smith,jim@supergig.co.uk,20/10/1980,416 323-8888&lt;br /&gt;&lt;br /&gt;$ cat sales.csv&lt;br /&gt;2,5/6/2004,100.22&lt;br /&gt;1,5/7/2004,99.95&lt;br /&gt;3,5/7/2004,122.95&lt;br /&gt;3,5/13/2004,100.00&lt;br /&gt;4,5/22/2004,555.55&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;First, let’s start by generating a report for the marketing folk showing when each person has placed orders:&lt;br /&gt;&lt;span style="font-family: monospace; font-size: 9pt;"&gt;&lt;br /&gt;$ cat sales.csv | join -t, - customers.csv | sort -t, -k 1 | awk -F, '{ print $2","$4","$5 }'&lt;br /&gt;5/6/2004,Steven,Goldfish&lt;br /&gt;5/13/2004,Paula,Brown&lt;br /&gt;5/7/2004,Paula,Brown&lt;br /&gt;5/22/2004,James,Smith&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;Wow, didn’t that feel like you were using a relational database, albeit in a CLI type of way? Note that we had to specify the delimiter (same syntax as sort). Also, we sorted the output on customerid to ensure orders by the same person are contiguous. The astute reader, however, will notice that the report isn’t complete. We missed one of the sales on 5/7/2004. Why? From the man page we get the following critical nugget:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;Important: FILE1 and FILE2 must be sorted on the join fields.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;In this case we were joining on customerid columns, which are not in the same order in the sales and customers table. As such, we failed to join the records that weren’t sorted the same in both files. While this could be seen as a limitation of join, it is also what makes it efficient and makes it work so well with other utilities—all join operations occur with a single sequential pass through each file. Remember that “real” databases have indexes to make this sort of thing more efficient than a single full table scan. No frets though, for occasional queries, using sort to put the join fields in the same order works quite well. Also note that for a lot of security data, where the data is sorted chronologically, this requirement is frequently met with no additional effort, as shown in the Ruminate logs above. In this case, we’ll sort sales to put customerid in the same order as the customers table:&lt;br /&gt;&lt;span style="font-family: monospace; font-size: 9pt;"&gt;&lt;br /&gt;$ cat sales.csv | sort -t, -k 1 -g | join -t, - customers.csv | awk -F, '{ print $2","$4","$5 }'&lt;br /&gt;5/7/2004,John,Smith&lt;br /&gt;5/6/2004,Steven,Goldfish&lt;br /&gt;5/13/2004,Paula,Brown&lt;br /&gt;5/7/2004,Paula,Brown&lt;br /&gt;5/22/2004,James,Smith&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;Now the order from John Smith shows up correctly.&lt;br /&gt;&lt;br /&gt;Let’s do another simple query for the marketing folk: Report of all the customers that have placed individual purchases of over $100—the high rollers:&lt;br /&gt;&lt;span style="font-family: monospace; font-size: 9pt;"&gt;&lt;br /&gt;$ cat sales.csv | awk -F, '{ if ($3 &amp;gt; 100) print $0}' | sort -t, -k 1 -g | join -t, customers.csv - | cut -d, -f2,3,8&lt;br /&gt;Steven,Goldfish,100.22&lt;br /&gt;Paula,Brown,122.95&lt;br /&gt;James,Smith,555.55&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;Again, this is simple and straightforward (in an esoteric CLI type of way). If we were doing an SQL tutorial, we would have just introduced a WHERE clause. If I were going to translate this as literally as possible to SQL I would do so as follows:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: monospace; font-size: 9pt;"&gt;&lt;table border="1" cellspacing="1"&gt;&lt;tbody valign="top"&gt;&lt;tr&gt;&lt;td&gt;&lt;span style="font-weight:bold;"&gt;CLI&lt;/span&gt;&lt;/td&gt;&lt;td&gt;&lt;span style="font-weight:bold;"&gt;pseudo-SQL&lt;/span&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;cat sales.csv&lt;/td&gt;&lt;td&gt;FROM sales&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;awk -F, '{ if ($3 &amp;gt; 100) print $0}'&lt;/td&gt;&lt;td&gt;WHERE saleamount &amp;gt; 100&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;sort -t, -k 1 -g&lt;/td&gt;&lt;td&gt;USING INDEX customerid&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;join -t, customers.csv&lt;/td&gt;&lt;td&gt;JOIN customers ON customerid&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;cut -d, -f2,3,8&lt;/td&gt;&lt;td&gt;SELECT firstname, lastname, saleamount&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div&gt;With the full pseudo-SQL as follows:&lt;br /&gt;&lt;span style="font-family: monospace; font-size: 9pt;"&gt;&lt;br /&gt;SELECT firstname, lastname, saleamount FROM sales JOIN customers ON customerid USING INDEX customerid WHERE saleamount &amp;gt; 100&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;For the last example, I’ll do the gratuitously ugly example from the tutorial whose data we are using. Let’s calculate the total spent by each customer:&lt;br /&gt;&lt;span style="font-family: monospace; font-size: 9pt;"&gt;&lt;br /&gt;$ cat sales.csv | awk -F"," '{SUMS[$1]+=$3} END { for (x in SUMS) { print x","SUMS[x]} }' | sort -t, -k 1 -g | join -t, customers.csv - | cut -d, -f2,3,7&lt;br /&gt;John,Smith,99.95&lt;br /&gt;Steven,Goldfish,100.22&lt;br /&gt;Paula,Brown,222.95&lt;br /&gt;James,Smith,555.55&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;Alright, so this isn’t so pretty, but it works.&lt;br /&gt;&lt;br /&gt;In summary, join makes it easy to splice together data from multiple flat files. It works well in the classic *nix CLI analysis paradigm, using sequential passes through files containing one record per line. Join is particularly useful for infrequent queries on large files, including compressed files. Join plays well with the other CLI utils such as sed, awk, cut, etc and can be used to perform relational queries like those done in a database. I hope this short primer has been useful in demonstrating the power of join.&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/471870110572762709-4219038240907216787?l=smusec.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://smusec.blogspot.com/feeds/4219038240907216787/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://smusec.blogspot.com/2011/05/join-relational-queries-cli-way.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/4219038240907216787'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/4219038240907216787'/><link rel='alternate' type='text/html' href='http://smusec.blogspot.com/2011/05/join-relational-queries-cli-way.html' title='Join: Relational Queries the CLI Way'/><author><name>Charles Smutz</name><uri>http://www.blogger.com/profile/05098439824931378207</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_g1XmJJW8J_g/S-igCjTBm_I/AAAAAAAAAAY/7Dc7fU1NH5g/S220/CharlesSmutz2009.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-471870110572762709.post-6895064482622415031</id><published>2011-04-22T04:50:00.000-07:00</published><updated>2011-04-23T18:14:53.642-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='opinion'/><title type='text'>Faith and Security</title><content type='html'>I’d like to share some thoughts on how faith applies to us seeking to provide “security”, especially those of us in operational environments who expend large portions of our time and efforts to achieve this goal. I personally think it quite appropriate to speak of faith and religion openly and that our public/professional lives can’t (and shouldn’t) be fully abstracted from what many expect to be our private devotions. That being said, I’m going to try to avoid both general evangelization and pushing specific sectarian dogmas. I hope my remarks resonate with those sincerely trying to live their faith. I also hope these comments provide some perspective to help those who don’t believe in God better understand those who do.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Peace is greater than Security&lt;/h3&gt;&lt;br /&gt;For people of faith, security is a profane goal, let alone frequently arrogantly vain. Peace is the heavenly good that should be sought after. Absolute security is not only undesirable, but is contrary to our earthly existence. I believe it necessary to live in a fallen condition, such as our current mortal existence, where we are free to grow through choosing between good and bad including facing nearly constant adversity. Removing all opposition to good would frustrate our eternal progression. Regardless of your belief in our raison d’etre, religious and ethical codes guide adherents in how they react to adversity and hostility, including the heavenly pursuit of peace. For example, Christians believe a greater measure of &lt;a href="http://lds.org/scriptures/nt/john/14.27?lang=eng#26"&gt;peace&lt;/a&gt; can be found through Christ than possible through worldly means. Peace may be had in the absence of comprehensive security, requires a great measure of discipline, and doesn’t come at the cost of sacrifices to freedom. In an ideal world, we’d all be seeking peace.&lt;br /&gt;&lt;br /&gt;The world isn’t perfect. One responsibility we all have is to uphold freedom and provide an appropriate level of security. Ironically, one of the primary methods of pursuing security is through force and compulsion. Often people find extreme measures, such as warfare, the best option for achieving security. While there is some variance, most religions justify violence under certain conditions. Sadly, we find ourselves in a world of turmoil and warfare. Even though we often take a less effective path to security than we might otherwise hope for, our faith must be able to provide guidance to us during such pursuits.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Seek Divine Help&lt;/h3&gt;&lt;br /&gt;If we want to succeed in any endeavor, seeking divine help is always wise. Providing an appropriate level of security, one that ensures individual liberties, is an honorable pursuit in which God will assist us. One important aspect of our lives is doing honest and honorable work. Important experiences occur as we seek and receive God’s assistance in our labors. While it would be nice if we as a society devoted fewer resources to preventing bad things from happening and more to ensuring good things happened, I think most people performing work in “security” do honest and honorable work. As such, we should seek the help God has &lt;a href="http://lds.org/scriptures/bofm/alma/34.18-26?lang=eng#17"&gt;offered&lt;/a&gt; to those that follow him.  &lt;br /&gt;I like to separate God’s help into two major classes: direct help and inspiration. An example of direct help would be unexpected severe weather impeding an opponent’s advancement through the countryside. On the flip side, leaders might be inspired to advance, retreat, or do seemingly odd things, often in opposition to reason. Clearly, these two forms of assistance can go hand in hand. A classic scriptural example is the &lt;a href="http://lds.org/scriptures/nt/heb/11.23-30?lang=eng#22"&gt;Exodus of Israel&lt;/a&gt; from Egypt.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Ask and Ye Shall Receive&lt;/h3&gt;&lt;br /&gt;The first thing one should do when seeking God’s help is to ask. God has promised us great blessings, if we but &lt;a href="http://lds.org/scriptures/nt/matt/21.22?lang=eng#21"&gt;ask&lt;/a&gt;. Certainly God always knows what we need and want, but many blessing are contingent upon our sincere supplication to him. I can think of nothing more natural than praying for safety, protection, and assistance in defense. Sadly, I think this very important step is often overlooked. &lt;br /&gt;&lt;br /&gt;Outside of scriptural accounts, when thinking about the importance of prayer in security, I often visualize Arnold Friberg’s painting of &lt;a href="http://www.revolutionary-war-and-beyond.com/prayer-at-valley-forge.html"&gt;Washington’s prayer at Valley Forge&lt;/a&gt;. I acknowledge that the facts surrounding the story of the Isaac Potts’ and other accounts are often disputed. Regardless, based on what I know of Washington, I believe it to be plausible that he (and many others) offered numerous sincere prayers to bring about the miraculous shift in the war that eventually resulted in American independence. The first step to providence is asking.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Keep Yourself Worthy of Divine Help&lt;/h3&gt;&lt;br /&gt;I wholeheartedly agree with the maxim that God helps those who help themselves. God frequently extends mercy and assistance to those who have done all in their power. While we don’t always understand the judgments of God, he also frequently withholds assistance from those who have neglected to do what they can, especially those who do so knowingly. If we want the Lord’s assistance, certainly we should be doing the very best work we can do.  &lt;br /&gt;&lt;br /&gt;Just like vigilance in preparation, maintenance, and practice is required to ensure proper operation of implements of security (e.x. personal firearms, fighter jets, electronic surveillance systems, etc), the same vigilance is required to maintain channels of divine assistance. For example, regular prayer, scripture study, and meditation are essential to ensuring constant guidance through heavenly inspiration. Obedience to laws and commandments, such as Sabbath day observance, health codes, morality and chastity, fasting, etc bring with them promised blessings and power. Most of us know what we need to do; we just need to be vigilant in doing so. Consistently doing what is right, even if we don’t feel an acute need for help at the moment, is very much what faith is about. This sort of faithfulness invariably results in confidence and answers to prayers when the time of need does come. The parable of &lt;a href="http://lds.org/scriptures/nt/matt/25.1.1-13?lang=eng#1"&gt;the ten virgins&lt;/a&gt; beautifully advocates diligence in preparation. &lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Have the Faith to Act&lt;/h3&gt;&lt;br /&gt;If God extends help, it’s important to act upon it. Sadly, people often don’t have adequate faith to be guided by the wisdom of God over the wisdom of man. Admittedly, it often requires great faith to do so, especially in a world that is largely ruled by agnostic (and often short sighted) reason. Faith and principle based decisions are frequently hard to justify, especially in the face of empiricism (well founded or not). On the other hand, when we are given strong assurances through faith, we shouldn’t be afraid to proceed with what human wisdom deems as silly courses of action. Some of the biggest disappointments of my career have occurred as I’ve ignored inspiration concerning my work. On the other hand, as we act in faith we become more confident in doing so in the future. Through experiences in small things our faith will go to the point where we can do great things.&lt;br /&gt;&lt;br /&gt;The scriptures are replete with examples of those who have had faith to act and those who haven’t. Infamous examples of those who lacked faith at key moments include &lt;a href="http://lds.org/scriptures/ot/1-sam/15.19-23?lang=eng#18"&gt;Saul (the Old Testament King)&lt;/a&gt; and &lt;a href="http://lds.org/scriptures/nt/matt/27.19-26?lang=eng#18"&gt;Pontius Pilate&lt;/a&gt;. On the other hand, demonstrations of great faith include those by &lt;a href="http://lds.org/scriptures/ot/1-sam/17.32-51?lang=eng#31"&gt;David&lt;/a&gt;, &lt;a href="http://lds.org/scriptures/ot/judg/7?lang=eng"&gt;Gideon&lt;/a&gt;, and &lt;a href="http://lds.org/scriptures/ot/2-kgs/6.8-20?lang=eng#7"&gt;Elisha&lt;/a&gt;. &lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Give Credit Where It’s Due&lt;/h3&gt;&lt;br /&gt;One principle that the secular world understands well, at least at first blush, is giving credit where it’s due. The sad reality is that the world makes it very hard to give adequate credit to God. In some cases it’s appropriate to keep highly miraculous or personal miracles to yourself. However, when others attribute the positive outcomes of divine assistance to you, it’s important to try to set the record straight. I feel it’s appropriate to use words such as “blessing” and “providence” to convey my belief of divine intervention to those who want to understand without unduly imposing on those who don’t. We can, and always should, give thanks to God directly through personal prayer.&lt;br /&gt;&lt;br /&gt;One of my favorite examples of this principle is that of &lt;a href="http://www.ehow.com/about_6459000_history-little-italy-baltimore.html"&gt;the preservation of Little Italy&lt;/a&gt; from the &lt;a href="http://en.wikipedia.org/wiki/Great_Baltimore_Fire"&gt;Great Fire of Baltimore&lt;/a&gt;. In 1904 the core of Baltimore City burned to the ground. As the fire swept across the city, many in the neighborhood of little Italy met in the local church to pray for deliverance. In general, most people agree that the wind changing direction prevented the fire from crossing the Jones Falls river, saving Little Italy from the inferno. Some attribute this outcome to providence and some to chance. It’s clear however, what the people in little Italy believed at the time.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;No Room for Pride and Hatred&lt;/h3&gt;&lt;br /&gt;If we want God’s help, both in the short and long term, we have to do things in God’s way. Faith engenders love and teaches to avoid the pitfalls of hate and pride. While there are numerous sources I could site, I couldn’t resist the universality of Yoda teaching about the consequences of fear and hate:&lt;br /&gt;&lt;br /&gt;"Fear is the path to the dark side. Fear leads to anger. Anger leads to hate. Hate leads to suffering."&lt;br /&gt;&lt;br /&gt;Lest you think this principle is merely fiction, I point out that &lt;a href="http://lds.org/scriptures/ot/prov/29.25?lang=eng#24"&gt;Proverbs&lt;/a&gt; warns about fear of man and that &lt;a href="http://lds.org/scriptures/nt/2-tim/1.7?lang=eng#6"&gt;Timothy&lt;/a&gt; was taught to that fear is not of God.&lt;br /&gt;&lt;br /&gt;While pride is not explicitly mentioned here, I consider it to be implied with, or at least compatible in the above quote. Pride is the grease that makes the slide from prosperity to degeneracy smooth, both for individuals and societies. While we are often forced to take extreme actions against our adversaries, we should be careful to not hate our enemies. We should beware lest we cause our own downfall and estrangement from God though our pride.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;But if not...&lt;/h3&gt;&lt;br /&gt;While our faith can guide us and bring about miracles in our efforts to secure freedom, what about the times that prayers seems unanswered or the miracle doesn’t occur? We must be patient and remember that the faithful have to face the same opposition common to all man. We must remember that the demonstration of our faith &lt;a href="http://lds.org/scriptures/bofm/ether/12.6?lang=eng#5"&gt;precedes the miracle&lt;/a&gt;. We may think we know the ideal solution to a problem or the right timing for the solution, but often &lt;a href="http://lds.org/scriptures/ot/isa/55.8-9?lang=eng#7"&gt;God knows differently&lt;/a&gt;. Our faith must be able to sustain us, bringing us peace, even in times when our efforts to ensure security seem to fail, at least in the short term.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Conclusion&lt;/h3&gt;&lt;br /&gt;I’ve shared some principles, that if followed, I earnestly believe can help those of us seeking to provide security find a greater measure of success through divine assistance. I hope these words are encouraging to those who are seeking to live their faith. For those who don’t seek to live by faith, I hope this post helps you better understand those who do.&lt;br /&gt;&lt;br /&gt;I wish you all a happy and peaceful Easter.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/471870110572762709-6895064482622415031?l=smusec.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://smusec.blogspot.com/feeds/6895064482622415031/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://smusec.blogspot.com/2011/04/faith-and-security.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/6895064482622415031'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/6895064482622415031'/><link rel='alternate' type='text/html' href='http://smusec.blogspot.com/2011/04/faith-and-security.html' title='Faith and Security'/><author><name>Charles Smutz</name><uri>http://www.blogger.com/profile/05098439824931378207</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_g1XmJJW8J_g/S-igCjTBm_I/AAAAAAAAAAY/7Dc7fU1NH5g/S220/CharlesSmutz2009.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-471870110572762709.post-2696932596397994175</id><published>2011-03-26T14:12:00.000-07:00</published><updated>2011-06-22T05:29:44.878-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='technology'/><category scheme='http://www.blogger.com/atom/ns#' term='apt'/><category scheme='http://www.blogger.com/atom/ns#' term='devel'/><title type='text'>Passive Network Monitoring of Strong Authentication</title><content type='html'>There’s been a fair amount of consternation and FUD concerning the effectiveness of “strong authentication” in defending against APT. For example, in their &lt;a href="http://www.mandiant.com/news_events/forms/m-trends_2011"&gt;M-trends 2011&lt;/a&gt; report, Mandiant has demonstrated how smart cards are being subverted. If that isn’t bad enough, RSA has &lt;a href="http://www.rsa.com/node.aspx?id=3872"&gt;recently revealed&lt;/a&gt; that they’ve been victim of attacks that they believe are attributed to APT and which resulted in attackers getting access to information that may weaken the effectiveness of SecureID.&lt;br /&gt;&lt;br /&gt;Unfortunately, like most people blogging about these issues, I can’t provide any more authoritative information on the topic other than to say that based on my personal experience, targeting and subverting strong authentication mechanisms is a common practice for some targeted, persistent attackers. It’s hard to predict the impact of any of these weaknesses. Additionally, people who have found out the hard way usually aren’t particularly open about sharing their hard knocks. &lt;br /&gt;&lt;br /&gt;Nevertheless, I’d like to advance the suitability of passive network monitoring as a method for helping to audit authentication, especially strong authentication mechanisms. While auditing is more properly conducted using logs provided by the devices that actually perform authentication (and authorization, access control, etc if you want to be pedantic), there are real operational and organization issues that may well make passive network monitoring one of the most effective means of gathering the information necessary to perform auditing of strong authentication.&lt;br /&gt;&lt;br /&gt;The vast majority of password based authentication mechanisms bundle the username with the password and provide both to the server either in the clear or encrypted. It is possible to provide the username in the clear and the password encrypted which would improve monitoring capabilities at the possible expense of privacy. In general, this bundling of credentials is done because confidentiality is provided through mechanisms that operate at a different layer of the stack: ex. username and password sent through SSL tunnel.&lt;br /&gt;&lt;br /&gt;On the other hand, many authentication mechanisms provide the username/user identifier in the clear. For these protocols, passive network monitoring provides the ability to collect information necessary to provide some amount of auditing of user activity. In this post I advance two quick and dirty examples of how this information could be collected. For and simplicity’s and brevity’s sake, I’ll focus solely on collecting usernames. I’ve chosen two protocols that are very frequently used in conjunction with the strong authentication mechanisms: RADIUS and SSL/TLS client certificate authentication.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;RADIUS&lt;/h3&gt;&lt;br /&gt;RADIUS isn’t exactly as the most secure authentication protocol in the world. Since it has some serious weaknesses, it’s normally not used over hostile networks (like the internet). However, it is frequently used internally to organizations. In fact, it is very frequently used in conjunction with strong credentials such as RSA SecureID. One nice thing about RADIUS is that the username is passed in the clear in authentication requests. As such it’s pretty simple to build a monitoring tool to expose this data to auditing.&lt;br /&gt;&lt;br /&gt;In my example of monitoring RADIUS, I’ll use this &lt;a href="http://www.wand.net.nz/trac/libtrace/browser/trunk/test/traces/radius.pcap"&gt;packet capture&lt;/a&gt; taken from the testing data sets for libtrace.&lt;br /&gt;&lt;br /&gt;In my experience tcpdump is very useful for monitoring and parsing older and simpler protocols, especially ones that usually don’t span multiple packets, like DNS or RADIUS. The following is shows how tcpdump parses one RADIUS authentication request:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: monospace; font-size: 9pt;"&gt;&lt;br /&gt;/usr/sbin/tcpdump -nn -r radius.pcap -s 0 -v "dst port 1812" -c 1&lt;br /&gt;reading from file radius.pcap, link-type EN10MB (Ethernet)&lt;br /&gt;18:42:58.228064 IP (tos 0x0, ttl  64, id 47223, offset 0, flags [DF], proto: UDP (17), length: 179) 10.1.12.20.1034 &gt; 192.107.171.165.1812: RADIUS, length: 151&lt;br /&gt;        Access Request (1), id: 0x2e, Authenticator: 36ea5ffd15130961caafc039b5909d34&lt;br /&gt;          Username Attribute (1), length: 6, Value: test&lt;br /&gt;          NAS IP Address Attribute (4), length: 6, Value: 10.1.12.20&lt;br /&gt;          NAS Port Attribute (5), length: 6, Value: 0&lt;br /&gt;          Called Station Attribute (30), length: 31, Value: 00-02-6F-21-EC-52:CRCnet-test&lt;br /&gt;          Calling Station Attribute (31), length: 19, Value: 00-02-6F-21-EC-5F&lt;br /&gt;          Framed MTU Attribute (12), length: 6, Value: 1400&lt;br /&gt;          NAS Port Type Attribute (61), length: 6, Value: Wireless - IEEE 802.11&lt;br /&gt;          Connect Info Attribute (77), length: 22, Value: CONNECT 0Mbps 802.11&lt;br /&gt;          EAP Message Attribute (79), length: 11, Value: .&lt;br /&gt;          Message Authentication Attribute (80), length: 18, Value: ...eE.*.B.._..).&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Note that we intentionally haven’t turned the verbosity up all the way. While there’s a lot of other good info in there, let say we only want to extract the UDP quad and the username and then send them to our SIMS so we can audit them. Assuming a configuration of syslog that sends logs somewhere to be audited appropriately, the following demonstrates how to do so:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: monospace; font-size: 9pt;"&gt;&lt;br /&gt;tcpdump -nn -r radius.pcap -s 0 -v "dst port 1812" | awk '{ if ( $1 ~ "^[0-9][0-9]:" ) { print SRC" "DST" "USER; SRC=$18; DST=$20; USER="" }; if ( $0 ~ "  Username Attribute" ) {  USER=$NF } }' | logger -t radius_request&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This example generates syslogs that appears as follows:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: monospace; font-size: 9pt;"&gt;&lt;br /&gt;Mar 26 14:45:15 monitor radius_request: 10.1.12.20.1034 192.107.171.165.1812: test&lt;br /&gt;Mar 26 14:45:15 monitor radius_request: 10.1.12.20.1034 192.107.171.165.1812: test&lt;br /&gt;Mar 26 14:45:15 monitor radius_request: 10.1.12.20.1034 192.107.171.165.1812: test&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;I’ve done no significant validation to ensure that it’s complete, but this very well could be used on a large corporate network as is. Obviously, you’d need to replace the -r pcapfile with the appropriate -i interface.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;SSL/TLS Client Certificate&lt;/h3&gt;&lt;br /&gt;Another opportunity for simple passive monitoring is SSL/TLS when a client certificate is used. It is very common for this mechanism to be used to authenticate users with either soft or hard (ie. smart card) certificates to web sites. This mechanism relies on PKI which involves the use of a public and private key. While the private key should never be transferred over the network, and in many cases they never leave smart cards, the public keys are openly shared. In the case of SSL/TLS client certificate based authentication the public key, along with other information such as the client user identification, is passed in the clear during authentication as the client certificate.&lt;br /&gt;&lt;br /&gt;To have data for this example, I generated my own. I took the following steps based on the &lt;a href="http://wiki.wireshark.org/SSL"&gt;wireshark SSL wiki&lt;/a&gt;:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: monospace; font-size: 9pt;"&gt;&lt;br /&gt;openssl req -new -x509 -out server.pem -nodes -keyout privkey.pem -subj /CN=localhost/O=pwned/C=US&lt;br /&gt;openssl req -new -x509 -nodes -out client.pem -keyout client.key -subj /CN=Foobar/O=pwned/C=US&lt;br /&gt;&lt;br /&gt;openssl s_server -ssl3 -cipher AES256-SHA -accept 4443 -www -CAfile client.pem -verify 1 -key privkey.pem&lt;br /&gt;&lt;br /&gt;#start another shell&lt;br /&gt;tcpdump -i lo -s 0 -w ssl_client.pcap "tcp port 4443"&lt;br /&gt;&lt;br /&gt;#start another shell&lt;br /&gt;(echo GET / HTTP/1.0; echo ; sleep 1) | openssl s_client -connect localhost:4443 -ssl3 -cert client.pem -key client.key&lt;br /&gt;&lt;br /&gt;#kill tcpdump and server&lt;br /&gt;&lt;br /&gt;#fix pcap by converting back to 443 and fixing checksums (offload problem)&lt;br /&gt;tcprewrite --fixcsum --portmap=4443:443 --infile=ssl_client.pcap --outfile=ssl_client_443.pcap&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;You can download the resulting pcap &lt;a href="http://www.csmutz.com/smusec_files/ssl_client_443.pcap"&gt;here&lt;/a&gt;. &lt;br /&gt;&lt;br /&gt;The client certificate appears as follows:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: monospace; font-size: 9pt;"&gt;&lt;br /&gt;$ openssl x509 -in client.pem -noout -text&lt;br /&gt;Certificate:&lt;br /&gt;    Data:&lt;br /&gt;        Version: 3 (0x2)&lt;br /&gt;        Serial Number:&lt;br /&gt;            b0:cc:6b:94:b4:83:0f:78&lt;br /&gt;        Signature Algorithm: sha1WithRSAEncryption&lt;br /&gt;        Issuer: CN=Foobar, O=pwned, C=US&lt;br /&gt;        Validity&lt;br /&gt;            Not Before: Mar 26 13:13:12 2011 GMT&lt;br /&gt;            Not After : Apr 25 13:13:12 2011 GMT&lt;br /&gt;        Subject: CN=Foobar, O=pwned, C=US&lt;br /&gt;        Subject Public Key Info:&lt;br /&gt;            Public Key Algorithm: rsaEncryption&lt;br /&gt;            RSA Public Key: (1024 bit)&lt;br /&gt;                Modulus (1024 bit):&lt;br /&gt;                    00:e5:d6:78:cd:95:4e:89:0c:88:bd:78:98:26:86:&lt;br /&gt;                    0b:f1:be:df:85:98:a2:93:c1:66:65:44:d2:aa:08:&lt;br /&gt;                    69:2d:4c:a9:9d:50:08:79:1d:58:6e:6d:b4:2b:24:&lt;br /&gt;                    ca:37:90:d6:91:9f:6d:73:5f:51:5a:10:af:f0:ce:&lt;br /&gt;                    85:85:d6:e4:42:7b:ca:b0:af:0c:52:8b:60:1c:5b:&lt;br /&gt;                    3f:54:10:cc:c4:35:18:a8:a6:a7:c8:ae:df:b7:ab:&lt;br /&gt;                    a9:d9:20:cf:f7:5c:43:01:2e:12:cf:96:45:87:e7:&lt;br /&gt;                    7e:87:f7:5e:8f:25:23:1b:ee:bd:0a:79:48:07:99:&lt;br /&gt;                    ba:cc:68:16:53:43:56:e9:a1&lt;br /&gt;                Exponent: 65537 (0x10001)&lt;br /&gt;        X509v3 extensions:&lt;br /&gt;            X509v3 Subject Key Identifier:&lt;br /&gt;                BD:C2:84:BF:76:17:B7:15:BC:2F:8C:7E:A6:E6:18:B1:47:60:A3:B6&lt;br /&gt;            X509v3 Authority Key Identifier:&lt;br /&gt;                keyid:BD:C2:84:BF:76:17:B7:15:BC:2F:8C:7E:A6:E6:18:B1:47:60:A3:B6&lt;br /&gt;                DirName:/CN=Foobar/O=pwned/C=US&lt;br /&gt;                serial:B0:CC:6B:94:B4:83:0F:78&lt;br /&gt;&lt;br /&gt;            X509v3 Basic Constraints:&lt;br /&gt;                CA:TRUE&lt;br /&gt;    Signature Algorithm: sha1WithRSAEncryption&lt;br /&gt;        4c:28:ea:47:20:38:d5:17:dd:cf:aa:f8:13:3e:d0:5f:cf:05:&lt;br /&gt;        7d:c7:a1:c3:f4:3e:d7:db:56:f7:d4:d6:d6:c6:f4:5c:47:5b:&lt;br /&gt;        99:f6:9c:23:2d:dc:75:ab:51:8b:96:df:26:3b:9e:59:8f:2c:&lt;br /&gt;        08:d1:84:bf:4f:98:65:b4:0f:b7:32:9d:2f:eb:d9:a5:a6:69:&lt;br /&gt;        b6:75:ce:03:f4:ad:3b:f2:e6:3a:a1:ff:44:ea:8a:98:40:34:&lt;br /&gt;        cc:dd:e0:d8:35:0e:8b:97:20:30:e4:7b:07:52:98:63:11:32:&lt;br /&gt;        5e:6e:cb:c7:f1:10:67:1c:cd:e2:03:3a:99:98:8b:2f:f8:94:&lt;br /&gt;        03:6f&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;For auditing, we are interested in extracting the CN, which in this case is “Foobar”. As the client certificate is transferred over the network, the CN appears as follows:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: monospace; font-size: 9pt;"&gt;&lt;br /&gt;000002e0  00 3f 0d 00 00 37 02 01  02 00 32 00 30 30 2e 31  |.?...7....2.00.1|&lt;br /&gt;000002f0  0f 30 0d 06 03 55 04 03  13 06 46 6f 6f 62 61 72  |.0...U....Foobar|&lt;br /&gt;00000300  31 0e 30 0c 06 03 55 04  0a 13 05 70 77 6e 65 64  |1.0...U....pwned|&lt;br /&gt;00000310  31 0b 30 09 06 03 55 04  06 13 02 55 53 0e 00 00  |1.0...U....US...|&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Immediately preceding the string “Foobar” is following sequence (in hex):&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: monospace; font-size: 9pt;"&gt;&lt;br /&gt;06 03 55 04 03  13 06&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;I’m not 100% sure what the "06 03" is for, but I believe this to be invariant in client certificates (if not, this example needs fixing). The "55 04 03" is indicative of the following data being a CN. This is an x509/ASN.1 thing where this sequence maps to the OID 2.5.4.3. The "13" can vary among a few common values (it specifies the data type) and the "06" indicates the length of the data (6 ASCII characters). Using this knowledge of SSL certificates we can create a tool to extract and log all CNs as follows:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: monospace; font-size: 9pt;"&gt;&lt;br /&gt;$ mkdir /dev/shm/ssl_client_streams&lt;br /&gt;$ cd /dev/shm/ssl_client_streams/&lt;br /&gt;$ vortex -r ssl_client_443.pcap -S 0 -C 10240 -g "svr port 443" | xargs -t -I+ pcregrep -o -H "\x06\x03\x55\x04\x03..[A-Za-z0-9]{1,100}" + | sed -r "s/\x06\x03\x55\x04\x03../ /" | sed 's/c/ /' | logger -t client_cert&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;This generates logs as follows:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: monospace; font-size: 9pt;"&gt;&lt;br /&gt;Mar 26 15:26:05 sr2s4 client_cert: 127.0.0.1:41143 127.0.0.1:443: localhost1&lt;br /&gt;Mar 26 15:26:05 sr2s4 client_cert: 127.0.0.1:41143 127.0.0.1:443: localhost1&lt;br /&gt;Mar 26 15:26:05 sr2s4 client_cert: 127.0.0.1:41143 127.0.0.1:443: Foobar1&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;If you are new to vortex, check out my &lt;a href="http://smusec.blogspot.com/search/label/vortex%20howto"&gt;vortex howto&lt;/a&gt; series. Basically we’re snarfing the first 10k of SSL streams transferred from the client to the server as files then analyzing them. Note that since we’re pulling all CNs out of all the certificates in the certificate chain provided by the client, we’re getting not only “Foobar” but “localhost” who is the CA in this case. Also note the trailing garbage we were too lazy to remove.&lt;br /&gt;&lt;br /&gt;While this works, this is a little too dirty even for me. The biggest problem is that the streams which are snarfed by vortex are never purged. Second, we’re doing a lot of work in an inefficient manner on each SSL stream, even those that don’t include client certs.&lt;br /&gt;&lt;br /&gt;Let’s refactor this slightly. First, we’re going to immediately weed out all stream we don’t want look at. In this example I’m looking for client certs in general, but you could easily change signature to be the CA for the certificates which you are interested in monitoring. Ex. “Pwned Org CA”:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: monospace; font-size: 9pt;"&gt;&lt;br /&gt;$ vortex -e -r ssl_client_443.pcap -S 0 -C 10240 -g "svr port 443" | xargs pcregrep -L "\x06\x03\x55\x04\x03" | xargs rm&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;That will leave all the streams which we want to inspect in the current dir. If we do something like the following in an infinite loop or very frequent cron job, then we’ll do the logging and purging we need:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: monospace; font-size: 9pt;"&gt;&lt;br /&gt;find -cmin +1 -type f | while read file&lt;br /&gt;do&lt;br /&gt;  pcregrep -o -H "\x06\x03\x55\x04\x03..[A-Za-z0-9]{1,100}" $file | sed -r "s/\x06\x03\x55\x04\x03../ /" | sed 's/c/ /' | logger -t client_cert&lt;br /&gt;  rm $file&lt;br /&gt;done&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This implementation is also probably suitable for use on a large network or pretty close to it.&lt;br /&gt;&lt;br /&gt;For these examples, it’s assumed that the logs are streamed to a log storage, aggregation, or correlation tool for real time auditing or for historical forensics. I would not be surprised if there were flaws in the examples as presented, so use at your own risk or perform the validation and tweaking necessary for your environment. These examples are intended to be merely that—to show the feasibility. While I’ve discussed two specific protocols/mechanisms there are others that lend themselves to passive network monitoring as well as many that don’t.&lt;br /&gt;&lt;br /&gt;In this post I’ve shown how passive network monitoring could be used to help audit the use or misuse of strong authentication mechanisms. I’ve given quick and dirty examples which are probably suitable or are close to something that would be suitable for use on enterprise networks. Notwithstanding the weaknesses in my examples, I hope they provide ideas for what can be done to “trust, but verify” strong authentication mechanisms through data collection done on passive network sensors.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/471870110572762709-2696932596397994175?l=smusec.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://smusec.blogspot.com/feeds/2696932596397994175/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://smusec.blogspot.com/2011/03/passive-network-monitoring-of-strong.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/2696932596397994175'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/2696932596397994175'/><link rel='alternate' type='text/html' href='http://smusec.blogspot.com/2011/03/passive-network-monitoring-of-strong.html' title='Passive Network Monitoring of Strong Authentication'/><author><name>Charles Smutz</name><uri>http://www.blogger.com/profile/05098439824931378207</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_g1XmJJW8J_g/S-igCjTBm_I/AAAAAAAAAAY/7Dc7fU1NH5g/S220/CharlesSmutz2009.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-471870110572762709.post-9163381893130979624</id><published>2011-03-19T10:12:00.000-07:00</published><updated>2011-06-22T05:30:05.665-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='technology'/><category scheme='http://www.blogger.com/atom/ns#' term='ruminate'/><title type='text'>Update on Ruminate</title><content type='html'>It’s been a couple weeks, but I wanted to say a little bit about the Feb 26 Release of Ruminate.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Who should be interested in Ruminate?&lt;/h3&gt;&lt;br /&gt;This release is close to level of refinement and capabilities necessary for use in an operational environment. Ruminate will be useful for people who are willing to spend extensive effort integrating their own network monitoring tools. I doubt very few people will want to use it in exactly how it is out of the box, but many of the components or even the whole framework (with custom detections running on top) may be useful to others. Ruminate as currently constituted is not for those who want a simple install process. Ruminate doesn’t do alerting or event correlation. It is up to the user to integrate Ruminate with an external log correlation and alerting system.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;The Good&lt;/h3&gt;&lt;br /&gt;I think the Ruminate architecture is very promising. It makes some things that are very hard to do in conventional NIDS look very easy. The following diagram shows the layout of the Ruminate components:&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/-AhDO3yipiTI/TYTmt3PQERI/AAAAAAAAABo/wmZP_w2AVbM/s1600/ruminate_components-20110226.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 273px; height: 320px;" src="http://2.bp.blogspot.com/-AhDO3yipiTI/TYTmt3PQERI/AAAAAAAAABo/wmZP_w2AVbM/s320/ruminate_components-20110226.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5585843113442677010" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;If you are totally new to Ruminate, I still suggest reading the &lt;a href="http://cs.gmu.edu/~tr-admin/papers/GMU-CS-TR-2010-20.pdf"&gt;technical report&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;The improved HTTP parser is pretty groovy. I’m really pleased with the attempts I’m making at things like HTTP 206 defrag. I think my HTTP log format, which includes compact single character flags inspired by network flow records (e.g. argus), is pretty cute.&lt;br /&gt;&lt;br /&gt;Since I haven’t documented it anywhere else, let me do it here. The fields in the logs (with examples) are as follows:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: monospace; font-size: 9pt;"&gt;&lt;br /&gt;&lt;br /&gt;Jan 12 01:47:39 node1 http[26350]: tcp-198786717-1294814857-1294814859-c-33510-10.101.84.70:10977c129.174.93.161:80_http-0 1.1 GET cs.gmu.edu /~tr-admin/papers/GMU-CS-TR-2010-20.pdf 0 32768 206 1292442029 application/pdf TG ALHEk http://cs.gmu.edu/~tr-admin/papers/GMU-CS-TR-2010-20.pdf - "zh-CN,zh;q=0.8" "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/534.10 (KHTML, like Gecko) Chrome/8.0.552.224 Safari/534.10" "Apache"&lt;br /&gt;&lt;br /&gt;Transaction ID: tcp-198786717-1294814857-1294814859-c-33510-10.101.84.70:10977c129.174.93.161:80_http-0&lt;br /&gt;Request Version: 1.1&lt;br /&gt;Request Method: GET&lt;br /&gt;Request Host: cs.gmu.edu&lt;br /&gt;Request Resource: /~tr-admin/papers/GMU-CS-TR-2010-20.pdf&lt;br /&gt;Request Payload Size: 0&lt;br /&gt;Response Payload Size: 32768&lt;br /&gt;Response Code: 206&lt;br /&gt;Response Last-Modified (unix timestamp): 1292442029&lt;br /&gt;Response Content-Type: application/pdf&lt;br /&gt;Response Flags: TG&lt;br /&gt;Request Flags: ALHEk&lt;br /&gt;Request Referer: http://cs.gmu.edu/~tr-admin/papers/GMU-CS-TR-2010-20.pdf&lt;br /&gt;Request X-Forwarded-For: -&lt;br /&gt;Request Accept-Language (in quotes): "zh-CN,zh;q=0.8"&lt;br /&gt;Request User-Agent (in quotes): "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/534.10 (KHTML, like Gecko) Chrome/8.0.552.224 Safari/534.10"&lt;br /&gt;Response Server (in quotes): "Apache"&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The Request Flags are as follows:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: monospace; font-size: 9pt;"&gt;&lt;br /&gt;C =&gt; existence of "Cookie" header&lt;br /&gt;Z =&gt; existence of "Authorization" header&lt;br /&gt;T =&gt; existence of "Date" header&lt;br /&gt;F =&gt; existence of "From" header&lt;br /&gt;A =&gt; existence of "Accept" header&lt;br /&gt;L =&gt; existence of "Accept-Language" header&lt;br /&gt;H =&gt; existence of "Accept-Charset" header&lt;br /&gt;E =&gt; existence of "Accept-Encoding" header&lt;br /&gt;k =&gt; "keep-alive" value in Connection&lt;br /&gt;c =&gt; "close" value in Connection&lt;br /&gt;o =&gt; other value in Connection&lt;br /&gt;V =&gt; existence of "Via" header&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The Response Flags are as follows:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family: monospace; font-size: 9pt;"&gt;&lt;br /&gt;C =&gt; existence of "Set-Cookie" header&lt;br /&gt;t =&gt; existence of "Transfer-Encoding" header, presumably chunked&lt;br /&gt;g =&gt; gzip content encoding&lt;br /&gt;d =&gt; deflate content encoding&lt;br /&gt;o =&gt; other content encoding&lt;br /&gt;T =&gt; existence of "Date" header&lt;br /&gt;L =&gt; existence of "Location" header&lt;br /&gt;V =&gt; existence of "Via" header&lt;br /&gt;G =&gt; existence of "ETag" header&lt;br /&gt;P =&gt; existence of "X-Powered-By" header&lt;br /&gt;i =&gt; starts with inline for Content-Disposition&lt;br /&gt;a =&gt; starts with attach for Content-Disposition&lt;br /&gt;f =&gt; starts with form-d for Content-Disposition&lt;br /&gt;c =&gt; other Content-Disposition&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;While not standard in any way, this log format should be very useful for my research. &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;The Bad&lt;/h3&gt;&lt;br /&gt;Ruminate is rough. It’s nowhere near the level of refinement of the leading NIDS. This is not likely to change in the short term.&lt;br /&gt;Ruminate is based on a really old version of vortex. There are lots of reasons this isn’t optimal but the biggest issue is performance on high speed networks. Soon I’ll release a new version that is either based on the latest version of vortex or one that is totally separate from, but dependent on, vortex.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Yara Everywhere&lt;/h3&gt;&lt;br /&gt;I’ve added yara to the basically every layer of Ruminate. This is useful for those in operational environments because many people are used to and have existing signatures written for yara. Since Ruminate is very object focused (not network focused), yara makes a lot of sense. While applying signatures to raw streams is not what Ruminate is about, it was easy to do and may even be useful for environments struggling with limitations in signature matching NIDS. Lastly, the use of yara, with its extensive meta-signature rule definitions, helps fill a gap in Ruminate which can’t reasonably be filled by an external event correlation engine.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Ruminate or Razorback (or both)&lt;/h3&gt;&lt;br /&gt;I’ve been asked, and it’s a good question, how Ruminate and Razorback compare. Before I express my candid opinions, I want to say that I’m very pleased with what the VRT guys are doing with Razorback. While there is some overlap in what I’m doing and what they’re doing (at least in high level goals), there’s more than enough room for multiple people innovating in the network payload object analysis space. If nothing else, the more people in the space, the more legitimate the problem of analyzing client payload objects (file) becomes. It seems unfathomable to me, but there are many who still question the value of using NIDS for detecting attacks against client applications (Adobe, IE) versus the traditional server exploits (IIS, WuFTP) or detection of today’s reconnaissance (google search) versus old school reconnaissance (port scan).&lt;br /&gt;&lt;br /&gt;To date, Ruminate’s unique contributions are very much focused on scalable payload object collection, decoding, and reconstruction. Notable features include dynamic and highly scalable load balancing of network streams, full protocol decoding for HTTP and SMTP/MIME, and object fragment reassembly (ex. HTTP 206 defrag). If you want to comprehensively analyze payloads transferred through a large network, Ruminate is the best openly available tool for exposing the objects to analysis. The actual object analysis is pretty loose in Ruminate today, but is definitely simple and scalable. Ruminate’s biggest shortcoming is its rough implementation and relatively low level of refinement. This isn’t a problem for academia and other research, but it is a barrier to widespread adoption.&lt;br /&gt;&lt;br /&gt;Razorback is largely tackling the other end of the problem—what to do once you get the objects off the network (or host or other source for that matter). Razorback has a robust and well defined framework for client object analysis. While definitely in early beta state, Razorback is a whole lot more refined and “cleaner” than Ruminate. Razorback has centrally controlled object distribution model which has obvious advantages and disadvantages over what Ruminate is doing. Razorbacks’ limitations in network payload object extraction are inherited largely from it’s reliance on the Snort 2.0 framework, which to be fair, was never designed for this sort of analysis.&lt;br /&gt;&lt;br /&gt;While I’ve never actually done it, if there was a brave soul who wanted to combine the best of both Ruminate and Razorback, it would be possible to use Ruminate to extract objects off the network and use Razorback to analyze the objects. Using the parlance of both respectively, one could modify Ruminate’s object multiplexer (object_mux) to be a collector for Razorback. The point I'm trying to make is that the innovations found in Ruminate and Razorback may be more complimentary than competing.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Take what you want (or leave it)&lt;/h3&gt;&lt;br /&gt;I’m sharing what I’ve implemented in hopes that it helps advance academic research and the solutions used in industry. Please take Ruminate as a whole, some components, or simply the ideas or paradigm and run with them. I’m always interested in hearing feedback on Ruminate or the ideas it advances. I’m also open to working with others on research using or continued development of Ruminate.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/471870110572762709-9163381893130979624?l=smusec.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://smusec.blogspot.com/feeds/9163381893130979624/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://smusec.blogspot.com/2011/03/update-on-ruminate.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/9163381893130979624'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/9163381893130979624'/><link rel='alternate' type='text/html' href='http://smusec.blogspot.com/2011/03/update-on-ruminate.html' title='Update on Ruminate'/><author><name>Charles Smutz</name><uri>http://www.blogger.com/profile/05098439824931378207</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_g1XmJJW8J_g/S-igCjTBm_I/AAAAAAAAAAY/7Dc7fU1NH5g/S220/CharlesSmutz2009.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/-AhDO3yipiTI/TYTmt3PQERI/AAAAAAAAABo/wmZP_w2AVbM/s72-c/ruminate_components-20110226.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-471870110572762709.post-5062415539797548214</id><published>2011-01-22T16:37:00.000-08:00</published><updated>2011-01-22T16:47:59.650-08:00</updated><title type='text'>Shameless plug for Colleagues' DC3 Presentations</title><content type='html'>Is it shameful to engage in cronyism, if you disclose it up front? I hope not.&lt;br /&gt;&lt;br /&gt;While I’m not going to be attending the &lt;a href="http://www.dodcybercrime.com/11CC/"&gt;DoD Cyber Crime Conference&lt;/a&gt; this year, I’d like to draw attention to some of my colleagues who will be. Since I’m not attending, I haven’t looked at who else is speaking.&lt;br /&gt;&lt;br /&gt;Sam Wenck, who co-presented with me last year and works side by side with me daily, is presenting on Threat Intelligence Knowledge Management for Incident Response. In essence, he’ll be speaking on how to implement the technology necessary to support intelligence driven CND. If you are interesting in improving your organization’s ability to record, maintain, and leverage threat intelligence, you should attend. &lt;br /&gt;&lt;br /&gt;Kieth Gould will be speaking to the title of “When did it happen? Are you sure about that?” I believe the original title of this preso was “How to score a date with your PC” (which Kieth routinely does). Frankly, I’m just not deep enough into host based forensics to fully appreciate the subject matter. Kieth has a reputation for his aptitude for and thorough attention to esoteric technical detail. This presentation might break the Geek Meter scale.&lt;br /&gt;&lt;br /&gt;Having had previews of the content, I expect both these presentations to contain an abundance of pragmatic technical content and be free from annoying marketing rhetoric.&lt;br /&gt;&lt;br /&gt;I also believe Mike Cloppert is going to be on a panel (not sure which one), but he doesn’t need any help drawing crowds.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/471870110572762709-5062415539797548214?l=smusec.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://smusec.blogspot.com/feeds/5062415539797548214/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://smusec.blogspot.com/2011/01/shameless-plug-for-colleagues-dc3.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/5062415539797548214'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/5062415539797548214'/><link rel='alternate' type='text/html' href='http://smusec.blogspot.com/2011/01/shameless-plug-for-colleagues-dc3.html' title='Shameless plug for Colleagues&apos; DC3 Presentations'/><author><name>Charles Smutz</name><uri>http://www.blogger.com/profile/05098439824931378207</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_g1XmJJW8J_g/S-igCjTBm_I/AAAAAAAAAAY/7Dc7fU1NH5g/S220/CharlesSmutz2009.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-471870110572762709.post-6651048799254428804</id><published>2011-01-13T16:56:00.000-08:00</published><updated>2011-06-22T05:31:06.421-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='technology'/><category scheme='http://www.blogger.com/atom/ns#' term='ruminate'/><title type='text'>Gnawing on HTTP 206 Fragmented Payloads with Ruminate</title><content type='html'>I've been madly working on getting Ruminate to a point where I can recommend it to people in industry for use, hopefully by the end of January 2011. I've done a huge amount of work on HTTP decoding including a working implementation of HTTP 206 defragmentation which I consider a "killer feature" when dealing with payloads transferred through the network. I wanted to take a break from the documentation and code packaging that Ruminate so badly needs to discuss the importance of this mechanism, along with some examples. This discussion should also help clarify the areas where Ruminate is seeking to innovate.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;HTTP 206 Partial Content&lt;/h3&gt;&lt;br /&gt;&lt;br /&gt;As NIDS begin to earnestly address true layer 7 decoding and embedded object analysis (ex. files transferred through network), they will run into complications like HTTP 206. I haven't heard much about HTTP 206 defrag so I assume this isn't on most people's radar.&lt;br /&gt;&lt;br /&gt;What is HTTP 206? It's basically HTTP's method of fragmenting payload objects. 206 is the response code, just like 200 or 404. If you want to download just part of a file, you can ask the server to give you a specific set (or sets) of bytes and compliant servers will respond with only the data you asked for via a 206 response.&lt;br /&gt;&lt;br /&gt;If you're not looking for malicious content in HTTP 206 transactions, you should be. Who really cares about HTTP 206 transactions if they represent a very small number of total HTTP transactions on a network? One oft overlooked detail is that HTTP 206 is actually used to transfer a significant amount (often up to 20%) of the most interesting payloads, such as PDF documents or PE executables. Even though HTTP 206 is often used naively by unwitting clients, it is used to transfer malicious content just as well as benign content, making life harder for your NIDS in the process.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Layer 7 and Embedded Object Defrag&lt;/h3&gt;&lt;br /&gt;One of Ruminate's goals is to address layer 7 and payload object analysis with the same level of vigor that current NIDS address layer 3 and layer 4. Part of this analysis necessarily involves layer 7 and payload object defrag/reassembly just like layer 3 and layer 4 defrag/reassembly have been big topics for the current generation NIDS. HTTP 206 is a perfect example of layer 7 fragmentation that is loosely analogous to ipfrag, etc. What is an example of client application object fragmentation? Imagine you have malicious javascript and you want to evade NIDS that are smart enough to decode basic javascript obfuscation like hex armoring. One option is to split your javascript across multiple files (which all get included at run time), possibly across multiple servers/domains.&lt;br /&gt;&lt;br /&gt;The next release of Ruminate will include thousands of lines of new and improved HTTP parsing code, including a new 206defrag service. When individual HTTP parser node comes across a HTTP 206 response, it feeds the fragmented payload to the 206defrag service which does the defragmentation. When 206defrag service has all the pieces of the file, the reassembled payload is passed through the object multiplexer to the appropriate analysis service(s), ex. PDF.&lt;br /&gt;&lt;br /&gt;I'm very pleased at the progress I've made to address HTTP 206. First of all, it actually works! In operation so far, I've been able to look at a lot of interesting payloads that I wouldn't have been able to otherwise.&lt;br /&gt;&lt;br /&gt;I wanted to share some examples that demonstrate uses of HTTP 206 in the wild. The first example will be very straightforward and is the type of thing you’ll see most often. The other two examples demonstrate characteristics that are less common, but still happen in the real world. None of the examples were contrived or fabricated--they were taken from real network traffic that I had no direct influence on. I will however, use them to show what I believe to be useful functionality of Ruminate. I anonymized the client IP addresses, but other than that, the data is just as observed. Note that other than interesting examples of HTTP 206 in action, there is absolutely no malicious, sensitive, private or otherwise interesting data in the pcaps. The &lt;a href="http://www.ruminate-ids.org/files/206_examples.zip"&gt;206_examples.zip&lt;/a&gt; download includes the pcaps of the examples and the relevant logs from Ruminate. For those stout of heart enough to actually tinker Ruminate in its current state, I’ve also included the new HTTP code in the download also.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Example A &lt;/h3&gt;&lt;br /&gt;Example A is a canonical example of HTTP 206 fragmentation. Let’s start with the logs:&lt;br /&gt;&lt;span style="font-family: monospace; font-size: 9pt;"&gt;&lt;br /&gt;[csmutz@master 206_examples]$ cat http_a.log&lt;br /&gt;Jan 12 01:47:39 node1 http[26350]: tcp-198786717-1294814857-1294814859-c-33510-10.101.84.70:10977c129.174.93.161:80_http-0 1.1 GET cs.gmu.edu /~tr-admin/papers/GMU-CS-TR-2010-20.pdf 0 32768 206 1292442029 application/pdf TG ALHEk http://cs.gmu.edu/~tr-admin/papers/GMU-CS-TR-2010-20.pdf - "zh-CN,zh;q=0.8" "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/534.10 (KHTML, like Gecko) Chrome/8.0.552.224 Safari/534.10" "Apache"&lt;br /&gt;Jan 12 01:47:39 master 206defrag: input tcp-198786717-1294814857-1294814859-c-33510-10.101.84.70:10977c129.174.93.161:80_http-0 555523 0 32768 cs.gmu.edu /~tr-admin/papers/GMU-CS-TR-2010-20.pdf 10.101.84.70&lt;br /&gt;Jan 12 01:48:17 node4 http[26947]: tcp-198787353-1294814861-1294814896-c-523548-10.101.84.70:10978c129.174.93.161:80_http-0 1.1 GET cs.gmu.edu /~tr-admin/papers/GMU-CS-TR-2010-20.pdf 0 522755 206 1292442029 application/pdf TG ALHEk http://cs.gmu.edu/~tr-admin/papers/GMU-CS-TR-2010-20.pdf - "zh-CN,zh;q=0.8" "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/534.10 (KHTML, like Gecko) Chrome/8.0.552.224 Safari/534.10" "Apache"&lt;br /&gt;Jan 12 01:48:17 master 206defrag: input tcp-198787353-1294814861-1294814896-c-523548-10.101.84.70:10978c129.174.93.161:80_http-0 555523 32768 522755 cs.gmu.edu /~tr-admin/papers/GMU-CS-TR-2010-20.pdf 10.101.84.70&lt;br /&gt;Jan 12 01:48:17 master 206defrag: output tcp-198786717-1294814857-1294814859-c-33510-10.101.84.70:10977c129.174.93.161:80_http-0_206defrag normal 2 555523 5a484ada9c816c0e8b6d2d3978e3f503 tcp-198786717-1294814857-1294814859-c-33510-10.101.84.70:10977c129.174.93.161:80_http-0,tcp-198787353-1294814861-1294814896-c-523548-10.101.84.70:10978c129.174.93.161:80_http-0&lt;br /&gt;[csmutz@master 206_examples]$ cat object_a.log&lt;br /&gt;Jan 12 01:48:17 master object_mux[11977]: tcp-198786717-1294814857-1294814859-c-33510-10.101.84.70:10977c129.174.93.161:80_http-0_206defrag 555523 5a484ada9c816c0e8b6d2d3978e3f503 pdf PDF document, version 1.4&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;Unfortunately I don’t have time to explain in full the log formats, etc. Hopefully I'll document that somewhere more accessible than the code soon :). The first log line demonstrates the 1st HTTP transaction where the client asks the server for the first 32k of the PDF and the server obliges.&lt;br /&gt;&lt;br /&gt;Headers are as follows:&lt;br /&gt;&lt;span style="font-family: monospace; font-size: 9pt;"&gt;&lt;br /&gt;GET /~tr-admin/papers/GMU-CS-TR-2010-20.pdf HTTP/1.1&lt;br /&gt;Host: cs.gmu.edu&lt;br /&gt;Connection: keep-alive&lt;br /&gt;Referer: http://cs.gmu.edu/~tr-admin/papers/GMU-CS-TR-2010-20.pdf&lt;br /&gt;Accept: */*&lt;br /&gt;User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/534.10 (KHTML, like Gecko) Chrome/8.0.552.224 Safari/534.10&lt;br /&gt;Accept-Encoding: gzip,deflate,sdch&lt;br /&gt;Accept-Language: zh-CN,zh;q=0.8&lt;br /&gt;Accept-Charset: GBK,utf-8;q=0.7,*;q=0.3&lt;br /&gt;Range: bytes=0-32767 &lt;br /&gt;&lt;br /&gt;HTTP/1.1 206 Partial Content&lt;br /&gt;Date: Wed, 12 Jan 2011 06:47:37 GMT&lt;br /&gt;Server: Apache&lt;br /&gt;Last-Modified: Wed, 15 Dec 2010 19:40:29 GMT&lt;br /&gt;ETag: "56010f-87a03-497781c080540"&lt;br /&gt;Accept-Ranges: bytes&lt;br /&gt;Content-Length: 32768&lt;br /&gt;Content-Range: bytes 0-32767/555523&lt;br /&gt;Connection: close&lt;br /&gt;Content-Type: application/pdf&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;That’s all straightforward. The HTTP parser realizes that it doesn’t have a complete payload object so instead of passing it to the object multiplexer it sends it to the 206defrag service. The next log line shows the 206defrag service receiving this fragment. Since it doesn’t have the whole object yet, it holds on to it.&lt;br /&gt;&lt;br /&gt;After sampling the first 32k, the client gets the rest of the PDF. Headers as follows:&lt;br /&gt;&lt;span style="font-family: monospace; font-size: 9pt;"&gt;&lt;br /&gt;GET /~tr-admin/papers/GMU-CS-TR-2010-20.pdf HTTP/1.1 &lt;br /&gt;Host: cs.gmu.edu &lt;br /&gt;Connection: keep-alive &lt;br /&gt;Referer: http://cs.gmu.edu/~tr-admin/papers/GMU-CS-TR-2010-20.pdf&lt;br /&gt;Accept: */*&lt;br /&gt;User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/534.10 (KHTML, like Gecko) Chrome/8.0.552.224 Safari/534.10&lt;br /&gt;Accept-Encoding: gzip,deflate,sdch&lt;br /&gt;Accept-Language: zh-CN,zh;q=0.8&lt;br /&gt;Accept-Charset: GBK,utf-8;q=0.7,*;q=0.3&lt;br /&gt;Range: bytes=32768-555522&lt;br /&gt;If-Range: "56010f-87a03-497781c080540" &lt;br /&gt;&lt;br /&gt;HTTP/1.1 206 Partial Content&lt;br /&gt;Date: Wed, 12 Jan 2011 06:47:41 GMT&lt;br /&gt;Server: Apache&lt;br /&gt;Last-Modified: Wed, 15 Dec 2010 19:40:29 GMT&lt;br /&gt;ETag: "56010f-87a03-497781c080540"&lt;br /&gt;Accept-Ranges: bytes&lt;br /&gt;Content-Length: 522755&lt;br /&gt;Content-Range: bytes 32768-555522/555523&lt;br /&gt;Connection: close &lt;br /&gt;Content-Type: application/pdf&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;Again, this is very straightforward. The client gets the rest of the file. Note the “Etag” and “If-Range” headers. If clients and servers consistently used this convention it might make reassembly easier. Alas, it’s frequently not used. The server was nice enough to report a content type of “application/pdf” for both fragments, doesn’t use any other content-encoding or transfer-encoding, etc. If only all transactions were this simple!&lt;br /&gt;&lt;br /&gt;After receiving the 2nd fragment on the 4th log line, the 206defrag service realizes it has the whole payload now. Line 5 shows the service sending this payload object off for analysis. In line 6 the object multiplexer decides to send this file on to the PDF analyzer. Not shown here, but the PDF analysis service deems this PDF well worth the time reading :)&lt;br /&gt;&lt;br /&gt;This is a very simple and clean example of HTTP 206 fragmentation. Most uses of HTTP 206 are similar to this, even if not quite this simple. In very many cases, instead of being split across separate TCP streams, the fragments are sent serially in the same stream a la pipelined request/responses. This general scenario is very common for PDFs.&lt;br /&gt;&lt;br /&gt;One point I’d like to make here is that if your NIDS doesn’t do HTTP 206 defrag, you loose the opportunity to analyze a significant portion of PDFs, at least any analysis that requires looking at the whole PDF at once.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Example B&lt;/h3&gt;&lt;br /&gt;Example B is interesting for a couple reasons. Again, let’s start with the logs:&lt;br /&gt;&lt;span style="font-family: monospace; font-size: 9pt;"&gt;&lt;br /&gt;[csmutz@master 206_examples]$ cat http_b.log&lt;br /&gt;Jan 12 02:17:56 node4 http[27618]: tcp-198921731-1294816073-1294816075-i-936869-192.168.72.14:3254c65.54.95.206:80_http-0 1.1 GET au.download.windowsupdate.com /msdownload/update/software/uprl/2011/01/windows-kb890830-v3.15-delta_7d99803eaf3b6e8dfa3581348bc694089579d25a.exe 0 816896 206 1294342831 application/octet-stream TP AEk - - "" "Microsoft BITS/6.6" "Microsoft-IIS/7.5"&lt;br /&gt;Jan 12 02:17:56 master 206defrag: input tcp-198921731-1294816073-1294816075-i-936869-192.168.72.14:3254c65.54.95.206:80_http-0 1022920 0 816896 au.download.windowsupdate.com /msdownload/update/software/uprl/2011/01/windows-kb890830-v3.15-delta_7d99803eaf3b6e8dfa3581348bc694089579d25a.exe 192.168.72.14&lt;br /&gt;Jan 12 02:17:56 node4 http[27618]: tcp-198921731-1294816073-1294816075-i-936869-192.168.72.14:3254c65.54.95.206:80_http-1 1.1 GET au.download.windowsupdate.com /msdownload/update/software/uprl/2011/01/windows-kb890830-v3.15-delta_7d99803eaf3b6e8dfa3581348bc694089579d25a.exe 0 0 - - - - AEk - - "" "Microsoft BITS/6.6" ""&lt;br /&gt;Jan 12 02:33:26 node1 http[26761]: tcp-199054360-1294817575-1294817576-r-206649-192.168.72.14:3257c65.54.95.14:80_http-0 1.1 GET au.download.windowsupdate.com /msdownload/update/software/uprl/2011/01/windows-kb890830-v3.15-delta_7d99803eaf3b6e8dfa3581348bc694089579d25a.exe 0 206024 206 1294342831 application/octet-stream TP AEk - - "" "Microsoft BITS/6.6" "Microsoft-IIS/7.5"&lt;br /&gt;Jan 12 02:33:26 master 206defrag: input tcp-199054360-1294817575-1294817576-r-206649-192.168.72.14:3257c65.54.95.14:80_http-0 1022920 816896 206024 au.download.windowsupdate.com /msdownload/update/software/uprl/2011/01/windows-kb890830-v3.15-delta_7d99803eaf3b6e8dfa3581348bc694089579d25a.exe 192.168.72.14&lt;br /&gt;Jan 12 02:33:26 master 206defrag: output tcp-198921731-1294816073-1294816075-i-936869-192.168.72.14:3254c65.54.95.206:80_http-0_206defrag normal 2 1022920 fc13fee1d44ef737a3133f1298b21d28 tcp-198921731-1294816073-1294816075-i-936869-192.168.72.14:3254c65.54.95.206:80_http-0,tcp-199054360-1294817575-1294817576-r-206649-192.168.72.14:3257c65.54.95.14:80_http-0&lt;br /&gt;[csmutz@master 206_examples]$ cat object_b.log&lt;br /&gt;Jan 12 02:33:26 master object_mux[3282]: tcp-198921731-1294816073-1294816075-i-936869-192.168.72.14:3254c65.54.95.206:80_http-0_206defrag 1022920 fc13fee1d44ef737a3133f1298b21d28 null PE32 executable for MS Windows (GUI) Intel 80386 32-bit&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;At first glance, this looks a lot like the last example. There are some subtle but notable differences. First of all, the first tcp stream contains two requests, not one. While the first transaction looks normal, the log for the second is incomplete. The size of the response payload is “-“, there is no response code either, and none of the response headers are set. What is happening here is that Ruminate can validate and parse the request but it can’t do so with the response, so it just gives the metadata for the request. What is going on here? To find out, we’ll have to go to the packets... &lt;br /&gt;&lt;br /&gt;Looking at packet 956, we see the second pipelined request. Presumably everything is still normal at this point:&lt;br /&gt;&lt;span style="font-family: monospace; font-size: 9pt;"&gt;&lt;br /&gt;[csmutz@master 206_examples]$ tshark -nn -r 206_example_b.pcap | grep "^956 "&lt;br /&gt;956   1.259759 192.168.72.14 -&gt; 65.54.95.206 HTTP GET /msdownload/update/software/uprl/2011/01/windows-kb890830-v3.15-delta_7d99803eaf3b6e8dfa3581348bc694089579d25a.exe HTTP/1.1&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;If we go farther down the packet trace we get to the point that the client receives the header for the 2nd response in packet 1213:&lt;br /&gt;&lt;span style="font-family: monospace; font-size: 9pt;"&gt;&lt;br /&gt;[csmutz@master 206_examples]$ tshark -nn -r 206_example_b.pcap | grep -C 2 "^1213 "&lt;br /&gt;1211   1.407243 192.168.72.14 -&gt; 65.54.95.206 TCP [TCP Dup ACK 1101#52] 3254 &gt; 80 [ACK] Seq=581 Ack=899890 Win=65535 Len=0 SLE=935155 SRE=965425&lt;br /&gt;1212   1.407254 65.54.95.206 -&gt; 192.168.72.14 TCP [TCP segment of a reassembled PDU]&lt;br /&gt;1213   1.407255 65.54.95.206 -&gt; 192.168.72.14 HTTP HTTP/1.1 206 Partial Content  (application/octet-stream)&lt;br /&gt;1214   1.407347 192.168.72.14 -&gt; 65.54.95.206 TCP [TCP Dup ACK 1101#53] 3254 &gt; 80 [ACK] Seq=581 Ack=899890 Win=65535 Len=0 SLE=935155 SRE=965425&lt;br /&gt;1215   1.407465 192.168.72.14 -&gt; 65.54.95.206 TCP [TCP Dup ACK 1101#54] 3254 &gt; 80 [ACK] Seq=581 Ack=899890 Win=65535 Len=0 SLE=935155 SRE=965425&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;Already we see something amiss. The client is ACKing incessantly some data at a point that is a partway into the payload of the 2nd response. As it turns out, the client never ACKs any more data, even though the server tries to ram the whole response down the client’s buffer. It appears that the whole payload for the 2nd response is transferred over the wire, but the client never ACKs it. Ruminate handles this case by assuming the client threw away the unACKed data and doing essentially the same. Since the whole response can’t be reconstructed, Ruminate punts and provides no metadata about the response in the log and doesn't send the payload fragment to the 206defrag service, considering it invalid. Some could argue that it would be nice if Ruminate was a little more promiscuous in the TCP reassembly and HTTP parsing. While I could see the argument that it would be nice to provide some information about the response, the current behavior is relatively simple and safe. I suspect that some other NIDS and network forensics utilities would actually use all the unACKed data, opening the door to analyze the whole payload at this point. I can see the appeal of this approach. I’m not 100% sure I’ve analyzed this situation correctly, but I think Ruminate does the right thing in this case.&lt;br /&gt;&lt;br /&gt;It seems apparent that the client discarded this unACKed data because several minutes later, it requests the second fragment over again, which it receives successfully. After the client receives this second fragment, Ruminate splices it together and the exe is sent off for analysis. The interesting part about this 2nd attempt for the 2nd fragment is that this time the client chose a different mirror to download from--it’s on the same subnet but is a different IP. &lt;br /&gt;&lt;br /&gt;I chose this example because it points out a few things. First it demonstrates how the classic layer 4 defrag accuracy problem can influence the layer 7 defrag problem. Similarly, it alludes to the same problems applied to layer 7. What do you do if layer 7, ex. HTTP 206 fragments, overlap? Which version do you keep if it’s different? Can this be used for NIDS evasion like it was in the layer 4 case? These are the type of interesting questions I hope Ruminate aids in studying.&lt;br /&gt;&lt;br /&gt;I believe this example also helps validate some of the architecture of Ruminate, from dynamic load balancing of streams to a service based approach. Since the two layer 7 fragments were sent from distinct client/server IP pairs, you have no guarantee that the conventional method of static header load balancing would send the layer 7 fragments to the same HTTP analysis node. If you are going to do this the conventional NIDS way, you are forced accept a high cost in synchronization between the two analyzer nodes because layer 7 defrag can involve large amounts of data spread through long periods of time. The service based approach not only factors in realities of today’s commodity IT infrastructure, but makes this problem look relatively simple.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Example C&lt;/h3&gt;&lt;br /&gt;&lt;br /&gt;Instead of leading off with the logs for this example, I need to explain one more wrinkle of HTTP 206. I didn’t learn about this until I was trying to implement 206defrag and was disappointed that to see that many of the PDFs I tried to download on my own machine weren’t being successfully reconstructed by Ruminate (my computer almost always does HTTP 206 when downloading PDFs). If the client requests more than one byte range in a single request, the server puts the various responses in a MIME blob that separates the byte ranges much like multiple attachments to an email, but from what I’ve seen, sans the base64 encoding. If I understand correctly, this is very similar to how some POSTs are encoded.&lt;br /&gt;&lt;br /&gt;This is how it looks in practice:&lt;br /&gt;&lt;span style="font-family: monospace; font-size: 9pt;"&gt;&lt;br /&gt;GET /courses/ECE545/viewgraphs_F04/loCarb_VHDL_small.pdf HTTP/1.1&lt;br /&gt;Host: teal.gmu.edu &lt;br /&gt;User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 ( .NET CLR 3.5.30729; .NET4.0C) Creative ZENcast v1.02.10 &lt;br /&gt;Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8&lt;br /&gt;Accept-Language: en-us,en;q=0.5&lt;br /&gt;Accept-Encoding: gzip,deflate&lt;br /&gt;Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7 &lt;br /&gt;Keep-Alive: 115 &lt;br /&gt;Connection: keep-alive&lt;br /&gt;X-REMOVED: Range&lt;br /&gt;X-Behavioral-Ad-Opt-Out: 1&lt;br /&gt;X-Do-Not-Track: 1 &lt;br /&gt;Range: bytes=1-1,0-4095 &lt;br /&gt; &lt;br /&gt;HTTP/1.1 206 Partial Content Date: Mon, 10 Jan 2011 17:02:50 GMT &lt;br /&gt;Server: Apache&lt;br /&gt;Last-Modified: Sat, 20 Nov 2004 02:05:07 GMT&lt;br /&gt;ETag: "25fb6-79bec-d67fac0"&lt;br /&gt;Accept-Ranges: bytes&lt;br /&gt;Content-Length: 4303&lt;br /&gt;Keep-Alive: timeout=15, max=100&lt;br /&gt;Connection: Keep-Alive&lt;br /&gt;Content-Type: multipart/byteranges; boundary=49980f01bf1635062&lt;br /&gt;  &lt;br /&gt;--49980f01bf1635062&lt;br /&gt;Content-type: application/pdf&lt;br /&gt;Content-range: bytes 1-1/498668&lt;br /&gt; &lt;br /&gt;P&lt;br /&gt;--49980f01bf1635062&lt;br /&gt;Content-type: application/pdf&lt;br /&gt;Content-range: bytes 0-4095/498668 &lt;br /&gt;&lt;br /&gt;%PDF-1.4&lt;br /&gt;...&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;In this case you see the client asking for and the server responding with the second byte of the PDF, then the first 4K of it.&lt;br /&gt;&lt;br /&gt;For brevity’s sake, I’ll only display the 206defrag “output” log:&lt;br /&gt;&lt;span style="font-family: monospace; font-size: 9pt;"&gt;&lt;br /&gt;Jan 10 12:04:02 master 206defrag: output tcp-170962418-1294678989-1294679016-c-233988-10.45.179.94:19950c129.174.93.170:80_http-0-part-1_206defrag normal 70 498668 94046a5fb1c5802d0f1e6d704cf3e10e tcp-170962418-1294678989-1294679016-c-233988-10.45.179.94:19950c129.174.93.170:80_http-0-part-1,tcp-170962418-1294678989-1294679016-c-233988-10.45.179.94:19950c129.174.93.170:80_http-1-part-1,tcp-170962841-1294678990-1294679016-c-305932-10.45.179.94:19953c129.174.93.170:80_http-1-part-4,tcp-170962418-1294678989-1294679016-c-233988-10.45.179.94:19950c129.174.93.170:80_http-6-part-1,tcp-170962418-1294678989-1294679016-c-233988-10.45.179.94:19950c129.174.93.170:80_http-7-part-2,tcp-170962841-1294678990-1294679016-c-305932-10.45.179.94:19953c129.174.93.170:80_http-2-part-1,...&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;In case you’re curious, yes the “70” early in the log means that the payload was assembled from 70 fragments. Furthermore, the “normal” means that the fragments that were spliced together from contiguous segments without any portions of the fragments overlapping. Note that the duplication of byte 1 numerous times doesn’t affect this because it’s not necessary to use those fragments. In the future, I could be more granular with the logic and logging for special cases where fragments are duplicated, fragments overlap, etc. I have little knowledge of how specific HTTP clients handle situations like overlapping fragments.&lt;br /&gt;&lt;br /&gt;One other thing of note is that these fragments are being transferred through two simultaneous TCP connections (client port 19950 and 19953) using multiple HTTP 1.1 transactions. One other thing that I think is interesting about this example is the seemingly sporadic order in which the fragments are requested:&lt;br /&gt;&lt;br /&gt;The following shows the client TCP port, the HTTP transaction index in that TCP connection, the, the MIME part index, the fragment start index, and the fragment length.&lt;br /&gt;&lt;span style="font-family: monospace; font-size: 9pt;"&gt;&lt;br /&gt;[csmutz@master 206_examples]$ cat http_c.log | grep input | sed -r 's/tcp-.*:([0-9]+)c.*-([0-9]+-part-[0-9]+) /\1.\2 /' | awk '{ print $7" "$9" "$10 }'&lt;br /&gt;19953.0-part-0 1 1&lt;br /&gt;19950.0-part-0 1 1&lt;br /&gt;19953.0-part-1 487541 4096&lt;br /&gt;19950.0-part-1 0 4096&lt;br /&gt;19953.1-part-0 1 1&lt;br /&gt;19950.1-part-0 1 1&lt;br /&gt;19950.1-part-1 4096 14319&lt;br /&gt;19953.1-part-1 478933 1325&lt;br /&gt;19953.1-part-2 477152 1781&lt;br /&gt;19950.2-part-0 1 1&lt;br /&gt;19953.1-part-3 480258 803&lt;br /&gt;19953.1-part-4 18415 2540&lt;br /&gt;19950.2-part-1 494520 4096&lt;br /&gt;19953.1-part-5 481061 697&lt;br /&gt;19950.3-part-0 1 1&lt;br /&gt;19953.2-part-0 1 1&lt;br /&gt;19953.2-part-1 32255 13312&lt;br /&gt;19950.3-part-1 498616 52&lt;br /&gt;19953.3-part-0 1 1&lt;br /&gt;19950.4-part-0 1 1&lt;br /&gt;19953.3-part-1 52049 5315&lt;br /&gt;19953.3-part-2 483154 1646&lt;br /&gt;19950.4-part-1 491637 2883&lt;br /&gt;19953.3-part-3 57364 5529&lt;br /&gt;19953.3-part-4 485870 46&lt;br /&gt;...&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;I’m not sure I can discern any pattern to the manner in which the fragments are transferred, but it’s definitely not in order. While this looks like a bit of a shotgun (double barreled in this case) approach to getting this file, it’s not overly haphazard as the fragments line up nicely. I did quickly look at the byteranges themselves to see if they correlated to the internal structure of the PDF (objects/streams) but didn’t see anything too obvious in the couple I examined. I’m also not sure why the client wants to request the second byte so frequently. According to my reckoning, the payload was reconstructed from 70 fragments, using 22 HTTP transactions, through 2 unique TCP connections. While definitely the exception rather than the norm, this is an example where the buffer then analyze model of Ruminate has significant benefits over the stateful incremental analysis model of conventional packet based NIDS.&lt;br /&gt; &lt;br /&gt;While examples of rare conditions, examples B and C demonstrate the type of issues I’ve built Ruminate to be able to study and address. As attacks continue to move up the stack, NIDS research needs to also.&lt;br /&gt;&lt;br /&gt;Descending out of the clouds into the real world, example A isn’t as uncommon as many might suppose. I’m hoping that the upcoming release of Ruminate, with vastly improved HTTP parsing capabilities, will prove useful to some in operational environments. I feel it important to reiterate that Ruminate is a research oriented tool--it’s somewhere between experimental and proof of concept. The last thing I want is for Ruminate to be used in manner that someone is misled with a false sense of security. It should go without saying, but only those who are willing to accept any limitations (presumably without knowing all of them) or are willing to do adequate vetting themselves should rely on Ruminate in production environments. That being said, I’ve been pleasantly surprised with what I’ve been able to do with Ruminate so far.&lt;br /&gt;&lt;br /&gt;In the next couple weeks I’m going to work on refining, packaging, and documenting Ruminate so it will easier for those who want to try to play with it. I hope to have this done around the end of the month.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/471870110572762709-6651048799254428804?l=smusec.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://smusec.blogspot.com/feeds/6651048799254428804/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://smusec.blogspot.com/2011/01/gnawing-on-http-206-fragmented-payloads.html#comment-form' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/6651048799254428804'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/6651048799254428804'/><link rel='alternate' type='text/html' href='http://smusec.blogspot.com/2011/01/gnawing-on-http-206-fragmented-payloads.html' title='Gnawing on HTTP 206 Fragmented Payloads with Ruminate'/><author><name>Charles Smutz</name><uri>http://www.blogger.com/profile/05098439824931378207</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_g1XmJJW8J_g/S-igCjTBm_I/AAAAAAAAAAY/7Dc7fU1NH5g/S220/CharlesSmutz2009.jpg'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-471870110572762709.post-8342082616478564649</id><published>2011-01-01T06:42:00.000-08:00</published><updated>2011-01-01T10:56:45.853-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='opinion'/><category scheme='http://www.blogger.com/atom/ns#' term='humor'/><title type='text'>5 Saddest Conspiracy Theories of 2010</title><content type='html'>Is it not obligatory for bloggers to make some sort of list at New Year? Well here is mine. I’m posting what I call the saddest conspiracy theories of 2010. These are all events that are clouded by secrecy and/or controversy, implying some amount of foul play or reckless incompetency. While all are somehow related to security or technology, some are on the periphery of the topics normally discussed in this blog. I’ll only give sensational one-sided coverage for these conspiracy theories. While I won’t even try to argue the “truth” of any of these, what makes them sad is that the level of plausibility is much higher than zero.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;1. Another US Gov Sponsored Backdoor&lt;/h3&gt;&lt;br /&gt;The FBI has been &lt;a href="http://marc.info/?l=openbsd-tech&amp;m=129236621626462&amp;w=2"&gt;accused&lt;/a&gt; of trying to put backdoors into the IPSEC implementation of OpenBSD. It appears, at least to the founder and leader of OpenBSD, that the &lt;a href="http://www.informationweek.com/news/security/vulnerabilities/showArticle.jhtml?articleID=228900037"&gt;FBI did contract people to modify OpenBSD&lt;/a&gt; for the purpose of introducing bugs. However, it’s unclear if intended audience for these bugs was the whole world (unlikely), organizations with &lt;a href="http://mickey.lucifier.net/b4ckd00r.html"&gt;specific hardware&lt;/a&gt;, or just an &lt;a href="https://twitter.com/ejhilbert/status/14891845825863680"&gt;internal experiment&lt;/a&gt;. I’d be receptive to the experiment explanation if it was it was done openly (like my dabbling in breaking &lt;a href="http://mason.gmu.edu/~csmutz/re/"&gt;forward secrecy through OS level random escrow&lt;/a&gt;) or to the experiment explanation if it never touched the internet. The commits to a public project are kind of scary. The jury is still out on this one. However, if this turns out anything like the alleged &lt;a href="http://www.schneier.com/blog/archives/2007/11/the_strange_sto.html"&gt;NSA backdoor in the Windows PRNG&lt;/a&gt;, we won’t hear much more conclusive on this. The sad part is the community isn’t wondering if the three letter agencies are trustworthy participants in the design and implementation of crypto. The answer is clear: No. The real question is how many more of these are lingering both in open and closed source software.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;2. Security Theater Turns Peep Show&lt;/h3&gt;&lt;br /&gt;Yes, I had to include it. The security theater that is TSA screening at airports was bad enough in the past. It has provided basically no improvement in security, has &lt;a href="http://www.schneier.com/blog/archives/2010/11/causing_terror.html"&gt;amplified the effects of terrorism&lt;/a&gt;, and has been an unjustified encroachment on civil liberties. This year sees the widespread deployment of X-ray backscatter machines, also known as full body scanners. The &lt;a href="http://www.schneier.com/blog/archives/2010/11/tsa_backscatter.html"&gt;public backlash&lt;/a&gt; is heating up. While there’s plenty of controversy, and probably not a lot of conspiracy, the current state of airport security is just plain sad. Let’s hope we can find a way to apply the same logic and tactics which are being used so effectively for “real world” security to the field of cyber security.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;3. Big Brother Breathes New Life Into Wiretapping Laws&lt;/h3&gt;&lt;br /&gt;Up until a few years ago, most people thought wiretapping laws were in place to prevent people from being covertly spied on by others, especially police and spooks that are wont to do things like &lt;a href="http://en.wikipedia.org/wiki/NSA_warrantless_surveillance_controversy"&gt;warrantless wiretapping&lt;/a&gt;.Those of us who questioned the purpose of these wiretapping laws (or the constitution for that matter) back in 2007-2009 time frame, now have some consolation. In 2010, it has become &lt;a href="http://www.techdirt.com/articles/20100603/0859019675.shtml"&gt;common practice&lt;/a&gt; for police to use local and state wiretapping laws to retaliate against people who try to hold them accountable though recording of police in public settings. With a little luck and even more creative interpretation of laws, even the federal wiretapping laws may be useful in the future.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;4. Traditional Journalism: Too Big to Fail&lt;/h3&gt;&lt;br /&gt;While I don’t want to delve in to the whole Wikileaks affair, one thing I’ve seen coming out of it is a lot of criticism of Wikileaks. Most of the criticism from the media seems rooted more in desires at maintaining their traditional role in filtering, pushing, and disseminating news than ensuring important news is uncovered and the public is informed. For example, when Floyd Abrams discusses &lt;a href="http://online.wsj.com/article/SB10001424052970204527804576044020396601528.html"&gt;Why WikiLeaks Is Unlike the Pentagon Papers&lt;/a&gt; he focuses more on the narrow topic of why wikileaks is a threat to traditional journalism instead of more fundamental topics like freedom of press or government accountability. To me it seems that the very wiki model is being attacked, not because it’s inherently wrong, but because it continues to marginalize the role of established information channels. The writing is on the wall that traditional news “sources” are an endangered species so they’re in survival mode. It seems that they are often more worried about fighting turf wars and ingratiating themselves with The Man than serving their more fundamental role of public watchdog. It really doesn’t matter where you fall on the professional vs. crowdsource information flow argument, when media is more worried about &lt;a href=""http://blog.heritage.org/2010/06/07/the-ftc-confuses-newspapers-with-journalism-as-it-seeks-new-media-tax/"&gt;getting and maintaining government support&lt;/a&gt; than fulfilling their core mission, we ought to be scared. Don’t worry though, the next iteration of wikileaks, openleaks, is going to put the traditional media folk &lt;a href="http://en.wikipedia.org/wiki/OpenLeaks#Conduit"&gt;back into the loop&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;5. US-China Diplomacy vis-à-vis Intellectual Property&lt;/h3&gt;&lt;br /&gt;So of all the conspiracy theories, this is the 800 pound panda. While many are still waking up to it, the ever widening scope of cyber espionage being conducted by targeted, persistent attackers is alarming. Many open sources, including &lt;a href="http://googleblog.blogspot.com/2010/01/new-approach-to-china.html"&gt;Google&lt;/a&gt;, attribute these attacks to actors in China—-with largely unsupported and varying claims about the level of the Chinese Government’s involvement. The US should be pursuing diplomatic solutions to this problem, the economic portion of which has been aptly seen &lt;a href="http://www.newyorker.com/reporting/2010/11/01/101101fa_fact_hersh?currentPage=all"&gt;“as a trade issue that we have not dealt with.”&lt;/a&gt; So Hillary Clinton says with &lt;a href="http://www.state.gov/secretary/rm/2010/01/135105.htm"&gt;big words&lt;/a&gt; that China should investigate and the American people will be updated as the “facts become clear”. What have we heard so far on cyber espionage front? Not much. That’s OK though because the US has been very active this year in other tough diplomatic discussions with China. For example, Attorny General Holder visited China &lt;a href="http://www.itworld.com/legal/124630/us-working-china-intellectual-property-rights"&gt;late this year&lt;/a&gt; to discuss intellectual property rights. Apparently, China promised to crack down on illegal distribution of music, movies, and software.&lt;br /&gt;&lt;br /&gt;What a big win. First of all, we wouldn’t want to go lax on software piracy enforcement, especially not in light of recent extensive abuse by oppressive regimes. The problem is so bad that Microsoft, one of the most draconian companies when it comes to software piracy and one of the most permissive when it comes to “local” law (like search result filtering), recently extended free licenses to the type of organizations where unequal software piracy enforcement is used as a pretext for &lt;a href="http://www.nytimes.com/2010/10/17/world/17russia.html?_r=1"&gt;oppressing dissidents&lt;/a&gt;. I can definitely see how the relatively extreme punishments imposed on the relatively few people actually caught pirating music and videos in the US would fit well with the Chinese model of law enforcement. Not only that, but this could help fill in some of the pretext for abuse taken away by liberal software licensing. Best yet, continued discussions like this could lay the ground work for expansion of intellectual property protection even other western countries refuse to get caught up in. For example, wouldn’t it be great if software patents, one of the US’s greatest forms of meta-innovation of late, were enforced with the same vigor and uniformity in China as they are in the US?&lt;br /&gt;&lt;br /&gt;Whether you feel like getting out your tinfoil hat or your tissue to catch your tears, I hope these critical reflections on 2010 have been amusing, even comical. Let’s all hope for better in 2011.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/471870110572762709-8342082616478564649?l=smusec.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://smusec.blogspot.com/feeds/8342082616478564649/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://smusec.blogspot.com/2011/01/5-saddest-conspiracy-theories-of-2010.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/8342082616478564649'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/8342082616478564649'/><link rel='alternate' type='text/html' href='http://smusec.blogspot.com/2011/01/5-saddest-conspiracy-theories-of-2010.html' title='5 Saddest Conspiracy Theories of 2010'/><author><name>Charles Smutz</name><uri>http://www.blogger.com/profile/05098439824931378207</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_g1XmJJW8J_g/S-igCjTBm_I/AAAAAAAAAAY/7Dc7fU1NH5g/S220/CharlesSmutz2009.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-471870110572762709.post-2430422941945357505</id><published>2010-12-17T14:26:00.000-08:00</published><updated>2011-06-22T05:31:27.772-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='technology'/><category scheme='http://www.blogger.com/atom/ns#' term='near real-time IDS'/><category scheme='http://www.blogger.com/atom/ns#' term='ruminate'/><title type='text'>Announcing Ruminate IDS</title><content type='html'>I’m pleased to announce that &lt;a href="http://www.ruminate-ids.org/"&gt;Ruminate IDS&lt;/a&gt;, a system I’m building in order to conduct my PhD research, has been released as open source. &lt;br /&gt;&lt;br /&gt;The goal of Ruminate is demonstrate the feasibility and value of flexible and scalable analysis of objects transferred through the network. Ex. PDFs, SWFs, ZIPs, DOCs, XLSs, GIFs, etc. To the best of my knowledge, there is no other IDS out there that focuses heavily on or provides comprehensive facilities to do this today. Ruminate doesn’t do the stuff that contemporary NIDS do well, such as signature matching, individual packet analysis, port scan detection, etc. If you’re interested in learning about Ruminate, reading the &lt;a href="http://cs.gmu.edu/~tr-admin/papers/GMU-CS-TR-2010-20.pdf"&gt;technical report&lt;/a&gt; is the best place to start.&lt;br /&gt;&lt;br /&gt;The current implementation that is available for download is built largely to gather statistics useful for academic research. I’m hoping a release a version early in 2011 that will be more appropriate for people seeking to use it in operational environments. Regardless, I was somewhat surprised by the ability of Ruminate IDS as presently constituted to detect live attacks by highly targeted and sophisticated actors when used on a production campus network.&lt;br /&gt;&lt;br /&gt;Ruminate is a great example of the type of IDS that could be built on top of the utility provided by vortex. It would probably be fair to consider a Ruminate a fabulous example (and facilitator) of &lt;a href="http://teddziuba.com/2010/10/taco-bell-programming.html"&gt;Taco Bell Programming&lt;/a&gt; with both the good and bad connotations. &lt;br /&gt;&lt;br /&gt;Despite the many imperfections and limitations, I hope Ruminate IDS may be of value to both academia and network defenders alike.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/471870110572762709-2430422941945357505?l=smusec.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://smusec.blogspot.com/feeds/2430422941945357505/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://smusec.blogspot.com/2010/12/announcing-ruminate-ids.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/2430422941945357505'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/2430422941945357505'/><link rel='alternate' type='text/html' href='http://smusec.blogspot.com/2010/12/announcing-ruminate-ids.html' title='Announcing Ruminate IDS'/><author><name>Charles Smutz</name><uri>http://www.blogger.com/profile/05098439824931378207</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_g1XmJJW8J_g/S-igCjTBm_I/AAAAAAAAAAY/7Dc7fU1NH5g/S220/CharlesSmutz2009.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-471870110572762709.post-8313504981148130761</id><published>2010-12-11T19:32:00.000-08:00</published><updated>2011-06-22T05:31:50.587-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='opinion'/><category scheme='http://www.blogger.com/atom/ns#' term='technology'/><category scheme='http://www.blogger.com/atom/ns#' term='machine learning'/><title type='text'>Machine Learning Disabilities in Incident Detection</title><content type='html'>&lt;h2&gt;Intro&lt;/h2&gt;&lt;br /&gt;I can’t count how many times I’ve seen machine learning supposedly applied to solve a problem in the realm of information security. In my estimation, the vast majority of these attempts are a waste of resources that never demonstrate any real-world value. It saddens me to consistently see lots of effort and brainpower wasted on a field that I believe has a lot of potential. I’d like to share my thoughts on how machine learning can be effectively applied to incident detection. My focus is to address this topic in a manner and forum that is accessible by people in industry, especially those who fund, lead, or execute cyber security R&amp;D. I hope some people in academia might find it useful, assuming they can stomach the information as presented here (including lack of academic formality and original empirical evidence). For what it’s worth, I consider myself having a pretty good amount of real world experience in targeted attack detection and a fair amount of academic experience in machine learning.&lt;br /&gt;&lt;br /&gt;&lt;h2&gt;Definitions&lt;/h2&gt;&lt;br /&gt;Before I get too far, a few definitions are in order. Specifically, I need to clarify what I mean by “Machine Learning”. As used here, “Machine Learning” indicates the use of computer algorithms that provide categorization capabilities beyond simple signatures or thresholds and which implement generalized or fuzzy matching capabilities. Typically, the machine is trained with some examples of data of interest (usually attack and benign data) from which it learns through construction of a model that can be used to classify a larger corpus of observations (usually as attack or benign) even when the larger corpus contains observations that don’t exactly match the observations in the training data. &lt;br /&gt;With “Incident Detection”, I’m trying to be a little more broad than the classic definition of Intrusion Detection or NIDS by adding in connotations relative to Incident Response. I almost used CND, but that isn’t quite right because CND is a very broad topic. “Using Machine Learing for CNA Detection” would be an accurate alternate title. While I’ll be using NIDS heavily in my examples, note that for me NIDS isn’t merely about detecting malicious activity on the network, it’s also about detecting and providing forensics capabilities to analyze otherwise benign attack activity performed by targeted, persistent attackers (or in other words supporting &lt;a href="http://papers.rohanamin.com/?p=15"&gt;cyber kill chain analysis&lt;/a&gt;).&lt;br /&gt;&lt;br /&gt;&lt;h2&gt;References&lt;/h2&gt;&lt;br /&gt;During this short essay, I’ll reference two academic papers. The first is the PhD thesis of my friend, mentor, and former boss: Rohan Amin. His thesis, &lt;a href="http://papers.rohanamin.com/wp-content/uploads/papers.rohanamin.com/2010/11/Amin2011-dissertation.pdf"&gt;Detecting Targeted Malicious Email through Supervised Classification of Persistent Threat and Recipient Oriented Features&lt;/a&gt;, is the best examples of the useful application of machine learning to the problem of incident detection I’ve ever seen. I’ve conversed with Rohan on his research from start to finish and have largely been waiting to write this essay until he finished his thesis so I would have a positive example to talk about. His research is refreshing: from the choosing one of the most pressing security problems of the APT age to making brilliant technical contributions. If rated against the recommendations I will make herein, Rohan’s paper scores very high.&lt;br /&gt;&lt;br /&gt;My second reference is &lt;a href="http://www.icir.org/robin/papers/oakland10-ml.pdf"&gt;Outside the Closed World: On Using Machine Learning For Network Intrusion Detection&lt;/a&gt; which was presented at &lt;a href="http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5504793"&gt;IEEE S+P 2010&lt;/a&gt;. Robin Sommer and Vern Paxson are academic researchers with some serious credentials in the field of NIDS. They are probably best known in industry for their contributions to &lt;a href="http://bro-ids.org/"&gt;Bro IDS&lt;/a&gt;. Their paper is geared to academics but tries to encourage some amount of real world relevancy in research. It makes me laugh with cynicism sometimes at the political correctness and positive tone with which they make recommendations to researchers such as “Understand what the system is doing.” While I don’t agree with everything Sommer and Paxson say, they say a lot that is spot on, the paper is well written, it provides a good view into how academics think, and it even explicitly, albeit briefly, calls out the difference in approach required for opportunistic and targeted attacks.&lt;br /&gt;&lt;br /&gt;&lt;h2&gt;Solve a Problem (worth solving)&lt;/h2&gt;&lt;br /&gt;Sommer and Paxson said it so well:&lt;br /&gt;&lt;br /&gt;&lt;i&gt;The intrusion detection community does not benefit any further from yet another study measuring the performance of some previously untried combination of a machine learning scheme with a particular feature set, applied to something like the DARPA dataset.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;Amen. The Engineer in me and my personality scoffs at what I see as a too haphazard and inefficient process of invention which involves combining one of the set of machine learning techniques with one of the set of possible problems, often apparently pseudo-randomly, until a good fit is found through empirical evaluation. Sure, there are numerous examples of where this general approach has worked in the past. Ex. &lt;a href="http://en.wikipedia.org/wiki/Vulcanization#Goodyear.27s_contribution"&gt;Goodyear’s invention of sulfur vulcanization for rubber&lt;/a&gt; is often thought to have happened by luck. Certainly this methodology is at least compatible with Edison’s maxim of “Genius was 1 percent inspiration and 99 percent perspiration.” While systematically testing every permutation of machine learning algorithms, problems, and other options such as data sets and features selections, is perfectly valid, I don’t like it. Most people investing in research probably shouldn’t either. One of the problems I see with this in the real world is that many people have what they think is a whiz bang machine learning algorithm, possibly even working well in a different domain. Since cyber security is a hot topic, people try to port the whiz bang mechanism to the probleme du jour, e.g. cyber security. Often these efforts fail not because there isn’t some way in which the whiz bang mechanism could provide value in the cyber security realm, but because the whiz bang mechanisms isn’t applied to a specific enough or relevant enough problem, poor data is used for evaluation, etc.&lt;br /&gt;&lt;br /&gt;One strong predictor of the relevancy of the research being conducted and the technology that will come from it is the relevancy of the data being evaluated. Could it be any more clear that if you are using data that is too old to reflect current conditions, you can have little confidence that your resulting technology will address today’s threats? Furthermore, if you are using synthetic data, you may be able to show empirically that your solution solves a possible problem under certain conditions, but you have no guarantee that the problem is a problem worth solving or that the conditions assumed will ever be reached in the real world. Sommer and Paxson largely trash any research that relies predominately on the &lt;a href="http://www.ll.mit.edu/mission/communications/ist/corpora/ideval/data/index.html"&gt;DARPA 1998-2000 Intrusion Detection Evaluation data sets&lt;/a&gt;, with which I passionately agree.&lt;br /&gt;&lt;br /&gt;While the relevancy of the data being evaluated is a pretty good litmus test for the relevancy of the technology coming from the research, I believe it’s much more fundamental than that. Below I present two models for R&amp;D. In the S-P-D process, novelty is ensured by taking a solution and using increasing innovation and discovery to find a problem and then a data set/features set for which the solution can be empirically shown to be valid. This correlates to the all too frequently played out example I alluded to above where a whiz bang machine learning algorithm is applied to a new domain such as cyber security. The researcher spends most of his time figuring out how to apply the solution to a problem including finding or creating data that shows how the solution solves a problem. Clearly, there is little guarantee for real world relevancy, but academic novelty is assured throughout the process. On the other hand, in the D-P-S process, relevancy in ensured because the data is drawn from real world observation. By evaluating data real world events, a problem is discovered, described, and prioritized. Resources are dedicated to research, and a useful solution is sought. Academic novelty is not necessarily guaranteed, but relevancy is systemic. Rohan’s PhD research exemplifies the D-P-S problem. Between 2003 and 2006 Targeted Malicious Email (TME) evolved as the principle attack vector for highly targeted sophisticated attacks. As the problem of APT attacks became more severe and more was learned about the attacks, TME detection was identified as a critical capability. Analysis of the data (real attacks) revealed consistent patterns between attacks that current security systems could not effectively detect. Rohan recognized the potential of machine learning to improve detection capabilities and did the hard work of refining and demonstrating his ideas.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_g1XmJJW8J_g/TQREXV16DZI/AAAAAAAAABQ/lMSCeEEHKtk/s1600/Data-Problem-Solution.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 320px; height: 115px;" src="http://2.bp.blogspot.com/_g1XmJJW8J_g/TQREXV16DZI/AAAAAAAAABQ/lMSCeEEHKtk/s320/Data-Problem-Solution.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5549635808618220946" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;While I’m normally not a fan of these sort of models and diagrams, I want to make this point clear to the people funding cyber R&amp;D. If you want to improve the ROI of your cyber R&amp;D, make sure you are funding D-P-S projects, not S-P-D research. What does that mean for non-business types? The most important thing cyber security researcher need today is Data demonstrating real Problems. In the current climate, there is an over abundance of money being poured in cyber R&amp;D. I agree with the vast majority of the recommendations given by Sommer and Paxson regarding data, including the recommendation that NIDS researchers secure access to a large production network. Researchers also understand the threat environment of that network. I will add that if individual organizations, industries, and governments want to advance current cyber security R&amp;D, the most important thing they can do is provide researchers access to the data demonstrating the biggest problems they are facing, including required context. For more coverage on the topic of sharing attack information with researchers, see my post on how &lt;a href="http://smusec.blogspot.com/2010/04/keeping-targeted-attacks-secret-kills-r.html"&gt;Keeping Targeted Attacks Secret Kills R&amp;D&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;h2&gt;On Problem Selection&lt;/h2&gt;&lt;br /&gt;In my very first blog post, I discussed &lt;a href="http://smusec.blogspot.com/2010/03/developing-relevant-information.html"&gt;Developing Relevant Information Security Systems&lt;/a&gt;. Some of the ideas presented there apply to the discussion at hand.&lt;br /&gt;&lt;br /&gt;Machine Learning as applied to intrusion detection is often considered synonymous with anomaly detection. Even Sommer and Paxson equate the two. Maybe this springs from the classic taxonomy of NIDS that branches at signature matching and anomaly detection. Personally, I question the value of this taxonomy. Certainly NIDS like Bro somewhat break this taxonomy, requiring it to be expanded to at least misuse detection or anomaly detection. Even that division isn’t fully comprehensive. Detecting activity from persistent malicious actors, even if that activity isn’t malicious per se, is an important task of NIDS also, but doesn’t fall cleanly under traditional definitions of either misuse detection or anomaly detection.&lt;br /&gt;&lt;br /&gt;Regardless of how you classify your NIDS, I don’t agree with equating machine learning and anomaly detection. Machine learning can be applied to misuse detection can’t it? While Rohan’s PhD work isn’t fully integrated with any public NIDS, it very well could be. Similarly, anomaly detection systems as discussed in academia often use machine learning to create models for detection, but it’s equally possible for anomaly detection systems to use human expert created thresholds or models. &lt;br /&gt;&lt;br /&gt;The biggest problem I have with equating machine learning with anomaly detection is that anomaly detection is largely a nebulous and silly problem. Equating the two trivializes machine learning. It’s pretty easy to identify statistically significant outliers in data sets. The problem is that the designation as anomalous is often rather arbitrary, with most researchers doing little to demonstrate the real world relevancy of any anomalous detections. Furthermore, for all but the most draconian of environments, anomaly detection is silly anyway. Anyone with any operational experience knows that the mind numbingly vast majority of “anomalous” activity is actually benign. Furthermore, highly targeted attacks quite often are, by design, made to blend in with “normal” activity.&lt;br /&gt;&lt;br /&gt;&lt;h2&gt;4 Principles&lt;/h2&gt;&lt;br /&gt;Most of the discussion heretofore has been targeted at people making high level decisions about R&amp;D. Now, I’ll provide some more concrete principles that can be applied by people actually implementing machine learning for Incident Detection. They are as follows:&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt;Use Machine Learning for Complex Relationships&lt;br /&gt;&lt;li&gt;Serve the Analyst&lt;br /&gt;&lt;li&gt;Features are the Most Important&lt;br /&gt;&lt;li&gt;Use the Right Algorithm&lt;br /&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Use Machine Learning for Complex Relationships (with Many Variables)&lt;/h3&gt;&lt;br /&gt;When should you use Machine Learning instead of other traditional approaches such as signature matching or simple thresholds? When you have to combine many variables in a complex manner to provide reliable detections. Why?&lt;br /&gt;&lt;br /&gt;Traditional methods work very well for detection mechanisms based on a small number of features. For example, skilled analysts often combine two low fidelity conditions into one high fidelity condition using correlation engines or complex rule definitions. I’ve seen this done manually with three or more variables, but it gets real ugly really quickly as the number of variables increases, especially when each dimension is more complex than a simple binary division. &lt;br /&gt;&lt;br /&gt;On the other hand, machines, if properly designed, function very well with high dimensional models. Computers are adept at analyzing complex relationships in n-dimensional space. &lt;br /&gt;Why not use machine learning for low dimensional analysis? Because it’s usually an unnecessary complication. Furthermore, humans are usually more accurate than machines at dealing with the low dimensional case because they are able to add contextual knowledge often not directly derivable from a training set.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Serve the Analyst&lt;/h3&gt;&lt;br /&gt;Any advanced detection mechanism must serve the analyst. It will fail otherwise. By serving the analyst, I mean empowering and magnifying the efforts of the analyst. The human should ultimately be the master of the tool. To me it seems ridiculous, but there are actually people, including a lot of researchers, that believe (or purport to believe) that tools such as IDS should (and can) be made to house all the intelligence of the system and that the roles of humans is merely to service and vet alerts. This is ridiculous. This is so backwards, that I can’t even believe some people seriously believe this. It’s sad to see it play out in practice. Much like airport security, which has gotten out of hand with increasingly intrusive screening that provides little to no value, I have to question the motives of the people pushing this mindset. Is it even possible for them to believe this is the right way to go? Are they just ignorant and reckless? Maybe it just comes down to greed or gross self-interest. Regardless of the reason, this mindset is broken. &lt;br /&gt;Toggling back to the positive side, machine learning has a great potential to empower analysis. Advanced data mining, including machine learning, should be used not only to aid that analyst is automating detections but also in understanding and visualizing previous attack data so that new detections can be created.&lt;br /&gt;It is vital that the analyst understand how any machine learning mechanisms work under the hood. For example, an expert should understand and review the models generated by the machine so that the expert can provide a sanity check and so that the human can understand the significance of the patterns the machine identifies. One of the coolest parts of Rohan’s PhD thesis is that he uncovered many pertinent patterns in the data, such most targeted job classes. In addition, as the accuracy of the classifier begins to wane over time, it is the expert analyst who will be able to recommend the appropriate changes to the system, such new features to be included in analysis.&lt;br /&gt;Part of empowering the analysts is giving the analyst the data needed to understand any alerts or detections. Any alert should be accompanied with a method of determining what activity triggered the alert and why the activity is thought to be malicious. Many machine learning mechanisms fail because they don’t do this well. They will tell an operator that they think something may be bad, but can’t or won’t tell the operator why, let alone providing sufficient context, making the operator’s job of vetting the alert that much harder. Incidentally, if the machine learning based detection mechanism provides adequate context, it lowers the cost and pain of validating false positives, lessening their adverse impact on operations.&lt;br /&gt;&lt;br /&gt;For an advanced detection mechanisms to have success in an operational environment, it must be made with the goal of serving the expert analyst. I believe much of the “Symantec Gap” described by Sommer and Paxson arises from ignoring this principle.&lt;br /&gt; &lt;br /&gt;&lt;h3&gt;Features are the Most Important&lt;/h3&gt;&lt;br /&gt;The most important thing to consider when applying machine learning to computer security is feature selection. Remember the 2007 financial system meltdown? The author of much of the software that “facilitated” the meltdown, wrote an &lt;a href="http://nymag.com/news/business/55687/"&gt;article&lt;/a&gt; describing his work and how it was abused by reckless investment banks. Glossing over the details (which are very different), the high level misuse case is often the same as cases of abuse of machine learning: People hope that by putting low value meat scraps into some abstract and complicated meat grinder of a machine they get some output that is better than the ingredients put in. It’s a very appealing idea. If one can turn things you don’t want to look at into hot dogs or sausage by running it through a meat grinder, why can’t we turn it into steak with a really big, complex meat grinder? Machine learning mechanisms can be very good at targeting specific and complex patterns in data, but at the end of the day, &lt;a href="http://en.wikipedia.org/wiki/Garbage_In,_Garbage_Out"&gt;GIGO&lt;/a&gt; still applies.&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;Expressiveness of Features&lt;/h4&gt;&lt;br /&gt;The most important part of using machine learning for IDS is to ensure that the machine is trained with features that expose attributes that are useful for discriminating individual observations. A classic example from the world of NIDS is the inadequacy of network monitoring tools that operate at layer 3 or layer 4 to detect layer 7 (or deeper) attacks. When I get on the network payload analysis soapbox (which I often do) one of my favorite examples is as follows:&lt;br /&gt;&lt;br /&gt;&lt;i&gt;Image in you have an open email relay that sends your organization two emails. Both are about the same size, both contain an attachment of the same type, and both contain content relevant to your organization. One is a highly targeted malicious email, the other is benign.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;Can you discriminate the between the two based on netflow? Not a chance. There is nothing about the layer 3 or layer 4 data that is malicious. Remember, the malicious content is the attachment, not anything done at the network layer by the unwitting relay. It doesn’t matter how many features you extract from netflow or how much you processes it, you’re not going to be able to make a meaningful and reliable differentiation.&lt;br /&gt;&lt;br /&gt;It’s crucial when using machine learning as a detection mechanism that you have some level of confidence that the features can actually be used to draw meaningful conclusions. The straightforward way to do this is to have analysts identify low fidelity indicators that when combined in complex ways, will yield meaningful results. Sure, some data mining may be involved here, and the process may be iterative, but you’ve got to have expressive and meaningful features. In my estimation, the biggest contribution Rohan makes with his study is demonstrating the value of features that most other mechanisms ignore (and incidentally, are harder for attackers to change).&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;Disparate Data Sources as a Red Herring&lt;/h4&gt;&lt;br /&gt;One claim made in support of machine learning is that with machine learning, you can correlate disparate data sources. This is really a red herring. You don’t necessarily need machine learning to do this. I’ve seen traditional SIMS, processing a wide variety of data feeds, used to make really impressive detections based on analyst crafted rules that aren’t particularly complex, in and of themselves, but which require a lot of work and technological horsepower behind the scenes because they leverage data from multiple sources. Sure, machine learning facilitates use of complex relationships in data, but those relationships don’t necessarily have to be from disparate data sources.&lt;br /&gt;&lt;br /&gt;That being said, machine learning can be wildly successful at leveraging complex relationships within disparate data sources. Rohan’s PhD work demonstrates this fabulously. One temptation, however, is to try to unnaturally “enrich” data, often consisting of inadequate features to begin with, by joining yet other features. The hope is to improve the quality of the models generated. This is all fine and well if the data joined provides some utility in classification. Also, for most machine learning techniques, if the all classes in the training data set are adequately represented and the training set has adequate entropy, no serious harm can be done by joining features with no value in improving classification. However, if some classes are under-represented (as is often the case with the “bad” examples) or if the training data doesn’t have adequate entropy (as is often the case with artificial data), “enriching” data with other data sources can incorrectly improve measures of statistical significance and performance of the machine learner in a way that wouldn’t apply to real world data. Returning to our example of the email which can’t be detected with netflow data, let’s assume the benign email is sent by the relay with an ephemeral source port of 36865 and the malicious email is sent with a source port of 36866. Now let’s say that the researcher wants to “enrich” his data by adding all sort of lookups based on the layer 3 and layer 4 parameters such as geoip lookups, etc. If the researcher joins IANA assigned port numbers into the mix, the machine’s model will discover that the benign email was sent with at source port of “kastenxpipe” and the malicious email has a source port of “unassigned”. The spurious conclusion is clear: malicious emails sent through ignorant relays originate from “unassigned” source ports. This example is contrived, but this sort of things actually occurs.&lt;br /&gt;&lt;br /&gt;By far the most important thing to get right when applying machine learning to the field of incident detection is operating on meaningful features. &lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Use the Right Algorithm (but don’t fret about it)&lt;/h3&gt;&lt;br /&gt;One aspect of applying machine to incident detection is choosing the right algorithm. This is also the one aspect that is usually belabored the most in academia, especially in research that is farthest from being applicable to real world problems. There are a lot of religious battles that go on in this realm also. However, very little of this provides real world value.&lt;br /&gt;&lt;br /&gt;My suggestion is to choose the algorithm or one of the set of algorithms that makes sense for your data and how your system is going to operate. Don’t fret too much about it. I think of this selection much like a choosing a cryptographic algorithm. The primary factor in doing this is choosing the type of cryptographic function: hash, digital signature, block cipher, complete secure channel, etc. To a large degree, it probably doesn’t matter if you choose SSL, SSH, or IPSEC for use as a secure channel. Sure, there may be some small factors or even external factors may make one slightly more desirable, but at the end of the day, any from the palette of choices will likely provide you an adequately secure channel, all other things being equal.&lt;br /&gt;&lt;br /&gt;Also, similar to making choices for crypto systems, you should avoid inventing or rolling your own unless you have a compelling reason to do so and you know what you are doing. All too often, I see exotic and home-grown machine learning techniques applied to information security. Often I see ROC charts, figures on performance, and other convoluted diagrams justifying these sorts of things. Just like with crypto, I think it’s appropriate to hold researchers to a high burden of proof to demonstrate the real world benefit of any “bleeding edge” machine learning mechanisms being applied to incident detection.&lt;br /&gt;&lt;br /&gt;Again, Rohan’s PhD work is exemplary of the principles I’m trying to express. He chose a machine learning mechanism that fit his data and use cases well. While he did spend a fair amount of time and efforts trying to tweak the classifier (see cost sensitive stuff), this had marginal benefit. He provides few suggestions for future work in improving the machine learning mechanisms. However, he recommends, and I agree with his recommendation, that the overall system could be improved by exposing more relevant features (such as file attachment metadata) and tightening outcome classes by separating the “bad” in classification  into multiple groupings based on similarity of attacks.&lt;br /&gt;&lt;br /&gt;With that high level principle out of the way, I’ll say a little about specific classes of mechanisms or specific algorithms. In doing so I’ll express a few biases and religious beliefs that aren’t backed with the same level of objectivity contained in the rest of this essay.&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;Random Forests&lt;/h4&gt;&lt;br /&gt;I love Random Forests. Lots of other people do too. Random Forests works well with numerical data as well as other data types like categorical data. While Random Forests may not be the most simple example, tree based classification mechanisms are very easy to understand and once a classifier is trained, insanely efficient to classify new observations. The algorithm takes care of identifying variable importance and tuning the classifier accordingly. Many other mechanisms can only do part of all of this, require a large amount of manual tuning, require manual data normalization, etc. Random Forests is easy and works very well in many situations.&lt;br /&gt;&lt;br /&gt;&lt;h4&gt;Text Based Mechanisms&lt;/h4&gt;&lt;br /&gt;Text based mechanisms are all the rage. They are awesome for helping make sense of human to human communication. For example, bayesian algorithms used in SPAM filtering mechanisms are actually rather effective at identifying and filtering high fidelity SPAM based on the text intended for human consumption. Document clustering mechanisms are very effective at weeding through large corpuses of documents, identifying those about similar topics. There is a huge amount of contemporary research on and new whiz bang mechanisms related to text mining, natural language processing, etc.&lt;br /&gt;&lt;br /&gt;For the part of information assurance that requires operating on human to human communication, text based machine learning mechanisms hold high potential. However, most communication of interest in incident detection isn’t human to human, but is computer to computer. A large portion of computer to computer communication is done through exchange of numerical data. However, it is somewhat humorous to see researchers attempt to apply text classification mechanisms to predominately numerical data, such as network sensor data. While there may be legitimate reasons to do this, I see these efforts with the same cynical doubts concerning longevity with which I regard efforts to vectorize logical problems into problems suitable for floating point operations so GPUs can be leveraged.   &lt;br /&gt; &lt;br /&gt;&lt;h4&gt;R: Freedom in Stats and Machine Learning&lt;/h4&gt;&lt;br /&gt;One tool that I have to give a quick shout out to is &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt;. Many people call R the free version of S (S is a popular stats tool), just like people say Linux is the free version of Unix. It’s a pretty close analogy. R is not only free as in beer, but is very free as in speech. There’s a huge and growing community supporting it. People who like Linux, Perl, and the CLI will love R. One thing I like about R is that everything you do is done via commands. Those commands are stored in a history, just like bash. If you want to automate something you’ve done manually, all you do is turn your R history into an R script. It’s easy to process stats, create graphs, or run machine learning algorithms without ever touching a GUI. It is much like Latex in that it has a steep learning curve, but people who master it are usually happy with the things they can do with it.&lt;br /&gt;&lt;br /&gt;&lt;h2&gt;Conclusion&lt;/h2&gt;&lt;br /&gt;I hope that in the future there will be a greater measure of success in applying machine learning to incident detection. I hope those funding and directing research will help ensure a greater measure of relevancy by providing researchers with the data and problems necessary to conduct relevant research. I also hope that the principles I’ve laid out will be useful for people other than myself in helping to guide research in the future.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/471870110572762709-8313504981148130761?l=smusec.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://smusec.blogspot.com/feeds/8313504981148130761/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://smusec.blogspot.com/2010/12/machine-learning-disabilities-in.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/8313504981148130761'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/8313504981148130761'/><link rel='alternate' type='text/html' href='http://smusec.blogspot.com/2010/12/machine-learning-disabilities-in.html' title='Machine Learning Disabilities in Incident Detection'/><author><name>Charles Smutz</name><uri>http://www.blogger.com/profile/05098439824931378207</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_g1XmJJW8J_g/S-igCjTBm_I/AAAAAAAAAAY/7Dc7fU1NH5g/S220/CharlesSmutz2009.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_g1XmJJW8J_g/TQREXV16DZI/AAAAAAAAABQ/lMSCeEEHKtk/s72-c/Data-Problem-Solution.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-471870110572762709.post-3379703482135896953</id><published>2010-10-14T18:55:00.000-07:00</published><updated>2010-10-14T19:13:12.435-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='idle'/><category scheme='http://www.blogger.com/atom/ns#' term='supercomputer'/><title type='text'>Touching Number 1</title><content type='html'>While visiting Oak Ridge, I was given the opportunity to not only see, but also touch the &lt;a href="http://www.top500.org/system/10184"&gt;#1 ranked supercomputer&lt;/a&gt; in the world. This blackberry snapped photo shows just enough detail to capture the nerdy nirvana of the event:&lt;div&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_g1XmJJW8J_g/TLe11gjyxtI/AAAAAAAAABI/59CNCLyv2XI/s1600/Jaguar.jpg"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 320px; height: 240px;" src="http://2.bp.blogspot.com/_g1XmJJW8J_g/TLe11gjyxtI/AAAAAAAAABI/59CNCLyv2XI/s320/Jaguar.jpg" border="0" alt="" id="BLOGGER_PHOTO_ID_5528086998498330322" /&gt;&lt;/a&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Amazing!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/471870110572762709-3379703482135896953?l=smusec.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://smusec.blogspot.com/feeds/3379703482135896953/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://smusec.blogspot.com/2010/10/touching-number-1.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/3379703482135896953'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/3379703482135896953'/><link rel='alternate' type='text/html' href='http://smusec.blogspot.com/2010/10/touching-number-1.html' title='Touching Number 1'/><author><name>Charles Smutz</name><uri>http://www.blogger.com/profile/05098439824931378207</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_g1XmJJW8J_g/S-igCjTBm_I/AAAAAAAAAAY/7Dc7fU1NH5g/S220/CharlesSmutz2009.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_g1XmJJW8J_g/TLe11gjyxtI/AAAAAAAAABI/59CNCLyv2XI/s72-c/Jaguar.jpg' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-471870110572762709.post-4677140445182926947</id><published>2010-09-18T18:35:00.001-07:00</published><updated>2010-09-18T19:04:45.039-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='opinion'/><category scheme='http://www.blogger.com/atom/ns#' term='cyberwar'/><category scheme='http://www.blogger.com/atom/ns#' term='apt'/><title type='text'>Are Targeted Attacks on Industry Cyberwar?</title><content type='html'>I’m writing this post to try to enter the conversation on cyberwar, etc. My motivation in doing so is not only to share my opinions on the topic, but also to add my witness to the few others out there which testify that targeted attacks pose a real and extant threat to our long term national prosperity.&lt;br /&gt;&lt;br /&gt;Before I start, I need to clarify my viewpoint. I’m a technical person. I do technical work--like programming computers. I don’t have any political, social, or economic influence. I do have a lot of operational experience doing incident response, especially against highly sophisticated attacks. However, since my current and past employers and universities don’t allow me to speak about specifics of attacks; I can only cite general observations and trends. I stand very little to gain from the comments I’ll be making. My primary goal is to help shape public opinion.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Targeted, Persistent Attacks&lt;/h3&gt;&lt;br /&gt;Throughout this article, I’ll be speaking about highly targeted, persistent attacks perpetrated by well organized attack groups for the apparent purpose of stealing sensitive information including trade secrets. Many people use the term Advanced Persistent Threat (APT) to describe this category of attackers. Some people use it to describe some specific subset (which they often imply isn’t a strict subset) of this attack class, and as such, use it as a proper noun. Even though many imply some coherent rationale for their grouping, they usually won’t elucidate in public. I tend to use terms like targeted attacks and persistent attackers to ensure people understand I’m talking about the general attack class. That being said, the vast majority of what has been said by people in the know about APT applies to what I’ll be saying, regardless of whether you consider APT a general attack class or specific attack group. Just to be explicit, examples of APT discussions that I believe to be on the mark are those by &lt;a href="http://blogs.sans.org/computer-forensics/2010/06/21/security-intelligence-knowing-enemy/"&gt;Mike Cloppert&lt;/a&gt; and &lt;a href="http://taosecurity.blogspot.com/2010/07/my-article-on-advanced-persistent.html"&gt;Richard Bejtlich&lt;/a&gt;. On the other hand, examples of wantonly ignorant discussions about APT include those by &lt;a href="http://www.computerworld.com/s/article/9174484/McAfee_Amateur_malware_not_used_in_Google_attacks"&gt;Mcaffe and Damballa&lt;/a&gt;. One quick litmus test is that if someone supposedly discussing APT closely relates the activity to botnets, identity theft, or insider threat, they’re not talking about the same thing I am.&lt;br /&gt;&lt;br /&gt;Most of my discussion will focus around highly targeted attacks for the purpose of compromising sensitive information, especially against industry. I’ll intentionally avoid speculating on important issues such as the ability of terrorists to use vulnerable computer systems to cause mass disruption and destruction. The one thing I will say is that there are a lot of projections about how information systems could be exploited for malicious intent. Many of these are still hypothetical. APT attacks are real today and are becoming more prevalent as time passes.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Attacks on Industry&lt;/h3&gt;&lt;br /&gt;One of the most disturbing aspects of highly targeted and persistent attacks is that these attacks are becoming more common against private industry. Governments have always had to worry about spies breaking into their systems, and have supposedly been developing systems to counter APT level threats for some time. Private industry isn’t used to having to defend against APT class attacks. Companies like Google are being taken off guard. These highly targeted attacks are resulting in information being compromised that normally isn’t--things like trade secrets and proprietary information. This is really scary. The perpetrators aren’t going after credit cards or SSNs, they’re going after trade secrets. Many people consider this sort of information one of the most valuable classes of assets in the US economy. The use of this information by competitors represents a serious threat to the long term prosperity of any information based company, and by extension, the competitiveness of the US economy. This is real scary. Even the military types recognize the risk. I think it demonstrates some serious means/ends inversion, but when military types start talking about threats to US prosperity inhibiting our ability to conduct war, we ought to listen. We need to remember that self defense is merely a means to an end of freedom, peace, and prosperity. Highly targeted attacks don’t just endanger short term national security; they are a serious threat to the US’s long term peace and prosperity. Throughout this post, I’m going to be focusing primarily on attacks against industry.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Cyberwar?&lt;/h3&gt;&lt;br /&gt;Are targeted, persistent attackers waging cyberwar? This is a hard question. First, modern society has confounded the meaning of war, using it for things like “Cold War”, “War on Terror”, and even “War on Christmas”. It’s hard to clearly define what warfare is.&lt;br /&gt;&lt;br /&gt;Clearly, cyber- (e.g. something related to computers or networking) is used pervasively in modern warfare. Militaries have driven many of the developments in technology and communication that are now used by civilians. The military uses computers, networks, and robots extensively to conduct warfare. While using cyber- in this context probably lines up with other prefixes such as modern- (e.g. using gunpowder) and chemical-, this doesn’t comprise all of what most people mean when they say cyberwar, including the US military.&lt;br /&gt;&lt;br /&gt;The US military has applied a much broader meaning to &lt;a href="http://en.wikipedia.org/wiki/Cyberwarfare"&gt;cyberwar&lt;/a&gt;: defining it a battle space or domain much like land, air and sea. I’m not sure I fully agree with the rationale behind this definition, but it’s theirs to make. However, using this definition, targeted, persistent attacks with the apparent goal of collection of sensitive information, doesn’t line up with cyberwar, because no disruption occurs. Using US government parlance, this activity is probably better categorized as cyber-espionage.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Cyber-Espionage?&lt;/h3&gt;&lt;br /&gt;If persistent, targeted attacks seeking sensitive information aren’t classed as warfare, maybe they are appropriately classed as cyber-espionage. Recently, Gen. Michael Hayden spoke at Blackhat on this very subject. What &lt;a href="http://www.darkreading.com/security/cybercrime/showArticle.jhtml?articleID=226400063"&gt;he said&lt;/a&gt; seems to be basically in line with the rest of what the US government has said on these topics. His basic assertion was that intelligence gathering isn’t cyberwar. He basically said that attacks targeting sensitive information like what I’ve been speaking of are just part of business as usual, at least for cyber-spies. He expounds the partitioning of the cyber domain into 3 sub-domains: CND (defense--stopping the other two), CNE (exploitation--for espionage), and CNA (attack--for disruption or destruction). A lot of what he said makes sense, as he dispels a lot of FUD. At the very least, most of what he said is technically correct.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Information as the End&lt;/h3&gt;&lt;br /&gt;A couple months ago, I would have agreed with this categorization of APT attacks as cyber-espionage. Then I listened to this &lt;a href="http://www.visiblerisk.com/podcast/"&gt;podcast&lt;/a&gt;. Something Rob Lee said struck a cord with me. He said, in short, that information is an asset over which modern wars are being fought, much like the riches of land or gold in previous centuries. I’d never thought of information as the end of warfare, simply as the means. I think this way of looking at targeted attacks warrants more discussion. What if cyberwar isn’t just about aggressors using IT as a means to conduct warfare? What if the purpose of cyberwar is to rest highly valuable information away from the enemy, just like land or gold in traditional warfare? This isn’t information warfare, because the information targeted is not necessarily about warfare. Attacks targeting  industry trade secrets aren’t espionage by most people’s definition because the secrets being taken aren’t military or political in nature--they are largely economic. This is essentially economic espionage.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Cyber-Piracy?&lt;/h3&gt;&lt;br /&gt;It’s a shame that people in industry have used the term piracy for actions that are more equitable to petty theft. If it wasn’t already used, cyber-piracy seems like a good way to describe the theft of sensitive information of economic value using military-like force. That’s really what’s happening to industry now. Persistent attackers are forcibly stealing highly valuable trade secrets. One of the reasons I’d like to compare this to naval piracy is that it must be perpetrated by a military-like force and because it is usually best answered with military or para-military force. I can visualize trade secrets being exfiltrated by hackers as gold or other goods being carried off by pirates in ships. The value of the data lost due to targeted attacks is immensely high, but is not normally discussed and it is easy to conceal. Regardless, if the value of the data stolen from private industry through targeted attacks was known, it would probably be considered a justifiable reason to wage a war against the perpetrators.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;On Attribution&lt;/h3&gt;&lt;br /&gt;One thing that many people seem to get preoccupied with is the issue of attribution for highly targeted attacks. Many facets of these attacks make it very unlikely that the attacks are perpetrated merely by organized crime without some level of support or tolerance by national governments. For example, highly persistent attackers usually target information that is not highly liquid and as such could only be of value to a small set of possible markets. Are these attacks directly sponsored, indirectly guided, or loosely condoned by foreign nations? Most of us will never know that answer. For most people, it really doesn’t matter. The actions that should be taken to solve the targeted attack problem don’t change that much regardless of how much foreign government support is behind these attacks. Lay people should be pushing for diplomatic, legal, and possibly military pressure to stop them.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;China&lt;/h3&gt;&lt;br /&gt;Numerous open sources have implicated China in targeted attacks. My favorites include the &lt;a href="http://www.uscc.gov/researchpapers/2009/NorthropGrumman_PRC_Cyber_Paper_FINAL_Approved%20Report_16Oct2009.pdf"&gt;NG report on PRC cyber-warfare and CNE&lt;/a&gt; and &lt;a href="http://www.nartv.org/mirror/shadows-in-the-cloud.pdf"&gt;Shadows in the Clouds&lt;/a&gt;. The attacks on Google earlier this year and the &lt;a href="http://googleblog.blogspot.com/2010/01/new-approach-to-china.html"&gt;subsequent response by Google&lt;/a&gt; is probably the best known public example. The most compelling evidence of Chinese involvement is that Chinese human rights activists were targeted by these attacks. It is hard to imagine anyone other than a Chinese supporter having adequate motivation to conduct this sort of attack. Of course, this doesn’t mean that the attacks are perpetrated by agents of the Chinese government. Indeed, the Chinese government often claims that they are victims of hacking themselves. Clearly the Chinese government has other high priority issues to address, such as ensuring that the constitutionally granted right to free speech is protected.&lt;br /&gt;&lt;br /&gt;That being said, I think the focus on China is a little myopic. I find it hard to believe that all targeted attacks on industry are from one source. Even if they are, how long it will stay that way?&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;It Takes Two to Fight&lt;/h3&gt;&lt;br /&gt;As mentioned previously, the extent of the damage caused by targeted persistent attacks is probably great enough to justify a war. If there’s one element missing from cyberwar, it’s our response. I’ve heard the terms cyber-Pearl Harbor and cyber-9/11 bandied about, but up to this point, there has not been a single decisive attack and associated response that even comes close to earning these titles. I doubt such an event will ever occur associated with targeted attacks on industry. Sure, terrorists and the like may well perpetrate an event that might earn an appellation of cyber-9/11. Terrorists intentionally perpetrate highly visible and dramatic attacks, but APT attacks are exactly the opposite: they are stealthy and deceptively mundane in methods. Unlike terrorists, whose goal is to gain attention, targeted, persistent attackers seem to prefer keeping things quiet. To make matters worse, most of the victims of these attacks like to keep their losses secret also. In the past, I’ve discussed how &lt;a href="http://smusec.blogspot.com/2010/04/keeping-targeted-attacks-secret-kills-r.html"&gt;keeping targeted attacks secret stifles the development of technical solutions&lt;/a&gt;. &lt;br /&gt;&lt;br /&gt;From everything I can tell, the US is not fighting back to protect industry from targeted, persistent cyber attacks. The military is trying hash out their internal turf wars about who will own the cyber domain. Beyond that, the US government is still trying to figure out who, if anyone, is going to help defend industry against cyber threats. Based on the &lt;a href="http://www.foreignaffairs.com/articles/66552/william-j-lynn-iii/defending-a-new-domain"&gt;recent reports&lt;/a&gt; of a huge breach in the government’s classified networks, it appears the government and military is struggling to defend its own networks. While DHS claims to have a &lt;a href="http://www.dhs.gov/xabout/structure/editorial_0839.shtm"&gt;division dedicated to cyber security&lt;/a&gt;, it appears that they are not concerned about the theft of trade secrets from industry, preferring to focus their efforts on protecting critical infrastructure from attacks like those terrorists would like to be able to perpetrate. Defending industry from targeted attacks is not a battle anyone is openly fighting, even though industry is getting roughed up.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Cyberwar?, Cyber Espionage?&lt;/h3&gt;&lt;br /&gt;Returning to the title of this post, do targeted attacks on industry constitute cyberwar? Probably not, especially if there is no reciprocation. Is it espionage? Not really, at least not according most peoples’ definition, because the data targeted isn’t directly related to the government but is largely economic in nature. If I were going to put targeted, persistent attacks on industry under a single moniker, I’d label them as “Economic Espionage”.&lt;br /&gt;&lt;br /&gt;A major motivation in writing this post is to voice my concern about a very serious threat to our long term prosperity and to add my voice to the others claiming that these attacks are real: they are happening today at an alarming rate. I normally don’t like doing it this way, but I’ve pointed out a serious problem without providing any suggestions for remedying it. I hope to provide my thoughts on what needs to be done in a future post. Targeted attacks on industry are real. They pose a serious threat to our long term prosperity.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/471870110572762709-4677140445182926947?l=smusec.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://smusec.blogspot.com/feeds/4677140445182926947/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://smusec.blogspot.com/2010/09/are-targeted-attacks-on-industry.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/4677140445182926947'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/4677140445182926947'/><link rel='alternate' type='text/html' href='http://smusec.blogspot.com/2010/09/are-targeted-attacks-on-industry.html' title='Are Targeted Attacks on Industry Cyberwar?'/><author><name>Charles Smutz</name><uri>http://www.blogger.com/profile/05098439824931378207</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_g1XmJJW8J_g/S-igCjTBm_I/AAAAAAAAAAY/7Dc7fU1NH5g/S220/CharlesSmutz2009.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-471870110572762709.post-2292903171576035854</id><published>2010-08-26T17:43:00.001-07:00</published><updated>2011-06-22T05:32:20.784-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='technology'/><category scheme='http://www.blogger.com/atom/ns#' term='vortex howto'/><title type='text'>Vortex Howto Series: Demo VM Image</title><content type='html'>&lt;span style="font-style:italic;"&gt;(Updated 10/16/2010) Doug Burks just informed me that he's &lt;a href="http://securityonion.blogspot.com/2010/10/security-onion-live-20101010-edition.html"&gt;included vortex&lt;/a&gt; in his &lt;a href="http://code.google.com/p/security-onion/"&gt;Security Onion liveCD&lt;/a&gt;. See comments. In many ways, this is probably a superior way to kick the wheels on vortex because if you run it on real hardware with multiple cores, you can actually see the benefits of parallelism. You can also easily and directly compare vortex to full IDS platforms like Snort or Bro as well as other smaller utilities like tcpick (vortex hopefully providing some value add somewhere). Note that Security Onion Live doesn't include libBSF, but most people don't use that extensively anyway. I gave Security Onion Live a quick test drive and highly recommend it. The VM image below will remain available for (slow) download in the event anyone finds it useful. &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;In order to make &lt;a href="http://sourceforge.net/projects/vortex-ids/"&gt;vortex&lt;/a&gt;, especially my &lt;a href="http://smusec.blogspot.com/search/label/vortex%20howto"&gt;vortex howto series&lt;/a&gt;, more accessible, I've created a vmware image. The image is a basic install of centos with all the prerequisites for the vortex howto series installed, including the html instruction for offline reading. Only the small pcaps are included, but scripts that download the other data sets are included.&lt;br /&gt;&lt;br /&gt;The intent is to make basic demonstration of vortex very easy. It's as easy as I dare make it. I've tested the content from installments 1 and 2, which were very easy to execute. Unfortunately, installments 3, and especially installment 4, are difficult to demonstrate in VM due to the small number of processor cores, use of 32-bit for portability, etc.&lt;br /&gt;&lt;br /&gt;The image can be downloaded &lt;a href="http://www.csmutz.com/smusec_files/Vortex%20Demo.zip"&gt;here&lt;/a&gt;. Please excuse the slow download rates. See the included README for more details.&lt;br /&gt;&lt;br /&gt;One errata item I've already noticed is that to install the defcon data set using the script provided, you'll need to install ctorrent. Ex. sudo yum install ctorrent. Also, I seemed to have trouble using mergcap to create the whole 7 GB pcap file for defcon. It fails at the 2GB mark, but this amount of data should be adequate for demonstration purposes anyway.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/471870110572762709-2292903171576035854?l=smusec.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://smusec.blogspot.com/feeds/2292903171576035854/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://smusec.blogspot.com/2010/08/vortex-howto-series-demo-vm-image.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/2292903171576035854'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/2292903171576035854'/><link rel='alternate' type='text/html' href='http://smusec.blogspot.com/2010/08/vortex-howto-series-demo-vm-image.html' title='Vortex Howto Series: Demo VM Image'/><author><name>Charles Smutz</name><uri>http://www.blogger.com/profile/05098439824931378207</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_g1XmJJW8J_g/S-igCjTBm_I/AAAAAAAAAAY/7Dc7fU1NH5g/S220/CharlesSmutz2009.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-471870110572762709.post-1402303158556266619</id><published>2010-08-26T17:14:00.000-07:00</published><updated>2010-08-26T17:42:47.741-07:00</updated><title type='text'>Nergal uncovers another cool 'sploit</title><content type='html'>I'm really happy to see that Rafal Wojtczuk has gotten a fair amount of press, including a mention on &lt;a href=""&gt;slashdot&lt;/a&gt;, for his recent &lt;a href="http://www.invisiblethingslab.com/resources/misc-2010/xorg-large-memory-attacks.pdf"&gt;disclosure&lt;/a&gt; of a vulnerability allowing execution of code with root privileges. It's not the first of this sort for him and hopefully not the last.&lt;br /&gt;&lt;br /&gt;Rafal is the primary developer and maintainer of &lt;a href="http://libnids.sourceforge.net/"&gt;libnids&lt;/a&gt;, the library on which vortex is based. My only contact with Rafal was a short email thread seeking help with libnids: he was most helpful.&lt;br /&gt;&lt;br /&gt;Go Nergal!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/471870110572762709-1402303158556266619?l=smusec.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://smusec.blogspot.com/feeds/1402303158556266619/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://smusec.blogspot.com/2010/08/nergal-uncovers-another-cool-sploit.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/1402303158556266619'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/1402303158556266619'/><link rel='alternate' type='text/html' href='http://smusec.blogspot.com/2010/08/nergal-uncovers-another-cool-sploit.html' title='Nergal uncovers another cool &apos;sploit'/><author><name>Charles Smutz</name><uri>http://www.blogger.com/profile/05098439824931378207</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_g1XmJJW8J_g/S-igCjTBm_I/AAAAAAAAAAY/7Dc7fU1NH5g/S220/CharlesSmutz2009.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-471870110572762709.post-1862417731802534428</id><published>2010-07-12T16:02:00.000-07:00</published><updated>2010-07-12T17:49:07.731-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='apt'/><title type='text'>Reflections on Sans 4n6 and IR summit</title><content type='html'>I was really pleased with how the &lt;a href="http://www.sans.org/forensics-incident-response-summit-2010/"&gt;Sans 4n6 and IR Summit&lt;/a&gt; turned out. More than anything else, it was a great opportunity to network with and hear from some of the thought leaders in 4n6 and IR. Coming from a team that has a lot of experience with IR, especially APT, I probably gained more from side conversations than anything else. I was really impressed with the heavy focus on APT, and the surprisingly on point discussions about APT. Rob Lee did a great job organizing this.&lt;br /&gt;&lt;br /&gt;Being primarily focused on IR tool development, I was happy with the high amount of respect SW developers were given. More than once, the point was made that you need really smart people creating capabilities if your (really smart) analysts are to have a chance to keep up with APT. When I romanticize my work, I fancy myself as Q, equipping our 00* analysts with the best armaments out there. Normally SW engineers are second only to end users when it comes to abuse by security folk. Overall, there was very limited bashing on end users, and even less bashing on SW engineers. I think this demonstrates the level understanding of APT at the summit including the realization that persistent attackers are best dealt with through a threat focused response, or as Mike Cloppert has so effectively expressed: &lt;a href="http://blogs.sans.org/computer-forensics/author/mikecloppert/"&gt;security intelligence&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;I was impressed with the amount of discussion on community involvement at the conference, from technical folk volunteering to help local law enforcement to the quiescent response to APT by the federal government. In fact, in my mind, the best slides of the summit should be awarded to Richard Bejtlich concerning what the &lt;a href="http://files.sans.org/summit/forensics10/PDFs/32%20bejtlich_apt_panel.pdf"&gt;US gov. should do in response to APT&lt;/a&gt;. If you want a discomfort chuckle, they’re definitely worth the click.&lt;br /&gt;&lt;br /&gt;For those who haven’t found it yet, the slides are &lt;a href=" http://files.sans.org/summit/forensics10/"&gt;here&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/471870110572762709-1862417731802534428?l=smusec.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://smusec.blogspot.com/feeds/1862417731802534428/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://smusec.blogspot.com/2010/07/reflections-on-sans-4n6-and-ir-summit.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/1862417731802534428'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/1862417731802534428'/><link rel='alternate' type='text/html' href='http://smusec.blogspot.com/2010/07/reflections-on-sans-4n6-and-ir-summit.html' title='Reflections on Sans 4n6 and IR summit'/><author><name>Charles Smutz</name><uri>http://www.blogger.com/profile/05098439824931378207</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_g1XmJJW8J_g/S-igCjTBm_I/AAAAAAAAAAY/7Dc7fU1NH5g/S220/CharlesSmutz2009.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-471870110572762709.post-758742973006693986</id><published>2010-06-23T05:09:00.000-07:00</published><updated>2011-07-07T16:44:19.353-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='technology'/><category scheme='http://www.blogger.com/atom/ns#' term='packet capture'/><title type='text'>Flushing out Leaky Taps</title><content type='html'>Many organizations rely heavily on their network monitoring tools. Network monitoring tools that operate on passive taps are often assumed to have complete network visibility. While most network monitoring tools provide stats on the packets dropped internally, most don’t tell you how many packets were lost externally to the appliance. I suspect that very few organizations do an in depth verification of the completeness of tapped data nor quantify the amount of loss that occurs in their tapping infrastructure before packets arrive at network monitoring tools. Since I’ve seen very little discussion on the topic, this post will focus on techniques and tools for detecting and measuring tapping issues.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Impact of Leaky Taps&lt;/h3&gt;&lt;br /&gt;How many packets does your tapping infrastructure drop before ever reaching your network monitoring devices? How do you know?&lt;br /&gt;&lt;br /&gt;I’ve seen too many environments where tapping problems have caused network monitoring tools to provide incorrect or incomplete results. Often these issues last for months or years without being discovered, if ever. Making decisions or relying on bad data is never good. Many public packet traces also include the type of visibility issues I will discuss.&lt;br /&gt;&lt;br /&gt;One thing to keep in mind when worrying about loss due to tapping is that you should probably solve, or at least quantify, any packet loss inside your network monitoring devices before you worry about packet loss in the taps. You need to have strong confidence in the accuracy of your network monitoring devices before you use data from them to debug loss by your taps. Remember, in most network monitoring systems there are multiple places where packet loss is reported. For example, using tcpdump on Linux, you have the dropped packets reported by tcpdump and the packets dropped by the network interface (ifconfig).&lt;br /&gt;&lt;br /&gt;I’m not going to discuss in detail the many things that can go wrong in getting packets from your network to a network monitoring tool. For a quick overview on different strategies for tapping, I’d recommend this &lt;a href="http://www.qosient.com/argus/sensorPerformance.shtml"&gt;article&lt;/a&gt; by the argus guys. I will focus largely on the resulting symptoms and how to detect, and to some degree, quantify them. I’m going to focus on two very common cases: low volume packet loss and unidirectional (simplex) visibility.&lt;br /&gt;&lt;br /&gt;Low volume packet loss is common in many tapping infrastructures, from span ports up to high end regenerative tapping devices. I feel that many people wrongly assume that taps either work 100% or not at all. In practice, it is common for tapping infrastructures to drop some packets such that your network monitoring device never even gets the chance to inspect them. Many public packet traces include this type of loss. Very often this loss isn’t even recognized, let alone quantified.&lt;br /&gt;&lt;br /&gt;The impact of this loss depends on what you are trying to do. If you are collecting netflow, then the impact probably isn’t too bad since you’re looking at summaries anyway. You’ll have slightly incorrect packet and byte counts, but overall the impact is going to be small. Since most flows contain many packets, totally missing a flow is unlikely. If you’re doing signature matching IDS, such as snort, then the impact is probably very small, unless you win the lottery and the packet dropped by your taps is the one containing the attack you want to detect. Again, stats are in your favor here. Most packet based IDSs are pretty tolerant of packet loss. However, if you are doing comprehensive deep payload analysis, the impact can be pretty severe. Let’s say you have a system that collects and/or analyzes all payload objects of certain type--it could be anything from emails to multi-media files. If you loose just one packet used to transfer part of the payload object, you can impact your ability to effectively analyze that payload object. If you have to ignore or discard the whole payload object, the impact of a single lost packet can be significantly multiplied in that many packets worth of data can’t be analyzed.&lt;br /&gt;&lt;br /&gt;Another common problem is unidirectional visibility. There are sites and organizations that do asymmetric routing such they actually intend to tap and monitor unidirectional flows. Obviously, this discussion only applies to situations where one intends to tap a bi-directional link but only ends up analyzing one direction. One notorious example of a public data set suffering from &lt;a href="http://datasetsfortheresearchcommunity.blogspot.com/2009/08/misconfiguration-issue-of-nsa-span-port.html"&gt;this issue&lt;/a&gt; is the &lt;a href="http://www.itoc.usma.edu/research/dataset/index.html"&gt;2009 Inter-Service Academy Cyber Defense Competition&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Unidirectional capture is common, for example, when using regenerative taps which split tapped traffic into two links based on direction but only one directional link makes it into the monitoring device. Most netflow systems are actually designed to operate well on simplex links so the adverse affect is that you only get data on one direction. Simple packet based inspection works fine, but more advanced, and usually rare, rules or operations using both directions obviously won’t work. Multi-packet payload inspection may still be possible on the visible direction, but it often requires severe assumptions to be made about reassembly, opening the door to classic IDS evasion. As such, some deep payload analysis systems, including vortex and others based on libnids, just won’t work on unidirectional data. Simplex visibility is usually pretty easy to detect and deal with, but it often goes undetected because most networking monitoring equipment functions well without full duplex data. &lt;br /&gt;&lt;br /&gt;&lt;h3&gt;External Verification&lt;/h3&gt;&lt;br /&gt;Probably the best strategy for verifying network tapping infrastructure is to perform some sort of comparison of data collected passively with data collected inline. This could be comparing packet counts on routers or end devices to packet counts on a network monitoring device. For higher order verification, you should do something like compare higher order network transaction logs from an inline or end device against passively collected transaction logs. For example, you could compare IIS or Apache webserver logs to HTTP transaction logs collected by an IDS such as Bro or Suricata. These verification techniques are often difficult. You’ve got to try to deal with issues such as clock synchronization and offsets (caused by buffers in tapping infrastructure or IDS devices), differences in the data sources/logs used for concordance, etc. This is not trivial, but often can be done. &lt;br /&gt;&lt;br /&gt;Usually the biggest barrier to external verification of tapping infrastructure is the lack of any comprehensive external data source. Many people rely on passive collection devices for their primary and authoritative network monitoring. Often times, there just isn’t another data source to which you can compare your passive network monitoring tools. &lt;br /&gt;&lt;br /&gt;One tactic I’ve used to prove loss in taps is to use two sets of taps such that packets must traverse both taps. If one tap sees a packet traverse the network and another tap doesn’t, and both monitoring tools claim 0 packet loss, you know you’ve got a problem. I’ve actually seen situations where one network monitoring device didn’t see some packets and the other network monitoring devices didn’t see some packets, but the missing packets from the two traces didn’t overlap.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Inferring Tapping Issues&lt;/h3&gt;&lt;br /&gt;While not easy and necessarily not as precise nor as complete as comparing to external data, using network monitoring tools to infer visibility gaps in the data they are seeing is possible. Many network protocols, namely TCP, provide mechanisms specifically designed to ensure reliable transport of data. Unlike an endpoint, a passive observer can’t simply ask for a retransmission when a packet is lost. However, a passive observer can use the mechanisms the endpoints use to infer if it missed packets passed between endpoints. For example, if Alice sends a packet to Bob which the passive observer Eve doesn’t see, but Bob acknowledges receipt with Alice and Eve sees the acknowledgement, Eve can infer that she missed a packet.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Data and Tools&lt;/h3&gt;&lt;br /&gt;To keep the examples simple and easily comparable, I’ve created 3 pcaps. The &lt;a href="http://www.csmutz.com/smusec_files/alice_full.pcap"&gt;full pcap&lt;/a&gt; contains all the packets from a HTTP download of the ASCII “Alice in Wonderland” from &lt;a href="http://www.gutenberg.org/etext/11"&gt;Project Gutenburg&lt;/a&gt;. The &lt;a href="http://www.csmutz.com/smusec_files/alice_loss.pcap"&gt;loss pcap&lt;/a&gt;, is the same except that one packet, packet 50, was removed. The &lt;a href="http://www.csmutz.com/smusec_files/alice_half.pcap"&gt;half pcap&lt;/a&gt; is the same as the full pcap, but only contains the packets going to the server, without the packets going to the client.&lt;br /&gt;&lt;br /&gt;For tools, I’ll be using argus and tshark to infer packet loss in the tap. Argus is a network flow monitoring tool. Tshark is the CLI version of the ever popular wireshark. Since deep payload analysis systems are often greatly affected by packet loss, I’ll explain how the two types of packet loss affect vortex.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Low Volume Loss in Taps&lt;/h3&gt;&lt;br /&gt;Detecting and quantifying low volume loss can be difficult. The most effective tool I’ve found for measuring this is tshark, especially the tcp analysis lost segment flag.&lt;br /&gt;&lt;br /&gt;Note that this easily identifies the lost packet at postion 50:&lt;br /&gt;&lt;br /&gt;&lt;i&gt;&lt;br /&gt;$ tshark -r alice_full.pcap -R tcp.analysis.lost_segment&lt;br /&gt;$ tshark -r alice_loss.pcap -R tcp.analysis.lost_segment&lt;br /&gt; 50   0.410502  152.46.7.81 -&gt; 66.173.221.158 TCP [TCP Previous segment lost] [TCP segment of a reassembled PDU] &lt;br /&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;I’ve created a simple (but inefficient) script that can be used on many pcaps. Since tshark doesn’t release memory, you’ll need to use pcap slices smaller than the amount of memory in your system. The script is as follows:&lt;br /&gt;&lt;br /&gt;&lt;i&gt;&lt;br /&gt;#!/bin/bash&lt;br /&gt;&lt;br /&gt;while read file&lt;br /&gt;do&lt;br /&gt;  total=`tcpdump -r $file -nn "tcp" 2&gt;/dev/null | wc -l`&lt;br /&gt;  errors=`tshark -r $file -R tcp.analysis.lost_segment | wc -l`&lt;br /&gt;  percent=`echo $errors $total | awk '{ print $1*100/$2 }'`&lt;br /&gt;  bandwidth=`capinfos $file | grep "bits/s" | awk '{ print $3" "$4 }'`&lt;br /&gt;  echo "$file:     $percent%      $bandwidth     "&lt;br /&gt;done&lt;br /&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;i&gt;Updated 02/21/2011: Most people will want to use &lt;b&gt;"tcp.analysis.ack_lost_segment"&lt;/b&gt; instead of "tcp.analysis.lost_segment". See bottom of post for details.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;It is operated by piping it a list of pcap files. For example, here are the results from the slices of the &lt;a href="http://www.ddtek.biz/dc17.html"&gt;defcon17 Capture the Flag packet captures&lt;/a&gt;:&lt;br /&gt;&lt;br /&gt;&lt;i&gt;&lt;br /&gt;$ ls ctf_dc17.pcap0* | calc_tcp_packet_loss.sh&lt;br /&gt;ctf_dc17.pcap000:     0.44235%      34751.40 bits/s&lt;br /&gt;ctf_dc17.pcap001:     0.584816%      210957.26 bits/s&lt;br /&gt;ctf_dc17.pcap002:     0.615856%      173889.57 bits/s&lt;br /&gt;ctf_dc17.pcap003:     0.51238%      165425.21 bits/s&lt;br /&gt;ctf_dc17.pcap004:     0.343817%      253283.86 bits/s&lt;br /&gt;...&lt;br /&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;Note that I haven’t done any sort of serious analysis of this data set. I assume there were some packets lost, but don’t know for sure. I’m just inferring. Also, assuming there are some packets missing, I will never know if this was a tapping issue, network monitoring/packet capture issue, or both.&lt;br /&gt;&lt;br /&gt;In the case of low volume loss in taps, netflow isn’t always the most useful. &lt;br /&gt;&lt;br /&gt;&lt;i&gt;&lt;br /&gt;$ argus -X -r alice_full.pcap -w full.argus&lt;br /&gt;$ ra -r full.argus -n -s stime flgs saddr sport daddr dport spkts dpkts loss&lt;br /&gt;   10:12:54.474330  e          66.173.221.158.55812         152.46.7.81.80           &lt;b&gt;87      121          0&lt;/b&gt;&lt;br /&gt;$ argus -X -r alice_loss.pcap -w loss.argus&lt;br /&gt;$ ra -r loss.argus -n -s stime flgs saddr sport daddr dport spkts dpkts loss&lt;br /&gt;   10:12:54.474330  e          66.173.221.158.55812         152.46.7.81.80           &lt;b&gt;87      120          0&lt;/b&gt;&lt;br /&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;Note that there is one less dpkt (destination packet). Other than the packet counts, there is no way to know that packet loss occurred. I’d swear I’ve seen other cases where argus actually gave an indication of packet loss in either the loss count or the flags, but that’s definitely not occurring here. Note loss in most network flow monitoring tools refers to packets lost by the network itself (observed by retransmission) not loss in the taps which has to be inferred.&lt;br /&gt;&lt;br /&gt;Vortex basically gives up on trying to reassemble a TCP stream if there is a packet that is lost and the TCP window is exceeded. The stream gets truncated at the first hole and the stream remains in limbo until it idles out or vortex closes.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;&lt;br /&gt;$ vortex -r alice_full.pcap -e -t full&lt;br /&gt;Couldn't set capture thread priority!&lt;br /&gt;full/tcp-1-1276956774-1276956775-c-168169-66.173.221.158:55812s152.46.7.81:80&lt;br /&gt;full/tcp-1-1276956774-1276956775-c-168169-66.173.221.158:55812c152.46.7.81:80&lt;br /&gt;VORTEX_ERRORS TOTAL: 0 IP_SIZE: 0 IP_FRAG: 0 IP_HDR: 0 IP_SRCRT: 0 TCP_LIMIT: 0 TCP_HDR: 0 TCP_QUE: 0 TCP_FLAGS: 0 UDP_ALL: 0 SCAN_ALL: 0 VTX_RING: 0 OTHER: 0&lt;br /&gt;VORTEX_STATS PCAP_RECV: 0 PCAP_DROP: 0 VTX_BYTES: 168169 VTX_EST: 1 VTX_WAIT: 0 VTX_CLOSE_TOT: 1 VTX_CLOSE: 1 VTX_LIMIT: 0 VTX_POLL: 0 VTX_TIMOUT: 0 VTX_IDLE: 0 VTX_RST: 0 VTX_EXIT: 0 VTX_BSF: 0&lt;br /&gt;&lt;br /&gt;$ vortex -r alice_loss.pcap -e -t loss&lt;br /&gt;Couldn't set capture thread priority!&lt;br /&gt;loss/tcp-1-1276956774-1276956774-e-31056-66.173.221.158:55812s152.46.7.81:80&lt;br /&gt;loss/tcp-1-1276956774-1276956774-e-31056-66.173.221.158:55812c152.46.7.81:80&lt;br /&gt;VORTEX_ERRORS TOTAL: 2 IP_SIZE: 0 IP_FRAG: 0 IP_HDR: 0 IP_SRCRT: 0 TCP_LIMIT: 0 TCP_HDR: 0 TCP_QUE: 2 TCP_FLAGS: 0 UDP_ALL: 0 SCAN_ALL: 0 VTX_RING: 0 OTHER: 0&lt;br /&gt;Hint--TCP_QUEUE: Investigate possible packet loss (if PCAP_LOSS is 0 check ifconfig for RX dropped).&lt;br /&gt;VORTEX_STATS PCAP_RECV: 0 PCAP_DROP: 0 VTX_BYTES: 31056 VTX_EST: 1 VTX_WAIT: 0 VTX_CLOSE_TOT: 1 VTX_CLOSE: 0 VTX_LIMIT: 0 VTX_POLL: 0 VTX_TIMOUT: 0 VTX_IDLE: 0 VTX_RST: 0 VTX_EXIT: 1 VTX_BSF: 0&lt;br /&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;Note that there are fewer bytes collected, vortex warns about packet loss, there are TCP_QUEUE errors, and the stream doesn’t close cleanly in the loss pcap.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Simplex Capture&lt;/h3&gt;&lt;br /&gt;Simplex Capture is actually pretty simple to identify. It’s only problematic because many tools don’t warn you if it is occurring, so you often don’t even know it is happening. The straightforward approach is to use netflow and look for flows with packets in only one direction.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;&lt;br /&gt;$ argus -X -r alice_half.pcap -w half.argus&lt;br /&gt;$ ra -r half.argus -n -s stime flgs saddr sport daddr dport spkts dpkts loss&lt;br /&gt;   10:12:54.474330  e          66.173.221.158.55812         152.46.7.81.80           &lt;b&gt;87        0          0&lt;/b&gt;&lt;br /&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;This couldn’t be more clear. There are only packets in one direction. If you use a really small flow record interval, you’ll want to do some flow aggregation to ensure you will get packets from both directions in a given flow record. Note that argus by default creates bidirectional flow records. If your netflow system does unidirectional flow records, you need to do a little more work like associating the two unidirectional flows and making sure both sides exist.&lt;br /&gt;&lt;br /&gt;You could also use tshark or tcpdump and see that for a given connection, you only see packets in one direction.&lt;br /&gt;&lt;br /&gt;Vortex handles simplex network traffic in a straightforward, albeit somewhat lackluster manner--it just ignores it. LibNIDS, on which vortex is based, is designed to overcome NIDS TCP evasion techniques through exactly mirroring the functionality of TCP stack but assumes full visibility (no packet loss) to do so. If it doesn’t see both sides of a TCP handshake, it won’t follow the stream because a full handshake hasn’t occurred. As such the use of vortex on the half pcap is rather uneventful:&lt;br /&gt;&lt;br /&gt;&lt;i&gt;&lt;br /&gt;$ vortex -r alice_half.pcap -e -t half&lt;br /&gt;Couldn't set capture thread priority!&lt;br /&gt;VORTEX_ERRORS TOTAL: 0 IP_SIZE: 0 IP_FRAG: 0 IP_HDR: 0 IP_SRCRT: 0 TCP_LIMIT: 0 TCP_HDR: 0 TCP_QUE: 0 TCP_FLAGS: 0 UDP_ALL: 0 SCAN_ALL: 0 VTX_RING: 0 OTHER: 0&lt;br /&gt;VORTEX_STATS PCAP_RECV: 0 PCAP_DROP: 0 VTX_BYTES: 0 VTX_EST: 0 VTX_WAIT: 0 VTX_CLOSE_TOT: 0 VTX_CLOSE: 0 VTX_LIMIT: 0 VTX_POLL: 0 VTX_TIMOUT: 0 VTX_IDLE: 0 VTX_RST: 0 VTX_EXIT: 0 VTX_BSF: 0&lt;br /&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;The most optimistic observer will point out that at least vortex makes it clear when you don’t have full duplex traffic--because you see nothing.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Conclusion&lt;/h3&gt;&lt;br /&gt;I hope the above is helpful to others who rely on passive network monitoring tools. I’ve discussed the two most prevalent tapping issues I’ve seen personally. One topic I’ve intentionally avoided because it’s hard to discuss and debug is interleaving of aggregated taps, especially issues with timing. For example, assume you do some amount of tap aggregation, especially aggregation of simplex flows, either using an external tap aggregator or bonded interfaces inside your network monitoring system. If enough buffering occurs, it may be possible for packets from each simplex flow to be interleaved incorrectly. For example, a SYN-ACK, may end up in front of the corresponding SYN. There are other subtle tapping issues, but the two I discussed above are by far the most prevalent problems I’ve seen. Verifying or quantifying the loss in your tapping infrastructure once is above and beyond what many organizations do. If you rely heavily on the validity of your data, you may consider doing this periodically or automatically so you detect any changes or failures.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;&lt;br /&gt;&lt;b&gt;Updated 02/21/2011:&lt;/b&gt; I need to clarify and correct the discussion about low volume packet loss. The point of this post was to talk about packet loss in tapping infrastructure--packets that are successfully transferred through network, but which don’t make it to passive monitoring equipment. This is actually pretty common in low end tapping equipment, such as span ports of switches or routers. My intention was not to talk about normal packet loss that occurs in networks, usually due to network congestion. I messed up. I have two versions of the below script floating around--one that measures packets “missed” by the network monitor and one that measures total packets “lost” on the network. I used the wrong one above.&lt;br /&gt;&lt;br /&gt;Let me explain more. When I say “missed” I mean packets that traversed the network being monitored, but didn’t make it to the monitor device. Ex. they were lost during tapping/capture. When I say “lost” packets, I mean packets that the monitor device anticipated, but didn’t see for whatever reason. They could be dropped on the network (i.e. congestion) or could be dropped in the tapping/capture process. One really cool feature of tshark is that you can easily differentiate between the two. The tcp.analysis.ack_lost_segment filter matches all packets which ACK a packet (or packets) which were not seen in the packet trace. The official description is: “ACKed Lost Packet (This frame ACKs a lost segment)”. While your monitor device didn’t see the ACK’d packets, the other endpoint in the communications presumably did because it sent an ACK. The implications of this are that you can infer with strong confidence that the absent packets were actually transferred through the network but were “missed” by your capture. This feature of tshark is the best way I’ve found to identify packet loss that is occurring in passive network tapping devices or in network monitors which isn’t reported in the normal places in network sensors (pcap dropped, ifconfig dropped, ethtool -S). In normal networks with properly functioning passive monitoring devices “ack_lost_segment” should be zero.&lt;br /&gt; &lt;br /&gt;On the other hand, the mechanism which I mistakenly demonstrated below calculates packets lost for any reason, usually either congestion on the network being monitored or deficiencies in networking monitoring equipment. The description of tcp.analysis.lost_segment is: “Previous Segment Lost (A segment before this one was lost from the capture). For the purposes of verifying the accuracy of your network monitoring equipment, any loss due to congestion is a red herring. While this mechanism certainly does report packets “missed” by your network monitoring equipment, it will also report those “lost” for any other reason. I keep this version of the script around to look at things like loss due to congestion. It may well be useful for passively studying where loss due to congestion is occurring such as you might do if you are studying &lt;a href="https://gettys.wordpress.com/2010/12/06/whose-house-is-of-glasse-must-not-throw-stones-at-another/"&gt;buffer bloat&lt;/a&gt;. In networks subject to normal congestion, “lost_segment” should be non-zero.&lt;br /&gt;&lt;br /&gt;Please excuse this mistake. I try hard to keep my technical blog posts strictly correct, very often providing real examples.&lt;br /&gt;&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;i&gt;&lt;br /&gt;&lt;b&gt;Updated 07/07/2011:&lt;/b&gt; György Szaniszló has proposed a fix for wireshark that ensures that all “ack_lost_segment” are actually reported as such. In older versions of tshark, there were false negatives (instances where “ack_lost_segment” should have been reported but wasn't) but no false postivies (all instances of the “ack_lost_segment” were correct). As such, with György's fix, tshark should provide more accurate numbers in the event of loss in tapping infrastructure. The old versions of tshark are still useful for confirming that you have problems with your tapping infrastructure (I've had decent success with them), but clearly are not as accurate for comprehensively quantifying all instances of loss in your taps. In his &lt;a href="https://bugs.wireshark.org/bugzilla/show_bug.cgi?id=6081"&gt;bug report&lt;/a&gt;, he does a great job explaining the different types of loss, which he terms “Network-side” and “Monitor-side”. He also provides an additional trace for testing.&lt;br /&gt;&lt;/i&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/471870110572762709-758742973006693986?l=smusec.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://smusec.blogspot.com/feeds/758742973006693986/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://smusec.blogspot.com/2010/06/flushing-out-leaky-taps.html#comment-form' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/758742973006693986'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/758742973006693986'/><link rel='alternate' type='text/html' href='http://smusec.blogspot.com/2010/06/flushing-out-leaky-taps.html' title='Flushing out Leaky Taps'/><author><name>Charles Smutz</name><uri>http://www.blogger.com/profile/05098439824931378207</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_g1XmJJW8J_g/S-igCjTBm_I/AAAAAAAAAAY/7Dc7fU1NH5g/S220/CharlesSmutz2009.jpg'/></author><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-471870110572762709.post-7078834686159264113</id><published>2010-06-23T03:54:00.000-07:00</published><updated>2010-06-28T17:04:29.844-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='opinion'/><category scheme='http://www.blogger.com/atom/ns#' term='apt'/><category scheme='http://www.blogger.com/atom/ns#' term='security intelligence'/><title type='text'>Cloppert on Defining APT Campains</title><content type='html'>Michael Cloppert has posted another installment in his long running series on security intelligence. In his latest, &lt;a href="http://blogs.sans.org/computer-forensics/2010/06/21/security-intelligence-knowing-enemy/"&gt;Defining APT Campaigns&lt;/a&gt;, he discusses the how and why behind a threat focussed approach to categorizing attack activity. More importantly than the how, when combined with his previous articles in this series, he gives a clear explanation of the why.&lt;br /&gt;&lt;br /&gt;If you are somehow responsible for responding to targeted attackers you should understand why security intelligence or a threat focussed response is so critical. This is how you consistently stop and analyze attacks before compromises occur. This is how build resilient defenses that transcend the vulnerability du jour. This is how you get a leg up on the attackers and make repeated attacks harder for them.&lt;br /&gt;&lt;br /&gt;I have to say, when I was first exposed to security intelligence, I was a little skeptical. My thought was "that's cool we can understand so much about the attacker, but what's the point?". Well, the point is, the more visibility you have into an attack sequence, the more an attacker has to change to make the next attack successful. You can also stop attacks sooner, saving time on damage assessment and cleanup which allows you to spend more time preparing for the next attack. After seeing how effective this approach is against APT, I'm a believer. I can't count the number attacks, including 0-day exploits, that I have seen effectively mitigated because of common indicators or techniques used between attacks in the same campaign.&lt;br /&gt;&lt;br /&gt;Lastly, he touches on the criticality of developing tools for threat focussed incident response and detection. Clearly this warms my heart.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/471870110572762709-7078834686159264113?l=smusec.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://smusec.blogspot.com/feeds/7078834686159264113/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://smusec.blogspot.com/2010/06/cloppert-on-defining-apt-campains.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/7078834686159264113'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/7078834686159264113'/><link rel='alternate' type='text/html' href='http://smusec.blogspot.com/2010/06/cloppert-on-defining-apt-campains.html' title='Cloppert on Defining APT Campains'/><author><name>Charles Smutz</name><uri>http://www.blogger.com/profile/05098439824931378207</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_g1XmJJW8J_g/S-igCjTBm_I/AAAAAAAAAAY/7Dc7fU1NH5g/S220/CharlesSmutz2009.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-471870110572762709.post-3572766734063330891</id><published>2010-06-16T08:12:00.000-07:00</published><updated>2011-06-22T05:42:13.059-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='idle'/><category scheme='http://www.blogger.com/atom/ns#' term='memories'/><title type='text'>Flashback to my Commodore 128</title><content type='html'>While some of my colleagues were having nightmares responding to more &lt;a href="http://www.adobe.com/support/security/bulletins/apsb10-14.html"&gt;adobe 0-days&lt;/a&gt;, I've been on vacation, having pleasant flashbacks of my own.&lt;br /&gt;&lt;br /&gt;While I don't normally indulge like this, I just couldn't pass up posting this picture, which I found going through old pictures with my grandfather. It's a picture of our newly set up &lt;a href="http://en.wikipedia.org/wiki/Commodore_128"&gt;Commodore 128&lt;/a&gt;, complete with joysticks and tractor-feed dot matrix printer. I think the picture was taken on or soon after Christmas in 1987, give or take a year.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_g1XmJJW8J_g/TBjvcOj4rKI/AAAAAAAAAA4/Z32SmmkwT4M/s1600/commodore128.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 250px; height: 320px;" src="http://3.bp.blogspot.com/_g1XmJJW8J_g/TBjvcOj4rKI/AAAAAAAAAA4/Z32SmmkwT4M/s320/commodore128.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5483395814547565730" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;I think all of us have sweet memories of our first computers. The commodore 128 was mine. I have great memories of playing games and using &lt;a href="http://en.wikipedia.org/wiki/The_Print_Shop"&gt;The Print Shop&lt;/a&gt;. It makes me laugh seeing the huge smile I had on my face then.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/471870110572762709-3572766734063330891?l=smusec.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://smusec.blogspot.com/feeds/3572766734063330891/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://smusec.blogspot.com/2010/06/flashback-to-my-commodore-128.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/3572766734063330891'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/3572766734063330891'/><link rel='alternate' type='text/html' href='http://smusec.blogspot.com/2010/06/flashback-to-my-commodore-128.html' title='Flashback to my Commodore 128'/><author><name>Charles Smutz</name><uri>http://www.blogger.com/profile/05098439824931378207</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_g1XmJJW8J_g/S-igCjTBm_I/AAAAAAAAAAY/7Dc7fU1NH5g/S220/CharlesSmutz2009.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_g1XmJJW8J_g/TBjvcOj4rKI/AAAAAAAAAA4/Z32SmmkwT4M/s72-c/commodore128.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-471870110572762709.post-8541157085350070110</id><published>2010-05-29T19:01:00.000-07:00</published><updated>2010-06-02T08:25:03.463-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='opinion'/><category scheme='http://www.blogger.com/atom/ns#' term='apt'/><title type='text'>Security Engineering Is Not The Solution to Targeted Attacks</title><content type='html'>Recent publicity and lessons from the school of hard knocks have significantly increased the visibility of targeted attacks. Many organizations react to targeted attacks by pouring on yet more of the traditional reactive security measures that didn’t work in the first place. Many also institute draconian rules and procedures for their users. While stepping up security infrastructure and user awareness training is often necessary, it can never completely solve the targeted attack problem, at least not without inflicting unacceptably unreasonable and probably impractical restrictions on the organization’s personnel and IT infrastructure. &lt;br /&gt;&lt;br /&gt;There’s been a fair amount of buzz about Michal Zalewski’s article entitled &lt;a href="http://www.zdnet.com/blog/security/security-engineering-broken-promises/6503"&gt;Security engineering: broken promises&lt;/a&gt;. He does a very good job of summarizing some of the open issues with security engineering. I do think he’s probably a little pessimistic, missing some opportunities to give credit and I think it’s unfair to claim security engineering has failed for not developing a unified model that can ensure security. However, he’s pulled together a lot of different facets of security engineering in a short article. The field of security engineering does need to continue to seek to eliminate vulnerabilities that are being exploited widely, and do it in an efficient manner. Much of his discussion can be generalized beyond software security to general information security.&lt;br /&gt;&lt;br /&gt;While Zalewski didn’t address or mention APT, I’ve heard similar (but usually not so complete or well worded) rants about the failings of security best practices in regards to APT. It really pains me hear people trash security engineering, especially in the context of Aurora and similar attacks. I’ve also heard a fair amount “sky is falling” and “security best practices can’t keep you safe from APT”. Blaming security engineering for failing to stop targeted attacks doesn’t make sense when it was never a requirement of most systems. Furthermore, we don’t want security engineering alone to solve this problem anyway. &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Engineering&lt;/h3&gt;&lt;br /&gt;Engineering is about applying science to provide solutions that meet well defined parameters. These parameters involve all sorts of things like functionality, cost, reliability, etc. Many of these parameters are conflicting, at least apparently. Because we live in a world with scarce resources, engineering seeks to provide the optimum value for all the various parameters.&lt;br /&gt;&lt;br /&gt;While security has some unique characteristics, it can be viewed as another parameter of a system. While I agree that if done right, security doesn’t have to be as painful as we often make it, security does often conflict with other parameters such as flexibility, cost, and functionality. As such, a wise engineer only invests as much effort in making a system secure as is required. &lt;br /&gt;&lt;br /&gt;It amazes me that &lt;a href="http://en.wikipedia.org/wiki/Physics_envy"&gt;physics envy&lt;/a&gt; rages so strong in some people’s hearts and minds that they actually lose sight of the imperfections of both theoretical and applied physics. People who expect a comprehensive model to cover all aspects of security, much like the ever nebulous &lt;a href="http://en.wikipedia.org/wiki/Theory_of_everything"&gt;theory of everything&lt;/a&gt;, have a long time to wait, very possibly infinitely long, but even that is probably impossible to prove. Furthermore, many of the simple physics models are hard to actually apply in the real world due the many different phenomena that need to be modeled simultaneously. The massless, frictionless, point objects that we hear so much of in physics exercises must only exist in a vacuum, because I’ve never seen them. Practical application of physics isn’t as easy as the models often make it appear. That’s alright though. Using classic Newtonian physics works in a great many situations and helps me understand the world around me. Scientific models always have limitations in their applicability, but that doesn’t negate their value. While it quite often literally requires a group of rocket scientists, very often using a mix of multiple models and simulations, we’ve been able to do a great many things based on physics without having a theory of everything.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Success of Security Engineering&lt;/h3&gt;&lt;br /&gt;Formal methods are hard. Formally verifying a system is only practical on the most simple of systems. However, it has been done. Ex. flight control systems or highly secure classified systems. We refrain from formally verifying all systems we build not because we can’t or don’t know how, but because it’s just too hard. It requires too much effort and restricts the functionality and flexibility of the resulting systems too much for most people’s tastes. Most people couldn’t, or at least wouldn’t want to do their day to day work on one of these highly verified, and therefore, highly restricted systems.&lt;br /&gt;&lt;br /&gt;While not my favorite, using risk mitigation strategies is very effective in certain circumstances. It’s useful where the risk can be quantified and accurately predicted. A prime example of this is the risk associated with identity theft. Many financial institutions effectively apply risk mitigation calculations to determine whether a given measure which will reduce losses due to identity theft will cost more to implement than just accepting the losses. As long as the losses can be accurately calculated a priori, this method is very valid.&lt;br /&gt;&lt;br /&gt;Again, I realize that there is plenty room for improvement in the field of security engineering. Regardless, for most threats, security engineering is rather successful overall. We know how to make systems more secure than they are now, but we prefer not to. So if systems aren’t as secure as they should be, it’s usually because we didn’t design them to be secure. Sure, part of engineering is finding solutions that satisfy multiple parameters at the same time. These advances will continue to make security more compatible with ease of use and flexibility. Improved standards will continue to raise the minimum bar of security, while minimizing the additional cost of doing so. However, I believe most systems are secure enough, or at least as secure as we wanted them to be.&lt;br /&gt;&lt;br /&gt;It should be noted that the adequacy of the security provided by most systems is not provided solely by the system itself but is supported by external factors such as legal protections. For example, the physical security of most houses is only good enough to make it difficult, well maybe even only inconvenient, for would-be burglars. The vast majority of the deterrence comes in fear of getting caught. Furthermore, insurance provides a very cost effective means of protecting your investments despite the remote risk of burglary.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Security Engineering Can’t Solve Targeted Attacks&lt;/h3&gt;&lt;br /&gt;The biggest problem with targeted attacks isn’t that security engineering couldn’t provide effective solutions. Our current systems aren’t secure enough to protect us from targeted attacks because we haven’t asked them to be that secure. Furthermore, I don’t think we want them to be that secure. Even if it was possible to make a machine that was 100% secure, I doubt it could ever be used for much of consequence while maintaining that level of security due to weaknesses in the environment, people, and processes.&lt;br /&gt;&lt;br /&gt;Let’s return to the example of the residential physical security. Imagine if you took away the deterrence offered by law enforcement. It’s hard to imagine, but let’s say would-be attackers had basically no external deterrence and the only thing between them and your possessions in your house was you and your house. You’d have to go to some very extreme measures to keep your house secure. Simple locks and even an alarm system wouldn’t cut it. Basically in absence of any other deterrent, to defeat a rational burglar the defenses on your house would have to cost the attacker more to circumvent than the value that he could gain from sacking the house. This is a tough asymmetric situation where your defenses have to be perfect and the persistent burglar only has to find one weakness or one weak moment. He can try over and over again, as failed attempts don’t cost him much. It doesn’t take much imagination to see how living in a house like this wouldn’t be much fun.&lt;br /&gt;&lt;br /&gt;Ok, now pretend you have something valuable to a small set of burglars. Let’s say you have something like a highly coveted recipe for cinnamon rolls. Let’s say a small set of burglars really want to make their own sweat buns instead of buying yours, and possibly sell them to your customers. The problem with this is that you can’t take out an insurance policy on your roll recipe very easily. How could you quantify the cost of exposure? How could you prove the secret was really lost if you suspect it was? How many times would insurance compensate you—only on the first loss or on all subsequent losses? Insurance just doesn’t work in this case. Insurance policies work great for easily replaceable items like televisions, cars, etc, but they just don’t work well for things like trade secrets.&lt;br /&gt;&lt;br /&gt;Targeted attacks are much like the scenario laid out above. Sadly, there is little to no deterrence. Usually the information targeted is highly valuable, but not easily quantifiable. Lastly, while it technically would be possible to engineer defenses that would be effective, very few people really want to live the resulting vault in fort knox, let alone pay for the construction.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Alternatives to Security Engineering&lt;/h3&gt;&lt;br /&gt;So if it’s not feasible to pursue a pure engineering solution to defend against targeted attacks, what is to be done? First of all, a lot of other non-technical solutions should be pursued. I’ll refrain from discussing legal, political, diplomatic, military, etc. solutions because most of us only have a minor influence on these domains and my experience is pretty thin in these areas. However, I do think it’s clear that in many cases, non-technical solutions would be the most effective solutions to the problem. It should also be clear by the empty public statements made by many leaders and decision makers in this realm that non-technical solutions on an international scale are going to take a while, if they ever come.&lt;br /&gt;&lt;br /&gt;Security engineering is part of the solution. In many cases, we do need to engineer more secure solutions. We need to make security cheaper and easier. However, even with the best minds on the problem, this will only help so much. While our users need to improve their resilience to social engineering, in many cases, targeted attacks are so done so well, that I couldn’t fault a user for being duped.&lt;br /&gt;&lt;br /&gt;Previously I discussed how &lt;a href="http://smusec.blogspot.com/2010/04/keeping-targeted-attacks-secret-kills-r.html"&gt;keeping targeted attacks secret kills R&amp;D&lt;/a&gt;. In that case, I wasn’t speaking of security engineering as applied to all IT systems, but was referring to the small subset of IT infrastructure dedicated primarily to security (ex. IDS, SIMS, etc). In that post I echoed the claim of others that threat focused response or security intelligence is one of the most effective approaches to responding to targeted attacks. Surely, this incident response approach will require some engineering of tools to support this approach, in addition to the general security engineering that will come out of proper incident response. Correctly prioritizing your engineering resources to deal with targeted attacks will often result in allocation of resources to tools that support an intelligence driven response.&lt;br /&gt;&lt;br /&gt;I often imagine that a well functioning threat focused incident response team facing targeted attacks is much like the wolf and sheepdog cartoon. While the sheep aren’t particularly well protected, and really can’t be if they are to graze successfully, the sheepdog watches for the ever present wolf. The sheepdog keeps track of the wolf and counters his efforts directly, instead of trying to remedy every possible vulnerability. I recognize that as the sheepdog is invariably successful, this comparison is a little more ideal than reality will probably ever be. However, focusing a concentrated intelligence effort on a relatively small group of highly sophisticated attackers makes a lot of sense as long as the group of advanced attackers is small and the effort to defend against them is much higher than against other vanilla threats.&lt;br /&gt;&lt;br /&gt;I’ve done both security engineering and engineering for security intelligence. Both have their place. Both have their success stories and both have numerous opportunities for improvement. However, blaming security engineering for the impact of targeted attacks is a herring as red as they come. A world where security engineering actually tried to solve highly targeted and determined attackers would not be a fun place in which to live. In absence of other solutions, an intelligence driven incident response model is your best bet. If I haven’t been able to convince you of this, then all I have to say is that Chewbacca lives on Endor and that just doesn’t make sense… Blaming security engineering for target attacks: that does not make sense.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/471870110572762709-8541157085350070110?l=smusec.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://smusec.blogspot.com/feeds/8541157085350070110/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://smusec.blogspot.com/2010/05/security-engineering-is-not-solution-to.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/8541157085350070110'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/8541157085350070110'/><link rel='alternate' type='text/html' href='http://smusec.blogspot.com/2010/05/security-engineering-is-not-solution-to.html' title='Security Engineering Is Not The Solution to Targeted Attacks'/><author><name>Charles Smutz</name><uri>http://www.blogger.com/profile/05098439824931378207</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_g1XmJJW8J_g/S-igCjTBm_I/AAAAAAAAAAY/7Dc7fU1NH5g/S220/CharlesSmutz2009.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-471870110572762709.post-7066800412122164822</id><published>2010-05-20T16:30:00.000-07:00</published><updated>2010-05-20T16:36:55.326-07:00</updated><title type='text'>Panel and Preso at SANS 4n6 and IR Summit</title><content type='html'>I’m honored to have been asked to be part of the &lt;a href="http://www.sans.org/forensics-incident-response-summit-2010/"&gt;SANS 2010 What Works in Forensics and Incident Response Summit&lt;/a&gt;. I’ll be part of a panel discussion on network forensics and will be presenting on the topic of “Network Payload Analysis for Advanced Persistent Threats”.&lt;br /&gt;&lt;br /&gt;The &lt;a href="http://blogs.sans.org/computer-forensics/2010/05/20/2010-digital-foreniscs-incident-response-summit-final-agenda-released/"&gt;agenda&lt;/a&gt; includes some presentations and panel discussions by a large number of the thought leaders in the field of incident response and digital forensics. This is an excellent opportunity to hear from those with experience responding to highly targeted attacks. I'm really looking forward to participating.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/471870110572762709-7066800412122164822?l=smusec.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://smusec.blogspot.com/feeds/7066800412122164822/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://smusec.blogspot.com/2010/05/panel-and-preso-at-sans-4n6-and-ir.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/7066800412122164822'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/7066800412122164822'/><link rel='alternate' type='text/html' href='http://smusec.blogspot.com/2010/05/panel-and-preso-at-sans-4n6-and-ir.html' title='Panel and Preso at SANS 4n6 and IR Summit'/><author><name>Charles Smutz</name><uri>http://www.blogger.com/profile/05098439824931378207</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_g1XmJJW8J_g/S-igCjTBm_I/AAAAAAAAAAY/7Dc7fU1NH5g/S220/CharlesSmutz2009.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-471870110572762709.post-6172214436688355864</id><published>2010-05-01T19:20:00.000-07:00</published><updated>2011-06-22T05:42:49.803-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='technology'/><category scheme='http://www.blogger.com/atom/ns#' term='vortex howto'/><category scheme='http://www.blogger.com/atom/ns#' term='near real-time IDS'/><title type='text'>Vortex Howto Series: Parallel NRT IDS</title><content type='html'>To fulfill all the major tasks I promised when I &lt;a href="http://smusec.blogspot.com/2010/03/vortex-howto-series-network.html"&gt;began&lt;/a&gt; this series of &lt;a href="http://smusec.blogspot.com/search/label/vortex%20howto"&gt;vortex howto&lt;/a&gt; articles, this installment will focus on scaling up the network analysis done with vortex in a way that leverages the highly parallel nature of modern servers. While the techniques shared in this post are applicable to all the uses of vortex demonstrated so far, it’s especially applicable to near-real time network analysis, a major goal of which is to support detections not possible with conventional IDS architectures, including high latency and/or highly computationally expensive analysis. If you are new to NRT IDS and its goals, I recommend reading about &lt;a href="http://labs.snort.org/nrt/"&gt;snort-nrt&lt;/a&gt; especially this &lt;a href="http://vrt-sourcefire.blogspot.com/2010/04/new-detection-framework.html"&gt;blog post&lt;/a&gt; which explains why some very useful detection just can’t be done in traditional IDS architectures. As we’re going to build upon the work done in installment 3, I highly recommend reading it if you haven’t. &lt;br /&gt;&lt;br /&gt;Many of us learned about multiprocessing and its advantages in college. In cases where you have high latency analysis, which often is caused by IO such as querying a DB, multiprocessing allows you to efficiently keep your processor(s) busy while accomplishing many high latency tasks in parallel. Traditionally, if you want to do computationally expensive tasks that can’t be done on a single processor, you have two options: use a faster processor or use multiple processors in parallel. Well, if you haven’t noticed, processor speeds haven’t increased for quite some time, but the number of processors in computers has increased fairly steadily. Therefore, as you scale up computationally expensive work on commodity hardware, your only serious choice is to parallelize. While the hard real time constraints of IPS make high latency analysis impossible and computationally expensive analysis difficult, if you are satisfied with near real-time, it’s a lot easier to efficiently leverage parallel processing.&lt;br /&gt;&lt;br /&gt;Note that throughout this article, I’m not going to make a clear distinction between multi-threading, multi-processing, and multi-system processing. While text books make a stark differentiation, modern hardware and software somewhat blur the differences. For the purposes of this article, the distinction isn’t really important anyway.&lt;br /&gt;&lt;br /&gt;Vortex is a platform for network analysis, but it doesn’t care if the analyzer you use is single or multi-threaded. Vortex works well either way. However, xpipes, which is distributed with vortex does make it easy to turn a single threaded analyzer into a highly parallel analyzer even if, or especially in the cases where, the analyzer is written in a language that doesn’t support threading.&lt;br /&gt;&lt;br /&gt;Xpipes borrows much of its philosophy (and name) from xargs. Like xargs it reads a list of data items (very often filenames) from STDIN and is usually used in conjunction with a pipe, taking input from another program. While xargs takes inputs and plops them in as arguments to another program, xpipes takes inputs and divides them between multiple pipes feeding other programs. If you are in a situation where xargs works for you, then by all means, use it. Xpipes was written to be able to fit right between vortex and a vortex analyzer without modifying either, thereby maintaining the vortex interface. Xpipes spawns multiple independent instances of the analyzer program and divides traffic between the analyzers, feeding each stream to the next available analyzer. In general, xpipes is pretty efficient.&lt;br /&gt;&lt;br /&gt;Slightly simplifying our ssdeep-n network NRT IDS from our last installment we get:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;vortex -r ctf_dc17.pcap -e -t /dev/shm/ssdeep-n \&lt;br /&gt;-K 600 | ./ssdeep-n.sh | logger -t ssdeep-n&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;To convert this to a multhreaded NRT IDS, we would do the following:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;vortex -r ctf_dc17.pcap -e -t /dev/shm/ssdeep-n \&lt;br /&gt;-K 600 | xpipes -P 12 -c './ssdeep-n.sh | logger -t ssdeep-n'&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Now instead of a single instance of the analyzer we will have 12. Our system has 16 processors so this doesn’t fully load the system, but now a larger fraction of the total computing resources are used. Taking a look at this in top is as follows:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;top - 12:56:25 up 102 days, 19:35,  4 users,  load average: 17.30, 16.94, 9.&lt;br /&gt;Tasks: 295 total,   7 running, 288 sleeping,   0 stopped,   0 zombie&lt;br /&gt;Cpu(s): 16.5%us, 54.7%sy,  0.1%ni, 28.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%&lt;br /&gt;Mem:  74175036k total, 73891608k used,   283428k free,   338324k buffers&lt;br /&gt;Swap: 76218360k total,   155056k used, 76063304k free, 72417572k cached&lt;br /&gt;&lt;br /&gt;  PID  VIRT  RES S %CPU %MEM COMMAND&lt;br /&gt;10345 66128 3492 R 22.6  0.0 ssdeep-n.sh&lt;br /&gt;10346 66128 3452 S 22.3  0.0 ssdeep-n.sh&lt;br /&gt;10322 66128 3456 R 21.9  0.0 ssdeep-n.sh&lt;br /&gt;10336 66120 3440 R 21.9  0.0 ssdeep-n.sh&lt;br /&gt;10337 66124 3464 R 21.9  0.0 ssdeep-n.sh&lt;br /&gt;10343 66128 3488 R 21.9  0.0 ssdeep-n.sh&lt;br /&gt;10330 66128 3464 R 21.6  0.0 ssdeep-n.sh&lt;br /&gt;10342 66128 3444 S 21.6  0.0 ssdeep-n.sh&lt;br /&gt;10351 66120 3476 S 21.6  0.0 ssdeep-n.sh&lt;br /&gt;10326 66132 3476 S 20.9  0.0 ssdeep-n.sh&lt;br /&gt;10340 66124 3448 S 20.9  0.0 ssdeep-n.sh&lt;br /&gt;10329 66120 3452 S 19.9  0.0 ssdeep-n.sh&lt;br /&gt;10302  350m 297m S 11.3  0.4 vortex&lt;br /&gt;    5     0    0 S  0.3  0.0 migration/1&lt;br /&gt;   32     0    0 S  0.3  0.0 migration/10&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Beautiful, isn’t it?&lt;br /&gt;&lt;br /&gt;If run to completion, the multithreaded version finishes in minutes while the single threaded version took hours.&lt;br /&gt;&lt;br /&gt;As should be clear from the above, the -P option specifies the number of children processes to spawn. Typical values of this range from 2 to a few less than the number of processors in the system for highly computationally expensive analyzers. For high latency analyzers you can use quite a few more but there is an arbitrary limit of 1000.&lt;br /&gt;&lt;br /&gt;One of the coolest features of xpipes is that it provides a unique identifier for each child process in the form of an environment variable. For each child process it spawns, xpipes sets the environment variable XPIPES_INDEX to an incrementing integer starting at zero. Furthermore, since the command specified is interpreted by shell, XPIPES_INDEX can be used in the command. Imagine that instead of using logger to write a log, we want to write directly to file. If you try something like:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;$ vortex | xpipes -P 8 -c "analyzer &gt; log.txt"&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;You would find that log file gets clobbered by multiple instances trying to write to the file at the same time. However, you could do the following:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;$ vortex | xpipes -P 8 -c "analyzer &gt; log_$XPIPES_INDEX.txt"&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;You’d end up with 8 log files, log_0.txt through log_7.txt which you could cat together if wanted. Similarly, if you want to lock each analyzer to a separate core, say 2-10, you could do something like the following:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;$ vortex | xpipes -P 8 -c "taskset -c $[ $XPIPES_INDEX + 2 ] analyzer"&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;I think you get the idea. Just having a predictable identifier available to both the interpreter shell and the program opens a lot of doors.&lt;br /&gt;&lt;br /&gt;Note that if you want to specify the command on the command line you can do so with the -c option. This can admittedly get a little tricky at times because of multiple layers of quoting etc. Alternatively, xpipes can read the command to execute from a file. For example:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;$ echo 'analzyer "crazy quoted options"' &gt; analyzer.cmd&lt;br /&gt;$ vortex | xpipes -P 8 –f  analyzer.cmd&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;That’s the basics of parallel processing for NRT IDS the vortex way. So while vortex takes care of all the real time constraints and heavy lifting of network stream reassembly, xpipes takes care of multithreading so all your analyzer has to do is analysis. While vortex’s primary goal has never been absolute performance, I have seen vortex used to perform both computationally expensive and relatively high latency analysis that would break a conventional IDS.&lt;br /&gt;&lt;br /&gt;This largely fulfills the obligation I took on when I started this series of vortex howto articles. I hope this has been helpful to the community. I hope that someone who has read the series would be able to use vortex without too much trouble if a situation ever arose where it was the right tool for the job. &lt;br /&gt;&lt;br /&gt;If there are other topics you would like discussed/explained, feel free to suggest a topic. For example, I’ve considered an article on tuning linux and vortex for lossless packet capture, but I think the README and error messages cover this pretty well. I’ve also considered discussing the details of the performance relevant parameters in vortex and xargs, but most of these work very well for most situations without any changes.&lt;br /&gt;&lt;br /&gt;Again, I hope this series has will help people derive some benefit from vortex. I also want to reiterate my acknowledgments to Lockheed Martin for sharing vortex with the community as open source.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/471870110572762709-6172214436688355864?l=smusec.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://smusec.blogspot.com/feeds/6172214436688355864/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://smusec.blogspot.com/2010/05/vortex-howto-series-parallel-nrt-ids.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/6172214436688355864'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/6172214436688355864'/><link rel='alternate' type='text/html' href='http://smusec.blogspot.com/2010/05/vortex-howto-series-parallel-nrt-ids.html' title='Vortex Howto Series: Parallel NRT IDS'/><author><name>Charles Smutz</name><uri>http://www.blogger.com/profile/05098439824931378207</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_g1XmJJW8J_g/S-igCjTBm_I/AAAAAAAAAAY/7Dc7fU1NH5g/S220/CharlesSmutz2009.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-471870110572762709.post-4667679603616690899</id><published>2010-04-27T15:32:00.000-07:00</published><updated>2011-06-22T05:43:15.372-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='technology'/><category scheme='http://www.blogger.com/atom/ns#' term='snort'/><category scheme='http://www.blogger.com/atom/ns#' term='near real-time IDS'/><title type='text'>Snort Releases Near Real-Time Extension</title><content type='html'>Is that a pig I see flying? No, but VRT has released a &lt;a href="http://labs.snort.org/nrt/"&gt;a near real-time extension to snort&lt;/a&gt;. &lt;br /&gt;&lt;br /&gt;I'm far from the first to discuss it, but figured I had to mention it because so much of the content on this blog has been, and will be, about near real-time network analysis.&lt;br /&gt;&lt;br /&gt;My initial reaction is that I thought the day would never come. It was not too long ago that near real-time IDS was the domain of a few hardcore net defenders who built their own tools. Having built a platform for NRT and seen it used with great success, I can't advocate the technique zealously enough. &lt;br /&gt;&lt;br /&gt;I'm really happy to see Sourcefire making this step toward the paradigm for which a few of us have been clamoring for years. Regardless of the implementation, just recognizing the validity of the paradigm and its value is an important step. Furthermore, the definition of NRT that VRT is using is very similar to the definition I've been using with my colleagues for some time. There seems to be a true understanding of what is being asked for, not just buzzword reflection.&lt;br /&gt;&lt;br /&gt;While I haven't been able to play with it as much as I'd like, I have a few quick comments/thoughts:&lt;br /&gt;&lt;br /&gt;If you have problems with libtool during compilation, delete the ltmain.sh from the unpacked tarball and replace it with the ltmain.sh from your distro's ltmain.sh. This file should be in the libtool package (rpm -ql libtool).&lt;br /&gt;&lt;br /&gt;Other than that little issue, the install was easy for me.&lt;br /&gt;&lt;br /&gt;The documentation is basically non-existent. Browsing through the source code, I got a bit of feel for what was going on, but I don't understand fully how everything fits together. A howto guide, explaining how to do NRT on arbitrary data would be nice, but who am I to complain about poor documentation :)&lt;br /&gt;&lt;br /&gt;One thing that I was surprised to see, however, was an implementation of the pdf parsing routines in C. They utilized other C code written by &lt;a href="http://blog.didierstevens.com/programs/pdf-tools/"&gt;Didier Stevens&lt;/a&gt;, but they didn't use his python implementation of what I think is similar functionality. I believe making use of existing code, with the smallest amount of re-factoring possible, is an important enabler for agility in NRT analysis. After all, in my view, NRT is about taking the detection tools used in other domains and applying them to data extracted passively from the network.&lt;br /&gt;&lt;br /&gt;From what I can see, snort-nrt looks very promising.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/471870110572762709-4667679603616690899?l=smusec.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://smusec.blogspot.com/feeds/4667679603616690899/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://smusec.blogspot.com/2010/04/snort-releases-near-real-time-extension.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/4667679603616690899'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/4667679603616690899'/><link rel='alternate' type='text/html' href='http://smusec.blogspot.com/2010/04/snort-releases-near-real-time-extension.html' title='Snort Releases Near Real-Time Extension'/><author><name>Charles Smutz</name><uri>http://www.blogger.com/profile/05098439824931378207</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_g1XmJJW8J_g/S-igCjTBm_I/AAAAAAAAAAY/7Dc7fU1NH5g/S220/CharlesSmutz2009.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-471870110572762709.post-3908703194463486682</id><published>2010-04-24T17:12:00.000-07:00</published><updated>2011-06-22T05:43:30.044-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='technology'/><category scheme='http://www.blogger.com/atom/ns#' term='vortex howto'/><category scheme='http://www.blogger.com/atom/ns#' term='near real-time IDS'/><title type='text'>Vortex Howto Series: Near Real-Time IDS</title><content type='html'>This installment of the &lt;a href="http://smusec.blogspot.com/search/label/vortex%20howto"&gt;vortex howto&lt;/a&gt; series will build upon previous installments to demonstrate additional features of vortex relevant to implementing a near-real time IDS.&lt;br /&gt;&lt;br /&gt;Most mainstream IDSs are extremely packet focused. There are many reasons for this, but at least one of these is in order to support IPS where the “P” is for prevention. The rationale is that to block attacks, one must be able to make a decision on whether to block or pass a packet in a very short period of time. Conventional IDSs focus heavily on efficiency, usually having a very strict C API for analysis modules.&lt;br /&gt;&lt;br /&gt;Vortex supports a very different philosophy. Vortex takes a stream-centric approach. The focus is on supporting analysis on the data traveling through the network, not the mechanism for transporting the data (packets). Vortex doesn’t even try to support preventing attacks but focuses on facilitating deep analysis of network payload data, especially processor intensive or high latency analysis. Vortex has a very flexibly API, one which anyone familiar with Linux/Unix will appreciate. I think of it is as a find command for network payload data.&lt;br /&gt;&lt;br /&gt;For this installment we’re going to improve upon the example provided in the readme. We’re going to use &lt;a href="http://ssdeep.sourceforge.net/"&gt;ssdeep&lt;/a&gt; to do fuzzy hash comparisons against known attack signatures. We’ll call our IDS ssdeep-n. We’re using ssdeep because it’s relatively computationally expense. Actually, it’s extremely slow. While ssdeep has a very easy to use API, we’re intentionally not going to use it because we want to demonstrate the ability to use vortex to take any Unix command line tool and use it for network analysis.&lt;br /&gt;&lt;br /&gt;So without further ado, here is our analyzer:&lt;br /&gt;&lt;pre&gt;#!/bin/bash&lt;br /&gt;#simple script to run ssdeep on network stream (or any list of files)&lt;br /&gt;#output should be piped to log file or logging system (logger)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;while read file&lt;br /&gt;do&lt;br /&gt;result=`ssdeep -m /etc/ssdeep-n.sigs -b $file`&lt;br /&gt;if ! echo $result | grep matches &gt; /dev/null&lt;br /&gt;then&lt;br /&gt;  rm $file&lt;br /&gt;else&lt;br /&gt;  mv $file /var/lib/ssdeep-n/hits/&lt;br /&gt;  echo $result | sed 's/ \/etc\/ssdeep-n\.sigs:/ /g'&lt;br /&gt;fi&lt;br /&gt;done&lt;br /&gt;&lt;/pre&gt;You can download it &lt;a href="http://www.csmutz.com/smusec_files/ssdeep-n.sh"&gt;here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;While contrived and not the most efficient solution, this is sufficiently generalized to be representative of what could be done with basically any Unix command, including those that don’t support multiple files per invocation or situations where you need to capture/parse the output of the command. We execute ssdeep on the stream file provided by vortex and capture the output. We check the captured output for what we find interesting. If we don’t detect a match, we purge the stream. If we do detect a match, we archive the stream file to /var/lib/ssdeep-n/hits/ and output an alert, massaging the alert text a small amount.&lt;br /&gt;&lt;br /&gt;For a data set, the &lt;a href="http://www.ddtek.biz/dc17.html"&gt;defcon17 CTF packet captures&lt;/a&gt; will be used. I downloaded the packet captures and used mergecap to combined them back into one pcap with the following properties:&lt;br /&gt;&lt;pre&gt;$ capinfos ctf_dc17.pcap&lt;br /&gt;File name: ctf_dc17.pcap&lt;br /&gt;File type: Wireshark/tcpdump/... - libpcap&lt;br /&gt;File encapsulation: Ethernet&lt;br /&gt;Number of packets: 38994342&lt;br /&gt;File size: 7780760337 bytes&lt;br /&gt;Data size: 7156850841 bytes&lt;br /&gt;Capture duration: 185602.101865 seconds&lt;br /&gt;Start time: Fri Jul 31 13:26:38 2009&lt;br /&gt;End time: Sun Aug  2 17:00:00 2009&lt;br /&gt;Data rate: 38560.18 bytes/s&lt;br /&gt;Data rate: 308481.46 bits/s&lt;br /&gt;Average packet size: 183.54 bytes&lt;br /&gt;&lt;/pre&gt;Closely related to the data set, is the signature set we’ll be using. You can download it from &lt;a href="http://www.csmutz.com/smusec_files/ssdeep-n.sigs"&gt;here&lt;/a&gt;. The signature file contains ssdeep hashes for an assortment of attack data, some of which will match against the defcon 17 data set. Fearing to depart too much from the standards set by the security industry, the signature names are painfully useless :)&lt;br /&gt;&lt;br /&gt;Now we’re ready to actually get our near real time IDS to run. Based on the knowledge from some of the previous articles in this series, the following is a good starting point:&lt;br /&gt;&lt;pre&gt;$ vortex -r ctf_dc17.pcap -e -t /dev/shm/ssdeep-n \&lt;br /&gt;-S 1000000000 -C 1000000000 |./ssdeep-n.sh&lt;br /&gt;&lt;/pre&gt;One of the most important vortex options, at least for those of us that care about security, is the -u option. Live captures usually require root privileges to open the capture device but we’d like to not run as root any longer than necessary. The -u option tells vortex to suid down to a non-root user after opening the capture device/file. Changing the command so it can be executed as root, but quickly dropping to the use of the user nobody, which has limited permission, yields the following:&lt;br /&gt;&lt;pre&gt;# vortex -r ctf_dc17.pcap -u nobody -e -t /dev/shm/ssdeep-n \&lt;br /&gt;-S 1000000000 -C 1000000000 | su nobody -c './ssdeep-n.sh'&lt;br /&gt;&lt;/pre&gt;While we aren’t reading from a live interface, we very easily could be. We’re using su so the analyzer runs with the non-root account also.&lt;br /&gt;&lt;br /&gt;Libnids, on which vortex is built, has some statically sized hash tables. In general we want these hash tables to be large enough that they are never filled, but not too much larger than necessary as they consume a fair amount of memory. One of these hash tables is the main connection hash table. Each active connection which vortex is capturing requires an entry in this hash table. When this hash table fills up, vortex ignores additional connections until active connections are closed. The default value of 1M is pretty good, but for demonstrative purposes, we’re going to set this to 2M by using -s 2097152. You will know you need to increase this if you ever have errors of the category “TCP_LIMIT”. Similarly, libnids has a static hash table for IP Frag which can be set with -H. We’ll leave this at the default, but if you have a network where IP frag is actually used routinely, you may want to increase this.&lt;br /&gt;&lt;br /&gt;Vortex doesn’t provide the data to the external analyzer until all the requested data from the stream has been gathered or until the connection has successfully closed. For various reasons, vortex can’t always detect when communication has terminated. To prevent connections from being followed indefinitely, even after the connections have been abandoned by one or both ends, the -K option provides a timeout. Note however, that this timeout is only reset when data is transferred through the connection, not when other possibly valid TCP traffic, such as keepalives, ACKs, etc are observed. Vortex has an especially hard time detecting the end of many of the connections in the defcon data set we are using, so we definitely need to set this option. In practice, the -K option also helps guard against benign or malicious resource exhaustion. Common settings of this range from 1s to 3600s. We’ll set this to 600s with -K 600.&lt;br /&gt;&lt;br /&gt;Adding the hash table size options and timeout yields:&lt;br /&gt;&lt;pre&gt;# vortex -r ctf_dc17.pcap -u nobody -s 2097152 -K 600 \&lt;br /&gt;-e -t /dev/shm/ssdeep-n -S 1000000000 -C 1000000000 \&lt;br /&gt;| su nobody -c './ssdeep-n.sh'&lt;br /&gt;&lt;/pre&gt;Another important aspect of running vortex for long periods of time, as you would do with a near-real time IDS, is logging of health/status. By default vortex dumps error and performance stats at program termination, but vortex can be configured to dump this data periodically. The -E and -T set the reporting interval for error and performance statistics which are output to syslog and STDERR. We’ll use 3600 for each so we get stats back every hour. The -L option sets the syslog tag so that different instances of vortex can be differentiated from each other. We’ll use -L ssdeep-n.&lt;br /&gt;&lt;br /&gt;One subtle item of note here is that while basically all aspects of vortex timings are based on the time loaded from the packet captures, either live or dead, the periods for error and performance stats logging are implemented in system time (not pcap time). In this example, we’ll see the multi-day packet capture processed in a couple hours. The times from the packet captures, including the -K idle timeout will be based on pcap time, while the error and stats messages will be based on local system time.&lt;br /&gt;&lt;br /&gt;Adding logging yields the following:&lt;br /&gt;&lt;pre&gt;# vortex -r ctf_dc17.pcap -u nobody -s 2097152 -K 600 \&lt;br /&gt;-e -t /dev/shm/ssdeep-n -E 3600 -T 3600 -L ssdeep-n \&lt;br /&gt;-S 1000000000 -C 1000000000 | su nobody -c './ssdeep-n.sh \&lt;br /&gt;| logger -s -p local0.info -t ssdeep-n'&lt;br /&gt;&lt;/pre&gt;We’re taking the output of ssdeep-n and feeding it to logger such that logs are echoed back to the terminal via STDOUT and sent to system log.&lt;br /&gt;&lt;br /&gt;So now we’re ready to actually run our near real time IDS.&lt;br /&gt;&lt;br /&gt;The results look something like the following:&lt;br /&gt;&lt;pre&gt;Apr 24 15:23:18 localhost ssdeep-n: VORTEX_STATS PCAP_RECV: 0&lt;br /&gt;PCAP_DROP: 0 VTX_BYTES: 0 VTX_EST: 0 VTX_WAIT: 0&lt;br /&gt;VTX_CLOSE_TOT: 0 VTX_CLOSE: 0 VTX_LIMIT: 0 VTX_POLL: 0&lt;br /&gt;VTX_TIMOUT: 0 VTX_IDLE: 0 VTX_RST: 0 VTX_EXIT: 0 VTX_BSF: 0&lt;br /&gt;Apr 24 15:23:18 localhost ssdeep-n: VORTEX_ERRORS TOTAL: 0&lt;br /&gt;IP_SIZE: 0 IP_FRAG: 0 IP_HDR: 0 IP_SRCRT: 0 TCP_LIMIT: 0&lt;br /&gt;TCP_HDR: 0 TCP_QUE: 0 TCP_FLAGS: 0 UDP_ALL: 0 SCAN_ALL: 0&lt;br /&gt;VTX_RING: 0 OTHER: 0&lt;br /&gt;Apr 24 15:27:10 localhost ssdeep-n: tcp-30216-1249077951&lt;br /&gt; -1249079141-i-4425-10.31.8.30:53668c10.31.6.2:1787&lt;br /&gt; matches Command DGB (75)&lt;br /&gt;Apr 24 15:28:20 localhost ssdeep-n: tcp-56998-1249080094&lt;br /&gt; -1249080224-i-155949-10.31.8.30:56248s10.31.5.2:1787&lt;br /&gt; matches Response DGB (93)&lt;br /&gt;Apr 24 15:28:20 localhost ssdeep-n: tcp-56998-1249080094&lt;br /&gt; -1249080224-i-155949-10.31.8.30:56248c10.31.5.2:1787&lt;br /&gt; matches Response DGB (93)&lt;br /&gt;Apr 24 15:28:54 localhost ssdeep-n: tcp-62766-1249080436&lt;br /&gt; -1249080721-i-156483-10.31.8.30:36129s10.31.6.2:1787&lt;br /&gt; matches Response DGB (66)&lt;br /&gt;Apr 24 15:34:30 localhost ssdeep-n: tcp-112145-1249083434&lt;br /&gt; -1249083605-i-80684-10.31.8.30:36729s10.31.1.2:1787&lt;br /&gt; matches Response DGB (94)&lt;br /&gt;Apr 24 15:36:25 localhost ssdeep-n: tcp-129781-1249084423&lt;br /&gt; -1249084581-i-80510-10.31.8.30:41222s10.31.10.2:1787&lt;br /&gt; matches Response DGB (94)&lt;br /&gt;Apr 24 16:23:18 localhost ssdeep-n: VORTEX_STATS PCAP_RECV: 0&lt;br /&gt;PCAP_DROP: 0 VTX_BYTES: 374632450 VTX_EST: 370486 VTX_WAIT: 9999&lt;br /&gt;VTX_CLOSE_TOT: 366266 VTX_CLOSE: 0 VTX_LIMIT: 0 VTX_POLL: 0&lt;br /&gt;VTX_TIMOUT: 0 VTX_IDLE: 186789 VTX_RST: 179477 VTX_EXIT: 0 VTX_BSF: 0&lt;br /&gt;Apr 24 16:23:18 localhost ssdeep-n: VORTEX_ERRORS TOTAL: 484&lt;br /&gt;IP_SIZE: 0 IP_FRAG: 0 IP_HDR: 0 IP_SRCRT: 2 TCP_LIMIT: 0 TCP_HDR: 12&lt;br /&gt;TCP_QUE: 470 TCP_FLAGS: 0 UDP_ALL: 0 SCAN_ALL: 0 VTX_RING: 0 OTHER: 0&lt;br /&gt;Apr 24 16:33:29 localhost ssdeep-n: tcp-394718-1249150608&lt;br /&gt; -1249150608-r-2056-10.31.5.5:47377s10.31.3.2:4343&lt;br /&gt; matches Attack ABC (97)&lt;br /&gt;Apr 24 16:33:30 localhost ssdeep-n: tcp-394734-1249150609&lt;br /&gt; -1249150610-r-2568-10.31.5.5:32478s10.31.4.2:4343&lt;br /&gt; matches Attack ABC (97)&lt;br /&gt;...&lt;br /&gt;Apr 24 16:49:00 localhost ssdeep-n: tcp-431134-1249152504&lt;br /&gt;-1249152504-i-2056-10.31.5.5:57596s10.31.2.2:4343&lt;br /&gt; matches Attack ABC (97)&lt;br /&gt;Apr 24 17:23:18 localhost ssdeep-n: VORTEX_STATS PCAP_RECV: 0&lt;br /&gt;PCAP_DROP: 0 VTX_BYTES: 642622346 VTX_EST: 532289 VTX_WAIT: 9999&lt;br /&gt;VTX_CLOSE_TOT: 525121 VTX_CLOSE: 0 VTX_LIMIT: 0 VTX_POLL: 0&lt;br /&gt;VTX_TIMOUT: 0 VTX_IDLE: 269566 VTX_RST: 255555 VTX_EXIT: 0 VTX_BSF: 0&lt;br /&gt;Apr 24 17:23:18 localhost ssdeep-n: VORTEX_ERRORS TOTAL: 713&lt;br /&gt;IP_SIZE: 0 IP_FRAG: 0 IP_HDR: 0 IP_SRCRT: 2 TCP_LIMIT: 0 TCP_HDR: 30&lt;br /&gt;TCP_QUE: 681 TCP_FLAGS: 0 UDP_ALL: 0 SCAN_ALL: 0 VTX_RING: 0 OTHER: 0&lt;br /&gt;...&lt;br /&gt;&lt;/pre&gt;A few of the signatures matched, with varying degrees of similarity. Since we’ve archived the matches, we can go examine them. For example, let’s look at one of the very popular “Attack ABC” hits:&lt;br /&gt;&lt;pre&gt;[csmutz@master ~]$ hexdump -v /var/lib/ssdeep-n/hits\&lt;br /&gt;/tcp-431134-1249152504-1249152504-i-2056-10.31.5.5:57596s\&lt;br /&gt;10.31.2.2:4343 | head&lt;br /&gt;0000000 9090 9090 9090 9090 9090 9090 9090 9090&lt;br /&gt;0000010 9090 9090 9090 9090 9090 9090 9090 9090&lt;br /&gt;0000020 7dbf b830 3110 66c9 f0b9 db01 d9d9 2474&lt;br /&gt;0000030 58f4 7831 8310 04c0 7803 9f0c edc5 ba99&lt;br /&gt;0000040 5975 c5cc b196 c5e6 fd66 1d82 fe98 9d72&lt;br /&gt;0000050 0165 5a8d d5e0 9b73 be14 1aee 86eb 0c74&lt;br /&gt;0000060 f715 c888 718e d758 4eca 2858 2b2b b48a&lt;br /&gt;0000070 73a1 f051 83b4 2fa5 1321 e837 0734 0939&lt;br /&gt;0000080 d8c9 f5c6 1e36 1d43 5fc8 f2b3 c55e cb35&lt;br /&gt;0000090 e124 2c38 99d9 bcad 9949 a0ab da68 454b&lt;br /&gt;&lt;/pre&gt;I don’t know much about what is supposed to be going on here, but I do know that starting your conversation off with a NOP sled, is in computer etiquette, not the nicest way to start a conversation. While contrived, we’ve “detected” an attack. We could look more, but I think that’s a sufficient discussion of our results.&lt;br /&gt;&lt;br /&gt;We’ve demonstrated how to use vortex to build a near real-time IDS. While ssdeep is probably not something you’d ever want to run on bare network streams, we’ve shown how easy it is to take basically any Unix command that operates on files, including computationally expensive ones, and apply the same functionality to network streams in near real time. While we used a program written in C with a straight-forward API, we could just have easily used a perl/python/ruby script, java program, or even VB script written for windows which runs via mono or wine. No re-implementation is required to take a valuable detection mechanisms and run it on network traffic in near real-time. I think of the most valuable things vortex could be used for is doing the type of decoding and or data extraction that just isn’t possible with mainstream IDS. For example, assuming the signature matching capabilities of Snort isn't good enough for you, what about extracting MS documents from network traffic and running &lt;a href="http://www.snort.org/vrt/vrt-resources/officecat"&gt;officecat&lt;/a&gt; on them? Similarly, if you like Bro-style transaction logs for network protocols, why not extract metadata from pdfs traversing the network with &lt;a href="http://www.accesspdf.com/pdftk/"&gt;pdftk&lt;/a&gt; or &lt;a href="http://blog.didierstevens.com/programs/pdf-tools/"&gt;one of Didier Stevens PDF tools&lt;/a&gt;?&lt;br /&gt;&lt;br /&gt;While we’ve run our near real-time IDS from a dead capture file, it could just as easily be done from live capture. Vortex includes some example init scripts that could be used to run vortex in a daemon mode, such as you would need to do for a network sensor. Vortex facilitates the creation of agile and flexible near real-time detection mechanisms.&lt;br /&gt;&lt;br /&gt;As we’ll show in the next installment of this series, vortex removes the real-time constraints inherent in network packet capture from our content analysis. Vortex also can be used to take detection mechanisms as we’ve implemented here and scale them across highly parallel systems.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/471870110572762709-3908703194463486682?l=smusec.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://smusec.blogspot.com/feeds/3908703194463486682/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://smusec.blogspot.com/2010/04/vortex-howto-series-near-real-time-ids.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/3908703194463486682'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/3908703194463486682'/><link rel='alternate' type='text/html' href='http://smusec.blogspot.com/2010/04/vortex-howto-series-near-real-time-ids.html' title='Vortex Howto Series: Near Real-Time IDS'/><author><name>Charles Smutz</name><uri>http://www.blogger.com/profile/05098439824931378207</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_g1XmJJW8J_g/S-igCjTBm_I/AAAAAAAAAAY/7Dc7fU1NH5g/S220/CharlesSmutz2009.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-471870110572762709.post-2289396612116155322</id><published>2010-04-20T15:03:00.000-07:00</published><updated>2011-06-22T05:43:45.148-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='opinion'/><category scheme='http://www.blogger.com/atom/ns#' term='technology'/><category scheme='http://www.blogger.com/atom/ns#' term='apt'/><category scheme='http://www.blogger.com/atom/ns#' term='devel'/><title type='text'>Keeping Targeted Attacks Secret Kills R&amp;D</title><content type='html'>I’m really impressed with Google’s response to what has been coined Operation Aurora by others. I’m impressed for lots reasons. I’m impressed because they recognize the value of their intellectual property and when they realized that it was threatened, they took decisive actions to protect their interests. I think it’s sad that so many companies in a similar situation would be blinded by short sighted lust for the “emerging market” that they fail to protect themselves and fail to recognize that the same market is far from a fair or open. I’m impressed that when they apparently felt that the espionage was backed or at least condoned by the Chinese government, they called them out. Most of all, I’m happy they made this public.&lt;br /&gt;&lt;br /&gt;That being said, I’m not too impressed that google, and the majority of the computer security industry for that matter, were taken off guard by these attacks. The level of sophistication and determination is not new nor is the type of data targeted. For the purpose of this article, when I refer to a targeted and sophisticated attacks I’m referring to attacks where one or more attacker groups repeatedly seeks to (re-)penetrate an organization’s computer systems for ends specific to the victim organization, typically exfiltration of sensitive information. These attacks are characterized by a high degree of knowledge of the victims, often a high degree of social engineering, adequate technical sophistication, and high degree of organization/coordination on the part of the attackers. I refrain from using the term advanced persistant threat (APT), because while it has had a fairly precise meaning among the people using the term for some time, the meaning has been blurred quite a bit of late. For the purposes of this article, the specific identities of the attackers, including affiliation or backing by nation-states, is not important. A few public reports of these sorts of attacks go back to at least the 2003-2005 timeframe, probably earlier, but that’s when I started paying attention. Maybe the one thing that is new is the type of industry targeted. I think google should have known it was coming. I’ll bet they had some warnings they chose to ignore, but I guess I can't fault them too much.&lt;br /&gt;&lt;br /&gt;The response by the security industry to these attacks is pitiful. Many people recognize that the state of the art, including mainstream enterprise security tools, can’t stop, let alone detect, this sort of activity. While there are a few valiant incident responders who have been dealing with sophisticated targeted attacks for some time, many with a good deal of success, the security vendors have basically ignored their pleas and ideas for improved security tools. I’ve heard vendors say “You don’t want to do that” and “the market for that isn’t big enough for us to implement it”.&lt;br /&gt;&lt;br /&gt;What has to happen for the security industry to realize they need to deal with sophisticated targeted attacks? First, organizations need to realize the value of their intellectual property. Second they need to realize that it’s at risk. I think most organizations are at this point. Third, they need to realize that conventional security wisdom, practices, and tools, won’t protect them against this, for some people new, class of attacker. Unfortunately, all too often, this epiphany only comes after personal and painful experience. Fourth, enough people need to start demanding effective solutions that vendors feel compelled to deliver them and academia recognizes the problems that need researching. Lastly, the solutions--a capable workforce, processes and practices, technology, etc need to be developed.&lt;br /&gt;&lt;br /&gt;While there are many hindrances, one of the biggest obstacles to effectively dealing with targeted attacks is silence. While this class of attack is far from new, basically no one talks about it. While there are plenty examples of good public documentation of sophisticated attacks, ex. &lt;a href="http://www.businessweek.com/magazine/content/08_16/b4080032218430.htm?chan=magazine+channel_top+stories"&gt;Businessweek E-espionage threat&lt;/a&gt;&lt;a&gt;, &lt;/a&gt;&lt;a href="http://www.uscc.gov/researchpapers/2009/NorthropGrumman_PRC_Cyber_Paper_FINAL_Approved%20Report_16Oct2009.pdf"&gt;NG’s report on Chinese Espionage&lt;/a&gt;, and &lt;a href="http://blogs.sans.org/computer-forensics/2010/01/25/m-trends-the-advanced-persistent-threat/"&gt;Mandiant M-trends&lt;/a&gt;, basically no one credible steps up and confirms the validity of the data, leaving many to dismiss these reports as sensational journalism, conspiracy theories, and marketing hype. Based on solid public data, I guess I don’t blame people for questioning the reality of this threat until they experience it personally.&lt;br /&gt;&lt;br /&gt;This code of silence related to compromises is very detrimental to solving the problem through the various available avenues: political/diplomatic, legal, and security systems including technology and people. There are a lot of legitimate reasons for not broadcasting your status as victim of a sophisticated attack and/or the type details required to help prevent future occurrences. Most of them I wouldn’t agree with, especially if everyone in the same industry/sector is in the same boat and you all know it. One of the few legitimate reasons to keep details of these attacks secret is that defending against persistent attackers is best achieved through an attacker focused or security intelligence driven approach. But how long is your threat intelligence still useful? Surely keeping specific attack data secret past a year or two doesn’t buy you much in terms of security intelligence as the most aggressive attackers change tactics and techniques more frequently than this. Hopefully it doesn't reveal too much about your capabilities either, as they need to be evolving that quickly also. Does acknowledging you’ve been attacked after your incident response is finished, or at least well under way, buy you anything in terms of threat intelligence? I don’t think so. I admire google for going public and doing something about it. I’m happy to see some public details, but more details and official acknowledgement from google would be nice. Sadly, google is right when they say they’ve already been more open that most others in the industry.&lt;br /&gt;&lt;br /&gt;The organizations that keep targeted attacks and the details of them secret are part of the problem, or at the very least, aren’t doing everything necessary to help solve the problem. I think it’s a little hypocritical for organizations to complain about the security industry and academia not addressing this class of threat when no one will talk about the problem publicly with the requisite level of certainty and specificity.&lt;br /&gt;&lt;br /&gt;Focusing on security R&amp;amp;D, there are a few things I think need to happen before the security tools industry and academia can start to address targeted attacks. The people doing R&amp;amp;D need to know what type of attacks are actually occurring, they need to understand the importance of a threat focused response model, and they need some decent data.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Understanding the Targeted Attack Scenario&lt;/h3&gt;&lt;br /&gt;One of the major problems with current academic and applied research is that most researchers don’t understand the basics of a highly targeted attack scenario. They don’t know how serious the problem is. If you tell an academic that the sky is falling because of targeted attacks and give them a high level overview, they’ll either yawn or laugh at you. Case in point, the following hypothetical conversation:&lt;br /&gt;&lt;br /&gt;Boots on Ground Responder: &lt;span style="font-style: italic;"&gt;We’ve got to do something about these highly socially engineered spear-phishing attacks!&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Heads in Clouds Researcher: &lt;span style="font-style: italic;"&gt;If you graph the social network, how many nodes away is the sender from the recipient?&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Boots on Ground Responder: &lt;span style="font-style: italic;"&gt;Uh, 1. Sometimes 2. Sometimes more, it depends.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Heads in Clouds Researcher: &lt;span style="font-style: italic;"&gt;Ok, what about the malware? Rootkit? Polymorphism? Any Red pill/Blue pill?&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Boots on Ground Responder: &lt;span style="font-style: italic;"&gt;In this case nothing like that. Just simple malware that provides minimal backdoor. Malware isn’t even packed.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Heads in Clouds Researcher: &lt;span style="font-style: italic;"&gt;Ok, this stuff isn’t being detected by your AV, IDS, etc but it’s still making it through firewalls, proxies, etc. Any interesting data hiding techniques?&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Boots on Ground Responder: &lt;span style="font-style: italic;"&gt;No, not really. Malware evades AV because it’s never been seen before. In cases where they need to evade our IDS, they use trivial obfuscation like ceasar ciphers. Usually though, they just hide in plain sight.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Heads in Clouds Researcher: &lt;span style="font-style: italic;"&gt;Doesn’t sound too interesting to me. Just patch your systems and tell your users not to click on unsolicited email.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Boots on Ground Responder: &lt;span style="font-style: italic;"&gt;Yeah, right. Still, we see repeated patterns in all of these attacks. I can’t give you details, but there’s got to be a way to catch these guys.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Heads in Clouds Researcher: &lt;span style="font-style: italic;"&gt;Ok, well I’m going to go back to musing on the trusting trust problem…&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The sad part is there are some really interesting problems, true academic problems, but for the most part, academia isn’t seeing them. I don’t think it’s because academia isn’t trying to find good problems to solve, I think it’s because the interesting details aren’t being shared.&lt;br /&gt;&lt;br /&gt;Researchers need to learn how different targeted attacks are from opportunistic attacks. They need to understand how the goals and methods differ. They need to understand how different the targeting mechanisms are. They need to understand how valuable an intelligence driven response model is. However, they won’t learn it until someone shows them.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Supporting threat focused response&lt;/h3&gt;&lt;br /&gt;So much conventional security wisdom and basically all academic research takes a vulnerability focused approach. The focus is on detecting and mitigating individual attacks, not persistent campaigns comprising series of attacks. That’s the best approach for many classes of attacks, but isn’t the best if determined attackers continue attacking the same target over and over again. So many other people have spoken on this topic, that I’ll defer to them and steer my ramblings toward application of these principles to security tool development. For the reader’s reference, I recommend this &lt;a href="http://www.visiblerisk.com/podcast/"&gt;podcast&lt;/a&gt; by some of the thought leaders in this realm. If what they are saying is news to you, check out their blogs, etc.&lt;br /&gt;&lt;br /&gt;People doing security R&amp;amp;D have to learn about intelligence driven incident response. While some products support this approach, almost none fully embrace it. Even worse, academia is basically mute on the topic.&lt;br /&gt;&lt;br /&gt;One aspect of a threat focused response model that is very important for security R&amp;amp;D is the importance of prioritization of response. While I have seen some products and research that recognizes the importance of prioritization based on the vulnerability/exploit, basically no security R&amp;amp;D addresses prioritization based on intelligence or attacker identity. Given the following choice, which would you rather detect/block: A stealthy rootkit installed by a botnet for the purpose of identity theft/fraud or an email containing a link to an exploit which when visited gives a sophisticated attacker user level access to the compromised computer? Most academics and many in the security industry would take the former because of impact on the system but a small group of security professional will lean hard towards the latter because of impact to the organization’s overall mission.&lt;br /&gt;&lt;br /&gt;Another important aspect of threat focused response is relative importance of prevention and detection. For an intelligence driven response model, detection is king, and prevention is a distant second. In fact in some cases, it might actually be beneficial to not mitigate attacker activity if the attack is or will be mitigated further in the attack sequence (or &lt;a href="http://blogs.sans.org/computer-forensics/2009/10/14/security-intelligence-attacking-the-kill-chain/"&gt;kill chain)&lt;/a&gt; and if blocking the attack prevents collection of further threat intelligence (ex. firewall block). On the flip side, being able to detect an attack, even if it wasn’t or couldn’t be blocked, is imperative. If you look at the bigger picture, being able to block an attack is always the best, but if you can’t or didn’t detect it in real time, detecting it in near real time often almost as good. While many don’t appreciate it, being able to do historical detections, or understanding how intrusions started, including attacker activity preceding the actual attack, is also important to an intelligence driven response.&lt;br /&gt;&lt;br /&gt;Lastly, post unsuccessful attack analysis is almost ignored by conventional tools and research. However, successful incident responders know the importance of analyzing unsuccessful attacks and developing mitigations across all facets of the attack sequence.&lt;br /&gt;&lt;br /&gt;People doing security R&amp;amp;D have to learn to build features supporting threat intelligence into their tools and research.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Irrelevant Data Supports Irrelevant Research&lt;/h3&gt;&lt;br /&gt;One of the biggest hurdles to overcome for basically any sort of research is obtaining good data. The relative dearth of data related to target attacks kills research. If you were a researcher, would you choose a problem for which there is no public data? How could you? Even if you are doing more applied R&amp;amp;D, getting good data isn’t so easy.&lt;br /&gt;&lt;br /&gt;There are a couple approaches to getting data for research: you can either gather the data for yourself, or you can use someone else’s data, usually a public data set. The problem with gathering the data yourself is that most researchers will never be able to gather data on targeted attacks. By their very nature, traditional computer security collection mechanisms such as honeypots, honey monkeys, etc will not normally ever see a targeted attack, definitely not a persistent campaign of target attacks. Even the researchers and vendors that do end up seeing samples representing one phase of targeted attacks, say malware, don’t see the full attack lifecycle. How can you address all phases of the attack if you only see one?&lt;br /&gt;&lt;br /&gt;So there are good public data sets and there are some that aren’t so great, however, it seems that once a reasonably valid data set is used, it gets used over and over again. I admire folk who put together quality data sets for the community. One infamous example in the realm of incident detection is the &lt;a href="http://www.ll.mit.edu/mission/communications/ist/corpora/ideval/data/index.html"&gt;DARPA 99 Intrustion Detection Evaluation dataset&lt;/a&gt;. While probably a decent data set at the time, and while memories of winnuke, etc may well be indelibly seared into the minds of some cyber war horses, these sort attacks are about as far from targeted attacks as you can get. DARPA 99 has been used and abused for a long time, but people still use it! Why? There aren’t many other options for public data sets. Other decent options for some types of research include packet captures from events like the &lt;a href="http://ddtek.biz/"&gt;Defcon CTF&lt;/a&gt; and &lt;a href="http://www.itoc.usma.edu/research/dataset/index.html"&gt;NSA/West Point Competition&lt;/a&gt;, but these events are by their very nature very poor sources for persistent and highly target attacks.&lt;br /&gt;&lt;br /&gt;While it will be necessary to develop good data sets involving targeted attacks, it’s going to be a hard effort. First, to demonstrate a persistent attacker, you need months, even years of data. As attacks have moved up the protocol stack and have become incredibly personalized, sanitizing data is going to be a lot more difficult than scrubbing IP addresses and hostnames. To truly address targeted attacks, tools will have to be configured with information about the data and people using the computer systems (not just the computer systems themselves). What that means for researchers is that to understand the significance of a target attack, you have to understand the targeted organization and targeted individuals. Lastly, as incident responders know, to be effective, data needs to be integrated from all phases of the attack and come in all sorts of formats: logs, netflow or packet captures, malware, etc. It’s clear that a perfect public data set for target attacks will never exist, but organizations can make steps by releasing older data.&lt;br /&gt;&lt;br /&gt;While I doubt that any quality public data sets will be coming soon, organizations need to learn the value of collecting an internal data set. By nature, Incident Responders aren’t always the most disciplined at things like collecting and labeling data for historical purposes, especially considering the conditions in which they operate. Regardless, a little bit of effort to compile historical attack data for future reference, including labeling of data, pays huge dividends both in responding to future attacks and providing good training/test data for new tools.&lt;br /&gt;&lt;br /&gt;Keeping quiet about sophisticated targeted attacks kills, among other things, intelligence driven tool R&amp;amp;D. For the technology to catch up with the threat, the problem needs to be discussed publicly and more details need to be shared. Publicly sharing attack information is critical to the research and development required to catch up technologically with sophisticated attacks. If the code of silence isn't broken, incident responders will continue to flounder with mainstream security tools while security tool vendors will continue to have watershed moments.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/471870110572762709-2289396612116155322?l=smusec.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://smusec.blogspot.com/feeds/2289396612116155322/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://smusec.blogspot.com/2010/04/keeping-targeted-attacks-secret-kills-r.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/2289396612116155322'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/2289396612116155322'/><link rel='alternate' type='text/html' href='http://smusec.blogspot.com/2010/04/keeping-targeted-attacks-secret-kills-r.html' title='Keeping Targeted Attacks Secret Kills R&amp;D'/><author><name>Charles Smutz</name><uri>http://www.blogger.com/profile/05098439824931378207</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_g1XmJJW8J_g/S-igCjTBm_I/AAAAAAAAAAY/7Dc7fU1NH5g/S220/CharlesSmutz2009.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-471870110572762709.post-1690424697463388102</id><published>2010-04-03T11:45:00.000-07:00</published><updated>2011-06-22T05:44:02.115-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='technology'/><category scheme='http://www.blogger.com/atom/ns#' term='vortex howto'/><title type='text'>Vortex Howto Series: Network Forensics</title><content type='html'>In my &lt;a href="http://smusec.blogspot.com/2010/03/vortex-howto-series-network.html"&gt;last installment&lt;/a&gt; in the vortex howto series, I showed how to use the most basic features of vortex to build a network surveillance tool. In this post, I will demonstrate more features of vortex through the example of an exercise in network forensics.&lt;br /&gt;&lt;br /&gt;As stated in the first article, the primary purpose of these howtos is to demonstrate how to use vortex to perform various tasks. I’ll go out of my way to explain some of the capabilities and features of vortex, as many of them aren’t particularly intuitive. In course of doing so, I’ll compare and contrast vortex to some of the other tools out there. While it will be clear that not much effort is being invested in building the tools demonstrated in this series, the tools should be just interesting enough to demonstrate the type of thing that could be done with in conjunction with vortex. Lastly, most of the data analyzed in this series is admittedly lame.&lt;br /&gt;&lt;br /&gt;Our goal in this installment will be to use vortex for network forensics. More specifically, we’re going to be doing forensics analysis for a web site that was attacked. In this case, we’re going to be looking at a password guessing attack, but the same techniques would be useful for other attacks such as SQL injection, other protocols tunneled over HTTP, etc.&lt;br /&gt;&lt;br /&gt;To further clarify our goals, let’s assume the fact that the attack occurred is known already. What needs to be done is to dissect the attack. We need to understand what the attacker did, how it was done, and what the result was. While the type of data you would collect and how you would report it depends largely upon your goal--legal prosecution, damage assessment, or security intelligence, we’ll take a relatively general approach.&lt;br /&gt;&lt;br /&gt;Good public attack data is hard to find. The data for this installment comes from a live production network from which I was able to obtain this packet trace. It is available for download &lt;a href="http://www.csmutz.com/smusec_files/net_4n6_data.pcap"&gt;here&lt;/a&gt;. I’ve taken all data out except for the data relevant to the single attack we will investigate. Specifically, the web site attacked is a wordpress blog at &lt;a href="http://www.elderhaskell.com/"&gt;http://www.elderhaskell.com&lt;/a&gt;. The attacker’s address is 193.226.51.2, about which I know very little, and care even less. Let’s pretend we were notified of a potential attack and asked to investigate. For sake of simplicity, let’s say all we know is that a potential attack occurred--no additional information or data will be given (ex. server logs, IDS alerts, etc). Unfortunately, this sort of engagement is all to common in the realm of network forensics.&lt;br /&gt;&lt;br /&gt;At this point most sane people would use wireshark/tshark to get a quick look at the data. Ok, here’s a screenshot from wireshark.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_g1XmJJW8J_g/S7eQSkakb0I/AAAAAAAAAAM/BvXpFpPSPsg/s1600/net_4n6_screenshot.jpg"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 287px; height: 320px;" src="http://4.bp.blogspot.com/_g1XmJJW8J_g/S7eQSkakb0I/AAAAAAAAAAM/BvXpFpPSPsg/s320/net_4n6_screenshot.jpg" alt="" id="BLOGGER_PHOTO_ID_5455988122269806402" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;There seems to be a pattern of GETs followed by POSTs for the login page. Looking at a few of the login attempts, the attacker appears to be trying to guess credentials for the site. Were any of the login attempts successful? Were these attempts manual or automated? What was the sequence of events and timings for the various transactions?&lt;br /&gt;&lt;br /&gt;While all this information could be extracted and compiled from wireshark/tshark, this would be a very manual process. I script things. Furthermore, since the whole point of this blog series is to use vortex, I guess we’d better use it.&lt;br /&gt;&lt;br /&gt;Before we extract the streams, let’s look at a few more of the vortex options.&lt;br /&gt;&lt;br /&gt;One really important option to understand is the -k option. Why would you ever want to “disable libNIDS TCP/IP checksum processing”? This is useful in cases where legitimate traffic has invalid TCP checksums, usually because of an artifact of the capture mechanism. One of the most common reasons for this is that packet captures are performed on the same machine as the client or the server and the packet capture libraries don’t have a view of packets with valid TCP checksums. This happens, for instance on linux when the kernel, instead of performing the checksum calculation itself, offloads the calculation to the network card  which occurs after point in the TCP/IP stack where the packets are captured. Anyhow, if you are trying to analyze a pcap or perform live analysis with systemically invalid checksums (often 0), try the -k option. Since checksums are rarely legitimately bad, this should have minimal adverse impact in most situations even though disabling all TCP checksums checks may be more than is absolutely necessary. If you aren’t doing a live capture, another good option is to use &lt;a href="http://tcpreplay.synfin.net/wiki/tcprewrite"&gt;tcprewrite&lt;/a&gt;, e.g. the --fixcsum option.&lt;br /&gt;&lt;br /&gt;While not particularly relevant in this case because the pcap has already been filtered to only contain attack relevant traffic, understanding vortex filtering mechanisms is important. Let’s say, for example, we started with a larger pcap that had traffic to and from other clients and servers. To continue the example, let’s say we want to filter out only traffic going to a webserver server with an IP of 192.168.1.1 and running on port 80. You could use a filter expression like “host 192.168.1.1 and tcp port 80” to create a BPF but there are a few problems with this:&lt;br /&gt;&lt;br /&gt;Problem #1: IP frag. While not particularly common, it’s out there and can even make you vulnerable to evasion if you aren’t careful. LibNIDS, on which vortex is based, goes to great lengths to accurately reassemble network traffic without introducing these sort of loop-holes. The following is taken directly from the libNIDS documention:&lt;br /&gt;&lt;br /&gt;&lt;i&gt;filters like ''tcp dst port 23'' will NOT correctly handle appropriately fragmented traffic, e.g. 8-byte IP fragments; one should add "or (ip[6:2] &amp;amp; 0x1fff != 0)" at the end of the filter to process reassembled packets.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;Parenthetically, any filter with “src” or “dst” alone will likely break libNIDS, and therefore vortex, which requires seeing both sides of a conversation, unlike some other IDS systems. However, the filter “tcp port 23” is also vulnerable to IP frag evasion as described above.&lt;br /&gt;&lt;br /&gt;Problem #2: Packet Filtering isn’t Stream Filtering. Even in the absence of other complications like IPfrag, you still might find the filter expression above a little imprecise. While it may sound a little out there, what if 192.168.1.1 connects as a client using port 80 to a server on port 25 on 10.1.1.1. If you used the filter above, you’d pick these connections up also which might just confuse you in your analysis. While you could further convolute your BPF with an expression such as “(dst host 192.168.1.1 and dst port 80) or (src host 192.168.1.1 and src port 80)”, vortex, a la libBSF, provides a better way—stream filtering semantics.&lt;br /&gt;If vortex is compiled with support for &lt;a href="http://sourceforge.net/projects/vortex-ids/files/libbsf"&gt;libBSF&lt;/a&gt;, then the –g and –G options are available. These are analogous to –f and –F except that instead of compiling a BPF, a BSF is compiled. The BSF is applied to each stream as it is established. For the example above, we could do a BSF such as “svr host 192.168.1.1 and svr port 80” which makes it very clear what streams we are looking for. However, since vortex has to do a lot more work to apply a filter to streams than a packet based filter and since filtering often occurs in an external system that doesn’t know BSF, a BPF or other packet filter is often used in front of a BSF. Ex. we could do something like “host 192.168.1.1 and (tcp port 80 or (ip[6:2] &amp;amp; 0x1fff != 0))” as a packet filter in addition to the BSF above.&lt;br /&gt;&lt;br /&gt;One other option that should be mentioned is -v. The -v option outputs empty streams. Why would you want to do that? If you ask vortex to provide both to server and to client streams, it will always give you the to server stream then the to client stream. This pairing and ordering is guaranteed, except in the case that one of the simplex streams is empty but the other half of the conversation is not. By default, empty simplex streams in an active (albeit one-sided) conversation are not output. Imagine you have an analyzer that expects both files. Some TCP streams may only have one file, which may throw your processing off. The -v rectifies this, ensuring to server and to client streams are always paired by creating empty files when necessary.&lt;br /&gt;&lt;br /&gt;Probably the most important option for the task at hand is -e. The -e option causes quite a bit more metadata to be put in the filenames. Files go from looking like 10.1.1.1:1954s172.16.1.1:80 to tcp-1-1229100756-1229100756-c-390-10.1.1.1:1954s172.16.1.1:80. The readme provides good information on how to decode this metadata:&lt;br /&gt;&lt;br /&gt;&lt;i&gt;{proto}-{connection_serial_number}-{connection_start_time}-{connection_end_time}-{connection_end_reason}-{connection_size}-{client_ip}:{client_port}{direction}{server_ip}:{server_port}&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;We’ll be using some of this extended metadata for this task, namely the serial number and timestamps. This extended metadata is one clear reason why you would use vortex over something like &lt;a href="http://www.circlemud.org/%7Ejelson/software/tcpflow/"&gt;tcpflow&lt;/a&gt; for this type of task. While it might sound far-fetched, I’ve run into situations where in a short space of time, the tcp quads were repeated and output files got clobbered by this. One other thing worth noting is that the connection_size metadata is the size of the data collected from both flows, and as such, the only difference in the filename for the to server and to client flows is the single character direction flag which is either “s” or “c”.&lt;br /&gt;With that background instruction, let’s extract the flows:&lt;br /&gt;&lt;pre&gt;$ mkdir streams&lt;br /&gt;$ vortex -r net_4n6_data.pcap -v -e -t streams&lt;br /&gt;Couldn't set capture thread priority!&lt;br /&gt;streams/tcp-1-1266678719-1266678721-c-2916-193.226.51.2:16118s&lt;br /&gt;66.173.221.158:80&lt;br /&gt;streams/tcp-1-1266678719-1266678721-c-2916-193.226.51.2:16118c&lt;br /&gt;66.173.221.158:80&lt;br /&gt;...&lt;br /&gt;streams/tcp-92-1266678802-1266678803-c-3460-193.226.51.2:26574s&lt;br /&gt;66.173.221.158:80&lt;br /&gt;streams/tcp-92-1266678802-1266678803-c-3460-193.226.51.2:26574c&lt;br /&gt;66.173.221.158:80&lt;br /&gt;VORTEX_ERRORS TOTAL: 0 IP_SIZE: 0 IP_FRAG: 0 IP_HDR: 0 IP_SRCRT&lt;br /&gt;: 0 TCP_LIMIT: 0 TCP_HDR: 0 TCP_QUE: 0 TCP_FLAGS: 0 UDP_ALL: 0&lt;br /&gt;SCAN_ALL: 0 VTX_RING: 0 OTHER: 0&lt;br /&gt;VORTEX_STATS PCAP_RECV: 0 PCAP_DROP: 0 VTX_BYTES: 296501 VTX_ES&lt;br /&gt;T: 92 VTX_WAIT: 0 VTX_CLOSE_TOT: 92 VTX_CLOSE: 92 VTX_LIMIT: 0&lt;br /&gt;VTX_POLL: 0 VTX_TIMOUT: 0 VTX_IDLE: 0 VTX_RST: 0 VTX_EXIT: 0 VT&lt;br /&gt;X_BSF: 0&lt;/pre&gt;&lt;br /&gt;Before we continue, a little explanation of the output at the end is in order. The ERRORS and STATS printouts show various error counts and statistics. Paying attention to these is a good thing to do. The README provides details of what these mean and vortex provides hints in many cases were a certain class of error is strongly indicative of a possible problem. 0 errors always is a good thing. Just like tcpdump, and most all pcap based apps for that matter, vortex doesn’t report packet received/dropped counts for dead captures. The VTX_EST: 92 tells us that there were 92 TCP connection monitored, and the VTX_CLOSE: 92 tells us that all 92 have been closed with a normal TCP close (FIN/ACK business).&lt;br /&gt;&lt;br /&gt;Ok, let’s get down to some real forensics.&lt;br /&gt;&lt;br /&gt;Since this post is already long, I’m not going to include my complete analysis notes, but if you’d like to view them, they are &lt;a href="http://www.csmutz.com/smusec_files/net_4n6_notes.txt"&gt;here&lt;/a&gt;. In course of looking at the data, I developed a script to summarize the requests and responses. It is as follows:&lt;br /&gt;&lt;pre&gt;#!/bin/bash&lt;br /&gt;&lt;br /&gt;while read line&lt;br /&gt;do&lt;br /&gt; id=`echo $line | awk -F- '{ print $2 }'`;&lt;br /&gt; timestamp=`echo $line | awk -F- '{ print $4 }'`;&lt;br /&gt; time=`date +%H:%M:%S -d @$timestamp`;&lt;br /&gt; date=`date -d @$timestamp`;&lt;br /&gt; action=`head -n 1 $line | awk '{ print $1" "$2}'`;&lt;br /&gt; req_digest=`grep -v -E "^(Content-Length|log=)" $line | \&lt;br /&gt;md5sum | head -c 6`;&lt;br /&gt; resp_digest=`echo $line | sed s/s/c/ | xargs grep \&lt;br /&gt;-v -E "^(Date|Last-Modified)" | md5sum | head -c 6`;&lt;br /&gt; creds=`grep -E "^log=" $line | awk -F'&amp;amp;' '{ print \&lt;br /&gt;$1" "$2 }' | sed -r 's/(log=|pwd=)//g'`;&lt;br /&gt; echo "$id $time $action $req_digest $resp_digest $creds";&lt;br /&gt;done&lt;/pre&gt;&lt;br /&gt;When executed it creates a summary as follows:&lt;br /&gt;&lt;pre&gt;$ ls tcp*s* | sort -k 2 -g -t- | ./summarize.sh&lt;br /&gt;1 10:12:01 GET /wp-login.php 17beda 1be989&lt;br /&gt;2 10:12:02 POST /wp-login.php 3f8d0a 0f713b admin admin&lt;br /&gt;3 10:12:03 GET /wp-login.php 43ab32 1be989&lt;br /&gt;4 10:12:04 POST /wp-login.php 3f8d0a 0f713b admin simple1&lt;br /&gt;5 10:12:05 GET /wp-login.php 43ab32 1be989&lt;br /&gt;6 10:12:06 POST /wp-login.php 3f8d0a 0f713b admin password&lt;br /&gt;7 10:12:07 GET /wp-login.php 43ab32 1be989&lt;br /&gt;8 10:12:08 POST /wp-login.php 3f8d0a 0f713b admin 123456&lt;br /&gt;9 10:12:08 GET /wp-login.php 43ab32 1be989&lt;br /&gt;10 10:12:09 POST /wp-login.php 3f8d0a 0f713b admin qwerty&lt;br /&gt;11 10:12:10 GET /wp-login.php 43ab32 1be989&lt;br /&gt;12 10:12:11 POST /wp-login.php 3f8d0a 0f713b admin abc123&lt;br /&gt;...&lt;br /&gt;89 10:13:20 GET /wp-login.php 43ab32 1be989&lt;br /&gt;90 10:13:21 POST /wp-login.php 3f8d0a 98ee8f wp-admin wp_password&lt;br /&gt;91 10:13:22 GET /wp-login.php 43ab32 1be989&lt;br /&gt;92 10:13:23 POST /wp-login.php 3f8d0a 98ee8f wp-admin wpadmin&lt;/pre&gt;&lt;br /&gt;The first column is the stream number. The second column is the time. The third column is the HTTP method (GET or POST). The fourth is the HTTP resource, which is the same for all activity. The fifth column is a digest (first 6 chars of md5) that was made of the request stream, with the variable data such as the Content-Length header and the form data (credentials) removed. Similarly, the sixth column is a digest of the response, minus the Date and Last-Modified headers removed. These digests allow us to quickly see which requests/responses are the same so that we can manually inspect the few unique requests and responses.&lt;br /&gt;&lt;br /&gt;A quick analysis of the summaries and the unique requests/responses shows that the attacker followed a set pattern of a GET followed by a POST. The attacker tried a list of 23 passwords for each of the two usernames: admin and wp-admin. None of the attempts to log in were successful.&lt;br /&gt;&lt;br /&gt;It’s highly like this was an automated attack. It’s also likely it wasn’t particularly targeted as it appears the same attacker was hitting other websites at approximately the same time. For example, note the following log from &lt;a href="http://northstarlearning.org/logs/access_100222.log"&gt;http://northstarlearning.org/logs/access_100222.log&lt;/a&gt; indicates:&lt;br /&gt;&lt;br /&gt;&lt;i&gt;193.226.51.2 - - [20/Feb/2010:20:33:28 -0800] "GET /logs/access_091214.logwp-login.php HTTP/1.1" 404 - "http://www.google.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MRA 5.1 (build 02228); .NET CLR 1.1.4322; InfoPath.2; .NET CLR 2.0.50727)" "northstarlearning.org"&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;Usually forensics involves some sort of report. We’ve made a report in the form of a interactive timeline of the attack using &lt;a href="http://simile.mit.edu/timeline/"&gt;simile timeplot&lt;/a&gt;.&lt;br /&gt;&lt;script src="http://www.csmutz.com/smusec_files/simile_helper.js" type="text/javascript"&gt;&lt;/script&gt;&lt;br /&gt;&lt;script src="http://simile.mit.edu/timeline/api/timeline-api.js" type="text/javascript"&gt;&lt;/script&gt;&lt;br /&gt;&lt;div id="my-timeline" style="height: 350px; border: 1px solid rgb(170, 170, 170);"&gt;&lt;/div&gt;&lt;br /&gt;&lt;script src="http://www.csmutz.com/smusec_files/simile_main.js" type="text/javascript"&gt;&lt;/script&gt;&lt;br /&gt;If the timeline doesn't appear or to view it full width, try &lt;a href="http://www.csmutz.com/smusec_files/simile.html"&gt;here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;The timeline shows the stream number, the HTTP method, and the username and password, if applicable. Clicking on each event summary brings up more details: The TCP parameters, request and response hashes (minus variable data as mentioned above), and the timestamp.&lt;br /&gt;&lt;br /&gt;While the attack we analyzed wasn’t particularly special nor interesting, I hope it is clear how vortex could be applied to other situations. For example, if tracking a sophisticated and persevering attacker, much information could be extracted to collect security intelligence, aiding in a threat focused defense. For a more traditional vulnerability focused security approach, there is much information that could be used to drive future mitigations.&lt;br /&gt;&lt;br /&gt;I’ve demonstrated how to use vortex to perform network forensics. One clear advantage that this approach has over manually inspecting every packet in wireshark/tshark is scalability. We can easily process large amounts of data using simple scripting. While tshark allows this sort of approach for a large set of protocols by allowing the user to select fields to display, if one is needs to analyze protocols or payload data not supported by tshark, using vortex and an external analyzer is often a wise approach. While I’ve created a pretty lame shell script in this example, many would use more powerful programming languages and their associated repository of protocol parsing code to have simple access to the data. For example, using perl and HTTP::Parser would make sense for this sort of thing. Vortex has some small, but significant advantages over the likes of tcpflow because of features such as the extended metadata.&lt;br /&gt;&lt;br /&gt;In future installments in this series on how to use vortex, we’ll show how to use vortex to perform near real time intrusion detection on a live network and then we’ll show how to do deep content analysis in a highly scalable manner suitable on today’s highly parallel general purpose systems.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/471870110572762709-1690424697463388102?l=smusec.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://smusec.blogspot.com/feeds/1690424697463388102/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://smusec.blogspot.com/2010/04/vortex-howto-series-network-forensics.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/1690424697463388102'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/1690424697463388102'/><link rel='alternate' type='text/html' href='http://smusec.blogspot.com/2010/04/vortex-howto-series-network-forensics.html' title='Vortex Howto Series: Network Forensics'/><author><name>Charles Smutz</name><uri>http://www.blogger.com/profile/05098439824931378207</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_g1XmJJW8J_g/S-igCjTBm_I/AAAAAAAAAAY/7Dc7fU1NH5g/S220/CharlesSmutz2009.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_g1XmJJW8J_g/S7eQSkakb0I/AAAAAAAAAAM/BvXpFpPSPsg/s72-c/net_4n6_screenshot.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-471870110572762709.post-493487845419870400</id><published>2010-03-22T20:38:00.000-07:00</published><updated>2010-04-03T13:31:05.182-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='vortex howto'/><title type='text'>Vortex Howto Series: Network Surviellance</title><content type='html'>While my last post was very high level, clearly in the realm of pontification, I’d like to come down to the other extreme and present a series of very technical howtos related to &lt;a href="http://sourceforge.net/projects/vortex-ids/"&gt;vortex&lt;/a&gt;, a utility for analysis of TCP stream data.&lt;br /&gt;&lt;br /&gt;Throughout the series I’d like to demonstrate the use of vortex through the following examples:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;How to use vortex to build a network surveillance tool&lt;br /&gt;&lt;/li&gt;&lt;li&gt;How to use vortex to build a near real-time deep content analysis IDS&lt;br /&gt;&lt;/li&gt;&lt;li&gt;How to use vortex as a network forensics tool&lt;br /&gt;&lt;/li&gt;&lt;li&gt;How to use vortex (and xpipes) to do the above in a highly scalable manner including leveraging highly parallel processors&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;In demonstrating these uses of vortex, my focus will be to explain though example the non-intuitive aspects of vortex and to show in rough proof-of-concept form what vortex can be used to do. I’ll also try to compare vortex to other tools to show where vortex adds value and where you’re better off using something else.&lt;br /&gt;&lt;br /&gt;Before I begin, I’d like to refer the reader to another blog, &lt;a href="http://securityfu.blogspot.com/"&gt;securityfu&lt;/a&gt;, which has a nice introduction to vortex entitled &lt;a href="http://securityfu.blogspot.com/2010/02/vortex-ids-get-super-snagadocious-on.html"&gt;Vortex IDS - Get Super Snagadocious on Ubuntu&lt;/a&gt;. Toosmooth provides an excellent overview of vortex. He also introduces some ideas of tools that could be built on top of vortex, especially deep email analysis, which seems to be something for which vortex is very well suited.&lt;br /&gt;&lt;br /&gt;I’d also like to clarify my relationship to vortex. Vortex was written and shared with the community by Lockheed Martin. The Charles Smutz mentioned in the changelogs, etc is the same person as the Charles Smutz who authors this blog, with the exception being that former does so as a Lockheed Martin employee and the latter does so as an individual. To be explicit, this blog is in no way sponsored or endorsed by Lockheed Martin and expresses my personal views and opinions as a security researcher.&lt;br /&gt;&lt;br /&gt;Ok, now on with the real material. Our goal in this segment of the vortex howto series will be to develop a mail relay (client) fingerprinting tool. This could just as easily be an FTP, HTTP, etc client fingerprinting tool.&lt;br /&gt;&lt;br /&gt;The point will be to demonstrate how to use vortex to collect network payload data in a user friendly way. We will be collecting characterizations about network clients which will be useful for historical analysis. In many ways, this is very similar to &lt;a href="http://www.sourcefire.com/products/3D/rna"&gt;Sourcefire RNA&lt;/a&gt;  but instead of characterizing network servers, we’ll be focusing on the clients. Also, we will not be focusing on creating transaction logs, for which &lt;a href="http://www.bro-ids.org/"&gt;Bro IDS&lt;/a&gt; is often very effective, depending on the information you want to collect. We will be focusing on building up an archive of network client fingerprints.&lt;br /&gt;&lt;br /&gt;The first thing we need to do is collect network streams so we can analyze them. In absence of a better data set, we’ll be using the DARPA Intrusion Detection Evaluation 2000 data set: &lt;a href="http://www.ll.mit.edu/mission/communications/ist/corpora/ideval/data/2000/NT_dataset/outside.tcpdump.gz"&gt;http://www.ll.mit.edu/mission/communications/ist/corpora/ideval/data/2000/NT_dataset/outside.tcpdump.gz&lt;/a&gt;. It’s not very interesting but it does have a fair amount of complete smtp connections so it is adequate for demonstrating what we want to do. It's also freely available so you can follow along on your own if you want.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;The Basics&lt;/h3&gt;First of all, like &lt;a href="http://www.tcpdump.org/"&gt;tcpdump&lt;/a&gt;, to collect network data from a live capture device, you use the -i option. To replay dead packets, ex. pcap file, the -r option is used. In this example, we’ll be replaying pcaps but the same techniques demonstrated would work equally well on a live network analyzer.&lt;br /&gt;&lt;br /&gt;Next, we need to specify where to put the data. The -t option does this. If you don’t specify anything, stream files end up in your current working directory. If you’re going to be spooling raw streams to disk for archival, specifying a directory on disk is fine. However, if you are going to be processing the streams and selectively writing small portions the data to disk/DB, spooling the streams to ramdisk of some sort is often the right thing to do. /dev/shm is the location of a tempfs mount common on modern linux distros which works perfect for this purpose.&lt;br /&gt;&lt;br /&gt;Since we’re going to concern ourselves with SMTP for the moment, we’ll use a BPF of “tcp port 25”. There are some issues with doing this that we’ll address in one of the other articles in this series.&lt;br /&gt;&lt;br /&gt;So to extract the streams we’re interested in we’d do the following:&lt;br /&gt;&lt;pre&gt;$ mkdir streams&lt;br /&gt;$ vortex -r outside.tcpdump -f "tcp port 25" -t streams&lt;br /&gt;Couldn't set capture thread priority!&lt;br /&gt;streams/196.37.75.158:1052s172.16.114.50:25&lt;br /&gt;streams/196.37.75.158:1052c172.16.114.50:25&lt;br /&gt;streams/196.37.75.158:1104s172.16.114.169:25&lt;br /&gt;streams/196.37.75.158:1104c172.16.114.169:25&lt;br /&gt;streams/196.37.75.158:1106s172.16.114.207:25&lt;br /&gt;streams/196.37.75.158:1106c172.16.114.207:25&lt;br /&gt;…&lt;br /&gt;streams/197.218.177.69:22094s172.16.114.207:25&lt;br /&gt;streams/197.218.177.69:22094c172.16.114.207:25&lt;br /&gt;streams/195.115.218.108:30802s172.16.114.168:25&lt;br /&gt;streams/195.115.218.108:30802c172.16.114.168:25&lt;br /&gt;VORTEX_ERRORS TOTAL: 0 IP_SIZE: 0 IP_FRAG: 0 IP_HDR: 0 IP_SRCRT: 0 TCP_LIMIT: 0 TCP_HDR: 0 TCP_QUE: 0 TCP_FLAGS: 0 UDP_ALL: 0 SCAN_ALL: 0 VTX_RING: 0 OTHER: 0&lt;br /&gt;VORTEX_STATS PCAP_RECV: 0 PCAP_DROP: 0 VTX_BYTES: 5455133 VTX_EST: 1719 VTX_WAIT: 0 VTX_CLOSE_TOT: 1719 VTX_CLOSE: 1718 VTX_LIMIT: 0 VTX_POLL: 0 VTX_TIMOUT: 0 VTX_IDLE: 0 VTX_RST: 0 VTX_EXIT: 1 VTX_BSF: 0&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;We’ve extracted all the relevant tcp streams from the pcap and stored them in files. At this point, we haven’t done anything that couldn’t be done just as easy with a myriad of other tools such as &lt;a href="http://www.circlemud.org/%7Ejelson/software/tcpflow/"&gt;tcpflow&lt;/a&gt;, &lt;a href="http://tcpick.sourceforge.net/"&gt;tcpick&lt;/a&gt;, etc so lets keep moving.&lt;br /&gt;&lt;br /&gt;Since we’re interested in characterizing the smtp client (relay forwarding an email), let’s look at an example stream showing data transmitted from the client to the server:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;$ head -n 15 streams/135.13.216.191:11896s172.16.113.204:25&lt;br /&gt;EHLO alpha.apple.edu&lt;br /&gt;HELO alpha.apple.edu&lt;br /&gt;MAIL From: &amp;lt;ansgarz@alpha.apple.edu&amp;gt;&lt;br /&gt;RCPT To: &amp;lt;jouniw@goose.eyrie.af.mil&amp;gt;&lt;br /&gt;DATA&lt;br /&gt;Received: (from mail@localhost) by alpha.apple.edu (SMI-8.6/SMI-SVR4)&lt;br /&gt;    id: CAA16711; Sat,  7 Aug 1999 14:19:07 -0400&lt;br /&gt;Date: Sat,  7 Aug 1999 14:19:07 -0400&lt;br /&gt;To: jouniw@goose.eyrie.af.mil&lt;br /&gt;Subject:  To Introduction exposes us an object&lt;br /&gt;Message-Id: &lt;19990807141907.caa16711&gt;&lt;br /&gt;&lt;br /&gt;        To Introduction exposes us an object can cause&lt;br /&gt;        The type your own memory improved over The&lt;br /&gt;        normal density classifier parameters; and&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;Let’s say for the sake of this exercise, we’re interested in the HELO command (some clients start with HELO and some lead off with EHLO), the sender’s domain (usually each smtp relay forwards mail for a relatively small number of domains, often only one), and the relay server's hostname and software name as reported in the first received line.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Our Analyzer&lt;/h3&gt;The following shell script, when combined with vortex, would collect this info and store it in a file per IP address, each line containing a unique fingerprint:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;#!/bin/bash&lt;br /&gt;&lt;br /&gt;while read STREAM_FILE&lt;br /&gt;do&lt;br /&gt;CLIENT_IP=`basename $STREAM_FILE | awk -F: '{ print $1 }'`&lt;br /&gt;HELO_CMD=`head -n 1 $STREAM_FILE | awk '{ print $1 }'`&lt;br /&gt;SENDER_DOMAIN=`grep -i "^MAIL FROM:" $STREAM_FILE | \&lt;br /&gt;  sed -r 's/^.*@(.*)&gt;.*$/\1/g'`&lt;br /&gt;BY_STRING=`grep -E -o -h "by [0-9a-zA-z.-]+( \(.*\))?" \ &lt;br /&gt; $STREAM_FILE | head -n 1 | sed 's/by //g'`&lt;br /&gt;&lt;br /&gt;FINGERPRINT="$HELO_CMD $SENDER_DOMAIN $BY_STRING"&lt;br /&gt;&lt;br /&gt;if ! grep -F "$FINGERPRINT" "$CLIENT_IP" 2&gt;/dev/null&lt;br /&gt;then&lt;br /&gt;    echo "$FINGERPRINT" &gt;&gt; "$CLIENT_IP"&lt;br /&gt;fi&lt;br /&gt;&lt;br /&gt;rm $STREAM_FILE&lt;br /&gt;&lt;br /&gt;done&lt;br /&gt;&lt;/pre&gt;&lt;a href="http://www.csmutz.com/files/smtp_fingerprint.sh"&gt;download smtp_fingerprint.sh&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Note that we’ve followed the basic paradigm of a vortex analyzer: read a filename from STDIN, analyze it, delete it.&lt;br /&gt;&lt;br /&gt;Also note that this shell script is very quick and dirty. There are so many things wrong with it, we won’t even list them. However, it doesn’t take much vision to see what 30 - 50 lines of perl/python/ruby code could do, possibly with a DB.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Pulling it Together&lt;/h3&gt;We’ve got a couple other things left to work out. First, we are only interested in “s” streams--the streams going from the tcp client to the server. As such, we’re going to set the client collection size (-C) to zero. While collecting complete  client to server streams is fine, in our case, we’re really only interested in the first few lines of the stream so we’re going to set the to server collection size (-S) to 4 Kb.&lt;br /&gt;&lt;br /&gt;Since we’re snarfing, analyzing, then purging, we’re going to store the streams temporarily in ramdisk (/dev/shm).&lt;br /&gt;&lt;br /&gt;Our finished product is as follows:&lt;br /&gt;&lt;pre&gt;$ mkdir smtp_fingerprints&lt;br /&gt;$ cd smtp_fingerprints&lt;br /&gt;$ vortex -r ../outside.tcpdump -t /dev/shm -S 4096 -C 0 -f \&lt;br /&gt;"tcp port 25" | smtp_fingerprint.sh&lt;br /&gt;Couldn't set capture thread priority!&lt;br /&gt;EHLO jupiter.cherry.org jupiter.cherry.org (SMI-8.6/SMI-SVR4)&lt;br /&gt;EHLO jupiter.cherry.org jupiter.cherry.org (SMI-8.6/SMI-SVR4)&lt;br /&gt;EHLO finch.eyrie.af.mil finch.eyrie.af.mil (SMI-8.6/SMI-SVR4)&lt;br /&gt;EHLO mars.avocado.net mars.avocado.net (SMI-8.6/SMI-SVR4)&lt;br /&gt;…&lt;br /&gt;EHLO alpha.apple.edu alpha.apple.edu (SMI-8.6/SMI-SVR4)&lt;br /&gt;VORTEX_ERRORS TOTAL: 0 IP_SIZE: 0 IP_FRAG: 0 IP_HDR: 0 IP_SRCRT: 0 TCP_LIMIT: 0 TCP_HDR: 0 TCP_QUE: 0 TCP_FLAGS: 0 UDP_ALL: 0 SCAN_ALL: 0 VTX_RING: 0 OTHER: 0&lt;br /&gt;VORTEX_STATS PCAP_RECV: 0 PCAP_DROP: 0 VTX_BYTES: 3323358 VTX_EST: 1719 VTX_WAIT: 0 VTX_CLOSE_TOT: 1719 VTX_CLOSE: 1455 VTX_LIMIT: 263 VTX_POLL: 0 VTX_TIMOUT: 0 VTX_IDLE: 0 VTX_RST: 0 VTX_EXIT: 1 VTX_BSF: 0&lt;br /&gt;Hint--VTX_LIMIT: Streams truncated due to size limits. If not desired,&lt;br /&gt;adjust stream size limits accordingly (-C, -S).&lt;br /&gt;EHLO alpha.apple.edu alpha.apple.edu (SMI-8.6/SMI-SVR4)&lt;br /&gt;…&lt;br /&gt;EHLO pluto.plum.net pluto.plum.net (SMI-8.6/SMI-SVR4)&lt;br /&gt;EHLO epsilon.pear.com epsilon.pear.com (SMI-8.6/SMI-SVR4)&lt;br /&gt;$&lt;br /&gt;&lt;/pre&gt;Alright, let’s look at the output of our masterpiece.&lt;br /&gt;&lt;pre&gt;$ ls&lt;br /&gt;135.13.216.191  172.16.112.194  172.16.112.50   172.16.113.84 172.16.114.169  194.27.251.21    195.73.151.50   197.182.91.233 &lt;br /&gt;135.8.60.182    172.16.112.20   172.16.113.105  172.16.114.148 172.16.114.207  194.7.248.153    196.227.33.189  197.218.177.69&lt;br /&gt;172.16.112.149  172.16.112.207  172.16.113.204  172.16.114.168 172.16.114.50   195.115.218.108  196.37.75.158&lt;br /&gt;&lt;/pre&gt;Cool, a file per client IP as we expected.&lt;br /&gt;&lt;br /&gt;Let’s take a peak at a few:&lt;br /&gt;&lt;pre&gt;$ cat 135.13.216.191&lt;br /&gt;EHLO alpha.apple.edu alpha.apple.edu (SMI-8.6/SMI-SVR4)&lt;br /&gt;$ cat 172.16.112.194&lt;br /&gt;EHLO falcon.eyrie.af.mil falcon.eyrie.af.mil (SMI-8.6/SMI-SVR4)&lt;br /&gt;$ cat 172.16.112.20&lt;br /&gt;EHLO zeno.eyrie.af.mil hobbes.eyrie.af.mil (8.8.7/8.8.7)&lt;br /&gt;&lt;/pre&gt;The digests, just as we wanted. We analyzed 1719 smtp streams. How many digests are there?&lt;br /&gt;&lt;pre&gt;$ wc -l *&lt;br /&gt;1 135.13.216.191&lt;br /&gt;1 135.8.60.182&lt;br /&gt;1 172.16.112.149&lt;br /&gt;…&lt;br /&gt;1 196.227.33.189&lt;br /&gt;1 196.37.75.158&lt;br /&gt;2 197.182.91.233&lt;br /&gt;1 197.218.177.69&lt;br /&gt;24 total&lt;br /&gt;&lt;/pre&gt;A relatively low number indicating a lot of duplicate fingerprints. Ok, most files have one line, which is to be expected. Let's look at the one with more than 1.&lt;br /&gt;&lt;pre&gt;$ cat 197.182.91.233&lt;br /&gt;EHLO marslistserv.com mars.avocado.net (SMI-8.6/SMI-SVR4)&lt;br /&gt;EHLO mars.avocado.net mars.avocado.net (SMI-8.6/SMI-SVR4)&lt;br /&gt;&lt;/pre&gt;Perfect, email relays that relay mail for more than one domain have more than one digest.&lt;br /&gt;&lt;br /&gt;What can we do with this monstrosity we’ve created? Let’s run a few queries on our awkward DB, CLI style:&lt;br /&gt;&lt;br /&gt;What server relay software is used and with what frequency?&lt;br /&gt;&lt;pre&gt;$ cat * | awk '{ print $NF }' | sort | uniq -c&lt;br /&gt;  1 (8.8.0/8.8.5)&lt;br /&gt;  1 (8.8.7/8.8.7)&lt;br /&gt; 22 (SMI-8.6/SMI-SVR4)&lt;br /&gt;&lt;/pre&gt;Clearly this is data miner’s paradise ;) A whole 3 different types of SMTP relay software are used.&lt;br /&gt;&lt;br /&gt;Who has ever delivered mail for (or claimed to be) orange.com?&lt;br /&gt;&lt;pre&gt;$ grep "orange.com" *&lt;br /&gt;195.73.151.50:EHLO lambda.orange.com lambda.orange.com (SMI-8.6/SMI-SVR4)&lt;br /&gt;&lt;/pre&gt;I think you get the idea.&lt;br /&gt;&lt;br /&gt;So the dataset we used is pretty limited and the analyzer we created is certainly contrived, but I hope this demonstrates the type of thing you could do using vortex as the basis for a network surveillance tool. You could collect and store or mine just about any data you wanted to. Because vortex is used, the analyst doesn’t have to worry about extracting data from packets. Network data appears as files, which is perfect for the CLI ninja. While not explicitly shown here, since vortex handles all the real time constraints, with a few minor modifications, our script could run on a production network of decent size and still perform fine.&lt;br /&gt;&lt;br /&gt;I’ve explained the most basic usage of vortex and demonstrated its use for something that can’t as easily be done with any other tools that I know of. In future installments of this series we’ll demonstrate various other aspects of vortex and how it is used.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/471870110572762709-493487845419870400?l=smusec.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://smusec.blogspot.com/feeds/493487845419870400/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://smusec.blogspot.com/2010/03/vortex-howto-series-network.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/493487845419870400'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/493487845419870400'/><link rel='alternate' type='text/html' href='http://smusec.blogspot.com/2010/03/vortex-howto-series-network.html' title='Vortex Howto Series: Network Surviellance'/><author><name>Charles Smutz</name><uri>http://www.blogger.com/profile/05098439824931378207</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_g1XmJJW8J_g/S-igCjTBm_I/AAAAAAAAAAY/7Dc7fU1NH5g/S220/CharlesSmutz2009.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-471870110572762709.post-7661586552867401772</id><published>2010-03-03T15:34:00.000-08:00</published><updated>2010-04-20T15:48:15.637-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='opinion'/><category scheme='http://www.blogger.com/atom/ns#' term='devel'/><title type='text'>Developing Relevant Information Security Systems</title><content type='html'>In January, I presented at DC3 on &lt;a href="http://www.dodcybercrime.com/10CC/descriptions.asp#DIB_adf"&gt;Agile Development for Incident Response&lt;/a&gt;. I firmly believe that rapid engineering of information security systems is necessary to effectively combat sophisticated threats. I’ve also been struck lately by the lack of relevance of so much information security research and development.&lt;br /&gt;&lt;br /&gt;One thing that I am adamant about, but has largely been ignored by the mainstream security community, is the need to face sophisticated and determined attackers with a threat focused response. A few others have already written extensively on this topic. The one reference I will make is to Mike Cloppert’s explanation of security intelligence, specifically his article on &lt;a href="http://blogs.sans.org/computer-forensics/2009/10/14/security-intelligence-attacking-the-kill-chain/%20"&gt;attacking the kill chain&lt;/a&gt; which takes a conventional military construct (kill chain) and applies it to information security.&lt;br /&gt;&lt;br /&gt;Threat focused analysis is necessary, but is not sufficient. Unfortunately, current off-the-shelf security systems do not adequately support this approach. To effectively perform security intelligence, new security tools must be developed. Sadly, sophisticated attackers are not static targets. They change and evolve. What’s more, the enemies themselves change over time.&lt;br /&gt;&lt;br /&gt;Working in the defense sector, I often try to contrast the cyber security world to the physical security world. I do this predominately for the purpose of finding ways to apply lessons of the past to present problems. The world has a long history of fighting wars and developing weapons systems. There must be some lessons to be learned from conventional weapons systems that can be applied to the realm of cyber security. As such, I’m going to use 4 conventional weapons systems to express allegorically some of my recent musing on effectively developing threat focused information security systems.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Too Much, Too Late&lt;/h3&gt;&lt;br /&gt;It wasn’t too long ago I visited the final manufacturing plant for the &lt;a href="http://en.wikipedia.org/wiki/F-22_Raptor"&gt;F22 Raptor&lt;/a&gt;. I have to admit, seeing the F22 in person makes the technological marvel it is that much sexier. However, while the F22 largely meets the expectations that the engineers set out to accomplish so many years ago and truly is far superior to any other fighter out there, the US decided we didn’t need it any more, especially at the ~$150 million per plane cost.&lt;br /&gt;&lt;br /&gt;What went wrong? Latency. The threat landscape has changed significantly in the last 3 decades. If we had active enemies with technology that could only be adequately matched by the F22, then the F22 would be a bargain. However, since F22s aren’t particularly useful in wars like Iraq an Afghanistan, the cost is unjustifiable. To add insult to injury, it is conceivable that in a decade or two we could have a real need for the F22 that justifies the high price tag, but since the production lines and engineering will have long ceased, simply building more of them won’t be an easy option.&lt;br /&gt;&lt;br /&gt;The information security equivalents of the F22 exist. They are technologically magnificent. They operate well for the missions they were designed for. Unfortunately, the cheese has moved. It’s hard to say if the technologies will be relevant in the future, but if they’re not relevant enough to justify further investment today, it will likely mean starting again from scratch.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Smart Bomb&lt;/h3&gt;&lt;br /&gt;Starkly contrasted to the F22, is a humble artillery round called the &lt;a href="http://en.wikipedia.org/wiki/M982_Excalibur"&gt;M982 Excalibur&lt;/a&gt;. This thing is everything the F22 isn’t--mundane, relatively cheap, and fabulously effective against today’s threats. It’s been very popular in Iraq and Afghanistan because its precision allows its use against insurgency close to non-targets or in complex terrain.&lt;br /&gt;&lt;br /&gt;What makes the Excalibur great? Was it lack of technical challenges and problems during development? No. Radical new technologies? No.&lt;br /&gt;&lt;br /&gt;The Excalibur is great because it is an ingenious marriage of technologies from other high tech devices (insanely expensive guided missiles) with a widely deployed, reliable, and economical infrastructure (&lt;a href="http://en.wikipedia.org/wiki/M198_howitzer"&gt;howitzer artillery&lt;/a&gt;). While the Excalibur is relatively economical, the &lt;a href="http://en.wikipedia.org/wiki/XM1156_Precision_Guidance_Kit"&gt;XM1156&lt;/a&gt; promises to make similar capabilities really cheap.&lt;br /&gt;&lt;br /&gt;We need more Excaliburs in the field of information assurance. We need to take our existing IT infrastructure and security tools and make the relatively minor tweaks necessary to keep pace with the changing threat landscape. Just like the howitzer munitions have changed over time to keep pace with enemies, often, we just need minor adjustments to our core IT infrastructure to allow us to respond to today’s attackers. However, if we can’t get the requisite features in a timely manner, we are often forced to make do without or employ a whole new tool just to fill a relatively small role. One general example I can think of is audit logs. All too often, the inclusion of one small piece of information is all that is required to turn a vanilla IT system into a widely deployed IDS.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Waiting for Godot&lt;/h3&gt;&lt;br /&gt;The &lt;a href="http://en.wikipedia.org/wiki/Expeditionary_Fighting_Vehicle"&gt;Expeditionary Fighting Vehicle&lt;/a&gt; (EFV) is an amphibious landing craft being developed for the Marines. The EFV is recognized as one of the top acquisition priorities for the Marines but the program is floundering.  I guess it doesn’t take much imagination to figure out how fundamental landing craft are to the mission of the Marines. The EFV was supposed to be in service over a decade ago, but reliability issues have kept that from happening. The current projected deployment date is far enough out that it might slip again or that the project might get canceled or changed drastically.&lt;br /&gt;&lt;br /&gt;There are too many EFVs in the realm of information security. There are lots of reasons why this occurs so often, which I don’t want to discuss at the moment. Risking being called an existentialist, I declare that a system that isn’t deployable yet doesn’t exist. We’ve got to stop building and waiting on vaporware. I’ve been burnt too many times by waiting for systems that are perpetually just around the corner. I wish it weren't true, but I have my own fair share of culpability in this regard. I do believe that applying agile instead of waterfall development methods will help curtail perpetually late projects. Clearly professional integrity is also required.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Freedom as in Speech&lt;/h3&gt;&lt;br /&gt;Probably the least well known weapon system I will use as an example is &lt;a href="http://www.stsc.hill.af.mil/crosstalk/2004/11/0411Kerr.html"&gt;Acoustic Rapid COTS Insertion&lt;/a&gt; (ARCI). In short, ARCI delivers rapid improvements to the sonar systems of the US submarine fleet through frequent deployments of both new software and hardware, building largely from off-the-self hardware, such as Intel and AMD processors, and commercial or open source software such as the Linux operating system. ARCI has demonstrated the value of shifting from completely custom and proprietary solutions to leveraging off-the-shelf platforms in order to focus R&amp;amp;D resources on the features unique to the mission of the system. ARCI delivers new capabilities to the fleet at a previously unknown rate and has become a shining example of the Navy’s quest to acquire open systems.  While lacking in historical track record, the &lt;a href="http://en.wikipedia.org/wiki/Littoral_combat_ship"&gt;Littoral Combat Ship&lt;/a&gt; promises to take this open systems approach to a meta level, making a ship a platform for modular mission systems that can be developed and deployed rapidly to fulfill current missions. I see great promise in this open systems approach to weapons systems.&lt;br /&gt;&lt;br /&gt;There are already many good examples of openness in the realm of information security systems, but we need more. To remain relevant in the face of changing threats, information security systems must provide flexibility at the architectural, platform, and component level. Re-inventing the wheel is a waste of time that we can’t afford. We need to build upon established technologies and focus new development on the capabilities specific to the threats we face. We have to build openness and flexibility into our information security systems. My personal experience with ARCI has changed the way I think about developing highly specialized systems.&lt;br /&gt;&lt;br /&gt;&lt;h3&gt;Security Development call to Keyboards&lt;/h3&gt;&lt;br /&gt;I’ve intentionally masked my complaints about information security systems development with analogies to military weapons systems, so as to not have to name any specific information security tools. Whether you agree with my hasty analysis of these weapons systems or not, I hope that the characterizations I’ve tried to establish allow you to identify the allegorical class of information security systems. The information security community must do better at defending against sophisticated attacks. A portion of the need for improvement rests on the security tool development sector and the people who direct them.&lt;br /&gt;&lt;br /&gt;As security system developers, we need to create open systems that are relevant to today’s threats. We need to build flexibility into our systems at the architectural, platform, and component level. We need to build tools that ease customization, extension, and integration with other tools. We need to rapidly respond to our users’ request for changes to functionality. We have to shed the blinders of entrenched methods and truly innovate. We have to stop peddling vaporware.&lt;br /&gt;&lt;br /&gt;As people who buy or direct development of security tools, we require open systems that both meet our needs today, and provide us the freedom to react to changes in the future. We must be judicious in asking for highly specialized tools that aren’t possible to develop in a short time frame and which might be irrelevant before they are completed. We must find ways to motivate our vendors to provide what we need and not more. When our vendors can’t or won’t provide the capabilities we need, we have to roll up our sleeves and do it ourselves.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/471870110572762709-7661586552867401772?l=smusec.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://smusec.blogspot.com/feeds/7661586552867401772/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://smusec.blogspot.com/2010/03/developing-relevant-information.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/7661586552867401772'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/471870110572762709/posts/default/7661586552867401772'/><link rel='alternate' type='text/html' href='http://smusec.blogspot.com/2010/03/developing-relevant-information.html' title='Developing Relevant Information Security Systems'/><author><name>Charles Smutz</name><uri>http://www.blogger.com/profile/05098439824931378207</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://2.bp.blogspot.com/_g1XmJJW8J_g/S-igCjTBm_I/AAAAAAAAAAY/7Dc7fU1NH5g/S220/CharlesSmutz2009.jpg'/></author><thr:total>1</thr:total></entry></feed>
