Occasionally, one comes across the need to perform relational queries on data that is stored in flat files. One legitimate tactic for doing so is to just load the data at hand in a relational database such as sqlite or mysql. There are many situations where this is less than desirable or just not practical. In such situations, I’ve seen people hack together bash/perl/whatever scripts, many of which are extremely inefficient, harder than they need to be, and/or just plain ugly. Using “join”, in conjunction with classic line based text processing utils, can provide very for elegant solutions in some of these situations. Never heard of join? Keep reading as I extol its virtues!
I learned to use join working with Ruminate IDS. Ruminate creates logs for each of the processing layers data traverses. In its current state, correlating these logs at various layers of the processing stack is left for an external log aggregation/correlation system. In exploring events in flat file logs, I use join to splice multiple layers of the processing stack together, similar to what you would do using a join in SQL.
The following is some data I will use for this example. In order to sanitize the smallest amount of data possible, I’ve only included the logs entries I will be using here. Feel free to intersperse (or imagine interspersing) additional dummy logs if you like (make sure to maintain ordering on the keys used for joining—see explanation far below). This data represents the processing associated with the same malicious pdf transfered over the network in two separate transactions:
$ cat clamav.log
tcp-1305036479-10.1.1.1:51770c114.143.209.62:80_http-0 Exploit.PDF-22632
tcp-1305128525-10.1.1.1:57460c89.114.97.13:80_http-0 Exploit.PDF-22632
$ cat object.log
tcp-1305036479-10.1.1.1:51770c114.143.209.62:80_http-0 2080 9e9dfd9534fe89518ba997deac07e90d PDF document, version 1.6
tcp-1305128525-10.1.1.1:57460c89.114.97.13:80_http-0 2080 9e9dfd9534fe89518ba997deac07e90d PDF document, version 1.6
$ cat http.log
tcp-1305036479-10.1.1.1:51770c114.143.209.62:80_http-0 GET haeied.net /1.pdf
tcp-1305128525-10.1.1.1:57460c89.114.97.13:80_http-0 GET haeied.net /1.pdf.
Note that these log files are presented in reverse order of the processing stack. HTTP processing extracts objects and creates network protocol logs. File metadata is extracted from those objects. The objects are then multiplexed to analyzers like clamav for analysis.
Let’s say I want to look at all the files transferred over the network that matched the clamav signature “Exploit.PDF-22632”. I use the classic grep:
$ grep -F "Exploit.PDF-22632" clamav.log
tcp-1305036479-10.1.1.1:51770c114.143.209.62:80_http-0 Exploit.PDF-22632
tcp-1305128525-10.1.1.1:57460c89.114.97.13:80_http-0 Exploit.PDF-22632
Unfortunately, the TCP quad and timestamp doesn’t provide us much useful context. Let’s join in the http.log data:
$ grep -F "Exploit.PDF-22632" clamav.log | join - http.log
tcp-1305036479-10.1.1.1:51770c114.143.209.62:80_http-0 Exploit.PDF-22632 GET haeied.net /1.pdf
tcp-1305128525-10.1.1.1:57460c89.114.97.13:80_http-0 Exploit.PDF-22632 GET haeied.net /1.pdf
Whoa, that was easy. Note that join assumed that we wanted to use the first column as the key for joining. While we’re at it, let’s join in the object.log data, only selecting the columns we are interested in:
$ grep -F "Exploit.PDF-22632" clamav.log | join - http.log | join - object.log | cut -d" " -f 2-6,8-
Exploit.PDF-22632 GET haeied.net /1.pdf 2080 PDF document, version 1.6
Exploit.PDF-22632 GET haeied.net /1.pdf 2080 PDF document, version 1.6
One big advantage of join is that it is easy to use in conjunction with other filter programs such as grep, sed, and zcat. You might use sed to convert tcp quads from IDS alerts and firewall logs into exactly the same format so you can join them on the tcp quad as the key. Join works very well on large files, including compressed files, decompressing them on the fly to efficiently get the data you want. The following is the same, with the difference of operating on compressed files:
$ gzip -c clamav.log > clamav.log.gz
$ gzip -c object.log > object.log.gz
$ gzip -c http.log > http.log.gz
$
$ zgrep -F "Exploit.PDF-22632" clamav.log | join - <(zcat http.log.gz) | join - <(zcat object.log.gz) | cut -d" " -f 2-6,8-
Exploit.PDF-22632 GET haeied.net /1.pdf 2080 PDF document, version 1.6
Exploit.PDF-22632 GET haeied.net /1.pdf 2080 PDF document, version 1.6
Again, very easy to get a nice little report using data spanning multiple files.
To continue demonstrating join, I’m going to refer to the data used in an SQL JOIN tutorial.
I used the data in CSV form as follows:
$ cat customers.csv
1,John,Smith,John.Smith@yahoo.com,2/4/1968,626 222-2222
2,Steven,Goldfish,goldfish@fishhere.net,4/4/1974,323 455-4545
3,Paula,Brown,pb@herowndomain.org,5/24/1978,416 323-3232
4,James,Smith,jim@supergig.co.uk,20/10/1980,416 323-8888
$ cat sales.csv
2,5/6/2004,100.22
1,5/7/2004,99.95
3,5/7/2004,122.95
3,5/13/2004,100.00
4,5/22/2004,555.55
First, let’s start by generating a report for the marketing folk showing when each person has placed orders:
$ cat sales.csv | join -t, - customers.csv | sort -t, -k 1 | awk -F, '{ print $2","$4","$5 }'
5/6/2004,Steven,Goldfish
5/13/2004,Paula,Brown
5/7/2004,Paula,Brown
5/22/2004,James,Smith
Wow, didn’t that feel like you were using a relational database, albeit in a CLI type of way? Note that we had to specify the delimiter (same syntax as sort). Also, we sorted the output on customerid to ensure orders by the same person are contiguous. The astute reader, however, will notice that the report isn’t complete. We missed one of the sales on 5/7/2004. Why? From the man page we get the following critical nugget:
Important: FILE1 and FILE2 must be sorted on the join fields.
In this case we were joining on customerid columns, which are not in the same order in the sales and customers table. As such, we failed to join the records that weren’t sorted the same in both files. While this could be seen as a limitation of join, it is also what makes it efficient and makes it work so well with other utilities—all join operations occur with a single sequential pass through each file. Remember that “real” databases have indexes to make this sort of thing more efficient than a single full table scan. No frets though, for occasional queries, using sort to put the join fields in the same order works quite well. Also note that for a lot of security data, where the data is sorted chronologically, this requirement is frequently met with no additional effort, as shown in the Ruminate logs above. In this case, we’ll sort sales to put customerid in the same order as the customers table:
$ cat sales.csv | sort -t, -k 1 -g | join -t, - customers.csv | awk -F, '{ print $2","$4","$5 }'
5/7/2004,John,Smith
5/6/2004,Steven,Goldfish
5/13/2004,Paula,Brown
5/7/2004,Paula,Brown
5/22/2004,James,Smith
Now the order from John Smith shows up correctly.
Let’s do another simple query for the marketing folk: Report of all the customers that have placed individual purchases of over $100—the high rollers:
$ cat sales.csv | awk -F, '{ if ($3 > 100) print $0}' | sort -t, -k 1 -g | join -t, customers.csv - | cut -d, -f2,3,8
Steven,Goldfish,100.22
Paula,Brown,122.95
James,Smith,555.55
Again, this is simple and straightforward (in an esoteric CLI type of way). If we were doing an SQL tutorial, we would have just introduced a WHERE clause. If I were going to translate this as literally as possible to SQL I would do so as follows:
CLI | pseudo-SQL |
cat sales.csv | FROM sales |
awk -F, '{ if ($3 > 100) print $0}' | WHERE saleamount > 100 |
sort -t, -k 1 -g | USING INDEX customerid |
join -t, customers.csv | JOIN customers ON customerid |
cut -d, -f2,3,8 | SELECT firstname, lastname, saleamount |
With the full pseudo-SQL as follows:
SELECT firstname, lastname, saleamount FROM sales JOIN customers ON customerid USING INDEX customerid WHERE saleamount > 100
For the last example, I’ll do the gratuitously ugly example from the tutorial whose data we are using. Let’s calculate the total spent by each customer:
$ cat sales.csv | awk -F"," '{SUMS[$1]+=$3} END { for (x in SUMS) { print x","SUMS[x]} }' | sort -t, -k 1 -g | join -t, customers.csv - | cut -d, -f2,3,7
John,Smith,99.95
Steven,Goldfish,100.22
Paula,Brown,222.95
James,Smith,555.55
Alright, so this isn’t so pretty, but it works.
In summary, join makes it easy to splice together data from multiple flat files. It works well in the classic *nix CLI analysis paradigm, using sequential passes through files containing one record per line. Join is particularly useful for infrequent queries on large files, including compressed files. Join plays well with the other CLI utils such as sed, awk, cut, etc and can be used to perform relational queries like those done in a database. I hope this short primer has been useful in demonstrating the power of join.
SELECT firstname, lastname, saleamount FROM sales JOIN customers ON customerid USING INDEX customerid WHERE saleamount > 100
For the last example, I’ll do the gratuitously ugly example from the tutorial whose data we are using. Let’s calculate the total spent by each customer:
$ cat sales.csv | awk -F"," '{SUMS[$1]+=$3} END { for (x in SUMS) { print x","SUMS[x]} }' | sort -t, -k 1 -g | join -t, customers.csv - | cut -d, -f2,3,7
John,Smith,99.95
Steven,Goldfish,100.22
Paula,Brown,222.95
James,Smith,555.55
Alright, so this isn’t so pretty, but it works.
In summary, join makes it easy to splice together data from multiple flat files. It works well in the classic *nix CLI analysis paradigm, using sequential passes through files containing one record per line. Join is particularly useful for infrequent queries on large files, including compressed files. Join plays well with the other CLI utils such as sed, awk, cut, etc and can be used to perform relational queries like those done in a database. I hope this short primer has been useful in demonstrating the power of join.
That's awesome--I've been using Linux for over a decade and never knew that join was there! (I also just discovered xxd instead of hexdump -C.) Cool stuff!
ReplyDelete