I’d like to share some tactics that I have found to be useful for storing and retrieving security event data. These approaches can certainly be backed by theory, but I've also come to learn their value through real world experience, including some hard knocks. One impetus for finally putting these thoughts down is to explain the motivation behind MongoR, a wrapper around MongoDB which was written by a close colleague and recently open sourced by my employer. However, MongoR only addresses some of these considerations, and I believe these issues should be addressed by all data stores that want to better support users who warehouse security event data. Many data stores already utilize these tactics or make it possible for the user to do it on their own, but few leverage all the optimizations possible for storing logs.
Storing and retrieving security event data, such as network infrastructure or sensor logs, is non-trivial due to the high insert rate required. In addition to large volume, most off-the-shelf data stores are poorly tuned to rolling window style of data retention that is usually desired. For example, a typical use case would be to store 90 days of logs and to be able to search them based on key indicator types such as IP addresses.Security event data typically has the following properties:
- Organized primarily by time
- Limited, Fixed Retention
- Not Highly Relational
- Very high insert to query ratio
- Core indicator types (ex. IP) comprise most searches
- Searches are typically for rare data (ex. attack activity)
These properties are nearly diametrically opposed to that of a typical database driven app such as a service ticket, e-commerce, or ERP system. As such, it’s not surprising that it requires some work to adapt databases to storing event streams and that many systems that are successful look more like search engines than databases.
NoSQL data stores like MongoDB work very well for security event data, fitting the not highly relational property well, and providing a lot of flexibility when a fixed schema just isn’t possible. Ex. Storing and accessing arbitrary HTTP headers in something like Mongo is great. When there are search engines like Elastic Search that store documents with flexible schemas, why would you even use something like Mongo? There are a few reasons, but one major driver is that Mongo and similar databases readily support alternative access methods, such as map/reduce. In speaking about tactics for log storage, I’m generally not going to differentiate between NoSQL databases like MongoDB and search engines like ElasticSearch as the delineation between them is blurry anyway. I merely wanted to note that there is room for both approaches in the security event storage realm.
Events are Stored to be Deleted
Providing the capacity to support an adequate insert rate can often be difficult, but paying nearly as much for deleting an old event as inserting a new event adds insult to injury. The time honored approach is to use some sort of time based data partitioning where each shard contains a time slice of events, whether the trigger for a new shard be date based or size based. Glossing over issues such as fragmentation, the expensive part here is not pruning the data itself, but removing entries from indexes. Trimming documents out of the indexes is expensive and significantly increases the database's working set. Deleting event data should be cheap. Many data stores don’t support date based sharding with efficient deletes so this must be done by the user. This is one of the core benefits of MongoR: pruning old data is a simple collection (table) drop.
Working Set Blues
Generalizing beyond efficient deletes, the major driver for data partitioning is keeping the working set of a database manageable. The only way to maintain high insert rates is to keep most of the area of active writing cached in RAM. This is especially imperative for indexes as these usually involve very random writes (the document data itself can usually be written sequentially). Sharding helps keep the working set manageable by bounding the size of the data actively written. The downside to this approach is that you have to check every shard during a query, multiplying IOPS by shard count. For security event data, this is very often a wise trade-off. This is the other major benefit MongoR adds on top of MongoDB: managing many collections that are small enough to fit in RAM.
Tiered Storage Please
While MongoR addresses some of the most fundamental issues required to make MongoDB serviceable for high rate security event data, there is more to be done and it’s not specific to MongoDB. One my pet peeves are databases that don’t readily support separating indexes from raw data. There are many people who want to index a relatively small part of raw events, say just the IP address from web server logs. When indexes are mingled with document data, the IOPS required for writing the indexes can be much higher because the indexes may be spread over a larger number of blocks than would occur if indexes are grouped together. To the degree that document data and indexes have different read/write patterns and especially when indexes are smaller than base data, using tiered storage is beneficial.
Indexing the Kitchen Sink
Many systems used to store security event data support indexing a portion of the data in event records, while some support an all or nothing approach. Some are flexible in what fields can be indexed and some are fixed. Flexibility is always appreciated by advanced users who can say in advance what they will typically search. The all or nothing approach can make the use of the system very expensive and can lead to excluding data useful for analysis but not likely to be searched directly.
Stop the Sync Shell Game
One of the fundamental trade-offs a data store has to make is between efficiency and data safety. You can cache heavily and get high performance or constantly sync data to disks and get data safety, but not both. This dichotomy, however, neglects a few optimizations. First of all, security event data is by design immutable. If done right, the raw event data would be dumped with simple sequential writes and loaded with unsynchronized reads.
Another optimization lies in the presumption of some sort of sharding where only the current data collection needs to be loaded in RAM (and for proper performance needs to be wholly in RAM). If the indexes are the expensive part of a data store, requiring all sorts of IOPS to keep synced, why are they written to disk before the shard is finished? In the event of a failure, the indexes can always be re-created from the source data (sequential access). This also opens the door to using efficient data structures for the in memory indexes that may well differ from what is best for the on disk indexes or indexes that are created incrementally. Once you’ve relegated yourself to keeping your current shard in memory, embrace it and leverage the resulting possible optimizations.
There is no Trie
Moving beyond binary trees for indexes is liberating. Various alternative indexing mechanisms, most of them involving hashing of some flavor, can provide constant time lookups and require fewer IOPS than binary trees. For example, I’ve been using discodbs based on cmph, and have had fantastic results. The minimal perfect hash functions used result in indexes that are less complex than a binary tree, involve much fewer IOPS for queries, and are much smaller than an optimal bloom filter. The trade-off is that they are immutable and cannot be efficiently updated or created incrementally, but that is not a concern for immutable log data.
Proponents of binary trees will claim that prefix matches are a killer feature. I’m not sure they are, and they are easily provided through other techniques anyway. Domain names are a prime example of a data type that does not work well with the prefix searches supported by typical binary tree indexes. Sure, you can reverse the domains and then prefix matching works, but you could also extract the prefixes you want searchable and insert them each as keys in your constant time lookup indexes. The point is that you often have to invest extra effort to make binary trees work for many common security event indicator types.
The security domain analogues of all the pre-processing that goes into full text indexing (tokenization, stemming, stopwords, natural language analysis, etc.) are all very immature. As this field matures, I feel we’ll be better able to say what is required to support any searching beyond exact term matches. Regardless, once you move past storing IP addresses as integers, it’s not clear whether binary tree prefix matching buys much in queries on security event data. In a day when general purpose and reliability focused filesystems are using alternatives to binary trees (ex. htree of ext4), security event data stores should be too.
Lower the Resolution Please
One place I think the security community should look for cost savings is in lowering the resolution of indexes from the per event/row level to groups/block of events. I’ve seen great savings in this in practice and in some cases it can even improve performance on an absolute scale, let alone economics. When the desired result is a very small number of records, low resolution indexes coupled with processing of a very small number of blocks to do the final filtering can beat out bigger, more expensive, and more precise indexes. One use case that benefits from this approach is pivoting on indicators of threat actors, which are increasingly often exchanged in threat intelligence sharing communities. In these searches, it is most often desirable to search the widest time window possible (full data set), to search on a common indicator type, and for there to be few or no results. Often the low resolution answer is adequate--there is no need to drill down to individual events. I’ve seen low resolution indexes that are orders of magnitude smaller than the raw source data provide utility through low latency query times that row level indexes can never touch because of fundamental differences in size. Significantly ameliorating the cost of very common terms, ex. your web server’s IP, and thereby naturally diminishing the need for stopword mechanisms is a nice side effect of dropping index resolution.
Trading CPU for IOPS
I’m reticent to mention compression as a method of improving performance for fear of being flamed as a heretic, but I think it has to be mentioned. I know disk is cheap, but IOPS aren’t. Getting adequate IOPS is often among the biggest considerations (and cost) of the hardware for security event data stores (RAM used for disk caching is usually up there too). Huge sequentially accessible gzipped files are expensive to use, especially to retrieve a small number of records. But compression doesn’t have to be that way. You can use smaller blocks to get good compression ratios and reasonable random access. Considering only simple tweaks to gzip, pigz and dictzip address the single threaded and random access limitations of standard gzip.
As an oversimplified example, imagine you have a server with a disk drive that provides 150 MB/s sequential read rate, CPU cores that do decompression at about 20 MB/s (compressed size) per core/thread, and data that compresses with a compression rate of 5:1. If you want to do analytics, say map/reduce or zgrep/awk, on the whole data store, you are better off using compression if you can dedicate 2 or more cores to decompression. If you can dedicate 8 cores, you are going to be able to stream at 750 MB/s instead of 150 MB/s.
The CPU/IOPS tradeoff is not just about improving sequential compressed log access, that is just a simple example that everyone who has ever zgreped text logs understands. A better example is the very compact indexes created using perfect hash functions such as done in discodb which require relatively high CPU but low IOPS to query.
Security event data has special characteristics that enable various optimizations. For the data stores that do not yet make these techniques easy, there is an opportunity to better cater to the security log use case. MongoR demonstrates temporal data partitioning that keeps the working set in memory resulting in scalable, fast inserts and efficient deletes.