I’m excited to announce PDFrate: a website that provides malicious document identification using machine learning based on metadata and structural features. The gory details of the underlying mechanisms will be presented at ACSAC 2012.
I’ve been working on this research since 2009, which was a year where the stream of PDF 0-days being leveraged by targeted attackers was nearly unbroken. I’ve refined the underlying techniques to a place where they are very effective in real operations and are addressed rigorously enough for academic acceptance. Note that I originally designed this for the purpose of detecting APT malicious documents but have found it to be largely effective on broad based crimeware PDFs also. Furthermore, it is pretty effective at distinguishing between the two. I can speak from personal experience that mechanisms underlying PDFrate provide a strong compliment to signature and dynamic analysis detection mechanisms.
Those that are interested should head over to the pdfrate site and check out the “about” page in particular which explains the mechanisms and points to some good examples.
PDFrate demonstrates a well refined mechanism for detecting malicious documents. This currently operates on PDF documents. I am close to extending this to office documents. But I see this paradigm extending much farther than just malicious documents. I see wise (and deep) selection of features and machine learning being effective for many things other things such as emails, network transactions such as HTTP, web pages, and other file formats such as SWF and JAR.
I’m happy to provide the PDFrate service to the community so that others can leverage (and critique) this mechanism. Providing this as a service is a really good way for others to be able to use it because it removes a lot of the difficulty of implementation and configuration, the hardest part of which is collecting and labeling a training set. High quality training data is critical for high quality classification and this data is often hard for a single organization/individual to compile. While the current data sets/classifiers provided on the site are fine for detecting similar attacks, there is room for improvement and generalization which I hope will come from community submissions and ratings. So please vote on submissions, malicious or not, as this will speed the development and evolution of a community driven classifier. This service could benefit from some additional recent targeted PDFs.
In addition to the classification that PDFrate provides, it also provides one of the best document metadata extraction capabilities that I’ve seen. While there are many tools for PDF analysis, the metadata and structure extraction capabilities used by PDFrate provide a great mix of speed, simplicity, robustness, saliency, and transparency. Even if you aren’t sold on using PDFrate for classification, you might see if you like the metadata it provides. Again, the about provides illustrative examples.
I hope this service is useful to the community. I look forward to describing in depth in December at ACSAC!