Saturday, October 27, 2012

PDFrate Update: API and Community Classifier

I am very pleased with the activity on pdfrate.com in the last few weeks. There have been a good number of visitors and some really good submissions. I’m really impressed at the number of targeted PDFs that were submitted and I’m happy with the pdfrate’s ability to classify these. I really appreciate those who have taken the time to label their submissions (assuming they know if they are malicious or not) so that the service can be improved through better training data.

There is now an API for retrieval of scan results. See the API section for more details, but as an example, you can view the report (JSON) for the Yara 1.6 manual.

This API may be unconventional, but I do like how easy it is to get scan results. You submit a file and get the JSON object back synchronously. I’ve split the metadata out from the scan results for a couple reasons. First, the metadata can be very large. Second, the metadata is currently presented as text blob, and I wasn’t sure how people would want it stuffed into JSON. If you want both, you have to make two requests. You can also view the metadata blob for the Yara 1.6 manual.

I’m happy that there have already been enough submissions, including ones that weren’t classified well by the existing data sets, that I’ve generated a community classifier based on PDFrate.com user submissions and voting. I’m thrilled that there were submissions matching categories of malicious PDFs that I know are floating around but simply aren’t in the existing data sets. I expect that if the current submission rate stays the same or goes up, the community classifier will become the most accurate classifier, because it will contain fresher and more relevant training data. Again, as an example, you can check out the report for the Yara 1.6 manual which now includes a score from the community classifier.

If a submission had votes before Oct 25th, it was included in the community classifier. Some users will note that even though they themselves did not vote on their submissions, they have votes. I reviewed many interesting submissions and placed votes on them so that they could be included in the community classifier. I decided to not do a bulk rescan of all documents already submitted. It wasn't for technical reasons. Note that the ratings occur solely based on the previously extracted metadata and as such are very fast. I did so because I didn’t want to provide potentially deceptive results to users. If a document is in the training set, it is generally considered an unfair test to use the resulting classifier on it, as the classifier will almost always provide good results. Regardless, if you want to have a submission re-scanned, just submit the file over again.

Again, I’m pleased with the PDFrate so far. I hope this service continues to improve and that it provides value to the community.

No comments:

Post a Comment