There are many ways to analyze malware. In this blog post, we illustrate a typical analysis method: comparing an unknown sample with a known sample, to determine if the unknown sample is malicious or not.
During one of our engagements, we came across a PDF document that triggered our anti-virus. What intrigued us, was that the document had the same title and almost the same size as another document we knew to be benign. Did our anti-virus find a trojanized document? Let’s find out!
Usually, when performing PDF document analysis, PDFiD is the starting place: it gives us an idea what we can expect to find inside the document. We take the same start here, but we are not going to take a detailed look at the report produced by PDFiD.
First, we will compare the reports for the known and unknown sample:
Comparing the two reports with diffdump.py tells us that these reports are identical (except for the filename, which is included in the report). We can go one step further, by using option -a to generate a report for all names found in the PDF, instead of only names that might indicate malicious behavior.
And we have the same result: the reports are identical. From a lexical PDF language point of view, these documents are identical. But we know they are not, they have a different cryptographic hash and one triggers our anti-virus, while the other does not.
Time to dig a bit deeper into the syntax and semantics of these documents with the help of pdf-parser. pdf-parser has a little-known option (-a) to calculate statistics of the elements and objects found inside a PDF document:
Comparing the statistics for our 2 samples, reveals that they are identical:
This is important information: that our sample and original documents have and identical number and type of elements and objects, is a strong indication that they are related.
To try to learn more about the differences, we let pdf-parser produce a full report for both documents:
Here again, the reports are practically identical, except for some characters at the beginning of the reports. This is our first important clue as to the difference for these documents. When we compare the comments at the beginning of these files, we notice they are identical except for the end-of-line characters: \n for our sample and \r\n for our original:
On Windows, the end-of-line is defined with 2 characters: carriage-return + newline (0x0D 0x0A or \r\n). While on Linux, it is a single character: newline.
Maybe our sample is a version of the original document that was somehow processed on a Linux machine, resulting in a change of end-of-line character. Time to define and test a hypothesis: the documents are identical, except for the end-of-line characters.
We use the stream editor sed to replace all \r\n instances in our original document with \n, and then we compare the sample with our transformed original:
The files are identical!
With this analysis, we show that both documents are identical, except for the end-of-line character(s). We trust our original not to be malicious, and since we can simply convert our original to the sample with eol-conversion, we can only conclude that the sample can not be malicious. Often, a differential analysis will not be so clear-cut, nevertheless it is an important method in the arsenal of the reverse engineer.
We have reported this false positive to our anti-virus vendor. The original sample is a private document that we will not share. We don’t know why exactly this sample triggered a false positive.
Want to learn more? Please do join us at the upcoming BruCON training on malicious documents, which was authored by NVISO’s experts!
About the author
Didier Stevens is a malware expert working for NVISO. Didier is a SANS Internet Storm Center senior handler and Microsoft MVP, and has developed numerous popular tools to assist with malware analysis. You can find Didier on Twitter and LinkedIn.