PDF files are everywhere. The story of this artikle roots in a fellow CS student asking for the easiest way to convert a number of PNG images into one PDF file. That was 2010 and lead to me reading (parts of) Adobe's PDF specification. After understanding the basic PDF structure I used it to write a tool for some everyday-tasks regarding PDFs. There's just one feature which I never implemented. Converting a number of PNG images into one PDF.


Of course, there are the obvious applications for a PDF manipulation tool: Joining PDF files, extracting pages from a given file, rotating pages, et cetera. But the PDFtool features some not-so-standard functions which I'd like to present in the following sections.

Splitting pages

Imagine you scan a book with an industrial-grade office scanner. As a result you have like hundrets of pages on which you see a double page. That is nice -- but wouldn't it be way more handy to have one book page per page? This is handy with more than one regard: It's somewhat easier to read when reading the scan on the screen but this is even more true for smaller screens -- talk eBook-Reader and Pod-Pad-things. But wait, there's more: This is especially handy when creating a reprint of the scan for now all it needs is PDF brochure print. And believe me, a reproduction of a rare book can be a nice-slash-grotesque gift :)
For more info check out the splitpage option pdftool/use_splitpage.html.

Side note on industrial office-grade scanners: They save pictures in a CCITT compatible format, i.e., as if it'd be a telefax, if it's hard b/w then one bit is used per pixel. Nicely combined with a classic RLE. However, I never got into that grade of detail.

Brochure print

Before PDF readers or printer dialogs started offering the brochure print feature preparing this kind of print was a royal pain in the pooper (remember: I started writing this tool in 2010, back then having a Windows XP was still considered somewhat normal, Osama Bin Laden was still alive, news anchors around the world struggled pronouncing "Eyjafjallajökull", and Wikileaks published some thriling documents leaked by Bradley Manning who is already back to civil life now that I write these lines -- WTF, this "project" has been idling in my lap longer than Manning needed to betray his country, get sentenced for treason and get a "Get out of jail and piss of all the others who serve their sentence"-card, holy shit).
Anyway, my point: Back in the days a brochure-print feature was not very common as double-sided printing wasn't common, either. Thus the PDFtool features a preprocessing-helper which essentially is page-reordering an inserting of empty pages. Check out the brochure option pdftool/use_brochure.html.

Security relevant operations: Hiding files in PDFs, hiding pages in PDFs

PDF files are essentially containers for arbitrary objects. They can form a PDF file with pages. But they must not. That is, they may contain unreferenced pages -- that is things which would be a page if, sweet if, they would be referenced in the list of pages contained in the document. Or they may just contain arbitrary files camouflaged as stream objects. I implemented both options as they allow you to test security scanners with two regards: a) will malicious files hidden as an object in a pdf file be found? b) will texts on invisible pages be found correctly (think data-leakage-prevention).
While embedding a file needs some preparation, the unlinking of a page from the pageref table is manageable with most any binary-safe editor -- just replace the object references in the pagetree with spaces (preserve the length information so there's no need to rebuild the XREF section) and the page has "disappeared".

Further Reading

For the pdftool documentation is a bit more extensive, I gave it its only directory -- check out pdftool/index.html. You will be provided with a command line overview and respective downloads of source and binaries.

Inspired by aSheep.