How can wkhtmltopdf be used without introducing a security vulnerability?


Background

Per the project website, wkhtmltopdf is a "command line tool to render HTML into PDF using the Qt WebKit rendering engine. It runs entirely "headless" and does not require a display or display service."

The website also states that "Qt 4 (which wkhtmltopdf uses) hasn’t been supported since 2015, the WebKit in it hasn’t been updated since 2012."

And finally, it makes the recommendation "Do not use wkhtmltopdf with any untrusted HTML – be sure to sanitize any user-supplied HTML/JS, otherwise it can lead to complete takeover of the server it is running on!"


Context

My intention is to provide wkhtmltopdf as part of an application to be installed on a Windows computer. This may or may not be relevant to the question.


Qualifiers / Additional Information

  • A flag is provided by wkhtmltopdf to disable JavaScript (–disable-javascript). This question assumes that this flag functions correctly and thus will count all <script> tags as benign. They are of no concern.
  • This question is not related to the invocation of wkhtmltopdf – the source HTML will be provided via a file (not the CLI / STDIN) and the actual command to run wkhtmltopdf has no chance of being vulnerable.
  • Specifically, this question relates to "untrusted HTML" and "sanitize any user-supplied HTML/JS".
  • Any malicious user that is able to send "untrusted" HTML to this application will not receive the resultant PDF back. That PDF will only temporarily exist for the purpose of printing and then be immediately deleted.
  • Even someone with 100% working knowledge of all of the wkhtmltopdf/webkit/qt source code cannot concretely state that they have zero vulnerabilities or how to safeguard against unknown vulnerabilities. This question is not seeking guarantees, just an understanding of the known approaches to compromising this or similar software.

Questions

What is the goal of sanitization in this context? Is the goal to guard against unexpected external resources? (e.g. <iframe>, <img>, <link> tags). Or are there entirely different classes of vulnerabilities that we can’t even safely enumerate? For instance, IE6 could be crashed with a simple line of HTML/CSS… could some buffer overflow exist that causes this old version of WebKit to be vulnerable to code injection?

What method of sanitizing should be employed? Should we whitelist HTML tags/attributes and CSS properties/values? Should we remove all references to external URI protocols (http, https, ftp, etc.)?

Does rendering of images in general provide an attack surface? Perhaps the document contains an inline/data-uri image whose contents are somehow malicious but this cannot reasonably be detected by an application whose scope is to simply trade HTML for a rendered PDF. Do images need to be disabled entirely to safely use wkhtmltopdf?