Testing PDF content with PHP and Behat

If you have a PDF generation functionality in your app, and since most of the libraries out there build the PDF content in an internal structure before outputting it to the file system (FPDF, TCPDF). A good way to write a test for it is to test the output just before the rendering process.

Recently however, and due to this process being a total pain in the ass, people switched to using tools like wkhtmltopdf or some of its PHP wrappers (phpwkhtmltopdfsnappy) that let you build your pages in html/css and use a browser engine to render the PDF for you, and while this technique is a lot more developer friendly, you loose control over the building process.

So if you’re using one of those tools or just need to test for the existence of some string inside a PDF, here’s how to write a BDD style acceptance test for it using Behat.

Setup

Add this your composer.json then run composer install

{
    "minimum-stability": "dev",
    "require": {
        "smalot/pdfparser": "*",
        "behat/behat": "3.*@stable",
        "behat/mink": "1.6.*@stable",
        "phpunit/phpunit": "4.*"
    },
    "config": {
        "bin-dir": "bin/"
    }
}

Initialize Behat

bin/behat --init

This command creates the initial features directory and a blank FeatureContext class.

If everything worked as expected, your project directory should look like this :

├── bin
│   ├── behat -> ../vendor/behat/behat/bin/behat
│   └── phpunit -> ../vendor/phpunit/phpunit/phpunit
├── composer.json
├── composer.lock
├── features
│   └── bootstrap
└── vendor
    ├── autoload.php
    ├── behat
    ├── composer
    ├── doctrine
    ├── phpdocumentor
    ├── phpspec
    ├── phpunit
    ├── sebastian
    ├── smalot
    ├── symfony
    └── tecnick.com

All right, it’s time to create some features, create a new file inside /feature, I’ll name mine pdf.feature

Feature: Pdf export
 
  Scenario: PDF must contain text
    Given I have pdf located at "samples/sample1.pdf"
    When I parse the pdf content
    Then the the page count should be "1"
    Then page "1" should contain
    """
Document title  Calibri : Lorem ipsum dolor sit amet, consectetur adipiscing elit.
    """

Run Behat (I know we didn’t write any testing code yet, just run it, trust me !)

bin/behat

An awesome feature of Behat is it detects any missing steps and provides you with boilerplate code you can use in your FeatureContext. This is the output of the last command:

Feature: Pdf export
 
  Scenario: PDF must contain text                     # features/pdf.feature:3
    Given I have pdf located at "samples/sample1.pdf"
    When I parse the pdf content
    Then the the page count should be "1"
    Then page "1" should contain
      """
      Document title  Calibri : Lorem ipsum dolor sit amet, consectetur adipiscing elit.
      """
 
1 scenario (1 undefined)
4 steps (4 undefined)
0m0.01s (9.28Mb)
 
--- FeatureContext has missing steps. Define them with these snippets:
 
    /**
     * @Given I have pdf located at :arg1
     */
    public function iHavePdfLocatedAt($arg1)
    {
        throw new PendingException();
    }
 
    /**
     * @When I parse the pdf content
     */
    public function iParseThePdfContent()
    {
        throw new PendingException();
    }
 
    /**
     * @Then the the page count should be :arg1
     */
    public function theThePageCountShouldBe($arg1)
    {
        throw new PendingException();
    }
 
    /**
     * @Then page :arg1 should contain
     */
    public function pageShouldContain($arg1, PyStringNode $string)
    {
        throw new PendingException();
    }

Cool right ? copy/paste the method definitions to you FeatureContext.php and let’s get to it, step by step :

Step 1

Given I have pdf located at "samples/sample1.pdf"

In this step we only need to make sure the filename we provided is readable then store it in a class property so we can use it in later steps :

 /**
     * @Given I have pdf located at :filename
     */
    public function iHavePdfLocatedAt($filename)
    {
        if (!is_readable($filename)) {
            Throw new \InvalidArgumentException(
                sprintf('The file [%s] is not readable', 
                $filename)
            );
        }
 
        $this->filename = $filename;
    }

Step 2

When I parse the pdf content

The heavy lifting is done here, we need to parse the PDF and store its content and metadata in a usable format:

    /**
     * @When I parse the pdf content
     */
    public function iParseThePdfContent()
    {
        $parser = new Parser();
        $pdf    = $parser->parseFile($this->filename);
        $pages  = $pdf->getPages();
        $this->metadata = $pdf->getDetails();
 
        foreach ($pages as $i => $page) {
            $this->pages[++$i] = $page->getText();
        }
    }

Step 3

Then the the page count should be "1"

since we already know how many pages the PDF contains, this is a piece of cake, so let’s not reinvent the wheel and use PHPUnit assertions :

 /**
     * @Then the the page count should be :pageCount
     * @param int $pageCount
     */
    public function theThePageCountShouldBe($pageCount)
    {
        PHPUnit_Framework_Assert::assertEquals( 
            (int) $pageCount, 
            $this->metadata['Pages']
        );
    }

Step 4

Then page "1" should contain
    """
Document title  Calibri : Lorem ipsum dolor sit amet, consectetur adipiscing elit.
    """

Same method, we have an array containing all content from all pages, a quick assertion does the trick:

    /**
     * @Then page :pageNum should contain
     * @param int $pageNum
     * @param PyStringNode $string
     */
    public function pageShouldContain($pageNum, PyStringNode $string)
    {
        PHPUnit_Framework_Assert::assertContains(
            (string) $string, 
            $this->pages[$pageNum]
        );
    }

Et voilà ! you should have green

Feature: Pdf export
 
  Scenario: PDF must contain text                     # features/pdf.feature:3
    Given I have pdf located at "samples/sample1.pdf" # FeatureContext::iHavePdfLocatedAt()
    When I parse the pdf content                      # FeatureContext::iParseThePdfContent()
    Then the the page count should be "1"             # FeatureContext::theThePageCountShouldBe()
    Then page "1" should contain                      # FeatureContext::pageShouldContain()
      """
      Document title  Calibri : Lorem ipsum dolor sit amet, consectetur adipiscing elit.
      """
 
1 scenario (1 passed)
4 steps (4 passed)

For the purpose of this article, we’re relying on the PDF parser library which has many encoding and white space issues, feel free to use any PHP equivalent or a system tool like xpdf for better results.

If you want to make your test more decoupled (and you should). One way is to create a PDFExtractor interface then implement it for each tool you want to use, that way you can easily swap libraries.

The source code behind this article is provided here, any feedback is most welcome.

Source: matmati.net:80/testing-pdf-with-behat-and-php

Comments

  1. Post
    Author
    sasha

    I think the example is focusing on a fairly common application component like a PDF Reading library, and writing some pretty generic tests that would likely be covered in the original library, strictly for the purposes of outlining how Behavior Driven Development works.
    You wouldn’t necessarily be writing tests for a PDF library that your application uses in the wild just to see if the library works. As long as you’ve researched your third party libraries appropriately and ran all of its tests yourself, it should be a safe assumption that the library you choose does work in the way you need it to.
    What should be tested is your application’s utilization of that library through its service classes. So in the case of something like application-specific invoices being generated in PDF format, I could see new tests being created for those custom service classes as an exercise in CYA more than anything.

Leave a Reply

Your email address will not be published.