Software Development
Testing PDF content with PHP and Behat
If you have a PDF generation functionality in your app, and since most of the libraries out there build the PDF content in an internal structure before outputting it to the file system (FPDF, TCPDF). A good way to write a test for it is to test the output just before the rendering process.
Recently however, and due to this process being a total pain in the ass, people switched to using tools like wkhtmltopdf or some of its PHP wrappers (phpwkhtmltopdf, snappy) that let you build your pages in html/css and use a browser engine to render the PDF for you, and while this technique is a lot more developer friendly, you loose control over the building process.
So if you’re using one of those tools or just need to test for the existence of some string inside a PDF, here’s how to write a BDD style acceptance test for it using Behat.
Setup framework
Add this your composer.json then run composer install
{
"minimum-stability": "dev",
"require": {
"smalot/pdfparser": "*",
"behat/behat": "3.*@stable",
"behat/mink": "1.6.*@stable",
"phpunit/phpunit": "4.*"
},
"config": {
"bin-dir": "bin/"
}
}
Initialize Behat
bin/behat --init
This command creates the initial features directory and a blank FeatureContext class.
If everything worked as expected, your project directory should look like this :
├── bin
│ ├── behat -> ../vendor/behat/behat/bin/behat
│ └── phpunit -> ../vendor/phpunit/phpunit/phpunit
├── composer.json
├── composer.lock
├── features
│ └── bootstrap
└── vendor
├── autoload.php
├── behat
├── composer
├── doctrine
├── phpdocumentor
├── phpspec
├── phpunit
├── sebastian
├── smalot
├── symfony
└── tecnick.com
All right, it’s time to create some features, create a new file inside /feature, I’ll name mine pdf.feature
Feature: Pdf export
Scenario: PDF must contain text
Given I have pdf located at "samples/sample1.pdf"
When I parse the pdf content
Then the the page count should be "1"
Then page "1" should contain
"""
Document title Calibri : Lorem ipsum dolor sit amet, consectetur adipiscing elit.
"""
Run Behat (I know we didn’t write any testing code yet, just run it, trust me!)
bin/behat
An awesome feature of Behat is it detects any missing steps and provides you with boilerplate code you can use in your FeatureContext. This is the output of the last command:
Feature: Pdf export
Scenario: PDF must contain text # features/pdf.feature:3
Given I have pdf located at "samples/sample1.pdf"
When I parse the pdf content
Then the the page count should be "1"
Then page "1" should contain
"""
Document title Calibri : Lorem ipsum dolor sit amet, consectetur adipiscing elit.
"""
1 scenario (1 undefined)
4 steps (4 undefined)
0m0.01s (9.28Mb)
--- FeatureContext has missing steps. Define them with these snippets:
/**
* @Given I have pdf located at :arg1
*/
public function iHavePdfLocatedAt($arg1)
{
throw new PendingException();
}
/**
* @When I parse the pdf content
*/
public function iParseThePdfContent()
{
throw new PendingException();
}
/**
* @Then the the page count should be :arg1
*/
public function theThePageCountShouldBe($arg1)
{
throw new PendingException();
}
/**
* @Then page :arg1 should contain
*/
public function pageShouldContain($arg1, PyStringNode $string)
{
throw new PendingException();
}
Cool right? copy/paste the method definitions to you FeatureContext.php and let’s get to it, step by step :
Step 1
Given I have pdf located at "samples/sample1.pdf"
In this step we only need to make sure the filename we provided is readable then store it in a class property so we can use it in later steps:
/**
* @Given I have pdf located at :filename
*/
public function iHavePdfLocatedAt($filename)
{
if (!is_readable($filename)) {
Throw new \InvalidArgumentException(
sprintf('The file [%s] is not readable',
$filename)
);
}
$this->filename = $filename;
}
Step 2
When I parse the pdf content
The heavy lifting is done here, we need to parse the PDF and store its content and metadata in a usable format:
/**
* @When I parse the pdf content
*/
public function iParseThePdfContent()
{
$parser = new Parser();
$pdf = $parser->parseFile($this->filename);
$pages = $pdf->getPages();
$this->metadata = $pdf->getDetails();
foreach ($pages as $i => $page) {
$this->pages[++$i] = $page->getText();
}
}
Step 3
Then the the page count should be "1"
Since we already know how many pages the PDF contains, this is a piece of cake, so let’s not reinvent the wheel and use PHPUnit assertions:
/**
* @Then the the page count should be :pageCount
* @param int $pageCount
*/
public function theThePageCountShouldBe($pageCount)
{
PHPUnit_Framework_Assert::assertEquals(
(int) $pageCount,
$this->metadata['Pages']
);
}
Step 4
Then page "1" should contain
"""
Document title Calibri : Lorem ipsum dolor sit amet, consectetur adipiscing elit.
"""
Same method, we have an array containing all content from all pages, a quick assertion does the trick:
/**
* @Then page :pageNum should contain
* @param int $pageNum
* @param PyStringNode $string
*/
public function pageShouldContain($pageNum, PyStringNode $string)
{
PHPUnit_Framework_Assert::assertContains(
(string) $string,
$this->pages[$pageNum]
);
}
Et voilà! you should have green
Feature: Pdf export
Scenario: PDF must contain text # features/pdf.feature:3
Given I have pdf located at "samples/sample1.pdf" # FeatureContext::iHavePdfLocatedAt()
When I parse the pdf content # FeatureContext::iParseThePdfContent()
Then the the page count should be "1" # FeatureContext::theThePageCountShouldBe()
Then page "1" should contain # FeatureContext::pageShouldContain()
"""
Document title Calibri : Lorem ipsum dolor sit amet, consectetur adipiscing elit.
"""
1 scenario (1 passed)
4 steps (4 passed)
For the purpose of this article, we’re relying on the PDF parser library which has many encoding and white space issues, feel free to use any PHP equivalent or a system tool like xpdf for better results.
If you want to make your test more decoupled (and you should). One way is to create a PDFExtractor interface then implement it for each tool you want to use, that way you can easily swap libraries.
The source code behind this article is provided here, any feedback is most welcome.
Source: matmati.net:80/testing-pdf-with-behat-and-php
-
Marketing Tips2 days ago
What is my Instagram URL? How to Find & Copy Address [Guide on Desktop or Mobile]
-
Business Imprint4 days ago
About Apple Employee and Friends&Family Discount in 2024
-
App Development3 days ago
How to Unlist your Phone Number from GetContact
-
News5 days ago
Open-Source GPT-3/4 LLM Alternatives to Try in 2024
-
Crawling and Scraping5 days ago
Comparison of Open Source Web Crawlers for Data Mining and Web Scraping: Pros&Cons
-
Grow Your Business2 days ago
Best Instagram-like Apps and their Features
-
Grow Your Business5 days ago
How to Become a Prompt Engineer in 2024
-
Marketing Tips2 days ago
B2B Instagram Statistics in 2024
I think the example is focusing on a fairly common application component like a PDF Reading library, and writing some pretty generic tests that would likely be covered in the original library, strictly for the purposes of outlining how Behavior Driven Development works.
You wouldn’t necessarily be writing tests for a PDF library that your application uses in the wild just to see if the library works. As long as you’ve researched your third party libraries appropriately and ran all of its tests yourself, it should be a safe assumption that the library you choose does work in the way you need it to.
What should be tested is your application’s utilization of that library through its service classes. So in the case of something like application-specific invoices being generated in PDF format, I could see new tests being created for those custom service classes as an exercise in CYA more than anything.