The good and the bad of PDFs

From OpenGovData

Jump to: navigation, search

Contents

Points to Add

Other points that we should talk about:

  • what "PDF/A" is and which problems it solves and doesn't solve
  • what actually should be done to make sure a PDF is disability-accessible
  • bad PDFs that make it impossible to extract the text; alternative is XHTML? XML+XSL? ODF?
  • what PDFs shouldn't be used for as a sole means of disclosure, like voting records (this is done in Congressional committees though we don't have to call them out by name) and other tabular data.
  • benefits of the format being open
  • use of digital signatures

Document Draft

Contributors: Kevin Lyons

PDF Basics

Anyone who has spent time working (or playing!) on the web has come across Adobe's Portable Document Format or PDF files. Used for countless purposes, PDF files are a very popular method of making information available and printable with very little mess or fuss. The popular Adobe Acrobat Reader, free for download and use, allows people to quickly open, read and print PDF files in their web browser and is the primary tool for accessing PDF files, both on and off line. Reader has had more than 500 million downloads (1) over the years and is a technology that even people who are not comfortable with the computers seem to easily grasp.
Indeed, the PDF standard is so widely that there are hundreds of both commercial and free software packages to create, edit and transform PDF files. Microsoft Word (2), OpenOffice (3) and other common word processing tools offer a direct conversion to PDF, as do spreadsheets, presentation software and many other software applications.
Given how popular the PDF standard itself is, it shouldn't be a surprise that the term PDF actually covers a wide variety of different types of files. While all PDF files fit the PDF standard, there are several different subtypes of PDF that are helpful in the government world. The following covers a few of the major ones.
Standard PDF Files
A standard PDF file is based on Adobe's Portable Document Format standard, with version 1.7 being the latest version as of the time this article was authored. Generally, when people discuss a "PDF File", they're referring to the standard PDF format we're discussing here. While other, useful PDF subtypes exist, most work with PDFs will stay in the standard PDF v1.7 specification.
The specification itself stems from the early 1990's. The original versions of the standard, v1.0 through v1.6, were all closed standards kept by Adobe as proprietary information. In fact, prior to opening the standards, the legal stance on PDF files was actually bit murky. While many people developed standards and best practices based around PDF documents, it was often unclear as to when one had to license the right to generate new PDFs or edit existing PDFs.
Then, in a move designed to ""provide an umbrella for the current alphabet soup of Adobe PDF standards," (11) Adobe made their 1.7 PDF specification into an ISO standard on July 1, 2008.
By taking this step, Adobe allowed their users in such major industries as Health Care (who created the PDF/H standard) and Engineering (who created the PDF/E standard) to try and bring all of their ideas under a single roof. In return, Adobe was provided a greater degree of feedback and integration of ideas which they could roll into the main PDF standard.
PDF/A Files
In recent years, a new subset of the PDF standard has emerged, one specifically for archiving data.
As many folks will tell you, when storing electronic data for a long period of time, the problem often isn't data loss, it's data retrieval. It's usually fairly simple, though often time consuming, to pull a copy of data from a backup location. But say the file is five or more years old. Are you sure the file will open with a newer version of your software? Or do you have to keep around older copies of your software just in case you need to open a specific file? Think about how much software licensing would cost you at that point...
To help alleviate this issue, the PDF/A standard contains all necessary information needed to recreate the original document and all of it's associated formatting.. This includes information such as all font data, graphics data, text. etc. The idea is to make the document self contained so there are no reliances on external data which itself might be changed or corrupted over time.
While by now the PDF/A standard does seem interesting, it's best to note that there are a few trade-offs for using PDF/A. File sizes are a great deal larger than normal PDFs, as no compression is allowed, all data such as fonts are embedded. In addition, given the self contained nature of PDF/A, you can't reuse common document elements. For instance, if you have a common image that's used for, say, a letterhead, that image is going to be embedded in each and every document as opposed to used once and linked into multiple documents.
PDF/UA
Another subset of the PDF world is the work-in-progress called PDF/UA. Here, the UA stands for Universal Access, a reference to the proposed standard for accessibility. While PDF/UA is not currently a fully developed standard, the project, currently run by the AIIM Institute, is working towards publication of standards in the near future.
The PDF/UA standard is meant to address accessibility issues for software such as screen readers for the visually impaired. The goal, as detailed on the project wiki, is "to set standards for PDF authoring such that conforming PDF files are accessible and usable to all, including those who use assistive technology" (12).
As one might note, now that the PDF standard is open, projects such as PDF/UA are springing up to help refine and expand upon both the standard and implementations of the standard. The PDF/UA project is a prime example of the benefits that both the creator of the standard and the community surrounding the standard can receive when standards are made open.
Summary
So, there you have the basics on PDF files. As you might guess, these various different PDF standards are a big reason why PDF files are so commonplace in today's technology sector. And, like any other common technology, this means that PDF files have been used in places where they are, perhaps, not the best tool for the job.
With that in mind, let's discuss some of the ins and outs of PDF files, specifically what they can and can't do for your applications and what they should and shouldn't be used for.

The Good

So let's begin with what PDF files are good for. On the top of the list, PDF files make wonderful tools for print-friendly publications, especially ones that you want to look professional. Given that PDF files are based on Postscript (4), a sort of universal printing language, they naturally lend themselves to print-formatted applications.
Print Formatting
Let's check out an example:
  Sam and Alice work for the Governor's Budget Office. Recently, the Governor has just announced his spending plan for the current year,
  something that many, many people are interested in reading. After hours upon hours of work, they finally have the report done, complete
  with graphs, charts and images galore. A grand total of over 400 pretty pages ready for viewing.
  
  The report is something that hundreds, maybe thousands of people are going to want to read, which presents an issue for Sam and Alice. 
  They could print off hundreds or thousands of copies of the report, leave some in their office for people to pick up, mail out others and
  inter-office some more. When they look at the printing bill, however, neither Alice nor Sam are very happy with the cost.
  
  After a little research, Alice finds out about PDF files. With a quick plugin for her copy of Word, she can create an electronic version of
  the report, which her tech people can then post on the Governor's public web site. Now, rather than printing off tens of thousands of pages, 
  all they need to do is distribute a single link and the information is instantly accessible. 
In a case like this, a PDF is your best friend for a number of reasons. You've got an electronic document that preserves all the hard work spent formatting the document, can quickly and easily be distributed to a large number of people (just email a link to the PDF) and folks can print it out at their leisure. No more toting around massive stacks of paper, just a quick press of a few buttons and you're done.
Archival
But the good doesn't stop there. PDF files also make for a nice archival tool. Most scanner software these days scans directly to PDF, and creating electronic versions of filed information via PDF is quick and easy. Let's look at another example, the Nebraska electronic UCC filing process (5):
  In keeping with federal regulations, Jim needs to file an original Uniform Commercial Code statement with the Nebraska Secretary of State.
  So, logging into his Nebraska.gov account, he types in a few key pieces of information into a web form, clicks a buttons and seconds later
  he's done. No hassle, no fuss.
  
  While Jim is done with his part, the work is just beginning for the UCC Division. The data needs to be stored, processed and, by statute, 
  they have to have a printed copy of the UCC filing on record. Since the filing was done electronically, there's no paper, which presents 
  a bit of a problem. But not to worry, PDF files to the rescue!
  
  Nightly, all the day's electronic filings are collected and the data is written out to PDF files. These files can then be printed, stored 
  and archived as official records of the filing, thus meeting the statutory requirements without a change in the law. Quick, simple and easy.
Again we can see that PDF files can ease our workload. By creating faux paper copies of electronic information, we can create electronic archives of important data in a format we can easily read. In addition, that information can then be posted to the web for further use.
Now, given that search engines such as Google and Yahoo can read, index and return results for PDF files (with some major exceptions listed below), they can fill an important role in making archived data available. So we can create, archive and display information all with one quick format. All in all, seems pretty handy, right?
Well, there are some limitations as we'll discuss below and they can be fairly serious ones. However, for the most part, PDF files do make a wonderful tool for archiving data, especially in a situation where you want to retain print capability.
Digital Signatures
Another nice feature that's relatively new in PDF specification is the ability to digitally sign information. As we're all likely aware, a great many facets of the governmental world seem to hinge on having a signed copy of a form, bill or some other piece of paper. Often times this requirement is statutory, so it's not something we can just throw out. The PDF standard gives us a way to handle this through it's implementation of digital signatures.
The PDF signature process is not too different from actually signing a piece of paper. When given a document in PDF format, a signature field can be added, which is then populated either by signing using the mouse as a pen, or with a previously saved image copy of your signature. Then is then loaded into the signature field and the user is done. The person receiving the document can then "verify" the signature.
When you receive a digitally signed PDF, you can obtain a few helpful pieces of information. First, the PDF will know if it's been changed at all since the signature was added and will let you know, a feature not available in traditional signed paper. As well, you are given the opportunity to verify the signature, meaning you accept that it's valid or can reject it as you see fit.
It should be noted that this technology has not yet been widely adopted, and there are a few caveats involved. For example, the free Adobe Reader does not allow an individual to digitally sign a PDF (13). As well, in order to establish a trusted signature, you may need to register your signature with a third party trust (13) as well.
So while the process may not be perfect, the addition of digital signatures to the PDF standard is still something that could very well have a great deal of benefit to the average governmental agency in the coming years.


Now that you've heard a bit about some of the great things you can get from PDF files, let's take some time and cover the less positive aspects of the PDF world. As with any tool, PDFs can be misused, so it's important to spend some time getting familiar with their limitations.

The Bad

While it's still fresh in our minds, let's revisit the previous concept - storage and archival of data. Here we run across a major drawback of PDF files - scanned data.

PDF Archival Issues

A huge project that's big on many, many people's mind right now is converting older paper or microfiche information to an electronic format that can be used on the web. Even in a smaller state it's not uncommon for a single agency's records to house hundreds of thousands of pages of public information. Filings add up, as do reports, and soon people are buried under mounds of paper.
Now we've talked about how great PDF files are for storing data, so why not just convert our old pages to PDF? After all, scanners all seem to support PDF conversion these days, so how hard would it really be to just scan in all our old documents? Sure it would take some time but hey, we could just throw them out on the web and call it good, right?
To get right down to it, that may not be the best idea. See, the most common way for a scanner to store the printed page is as one big image. So normally, when you scan a document, it just takes a picture of the image (like a camera) and stores the picture, not the actual words on the page. This causes us a big problem if we want to pull information back off that page.
Unlike when a PDF file is created straight from the text of a document, like say Word, a PDF created out of an image can't be index or read by software. What this means is that if you want people to be able to search a document for specific information, such as names, addresses, titles, etc., it can't be a scanned version, that won't work.
Again, let's go over an example:
  Sue is charge of archiving her office's old budget reports. Her boss, ever so helpful, gave her a $50 scanner he picked up at Best Buy,
  and told her to get to work. With no better ideas, Sue sits down and starts scanning pages. 
  
  Within a few hours, she manages to get about a hundred or so reports done. Happy with her work, she saves the files to a central file 
  share and continues her work. It doesn't take too long before she's got a few years worth of work stored up. So far, everything seems 
  to be great! 
   
  A short while later, her boss returns with a new request. The Auditor's office has been by and they need copies of all the reports that
  detail a specific type of distribution to a specific county. Figuring that she has all the data scanned, Sue thinks this should be an 
  easy task.
  
  Soon enough, Sue learns the awful truth. The only way she can get the data off her scanned PDF files is to read each and every one to find
  it. Her Windows file search doesn't work, nor does the Google archive of all of their PDF files. Now she's stuck with a bunch of information
  that she can't search, can't index and can't handle in any electronic format. 
Addressing PDF Archival Issues
Now, this issue is not insurmountable. The field of Optical Character Recognition or OCR for short (6) has come a long way in recent years. While still not perfect, it does have a very high degree of accuracy in converting scanned images to actual text. This does require another layer of software, or in some cases the purchase of hardware such as OCR scanners.
These purchases, of course, take money. While OCR solutions can be found for as little as $100 (7), the cheaper purchases often do not provide the level of service required by state or federal contracts and guidelines. But perhaps more important, they fail to provide a 100% accuracy rate. Given that one small mistake in scanning may cause a large amount of legal problems, one has to measure the risk vs. the reward. After all, one mistake in, say, a taxation statute, could cause lost revenue for the state and possibly even open the state up for a law suit. While certainly not likely to happen, it is a risk all the same.

Human Accessibility Issues

Another risk in the PDFs world comes from issues is in the realm of the 508 Compliance standards. While many government web sites simply overlook 508 compliance, it is a federal requirement and may be one for your state as well (8). Unless specific steps are taken, a typical PDF does not contain the markup necessary to meet 508 compliance standards (9), which in turn may leave you open to lawsuits or other legal liabilities.
A non-accessible PDF will cause issues for screen readers for the visually impaired, as well as potential navigation issues for individuals with motor skill impairments, just to name a few. While not a majority of your average web site user base, they are just a few of the individuals protected under the Americans With Disabilities Act, which itself takes a specific stance on IT issues (10).
Addressing Accessibility Issues
As with the archival issues listed above, this limitation can be worked around as well. PDF files do have a number of accessibility options built in, as which the PDF/UA project has been working to expand. However, one should note that accessibility options are not normally enabled by default for most any software, so special care needs to be taken when creating a PDF if you want to meet these standards.
In general, accessibility issues are addressed inside of a PDF by "tagging". The process of tagging a PDF is simply adding context to the data inside in a manner in which programs such as screen readers can use. Generally, this is done through Adobe's Acrobat suite of products, which includes a PDF editing tool. Few other products, commercial or not, support tagging functionality within PDF files.
Having said that, however, Microsoft Word, OpenOffice and InDesign all share the ability to create a tagged PDF that meats accessibility requirements. As such, you may already have the tools at hand to produce accessible PDF files. Check the options of on your PDF generating tools for "accessibility" or "tagging" tools to be sure.
For more information on how to check your PDF files to see if they are accessible, you may wish to visit the Web Accessibility Center for more details.

Computer Accessibility Issues

One final issue that should be brought up is again an accessibility issue, though this time of a different variety. Not only can PDF files have issues when relating to human interaction, they can present a wide array of issues with dealing with computer to computer interactions.
While PDF files are wonderful for display of data for people, they have severe limitations when dealing with electronic access to data. When one is in a situation where software applications may require input, PDF files simply are not the way to go. Let's illustrate with an example:
   Anna is a software developer for a smaller company. While working on a payroll system, she finds that her state's laws require a 
   certain tax rate for income. Looking back at historical data, she finds that this tax rate tends to change every couple of years.
   Wanting to assure that her company stays in compliance with state and federal law, she realizes that she needs to make sure her 
   payroll application updates itself with the latest tax rate.
   
   Searching around on the web, Anna finds that the tax rate is published on-line. The problem is, it's only published in PDF format,
   so Anna doesn't have a good way of grabbing that data from the PDF. She could make her software try to read the whole PDF and figure
   out just where the tax rate is, but that's a pretty time consuming and tricky thing to do and she can't ensure that it would work
   100% of the time.
   
   In the end, Anna decides to just have a person look up and enter the tax rate, so she provides the users a link to the PDF and calls
   it a day. 
   
   One year later, Anna's company receives a visit from the state's department of revenue. Their tax filings were off as they calculated
   payroll for six months using the wrong tax rate. Digging into the problem, she finds that the person responsible for updating the tax
   rate left the company and didn't bother to tell their replacement to update the rate. 
   
In the above example, if the tax rate were distributed in a machine readable format the company would have been able to automatically update their application, as would many other companies out there. This is a prime example of where a PDF file, while useful, is not an ideal single solution.
At any point in time when you have data that may need to be read by a machine as opposed to a person, a PDF is not an ideal, or even very feasible answer. While it may make a great supplement to your data, it can cause a wide variety of issues for people if a PDF is your sole means of disclosure of information.
Addressing Computer Accessibility Issues
In this case, the solution involves leaving PDF files behind all together. Fortunately, there are a large number of solutions that can take it's place. File formats such as XML and CSV are established, proven methods of enabling machine to machine communication quickly and easily. Given the wide proliferation of XML parsing and processing tools, publication of data via XML is, right now, one of the simplest and most accessible formats for data.
Though this is really a topic for another complete wiki-entry, it's important to point out one potential advantage to traveling down the XML path for data distribution. In addition to being able to publish raw data, a set of tools called XSL:FO exist for directly translating XML to a PDF format via an XSLT processor. While often times a cumbersome process with a fairly steep learning curve, it is possible to have a single XML distribution that you translate on the fly to a human-friendly PDF version of the data.
Indeed free products such as Apache's FOP and commercial products such as Antenna House Formatter support the XML to PDF transform on a variety of levels. Given those tools, one could fairly easily build a system (or perhaps web service) that outputs XML, which is processed by a second system (or service) that translates it as necessary and passes it on to a third party.

Summary

So there you have a few of the ups and downs of PDF files. Like any other technology, PDFs can be either helpful or harmful depending on your needs. Accordingly, you should take the time to evaluate what your needs are before committing to the use or non-use of PDFs in your work.
Please note that this article is by no means a complete list of either positive or negative aspects of PDFs. The intent here is to provide a good starting point for your research and give a few of the major pros and cons of PDF files. If you have additional questions, comments or concerns, please feel free to contact the author.

Sources

   Author's note: Links to any corporate entity or individual do not constitute endorsement of any product, service or other facet of 
                  said entity or individual either on the author's part nor the Wiki Maintainer's part. Information presented here is 
                  strictly for reference purposes only.
  • [1] About Acrobat, Adobe.com
  • [2] Microsoft Word's PDF Plugin, MSN
  • [3] Open Office PDF Exports. OpenOffice.org
  • [4] PDF Foundations, Wikipedia
  • [5] Nebraska Secretary of State UCC Division, sos.state.ne.us
  • [6] OCR definition and information, Wikipedia
  • [7] OCR Scanner Reviews, The Scanner Store
  • [8] 508 Compliance Laws, section508.gov
  • [9] PDF Accessibility standards, Adobe.com
  • [10] Americans With Disabilities Act, ADA web site
  • [11] Linux Watch News
  • [12] PDF/UA Project Wiki
  • [13] Digital Signature Blog
Personal tools