Portable Document Format (PDF) is a document type created in 1990s by Adobe. The aim of introducing this format is to provide a standard format for representation of documents that is independent of application software, hardware as well as operating systems. PDF files can be opened in Adobe Acrobat Reader as well as in multiple browsers like Chrome, Safari, Firefox, etc. through plugins or extensions. PDF file format contains images, text, hyperlinks, rich-media, digital signatures, attachments, metadata, and 3D objects.
Users generally convert their existing documents into PDF format but it does not mean that PDF can’t be manipulated or created by any software. Adobe Writer is Adobe’s own application for creating PDF files.
History of PDF format
Adobe made PDF format available free of charge in 1993. It was released as an open standard in July 2008 and was published by International Organization for Standardization as ISO 32000-1. Adobe published a Public Patent License to ISO 32000-1 in 2008, permit royalty-free rights for all patents taken by Adobe, compulsory to make, sell or distribute PDF-complaint implementations.
The first edition of PDF as PDF 1.0 later went through revisions up to PDF 1.7. PDF 1.7, the 6th edition that became ISO 32000-1 have some proprietary technologies instructed by Adobe, such as Adobe XML Forms Architecture and JavaScript extension for Acrobat.
It was in July 2017 when PDF 2.0 known as ISO 32000-2:2017 was published that does not include any non-standardized technologies.
PDF File Specifications
The PDF file is a set of bytes that are grouped in tokens according to syntax rules defined by PDF specifications.
File Structure
The PDF file contains the following inside the file in a sequence.
File Header
Irrespective of the PDF version, PDF files start with a header containing a unique identifier for PDF and the version of the format such as %PDF-1.x, where x ranges from 1-7.
File Body
The file body of PDF consists of a sequence of indirect objects presenting the contents of a document. The objects represent the components of a document such as fonts, pages, sample images, etc. The body also contains a sequence of indirect objects.
Cross-Reference Table
The cross-reference table holds information that allows random access to indirect objects so that the complete file does not need to read to locate any particular object.
The cross-reference table is also known as an index table, located near the end of the file and gives the byte offset of each indirect object from the beginning of the file.
File Trailer
PDF File Trailer enables users to quickly find the cross-reference table and special objects. The end line of the file shall contain only the end-of-file marker, %%EOF. The two earlier lines consist one per line and in order, the keyword startxref and the byte offset in the decoded stream from the starting of the file to the starting of xref keyword in last cross-reference section.
PDF Objects
PDF file generally consists of eight types of objects –
There are other objects like comments that are introduced with % sign and may contain 8-bit characters.
Indirect Objects
Indirect objects are located in special streams known as object streams. Cross-referencing to indirect objects are maintained in index table and marked with xref keyword that follows the main body and gives the byte offset for each indirect object from the beginning of the file.
Linear and Non-linear PDF layouts
The layout of PDF is categorized into linear and non-linear based on the target applications and other factors.
Linear PDF – Linear PDF files are created in such a way that they are written to disk in a linear fashion. These do not need browser plugins for the whole document to load first before preview.
Non-linear PDF – They uses less disk space as compared to linear PDF files. PDF pages of the document reside in scattered form across PDF because of this non-linear PDF files are slower as compared to linear files.
Objects Overview
PDF body contains objects as discussed above. PDF files are largely based on PostScript without the control features of programming languages like if and loop commands. PostScript code issue commands to generate graphical content collected and tokenized in addition to files, graphics, or fonts.
Text
Text in PDF document is represented by text elements in page content streams. The text element specifies that characters should be drawn at certain positions.
Graphics
The graphic operators in PDF content streams explain the appearance of pages reproduced on a raster output device. Six main groups are formed by graphic operators.