PDF File Format: Exploring its Internal Document Structure

0
80
PDF File Format

The Portable Document Format, or PDF, is a universal format for sharing information that includes text, images, multimedia, web links, and more. It’s loaded with a bunch of useful functions and can be password protected, run JavaScript, and so on.

Viewing a PDF is their primary purpose, not editing it. PDFs are widely used because they allow document formatting to be maintained across devices. This feature facilitates easier document sharing and ensures a consistent visual experience regardless of the reader’s device.

In this piece, we briefly explore PDF and its structure.

A PDF document has four primary parts in the following sequence:

  • PDF header: which identifies the PDF specification.
  • The body of the file: consists of the individual objects that make up the document.
  • Cross-reference table: for information on a file’s indirect objects.
  • Trailer: providing the location for the file’s cross-reference table and particular objects in the body.

PDF File Header

A PDF file begins with a header containing a unique identifier and the version of the format, such as %PDF-1.x. Here x ranges from 1-7.

You can use a hex editor or the xxd command to determine this:

$ xxd temp.pdf | head -n 1

0000000: 2550 4446 2d31 2e33 0a25 c4e5 f2e5 eba7 %PDF-1.3.%……

The above means the file complies with the PDF specification version 1.3. Similar to PostScript, comments begin with the % symbol. It doesn’t matter where in the file a comment is placed; everything after the % sign is ignored.

The second line of most PDF files is a comment, typically containing one or more high-bit ASCII characters (from 0x80 to 0xFF). This way, email clients and other programmes are notified about the file containing binary data and that they should not be decoded as 7-bit ASCII text.

File Body

The PDF file’s body contains all the objects that make up the PDF’s actual content. Text streams, image data, fonts, annotated texts, etc., are all examples of such objects. The document’s logic, security, and interactive features can all be implemented via various invisible/non-display objects in the body.

There are eight different kinds of objects that make up a PDF’s body:

  1. Boolean
  2. Numerical
  3. Strings
  4. Names
  5. Arrays
  6. Dictionaries
  7. Streams
  8. The null object

Cross-Reference Table

The file’s cross-reference table (xref) stores the offsets of all its objects, eliminating the need to search through the entire file systematically (or “walk” a linked list) to get to what you need.

If no changes have been made to the file, the xref table will be continuous and comprise a single section. Sections of an xref table are divided into subsections that correspond to blocks of objects consecutively numbered. Each object’s entry is consistently 20 bytes long, including the line-ending character(s).

A ten-digit number representing the object’s offset is stored in the first ten bytes, followed by a space. After that, a five-digit number represents the object’s generation; after that, another space. Then an ‘f’ or an ‘n’ denotes the status of use. The end-of-the-line marker follows this.

File Trailer

A PDF file trailer allows a conforming reader to quickly locate the cross-reference table and other unique objects. Document processing is aided by reading the document’s trailer foremost, which is essentially a keyed dictionary with corresponding values describing the document’s aspects.

There should be nothing on the final line of the file other than the end-of-file marker, %%EOF. The two lines before it contain the keyword ‘startxref’ and the byte offset from the file’s beginning to the beginning of ‘xref’ of the last cross-reference section.

At the top is the word trailer. For example:

<< 

/Size 22

/Root 2 0 R

/Info 1 0 R

>> 

startxref

24212

%%EOF

Incremental Updates

Is it impossible to edit a PDF? No. Appending new data to an existing file is a valid operation within the scope of this specification. One should add the following components to the file to put this into effect:

1. New objects characterising revised content

2. A new xref table that shows new object offsets and invalidates old objects. Objects deleted from a file will still be present but flagged as “f.”

3. A new file trailer with a /Prev entry that refers back to the original xref table

4. A new end-of-the-file that specifies the location of the updated xref table.

The PDF will have preserved its complete original form in the new structure, including the header, body, cross-reference table, and trailer. The PDF file will stand updated to include new sections for the document’s body, cross-references, and trailer.

PDF file format: Internal Document Structure Explained

LEAVE A REPLY

Please enter your comment!
Please enter your name here