Extract PDF Pages and Rename Based on Text in Each Page (Python)

I was recently tasked with traversing through a directory and subsequent sub-directories to find PDFs and split any multi-page files into single-page files. The end goal was to name each extracted page, that was now an individual PDF, with a document number present on each page. There was possibly over 100 PDF files in the directory and each PDF could have one to more than ten pages. At the extreme I could have been looking at around one-thousand pages to extract and rename – a task that would have been very time consuming and mind numbing to do manually.

The PDFs contained map books produced using data driven pages in ArcGIS, it was conceivable that I could also re-open the original MXDs and re-export the map book as individual pages and naming appropriately based on the document name in the attribute table. Since I was not the creator of any of these PDFs and they all came from different teams, hunting down the correct MXDs and exporting would be cumbersome and also very time consuming. There had to be a more interesting and time efficient way…

…A quick research via Google on some Python modules and I had what I needed to complete my task in a more automated and time efficient manner. I needed three modules;
(1) os – for traversing through the directories and files and for renaming the files
(2) PyPDF2 – to read/write PDF files and also to extract text from pages
(3) re – the regular expression module to find the text needed to rename the file.
The next step was write down some pseudocode to map out what needed to be achieved and then to get coding…

Let’s begin by importing the modules at the top of the script.

import os, PyPDF2, re

Define a function to extract the pages. This function will take two parameters; the path to the root directory and the path to a folder to extract the pages to. The ‘extract_to_folder’ needs to be on the same level or above the root directory. Use your operating system to create the folder named ‘extracted’ and also create a second folder called ‘renamed’.

def split_pdf_pages(root_directory, extract_to_folder):

Next we use the os module to search from the root directory down to find any PDF files and store the full filepath as a variable, one at a time.

for root, dirs, files in os.walk(root_directory):
 for filename in files:
  basename, extension = os.path.splitext(filename)
   if extension == ".pdf":
    fullpath = root + "\\" + basename + extension

We then open that PDF in read mode.

    opened_pdf = PyPDF2.PdfFileReader(open(fullpath,"rb"))

For each page in the PDF the page is extracted and saved as a new PDF file in the ‘extracted’ folder. The below snippet was sourced from stackoverflow.

    for i in range(opened_pdf.numPages):
     output = PyPDF2.PdfFileWriter()
     output.addPage(opened_pdf.getPage(i))
     with open(extract_to_folder + "\\" + basename + "-%s.pdf" % i, "wb") as output_pdf:
      output.write(output_pdf)

That completes our function to strip out individual pages from PDF files in a root directory and down through all corresponding sub-directories. This function might be all you need as you can rename the extracted pages as you save each file. The next task for me, however, was to rename the PDFs based on text contained in each individual file.

Define a function called ‘rename_pdfs’ that takes two arguments; the path to the folder where the extracted pages reside and the renamed folder. Loop through each PDF and create a filepath to each one.

def rename_pdfs(extraced_pdf_folder, rename_folder):
 for root, dirs, files in os.walk(extraced_pdf_folder):
  for filename in files:
   basename, extension = os.path.splitext(filename)
   if extension == ".pdf":
    fullpath = root + "\\" + basename + extension

Open each PDF in read mode…

    pdf_file_obj = open(fullpath, "rb")
    pdf_reader = PyPDF2.PdfFileReader(pdf_file_obj)

…and create a page object.

    page_obj = pdf_reader.getPage(0)

Now we extract the text from the page.

    pdf_text = page_obj.extractText()

My task was made quite easy because each page had a unique document number with a certain amount of characters prefixed the exact same for each. This meant that I could use regular expression, the re module, to find the prefix and then obtain the rest of the document number.

The code below finds the document number prefix in the text extracted from the page and appends the next 14 characters to the prefix to give the full document number.

    for index in re.finditer("THE-DOC-PREFIX-", pdf_text):
     doc_ext = pdf_text[index.end():index.end() + 14]
     doc_num = "THE-DOC-PREFIX-" + doc_ext
     pdf_file_obj.close()

The last thing to do is to use the document number to rename the PDF

    os.rename(fullpath, rename_folder + "\\" + doc_num + ".pdf")

That completes the two functions required to complete the task.

Set up the variables required for the function parameters…

root_dir = r"C:\Users\******\Documents\original"
extract_to = r"C:\Users\******\Documents\extracted"
rename_to = r"C:\Users\******\Documents\renamed"

…and then call each function.

split_pdf_pages(root_dir, extract_to)
rename_pdfs(extract_to,rename_to)

Run the script. The original files will remain and the renamed extracted pages will be in the renamed folder. Any PDF page that failed to be renamed will still be in the extracted folder and you can rename these manually. This failure to rename every PDF is because of the make-up of the PDF i.e. the way it was exported from a piece of software or how it was created. In a test run, out of 206 pages, 10 pages failed to be renamed. When I opened the pages the select tool was unable to highlight text and everything was embedded as an image, hence why the script couldn’t read any text to rename the document.

I hope someone out there will find this useful. I am always happy that my code works but appreciate if you have any constructive comments or hints and tips to make the code more efficient.

Here’s the full script…

# import the neccessary modules
import os, PyPDF2, re

# function to extract the individual pages from each pdf found
def split_pdf_pages(root_directory, extract_to_folder):
 # traverse down through the root directory to sub-directories
 for root, dirs, files in os.walk(root_directory):
  for filename in files:
   basename, extension = os.path.splitext(filename)
   # if a file is a pdf
   if extension == ".pdf":
    # create a reference to the full filename path
    fullpath = root + "\\" + basename + extension

    # open the pdf in read mode
    opened_pdf = PyPDF2.PdfFileReader(open(fullpath,"rb"))

    # for each page in the pdf
    for i in range(opened_pdf.numPages):
    # write the page to a new pdf
     output = PyPDF2.PdfFileWriter()
     output.addPage(opened_pdf.getPage(i))
     with open(extract_to_folder + "\\" + basename + "-%s.pdf" % i, "wb") as output_pdf:
      output.write(output_pdf)

# function for renaming the single page pdfs based on text in the pdf
def rename_pdfs(extraced_pdf_folder, rename_folder):
 # traverse down through the root directory to sub-directories
 for root, dirs, files in os.walk(extraced_pdf_folder):
  for filename in files:
   basename, extension = os.path.splitext(filename)
   # if a file is a pdf
   if extension == ".pdf":
    # create a reference to the full filename path
    fullpath = root + "\\" + basename + extension

    # open the individual pdf
    pdf_file_obj = open(fullpath, "rb")
    pdf_reader = PyPDF2.PdfFileReader(pdf_file_obj)

    # access the individual page
    page_obj = pdf_reader.getPage(0)
    # extract the the text
    pdf_text = page_obj.extractText()

    # use regex to find information
    for index in re.finditer("THE-DOC-PREFIX-", pdf_text):
     doc_ext = pdf_text[index.end():index.end() + 14]
     doc_num = "THE-DOC-PREFIX-" + doc_ext
     pdf_file_obj.close()
     # rename the pdf based on the information in the pdf
     os.rename(fullpath, rename_folder + "\\" + doc_num + ".pdf")

# parameter variables
root_dir = r"C:\Users\******\Documents\rename_pdf"
extract_to = r"C:\Users\******\Documents\extracted"
rename_to = r"C:\Users\******\Documents\renamed"

# use the two functions
split_pdf_pages(root_dir, extract_to)
rename_pdfs(extract_to,rename_to)

Resources:
PyPDF2
Automate the Boring Stuff with Python

2 thoughts on “Extract PDF Pages and Rename Based on Text in Each Page (Python)

  1. This is absolutely amazing and enormously helpful. I needed to split a 500 page pdf with employee number and name, and your code made it a cinch.

    Thanks so much – your help was greatly appreciated!

    Liked by 1 person

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s