Extract PDF Pages and Rename Based on Text in Each Page (Python)

Interested in learning ArcPy? check out this course.

I was recently tasked with traversing through a directory and subsequent sub-directories to find PDFs and split any multi-page files into single-page files. The end goal was to name each extracted page, that was now an individual PDF, with a document number present on each page. There was possibly over 100 PDF files in the directory and each PDF could have one to more than ten pages. At the extreme I could have been looking at around one-thousand pages to extract and rename – a task that would have been very time consuming and mind numbing to do manually.

The PDFs contained map books produced using data driven pages in ArcGIS, it was conceivable that I could also re-open the original MXDs and re-export the map book as individual pages and naming appropriately based on the document name in the attribute table. Since I was not the creator of any of these PDFs and they all came from different teams, hunting down the correct MXDs and exporting would be cumbersome and also very time consuming. There had to be a more interesting and time efficient way…

…A quick research via Google on some Python modules and I had what I needed to complete my task in a more automated and time efficient manner. I needed three modules;
(1) os – for traversing through the directories and files and for renaming the files
(2) PyPDF2 – to read/write PDF files and also to extract text from pages
(3) re – the regular expression module to find the text needed to rename the file.
The next step was write down some pseudocode to map out what needed to be achieved and then to get coding…

Let’s begin by importing the modules at the top of the script.

import os, PyPDF2, re

Define a function to extract the pages. This function will take two parameters; the path to the root directory and the path to a folder to extract the pages to. The ‘extract_to_folder’ needs to be on the same level or above the root directory. Use your operating system to create the folder named ‘extracted’ and also create a second folder called ‘renamed’.

def split_pdf_pages(root_directory, extract_to_folder):

Next we use the os module to search from the root directory down to find any PDF files and store the full filepath as a variable, one at a time.

for root, dirs, files in os.walk(root_directory):
 for filename in files:
  basename, extension = os.path.splitext(filename)
   if extension == ".pdf":
    fullpath = root + "\\" + basename + extension

We then open that PDF in read mode.

    opened_pdf = PyPDF2.PdfFileReader(open(fullpath,"rb"))

For each page in the PDF the page is extracted and saved as a new PDF file in the ‘extracted’ folder. The below snippet was sourced from stackoverflow.

    for i in range(opened_pdf.numPages):
     output = PyPDF2.PdfFileWriter()
     output.addPage(opened_pdf.getPage(i))
     with open(extract_to_folder + "\\" + basename + "-%s.pdf" % i, "wb") as output_pdf:
      output.write(output_pdf)

That completes our function to strip out individual pages from PDF files in a root directory and down through all corresponding sub-directories. This function might be all you need as you can rename the extracted pages as you save each file. The next task for me, however, was to rename the PDFs based on text contained in each individual file.

Define a function called ‘rename_pdfs’ that takes two arguments; the path to the folder where the extracted pages reside and the renamed folder. Loop through each PDF and create a filepath to each one.

def rename_pdfs(extraced_pdf_folder, rename_folder):
 for root, dirs, files in os.walk(extraced_pdf_folder):
  for filename in files:
   basename, extension = os.path.splitext(filename)
   if extension == ".pdf":
    fullpath = root + "\\" + basename + extension

Open each PDF in read mode…

    pdf_file_obj = open(fullpath, "rb")
    pdf_reader = PyPDF2.PdfFileReader(pdf_file_obj)

…and create a page object.

    page_obj = pdf_reader.getPage(0)

Now we extract the text from the page.

    pdf_text = page_obj.extractText()

My task was made quite easy because each page had a unique document number with a certain amount of characters prefixed the exact same for each. This meant that I could use regular expression, the re module, to find the prefix and then obtain the rest of the document number.

The code below finds the document number prefix in the text extracted from the page and appends the next 14 characters to the prefix to give the full document number.

    for index in re.finditer("THE-DOC-PREFIX-", pdf_text):
     doc_ext = pdf_text[index.end():index.end() + 14]
     doc_num = "THE-DOC-PREFIX-" + doc_ext
     pdf_file_obj.close()

The last thing to do is to use the document number to rename the PDF

    os.rename(fullpath, rename_folder + "\\" + doc_num + ".pdf")

That completes the two functions required to complete the task.

Set up the variables required for the function parameters…

root_dir = r"C:\Users\******\Documents\original"
extract_to = r"C:\Users\******\Documents\extracted"
rename_to = r"C:\Users\******\Documents\renamed"

…and then call each function.

split_pdf_pages(root_dir, extract_to)
rename_pdfs(extract_to,rename_to)

Run the script. The original files will remain and the renamed extracted pages will be in the renamed folder. Any PDF page that failed to be renamed will still be in the extracted folder and you can rename these manually. This failure to rename every PDF is because of the make-up of the PDF i.e. the way it was exported from a piece of software or how it was created. In a test run, out of 206 pages, 10 pages failed to be renamed. When I opened the pages the select tool was unable to highlight text and everything was embedded as an image, hence why the script couldn’t read any text to rename the document.

I hope someone out there will find this useful. I am always happy that my code works but appreciate if you have any constructive comments or hints and tips to make the code more efficient.

Here’s the full script…

# import the neccessary modules
import os, PyPDF2, re

# function to extract the individual pages from each pdf found
def split_pdf_pages(root_directory, extract_to_folder):
 # traverse down through the root directory to sub-directories
 for root, dirs, files in os.walk(root_directory):
  for filename in files:
   basename, extension = os.path.splitext(filename)
   # if a file is a pdf
   if extension == ".pdf":
    # create a reference to the full filename path
    fullpath = root + "\\" + basename + extension

    # open the pdf in read mode
    opened_pdf = PyPDF2.PdfFileReader(open(fullpath,"rb"))

    # for each page in the pdf
    for i in range(opened_pdf.numPages):
    # write the page to a new pdf
     output = PyPDF2.PdfFileWriter()
     output.addPage(opened_pdf.getPage(i))
     with open(extract_to_folder + "\\" + basename + "-%s.pdf" % i, "wb") as output_pdf:
      output.write(output_pdf)

# function for renaming the single page pdfs based on text in the pdf
def rename_pdfs(extraced_pdf_folder, rename_folder):
 # traverse down through the root directory to sub-directories
 for root, dirs, files in os.walk(extraced_pdf_folder):
  for filename in files:
   basename, extension = os.path.splitext(filename)
   # if a file is a pdf
   if extension == ".pdf":
    # create a reference to the full filename path
    fullpath = root + "\\" + basename + extension

    # open the individual pdf
    pdf_file_obj = open(fullpath, "rb")
    pdf_reader = PyPDF2.PdfFileReader(pdf_file_obj)

    # access the individual page
    page_obj = pdf_reader.getPage(0)
    # extract the the text
    pdf_text = page_obj.extractText()

    # use regex to find information
    for index in re.finditer("THE-DOC-PREFIX-", pdf_text):
     doc_ext = pdf_text[index.end():index.end() + 14]
     doc_num = "THE-DOC-PREFIX-" + doc_ext
     pdf_file_obj.close()
     # rename the pdf based on the information in the pdf
     os.rename(fullpath, rename_folder + "\\" + doc_num + ".pdf")

# parameter variables
root_dir = r"C:\Users\******\Documents\rename_pdf"
extract_to = r"C:\Users\******\Documents\extracted"
rename_to = r"C:\Users\******\Documents\renamed"

# use the two functions
split_pdf_pages(root_dir, extract_to)
rename_pdfs(extract_to,rename_to)

Resources:
PyPDF2
Automate the Boring Stuff with Python

17 thoughts on “Extract PDF Pages and Rename Based on Text in Each Page (Python)

  1. This is absolutely amazing and enormously helpful. I needed to split a 500 page pdf with employee number and name, and your code made it a cinch.

    Thanks so much – your help was greatly appreciated!

    Liked by 1 person

  2. Thank you so much for supplying the code about. I am trying to use it now, and running into a snag. The text I would like to use to rename the pdf, is the first 6 characters from the extracted text. Any ideas on how I can modify the regex section to accomplish this?

    Like

    • The extracted text from the whole document? you don’t need regex, just use os.rename(fullpath, rename_folder + “\\” + pdf_text[0:6] + “.pdf”, if you need the first 6 after the text you are searching for change the 14 to 6, and doc_num = doc_text

      Like

      • Thanks for the reply. I was able to get it working. One other question. Is there a way that I can select the text is before the value in re,finditer? The value I need to pull is sometimes the first string, but can also be buried a but, but is it always before the string “Check”

        Like

        • Change [index.end():index.end() + 14] to [index.start() – 6: index.start()], this will get the 6 characters before the text (Check) you are searching for. Change the slice to suit your needs. Hope that helps.

          Like

          • Thank you for all of your help. I am very new to python and this is one of my first programs. Would you be able to take a look at this snippet of code I modified from your post? It runs fine in debug, but errors out when I try to run it normally. The error I get is a permissionserror on the shutil.move line. It says the file is being used by another process.

            # function for renaming the single page pdfs based on text in the pdf
            def rename_pdfs(extraced_pdf_folder, rename_folder):
            # traverse down through the root directory to sub-directories
            for root, dirs, files in os.walk(extraced_pdf_folder):
            for filename in files:
            basename, extension = os.path.splitext(filename)
            # if a file is a pdf
            if extension == “.pdf”:
            # create a reference to the full filename path
            fullpath = root + “\\” + basename + extension

            # open the individual pdf
            pdf_file_obj = open(fullpath, “rb”)
            pdf_reader = PyPDF2.PdfFileReader(pdf_file_obj)

            # access the individual page
            page_obj = pdf_reader.getPage(0)
            # extract the the text
            pdf_text = page_obj.extractText()

            # Find the check number
            pdf_text = pdf_text.split(‘Check’, 1)[0]
            doc_num = pdf_text[-7:-1]
            pdf_file_obj.close()
            # rename the pdf based on the information in the pdf
            destinationlocation = rename_folder + “\\” + doc_num + “.pdf”
            shutil.move(fullpath, destinationlocation)

            Liked by 1 person

  3. Hi, I am trying to run this script on Pycharm. Below is the error that I am getting.

    Traceback (most recent call last):
    File “/Users/xxxxx/PycharmProjects/Proj01/Excel.py”, line 43, in
    print(split_pdf_pages(root_dir,extract_to))
    File “/Users/xxxxx/PycharmProjects/Proj01/Excel.py”, line 10, in split_pdf_pages
    opened_pdf = PyPDF2.PdfFileReader(open(fullpath, “rb”))
    UnboundLocalError: local variable ‘fullpath’ referenced before assignment

    I just started learning python. Please could you advise?

    Like

  4. New to Python, but found a way to make this work, which was super fast and cool. Just love learning new code. How would I change this to grab 2 pages at a time?

    Like

  5. Hi,

    i want to split that based on keyword on the page. A split document may contain multiple pages based on keyword on the page. if you can help me on that.

    Like

  6. hi i have list of pdfs in a folder. and each pdf has page numbers inside pdf. So my query is i want to rename the pdf with page numbers inside the pdfs. so no need to split each pdf into single page pdf. just want to rename. Could you please help me with the code specific to this requirement.

    Like

  7. Pingback: Dividere e rinominare un PDF in base al contenuto del file - Eugenio Nappi

Leave a comment