Extract PDF Pages and Rename Based on Text in Each Page (Python)

Interested in learning ArcPy? check out this course.

I was recently tasked with traversing through a directory and subsequent sub-directories to find PDFs and split any multi-page files into single-page files. The end goal was to name each extracted page, that was now an individual PDF, with a document number present on each page. There was possibly over 100 PDF files in the directory and each PDF could have one to more than ten pages. At the extreme I could have been looking at around one-thousand pages to extract and rename – a task that would have been very time consuming and mind numbing to do manually.

The PDFs contained map books produced using data driven pages in ArcGIS, it was conceivable that I could also re-open the original MXDs and re-export the map book as individual pages and naming appropriately based on the document name in the attribute table. Since I was not the creator of any of these PDFs and they all came from different teams, hunting down the correct MXDs and exporting would be cumbersome and also very time consuming. There had to be a more interesting and time efficient way…

…A quick research via Google on some Python modules and I had what I needed to complete my task in a more automated and time efficient manner. I needed three modules;
(1) os – for traversing through the directories and files and for renaming the files
(2) PyPDF2 – to read/write PDF files and also to extract text from pages
(3) re – the regular expression module to find the text needed to rename the file.
The next step was write down some pseudocode to map out what needed to be achieved and then to get coding…

Let’s begin by importing the modules at the top of the script.

import os, PyPDF2, re

Define a function to extract the pages. This function will take two parameters; the path to the root directory and the path to a folder to extract the pages to. The ‘extract_to_folder’ needs to be on the same level or above the root directory. Use your operating system to create the folder named ‘extracted’ and also create a second folder called ‘renamed’.

def split_pdf_pages(root_directory, extract_to_folder):

Next we use the os module to search from the root directory down to find any PDF files and store the full filepath as a variable, one at a time.

for root, dirs, files in os.walk(root_directory):
 for filename in files:
  basename, extension = os.path.splitext(filename)
   if extension == ".pdf":
    fullpath = root + "\\" + basename + extension

We then open that PDF in read mode.

    opened_pdf = PyPDF2.PdfFileReader(open(fullpath,"rb"))

For each page in the PDF the page is extracted and saved as a new PDF file in the ‘extracted’ folder. The below snippet was sourced from stackoverflow.

    for i in range(opened_pdf.numPages):
     output = PyPDF2.PdfFileWriter()
     output.addPage(opened_pdf.getPage(i))
     with open(extract_to_folder + "\\" + basename + "-%s.pdf" % i, "wb") as output_pdf:
      output.write(output_pdf)

That completes our function to strip out individual pages from PDF files in a root directory and down through all corresponding sub-directories. This function might be all you need as you can rename the extracted pages as you save each file. The next task for me, however, was to rename the PDFs based on text contained in each individual file.

Define a function called ‘rename_pdfs’ that takes two arguments; the path to the folder where the extracted pages reside and the renamed folder. Loop through each PDF and create a filepath to each one.

def rename_pdfs(extraced_pdf_folder, rename_folder):
 for root, dirs, files in os.walk(extraced_pdf_folder):
  for filename in files:
   basename, extension = os.path.splitext(filename)
   if extension == ".pdf":
    fullpath = root + "\\" + basename + extension

Open each PDF in read mode…

    pdf_file_obj = open(fullpath, "rb")
    pdf_reader = PyPDF2.PdfFileReader(pdf_file_obj)

…and create a page object.

    page_obj = pdf_reader.getPage(0)

Now we extract the text from the page.

    pdf_text = page_obj.extractText()

My task was made quite easy because each page had a unique document number with a certain amount of characters prefixed the exact same for each. This meant that I could use regular expression, the re module, to find the prefix and then obtain the rest of the document number.

The code below finds the document number prefix in the text extracted from the page and appends the next 14 characters to the prefix to give the full document number.

    for index in re.finditer("THE-DOC-PREFIX-", pdf_text):
     doc_ext = pdf_text[index.end():index.end() + 14]
     doc_num = "THE-DOC-PREFIX-" + doc_ext
     pdf_file_obj.close()

The last thing to do is to use the document number to rename the PDF

    os.rename(fullpath, rename_folder + "\\" + doc_num + ".pdf")

That completes the two functions required to complete the task.

Set up the variables required for the function parameters…

root_dir = r"C:\Users\******\Documents\original"
extract_to = r"C:\Users\******\Documents\extracted"
rename_to = r"C:\Users\******\Documents\renamed"

…and then call each function.

split_pdf_pages(root_dir, extract_to)
rename_pdfs(extract_to,rename_to)

Run the script. The original files will remain and the renamed extracted pages will be in the renamed folder. Any PDF page that failed to be renamed will still be in the extracted folder and you can rename these manually. This failure to rename every PDF is because of the make-up of the PDF i.e. the way it was exported from a piece of software or how it was created. In a test run, out of 206 pages, 10 pages failed to be renamed. When I opened the pages the select tool was unable to highlight text and everything was embedded as an image, hence why the script couldn’t read any text to rename the document.

I hope someone out there will find this useful. I am always happy that my code works but appreciate if you have any constructive comments or hints and tips to make the code more efficient.

Here’s the full script…

# import the neccessary modules
import os, PyPDF2, re

# function to extract the individual pages from each pdf found
def split_pdf_pages(root_directory, extract_to_folder):
 # traverse down through the root directory to sub-directories
 for root, dirs, files in os.walk(root_directory):
  for filename in files:
   basename, extension = os.path.splitext(filename)
   # if a file is a pdf
   if extension == ".pdf":
    # create a reference to the full filename path
    fullpath = root + "\\" + basename + extension

    # open the pdf in read mode
    opened_pdf = PyPDF2.PdfFileReader(open(fullpath,"rb"))

    # for each page in the pdf
    for i in range(opened_pdf.numPages):
    # write the page to a new pdf
     output = PyPDF2.PdfFileWriter()
     output.addPage(opened_pdf.getPage(i))
     with open(extract_to_folder + "\\" + basename + "-%s.pdf" % i, "wb") as output_pdf:
      output.write(output_pdf)

# function for renaming the single page pdfs based on text in the pdf
def rename_pdfs(extraced_pdf_folder, rename_folder):
 # traverse down through the root directory to sub-directories
 for root, dirs, files in os.walk(extraced_pdf_folder):
  for filename in files:
   basename, extension = os.path.splitext(filename)
   # if a file is a pdf
   if extension == ".pdf":
    # create a reference to the full filename path
    fullpath = root + "\\" + basename + extension

    # open the individual pdf
    pdf_file_obj = open(fullpath, "rb")
    pdf_reader = PyPDF2.PdfFileReader(pdf_file_obj)

    # access the individual page
    page_obj = pdf_reader.getPage(0)
    # extract the the text
    pdf_text = page_obj.extractText()

    # use regex to find information
    for index in re.finditer("THE-DOC-PREFIX-", pdf_text):
     doc_ext = pdf_text[index.end():index.end() + 14]
     doc_num = "THE-DOC-PREFIX-" + doc_ext
     pdf_file_obj.close()
     # rename the pdf based on the information in the pdf
     os.rename(fullpath, rename_folder + "\\" + doc_num + ".pdf")

# parameter variables
root_dir = r"C:\Users\******\Documents\rename_pdf"
extract_to = r"C:\Users\******\Documents\extracted"
rename_to = r"C:\Users\******\Documents\renamed"

# use the two functions
split_pdf_pages(root_dir, extract_to)
rename_pdfs(extract_to,rename_to)

Resources:
PyPDF2
Automate the Boring Stuff with Python

17 thoughts on “Extract PDF Pages and Rename Based on Text in Each Page (Python)”

Gary on May 4, 2017 at 8:04 pm said:

This is absolutely amazing and enormously helpful. I needed to split a 500 page pdf with employee number and name, and your code made it a cinch.

Thanks so much – your help was greatly appreciated!

LikeLiked by 1 person

Reply ↓
- clubdebambos on May 5, 2017 at 6:28 am said:
  
  Excellent, delighted to hear it was of use.
  
  LikeLike
  
  Reply ↓
  - Michele Young on June 24, 2021 at 7:17 pm said:
    
    Could you help me do this with a real file? I have no idea what I am doing, other than I need to do this.
    
    LikeLike
    
    Reply ↓
Nick on August 23, 2017 at 1:05 pm said:

Thank you so much for supplying the code about. I am trying to use it now, and running into a snag. The text I would like to use to rename the pdf, is the first 6 characters from the extracted text. Any ideas on how I can modify the regex section to accomplish this?

LikeLike

Reply ↓
- clubdebambos on August 23, 2017 at 1:26 pm said:
  
  The extracted text from the whole document? you don’t need regex, just use os.rename(fullpath, rename_folder + “\\” + pdf_text[0:6] + “.pdf”, if you need the first 6 after the text you are searching for change the 14 to 6, and doc_num = doc_text
  
  LikeLike
  
  Reply ↓
  - Nick on August 23, 2017 at 1:38 pm said:
    
    Thanks for the reply. I was able to get it working. One other question. Is there a way that I can select the text is before the value in re,finditer? The value I need to pull is sometimes the first string, but can also be buried a but, but is it always before the string “Check”
    
    LikeLike
    
    Reply ↓
    - clubdebambos on August 23, 2017 at 1:57 pm said:
      
      Change [index.end():index.end() + 14] to [index.start() – 6: index.start()], this will get the 6 characters before the text (Check) you are searching for. Change the slice to suit your needs. Hope that helps.
      
      LikeLike
      
      Reply ↓
      - Nick on October 3, 2017 at 2:09 pm said:
        
        Thank you for all of your help. I am very new to python and this is one of my first programs. Would you be able to take a look at this snippet of code I modified from your post? It runs fine in debug, but errors out when I try to run it normally. The error I get is a permissionserror on the shutil.move line. It says the file is being used by another process.
        
        # function for renaming the single page pdfs based on text in the pdf
        def rename_pdfs(extraced_pdf_folder, rename_folder):
        # traverse down through the root directory to sub-directories
        for root, dirs, files in os.walk(extraced_pdf_folder):
        for filename in files:
        basename, extension = os.path.splitext(filename)
        # if a file is a pdf
        if extension == “.pdf”:
        # create a reference to the full filename path
        fullpath = root + “\\” + basename + extension
        
        # open the individual pdf
        pdf_file_obj = open(fullpath, “rb”)
        pdf_reader = PyPDF2.PdfFileReader(pdf_file_obj)
        
        # access the individual page
        page_obj = pdf_reader.getPage(0)
        # extract the the text
        pdf_text = page_obj.extractText()
        
        # Find the check number
        pdf_text = pdf_text.split(‘Check’, 1)[0]
        doc_num = pdf_text[-7:-1]
        pdf_file_obj.close()
        # rename the pdf based on the information in the pdf
        destinationlocation = rename_folder + “\\” + doc_num + “.pdf”
        shutil.move(fullpath, destinationlocation)
        
        LikeLiked by 1 person
        
        Reply ↓
        
        clubdebambos on October 11, 2017 at 7:30 am said:
        
        By the look of it the file you are trying to move is possibly already being accessed. It’s rare I’ve used shutil but maybe look up some resources online such as http://www.pythonforbeginners.com/os/python-the-shutil-module or post a question on StackExchange. I have used shutil.copy2 with success in the past as this overcame permission issues.
        
        LikeLike
        
        Reply ↓
Chigozie Onyenanu on April 13, 2019 at 1:25 pm said:

Hi, I am trying to run this script on Pycharm. Below is the error that I am getting.

Traceback (most recent call last):
File “/Users/xxxxx/PycharmProjects/Proj01/Excel.py”, line 43, in
print(split_pdf_pages(root_dir,extract_to))
File “/Users/xxxxx/PycharmProjects/Proj01/Excel.py”, line 10, in split_pdf_pages
opened_pdf = PyPDF2.PdfFileReader(open(fullpath, “rb”))
UnboundLocalError: local variable ‘fullpath’ referenced before assignment

I just started learning python. Please could you advise?

LikeLike

Reply ↓
Joel Boehme on November 22, 2019 at 12:46 pm said:

New to Python, but found a way to make this work, which was super fast and cool. Just love learning new code. How would I change this to grab 2 pages at a time?

LikeLike

Reply ↓
Pallav Jha on January 8, 2020 at 7:52 am said:

Hi,

i want to split that based on keyword on the page. A split document may contain multiple pages based on keyword on the page. if you can help me on that.

LikeLike

Reply ↓
shashidhara reddy on April 28, 2020 at 9:49 am said:

hi i have list of pdfs in a folder. and each pdf has page numbers inside pdf. So my query is i want to rename the pdf with page numbers inside the pdfs. so no need to split each pdf into single page pdf. just want to rename. Could you please help me with the code specific to this requirement.

LikeLike

Reply ↓
Simon Laycock (@notsofastmatey) on July 16, 2020 at 11:09 am said:

Just wanted to say thank you for writing this. It has been a massive help today. (I had to change the double backslashes to forward slashes to make it run, but that’s all).

LikeLike

Reply ↓
Pingback: Dividere e rinominare un PDF in base al contenuto del file - Eugenio Nappi
europeancataclism on December 18, 2020 at 8:45 am said:

Can this. work like a function by saving the script with a name.py?

LikeLike

Reply ↓
Michele Young on June 24, 2021 at 7:15 pm said:

I need to do this but have no idea how to employ this – could anyone help me take this conceptually into reality?

LikeLike

Reply ↓