I was recently tasked with traversing through a directory and subsequent sub-directories to find PDFs and split any multi-page files into single-page files. The end goal was to name each extracted page, that was now an individual PDF, with a document number present on each page. There was possibly over 100 PDF files in the directory and each PDF could have one to more than ten pages. At the extreme I could have been looking at around one-thousand pages to extract and rename – a task that would have been very time consuming and mind numbing to do manually.
The PDFs contained map books produced using data driven pages in ArcGIS, it was conceivable that I could also re-open the original MXDs and re-export the map book as individual pages and naming appropriately based on the document name in the attribute table. Since I was not the creator of any of these PDFs and they all came from different teams, hunting down the correct MXDs and exporting would be cumbersome and also very time consuming. There had to be a more interesting and time efficient way…
…A quick research via Google on some Python modules and I had what I needed to complete my task in a more automated and time efficient manner. I needed three modules;
(1) os – for traversing through the directories and files and for renaming the files
(2) PyPDF2 – to read/write PDF files and also to extract text from pages
(3) re – the regular expression module to find the text needed to rename the file.
The next step was write down some pseudocode to map out what needed to be achieved and then to get coding…
Let’s begin by importing the modules at the top of the script.
import os, PyPDF2, re
Define a function to extract the pages. This function will take two parameters; the path to the root directory and the path to a folder to extract the pages to. The ‘extract_to_folder’ needs to be on the same level or above the root directory. Use your operating system to create the folder named ‘extracted’ and also create a second folder called ‘renamed’.
def split_pdf_pages(root_directory, extract_to_folder):
Next we use the os module to search from the root directory down to find any PDF files and store the full filepath as a variable, one at a time.
for root, dirs, files in os.walk(root_directory): for filename in files: basename, extension = os.path.splitext(filename) if extension == ".pdf": fullpath = root + "\\" + basename + extension
We then open that PDF in read mode.
opened_pdf = PyPDF2.PdfFileReader(open(fullpath,"rb"))
For each page in the PDF the page is extracted and saved as a new PDF file in the ‘extracted’ folder. The below snippet was sourced from stackoverflow.
for i in range(opened_pdf.numPages): output = PyPDF2.PdfFileWriter() output.addPage(opened_pdf.getPage(i)) with open(extract_to_folder + "\\" + basename + "-%s.pdf" % i, "wb") as output_pdf: output.write(output_pdf)
That completes our function to strip out individual pages from PDF files in a root directory and down through all corresponding sub-directories. This function might be all you need as you can rename the extracted pages as you save each file. The next task for me, however, was to rename the PDFs based on text contained in each individual file.
Define a function called ‘rename_pdfs’ that takes two arguments; the path to the folder where the extracted pages reside and the renamed folder. Loop through each PDF and create a filepath to each one.
def rename_pdfs(extraced_pdf_folder, rename_folder): for root, dirs, files in os.walk(extraced_pdf_folder): for filename in files: basename, extension = os.path.splitext(filename) if extension == ".pdf": fullpath = root + "\\" + basename + extension
Open each PDF in read mode…
pdf_file_obj = open(fullpath, "rb") pdf_reader = PyPDF2.PdfFileReader(pdf_file_obj)
…and create a page object.
page_obj = pdf_reader.getPage(0)
Now we extract the text from the page.
pdf_text = page_obj.extractText()
My task was made quite easy because each page had a unique document number with a certain amount of characters prefixed the exact same for each. This meant that I could use regular expression, the re module, to find the prefix and then obtain the rest of the document number.
The code below finds the document number prefix in the text extracted from the page and appends the next 14 characters to the prefix to give the full document number.
for index in re.finditer("THE-DOC-PREFIX-", pdf_text): doc_ext = pdf_text[index.end():index.end() + 14] doc_num = "THE-DOC-PREFIX-" + doc_ext pdf_file_obj.close()
The last thing to do is to use the document number to rename the PDF
os.rename(fullpath, rename_folder + "\\" + doc_num + ".pdf")
That completes the two functions required to complete the task.
Set up the variables required for the function parameters…
root_dir = r"C:\Users\******\Documents\original" extract_to = r"C:\Users\******\Documents\extracted" rename_to = r"C:\Users\******\Documents\renamed"
…and then call each function.
split_pdf_pages(root_dir, extract_to) rename_pdfs(extract_to,rename_to)
Run the script. The original files will remain and the renamed extracted pages will be in the renamed folder. Any PDF page that failed to be renamed will still be in the extracted folder and you can rename these manually. This failure to rename every PDF is because of the make-up of the PDF i.e. the way it was exported from a piece of software or how it was created. In a test run, out of 206 pages, 10 pages failed to be renamed. When I opened the pages the select tool was unable to highlight text and everything was embedded as an image, hence why the script couldn’t read any text to rename the document.
I hope someone out there will find this useful. I am always happy that my code works but appreciate if you have any constructive comments or hints and tips to make the code more efficient.
Here’s the full script…
# import the neccessary modules import os, PyPDF2, re # function to extract the individual pages from each pdf found def split_pdf_pages(root_directory, extract_to_folder): # traverse down through the root directory to sub-directories for root, dirs, files in os.walk(root_directory): for filename in files: basename, extension = os.path.splitext(filename) # if a file is a pdf if extension == ".pdf": # create a reference to the full filename path fullpath = root + "\\" + basename + extension # open the pdf in read mode opened_pdf = PyPDF2.PdfFileReader(open(fullpath,"rb")) # for each page in the pdf for i in range(opened_pdf.numPages): # write the page to a new pdf output = PyPDF2.PdfFileWriter() output.addPage(opened_pdf.getPage(i)) with open(extract_to_folder + "\\" + basename + "-%s.pdf" % i, "wb") as output_pdf: output.write(output_pdf) # function for renaming the single page pdfs based on text in the pdf def rename_pdfs(extraced_pdf_folder, rename_folder): # traverse down through the root directory to sub-directories for root, dirs, files in os.walk(extraced_pdf_folder): for filename in files: basename, extension = os.path.splitext(filename) # if a file is a pdf if extension == ".pdf": # create a reference to the full filename path fullpath = root + "\\" + basename + extension # open the individual pdf pdf_file_obj = open(fullpath, "rb") pdf_reader = PyPDF2.PdfFileReader(pdf_file_obj) # access the individual page page_obj = pdf_reader.getPage(0) # extract the the text pdf_text = page_obj.extractText() # use regex to find information for index in re.finditer("THE-DOC-PREFIX-", pdf_text): doc_ext = pdf_text[index.end():index.end() + 14] doc_num = "THE-DOC-PREFIX-" + doc_ext pdf_file_obj.close() # rename the pdf based on the information in the pdf os.rename(fullpath, rename_folder + "\\" + doc_num + ".pdf") # parameter variables root_dir = r"C:\Users\******\Documents\rename_pdf" extract_to = r"C:\Users\******\Documents\extracted" rename_to = r"C:\Users\******\Documents\renamed" # use the two functions split_pdf_pages(root_dir, extract_to) rename_pdfs(extract_to,rename_to)
Resources:
PyPDF2
Automate the Boring Stuff with Python
This is absolutely amazing and enormously helpful. I needed to split a 500 page pdf with employee number and name, and your code made it a cinch.
Thanks so much – your help was greatly appreciated!
LikeLiked by 1 person
Excellent, delighted to hear it was of use.
LikeLike
Could you help me do this with a real file? I have no idea what I am doing, other than I need to do this.
LikeLike
Thank you so much for supplying the code about. I am trying to use it now, and running into a snag. The text I would like to use to rename the pdf, is the first 6 characters from the extracted text. Any ideas on how I can modify the regex section to accomplish this?
LikeLike
The extracted text from the whole document? you don’t need regex, just use os.rename(fullpath, rename_folder + “\\” + pdf_text[0:6] + “.pdf”, if you need the first 6 after the text you are searching for change the 14 to 6, and doc_num = doc_text
LikeLike
Thanks for the reply. I was able to get it working. One other question. Is there a way that I can select the text is before the value in re,finditer? The value I need to pull is sometimes the first string, but can also be buried a but, but is it always before the string “Check”
LikeLike
Change [index.end():index.end() + 14] to [index.start() – 6: index.start()], this will get the 6 characters before the text (Check) you are searching for. Change the slice to suit your needs. Hope that helps.
LikeLike
Thank you for all of your help. I am very new to python and this is one of my first programs. Would you be able to take a look at this snippet of code I modified from your post? It runs fine in debug, but errors out when I try to run it normally. The error I get is a permissionserror on the shutil.move line. It says the file is being used by another process.
# function for renaming the single page pdfs based on text in the pdf
def rename_pdfs(extraced_pdf_folder, rename_folder):
# traverse down through the root directory to sub-directories
for root, dirs, files in os.walk(extraced_pdf_folder):
for filename in files:
basename, extension = os.path.splitext(filename)
# if a file is a pdf
if extension == “.pdf”:
# create a reference to the full filename path
fullpath = root + “\\” + basename + extension
# open the individual pdf
pdf_file_obj = open(fullpath, “rb”)
pdf_reader = PyPDF2.PdfFileReader(pdf_file_obj)
# access the individual page
page_obj = pdf_reader.getPage(0)
# extract the the text
pdf_text = page_obj.extractText()
# Find the check number
pdf_text = pdf_text.split(‘Check’, 1)[0]
doc_num = pdf_text[-7:-1]
pdf_file_obj.close()
# rename the pdf based on the information in the pdf
destinationlocation = rename_folder + “\\” + doc_num + “.pdf”
shutil.move(fullpath, destinationlocation)
LikeLiked by 1 person
By the look of it the file you are trying to move is possibly already being accessed. It’s rare I’ve used shutil but maybe look up some resources online such as http://www.pythonforbeginners.com/os/python-the-shutil-module or post a question on StackExchange. I have used shutil.copy2 with success in the past as this overcame permission issues.
LikeLike
Hi, I am trying to run this script on Pycharm. Below is the error that I am getting.
Traceback (most recent call last):
File “/Users/xxxxx/PycharmProjects/Proj01/Excel.py”, line 43, in
print(split_pdf_pages(root_dir,extract_to))
File “/Users/xxxxx/PycharmProjects/Proj01/Excel.py”, line 10, in split_pdf_pages
opened_pdf = PyPDF2.PdfFileReader(open(fullpath, “rb”))
UnboundLocalError: local variable ‘fullpath’ referenced before assignment
I just started learning python. Please could you advise?
LikeLike
New to Python, but found a way to make this work, which was super fast and cool. Just love learning new code. How would I change this to grab 2 pages at a time?
LikeLike
Hi,
i want to split that based on keyword on the page. A split document may contain multiple pages based on keyword on the page. if you can help me on that.
LikeLike
hi i have list of pdfs in a folder. and each pdf has page numbers inside pdf. So my query is i want to rename the pdf with page numbers inside the pdfs. so no need to split each pdf into single page pdf. just want to rename. Could you please help me with the code specific to this requirement.
LikeLike
Just wanted to say thank you for writing this. It has been a massive help today. (I had to change the double backslashes to forward slashes to make it run, but that’s all).
LikeLike
Pingback: Dividere e rinominare un PDF in base al contenuto del file - Eugenio Nappi
Can this. work like a function by saving the script with a name.py?
LikeLike
I need to do this but have no idea how to employ this – could anyone help me take this conceptually into reality?
LikeLike