PDF to JPG Conversion with Python (for Windows)

Posted on January 10, 2017 by Glen Bambrick

Interested in learning ArcPy? check out this course.

I recently had a torrid time trying to research and implement a Python script that could batch convert from PDF to JPG. There are numerous entries online that aim to help (and did so in parts) but I struggled to find one with a concise workflow from start to finish that satisfied my criteria and involved setting up what’s required to implement such. The below could be slated for not being the most ‘Pythonic’ way to get it done but it certainly made my life easier. I was struggling with Wand and ImageMagick as per most posts until I luckily stumbled across an entry on StackOverflow where floqqi, my new hero, answered my prayers. I felt that if I struggled with this that there must be others out there with the same predicament and I hope that the title of this post will help it come to the forefront of searches and aid fellow Python snippet researchers in finding some salvation.
Note: I am using Python 2.7 32-bit on Windows 7 Professional

1. Install ImageMagick
floqqi recommends downloading the latest version, which at the time of writing this is 7.0.4-3. I had already installed an earlier version while trying to get the Wand module to work. My version is 6.9.7-3. If you hover over the links you should be able to see the full link name http://www.imagemagick.org/download/binaries/ImageMagick-6.9.7-3-Q8-x86-dll.exe, or just click that link to download the same version I did.
Run the installer, accept the license agreement, and click Next on the Information window. In the Select Additional Tasks make sure that Install development headers and libraries for C and C++ is selected.

Click Next and then Install.

2. Install GhostScript
Install the 32-bit Ghostscript AGPL Release

3. Set Environment Variables
Create a new System Variable (Advanced System Settings > Environment Variables) called MAGICK_HOME and insert the Image Magick installation path as the value. This will be similar to C:\Program Files (x86)\ImageMagick-6.9.7-Q8

Click OK and and make sure that the same value (C:\Program Files (x86)\ImageMagick-6.9.7-Q8) is at the start of the Path variable. After this entry in the Path variable insert the entry for GhostScript which will be similar to C:\Program Files (x86)\gs\gs9.20\bin
Note: make sure that the entries are separated by a semi-colon (;)

4. Check if steps 1-3 have been correctly configured
Open the Command Prompt and enter…

convert file1.pdf file2.jpg

where file.pdf and file2.jpg are fully qualified paths for an input PDF and and output JPG (or the current directory contains the file).

If no errors are presented and the JPG has been created you can move on to the next step. Otherwise step into some troubleshooting.

5. Install PythonMagick
I downloaded the Python 2.7 32-bit whl file PythonMagick‑0.9.10‑cp27‑none‑win32.whl and then used pip to install from the command prompt.

pip install C:\Users\glen.bambrick\Downloads\pip install PythonMagick‑0.9.10‑cp27‑none‑win32.whl

Open up a Python IDE and test to see if you can import PythonMagick

We now have everything set up and can begin to write a script that will convert multiple (single page) PDFs to JPGs.

Import the necessary modules.

import os, PythonMagick
from PythonMagick import Image
from datetime import datetime

Ok so datetime isn’t necessary but I like to time my scripts and see if it can be improved upon. Set the start time for the script

start_time = datetime.now()

A couple of global variables, one for the directory that holds the PDFs, and another to hold a hexidecimal value for the background colour ‘white’. After trial and error I noticed that some JPGs were being exported with a black background instead of white and this will be used to force a white background. I found a useful link on StackOverflow to help overcome this.

pdf_dir = r"C:\MyPDFs"
bg_colour = "#ffffff"

We loop through each PDF in the folder

for pdf in [pdf_file for pdf_file in os.listdir(pdf_dir) if pdf_file.endswith(".pdf")]:

Set and read in each PDF. density is the resolution.

    input_pdf = pdf_dir + "\\" + pdf
    img = Image()
    img.density('300')
    img.read(input_pdf)

Get the dimensions of the image.

    size = "%sx%s" % (img.columns(), img.rows())

Build the JPG for output. This part must be the Magick in PythonMagic because for a small portion of it I am mystified. See that last link to StackOverflow for the origin of the code here. The PythonMagick documentation is tough to digest and in various threads read the laments about how poor it is.

    output_img = Image(size, bg_colour)
    output_img.type = img.type
    output_img.composite(img, 0, 0, PythonMagick.CompositeOperator.SrcOverCompositeOp)
    output_img.resize(str(img.rows()))
    output_img.magick('JPG')
    output_img.quality(75)

And lastly we write out our JPG

    output_jpg = input_pdf.replace(".pdf", ".jpg")
    output_img.write(output_jpg)

And see how long it took the script to run.

print datetime.now() - start_time

This places the output JPGs in the same folder as the PDFs. Based on the resolution (density) and quality settings the process can be a bit lengthy. Using the settings above it took 9 minutes to do 20 PDF to JPG Conversions. You will need to figure out the optimum resolution and quality for your purpose. Low res took 46 seconds for all 20.

As always I feel a sense of achievement when I get a Python script to work and hope that this post will spur on some comments to make the above process more efficient. Feel free to post links to any resources, maybe comment to help myself and other readers, or if this helped you in anyway let me know and I’ll pass the thanks on to floqqi and the rest of the crew. This script is the limit of my knowledge with PythonMagick and this is thanks to those that have endeavoured before me and referenced in the links throughout this post. Thanks guys.

Complete script…

import os, PythonMagick
from PythonMagick import Image
from datetime import datetime

start_time = datetime.now()

pdf_dir = r"C:\MyPDFs"
bg_colour = "#ffffff"

for pdf in [pdf_file for pdf_file in os.listdir(pdf_dir) if pdf_file.endswith(".pdf")]:

    input_pdf = pdf_dir + "\\" + pdf
    img = Image()
    img.density('300')
    img.read(input_pdf)

    size = "%sx%s" % (img.columns(), img.rows())

    output_img = Image(size, bg_colour)
    output_img.type = img.type
    output_img.composite(img, 0, 0, PythonMagick.CompositeOperator.SrcOverCompositeOp)
    output_img.resize(str(img.rows()))
    output_img.magick('JPG')
    output_img.quality(75)


    output_jpg = input_pdf.replace(".pdf", ".jpg")
    output_img.write(output_jpg)

print datetime.now() - start_time

25 thoughts on “PDF to JPG Conversion with Python (for Windows)”

Carleton on February 22, 2017 at 9:53 pm said:

Thanks for posting, this is helpful. I’m dealing with a name collision issue with ‘convert’ at step 4. How were you able to avoid this? I tried listing the absolute path to the convert.h file before the convert call, but the header file type wasn’t recognized. -Carleton

LikeLike

Reply ↓
- clubdebambos on February 23, 2017 at 3:21 pm said:
  
  I didn’t encounter any problems here. Have you tried uninstalling Steps 1 and 2 and starting over? Just to make sure you haven’t missed out on anything
  
  LikeLike
  
  Reply ↓
  - Carleton on February 23, 2017 at 6:47 pm said:
    
    Perhaps an update occurred since your original post. Using the command ‘magick’ instead of ‘convert’ did the trick. Thanks for the reply. This is the problem I’m referring to:
    
    http://stackoverflow.com/questions/3060205/error-invalid-parameter-fom-imagemagick-convert-on-windows
    
    LikeLike
    
    Reply ↓
    - clubdebambos on February 23, 2017 at 6:53 pm said:
      
      Excellent! Thanks for the update and link
      
      LikeLike
      
      Reply ↓
suvinks on May 18, 2017 at 12:25 am said:

Thanks for the post. I tried the python code but it is giving me the following error
Traceback (most recent call last):
File “C:/Users/XXXXX/PycharmProjects/untitled/p2i.py”, line 16, in
img.read(input_pdf)
RuntimeError: Magick: UnableToOpenConfigureFile `delegates.xml’ @ warning/configure.c/GetConfigureOptions/712

Am able to use the convert command line utility to do the convert without issues.

LikeLike

Reply ↓
- clubdebambos on May 18, 2017 at 6:47 am said:
  
  The configuration files are usually at C:\Program Files (x86)\ImageMagick-6.9.7-Q8 (for my instance), check that delegates.xml is in here.
  
  If you search the error here https://www.imagemagick.org/script/search.php
  you might get some answers.
  
  I originally went through the process in this post a couple of times with ImageMagick downloads, completely uninstalling the failed attempts until I got one to work. I’ve used it a lot of times since so it was worth it.
  
  LikeLike
  
  Reply ↓
  - suvinks on May 18, 2017 at 3:24 pm said:
    
    Thanks for the quick response
    My delegates xml file is in C:\Program Files\ImageMagick-7.0.5-Q16.
    
    The convert program works well from command line. I tried -debug configuration option and it showed me that it is checking in the above directory for delegates.xml. Not sure why Python code is having an issue. Any way to debug the python process?
    
    LikeLike
    
    Reply ↓
    - suvinks on May 18, 2017 at 3:51 pm said:
      
      Just noticed that the following code works perfectly
      
      img = Image(input_pdf)
      
      Only when I do this am getting error. I actually need to set the density as the default density is too poor to do anything with.
      img = Image()
      img.density(‘600’)
      img.read(input_pdf)
      
      Is there a way I can change the default density being used? This would help me to use Image(input_pdf) and still get my expected result.
      
      LikeLike
      
      Reply ↓
      - clubdebambos on May 22, 2017 at 11:30 am said:
        
        Try this…
        
        img = Image(input_pdf)
        img.density(‘300’)
        
        remove…
        img.read(input_pdf)
        
        LikeLike
        
        Reply ↓
Robinarthur on September 10, 2017 at 7:48 am said:

Thank you for your script. I will try this the next hours. I want to write a pdf -> txt script. All i found were not working, so i have to write it by my own

LikeLiked by 1 person

Reply ↓
robwoodruff on January 31, 2018 at 12:13 am said:

Yes! thanks dude. I’m using python 3.6 via Anaconda/Spyder 64 bit on Windows 7 and after banging my head against the wall for 48 hours with various combos of Wand, ImageMagick and GS this finally worked! My working version uses ImageMagick-6.9.8-9 for 64 bit, which I had to search for a while to find since they only offer version 7+ on the official site. I did this originally because forums told me that Wand wasn’t ready for version 7 yet, so newer versions may also work. And then for PythonMagick I used the most recent version for python 3.6/64 bit. Your post was a godsend. Thanks for taking the time to share!

LikeLiked by 1 person

Reply ↓
- arzoo on May 5, 2018 at 1:10 am said:
  
  can u plzz tell me what was your GS version…?
  
  LikeLike
  
  Reply ↓
- Mauricio on May 4, 2019 at 9:48 pm said:
  
  I am from Ecuador help me with the code please do not work with the files that I download
  
  LikeLike
  
  Reply ↓
- Mauricio on May 4, 2019 at 9:49 pm said:
  
  this is my mail : maur1025@hotmail.com
  
  LikeLike
  
  Reply ↓
Aneesh Krishnan on March 23, 2018 at 5:18 am said:

Thank you very much posting the script. Really helpful. You save my day

LikeLiked by 1 person

Reply ↓
Brooke Reams on April 25, 2018 at 10:10 pm said:

Thank you so much for posting this. It worked perfectly.

LikeLiked by 1 person

Reply ↓
PA on May 15, 2018 at 5:01 am said:

Thank you very much for this post. Works beautifully for first page, but the remaining pages from pdf are missing. Can you please help?

LikeLiked by 1 person

Reply ↓
- clubdebambos on May 17, 2018 at 9:37 am said:
  
  You would need to iterate through each page and export each one to JPG. Look at using PyPDF2 to access each page. Some of the code at https://glenbambrick.com/2016/09/23/extract-rename-pdf-pages/ might help and you will find plenty of help with a Google search. Cheers
  
  LikeLike
  
  Reply ↓
Djalmo Mourato on August 17, 2018 at 7:07 pm said:

Hi,
Thank you for this!
I followed all the steps and everything seems to be fine except when I run it I get this error:
Traceback (most recent call last):
File “.\pdf_to_img_conv2.py”, line 17, in
img.density(‘300’)
Boost.Python.ArgumentError: Python argument types in
Image.density(Image, str)
did not match C++ signature:
density(class Magick::Image {lvalue})
density(class Magick::Image {lvalue}, class Magick::Point)

Any thoughts what could be going wrong?

Thank you!
-DJ

LikeLike

Reply ↓
- arvind upadhyay on October 5, 2018 at 9:52 am said:
  
  DJ, you got this issue resolved? I am getting similar for img.density(‘300’).
  
  LikeLike
  
  Reply ↓
  - franco dark on December 10, 2019 at 9:42 pm said:
    
    did you guys find i fix i got the same error
    
    LikeLike
    
    Reply ↓
Sanjay Kumar on October 28, 2018 at 2:55 am said:

How to do in Linux

LikeLike

Reply ↓
Mauricio on May 4, 2019 at 9:45 pm said:

what is the difference of using wnad and magick

LikeLike

Reply ↓
Diban on October 12, 2019 at 2:23 am said:

Don’t know what I did wrong. when I run the cmd the pdf file dont get converted and I get an error.

LikeLike

Reply ↓
Paul on June 20, 2020 at 4:05 pm said:

This script does NOT work. Aside from PythonMagick being a giant pain, the code here has problems, and I know I am just scratching the surface. Just start with the image density setting. It takes 2 arguments, so unless this code was executed against an older version of PythonMagic AND the image density signature was change to require 2 parameters, then there is no way this code could have been executed successfully as-is, much less be successful at converting PDF’s to images in PPT.

img.density(‘600’)
Boost.Python.ArgumentError: Python argument types in
Image.density(Image, str)
did not match C++ signature:
density(class Magick::Image {lvalue})
density(class Magick::Image {lvalue}, class Magick::Point)

It is frustrating, as I have an urgent need to convert some PDF data over to PPT, and since it was urgent, I was looking for a quick solution. However, it appears that is not possible, and will have to create a working solution from scratch, using any myriad combination of some of the most problematic modules for Python, such as PythonMagick, PyPP2, Wand, and others.

LikeLike

Reply ↓