# OSGP: Measuring Geographic Distributions – Mean Center

(Open Source Geospatial Python)

The ‘What is it?’

The Mean Center is the average X coordinate and Y coordinate for all features in a study area and is the simplest descriptor of a geographic distribution. The Mean Center is generally used to track the changes in a features distribution over time and can also be used to compare the distribution of multiple features.

The Mean Center is also known as the Geographic Center or Center of Concentration for a set of features.

You would calculate the Mean Center for features where there is no travel interaction between the Center and the features of the study. Basically, use it for a study where each event that happens is a recorded location, for example a burglary for crime analysis, or the sighting of wombat for wildlife studies.

The Formula!

For Point features the X and Y coordinates of each feature is used, for Polygons the centroid of each feature represents the X and Y coordinate to use, and for Linear features the mid-point of each line is used for the X and Y coordinate.

The Code…

```from osgeo import ogr
from shapely.geometry import MultiLineString
from shapely import wkt
import numpy as np
import sys

## set the driver for the data
driver = ogr.GetDriverByName("FileGDB")
## path to the FileGDB
gdb = r"C:\Users\Glen B\Documents\my_geodata.gdb"
## ope the GDB in write mode (1)
ds = driver.Open(gdb, 1)

## input layer
input_lyr_name = "Birmingham_Burglaries_2016"

## the output layer
output_fc = "{0}_mean_center".format(input_lyr_name)

## reference the layer using the layers name
if input_lyr_name in [ds.GetLayerByIndex(lyr_name).GetName() for lyr_name in range(ds.GetLayerCount())]:
lyr = ds.GetLayerByName(input_lyr_name)
print "{0} found in {1}".format(input_lyr_name, gdb)

## delete the output feature class if it already exists
if output_fc in [ds.GetLayerByIndex(lyr_name).GetName() for lyr_name in range(ds.GetLayerCount())]:
ds.DeleteLayer(output_fc)
print "Deleting: {0}".format(output_fc)

try:
## assess the geometry of the input feature class
first_feat = lyr.GetFeature(1)
## for each point or polygon in the layer
## get the x and y value of the centroid
## store in a numpy array
if first_feat.geometry().GetGeometryName() in ["POINT", "MULTIPOINT", "POLYGON", "MULTIPOLYGON"]:
xy_arr = np.ndarray((len(lyr), 2), dtype=np.float)
for i, pt in enumerate(lyr):
ft_geom = pt.geometry()
xy_arr[i] = (ft_geom.Centroid().GetX(), ft_geom.Centroid().GetY())

## for lineear we get the midpoint of a line
elif first_feat.geometry().GetGeometryName() in ["LINESTRING", "MULTILINESTRING"]:
xy_arr = np.ndarray((len(lyr), 2), dtype=np.float)
for i, ln in enumerate(lyr):
line_geom = ln.geometry().ExportToWkt()
midpoint = shapely_line.interpolate(shapely_line.length/2)
xy_arr[i] = (midpoint.x, midpoint.y)

## exit gracefully if unknown geometry or input contains no geometry
except Exception:
print "Unknown geometry for {0}".format(input_lyr_name)
sys.exit()

avg_x, avg_y = np.mean(xy_arr, axis=0)

print "Mean Center: {0}, {1}".format(avg_x, avg_y)

## create a new point layer with the same spatial ref as input layer
out_lyr = ds.CreateLayer(output_fc, lyr.GetSpatialRef(), ogr.wkbPoint)

## define and create new fields to hold the mean center coordinates
x_fld = ogr.FieldDefn("X", ogr.OFTReal)
y_fld = ogr.FieldDefn("Y", ogr.OFTReal)
out_lyr.CreateField(x_fld)
out_lyr.CreateField(y_fld)

## create a new point geom for the mean center
pnt = ogr.Geometry(ogr.wkbPoint)

## add the mean center point to the new layer with attributes
feat_dfn = out_lyr.GetLayerDefn()
feat = ogr.Feature(feat_dfn)
feat.SetGeometry(pnt)
feat.SetField("X", avg_x)
feat.SetField("Y", avg_y)
out_lyr.CreateFeature(feat)

print "Created: {0}".format(output_fc)

## free up resources
del ds, lyr, first_feat, feat, out_lyr```

I’d like to give credit to Logan Byers from GIS StackExchange who aided in speeding up the computational time using NumPy and for forcing me to begin learning the wonders of NumPy.

The Example:

I downloaded crime data from DATA.POLICE.UK for the West Midlands Police from January 2016 to December 2016. I used some Python to extract just the Burglary data and made this into a feature class in the File GDB. Next, I downloaded OS Boundary Line data and clipped the Burglary data to just Birmingham. Everything was now in place to find the Mean Center of all burglaries for Birmingham in 2016. (see The Other Scripts section at the bottom of this post for processing the data)

Running the script from The Code section above calculates the Mean Center of all burglaries for 2016 and created a point feature class in the File GDB.

OSGP Mean Center:     407926.695396, 286615.428507
ArcGIS Mean Center:    407926.695396, 286615.428507

What’s Next?…

Central Feature

The Resources:

ESRI Guide to GIS Volume 2: Chapter 2 (I highly recommend this book)
see book review here.

Geoprocessing with Python

Python GDAL/OGR Cookbook

Setting up GDAL/OGR with FileGDB Driver for Python on Windows

< The Other Scripts >

1. Extract Burglary Data for West Midlands

```import csv, os
from osgeo import ogr, osr

## set the driver for the data
driver = ogr.GetDriverByName("FileGDB")

## path to the FileGDB
gdb = r"C:\Users\Glen B\Documents\my_geodata.gdb"

## ope the GDB in write mode (1)
ds = driver.Open(gdb, 1)

## the coordinates in the csv files are lat/long
source = osr.SpatialReference()
source.ImportFromEPSG(4326)

## we need the data in British National Grid
target = osr.SpatialReference()
target.ImportFromEPSG(27700)

transform = osr.CoordinateTransformation(source, target)

## set the output fc name
output_fc = "WM_Burglaries_2016"

## if the output fc already exists delete it
if output_fc in [ds.GetLayerByIndex(lyr_name).GetName() for lyr_name in range(ds.GetLayerCount())]:
ds.DeleteLayer(output_fc)
print "Deleting: {0}".format(output_fc)

out_lyr = ds.CreateLayer(output_fc, target, ogr.wkbPoint)

## define and create new fields
mnth_fld = ogr.FieldDefn("Month", ogr.OFTString)
rep_by_fld = ogr.FieldDefn("Reported_by", ogr.OFTString)
fls_wthn_fld = ogr.FieldDefn("Falls_within", ogr.OFTString)
loc_fld = ogr.FieldDefn("Location", ogr.OFTString)
lsoa_c_fld = ogr.FieldDefn("LSOA_code", ogr.OFTString)
lsoa_n_fld = ogr.FieldDefn("LSOA_name", ogr.OFTString)
crime_fld = ogr.FieldDefn("Crime_type", ogr.OFTString)
outcome_fld = ogr.FieldDefn("Last_outcome", ogr.OFTString)

out_lyr.CreateField(mnth_fld)
out_lyr.CreateField(rep_by_fld)
out_lyr.CreateField(fls_wthn_fld)
out_lyr.CreateField(loc_fld)
out_lyr.CreateField(lsoa_c_fld)
out_lyr.CreateField(lsoa_n_fld)
out_lyr.CreateField(crime_fld)
out_lyr.CreateField(outcome_fld)

root_folder = r"C:\Users\Glen B\Documents\Crime"

## for each csv
for root,dirs,files in os.walk(root_folder):
for filename in files:
if filename.endswith(".csv"):
csv_path = "{0}\\{1}".format(root, filename)
with open(csv_path, "rb") as csvfile:
## create a point with attributes for each burglary
if row[9] == "Burglary":
pnt = ogr.Geometry(ogr.wkbPoint)
pnt.Transform(transform)
feat_dfn = out_lyr.GetLayerDefn()
feat = ogr.Feature(feat_dfn)
feat.SetGeometry(pnt)
feat.SetField("Month", row[1])
feat.SetField("Reported_by", row[2])
feat.SetField("Falls_within", row[3])
feat.SetField("Location", row[6])
feat.SetField("LSOA_code", row[7])
feat.SetField("LSOA_name", row[8])
feat.SetField("Crime_type", row[9])
feat.SetField("Last_outcome", row[10])
out_lyr.CreateFeature(feat)

del ds, feat, out_lyr```

2. Birmingham Burglaries Only

```from osgeo import ogr

## required drivers
shp_driver = ogr.GetDriverByName("ESRI Shapefile")
gdb_driver = ogr.GetDriverByName("FileGDB")

## input boundary shapefile and file gdb
shapefile = r"C:\Users\Glen B\Documents\Crime\Data\GB\district_borough_unitary_region.shp"
gdb = r"C:\Users\Glen B\Documents\my_geodata.gdb"

## open the shapefile in read mode and gdb in write mode
shp_ds = shp_driver.Open(shapefile, 0)
gdb_ds = gdb_driver.Open(gdb, 1)

## reference the necessary layers
shp_layer = shp_ds.GetLayer(0)
gdb_layer = gdb_ds.GetLayerByName("WM_Burglaries_2016")

## filter the shapefile
shp_layer.SetAttributeFilter("NAME = 'Birmingham District (B)'")

## set the name for the output feature class
output_fc = "Birmingham_Burglaries_2016"

## if the output already exists then delete it
if output_fc in [gdb_ds.GetLayerByIndex(lyr_name).GetName() for lyr_name in range(gdb_ds.GetLayerCount())]:
gdb_ds.DeleteLayer(output_fc)
print "Deleting: {0}".format(output_fc)

## create an output layer
out_lyr = gdb_ds.CreateLayer(output_fc, shp_layer.GetSpatialRef(), ogr.wkbPoint)

## copy the schema from the West Midlands burglaries
## and use it for the Birmingham burglaries
lyr_def = gdb_layer.GetLayerDefn()
for i in range(lyr_def.GetFieldCount()):
out_lyr.CreateField (lyr_def.GetFieldDefn(i))

## only get burglaries that intersect the Birmingham region
for shp_feat in shp_layer:
print shp_feat.GetField("NAME")
birm_geom = shp_feat.GetGeometryRef()
for gdb_feat in gdb_layer:
burglary_geom = gdb_feat.GetGeometryRef()
if burglary_geom.Intersects(birm_geom):
feat_dfn = out_lyr.GetLayerDefn()
feat = ogr.Feature(feat_dfn)
feat.SetGeometry(burglary_geom)

## populate the attribute table
for i in range(lyr_def.GetFieldCount()):
feat.SetField(lyr_def.GetFieldDefn(i).GetNameRef(), gdb_feat.GetField(i))
## create the feature
out_lyr.CreateFeature(feat)
feat.Destroy()

del shp_ds, shp_layer, gdb_ds, gdb_layer```

The Usual 🙂

As always please feel free to comment to help make the code more efficient, highlight errors, or let me know if this was of any use to you.

# Setting up GDAL/OGR with FileGDB Driver for Python on Windows

I have decided to venture into the world of GDAL/OGR with Python with my main motivation to mimic some tools from ArcGIS for Desktop. I am hoping that this will help me to improve on a few fronts; my Python coding, increased knowledge regarding open source geospatial libraries, and to better understand the algorithms that churn away behind the scenes when you click a button in a GUI based GIS and perform some sort of geoprocessing or data analysis.

I mainly work with ESRI File Geodatabases and while I know this is not open source ESRI have an API in place to read and write to a gdb via GDAL/OGR. The first step is to setup what I need to start my journey for learning GDAL/OGR with Python for Windows. I will also install a few libraries that will help speed up some computations for more efficient geoprocessing.

I am using…
Python 2.7.13 32bit on Windows 7 Professional

Open the command prompt, change directory to where the whl was downloaded and use pip to install.

`pip install "GDAL‑2.1.3‑cp27‑cp27m‑win32.whl"`

3. Get the File Geodatabase API from ESRI (you will need an ESRI account)

Go to ESRI Dowloads and download File Geodatabase API 1.3 version for Windows (Visual Studio 2008). This will be a zip folder. Open the contents of the API zipped folder and extract FileGDBAPI.dll from the bin folder to

C:\Python27\Lib\site-packages\osgeo

or wherever your site-package folder resides. Just make sure to extract it to osgeo.

4. Create a New Variable in Environmental Variables

In Advanced System Settings create a new Environmental Variable called GDAL_DRIVER_PATH and set the path to the osgeo folder in Step 5.

5. Open __init__.py from osgeo…

… and uncomment line 10.

Save the file.

6. Test the setup

Open a Python interpreter and test using…

If you do not get an errors like the screenshot above then setup has been successful.

*************************************************************************************************
OPTIONAL: these will be used in some capacity for scripting geoprocessing,

7. Download numpy + mkl wheel from the brilliant website of Christoph Gohlke

Open up the command prompt and change directory to where the downloaded file resides. Use pip to install.

`pip install "numpy‑1.11.3+mkl‑cp27‑cp27m‑win32.whl"`

8. Install SciPy

Back we go to Gohlke repository and to the SciPy Wheels. Here, I have downloaded scipy‑0.19.0‑cp27‑cp27m‑win32.whl
Open up the command prompt if you have closed it after Step 1 and change directory to where the downloaded file can be found.
Use pip to install.

`pip install "scipy‑0.19.0‑cp27‑cp27m‑win32.whl"`

9. Install Shapely

You got it, go back to Gohlke and download the Shapely whl file. I grabbed Shapely‑1.5.17‑cp27‑cp27m‑win32.whl. Use pip to install similar to Steps 7 and 8.

Now to immerse myself in learning mode and put GDAL/OGR to some use. Check out OSGP#1.1: Measuring Geographic Distributions – Mean Center for the first attempt.

# PDF to JPG Conversion with Python (for Windows)

I recently had a torrid time trying to research and implement a Python script that could batch convert from PDF to JPG. There are numerous entries online that aim to help (and did so in parts) but I struggled to find one with a concise workflow from start to finish that satisfied my criteria and involved setting up what’s required to implement such. The below could be slated for not being the most ‘Pythonic’ way to get it done but it certainly made my life easier. I was struggling with Wand and ImageMagick as per most posts until I luckily stumbled across an entry on StackOverflow where floqqi, my new hero, answered my prayers. I felt that if I struggled with this that there must be others out there with the same predicament and I hope that the title of this post will help it come to the forefront of searches and aid fellow Python snippet researchers in finding some salvation.
Note: I am using Python 2.7 32-bit on Windows 7 Professional

1. Install ImageMagick
Run the installer, accept the license agreement, and click Next on the Information window. In the Select Additional Tasks make sure that Install development headers and libraries for C and C++ is selected.

Click Next and then Install.

2. Install GhostScript
I
nstall the 32-bit Ghostscript AGPL Release

3. Set Environment Variables
Create a new System Variable (Advanced System Settings > Environment Variables) called MAGICK_HOME and insert the Image Magick installation path as the value. This will be similar to C:\Program Files (x86)\ImageMagick-6.9.7-Q8

Click OK and and make sure that the same value (C:\Program Files (x86)\ImageMagick-6.9.7-Q8) is at the start of the Path variable. After this entry in the Path variable insert the entry for GhostScript which will be similar to C:\Program Files (x86)\gs\gs9.20\bin
Note: make sure that the entries are separated by a semi-colon (;)

4. Check if steps 1-3 have been correctly configured
Open the Command Prompt and enter…

`convert file1.pdf file2.jpg`

where file.pdf and file2.jpg are fully qualified paths for an input PDF and and output JPG (or the current directory contains the file).

If no errors are presented and the JPG has been created you can move on to the next step. Otherwise step into some troubleshooting.

5. Install PythonMagick
I downloaded the Python 2.7 32-bit whl file PythonMagick‑0.9.10‑cp27‑none‑win32.whl and then used pip to install from the command prompt.

`pip install C:\Users\glen.bambrick\Downloads\pip install PythonMagick‑0.9.10‑cp27‑none‑win32.whl`

Open up a Python IDE and test to see if you can import PythonMagick

We now have everything set up and can begin to write a script that will convert multiple (single page) PDFs to JPGs.

Import the necessary modules.

```import os, PythonMagick
from PythonMagick import Image
from datetime import datetime```

Ok so datetime isn’t necessary but I like to time my scripts and see if it can be improved upon. Set the start time for the script

`start_time = datetime.now()`

A couple of global variables, one for the directory that holds the PDFs, and another to hold a hexidecimal value for the background colour ‘white’. After trial and error I noticed that some JPGs were being exported with a black background instead of white and this will be used to force a white background. I found a useful link on StackOverflow to help overcome this.

```pdf_dir = r"C:\MyPDFs"
bg_colour = "#ffffff"```

We loop through each PDF in the folder

`for pdf in [pdf_file for pdf_file in os.listdir(pdf_dir) if pdf_file.endswith(".pdf")]:`

Set and read in each PDF. density is the resolution.

```    input_pdf = pdf_dir + "\\" + pdf
img = Image()
img.density('300')

Get the dimensions of the image.

`    size = "%sx%s" % (img.columns(), img.rows())`

Build the JPG for output. This part must be the Magick in PythonMagic because for a small portion of it I am mystified. See that last link to StackOverflow for the origin of the code here. The PythonMagick documentation is tough to digest and in various threads read the laments about how poor it is.

```    output_img = Image(size, bg_colour)
output_img.type = img.type
output_img.composite(img, 0, 0, PythonMagick.CompositeOperator.SrcOverCompositeOp)
output_img.resize(str(img.rows()))
output_img.magick('JPG')
output_img.quality(75)```

And lastly we write out our JPG

```    output_jpg = input_pdf.replace(".pdf", ".jpg")
output_img.write(output_jpg)```

And see how long it took the script to run.

`print datetime.now() - start_time`

This places the output JPGs in the same folder as the PDFs. Based on the resolution (density) and quality settings the process can be a bit lengthy. Using the settings above it took 9 minutes to do 20 PDF to JPG Conversions. You will need to figure out the optimum resolution and quality for your purpose. Low res took 46 seconds for all 20.

As always I feel a sense of achievement when I get a Python script to work and hope that this post will spur on some comments to make the above process more efficient. Feel free to post links to any resources, maybe comment to help myself and other readers, or if this helped you in anyway let me know and I’ll pass the thanks on to floqqi and the rest of the crew. This script is the limit of my knowledge with PythonMagick and this is thanks to those that have endeavoured before me and referenced in the links throughout this post. Thanks guys.

Complete script…

```import os, PythonMagick
from PythonMagick import Image
from datetime import datetime

start_time = datetime.now()

pdf_dir = r"C:\MyPDFs"
bg_colour = "#ffffff"

for pdf in [pdf_file for pdf_file in os.listdir(pdf_dir) if pdf_file.endswith(".pdf")]:

input_pdf = pdf_dir + "\\" + pdf
img = Image()
img.density('300')

size = "%sx%s" % (img.columns(), img.rows())

output_img = Image(size, bg_colour)
output_img.type = img.type
output_img.composite(img, 0, 0, PythonMagick.CompositeOperator.SrcOverCompositeOp)
output_img.resize(str(img.rows()))
output_img.magick('JPG')
output_img.quality(75)

output_jpg = input_pdf.replace(".pdf", ".jpg")
output_img.write(output_jpg)

print datetime.now() - start_time```

# Table or Feature Class Attributes to CSV with ArcPy (Python)

Here’s a little function for exporting an attribute table from ArcGIS to a CSV file. The function takes two arguments, these are a file-path to the input feature class or table and a file-path for the output CSV file (see example down further).

First import the necessary modules.

`import arcpy, csv`

Inside the function we use ArcPy to get a list of the field names.

```def tableToCSV(input_tbl, csv_filepath):
fld_list = arcpy.ListFields(input_tbl)
fld_names = [fld.name for fld in fld_list]```

We then open a CSV file to write the data to.

```    with open(csv_filepath, 'wb') as csv_file:
writer = csv.writer(csv_file)```

The first row of the output CSV file contains the header which is the list of field names.

`        writer.writerow(fld_names)`

We then use the ArcPy SearchCursor to access the attributes in the table for each row and write each row to the output CSV file.

```        with arcpy.da.SearchCursor(input_tbl, fld_names) as cursor:
for row in cursor:
writer.writerow(row)```

And close the CSV file.

`    csv_file.close()`

Full script example…

```import arcpy, csv

def tableToCSV(input_tbl, csv_filepath):
fld_list = arcpy.ListFields(input_tbl)
fld_names = [fld.name for fld in fld_list]
with open(csv_filepath, 'wb') as csv_file:
writer = csv.writer(csv_file)
writer.writerow(fld_names)
with arcpy.da.SearchCursor(input_tbl, fld_names) as cursor:
for row in cursor:
writer.writerow(row)
print csv_filepath + " CREATED"
csv_file.close()

fc = r"C:\Users\******\Documents\ArcGIS\Default.gdb\my_fc"
out_csv = r"C:\Users\******\Documents\output_file.csv"

tableToCSV(fc, out_csv)
```

Feel free to ask questions, comment, or help build upon this example.

# My First Encounter with arcpy.da.UpdateCursor

I have been using arcpy intermittently over the past year and a half mainly for automating and chaining batch processing to save myself countless hours of repetition. This week, however, I had to implement a facet of arcpy that I had not yet had the opportunity to utilise – the data access module.

The Scenario
A file geodatabase with 75 feature classes each containing hundreds to thousands of features. These feature classes were the product of a CAD (Bentley Microstation) to GIS conversions via FME with data coming from 50+ CAD files. As a result of the conversion each feature class could contain features with various attributes from one or multiple CAD files but each feature class consisted of the same schema which was helpful.

The main issue was that the version number for a chunk of the CAD files had not been corrected. Two things needed to be fixed: i) the ‘REV_NUM’ attribute for all feature classes needed to be ‘Ver2’, there would be a mix of ‘Ver1’ and ‘Ver2’,  and ii) in the ‘MODEL_SUMMARY’ if ‘Ver1’ was found anywhere in the text it needed to be replaced with ‘Ver2’. There was one other issue and this stemmed from creating new features and not attributing them, this would have left a ‘NULL’ value in the ‘MODEL’ field (and the other fields). All features had to have standardised attributes. The script would not fix these but merely highlight the feature classes.

OK so a quick recap…
1. Set the ‘REV_NUM’ for every feature to ‘Ver2’
2. Find and replace ‘Ver1’ with ‘Ver2’ in the text string of ‘MODEL_SUMMARY’ for all features.
3. Find all feature classes that have ‘NULL’ in the ‘MODEL’ field.

The Script
Let’s take a look at the thirteen lines of code required to complete the mission.

```import arcpy

arcpy.env.workspace = r"C:\Users\*****\Documents\CleanedUp\Feature_Classes.gdb"
fc_list = arcpy.ListFeatureClasses()
fields = ["MODEL", "MODEL_SUMMARY", "REV_NUM"]

for fc in fc_list:
with arcpy.da.UpdateCursor(fc, fields) as cursor:
for row in cursor:
if row[0] == None or row[0] == "":
print fc + ": Null value found for MODEL"
break
if row[1] != None:
row[1] = row[1].replace("Ver1", "Ver2")
row[2] = "Ver2"
cursor.updateRow(row)```

The Breakdown
Import the arcpy library (you need ArcGIS installed and a valid license to use)

`import arcpy`

Set the workspace path to the relevant file geodatabase

`arcpy.env.workspace = r"C:\Users\*****\Documents\CleanedUp\Feature_Classes.gdb"`

Create a list of all the feature classes within the file geodatabase.

`fc_list = arcpy.ListFeatureClasses()`

We know the names of the fields we wish to access so we add these to a list.

`fields = ["MODEL", "MODEL_SUMMARY", "REV_NUM"]`

For each feature class in the geodatabase we want to access the attributes of each feature for the relevant fields.

```for fc in fc_list:
with arcpy.da.UpdateCursor(fc, fields) as cursor:
for row in cursor:```

If the ‘MODEL’ attribute has a None (NULL) or empty string value then print the feature class name to the screen. Once one is found we can break out and move onto the next feature class.

```   if row[0] == None or row[0] == "":
print fc + ": Null value found for MODEL"
break```

We know have a list of feature classes that we can fix the attributes manually.

Next we find any instance of ‘Ver1’ in ‘MODEL_SUMMARY’ text strings and replace it with ‘Ver2’….

```   if row[1] != None:
row[1] = row[1].replace("Ver1", "Ver2")```

…and update all ‘REV_NUM’ attributes to ‘Ver2’ regardless of what is already attributed. This is like using the Field Calculator to update.

`   row[2] = "Ver2"`

Perform and commit the above updates for each feature.

`   cursor.updateRow(row)`

Very handy to update the data you need and this script can certainly be extended to handle more complex operations using the arcpy.da.UpdateCursor module.

Check out the documentation for arcpy.da.UpdateCursor

# Extract PDF Pages and Rename Based on Text in Each Page (Python)

I was recently tasked with traversing through a directory and subsequent sub-directories to find PDFs and split any multi-page files into single-page files. The end goal was to name each extracted page, that was now an individual PDF, with a document number present on each page. There was possibly over 100 PDF files in the directory and each PDF could have one to more than ten pages. At the extreme I could have been looking at around one-thousand pages to extract and rename – a task that would have been very time consuming and mind numbing to do manually.

The PDFs contained map books produced using data driven pages in ArcGIS, it was conceivable that I could also re-open the original MXDs and re-export the map book as individual pages and naming appropriately based on the document name in the attribute table. Since I was not the creator of any of these PDFs and they all came from different teams, hunting down the correct MXDs and exporting would be cumbersome and also very time consuming. There had to be a more interesting and time efficient way…

…A quick research via Google on some Python modules and I had what I needed to complete my task in a more automated and time efficient manner. I needed three modules;
(1) os – for traversing through the directories and files and for renaming the files
(2) PyPDF2 – to read/write PDF files and also to extract text from pages
(3) re – the regular expression module to find the text needed to rename the file.
The next step was write down some pseudocode to map out what needed to be achieved and then to get coding…

Let’s begin by importing the modules at the top of the script.

`import os, PyPDF2, re`

Define a function to extract the pages. This function will take two parameters; the path to the root directory and the path to a folder to extract the pages to. The ‘extract_to_folder’ needs to be on the same level or above the root directory. Use your operating system to create the folder named ‘extracted’ and also create a second folder called ‘renamed’.

`def split_pdf_pages(root_directory, extract_to_folder):`

Next we use the os module to search from the root directory down to find any PDF files and store the full filepath as a variable, one at a time.

```for root, dirs, files in os.walk(root_directory):
for filename in files:
basename, extension = os.path.splitext(filename)
if extension == ".pdf":
fullpath = root + "\\" + basename + extension```

We then open that PDF in read mode.

`    opened_pdf = PyPDF2.PdfFileReader(open(fullpath,"rb"))`

For each page in the PDF the page is extracted and saved as a new PDF file in the ‘extracted’ folder. The below snippet was sourced from stackoverflow.

```    for i in range(opened_pdf.numPages):
output = PyPDF2.PdfFileWriter()
with open(extract_to_folder + "\\" + basename + "-%s.pdf" % i, "wb") as output_pdf:
output.write(output_pdf)```

That completes our function to strip out individual pages from PDF files in a root directory and down through all corresponding sub-directories. This function might be all you need as you can rename the extracted pages as you save each file. The next task for me, however, was to rename the PDFs based on text contained in each individual file.

Define a function called ‘rename_pdfs’ that takes two arguments; the path to the folder where the extracted pages reside and the renamed folder. Loop through each PDF and create a filepath to each one.

```def rename_pdfs(extraced_pdf_folder, rename_folder):
for root, dirs, files in os.walk(extraced_pdf_folder):
for filename in files:
basename, extension = os.path.splitext(filename)
if extension == ".pdf":
fullpath = root + "\\" + basename + extension```

Open each PDF in read mode…

```    pdf_file_obj = open(fullpath, "rb")

…and create a page object.

`    page_obj = pdf_reader.getPage(0)`

Now we extract the text from the page.

`    pdf_text = page_obj.extractText()`

My task was made quite easy because each page had a unique document number with a certain amount of characters prefixed the exact same for each. This meant that I could use regular expression, the re module, to find the prefix and then obtain the rest of the document number.

The code below finds the document number prefix in the text extracted from the page and appends the next 14 characters to the prefix to give the full document number.

```    for index in re.finditer("THE-DOC-PREFIX-", pdf_text):
doc_ext = pdf_text[index.end():index.end() + 14]
doc_num = "THE-DOC-PREFIX-" + doc_ext
pdf_file_obj.close()```

The last thing to do is to use the document number to rename the PDF

`    os.rename(fullpath, rename_folder + "\\" + doc_num + ".pdf")`

That completes the two functions required to complete the task.

Set up the variables required for the function parameters…

```root_dir = r"C:\Users\******\Documents\original"
extract_to = r"C:\Users\******\Documents\extracted"
rename_to = r"C:\Users\******\Documents\renamed"```

…and then call each function.

```split_pdf_pages(root_dir, extract_to)
rename_pdfs(extract_to,rename_to)```

Run the script. The original files will remain and the renamed extracted pages will be in the renamed folder. Any PDF page that failed to be renamed will still be in the extracted folder and you can rename these manually. This failure to rename every PDF is because of the make-up of the PDF i.e. the way it was exported from a piece of software or how it was created. In a test run, out of 206 pages, 10 pages failed to be renamed. When I opened the pages the select tool was unable to highlight text and everything was embedded as an image, hence why the script couldn’t read any text to rename the document.

I hope someone out there will find this useful. I am always happy that my code works but appreciate if you have any constructive comments or hints and tips to make the code more efficient.

Here’s the full script…

```# import the neccessary modules
import os, PyPDF2, re

# function to extract the individual pages from each pdf found
def split_pdf_pages(root_directory, extract_to_folder):
# traverse down through the root directory to sub-directories
for root, dirs, files in os.walk(root_directory):
for filename in files:
basename, extension = os.path.splitext(filename)
# if a file is a pdf
if extension == ".pdf":
# create a reference to the full filename path
fullpath = root + "\\" + basename + extension

# open the pdf in read mode

# for each page in the pdf
for i in range(opened_pdf.numPages):
# write the page to a new pdf
output = PyPDF2.PdfFileWriter()
with open(extract_to_folder + "\\" + basename + "-%s.pdf" % i, "wb") as output_pdf:
output.write(output_pdf)

# function for renaming the single page pdfs based on text in the pdf
def rename_pdfs(extraced_pdf_folder, rename_folder):
# traverse down through the root directory to sub-directories
for root, dirs, files in os.walk(extraced_pdf_folder):
for filename in files:
basename, extension = os.path.splitext(filename)
# if a file is a pdf
if extension == ".pdf":
# create a reference to the full filename path
fullpath = root + "\\" + basename + extension

# open the individual pdf
pdf_file_obj = open(fullpath, "rb")

# access the individual page
# extract the the text
pdf_text = page_obj.extractText()

# use regex to find information
for index in re.finditer("THE-DOC-PREFIX-", pdf_text):
doc_ext = pdf_text[index.end():index.end() + 14]
doc_num = "THE-DOC-PREFIX-" + doc_ext
pdf_file_obj.close()
# rename the pdf based on the information in the pdf
os.rename(fullpath, rename_folder + "\\" + doc_num + ".pdf")

# parameter variables
root_dir = r"C:\Users\******\Documents\rename_pdf"
extract_to = r"C:\Users\******\Documents\extracted"
rename_to = r"C:\Users\******\Documents\renamed"

# use the two functions
split_pdf_pages(root_dir, extract_to)
rename_pdfs(extract_to,rename_to)```

# GIS Career Advice Suggestions; my experiences & opinions.

Warning: this is a lengthy read that may borderline a rant, you have been warned!

Career advice is everywhere these days from generic all-round advice via do’s and do not’s encompassing any career to must do’s and must not’s for a more focused delineated career path. There are a few great articles out there but most, however, seem to lack the personal experience of the author, and while I wouldn’t necessarily disagree with some or any of the advice, I feel that it often does not reflect reality, well my reality anyway. With that in mind I decided to put in words some ‘suggestions’ based on my experiences to date. Notice that I have labelled these ‘suggestions’ because I don’t think it’s possible to give someone career advice without knowing their individual personal situation. If the generic career advice was golden we would be all going into interviews and employment like robots acting the same as each other. The suggestions below are not numbered and I’m not initiating that you follow them in any particular systematic way or at all in fact if you don’t feel they’re for you. You might find some of these to be enlightening with a fresh approach or on the other-hand you might completely disagree with some that go completely against the grain, you might even think that one or two are complete tripe and a terrible suggestion, but at the end of the day these are my experiences, my opinions, and what has worked for me and what I have learned along the way. So here we go…

# Be Educated

While experience stands out on a resume it is highly unlikely that your experience would be on your resume without an educational background. If you want a potential employer to take you serious for an entry level or graduate role then having the necessary education is a huge step in the right direction. Many university courses now offer GIS modules as part of a wider discipline such as environmental sciences or engineering for examples. While these modules are a good way to get acquainted with GIS software they will more than likely lack in many aspects. In particular the theory behind GIS and GIScience, the theory behind how the analysis methods and techniques work through their underlying algorithms and these are often fundamental to problem solving in the workplace. One very simple example that I have encountered numerous times is ‘why are my points appearing in the middle of the ocean?’ Because most of use say ‘lat-long’ we assume that latitude is the x-axis and longitude the y when it is the opposite. At the end of the day anyone can make a map, it’s not really the hard part of GIS, but making a good map after performing several geoprocessing tasks and performing geostatistical analysis on data and putting forward competent results is! If you are serious about a career in GIS I suggest that you attend a course that purely focuses on GIS, how and why the software does what it does and how you can use the software to solve questions, problems, and aid in better decision making. I have completed two full-time GIS focused postgraduate courses, the GIS and GISci landscape is vast and ever growing and no course can completely cover every aspect, but they are the foundation for future endeavours both educationally and professionally.

# Don’t Hide Behind a Keyboard (if possible)!

I understand that this approach only works if there are companies located locally or an acceptable commute away. If you are within an hour away from a plethora of potential employers and you’re hiding in room, in your pyjamas, applying to them through the personality of your keyboard, well I suggest that you get your head examined.

# GIS Related Certifications Are Not That Important

Tom Cruise (Jerry Maguire – if you didn’t quite get that one)

Don’t Work for the Sake of Working

This suggestion only really relates to those that are in a position to do so. If your GIS career goal is making simple maps for the rest of your days, not really having to extend your brain power too far, then this does not apply to you, you’re more than likely already living your dream. Before Christmas I was working for a few months and was really only doing so for the small bit of income. I was learning nothing from the role and the monotony from it all was affecting my mood. I decided over Christmas that I wasn’t going to apply for a role or accept any roles that I really had very little interest in. I was going to devote my time to up-skilling and putting some blog posts together to attract interest. This has been my best choice to date. I made a list of the books, online courses, and other learning material that I wanted to go through and put aside around 6 hours a day, almost as if it was my full time job. I could do this because I was in a position to do so. If I had a mortgage, kids, or other responsibilities I would not of had the comfort of going down this route without some support. I have talked to so many people that are ‘just happy to be working’ even though they are miserable and have no progression prospects in those GIS jobs. Does 5 years experience doing the most simplistic GIS tasks look good on your resume? In my opinion, no! It looks like you are happy to stay at that level with little to no ambition. If you’re the type of person that can leave a depressing job in the evening and go enhance your skills at night well then I tip my hat to you. But for the majority of us the energy zapping 9 -5 – I want to shoot myself in the face – job means we’re straight home to the couch to complain about how shit our job is to whoever will listen. I once heard that ‘a job of no interest does more damage to your mental health than being unemployed’, and I support this statement. I went on unemployment benefits, learned a ridiculous amount of material in two months that I was genuinely interested in, and I am now applying some of these up-skilled learnings to a new role getting paid more than decent. Maybe I just got lucky or maybe it was motivation. You will be at different levels at different stages in your career, assess what you want and make the necessary changes to achieve your goals, even if a means to this is through unemployment.

# Teach Others When You Can

Be confident in your abilities as soon as you graduate. When you do eventually land your first role it is highly likely that you will know some things that others in the company do not because your course was tailored differently with modern techniques or many of the GIS users in there are basic users. Always take the time out to help a fellow colleague with troubleshooting or suggesting how you would go about doing things if they are stuck and keep that mentality throughout your career. I once worked for a company where people kept their talents to themselves, they wanted to be one step ahead of each other in a race for promotion, the only problem was that there was no room for promotion and the whole situation made for a sour work environment. One guy even spouted on about his ‘dominance in Arc’ as if he ruled over the rest of us. It’s frustrating sitting around being stuck when someone can help you but won’t. Don’t get caught up in getting ahead. Helping others opens the door for others to help and teach you when you need it and creates a pleasant knowledge driven environment.

# Never Burn Bridges

I have walked out of a job in the past, the company was sold to me as a tight knit family community that helped each other out. This couldn’t have been further from the truth and it wore me down to the point where I just had enough and walked. I’m proud I stuck up for myself (I won’t go into details) but if something similar was to happen now I would handle it in a different way and be more professional about it all. My walkout never had any adverse affects on my GIS career but it’s not something I recommend doing as on a couple of occasions I had to explain why I lasted 3 months in a company and the nature of why I left and you really have to be careful how you choose your words in those situations, especially in an interview environment. On the other-hand I have worked for the same employer three times and it’s a great feeling when someone wants you back and are confident in your abilities. I suggest that you attempt to keep in touch with past employers and fellow colleagues. LinkedIn has made this quite easy and it is something you should be using to your advantage to keep connected with your professional network.

# My Thoughts on Interviews

People are people, they are a fellow human beings who one day were in the position you are now. They should not be feared and you should not see an interview as a daunting experience. Society has painted the interview experience the way its is and it makes people nervous. Not getting the job is not the end of the world and you get better at interviews the more frequent you do them. I always thought I interviewed well. That was until I went for an IT role about 6 months ago and felt completely unnerved. This was simply my mind telling me this isn’t really what I wanted, my answers to questions seemed to always relate back to my GIS career and I knew quite quick that rejection was looming and I was quite happy with that. Shortly after I had an interview for a GIS role and I was back in flying form. I knew what I was talking about, what I wanted, what I could do, what I could bring to the team, my limitations, my aspirations, my goals, my salary expectations (which eventually put me out of the race because our evaluations were way off). People spend hours and even days preparing for an interview, building up the stress levels before they have even sat down with the prospective employer. You do not need to research the ins and outs of a company, you just have a look at their general activities, what industry or industries they operate in. GIS is GIS, pretty much the same or similar wherever you work or whatever industry you work in. I have worked in agriculture, oil and gas, mineral exploration, and engineering and had no clue about these industries before I started. Your GIS skills are often what’s wanted, not the industry that is looking for them.

Have some interview questions ready to ask. A couple of my favourites are; With the GIS landscape ever changing, how do you make sure that staff are kept abreast of best practices and have access to learning new methods and techniques as the industry evolves?, this indicates that you want somewhere where you can evolve and learn as you progress; You have painted this role in a glorified manner, but tell me one negative thing about working at this office location?, this will show confidence and that you are not afraid to ask difficult questions.

Be confident even if you have to fake it. It’s a horrible feeling leaving an interview knowing that nerves have cost you. Sit up straight, make eye contact, speak slow and clear as this will make you seem more relaxed and eventually you will be. The interviewer is not there to catch you out, they have your resume and can see what you have to offer, you wouldn’t have made it this far if they didn’t consider you a potential candidate for the position. Smile, smile, smile, always be smiling, you will stand out a mile from other candidates that may have interviewed better, a constant smile will work wonders.

And remember, you’re interviewing them too, you need to find out if you really want to work there so ask the right questions to get the information you need.

# Make Yourself Indispensable

When you are employed find something niche that adds to your stock. Something that if you were to disappear tomorrow they would be completely lost without you. It’s actually easier to do than you think but involves going way beyond simple GIS techniques and requires a lot of motivation. You may need to get to grips with specialist software such as FME, programming and scripting such as Python and JavaScript, learning geostatistical methods and chaining several intricate processing tasks with automation for examples but certainly not limited to these. I suggest that you find and work towards finding something that makes them fear losing you.

# Suggestions and Opinions Rant Over

If you are looking for GIS career advice make sure that you talk to GIS professionals that have been through what you are beginning to embark on and don’t rely on the generic career advice posts that paint everything in black and white. You need to tailor your do’s and do not’s to suit you as an individual and the path that you want to take. The more you talk to the more likely you will come across someone with similar experiences to you and can point you in the right direction. Avoid taking career advice from academics who like to talk about working in the ‘real world’ but have never done so.

You can use or dismiss some the suggestions above because at the end of the day they are only suggestions and as said at the beginning these are my personal experiences and my opinions.