Extract PDF Pages and Rename Based on Text in Each Page (Python)

Posted on September 23, 2016 by Glen Bambrick

Interested in learning ArcPy? check out this course.

I was recently tasked with traversing through a directory and subsequent sub-directories to find PDFs and split any multi-page files into single-page files. The end goal was to name each extracted page, that was now an individual PDF, with a document number present on each page. There was possibly over 100 PDF files in the directory and each PDF could have one to more than ten pages. At the extreme I could have been looking at around one-thousand pages to extract and rename – a task that would have been very time consuming and mind numbing to do manually.

The PDFs contained map books produced using data driven pages in ArcGIS, it was conceivable that I could also re-open the original MXDs and re-export the map book as individual pages and naming appropriately based on the document name in the attribute table. Since I was not the creator of any of these PDFs and they all came from different teams, hunting down the correct MXDs and exporting would be cumbersome and also very time consuming. There had to be a more interesting and time efficient way…

…A quick research via Google on some Python modules and I had what I needed to complete my task in a more automated and time efficient manner. I needed three modules;
(1) os – for traversing through the directories and files and for renaming the files
(2) PyPDF2 – to read/write PDF files and also to extract text from pages
(3) re – the regular expression module to find the text needed to rename the file.
The next step was write down some pseudocode to map out what needed to be achieved and then to get coding…

Let’s begin by importing the modules at the top of the script.

import os, PyPDF2, re

Define a function to extract the pages. This function will take two parameters; the path to the root directory and the path to a folder to extract the pages to. The ‘extract_to_folder’ needs to be on the same level or above the root directory. Use your operating system to create the folder named ‘extracted’ and also create a second folder called ‘renamed’.

def split_pdf_pages(root_directory, extract_to_folder):

Next we use the os module to search from the root directory down to find any PDF files and store the full filepath as a variable, one at a time.

for root, dirs, files in os.walk(root_directory):
 for filename in files:
  basename, extension = os.path.splitext(filename)
   if extension == ".pdf":
    fullpath = root + "\\" + basename + extension

We then open that PDF in read mode.

    opened_pdf = PyPDF2.PdfFileReader(open(fullpath,"rb"))

For each page in the PDF the page is extracted and saved as a new PDF file in the ‘extracted’ folder. The below snippet was sourced from stackoverflow.

    for i in range(opened_pdf.numPages):
     output = PyPDF2.PdfFileWriter()
     output.addPage(opened_pdf.getPage(i))
     with open(extract_to_folder + "\\" + basename + "-%s.pdf" % i, "wb") as output_pdf:
      output.write(output_pdf)

That completes our function to strip out individual pages from PDF files in a root directory and down through all corresponding sub-directories. This function might be all you need as you can rename the extracted pages as you save each file. The next task for me, however, was to rename the PDFs based on text contained in each individual file.

Define a function called ‘rename_pdfs’ that takes two arguments; the path to the folder where the extracted pages reside and the renamed folder. Loop through each PDF and create a filepath to each one.

def rename_pdfs(extraced_pdf_folder, rename_folder):
 for root, dirs, files in os.walk(extraced_pdf_folder):
  for filename in files:
   basename, extension = os.path.splitext(filename)
   if extension == ".pdf":
    fullpath = root + "\\" + basename + extension

Open each PDF in read mode…

    pdf_file_obj = open(fullpath, "rb")
    pdf_reader = PyPDF2.PdfFileReader(pdf_file_obj)

…and create a page object.

    page_obj = pdf_reader.getPage(0)

Now we extract the text from the page.

    pdf_text = page_obj.extractText()

My task was made quite easy because each page had a unique document number with a certain amount of characters prefixed the exact same for each. This meant that I could use regular expression, the re module, to find the prefix and then obtain the rest of the document number.

The code below finds the document number prefix in the text extracted from the page and appends the next 14 characters to the prefix to give the full document number.

    for index in re.finditer("THE-DOC-PREFIX-", pdf_text):
     doc_ext = pdf_text[index.end():index.end() + 14]
     doc_num = "THE-DOC-PREFIX-" + doc_ext
     pdf_file_obj.close()

The last thing to do is to use the document number to rename the PDF

    os.rename(fullpath, rename_folder + "\\" + doc_num + ".pdf")

That completes the two functions required to complete the task.

Set up the variables required for the function parameters…

root_dir = r"C:\Users\******\Documents\original"
extract_to = r"C:\Users\******\Documents\extracted"
rename_to = r"C:\Users\******\Documents\renamed"

…and then call each function.

split_pdf_pages(root_dir, extract_to)
rename_pdfs(extract_to,rename_to)

Run the script. The original files will remain and the renamed extracted pages will be in the renamed folder. Any PDF page that failed to be renamed will still be in the extracted folder and you can rename these manually. This failure to rename every PDF is because of the make-up of the PDF i.e. the way it was exported from a piece of software or how it was created. In a test run, out of 206 pages, 10 pages failed to be renamed. When I opened the pages the select tool was unable to highlight text and everything was embedded as an image, hence why the script couldn’t read any text to rename the document.

I hope someone out there will find this useful. I am always happy that my code works but appreciate if you have any constructive comments or hints and tips to make the code more efficient.

Here’s the full script…

# import the neccessary modules
import os, PyPDF2, re

# function to extract the individual pages from each pdf found
def split_pdf_pages(root_directory, extract_to_folder):
 # traverse down through the root directory to sub-directories
 for root, dirs, files in os.walk(root_directory):
  for filename in files:
   basename, extension = os.path.splitext(filename)
   # if a file is a pdf
   if extension == ".pdf":
    # create a reference to the full filename path
    fullpath = root + "\\" + basename + extension

    # open the pdf in read mode
    opened_pdf = PyPDF2.PdfFileReader(open(fullpath,"rb"))

    # for each page in the pdf
    for i in range(opened_pdf.numPages):
    # write the page to a new pdf
     output = PyPDF2.PdfFileWriter()
     output.addPage(opened_pdf.getPage(i))
     with open(extract_to_folder + "\\" + basename + "-%s.pdf" % i, "wb") as output_pdf:
      output.write(output_pdf)

# function for renaming the single page pdfs based on text in the pdf
def rename_pdfs(extraced_pdf_folder, rename_folder):
 # traverse down through the root directory to sub-directories
 for root, dirs, files in os.walk(extraced_pdf_folder):
  for filename in files:
   basename, extension = os.path.splitext(filename)
   # if a file is a pdf
   if extension == ".pdf":
    # create a reference to the full filename path
    fullpath = root + "\\" + basename + extension

    # open the individual pdf
    pdf_file_obj = open(fullpath, "rb")
    pdf_reader = PyPDF2.PdfFileReader(pdf_file_obj)

    # access the individual page
    page_obj = pdf_reader.getPage(0)
    # extract the the text
    pdf_text = page_obj.extractText()

    # use regex to find information
    for index in re.finditer("THE-DOC-PREFIX-", pdf_text):
     doc_ext = pdf_text[index.end():index.end() + 14]
     doc_num = "THE-DOC-PREFIX-" + doc_ext
     pdf_file_obj.close()
     # rename the pdf based on the information in the pdf
     os.rename(fullpath, rename_folder + "\\" + doc_num + ".pdf")

# parameter variables
root_dir = r"C:\Users\******\Documents\rename_pdf"
extract_to = r"C:\Users\******\Documents\extracted"
rename_to = r"C:\Users\******\Documents\renamed"

# use the two functions
split_pdf_pages(root_dir, extract_to)
rename_pdfs(extract_to,rename_to)

Resources:
PyPDF2
Automate the Boring Stuff with Python

Reproject a Polygon Shapefile using PyShp and PyProj

Posted on January 24, 2016 by Glen Bambrick

Interested in learning ArcPy? check out this course.

In this post I will use the PyShp library along with the PyProj library to reproject the local authority boundaries of Ireland, in Shapefile format, from Irish Transverse Mercator to WGS 84 using Python.

To follow along download the admin boundaries from the Central Statistics Office (CSO) and rename the files to Ireland_LA. Move the files to directory that you want to work from.

You will need to install the libraries, you can use easy install or pip by opening up the command prompt window and entering

easy_install pyshp
easy_install pyproj

pip install pyshp
pip install pyproj

Open an interactive Python window and enter the following to make sure that you have access to the libraries.

>>> import shapefile
>>> from pyproj import Proj, transform

If no errors are returned you are good to go.

My original attempt at converting the data lead to this monstrosity…

and I instantly realised that several local authority boundaries were made up of multipart geometry.

We would need two constructs, one for rebuilding single geometry features and one for rebuilding multipart geometry features.

So let’s get to it. In your favourite Python IDE open a new script, import the libraries and save it. There is a link at the bottom of the post to download the code.

import shapefile
from pyproj import Proj, transform

Define a function to create a projection (.prj) file. See the post on Generating a Projection (.prj) file using Python for more info.

def getWKT_PRJ (epsg_code):
    import urllib
    wkt = urllib.urlopen("http://spatialreference.org/ref/epsg/{0}/prettywkt/".format(epsg_code))
    remove_spaces = wkt.read().replace(" ","")
    output = remove_spaces.replace("\n", "")
    return output

Define a path to your working directory where the Ireland_LA files reside. You can create a similar path to mine below or define your own, just make sure the Shapefile is located there.

shp_folder = "C:/blog/pyproj/shp/"

Using PyShp create a Reader object to access the data from the Ireland_LA Shapefile.

shpf = shapefile.Reader(shp_folder + "Ireland_LA.shp")

Create a Writer object to write data to as a new Shapefile.

wgs_shp = shapefile.Writer(shapefile.POLYGON)

Set variables for access to the field information of both the original and new Shapefile.

fields = shpf.fields
wgs_fields = wgs_shp.fields

We will grab all the field info from the original and copy it into the new. The ‘Deletion Flag’ as set in the Shapefile standard will be passed over (the tuple in the if statement), and we want data from the lists that follow the tuple that define the field name, data type and field length. Basically we are simply replicating the field structure from the original into the new.

for name in fields:
    if type(name) == "tuple":
        continue
    else:
        args = name
        wgs_shp.field(*args)

Now we want to populate the fields with attribute information. Create a variable to access the records of the original file.

records = shpf.records()

Copy the records from the original into the new.

for row in records:
    args = row
    wgs_shp.record(*args)

In the above snippet the args variable holds each record as a list and then unpacks that list as arguments in wgs_shp.record(attr_1, attr_2, attr_3….), which creates a record in the dbf file.

We now have all the attribute data copied over. Let’s begin the quest to convert the data from ITM to WGS84! Define the input projection (the projection of the original file), and an output projection using PyProj..

input_projection = Proj(init="epsg:29902")
output_projection = Proj(init="epsg:4326")

We need to access the geometry of the features in the original file so give yourself access to it.

geom = shpf.shapes()

Now we loop through each feature in the original dataset, access every point that makes up the geometry, convert the coordinates for each point and re-assemble transformed geometry in the new Shapefile. The if statement will handle geometry with only one part making up the feature.

for feature in geom:
    # if there is only one part
    if len(feature.parts) == 1:
        # create empty list to store all the coordinates
        poly_list = []
        # get each coord that makes up the polygon
       for coords in feature.points:
           x, y = coords[0], coords[1]
           # tranform the coord
           new_x, new_y = transform(input_projection, output_projection, x, y)
           # put the coord into a list structure
           poly_coord = [float(new_x), float(new_y)]
           # append the coords to the polygon list
           poly_list.append(poly_coord)
       # add the geometry to the shapefile.
       wgs_shp.poly(parts=[poly_list])

The else statement handles geometries with multi-parts.

    else:
        # append the total amount of points to the end of the parts list
        feature.parts.append(len(feature.points))
        # enpty list to store all the parts that make up the complete feature
        poly_list = []
        # keep track of the part being added
        parts_counter = 0

        # while the parts_counter is less than the amount of parts
        while parts_counter < len(feature.parts) - 1:
            # keep track of the amount of points added to the feature
            coord_count = feature.parts[parts_counter]
            # number of points in each part
            no_of_points = abs(feature.parts[parts_counter] - feature.parts[parts_counter + 1])
            # create list to hold individual parts - these get added to poly_list[]
            part_list = []
            # cut off point for each part
            end_point = coord_count + no_of_points

            # loop through each part
            while coord_count < end_point:
                for coords in feature.points[coord_count:end_point]:
                    x, y = coords[0], coords[1]
                    # tranform the coord
                    new_x, new_y = transform(input_projection, output_projection, x, y)
                    # put the coord into a list structure
                    poly_coord = [float(new_x), float(new_y)]
                    # append the coords to the part list
                    part_list.append(poly_coord)
                    coord_count = coord_count + 1
        # append the part to the poly_list
        poly_list.append(part_list)
        parts_counter = parts_counter + 1
    # add the geometry to to new file
    wgs_shp.poly(parts=poly_list)

Save the Shapefile

wgs_shp.save(shp_folder + "Ireland_LA_wgs.shp")

And generate the projection file for it.

prj = open(shp_folder + "Ireland_LA_wgs.prj", "w")
epsg = getWKT_PRJ("4326")
prj.write(epsg)
prj.close()

Save and run the file. Open the Shapefile in a GIS to inspect. Have a look at the attribute table, nicely populated with the data. You should be able to configure the code for other polygon files, just change the original input Shapefile, set the projections (input and output), and save a new Shapefile. Also don’t forget the projection file!

You can download the source code for this post here. Right-click on the download link on the page, select Save file as…, before saving change the .txt in the filename to .py and save.

If anyone sees a way to make the code more efficient please comment, your feedback is appreciated.

Resources

For more information on the PyShp library visit documentation here and for PyProj.

Haversine Distances with Python, EURO 2016 & Historic Results

Posted on January 12, 2016 by Glen Bambrick

Ireland kick off their European Championship 2016 Group E on the 13th of June at the Stade de France, Saint-Denis, in the northern suburbs of Paris at 6pm local time (5pm on the Emerald Isle). Our opponents that evening will be Sweden, followed by a trek down south to Bordeaux to face Belgium on the 18th, and then a northern journey to Lille to take on Italy in the final group game on the 22nd.

Click on map to enlarge

I have a flight from Dublin to Paris, a train from Paris to Bordeaux, Bordeaux to Lille, Lille to Paris, and a flight back to Dublin. Let’s use the simple Haversine formula to calculate approximately how far I will be travelling to follow the boys in green this summer. Just to note, I am not being pessimistic by not sticking around for the knockout stages, I have every faith that we will get out of the group, a two week holiday in France is going to take its toll on the bank account! But who knows…

The Haversine formula calculates the distance between two points on a sphere, also known as the Greater Circle distance, here’s the Python code similar to rosettacode.org…

from math import radians, sin, cos, sqrt, asin

def haversine(coords_1, coords_2):

 lon1 = coords_1[0]
 lat1 = coords_1[1]
 lon2 = coords_2[0]
 lat2 = coords_2[1]
 
 # Earth radius in kilometers
 earth_radius = 6372.8

 # convert from degrees to radians
 dLon = radians(lon2 - lon1)
 dLat = radians(lat2 - lat1) 
 lat1 = radians(lat1)
 lat2 = radians(lat2)

 # mathsy stuff for you to research...if you're so inclined to
 a = sin(dLat/2)**2 + cos(lat1)*cos(lat2)*sin(dLon/2)**2
 c = 2*asin(sqrt(a))

 # return the distance rounded to 2 decimal places
 return round((earth_radius * c),2)

Now to use the function to calculate my approximate travelling distance. First define the coordinates as lists.

# lon - lat
dub_airport = [-6.2700, 53.4214]
cdg_airport = [2.5478, 49.0097]
paris = [2.3831, 48.8390]
bordeaux = [-0.5800, 44.8400]
lille = [3.0583, 50.6278]

And then use the coordinates as parameters for the haversine function. Print the distance for each leg and the total distance.

leg_1 = haversine(dub_airport, cdg_airport)
print "Leg 1: " + str(leg_1) + " Km"

leg_2 = haversine(paris, bordeaux)
print "Leg 2: " + str(leg_2) + " Km"

leg_3 = haversine(bordeaux, lille)
print "Leg 3: " + str(leg_3) + " Km"

leg_4 = haversine(lille, paris)
print "Leg 4: " + str(leg_4) + " Km"

leg_5 = haversine(cdg_airport, dub_airport)
print "Leg 5: " + str(leg_5) + " Km"

total_dist = leg_1 + leg_2 + leg_3 + leg_4 + leg_5
print "Total Distance " + str(total_dist) + " Km"

The output from running the complete script is…

Leg 1: 785.3 Km
Leg 2: 498.57 Km
Leg 3: 698.71 Km
Leg 4: 204.79 Km
Leg 5: 785.3 Km
Total Distance 2972.67 Km

Just over 2,970 Km! Ok so I could have been more accurate with getting the road length from my house to the airport, using the Haversine to find the distance from Dublin Airport to Charles De Gaulle, and then using road and rail networks to calculate my internal travel in France but the idea here was to introduce you to the Haversine formula.

And here’s a little map to show my up coming travels. These maps were made by exporting data from ArcGIS as an emf and imported in CorelDraw, a graphic design package for styling.

Click on the map to enlarge

That’s it for the Haversine, Python and GIS in general. If you have no interest in football you can stop reading here. From here down it’s just historic results against our opponents and opinions.

SWEDEN

We have faced Sweden in two world cup qualifying campaigns, in qualification for one European Championship, and four times in a friendly match. It doesn’t bode well that we have never beaten Sweden in a competitive game, losing four times and drawing twice.

Our dominance over Sweden in friendlies see us with a 75% win percentage, only losing once in four meetings.

Overall: P10 W3 D2 L5 F13 A16 – WIN% 30

BELGIUM

Similar to our record against Sweden, Ireland have never beaten Belgium in a competitive fixture, although in the seven competitive games we have only been beaten twice, drawing the other five.

We have to look to the friendly results for Ireland to get the better of Belgium, with Ireland edging the score at four wins to three.

Overall: P14 W4 D5 L5 F24 A25 – WIN% 28.57

ITALY

Well would you look at that!! We have beaten at least one of our upcoming opponents in a competitive match. Ok so one win out of eight but I am every the optimist and from 1994 onwards our record against Italy is pretty decent both competitively and in friendlies.

Our overall record against Italy looks dismal but recent results give Ireland hope. Since and including our meeting with Italy at World Cup 94 the head-to-head is two wins each and three draws. If you offered my a draw now against Italy I’d take it. Remember there are now twenty-four teams in the competition meaning the best placed third team in each group qualifies for the second round.

Overall: P13 W2 D3 L8 F9 A20 – WIN% 16.67

Do we stand a chance of getting out of the group?

Let me know your thoughts. With three teams qualifying from four out of the six groups Ireland have every chance of progressing. A win and a draw will more than likely be the minimum requirement to achieve passage to the next phase, but let’s not forget that Ireland reached the Quarter Finals of Italia 90 having not won a single match! Beat Sweden in the opener and get a draw against Italy. If only it was that simple 🙂

#COYBIG

Data used to create maps was downloaded from Natural Earth.
Historic results collate from the FAI and Wikipedia.

CSV to Shapefile with pyshp

Posted on January 9, 2016 by Glen Bambrick

Interested in learning ArcPy? check out this course.

In this post I will look at extracting point data from a CSV file and creating a Shapefile with the pyshp library. The data consists of the location of trees with various attributes generated by the Fingal County Council in Ireland. The data can be downloaded as a CSV file from dublinked.ie.

pyshp is a pure Python library designed to provide read and write support for the ESRI Shapefile (.shp) format and only utilizes Python’s standard library to achieve this. The library can be downloaded from https://code.google.com/p/pyshp/ and placed in the site-packages folder of your Python installation. Alternatively you can use easy-install…

easy_install pyshp

…or pip.

pip install pyshp

NOTE: You should make yourself familiar with the pyshp library by visiting Joel Lawhead’s examples and documents here.

The full code is at the bottom of the post, the following is a walkthrough. When ready to go open your favourite editor and import the modules required for the task at hand.

import shapefile, csv

We will use the getWKT_PRJ function discussed in a previous post.

def getWKT_PRJ (epsg_code):
 import urllib
 wkt = urllib.urlopen("http://spatialreference.org/ref/epsg/{0}/prettywkt/".format(epsg_code))
 remove_spaces = wkt.read().replace(" ","")
 output = remove_spaces.replace("\n", "")
 return output

Create an instance of the Shapefile Writer( ) class and declare the POINT geometry type.

trees_shp = shapefile.Writer(shapefile.POINT)

Set the autoBalance to 1. This enforces that for every record there must be a corresponding geometry.

trees_shp.autoBalance = 1

Create the field names and data types for each.

trees_shp.field("TREE_ID", "C")
trees_shp.field("ADDRESS", "C")
trees_shp.field("TOWN", "C")
trees_shp.field("TREE_SPEC", "C")
trees_shp.field("SPEC_DESC", "C")
trees_shp.field("COMMONNAME", "C")
trees_shp.field("AGE_DESC", "C")
trees_shp.field("HEIGHT", "C")
trees_shp.field("SPREAD", "C")
trees_shp.field("TRUNK", "C")
trees_shp.field("TRUNK_ACTL", "C")
trees_shp.field("CONDITION", "C")

Create a counter variable to keep track of the number of feature written to the Shapefile.

counter = 1

Open the CSV file in read mode.

with open('C:/csv_to_shp/Trees.csv', 'rb') as csvfile:
 reader = csv.reader(csvfile, delimiter=',')

Skip the header.

next(reader, None)

Loop through each row and assign each attribute in the row to a variable.

for row in reader:
 tree_id = row[0]
 address = row[1]
 town = row[2]
 tree_species = row[3]
 species_desc = row[4]
 common_name = row[5]
 age_desc = row[6]
 height = row[7]
 spread = row[8]
 trunk = row[9]
 trunk_actual = row[10]
 condition = row[11]
 latitude = row[12]
 longitude = row[13]

Set the geometry for each record based on the longitude and latitude vales.

trees_shp.point(float(longitude),float(latitude))

Create a matching record for the geometry using the attributes.

trees_shp.record(tree_id, address, town, tree_species, species_desc, common_name, age_desc,height, spread, trunk, trunk_actual, condition)

Print to screen the current feature number and increase the counter.

print "Feature " + str(counter) + " added to Shapefile."
 counter = counter + 1

Save the Shapefile to a location and name the file.

trees_shp.save("C:/csv_to_shp/Fingal_Trees")

Create a projection file (.prj)

prj = open("C:/csv_to_shp/Fingal_Trees.prj", "w")
epsg = getWKT_PRJ("4326")
prj.write(epsg)
prj.close()

Save and run the script. The number of features should be printed to the console.

If you open the original CSV file you can see that there are also 33670 records. Navigate to the file location where you saved the Shapefile output. You should see four files shown below.

And just to make sure that the data is correct, here I have opened it up in QGIS.

And the attribute table…

And here’s the full code…

# import libraries
import shapefile, csv

# funtion to generate a .prj file
def getWKT_PRJ (epsg_code):
 import urllib
 wkt = urllib.urlopen("http://spatialreference.org/ref/epsg/{0}/prettywkt/".format(epsg_code))
 remove_spaces = wkt.read().replace(" ","")
 output = remove_spaces.replace("\n", "")
 return output

# create a point shapefile
trees_shp = shapefile.Writer(shapefile.POINT)

# for every record there must be a corresponding geometry.
trees_shp.autoBalance = 1

# create the field names and data type for each.
trees_shp.field("TREE_ID", "C")
trees_shp.field("ADDRESS", "C")
trees_shp.field("TOWN", "C")
trees_shp.field("TREE_SPEC", "C")
trees_shp.field("SPEC_DESC", "C")
trees_shp.field("COMMONNAME", "C")
trees_shp.field("AGE_DESC", "C")
trees_shp.field("HEIGHT", "C")
trees_shp.field("SPREAD", "C")
trees_shp.field("TRUNK", "C")
trees_shp.field("TRUNK_ACTL", "C")
trees_shp.field("CONDITION", "C")

# count the features
counter = 1

# access the CSV file
with open('C:/csv_to_shp/Trees.csv', 'rb') as csvfile:
 reader = csv.reader(csvfile, delimiter=',')
 # skip the header
 next(reader, None)

#loop through each of the rows and assign the attributes to variables
 for row in reader:
  tree_id = row[0]
  address = row[1]
  town = row[2]
  tree_species = row[3]
  species_desc = row[4]
  common_name = row[5]
  age_desc = row[6]
  height = row[7]
  spread = row[8]
  trunk = row[9]
  trunk_actual = row[10]
  condition = row[11]
  latitude = row[12]
  longitude = row[13]

  # create the point geometry
  trees_shp.point(float(longitude),float(latitude))
  # add attribute data
  trees_shp.record(tree_id, address, town, tree_species, species_desc, common_name, age_desc,height, spread, trunk, trunk_actual, condition)

  print "Feature " + str(counter) + " added to Shapefile."
  counter = counter + 1

# save the Shapefile
trees_shp.save("C:/csv_to_shp/Fingal_Trees")

# create a projection file
prj = open("C:/csv_to_shp/Fingal_Trees.prj", "w")
epsg = getWKT_PRJ("4326")
prj.write(epsg)
prj.close()

Any problems let me know.

Labelling in ArcGIS with Formatting Tags and Expressions

Posted on January 5, 2016 by Glen Bambrick

Interested in learning ArcPy? check out this course.

I recently sat an interview test where I had to use labelling in ArcGIS Desktop without the aid of the internet or notes for guidance. I must admit I was pretty stumped when it came to formatting labels beyond using the GUI (Labels tab in the Layer Properties) and stepping into the world of expressions, so I decided to rectify this and explore the options. ESRI maintain a fantastic help resource that can be found at here (for 10.2), where you can find what you need to get started. The following examples are some neat ways you can format labels using tags and expressions. They’re quite basic but act as a foundation to build upon.

Open the Layer Properties of the layer you wish to label and switch to the Labels tab. Click on the Expression… button to open the Label Expression window. Switch the Parser at the bottom of the window to Python.

In this first example I will simply concatenate a string with a attribute (also a string), the custom string will be placed on the first line of the label and the attribute of the county name placed on the second. This is achieved with the following…

"This is the geographic region of\n" + [COUNTYNAME]

[COUNTYNAME] represents the field names COUNTYNAME in the attribute table of the data I am working with. Next we will concatenate the area on a new line and round the decimal places to two. We cast the area to a string so the concatenation can be preformed.

"This is the geographic region of\n" + [COUNTYNAME] + "\nArea: " + str(round(float([Shape_Area]),2)) + " sq m"

Next we force labels to be presented in upper case text. The Advanced checkbox must be checked to create multiline expressions. You could also replace upper with lower in the below code snippet to force text to be lower case, or replace with title to capitalize the first letter in each word (proper case).

def FindLabel ([COUNTYNAME]):
 label = [COUNTYNAME]
 label = label.upper()
 return label

Stack text on new lines by using replace. The expression below replaces spaces in the COUNTYNAME attribute with n which forces text after a space onto a new line and removes the space.

def FindLabel ([COUNTYNAME]):
 label = [COUNTYNAME]
 label = label.upper().replace(" ", "\n")
 return label

Lets make the text bold by using format tags. Each tag has an opening < > and closing </ > tag.

def FindLabel ([COUNTYNAME]):
 label = [COUNTYNAME]
 label = label.upper().replace(" ", "\n")
 return "<BOL>" + label + "</BOL>"

…and then add some colour. Missing RGB values are assumed to be 0.

def FindLabel ([COUNTYNAME]):
 label = [COUNTYNAME]
 label = label.upper().replace(" ", "\n")
 return "<BOL><CLR red='255'>" + label + "</CLR></BOL>"

So how about a custom colour…

def FindLabel ([COUNTYNAME]):
 label = [COUNTYNAME]
 label = label.upper().replace(" ", "\n")
 return "<BOL><CLR red='125' green='105' blue='190'>" + label + "</CLR></BOL>"

…and italics and an underline…

def FindLabel ([COUNTYNAME]):
 label = [COUNTYNAME]
 label = label.upper().replace(" ", "\n")
 return "<UND>" + "REGIONn" + "</UND>" + "<BOL><ITA><CLR red='125' green='105' blue='190'>" + label + "</CLR></ITA></BOL>"

We’ll throw back in the area and format sq m with a superscripted 2 instead…(use SUB if you need to subscript text)

def FindLabel ([COUNTYNAME], [Shape_Area]):
 label = [COUNTYNAME]
 label = label.upper().replace(" ", "\n")
 area = str(round(float([Shape_Area]),2))
 return "<UND>" + "REGIONn" + "</UND>" + "<BOL><ITA><CLR red='125' green='105' blue='190'>" + label + "</CLR></ITA></BOL>" + "nArea: " + area + "m" + "<SUP>" + "2" + "</SUP>"

Other format tags are <ACP> for all capitals, <SCP> for small capitals, <CHR spacing = ‘200’> for character spacing or <CHR width = ‘150’> for character width, <WRD spacing = ‘200’> for word spacing and <LIN leading = ’25’> for line leading.

Style labels based on attributes. If the area is greater than 1,000,000,000 sq m the label will be styled like the figure above with a colour, if not it will remain black.

def FindLabel ([COUNTYNAME], [Shape_Area]):
 area = str(round(float([Shape_Area]),2))
 label = [COUNTYNAME]
 label = label.upper().replace(" ", "\n")
 if float([Shape_Area]) > 1000000000:
 return "<UND>" + "REGIONn" + "</UND>" + "<BOL><ITA><CLR red='125' green='105' blue='190'>" + label + "</CLR></ITA></BOL>" + "nArea: " + area + "m" + "<SUP>" + "2" + "</SUP>"
 else:
 return "<UND>" + "REGIONn" + "</UND>" + "<BOL><ITA>" + label + "</ITA></BOL>" + "nArea: " + area + "m" + "<SUP>" + "2" + "</SUP>"

This has just been a quick intro into using expression and format tags for labelling. The Information was found in the online ArcGIS Help that can be found here.

Geospatiality

make geospatial your speciality

Category Archives: Python

Extract PDF Pages and Rename Based on Text in Each Page (Python)

Reproject a Polygon Shapefile using PyShp and PyProj

Resources

Recommended Further Reading

Haversine Distances with Python, EURO 2016 & Historic Results

SWEDEN

BELGIUM

ITALY

Do we stand a chance of getting out of the group?

CSV to Shapefile with pyshp

Labelling in ArcGIS with Formatting Tags and Expressions