OSGP: Measuring Geographic Distributions – Standard Distance

(Open Source Geospatial Python)

The ‘What is it?’

The Standard Distance, also know as the Standard Distance Deviation, is the average distance all features vary from the Mean Center and measures the compactness of a distribution. The Standard Distance is a value representing the distance in units from the Mean Center and is usually plotted on a map as a circle for a visual indication of dispersion, the Standard Distance is the radius.

The Standard Distance works best in the absence of a strong directional trend. According to Andy Mitchell, if a directional trend is present you are better off using the Standard Deviational Ellipse.

You can use the Standard Distance to compare territories between species, which has the wider/broader territory, or to compare changes over time such as the dispersion of burglaries for each calendar month.

In a Normal Distribution you would expect around 68% of all points to fall within the Standard Distance.

The Formula!

The mean x-coordinate is subtracted from the x-coordinate value for each point and the difference is squared. The sum of all the squared values for x minus the x-mean is divided by the number of points. This is also performed for y-coordinates. The resulting values for x and y are summed and then we take the square root of this value to return the value to original distance units.

First we get the mean X and Y…

Mean Center Formula

…and then the Standard Distance

Standard Distance Formula

For Point features the X and Y coordinates of each feature is used, for Polygons the centroid of each feature represents the X and Y coordinate to use, and for Linear features the mid-point of each line is used for the X and Y coordinate.

The Code…

from osgeo import ogr
from shapely.geometry import MultiLineString
from shapely import wkt
import numpy as np
import sys, math

## set the driver for the data
driver = ogr.GetDriverByName("FileGDB")
## path to the FileGDB
gdb = r"C:\Users\Glen B\Documents\ArcGIS\Default.gdb"
## ope the GDB in write mode (1)
ds = driver.Open(gdb, 1)

input_lyr_name = "Birmingham_Burglaries_2016"

output_fc = "{0}_standard_distance".format(input_lyr_name)

## reference the layer using the layers name
if input_lyr_name in [ds.GetLayerByIndex(lyr_name).GetName() for lyr_name in range(ds.GetLayerCount())]:
    lyr = ds.GetLayerByName(input_lyr_name)
    print "{0} found in {1}".format(input_lyr_name, gdb)

if output_fc in [ds.GetLayerByIndex(lyr_name).GetName() for lyr_name in range(ds.GetLayerCount())]:
    ds.DeleteLayer(output_fc)
    print "Deleting: {0}".format(output_fc)

try:
    ## for points and polygons we use the centroid
    first_feat = lyr.GetFeature(1)
    if first_feat.geometry().GetGeometryName() in ["POINT", "MULTIPOINT", "POLYGON", "MULTIPOLYGON"]:
        xy_arr = np.ndarray((len(lyr), 2), dtype=np.float)
        for i, pt in enumerate(lyr):
            ft_geom = pt.geometry()
            xy_arr[i] = (ft_geom.Centroid().GetX(), ft_geom.Centroid().GetY())

    ## for lines we get the midpoint of a line
    elif first_feat.geometry().GetGeometryName() in ["LINESTRING", "MULTILINESTRING"]:
        xy_arr = np.ndarray((len(lyr), 2), dtype=np.float)
        for i, ln in enumerate(lyr):
            line_geom = ln.geometry().ExportToWkt()
            shapely_line = MultiLineString(wkt.loads(line_geom))
            midpoint = shapely_line.interpolate(shapely_line.length/2)
            xy_arr[i] = (midpoint.x, midpoint.y)

except Exception:
    print "Unknown geometry for {}".format(input_lyr_name)
    sys.exit()

avg_x, avg_y = np.mean(xy_arr, axis=0)

print "Mean Center: {0}, {1}".format(avg_x, avg_y)

sum_of_sq_diff_x = 0.0
sum_of_sq_diff_y = 0.0

for x, y in xy_arr:
    diff_x = math.pow(x - avg_x, 2)
    diff_y = math.pow(y - avg_y, 2)
    sum_of_sq_diff_x += diff_x
    sum_of_sq_diff_y += diff_y

sum_of_results = (sum_of_sq_diff_x/lyr.GetFeatureCount()) + (sum_of_sq_diff_y/lyr.GetFeatureCount())
standard_distance = math.sqrt(sum_of_results)
print "Standard Distance: {0}". format(standard_distance)

## create a point with the mean center
## and buffer by the standard distance
pnt = ogr.Geometry(ogr.wkbPoint)
pnt.AddPoint(avg_x, avg_y)
polygon = pnt.Buffer(standard_distance, 90)

## create a new polygon layer with the same spatial ref as lyr
out_lyr = ds.CreateLayer(output_fc, lyr.GetSpatialRef(), ogr.wkbPolygon)

## define and create new fields
x_fld = ogr.FieldDefn("X", ogr.OFTReal)
y_fld = ogr.FieldDefn("Y", ogr.OFTReal)
stnd_dst = ogr.FieldDefn("Standard_Distance", ogr.OFTReal)
out_lyr.CreateField(x_fld)
out_lyr.CreateField(y_fld)
out_lyr.CreateField(stnd_dst)

## add the standard distance buffer to the layer
feat_dfn = out_lyr.GetLayerDefn()
feat = ogr.Feature(feat_dfn)
feat.SetGeometry(polygon)
feat.SetField("X", avg_x)
feat.SetField("Y", avg_y)
feat.SetField("Standard_Distance", standard_distance)
out_lyr.CreateFeature(feat)

print "Created {0}".format(output_fc)

## free up resources
del feat, ds, lyr, out_lyr

I’d like to give credit to Logan Byers from GIS StackExchange who aided in speeding up the computational time using NumPy and for forcing me to begin learning the wonders of NumPy (getting there bit by bit).

The Example:

I downloaded crime data from DATA.POLICE.UK for the West Midlands Police from January 2016 to December 2016. I used some Python to extract just the Burglary data and made this into a feature class in the File GDB. Next, I downloaded OS Boundary Line data and clipped the Burglary data to just Birmingham. Everything was now in place to find the Standard Distance of all burglaries for Birmingham in 2016. (see The Other Scripts section at the bottom of this post for processing the data)

birmingham_burgalries_2016

Running the script from The Code section above calculates the Standard Distance for burglaries in Birmingham for 2016 and creates a polygon feature class in the File GDB.

Standard Distance Circle

OSGP Mean Center:     407926.695396, 286615.428507
ArcGIS Mean Center:    407926.695396, 286615.428507

OSGP Standard Distance:      6416.076596
ArcGIS Standard Distance:    6416.076596

Also See…

Mean Center
Central Feature
Median Center
Initial Data Assessment

The Resources:

ESRI Guide to GIS Volume 2: Chapter 2 (I highly recommend this book)
see book review here.

Geoprocessing with Python

Python GDAL/OGR Cookbook

Setting up GDAL/OGR with FileGDB Driver for Python on Windows

< The Other Scripts >

1. Extract Burglary Data for West Midlands

import csv, os
from osgeo import ogr, osr

## set the driver for the data
driver = ogr.GetDriverByName("FileGDB")

## path to the FileGDB
gdb = r"C:\Users\Glen B\Documents\my_geodata.gdb"

## ope the GDB in write mode (1)
ds = driver.Open(gdb, 1)

## the coordinates in the csv files are lat/long
source = osr.SpatialReference()
source.ImportFromEPSG(4326)

## we need the data in British National Grid
target = osr.SpatialReference()
target.ImportFromEPSG(27700)

transform = osr.CoordinateTransformation(source, target)

## set the output fc name
output_fc = "WM_Burglaries_2016"

## if the output fc already exists delete it
if output_fc in [ds.GetLayerByIndex(lyr_name).GetName() for lyr_name in range(ds.GetLayerCount())]:
    ds.DeleteLayer(output_fc)
    print "Deleting: {0}".format(output_fc)

out_lyr = ds.CreateLayer(output_fc, target, ogr.wkbPoint)

## define and create new fields
mnth_fld = ogr.FieldDefn("Month", ogr.OFTString)
rep_by_fld = ogr.FieldDefn("Reported_by", ogr.OFTString)
fls_wthn_fld = ogr.FieldDefn("Falls_within", ogr.OFTString)
loc_fld = ogr.FieldDefn("Location", ogr.OFTString)
lsoa_c_fld = ogr.FieldDefn("LSOA_code", ogr.OFTString)
lsoa_n_fld = ogr.FieldDefn("LSOA_name", ogr.OFTString)
crime_fld = ogr.FieldDefn("Crime_type", ogr.OFTString)
outcome_fld = ogr.FieldDefn("Last_outcome", ogr.OFTString)

out_lyr.CreateField(mnth_fld)
out_lyr.CreateField(rep_by_fld)
out_lyr.CreateField(fls_wthn_fld)
out_lyr.CreateField(loc_fld)
out_lyr.CreateField(lsoa_c_fld)
out_lyr.CreateField(lsoa_n_fld)
out_lyr.CreateField(crime_fld)
out_lyr.CreateField(outcome_fld)

## where the downloaded csv files reside
root_folder = r"C:\Users\Glen B\Documents\Crime"

## for each csv
for root,dirs,files in os.walk(root_folder):
    for filename in files:
        if filename.endswith(".csv"):
            csv_path = "{0}\\{1}".format(root, filename)
            with open(csv_path, "rb") as csvfile:
                reader = csv.reader(csvfile, delimiter=",")
                next(reader,None)
                ## create a point with attributes for each burglary
                for row in reader:
                    if row[9] == "Burglary":
                        pnt = ogr.Geometry(ogr.wkbPoint)
                        pnt.AddPoint(float(row[4]), float(row[5]))
                        pnt.Transform(transform)
                        feat_dfn = out_lyr.GetLayerDefn()
                        feat = ogr.Feature(feat_dfn)
                        feat.SetGeometry(pnt)
                        feat.SetField("Month", row[1])
                        feat.SetField("Reported_by", row[2])
                        feat.SetField("Falls_within", row[3])
                        feat.SetField("Location", row[6])
                        feat.SetField("LSOA_code", row[7])
                        feat.SetField("LSOA_name", row[8])
                        feat.SetField("Crime_type", row[9])
                        feat.SetField("Last_outcome", row[10])
                        out_lyr.CreateFeature(feat)

del ds, feat, out_lyr

2. Birmingham Burglaries Only

from osgeo import ogr

## required drivers
shp_driver = ogr.GetDriverByName("ESRI Shapefile")
gdb_driver = ogr.GetDriverByName("FileGDB")

## input boundary shapefile and file gdb
shapefile = r"C:\Users\Glen B\Documents\Crime\Data\GB\district_borough_unitary_region.shp"
gdb = r"C:\Users\Glen B\Documents\my_geodata.gdb"

## open the shapefile in read mode and gdb in write mode
shp_ds = shp_driver.Open(shapefile, 0)
gdb_ds = gdb_driver.Open(gdb, 1)

## reference the necessary layers
shp_layer = shp_ds.GetLayer(0)
gdb_layer = gdb_ds.GetLayerByName("WM_Burglaries_2016")

## filter the shapefile
shp_layer.SetAttributeFilter("NAME = 'Birmingham District (B)'")

## set the name for the output feature class
output_fc = "Birmingham_Burglaries_2016"

## if the output already exists then delete it
if output_fc in [gdb_ds.GetLayerByIndex(lyr_name).GetName() for lyr_name in range(gdb_ds.GetLayerCount())]:
    gdb_ds.DeleteLayer(output_fc)
    print "Deleting: {0}".format(output_fc)

## create an output layer
out_lyr = gdb_ds.CreateLayer(output_fc, shp_layer.GetSpatialRef(), ogr.wkbPoint)

## copy the schema from the West Midlands burglaries
## and use it for the Birmingham burglaries
lyr_def = gdb_layer.GetLayerDefn()
for i in range(lyr_def.GetFieldCount()):
    out_lyr.CreateField (lyr_def.GetFieldDefn(i))

## only get burglaries that intersect the Birmingham region
for shp_feat in shp_layer:
    print shp_feat.GetField("NAME")
    birm_geom = shp_feat.GetGeometryRef()
    for gdb_feat in gdb_layer:
        burglary_geom = gdb_feat.GetGeometryRef()
        if burglary_geom.Intersects(birm_geom):
            feat_dfn = out_lyr.GetLayerDefn()
            feat = ogr.Feature(feat_dfn)
            feat.SetGeometry(burglary_geom)

            ## populate the attribute table
            for i in range(lyr_def.GetFieldCount()):
                feat.SetField(lyr_def.GetFieldDefn(i).GetNameRef(), gdb_feat.GetField(i))
            ## create the feature
            out_lyr.CreateFeature(feat)
            feat.Destroy()

del shp_ds, shp_layer, gdb_ds, gdb_layer

The Usual 🙂

As always please feel free to comment to help make the code more efficient, highlight errors, or let me know if this was of any use to you.

OSGP: Measuring Geographic Distributions – Weighted Mean Center

(Open Source Geospatial Python)

The ‘What is it?’

See Mean Center.

The unweighted center is mainly used for events that occur at a place and time such as burglaries. The weighted center, however, is predominantly used for stationary features such as retail outlets or delineated areas for example (such as Census tracts). The Weighted Mean Center does not take into account distance between features in the dataset.

The weight needs to be a numerical attribute. The greater the value, the higher the weight for that feature.

The Formula!

The Weighted Mean Center is calculated by multiplying the x and y coordinate by the weight for that feature and summing all for both x and y individually, and then dividing this by the sum of all the weights.

Weighted Mean Center FormulaFor Point features the X and Y coordinates of each feature is used, for Polygons the centroid of each feature represents the X and Y coordinate to use, and for Linear features the mid-point of each line is used for the X and Y coordinate.

The Code…

from osgeo import ogr
from shapely.geometry import MultiLineString
from shapely import wkt
import numpy as np
import sys

## set the driver for the data
driver = ogr.GetDriverByName("ESRI Shapefile")
## folder where the shapefile resides
folder = r"C:\Users\glen.bambrick\Documents\GDAL\shp\\"
## name of the shapefile concatenated with folder
shp = "{0}Census2011_Small_Areas_generalised20m.shp".format(folder)
## open the shapefile
ds = driver.Open(shp, 0)
## reference the only layer in the shapefile
lyr = ds.GetLayer(0)

## create an output data source
out_ds = driver.CreateDataSource("{0}{1}_wgt_mean_center.shp".format(folder,lyr.GetName()))

## output mean center weighted filename
output_fc = "{0}{1}_wgt_mean_center".format(folder,lyr.GetName())

## field that has numerical weight
weight_fld = "TOTAL2011"

try:
    first_feat = lyr.GetFeature(1)
    xy_arr = np.ndarray((len(lyr), 2), dtype=np.float)
    wgt_arr = np.ndarray((len(lyr), 1), dtype=np.float)
    ## use the centroid for points and polygons
    if first_feat.geometry().GetGeometryName() in ["POINT", "MULTIPOINT", "POLYGON", "MULTIPOLYGON"]:
        for i, pt in enumerate(lyr):
            ft_geom = pt.geometry()
            weight = pt.GetField(weight_fld)
            xy_arr[i] = (ft_geom.Centroid().GetX() * weight, ft_geom.Centroid().GetY() * weight)
            wgt_arr[i] = weight
    ## midpoint of lines
    elif first_feat.geometry().GetGeometryName() in ["LINESTRING", "MULTILINESTRING"]:
        for i, ln in enumerate(lyr):
            line_geom = ln.geometry().ExportToWkt()
            weight = ln.GetField(weight_fld)
            shapely_line = MultiLineString(wkt.loads(line_geom))
            midpoint = shapely_line.interpolate(shapely_line.length/2)
            xy_arr[i] = (midpoint.x * weight, midpoint.y * weight)
            wgt_arr[i] = weight

except Exception:
    print "Unknown geometry or Incorrect field name for {}".format(input_lyr_name)
    sys.exit()

## do the maths
sum_x, sum_y = np.sum(xy_arr, axis=0)
sum_wgt = np.sum(wgt_arr)
weighted_x, weighted_y = sum_x/sum_wgt, sum_y/sum_wgt

print "Weighted Mean Center: {0}, {1}".format(weighted_x, weighted_y)

## create a new point layer with the same spatial ref as lyr
out_lyr = out_ds.CreateLayer(output_fc, lyr.GetSpatialRef(), ogr.wkbPoint)

## define and create new fields
x_fld = ogr.FieldDefn("X", ogr.OFTReal)
y_fld = ogr.FieldDefn("Y", ogr.OFTReal)
out_lyr.CreateField(x_fld)
out_lyr.CreateField(y_fld)

## create a new point for the mean center weighted
pnt = ogr.Geometry(ogr.wkbPoint)
pnt.AddPoint(weighted_x, weighted_y)

## add the mean center weighted to the new layer
feat_dfn = out_lyr.GetLayerDefn()
feat = ogr.Feature(feat_dfn)
feat.SetGeometry(pnt)
feat.SetField("X", weighted_x)
feat.SetField("Y", weighted_y)
out_lyr.CreateFeature(feat)

print "Created: {0}.shp".format(output_fc)

## free up resources
del ds, out_ds, lyr, first_feat, feat, out_lyr

I’d like to give credit to Logan Byers from GIS StackExchange who aided in speeding up the computational time using NumPy and for forcing me to begin learning the wonders of NumPy (which is still a work in progress)

The Example:

I downloaded the Small Areas of Ireland from the CSO. You will have to acknowledge a disclaimer. The data contains population information for the 2011 Census. Once downloaded unzip Census2011_Small_Areas_generalised20m.zip to working folder.

Small Areas

Running the script from The Code section above calculates the Weighted Mean Center of all Small Areas based on the population count for each for 2011 and creates a point Shapefile as the output.

Small Areas Weighted Mean Center

OSGP Weighted Mean Center:      238557.427484, 208347.116116
ArcGIS Weighted Mean Center:    238557.427484, 208347.116116

Also See…

Mean Center
Central Feature
Median Center
Initial Data Assessment

The Resources:

ESRI Guide to GIS Volume 2: Chapter 2 (I highly recommend this book)
see book review here.

Geoprocessing with Python

Python GDAL/OGR Cookbook

The Usual 🙂

As always please feel free to comment to help make the code more efficient, highlight errors, or let me know if this was of any use to you.

OSGP: Standard GIS Tools – Initial Data Assessment

(Open Source Geospatial Python)

Here we will look at the general makeup of a downloaded spatial dataset – a Shapefile from the Central Statistics Office in Ireland containing census data from 2011. We will look at getting the spatial reference of the file along with a breakdown of the field names, type, width and precision. We can print the top ten records or the entire attribute table and get a list of unique values for a field and the count of each.

Download the Small Areas of Ireland from the CSO. You will have to acknowledge a disclaimer. Once downloaded unzip Census2011_Small_Areas_generalised20m.zip to working folder. We will now begin to interrogate this Shapefile.

Small Areas

First we import the necessary modules…

# import modules
from osgeo import ogr
from tabulate import tabulate
from operator import itemgetter

tabulate will allow us to print out formatted tables. Using ogr we can access the inner workings of the downloaded Shapefile. Please note that osgeo and tabulate are not standard Python libraries and will need to be installed.

Using the ESRI Shapefile driver we open the Shapefile in read mode (0) and access the data (lyr).

# use Shapefile driver
driver = ogr.GetDriverByName("ESRI Shapefile")
# reference Shapefile
shp = r"C:\Users\Glen B\Documents\GDAL\shp\Census2011_Small_Areas_generalised20m.shp"
# open the file
ds = driver.Open(shp, 0)
# reference the only layer in a Shapefile
lyr = ds.GetLayer(0)

Spatial Reference Information

Straight away we cant print the spatial reference information associated with the Shapefile (contained in the .prj file)

print lyr.GetSpatialRef()

This will print out…

PROJCS["TM65_Irish_Grid",
    GEOGCS["GCS_TM65",
        DATUM["TM65",
            SPHEROID["Airy_Modified",6377340.189,299.3249646]],
        PRIMEM["Greenwich",0.0],
        UNIT["Degree",0.0174532925199433]],
    PROJECTION["Transverse_Mercator"],
    PARAMETER["False_Easting",200000.0],
    PARAMETER["False_Northing",250000.0],
    PARAMETER["Central_Meridian",-8.0],
    PARAMETER["Scale_Factor",1.000035],
    PARAMETER["Latitude_Of_Origin",53.5],
    UNIT["Meter",1.0]]

You can also access this information individually…

# projected coordinate system
proj_string = lyr.GetSpatialRef().GetAttrValue("PROJCS", 0)
# geographic coordinate system
geog_string = lyr.GetSpatialRef().GetAttrValue("GEOGCS", 0)
# EPSG Code if available
epsg = lyr.GetSpatialRef().GetAttrValue("AUTHORITY", 1)
# datum
datum = lyr.GetSpatialRef().GetAttrValue("DATUM", 0)

print "\nFile: {0}\n\nProjected: {1}\nEPSG: {2}\n".format(lyr.GetName(),proj_string, epsg)
print "Geographic: {0}\nDatum: {1}\n".format(geog_string, datum)

The output…

File: Census2011_Small_Areas_generalised20m

Projected: TM65_Irish_Grid
EPSG: None

Geographic: GCS_TM65
Datum: TM65

If there is an EPSG code in the .prj file it will be printed instead of None.

Geometry Type

If we reference the first feature we can get the geometry of the Shapefile

first_feat = lyr.GetFeature(1)
print "Geometry Type: {0}\n".format(first_feat.geometry().GetGeometryName())

In this instance it is a polygon Shapefile.

Geometry Type: POLYGON

Field Information

Let’s get some information on the data through the Layer Definition.

# https://pcjericks.github.io/py-gdalogr-cookbook/vector_layers.html
lyr_def = lyr.GetLayerDefn()

But before we do we need to create a few list structures. These will be used to hold the accessed information and enable us to neatly print them to screen.

# list to hold headers for filed information
header_list = ["FIELD NAME", "TYPE", "WIDTH", "PRECISION"]
# list will be populated with field information
output_list = []
# list will be populated with field names and used for attribute headers
fld_names = []

Cycle through each field and populate the necessary lists…

# for each field
for i in range(lyr_def.GetFieldCount()):
    # reference the field name
    fld_name = lyr_def.GetFieldDefn(i).GetName()
    # reference the field type
    fld_type = lyr_def.GetFieldDefn(i).GetFieldTypeName(lyr_def.GetFieldDefn(i).GetType())
    # reference the field width
    fld_width = lyr_def.GetFieldDefn(i).GetWidth()
    # reference the field precision
    fld_precision = lyr_def.GetFieldDefn(i).GetPrecision()
    # append these as a list to the output_list
    output_list.append([fld_name, fld_type, str(fld_width), str(fld_precision)])
    # append field name to fld_name
    fld_names.append(fld_name)

The output_list is a list of lists containing information for each field, the field name, data type, width and precision, this is matched in the header_list. The fld_names will be used further down to print out attributes, this list hold the field names as headers. Let’s print the field information…

print "{0}\n".format(tabulate(output_list, header_list))

Here’s the output…

FIELD NAME   TYPE     WIDTH   PRECISION
------------ ------ ------- -----------
NUTS1        String       3           0
NUTS1NAME    String       7           0
NUTS2        String       4           0
NUTS2NAME    String      26           0
NUTS3        String       5           0
NUTS3NAME    String      15           0
COUNTY       String       2           0
COUNTYNAME   String      25           0
CSOED        String      11           0
OSIED        String      13           0
EDNAME       String      45           0
SMALL_AREA   String      61           0
GEOGID       String      65           0
MALE2011     Real        20          10
FEMALE2011   Real        20          10
TOTAL2011    Real        20          10
PPOCC2011    Real        20          10
UNOCC2011    Real        20          10
VACANT2011   Real        20          10
HS2011       Real        20          10
PCVAC2011    Real        20          10
CREATEDATE   String      10           0

Attribute Table

Next we print out some attributes for a set of features, the first ten.

# number of features from the first to print attributes for
num_to_return = 10
#num_to_return = lyr.GetFeatureCount()

Use the commented out line if you want to print attributes for all features. Create an empty list to hold the attributes. Some fields contain characters from the Irish language so we account for this so that the attributes are printed correctly.

# list will be populated with attribute data
att_table = []

# for each feature in the Shapefile
for count, feature in enumerate(lyr):
    # up to the number of set features to print
    if count < num_to_return:
        # count will beacome the Feature ID
        atts = [count]
        # for each field append the data to atts list
        for name in fld_names:
            try:
                # if the attribute is a string then decode with Celtic Languages
                atts.append(feature.GetField(name).decode("iso8859_14"))
            except Exception:
                atts.append(feature.GetField(name))
        # append the data for the feature to the att_table list
        att_table.append(atts)

The count becomes the Feature ID but we have no field for this so we will create one…

# add a FID header (count)
fld_names.insert(0, "FID")

So let’s print out the attributes…

print tabulate(att_table, fld_names)
print "{0} out of {1} features".format(num_to_return, lyr.GetFeatureCount())

Here’s the output…

  FID NUTS1   NUTS1NAME   NUTS2   NUTS2NAME            NUTS3   NUTS3NAME         COUNTY COUNTYNAME       CSOED   OSIED EDNAME                           SMALL_AREA GEOGID       MALE2011   FEMALE2011   TOTAL2011   PPOCC2011   UNOCC2011   VACANT2011   HS2011   PCVAC2011 CREATEDATE
----- ------- ----------- ------- -------------------- ------- --------------- -------- -------------- ------- ------- ------------------------------ ------------ ---------- ---------- ------------ ----------- ----------- ----------- ------------ -------- ----------- ------------
    0 IE0     Ireland     IE02    Southern and Eastern IE022   Mid-East              15 Wicklow County   15039  257005 Aughrim                           257005002 A257005002        137          138         275          84          18           15      102        14.7 27-03-2012
    1 IE0     Ireland     IE02    Southern and Eastern IE024   South-East (IE)       01 Carlow County    01054  017049 Tinnahinch                        017049001 A017049001        186          176         362         111          25           24      136        17.6 27-03-2012
    2 IE0     Ireland     IE02    Southern and Eastern IE024   South-East (IE)       01 Carlow County    01053  017032 Marley                            017032001 A017032001        194          173         367         121           8            5      129         3.9 27-03-2012
    3 IE0     Ireland     IE02    Southern and Eastern IE024   South-East (IE)       01 Carlow County    01054  017049 Tinnahinch                        017049002 A017049002         75           75         150          67          29           29       96        30.2 27-03-2012
    4 IE0     Ireland     IE02    Southern and Eastern IE024   South-East (IE)       01 Carlow County    01054  017049 Tinnahinch                        017049003 A017049003         84           81         165          64          16           14       80        17.5 27-03-2012
    5 IE0     Ireland     IE02    Southern and Eastern IE024   South-East (IE)       01 Carlow County    01015  017005 Ballyellin                        017005002 A017005002        105           99         204          71           6            5       77         6.5 27-03-2012
    6 IE0     Ireland     IE02    Southern and Eastern IE024   South-East (IE)       01 Carlow County    01015  017005 Ballyellin                        017005001 A017005001        115          108         223          70           9            8       79        10.1 27-03-2012
    7 IE0     Ireland     IE02    Southern and Eastern IE024   South-East (IE)       01 Carlow County    01033  017033 Muinebeag (Bagenalstown) Rural    017033001 A017033001        201          205         406         143          15           14      158         8.9 27-03-2012
    8 IE0     Ireland     IE02    Southern and Eastern IE024   South-East (IE)       01 Carlow County    01034  017034 Muinebeag (Bagenalstown) Urban    017034002 A017034002        142          116         258          89           9            9       98         9.2 27-03-2012
    9 IE0     Ireland     IE02    Southern and Eastern IE024   South-East (IE)       01 Carlow County    01034  017034 Muinebeag (Bagenalstown) Urban    017034003 A017034003        174          169         343         107           6            4      113         3.5 27-03-2012
10 out of 18488 features

Unique Values and Counts

Next we’ll get a list of the unique COUNTYNAME entries and a count to see how many small areas are in each. (The below works for text fields only)

# rest to first feature
lyr.ResetReading()

# field to return unique list and count of
field = "COUNTYNAME"

# create empty dictionary
values_dict = {}

# for each feature
for feature in lyr:
    attribute = feature.GetField(field).decode("iso8859_14")
    # if the COUNTYNAME is not already in the dictionary add it and assign a value of 1
    if attribute not in values_dict:
        values_dict[attribute] = 1
    # otherwise do not add it and increase the existing value by 1
    else:
        values_dict[attribute] = values_dict[attribute] + 1

## convert dictionary to list for use with tabulate
key_value_list = [[key, value] for key, value in values_dict.items()]

## print results
print "\nTotal Feature Count: {0}\n".format(lyr.GetFeatureCount())
print tabulate(sorted(key_value_list), [field, "Count"])

And here’s the output…

Total Feature Count: 18488

COUNTYNAME               Count
---------------------- -------
Carlow County              210
Cavan County               322
Clare County               511
Cork City                  519
Cork County               1650
Donegal County             761
Dublin City               2202
Dún Laoghaire-Rathdown     760
Fingal                     938
Galway City                307
Galway County              741
Kerry County               701
Kildare County             731
Kilkenny County            372
Laois County               305
Leitrim County             173
Limerick City              258
Limerick County            514
Longford County            179
Louth County               462
Mayo County                643
Meath County               636
Monaghan County            244
North Tipperary            283
Offaly County              286
Roscommon County           303
Sligo County               307
South Dublin               906
South Tipperary            350
Waterford City             198
Waterford County           275
Westmeath County           338
Wexford County             615
Wicklow County             488

Alternatively we could print out based on the highest count descending by replacing the last print statement with…

# http://stackoverflow.com/questions/17555218/python-how-to-sort-a-list-of-lists-by-the-fourth-element-in-each-list
print tabulate(sorted(key_value_list, key = itemgetter(1), reverse = True), [field, "Count"])

…to get…

COUNTYNAME               Count
---------------------- -------
Dublin City               2202
Cork County               1650
Fingal                     938
South Dublin               906
Donegal County             761
...

I will add to these as I come across something useful. If you know of any neat things to add please comment below. Please also comment if anything is unclear or if this was useful to you.

See Also…

Setting up GDAL/OGR with FileGDB Driver for Python on Windows
Measuring Geographic Distributions #1.1 – Mean Center
Measuring Geographic Distributions #2.1 – Central Feature
Measuring Geographic Distributions #3.1 – Median Center

OSGP: Measuring Geographic Distributions – Median Center

(Open Source Geospatial Python)

The ‘What is it?’

Also known as the Center of Minimum Distance, the Median Center is a location that is the shortest total distance to all features in the study area (not to be confused with the Central Feature, which is the feature that is the shortest distance to all others). It can be used to find a suitable location for something that needs to be centrally located. The Median Center will gravitate towards an area with the most features.

The Median Center is good for finding the most accessible location.

The Formula!

The is no single formula or equation for calculating an exact Median Center, according to Andy Mitchell it is an iterative process involving calculating the Mean Center, summing the distances from it to each feature, offsetting the center slightly and summing the distances again until it eventually zones in on the optimum location that has the lowest sum.

The code below implements the Yehuda Vardi and Cun-Hui Zhang algorithm or the Weiszfeld algorithm.

The Code…

import math, sys
import numpy as np
from osgeo import ogr
from scipy.spatial.distance import cdist

## "W" for Weiszfield
## "YC" for Yehuda Vardi and Cun-Hui Zhang
algorithm = "YC"

## Weiszfield
## https://gist.github.com/endolith/2837160
def numersum(test_median,dataPoint):
    ## Provides the denominator of the weiszfeld algorithm depending on whether
    ## you are adjusting the candidate x or y
    return 1/math.sqrt((test_median[0]-dataPoint[0])**2 + (test_median[1]-dataPoint[1])**2)

def denomsum(test_median, xy_arr):
    ## Provides the denominator of the weiszfeld algorithm
    temp = 0.0
    for i in range(0,len(xy_arr)):
        temp += 1/math.sqrt((test_median[0] - xy_arr[i][0])**2 + (test_median[1] - xy_arr[i][1])**2)
    return temp

## Yehuda Vardi and Cun-Hui Zhang
## http://stackoverflow.com/questions/30299267/geometric-median-of-multidimensional-points
## user: orlp
def geometric_median(X, eps=1e-5):
    y = np.mean(X, 0)

    while True:
        D = cdist(X, [y])
        nonzeros = (D != 0)[:, 0]
        Dinv = 1 / D[nonzeros]
        Dinvs = np.sum(Dinv)
        W = Dinv / Dinvs
        T = np.sum(W * X[nonzeros], 0)
        num_zeros = len(X) - np.sum(nonzeros)
        if num_zeros == 0:
            y1 = T
        elif num_zeros == len(X):
            return y
        else:
            R = (T - y) * Dinvs
            r = np.linalg.norm(R)
            rinv = 0 if r == 0 else num_zeros/r
            y1 = max(0, 1-rinv)*T + min(1, rinv)*y
        if np.linalg.norm(y - y1) < eps:
            return y1
        y = y1

## set the driver for the data
driver = ogr.GetDriverByName("FileGDB")

## path to the FileGDB
gdb = r"C:\Users\Glen B\Documents\my_geodata.gdb"

## ope the GDB in write mode (1)
ds = driver.Open(gdb, 1)

## input feature class
input_lyr_name = "Birmingham_Secondary_Schools"

## name of output feature class
output_fc = "{0}_median_center".format(input_lyr_name)

## reference the layer using the layers name
if input_lyr_name in [ds.GetLayerByIndex(lyr_name).GetName() for lyr_name in range(ds.GetLayerCount())]:
    lyr = ds.GetLayerByName(input_lyr_name)
    print "{0} found in {1}".format(input_lyr_name, gdb)

## if the output layer already exists then delete it
if output_fc in [ds.GetLayerByIndex(lyr_name).GetName() for lyr_name in range(ds.GetLayerCount())]:
    ds.DeleteLayer(output_fc)
    print "Deleting: {0}".format(output_fc)

## create an array with coordinates of each feature
try:
    first_feat = lyr.GetFeature(1)
    ## centroid for points and polygons
    if first_feat.geometry().GetGeometryName() in ["POINT", "MULTIPOINT", "POLYGON", "MULTIPOLYGON"]:
        xy_arr = np.ndarray((len(lyr), 2), dtype=np.float)
        for i, pt in enumerate(lyr):
            ft_geom = pt.geometry()
            xy_arr[i] = (ft_geom.Centroid().GetX(), ft_geom.Centroid().GetY())

    ## for lines we get the midpoint of a line
    elif first_feat.geometry().GetGeometryName() in ["LINESTRING", "MULTILINESTRING"]:
        xy_arr = np.ndarray((len(lyr), 2), dtype=np.float)
        for i, ln in enumerate(lyr):
            line_geom = ln.geometry().ExportToWkt()
            shapely_line = MultiLineString(wkt.loads(line_geom))
            midpoint = shapely_line.interpolate(shapely_line.length/2)
            xy_arr[i] = (midpoint.x, midpoint.y)

except Exception:
    print "Unknown geometry for {}".format(input_lyr_name)
    sys.exit()

## if using Weiszfield
if algorithm == "W":
    ## https://gist.github.com/endolith/2837160
    avg_x, avg_y = np.mean(xy_arr, axis=0)
    test_median = [avg_x, avg_y]
    numIter = 50

## minimise the objective function
for x in range(0,numIter):
    denom = denomsum(test_median,xy_arr)
    nextx = 0.0
    nexty = 0.0

    for y in range(0,len(xy_arr)):
        nextx += (xy_arr[y][0] * numersum(test_median,xy_arr[y]))/denom
        nexty += (xy_arr[y][1] * numersum(test_median,xy_arr[y]))/denom

    test_median = [nextx,nexty]

## if using Yehuda Vardi and Cun-Hui Zhang
elif algorithm == "YC":
    test_median = geometric_median(xy_arr)

print "Median Center: {0}, {1}".format(test_median[0], test_median[1])

## create a new point layer with the same spatial ref as lyr
out_lyr = ds.CreateLayer(output_fc, lyr.GetSpatialRef(), ogr.wkbPoint)

## define and create new fields
x_fld = ogr.FieldDefn("X", ogr.OFTReal)
y_fld = ogr.FieldDefn("Y", ogr.OFTReal)
out_lyr.CreateField(x_fld)
out_lyr.CreateField(y_fld)

## create a new point for the mean center
pnt = ogr.Geometry(ogr.wkbPoint)
pnt.AddPoint(test_median[0], test_median[1])

## add the mean center to the new layer
feat_dfn = out_lyr.GetLayerDefn()
feat = ogr.Feature(feat_dfn)
feat.SetGeometry(pnt)
feat.SetField("X", test_median[0])
feat.SetField("Y", test_median[1])
out_lyr.CreateFeature(feat)

print "Created {0}".format(output_fc)

## free up resources
del ds, lyr, first_feat, feat, out_lyr

I’d like to give credit to…
Logan Byers from GIS StackExchange who aided in speeding up the computational time using NumPy and for forcing me to begin learning the wonders of NumPy.
orlp from Stack Overflow for their implementation of Yehuda Vardi and Cun-Hui Zhang’s algorithm for the geometric median.
Daniel J Lewis (I think) for the implementation of the Weiszfeld algorithm.

The Example:

I downloaded vector data that contains polygons for schools (and other features) from OS Open Map – Local that covered the West Midlands. I also downloaded OS Boundary Line data. Using Python and GDAL/OGR I extracted secondary schools from the data for Birmingham only. Everything was now in place to find the Median Center of all Secondary Schools for Birmingham. (see The Other Scripts section at the bottom of this post for processing the data)

birmingham_secondary_schools

Running the script from The Code section above calculates the coordinates of the Median Center for Secondary Schools in Birmingham and creates a point feature class in the File GDB.

birmingham_secondary_schools_median_center

OSGP Median Center (W):        407658.278755, 286696.905759
OSGP Median Center (YC):      407658.278752, 286696.905769
ArcGIS Median Center:             407658.009375, 286697.53996

What’s Next?

Weighted Mean Center (link will be updated once post is ready)

Also See…

Mean Center
Central Feature

The Resources:

ESRI Guide to GIS Volume 2: Chapter 2
see book review here.

Geoprocessing with Python

Python GDAL/OGR Cookbook

Setting up GDAL/OGR with FileGDB Driver for Python on Windows

< The Other Scripts >

Birmingham Secondary Schools

from osgeo import ogr
import os

## necessary drivers
shp_driver = ogr.GetDriverByName("ESRI Shapefile")
gdb_driver = ogr.GetDriverByName("FileGDB")

## input boundary shapefile and file reference file gdb
shapefile = r"C:\Users\Glen B\Documents\Schools\Data\GB\district_borough_unitary_region.shp"
gdb = r"C:\Users\Glen B\Documents\my_geodata.gdb"

shp_ds = shp_driver.Open(shapefile, 0)
gdb_ds = gdb_driver.Open(gdb, 1)

## filter boundary to just Birmingham
shp_layer = shp_ds.GetLayer(0)
shp_layer.SetAttributeFilter("NAME = 'Birmingham District (B)'")

## name the output
output_fc = "Birmingham_Secondary_Schools"

## if the output feature class already exists then delete it
if output_fc in [gdb_ds.GetLayerByIndex(lyr_name).GetName() for lyr_name in range(gdb_ds.GetLayerCount())]:
    gdb_ds.DeleteLayer(output_fc)
    print "Deleting: {0}".format(output_fc)

## create the output feature class
out_lyr = gdb_ds.CreateLayer(output_fc, shp_layer.GetSpatialRef(), ogr.wkbPolygon)

## the folder that contains the data to extract Secondary Schools from
root_folder = r"C:\Users\Glen B\Documents\Schools\Vector\data"

## traverse through the folders and find ImportantBuildings files
## copy only those that intersect the Birmingham region
## transfer across attributes
count = 1
for root,dirs,files in os.walk(root_folder):
    for filename in files:
        if filename.endswith("ImportantBuilding.shp") and filename[0:2] in ["SP", "SO", "SJ", "SK"]:
            shp_path = "{0}\\{1}".format(root, filename)
            schools_ds = shp_driver.Open(shp_path, 0)
            schools_lyr = schools_ds.GetLayer(0)
            schools_lyr.SetAttributeFilter("CLASSIFICA = 'Secondary Education'")
            lyr_def = schools_lyr.GetLayerDefn()
            if count == 1:
                for i in range(lyr_def.GetFieldCount()):
                    out_lyr.CreateField(lyr_def.GetFieldDefn(i))
                count += 1
            shp_layer.ResetReading()
            for shp_feat in shp_layer:
                birm_geom = shp_feat.GetGeometryRef()

                for school_feat in schools_lyr:
                    school_geom = school_feat.GetGeometryRef()

                    if school_geom.Intersects(birm_geom):
                        feat_dfn = out_lyr.GetLayerDefn()
                        feat = ogr.Feature(feat_dfn)
                        feat.SetGeometry(school_geom)
                        for i in range(lyr_def.GetFieldCount()):
                            feat.SetField(lyr_def.GetFieldDefn(i).GetNameRef(), school_feat.GetField(i))

                        out_lyr.CreateFeature(feat)
                        feat.Destroy()

del shp_ds, shp_layer, gdb_ds

The Usual 🙂

As always please feel free to comment to help make the code more efficient, highlight errors, or let me know if this was of any use to you.

OSGP: Measuring Geographic Distributions – Mean Center

(Open Source Geospatial Python)

The ‘What is it?’

The Mean Center is the average X coordinate and Y coordinate for all features in a study area and is the simplest descriptor of a geographic distribution. The Mean Center is generally used to track the changes in a features distribution over time and can also be used to compare the distribution of multiple features.

The Mean Center is also known as the Geographic Center or Center of Concentration for a set of features.

You would calculate the Mean Center for features where there is no travel interaction between the Center and the features of the study. Basically, use it for a study where each event that happens is a recorded location, for example a burglary for crime analysis, or the sighting of wombat for wildlife studies.

The Formula!

Mean Center Formula

For Point features the X and Y coordinates of each feature is used, for Polygons the centroid of each feature represents the X and Y coordinate to use, and for Linear features the mid-point of each line is used for the X and Y coordinate.

The Code…

from osgeo import ogr
from shapely.geometry import MultiLineString
from shapely import wkt
import numpy as np
import sys

## set the driver for the data
driver = ogr.GetDriverByName("FileGDB")
## path to the FileGDB
gdb = r"C:\Users\Glen B\Documents\my_geodata.gdb"
## ope the GDB in write mode (1)
ds = driver.Open(gdb, 1)

## input layer
input_lyr_name = "Birmingham_Burglaries_2016"

## the output layer
output_fc = "{0}_mean_center".format(input_lyr_name)

## reference the layer using the layers name
if input_lyr_name in [ds.GetLayerByIndex(lyr_name).GetName() for lyr_name in range(ds.GetLayerCount())]:
    lyr = ds.GetLayerByName(input_lyr_name)
    print "{0} found in {1}".format(input_lyr_name, gdb)

## delete the output feature class if it already exists
if output_fc in [ds.GetLayerByIndex(lyr_name).GetName() for lyr_name in range(ds.GetLayerCount())]:
    ds.DeleteLayer(output_fc)
    print "Deleting: {0}".format(output_fc)

try:
    ## assess the geometry of the input feature class
    first_feat = lyr.GetFeature(1)
    ## for each point or polygon in the layer 
    ## get the x and y value of the centroid 
    ## store in a numpy array
    if first_feat.geometry().GetGeometryName() in ["POINT", "MULTIPOINT", "POLYGON", "MULTIPOLYGON"]:
        xy_arr = np.ndarray((len(lyr), 2), dtype=np.float)
        for i, pt in enumerate(lyr):
            ft_geom = pt.geometry()
            xy_arr[i] = (ft_geom.Centroid().GetX(), ft_geom.Centroid().GetY())

    ## for lineear we get the midpoint of a line
    elif first_feat.geometry().GetGeometryName() in ["LINESTRING", "MULTILINESTRING"]:
        xy_arr = np.ndarray((len(lyr), 2), dtype=np.float)
        for i, ln in enumerate(lyr):
            line_geom = ln.geometry().ExportToWkt()
            shapely_line = MultiLineString(wkt.loads(line_geom))
            midpoint = shapely_line.interpolate(shapely_line.length/2)
            xy_arr[i] = (midpoint.x, midpoint.y)

## exit gracefully if unknown geometry or input contains no geometry
except Exception:
    print "Unknown geometry for {0}".format(input_lyr_name)
    sys.exit()

avg_x, avg_y = np.mean(xy_arr, axis=0)

print "Mean Center: {0}, {1}".format(avg_x, avg_y)

## create a new point layer with the same spatial ref as input layer
out_lyr = ds.CreateLayer(output_fc, lyr.GetSpatialRef(), ogr.wkbPoint)

## define and create new fields to hold the mean center coordinates
x_fld = ogr.FieldDefn("X", ogr.OFTReal)
y_fld = ogr.FieldDefn("Y", ogr.OFTReal)
out_lyr.CreateField(x_fld)
out_lyr.CreateField(y_fld)

## create a new point geom for the mean center
pnt = ogr.Geometry(ogr.wkbPoint)
pnt.AddPoint(avg_x, avg_y)

## add the mean center point to the new layer with attributes
feat_dfn = out_lyr.GetLayerDefn()
feat = ogr.Feature(feat_dfn)
feat.SetGeometry(pnt)
feat.SetField("X", avg_x)
feat.SetField("Y", avg_y)
out_lyr.CreateFeature(feat)

print "Created: {0}".format(output_fc)

## free up resources
del ds, lyr, first_feat, feat, out_lyr

I’d like to give credit to Logan Byers from GIS StackExchange who aided in speeding up the computational time using NumPy and for forcing me to begin learning the wonders of NumPy.

The Example:

I downloaded crime data from DATA.POLICE.UK for the West Midlands Police from January 2016 to December 2016. I used some Python to extract just the Burglary data and made this into a feature class in the File GDB. Next, I downloaded OS Boundary Line data and clipped the Burglary data to just Birmingham. Everything was now in place to find the Mean Center of all burglaries for Birmingham in 2016. (see The Other Scripts section at the bottom of this post for processing the data)

birmingham_burgalries_2016

Running the script from The Code section above calculates the Mean Center of all burglaries for 2016 and created a point feature class in the File GDB.

birmingham_buglaries_2016_mean_center

OSGP Mean Center:     407926.695396, 286615.428507
ArcGIS Mean Center:    407926.695396, 286615.428507

What’s Next?…

Central Feature

The Resources:

ESRI Guide to GIS Volume 2: Chapter 2 (I highly recommend this book)
see book review here.

Geoprocessing with Python

Python GDAL/OGR Cookbook

Setting up GDAL/OGR with FileGDB Driver for Python on Windows

< The Other Scripts >

1. Extract Burglary Data for West Midlands

import csv, os
from osgeo import ogr, osr

## set the driver for the data
driver = ogr.GetDriverByName("FileGDB")

## path to the FileGDB
gdb = r"C:\Users\Glen B\Documents\my_geodata.gdb"

## ope the GDB in write mode (1)
ds = driver.Open(gdb, 1)

## the coordinates in the csv files are lat/long
source = osr.SpatialReference()
source.ImportFromEPSG(4326)

## we need the data in British National Grid
target = osr.SpatialReference()
target.ImportFromEPSG(27700)

transform = osr.CoordinateTransformation(source, target)

## set the output fc name
output_fc = "WM_Burglaries_2016"

## if the output fc already exists delete it
if output_fc in [ds.GetLayerByIndex(lyr_name).GetName() for lyr_name in range(ds.GetLayerCount())]:
    ds.DeleteLayer(output_fc)
    print "Deleting: {0}".format(output_fc)

out_lyr = ds.CreateLayer(output_fc, target, ogr.wkbPoint)

## define and create new fields
mnth_fld = ogr.FieldDefn("Month", ogr.OFTString)
rep_by_fld = ogr.FieldDefn("Reported_by", ogr.OFTString)
fls_wthn_fld = ogr.FieldDefn("Falls_within", ogr.OFTString)
loc_fld = ogr.FieldDefn("Location", ogr.OFTString)
lsoa_c_fld = ogr.FieldDefn("LSOA_code", ogr.OFTString)
lsoa_n_fld = ogr.FieldDefn("LSOA_name", ogr.OFTString)
crime_fld = ogr.FieldDefn("Crime_type", ogr.OFTString)
outcome_fld = ogr.FieldDefn("Last_outcome", ogr.OFTString)

out_lyr.CreateField(mnth_fld)
out_lyr.CreateField(rep_by_fld)
out_lyr.CreateField(fls_wthn_fld)
out_lyr.CreateField(loc_fld)
out_lyr.CreateField(lsoa_c_fld)
out_lyr.CreateField(lsoa_n_fld)
out_lyr.CreateField(crime_fld)
out_lyr.CreateField(outcome_fld)

## where the downloaded csv files reside
root_folder = r"C:\Users\Glen B\Documents\Crime"

## for each csv
for root,dirs,files in os.walk(root_folder):
    for filename in files:
        if filename.endswith(".csv"):
            csv_path = "{0}\\{1}".format(root, filename)
            with open(csv_path, "rb") as csvfile:
                reader = csv.reader(csvfile, delimiter=",")
                next(reader,None)
                ## create a point with attributes for each burglary
                for row in reader:
                    if row[9] == "Burglary":
                        pnt = ogr.Geometry(ogr.wkbPoint)
                        pnt.AddPoint(float(row[4]), float(row[5]))
                        pnt.Transform(transform)
                        feat_dfn = out_lyr.GetLayerDefn()
                        feat = ogr.Feature(feat_dfn)
                        feat.SetGeometry(pnt)
                        feat.SetField("Month", row[1])
                        feat.SetField("Reported_by", row[2])
                        feat.SetField("Falls_within", row[3])
                        feat.SetField("Location", row[6])
                        feat.SetField("LSOA_code", row[7])
                        feat.SetField("LSOA_name", row[8])
                        feat.SetField("Crime_type", row[9])
                        feat.SetField("Last_outcome", row[10])
                        out_lyr.CreateFeature(feat)

del ds, feat, out_lyr

2. Birmingham Burglaries Only

from osgeo import ogr

## required drivers
shp_driver = ogr.GetDriverByName("ESRI Shapefile")
gdb_driver = ogr.GetDriverByName("FileGDB")

## input boundary shapefile and file gdb
shapefile = r"C:\Users\Glen B\Documents\Crime\Data\GB\district_borough_unitary_region.shp"
gdb = r"C:\Users\Glen B\Documents\my_geodata.gdb"

## open the shapefile in read mode and gdb in write mode
shp_ds = shp_driver.Open(shapefile, 0)
gdb_ds = gdb_driver.Open(gdb, 1)

## reference the necessary layers
shp_layer = shp_ds.GetLayer(0)
gdb_layer = gdb_ds.GetLayerByName("WM_Burglaries_2016")

## filter the shapefile
shp_layer.SetAttributeFilter("NAME = 'Birmingham District (B)'")

## set the name for the output feature class
output_fc = "Birmingham_Burglaries_2016"

## if the output already exists then delete it
if output_fc in [gdb_ds.GetLayerByIndex(lyr_name).GetName() for lyr_name in range(gdb_ds.GetLayerCount())]:
    gdb_ds.DeleteLayer(output_fc)
    print "Deleting: {0}".format(output_fc)

## create an output layer
out_lyr = gdb_ds.CreateLayer(output_fc, shp_layer.GetSpatialRef(), ogr.wkbPoint)

## copy the schema from the West Midlands burglaries
## and use it for the Birmingham burglaries
lyr_def = gdb_layer.GetLayerDefn()
for i in range(lyr_def.GetFieldCount()):
    out_lyr.CreateField (lyr_def.GetFieldDefn(i))

## only get burglaries that intersect the Birmingham region
for shp_feat in shp_layer:
    print shp_feat.GetField("NAME")
    birm_geom = shp_feat.GetGeometryRef()
    for gdb_feat in gdb_layer:
        burglary_geom = gdb_feat.GetGeometryRef()
        if burglary_geom.Intersects(birm_geom):
            feat_dfn = out_lyr.GetLayerDefn()
            feat = ogr.Feature(feat_dfn)
            feat.SetGeometry(burglary_geom)

            ## populate the attribute table
            for i in range(lyr_def.GetFieldCount()):
                feat.SetField(lyr_def.GetFieldDefn(i).GetNameRef(), gdb_feat.GetField(i))
            ## create the feature
            out_lyr.CreateFeature(feat)
            feat.Destroy()

del shp_ds, shp_layer, gdb_ds, gdb_layer

The Usual 🙂

As always please feel free to comment to help make the code more efficient, highlight errors, or let me know if this was of any use to you.

Setting up GDAL/OGR with FileGDB Driver for Python on Windows

I have decided to venture into the world of GDAL/OGR with Python with my main motivation to mimic some tools from ArcGIS for Desktop. I am hoping that this will help me to improve on a few fronts; my Python coding, increased knowledge regarding open source geospatial libraries, and to better understand the algorithms that churn away behind the scenes when you click a button in a GUI based GIS and perform some sort of geoprocessing or data analysis.

I mainly work with ESRI File Geodatabases and while I know this is not open source ESRI have an API in place to read and write to a gdb via GDAL/OGR. The first step is to setup what I need to start my journey for learning GDAL/OGR with Python for Windows. I will also install a few libraries that will help speed up some computations for more efficient geoprocessing.

I am using…
Python 2.7.13 32bit on Windows 7 Professional

1. Download and Install Microsoft Visual C++ 2008 Service Pack

Click here to download and the install.

microsoft_visual c++

2. Go to Christoph Gohlke’s website and download the GDAL wheel.

Grab the GDAL whl file. I downloaded GDAL‑2.1.3‑cp27‑cp27m‑win32.whl
Open the command prompt, change directory to where the whl was downloaded and use pip to install.

pip install "GDAL‑2.1.3‑cp27‑cp27m‑win32.whl"

gdal_whl installation

3. Get the File Geodatabase API from ESRI (you will need an ESRI account)

Go to ESRI Dowloads and download File Geodatabase API 1.3 version for Windows (Visual Studio 2008). This will be a zip folder. Open the contents of the API zipped folder and extract FileGDBAPI.dll from the bin folder to

C:\Python27\Lib\site-packages\osgeo

or wherever your site-package folder resides. Just make sure to extract it to osgeo.

4. Create a New Variable in Environmental Variables

In Advanced System Settings create a new Environmental Variable called GDAL_DRIVER_PATH and set the path to the osgeo folder in Step 5.

5. Open __init__.py from osgeo…

… and uncomment line 10.

gdal_uncomment_line

Save the file.

6. Test the setup

Open a Python interpreter and test using…

test gdal setup

If you do not get an errors like the screenshot above then setup has been successful.

*************************************************************************************************
OPTIONAL: these will be used in some capacity for scripting geoprocessing,

7. Download numpy + mkl wheel from the brilliant website of Christoph Gohlke

Click here and download the necessary whl file. For my setup I have downloaded numpy‑1.11.3+mkl‑cp27‑cp27m‑win32.whl 
Open up the command prompt and change directory to where the downloaded file resides. Use pip to install.

pip install "numpy‑1.11.3+mkl‑cp27‑cp27m‑win32.whl"

numpy_mkl_whl

8. Install SciPy

Back we go to Gohlke repository and to the SciPy Wheels. Here, I have downloaded scipy‑0.19.0‑cp27‑cp27m‑win32.whl
Open up the command prompt if you have closed it after Step 1 and change directory to where the downloaded file can be found.
Use pip to install.

pip install "scipy‑0.19.0‑cp27‑cp27m‑win32.whl"

scipy_whl

9. Install Shapely

You got it, go back to Gohlke and download the Shapely whl file. I grabbed Shapely‑1.5.17‑cp27‑cp27m‑win32.whl. Use pip to install similar to Steps 7 and 8.

 

Now to immerse myself in learning mode and put GDAL/OGR to some use. Check out OSGP#1.1: Measuring Geographic Distributions – Mean Center for the first attempt.

PDF to JPG Conversion with Python (for Windows)

I recently had a torrid time trying to research and implement a Python script that could batch convert from PDF to JPG. There are numerous entries online that aim to help (and did so in parts) but I struggled to find one with a concise workflow from start to finish that satisfied my criteria and involved setting up what’s required to implement such. The below could be slated for not being the most ‘Pythonic’ way to get it done but it certainly made my life easier. I was struggling with Wand and ImageMagick as per most posts until I luckily stumbled across an entry on StackOverflow where floqqi, my new hero, answered my prayers. I felt that if I struggled with this that there must be others out there with the same predicament and I hope that the title of this post will help it come to the forefront of searches and aid fellow Python snippet researchers in finding some salvation.
Note: I am using Python 2.7 32-bit on Windows 7 Professional

1. Install ImageMagick
floqqi recommends downloading the latest version, which at the time of writing this is 7.0.4-3. I had already installed an earlier version while trying to get the Wand module to work. My version is 6.9.7-3. If you hover over the links you should be able to see the full link name http://www.imagemagick.org/download/binaries/ImageMagick-6.9.7-3-Q8-x86-dll.exe, or just click that link to download the same version I did.
Run the installer, accept the license agreement, and click Next on the Information window. In the Select Additional Tasks make sure that Install development headers and libraries for C and C++ is selected.

imagemagick

Click Next and then Install.

2. Install GhostScript
I
nstall the 32-bit Ghostscript AGPL Release

3. Set Environment Variables
Create a new System Variable (Advanced System Settings > Environment Variables) called MAGICK_HOME and insert the Image Magick installation path as the value. This will be similar to C:\Program Files (x86)\ImageMagick-6.9.7-Q8

MAGICK_HOME

Click OK and and make sure that the same value (C:\Program Files (x86)\ImageMagick-6.9.7-Q8) is at the start of the Path variable. After this entry in the Path variable insert the entry for GhostScript which will be similar to C:\Program Files (x86)\gs\gs9.20\bin
Note: make sure that the entries are separated by a semi-colon (;)

4. Check if steps 1-3 have been correctly configured
Open the Command Prompt and enter…

convert file1.pdf file2.jpg

where file.pdf and file2.jpg are fully qualified paths for an input PDF and and output JPG (or the current directory contains the file).

convert cmd

If no errors are presented and the JPG has been created you can move on to the next step. Otherwise step into some troubleshooting.

5. Install PythonMagick
I downloaded the Python 2.7 32-bit whl file PythonMagick‑0.9.10‑cp27‑none‑win32.whl and then used pip to install from the command prompt.

pip install C:\Users\glen.bambrick\Downloads\pip install PythonMagick‑0.9.10‑cp27‑none‑win32.whl

Open up a Python IDE and test to see if you can import PythonMagick

import PythonMagick

We now have everything set up and can begin to write a script that will convert multiple (single page) PDFs to JPGs. 

Import the necessary modules.

import os, PythonMagick
from PythonMagick import Image
from datetime import datetime

Ok so datetime isn’t necessary but I like to time my scripts and see if it can be improved upon. Set the start time for the script

start_time = datetime.now()

A couple of global variables, one for the directory that holds the PDFs, and another to hold a hexidecimal value for the background colour ‘white’. After trial and error I noticed that some JPGs were being exported with a black background instead of white and this will be used to force a white background. I found a useful link on StackOverflow to help overcome this.

pdf_dir = r"C:\MyPDFs"
bg_colour = "#ffffff"

We loop through each PDF in the folder

for pdf in [pdf_file for pdf_file in os.listdir(pdf_dir) if pdf_file.endswith(".pdf")]:

Set and read in each PDF. density is the resolution.

    input_pdf = pdf_dir + "\\" + pdf
    img = Image()
    img.density('300')
    img.read(input_pdf)

Get the dimensions of the image.

    size = "%sx%s" % (img.columns(), img.rows())

Build the JPG for output. This part must be the Magick in PythonMagic because for a small portion of it I am mystified. See that last link to StackOverflow for the origin of the code here. The PythonMagick documentation is tough to digest and in various threads read the laments about how poor it is.

    output_img = Image(size, bg_colour)
    output_img.type = img.type
    output_img.composite(img, 0, 0, PythonMagick.CompositeOperator.SrcOverCompositeOp)
    output_img.resize(str(img.rows()))
    output_img.magick('JPG')
    output_img.quality(75)

And lastly we write out our JPG

    output_jpg = input_pdf.replace(".pdf", ".jpg")
    output_img.write(output_jpg)

And see how long it took the script to run.

print datetime.now() - start_time

This places the output JPGs in the same folder as the PDFs. Based on the resolution (density) and quality settings the process can be a bit lengthy. Using the settings above it took 9 minutes to do 20 PDF to JPG Conversions. You will need to figure out the optimum resolution and quality for your purpose. Low res took 46 seconds for all 20.

As always I feel a sense of achievement when I get a Python script to work and hope that this post will spur on some comments to make the above process more efficient. Feel free to post links to any resources, maybe comment to help myself and other readers, or if this helped you in anyway let me know and I’ll pass the thanks on to floqqi and the rest of the crew. This script is the limit of my knowledge with PythonMagick and this is thanks to those that have endeavoured before me and referenced in the links throughout this post. Thanks guys.

Complete script…

import os, PythonMagick
from PythonMagick import Image
from datetime import datetime

start_time = datetime.now()

pdf_dir = r"C:\MyPDFs"
bg_colour = "#ffffff"

for pdf in [pdf_file for pdf_file in os.listdir(pdf_dir) if pdf_file.endswith(".pdf")]:

    input_pdf = pdf_dir + "\\" + pdf
    img = Image()
    img.density('300')
    img.read(input_pdf)

    size = "%sx%s" % (img.columns(), img.rows())

    output_img = Image(size, bg_colour)
    output_img.type = img.type
    output_img.composite(img, 0, 0, PythonMagick.CompositeOperator.SrcOverCompositeOp)
    output_img.resize(str(img.rows()))
    output_img.magick('JPG')
    output_img.quality(75)


    output_jpg = input_pdf.replace(".pdf", ".jpg")
    output_img.write(output_jpg)

print datetime.now() - start_time