OSGP: Measuring Geographic Distributions – Weighted Mean Center

(Open Source Geospatial Python)

The ‘What is it?’

See Mean Center.

The unweighted center is mainly used for events that occur at a place and time such as burglaries. The weighted center, however, is predominantly used for stationary features such as retail outlets or delineated areas for example (such as Census tracts). The Weighted Mean Center does not take into account distance between features in the dataset.

The weight needs to be a numerical attribute. The greater the value, the higher the weight for that feature.

The Formula!

The Weighted Mean Center is calculated by multiplying the x and y coordinate by the weight for that feature and summing all for both x and y individually, and then dividing this by the sum of all the weights.

Weighted Mean Center FormulaFor Point features the X and Y coordinates of each feature is used, for Polygons the centroid of each feature represents the X and Y coordinate to use, and for Linear features the mid-point of each line is used for the X and Y coordinate.

The Code…

from osgeo import ogr
from shapely.geometry import MultiLineString
from shapely import wkt
import numpy as np
import sys

## set the driver for the data
driver = ogr.GetDriverByName("ESRI Shapefile")
## folder where the shapefile resides
folder = r"C:\Users\glen.bambrick\Documents\GDAL\shp\\"
## name of the shapefile concatenated with folder
shp = "{0}Census2011_Small_Areas_generalised20m.shp".format(folder)
## open the shapefile
ds = driver.Open(shp, 0)
## reference the only layer in the shapefile
lyr = ds.GetLayer(0)

## create an output data source
out_ds = driver.CreateDataSource("{0}{1}_wgt_mean_center.shp".format(folder,lyr.GetName()))

## output mean center weighted filename
output_fc = "{0}{1}_wgt_mean_center".format(folder,lyr.GetName())

## field that has numerical weight
weight_fld = "TOTAL2011"

try:
    first_feat = lyr.GetFeature(1)
    xy_arr = np.ndarray((len(lyr), 2), dtype=np.float)
    wgt_arr = np.ndarray((len(lyr), 1), dtype=np.float)
    ## use the centroid for points and polygons
    if first_feat.geometry().GetGeometryName() in ["POINT", "MULTIPOINT", "POLYGON", "MULTIPOLYGON"]:
        for i, pt in enumerate(lyr):
            ft_geom = pt.geometry()
            weight = pt.GetField(weight_fld)
            xy_arr[i] = (ft_geom.Centroid().GetX() * weight, ft_geom.Centroid().GetY() * weight)
            wgt_arr[i] = weight
    ## midpoint of lines
    elif first_feat.geometry().GetGeometryName() in ["LINESTRING", "MULTILINESTRING"]:
        for i, ln in enumerate(lyr):
            line_geom = ln.geometry().ExportToWkt()
            weight = ln.GetField(weight_fld)
            shapely_line = MultiLineString(wkt.loads(line_geom))
            midpoint = shapely_line.interpolate(shapely_line.length/2)
            xy_arr[i] = (midpoint.x * weight, midpoint.y * weight)
            wgt_arr[i] = weight

except Exception:
    print "Unknown geometry or Incorrect field name for {}".format(input_lyr_name)
    sys.exit()

## do the maths
sum_x, sum_y = np.sum(xy_arr, axis=0)
sum_wgt = np.sum(wgt_arr)
weighted_x, weighted_y = sum_x/sum_wgt, sum_y/sum_wgt

print "Weighted Mean Center: {0}, {1}".format(weighted_x, weighted_y)

## create a new point layer with the same spatial ref as lyr
out_lyr = out_ds.CreateLayer(output_fc, lyr.GetSpatialRef(), ogr.wkbPoint)

## define and create new fields
x_fld = ogr.FieldDefn("X", ogr.OFTReal)
y_fld = ogr.FieldDefn("Y", ogr.OFTReal)
out_lyr.CreateField(x_fld)
out_lyr.CreateField(y_fld)

## create a new point for the mean center weighted
pnt = ogr.Geometry(ogr.wkbPoint)
pnt.AddPoint(weighted_x, weighted_y)

## add the mean center weighted to the new layer
feat_dfn = out_lyr.GetLayerDefn()
feat = ogr.Feature(feat_dfn)
feat.SetGeometry(pnt)
feat.SetField("X", weighted_x)
feat.SetField("Y", weighted_y)
out_lyr.CreateFeature(feat)

print "Created: {0}.shp".format(output_fc)

## free up resources
del ds, out_ds, lyr, first_feat, feat, out_lyr

I’d like to give credit to Logan Byers from GIS StackExchange who aided in speeding up the computational time using NumPy and for forcing me to begin learning the wonders of NumPy (which is still a work in progress)

The Example:

I downloaded the Small Areas of Ireland from the CSO. You will have to acknowledge a disclaimer. The data contains population information for the 2011 Census. Once downloaded unzip Census2011_Small_Areas_generalised20m.zip to working folder.

Small Areas

Running the script from The Code section above calculates the Weighted Mean Center of all Small Areas based on the population count for each for 2011 and creates a point Shapefile as the output.

Small Areas Weighted Mean Center

OSGP Weighted Mean Center:      238557.427484, 208347.116116
ArcGIS Weighted Mean Center:    238557.427484, 208347.116116

Also See…

Mean Center
Central Feature
Median Center
Initial Data Assessment

The Resources:

ESRI Guide to GIS Volume 2: Chapter 2 (I highly recommend this book)
see book review here.

Geoprocessing with Python

Python GDAL/OGR Cookbook

The Usual 🙂

As always please feel free to comment to help make the code more efficient, highlight errors, or let me know if this was of any use to you.

OSGP: Standard GIS Tools – Initial Data Assessment

(Open Source Geospatial Python)

Here we will look at the general makeup of a downloaded spatial dataset – a Shapefile from the Central Statistics Office in Ireland containing census data from 2011. We will look at getting the spatial reference of the file along with a breakdown of the field names, type, width and precision. We can print the top ten records or the entire attribute table and get a list of unique values for a field and the count of each.

Download the Small Areas of Ireland from the CSO. You will have to acknowledge a disclaimer. Once downloaded unzip Census2011_Small_Areas_generalised20m.zip to working folder. We will now begin to interrogate this Shapefile.

Small Areas

First we import the necessary modules…

# import modules
from osgeo import ogr
from tabulate import tabulate
from operator import itemgetter

tabulate will allow us to print out formatted tables. Using ogr we can access the inner workings of the downloaded Shapefile. Please note that osgeo and tabulate are not standard Python libraries and will need to be installed.

Using the ESRI Shapefile driver we open the Shapefile in read mode (0) and access the data (lyr).

# use Shapefile driver
driver = ogr.GetDriverByName("ESRI Shapefile")
# reference Shapefile
shp = r"C:\Users\Glen B\Documents\GDAL\shp\Census2011_Small_Areas_generalised20m.shp"
# open the file
ds = driver.Open(shp, 0)
# reference the only layer in a Shapefile
lyr = ds.GetLayer(0)

Spatial Reference Information

Straight away we cant print the spatial reference information associated with the Shapefile (contained in the .prj file)

print lyr.GetSpatialRef()

This will print out…

PROJCS["TM65_Irish_Grid",
    GEOGCS["GCS_TM65",
        DATUM["TM65",
            SPHEROID["Airy_Modified",6377340.189,299.3249646]],
        PRIMEM["Greenwich",0.0],
        UNIT["Degree",0.0174532925199433]],
    PROJECTION["Transverse_Mercator"],
    PARAMETER["False_Easting",200000.0],
    PARAMETER["False_Northing",250000.0],
    PARAMETER["Central_Meridian",-8.0],
    PARAMETER["Scale_Factor",1.000035],
    PARAMETER["Latitude_Of_Origin",53.5],
    UNIT["Meter",1.0]]

You can also access this information individually…

# projected coordinate system
proj_string = lyr.GetSpatialRef().GetAttrValue("PROJCS", 0)
# geographic coordinate system
geog_string = lyr.GetSpatialRef().GetAttrValue("GEOGCS", 0)
# EPSG Code if available
epsg = lyr.GetSpatialRef().GetAttrValue("AUTHORITY", 1)
# datum
datum = lyr.GetSpatialRef().GetAttrValue("DATUM", 0)

print "\nFile: {0}\n\nProjected: {1}\nEPSG: {2}\n".format(lyr.GetName(),proj_string, epsg)
print "Geographic: {0}\nDatum: {1}\n".format(geog_string, datum)

The output…

File: Census2011_Small_Areas_generalised20m

Projected: TM65_Irish_Grid
EPSG: None

Geographic: GCS_TM65
Datum: TM65

If there is an EPSG code in the .prj file it will be printed instead of None.

Geometry Type

If we reference the first feature we can get the geometry of the Shapefile

first_feat = lyr.GetFeature(1)
print "Geometry Type: {0}\n".format(first_feat.geometry().GetGeometryName())

In this instance it is a polygon Shapefile.

Geometry Type: POLYGON

Field Information

Let’s get some information on the data through the Layer Definition.

# https://pcjericks.github.io/py-gdalogr-cookbook/vector_layers.html
lyr_def = lyr.GetLayerDefn()

But before we do we need to create a few list structures. These will be used to hold the accessed information and enable us to neatly print them to screen.

# list to hold headers for filed information
header_list = ["FIELD NAME", "TYPE", "WIDTH", "PRECISION"]
# list will be populated with field information
output_list = []
# list will be populated with field names and used for attribute headers
fld_names = []

Cycle through each field and populate the necessary lists…

# for each field
for i in range(lyr_def.GetFieldCount()):
    # reference the field name
    fld_name = lyr_def.GetFieldDefn(i).GetName()
    # reference the field type
    fld_type = lyr_def.GetFieldDefn(i).GetFieldTypeName(lyr_def.GetFieldDefn(i).GetType())
    # reference the field width
    fld_width = lyr_def.GetFieldDefn(i).GetWidth()
    # reference the field precision
    fld_precision = lyr_def.GetFieldDefn(i).GetPrecision()
    # append these as a list to the output_list
    output_list.append([fld_name, fld_type, str(fld_width), str(fld_precision)])
    # append field name to fld_name
    fld_names.append(fld_name)

The output_list is a list of lists containing information for each field, the field name, data type, width and precision, this is matched in the header_list. The fld_names will be used further down to print out attributes, this list hold the field names as headers. Let’s print the field information…

print "{0}\n".format(tabulate(output_list, header_list))

Here’s the output…

FIELD NAME   TYPE     WIDTH   PRECISION
------------ ------ ------- -----------
NUTS1        String       3           0
NUTS1NAME    String       7           0
NUTS2        String       4           0
NUTS2NAME    String      26           0
NUTS3        String       5           0
NUTS3NAME    String      15           0
COUNTY       String       2           0
COUNTYNAME   String      25           0
CSOED        String      11           0
OSIED        String      13           0
EDNAME       String      45           0
SMALL_AREA   String      61           0
GEOGID       String      65           0
MALE2011     Real        20          10
FEMALE2011   Real        20          10
TOTAL2011    Real        20          10
PPOCC2011    Real        20          10
UNOCC2011    Real        20          10
VACANT2011   Real        20          10
HS2011       Real        20          10
PCVAC2011    Real        20          10
CREATEDATE   String      10           0

Attribute Table

Next we print out some attributes for a set of features, the first ten.

# number of features from the first to print attributes for
num_to_return = 10
#num_to_return = lyr.GetFeatureCount()

Use the commented out line if you want to print attributes for all features. Create an empty list to hold the attributes. Some fields contain characters from the Irish language so we account for this so that the attributes are printed correctly.

# list will be populated with attribute data
att_table = []

# for each feature in the Shapefile
for count, feature in enumerate(lyr):
    # up to the number of set features to print
    if count < num_to_return:
        # count will beacome the Feature ID
        atts = [count]
        # for each field append the data to atts list
        for name in fld_names:
            try:
                # if the attribute is a string then decode with Celtic Languages
                atts.append(feature.GetField(name).decode("iso8859_14"))
            except Exception:
                atts.append(feature.GetField(name))
        # append the data for the feature to the att_table list
        att_table.append(atts)

The count becomes the Feature ID but we have no field for this so we will create one…

# add a FID header (count)
fld_names.insert(0, "FID")

So let’s print out the attributes…

print tabulate(att_table, fld_names)
print "{0} out of {1} features".format(num_to_return, lyr.GetFeatureCount())

Here’s the output…

  FID NUTS1   NUTS1NAME   NUTS2   NUTS2NAME            NUTS3   NUTS3NAME         COUNTY COUNTYNAME       CSOED   OSIED EDNAME                           SMALL_AREA GEOGID       MALE2011   FEMALE2011   TOTAL2011   PPOCC2011   UNOCC2011   VACANT2011   HS2011   PCVAC2011 CREATEDATE
----- ------- ----------- ------- -------------------- ------- --------------- -------- -------------- ------- ------- ------------------------------ ------------ ---------- ---------- ------------ ----------- ----------- ----------- ------------ -------- ----------- ------------
    0 IE0     Ireland     IE02    Southern and Eastern IE022   Mid-East              15 Wicklow County   15039  257005 Aughrim                           257005002 A257005002        137          138         275          84          18           15      102        14.7 27-03-2012
    1 IE0     Ireland     IE02    Southern and Eastern IE024   South-East (IE)       01 Carlow County    01054  017049 Tinnahinch                        017049001 A017049001        186          176         362         111          25           24      136        17.6 27-03-2012
    2 IE0     Ireland     IE02    Southern and Eastern IE024   South-East (IE)       01 Carlow County    01053  017032 Marley                            017032001 A017032001        194          173         367         121           8            5      129         3.9 27-03-2012
    3 IE0     Ireland     IE02    Southern and Eastern IE024   South-East (IE)       01 Carlow County    01054  017049 Tinnahinch                        017049002 A017049002         75           75         150          67          29           29       96        30.2 27-03-2012
    4 IE0     Ireland     IE02    Southern and Eastern IE024   South-East (IE)       01 Carlow County    01054  017049 Tinnahinch                        017049003 A017049003         84           81         165          64          16           14       80        17.5 27-03-2012
    5 IE0     Ireland     IE02    Southern and Eastern IE024   South-East (IE)       01 Carlow County    01015  017005 Ballyellin                        017005002 A017005002        105           99         204          71           6            5       77         6.5 27-03-2012
    6 IE0     Ireland     IE02    Southern and Eastern IE024   South-East (IE)       01 Carlow County    01015  017005 Ballyellin                        017005001 A017005001        115          108         223          70           9            8       79        10.1 27-03-2012
    7 IE0     Ireland     IE02    Southern and Eastern IE024   South-East (IE)       01 Carlow County    01033  017033 Muinebeag (Bagenalstown) Rural    017033001 A017033001        201          205         406         143          15           14      158         8.9 27-03-2012
    8 IE0     Ireland     IE02    Southern and Eastern IE024   South-East (IE)       01 Carlow County    01034  017034 Muinebeag (Bagenalstown) Urban    017034002 A017034002        142          116         258          89           9            9       98         9.2 27-03-2012
    9 IE0     Ireland     IE02    Southern and Eastern IE024   South-East (IE)       01 Carlow County    01034  017034 Muinebeag (Bagenalstown) Urban    017034003 A017034003        174          169         343         107           6            4      113         3.5 27-03-2012
10 out of 18488 features

Unique Values and Counts

Next we’ll get a list of the unique COUNTYNAME entries and a count to see how many small areas are in each. (The below works for text fields only)

# rest to first feature
lyr.ResetReading()

# field to return unique list and count of
field = "COUNTYNAME"

# create empty dictionary
values_dict = {}

# for each feature
for feature in lyr:
    attribute = feature.GetField(field).decode("iso8859_14")
    # if the COUNTYNAME is not already in the dictionary add it and assign a value of 1
    if attribute not in values_dict:
        values_dict[attribute] = 1
    # otherwise do not add it and increase the existing value by 1
    else:
        values_dict[attribute] = values_dict[attribute] + 1

## convert dictionary to list for use with tabulate
key_value_list = [[key, value] for key, value in values_dict.items()]

## print results
print "\nTotal Feature Count: {0}\n".format(lyr.GetFeatureCount())
print tabulate(sorted(key_value_list), [field, "Count"])

And here’s the output…

Total Feature Count: 18488

COUNTYNAME               Count
---------------------- -------
Carlow County              210
Cavan County               322
Clare County               511
Cork City                  519
Cork County               1650
Donegal County             761
Dublin City               2202
Dún Laoghaire-Rathdown     760
Fingal                     938
Galway City                307
Galway County              741
Kerry County               701
Kildare County             731
Kilkenny County            372
Laois County               305
Leitrim County             173
Limerick City              258
Limerick County            514
Longford County            179
Louth County               462
Mayo County                643
Meath County               636
Monaghan County            244
North Tipperary            283
Offaly County              286
Roscommon County           303
Sligo County               307
South Dublin               906
South Tipperary            350
Waterford City             198
Waterford County           275
Westmeath County           338
Wexford County             615
Wicklow County             488

Alternatively we could print out based on the highest count descending by replacing the last print statement with…

# http://stackoverflow.com/questions/17555218/python-how-to-sort-a-list-of-lists-by-the-fourth-element-in-each-list
print tabulate(sorted(key_value_list, key = itemgetter(1), reverse = True), [field, "Count"])

…to get…

COUNTYNAME               Count
---------------------- -------
Dublin City               2202
Cork County               1650
Fingal                     938
South Dublin               906
Donegal County             761
...

I will add to these as I come across something useful. If you know of any neat things to add please comment below. Please also comment if anything is unclear or if this was useful to you.

See Also…

Setting up GDAL/OGR with FileGDB Driver for Python on Windows
Measuring Geographic Distributions #1.1 – Mean Center
Measuring Geographic Distributions #2.1 – Central Feature
Measuring Geographic Distributions #3.1 – Median Center

CSV to Shapefile with pyshp

In this post I will look at extracting point data from a CSV file and creating a Shapefile with the pyshp library. The data consists of the location of trees with various attributes generated by the Fingal County Council in Ireland. The data can be downloaded as a CSV file from dublinked.ie.

pyshp is a pure Python library designed to provide read and write support for the ESRI Shapefile (.shp) format and only utilizes Python’s standard library to achieve this. The library can be downloaded from https://code.google.com/p/pyshp/ and placed in the site-packages folder of your Python installation. Alternatively you can use easy-install…

easy_install pyshp

…or pip.

pip install pyshp

NOTE: You should make yourself familiar with the pyshp library by visiting Joel Lawhead’s examples and documents here.

The full code is at the bottom of the post, the following is a walkthrough. When ready to go open your favourite editor and import the modules required for the task at hand.

import shapefile, csv

We will use the getWKT_PRJ function discussed in a previous post.

def getWKT_PRJ (epsg_code):
 import urllib
 wkt = urllib.urlopen("http://spatialreference.org/ref/epsg/{0}/prettywkt/".format(epsg_code))
 remove_spaces = wkt.read().replace(" ","")
 output = remove_spaces.replace("\n", "")
 return output

Create an instance of the Shapefile Writer( ) class and declare the POINT geometry type.

trees_shp = shapefile.Writer(shapefile.POINT)

Set the autoBalance to 1. This enforces that for every record there must be a corresponding geometry.

trees_shp.autoBalance = 1

Create the field names and data types for each.

trees_shp.field("TREE_ID", "C")
trees_shp.field("ADDRESS", "C")
trees_shp.field("TOWN", "C")
trees_shp.field("TREE_SPEC", "C")
trees_shp.field("SPEC_DESC", "C")
trees_shp.field("COMMONNAME", "C")
trees_shp.field("AGE_DESC", "C")
trees_shp.field("HEIGHT", "C")
trees_shp.field("SPREAD", "C")
trees_shp.field("TRUNK", "C")
trees_shp.field("TRUNK_ACTL", "C")
trees_shp.field("CONDITION", "C")

Create a counter variable to keep track of the number of feature written to the Shapefile.

counter = 1

Open the CSV file in read mode.

with open('C:/csv_to_shp/Trees.csv', 'rb') as csvfile:
 reader = csv.reader(csvfile, delimiter=',')

Skip the header.

next(reader, None)

Loop through each row and assign each attribute in the row to a variable.

for row in reader:
 tree_id = row[0]
 address = row[1]
 town = row[2]
 tree_species = row[3]
 species_desc = row[4]
 common_name = row[5]
 age_desc = row[6]
 height = row[7]
 spread = row[8]
 trunk = row[9]
 trunk_actual = row[10]
 condition = row[11]
 latitude = row[12]
 longitude = row[13]

Set the geometry for each record based on the longitude and latitude vales.

trees_shp.point(float(longitude),float(latitude))

Create a matching record for the geometry using the attributes.

trees_shp.record(tree_id, address, town, tree_species, species_desc, common_name, age_desc,height, spread, trunk, trunk_actual, condition)

Print to screen the current feature number and increase the counter.

print "Feature " + str(counter) + " added to Shapefile."
 counter = counter + 1

Save the Shapefile to a location and name the file.

trees_shp.save("C:/csv_to_shp/Fingal_Trees")

Create a projection file (.prj)

prj = open("C:/csv_to_shp/Fingal_Trees.prj", "w")
epsg = getWKT_PRJ("4326")
prj.write(epsg)
prj.close()

Save and run the script. The number of features should be printed to the console.

Number of features

If you open the original CSV file you can see that there are also 33670 records. Navigate to the file location where you saved the Shapefile output. You should see four files shown below.

Shapfile

And just to make sure that the data is correct, here I have opened it up in QGIS.

QGIS - Fingal Trees

And the attribute table…

QGIS Attribute Table

And here’s the full code…

# import libraries
import shapefile, csv

# funtion to generate a .prj file
def getWKT_PRJ (epsg_code):
 import urllib
 wkt = urllib.urlopen("http://spatialreference.org/ref/epsg/{0}/prettywkt/".format(epsg_code))
 remove_spaces = wkt.read().replace(" ","")
 output = remove_spaces.replace("\n", "")
 return output

# create a point shapefile
trees_shp = shapefile.Writer(shapefile.POINT)

# for every record there must be a corresponding geometry.
trees_shp.autoBalance = 1

# create the field names and data type for each.
trees_shp.field("TREE_ID", "C")
trees_shp.field("ADDRESS", "C")
trees_shp.field("TOWN", "C")
trees_shp.field("TREE_SPEC", "C")
trees_shp.field("SPEC_DESC", "C")
trees_shp.field("COMMONNAME", "C")
trees_shp.field("AGE_DESC", "C")
trees_shp.field("HEIGHT", "C")
trees_shp.field("SPREAD", "C")
trees_shp.field("TRUNK", "C")
trees_shp.field("TRUNK_ACTL", "C")
trees_shp.field("CONDITION", "C")

# count the features
counter = 1

# access the CSV file
with open('C:/csv_to_shp/Trees.csv', 'rb') as csvfile:
 reader = csv.reader(csvfile, delimiter=',')
 # skip the header
 next(reader, None)

#loop through each of the rows and assign the attributes to variables
 for row in reader:
  tree_id = row[0]
  address = row[1]
  town = row[2]
  tree_species = row[3]
  species_desc = row[4]
  common_name = row[5]
  age_desc = row[6]
  height = row[7]
  spread = row[8]
  trunk = row[9]
  trunk_actual = row[10]
  condition = row[11]
  latitude = row[12]
  longitude = row[13]

  # create the point geometry
  trees_shp.point(float(longitude),float(latitude))
  # add attribute data
  trees_shp.record(tree_id, address, town, tree_species, species_desc, common_name, age_desc,height, spread, trunk, trunk_actual, condition)

  print "Feature " + str(counter) + " added to Shapefile."
  counter = counter + 1

# save the Shapefile
trees_shp.save("C:/csv_to_shp/Fingal_Trees")

# create a projection file
prj = open("C:/csv_to_shp/Fingal_Trees.prj", "w")
epsg = getWKT_PRJ("4326")
prj.write(epsg)
prj.close()

Any problems let me know.