(Open Source Geospatial Python)
Here we will look at the general makeup of a downloaded spatial dataset – a Shapefile from the Central Statistics Office in Ireland containing census data from 2011. We will look at getting the spatial reference of the file along with a breakdown of the field names, type, width and precision. We can print the top ten records or the entire attribute table and get a list of unique values for a field and the count of each.
Download the Small Areas of Ireland from the CSO. You will have to acknowledge a disclaimer. Once downloaded unzip Census2011_Small_Areas_generalised20m.zip to working folder. We will now begin to interrogate this Shapefile.

First we import the necessary modules…
# import modules
from osgeo import ogr
from tabulate import tabulate
from operator import itemgetter
tabulate will allow us to print out formatted tables. Using ogr we can access the inner workings of the downloaded Shapefile. Please note that osgeo and tabulate are not standard Python libraries and will need to be installed.
Using the ESRI Shapefile driver we open the Shapefile in read mode (0) and access the data (lyr).
# use Shapefile driver
driver = ogr.GetDriverByName("ESRI Shapefile")
# reference Shapefile
shp = r"C:\Users\Glen B\Documents\GDAL\shp\Census2011_Small_Areas_generalised20m.shp"
# open the file
ds = driver.Open(shp, 0)
# reference the only layer in a Shapefile
lyr = ds.GetLayer(0)
Spatial Reference Information
Straight away we cant print the spatial reference information associated with the Shapefile (contained in the .prj file)
print lyr.GetSpatialRef()
This will print out…
PROJCS["TM65_Irish_Grid",
GEOGCS["GCS_TM65",
DATUM["TM65",
SPHEROID["Airy_Modified",6377340.189,299.3249646]],
PRIMEM["Greenwich",0.0],
UNIT["Degree",0.0174532925199433]],
PROJECTION["Transverse_Mercator"],
PARAMETER["False_Easting",200000.0],
PARAMETER["False_Northing",250000.0],
PARAMETER["Central_Meridian",-8.0],
PARAMETER["Scale_Factor",1.000035],
PARAMETER["Latitude_Of_Origin",53.5],
UNIT["Meter",1.0]]
You can also access this information individually…
# projected coordinate system
proj_string = lyr.GetSpatialRef().GetAttrValue("PROJCS", 0)
# geographic coordinate system
geog_string = lyr.GetSpatialRef().GetAttrValue("GEOGCS", 0)
# EPSG Code if available
epsg = lyr.GetSpatialRef().GetAttrValue("AUTHORITY", 1)
# datum
datum = lyr.GetSpatialRef().GetAttrValue("DATUM", 0)
print "\nFile: {0}\n\nProjected: {1}\nEPSG: {2}\n".format(lyr.GetName(),proj_string, epsg)
print "Geographic: {0}\nDatum: {1}\n".format(geog_string, datum)
The output…
File: Census2011_Small_Areas_generalised20m
Projected: TM65_Irish_Grid
EPSG: None
Geographic: GCS_TM65
Datum: TM65
If there is an EPSG code in the .prj file it will be printed instead of None.
Geometry Type
If we reference the first feature we can get the geometry of the Shapefile…
first_feat = lyr.GetFeature(1)
print "Geometry Type: {0}\n".format(first_feat.geometry().GetGeometryName())
In this instance it is a polygon Shapefile.
Geometry Type: POLYGON
Field Information
Let’s get some information on the data through the Layer Definition.
# https://pcjericks.github.io/py-gdalogr-cookbook/vector_layers.html
lyr_def = lyr.GetLayerDefn()
But before we do we need to create a few list structures. These will be used to hold the accessed information and enable us to neatly print them to screen.
# list to hold headers for filed information
header_list = ["FIELD NAME", "TYPE", "WIDTH", "PRECISION"]
# list will be populated with field information
output_list = []
# list will be populated with field names and used for attribute headers
fld_names = []
Cycle through each field and populate the necessary lists…
# for each field
for i in range(lyr_def.GetFieldCount()):
# reference the field name
fld_name = lyr_def.GetFieldDefn(i).GetName()
# reference the field type
fld_type = lyr_def.GetFieldDefn(i).GetFieldTypeName(lyr_def.GetFieldDefn(i).GetType())
# reference the field width
fld_width = lyr_def.GetFieldDefn(i).GetWidth()
# reference the field precision
fld_precision = lyr_def.GetFieldDefn(i).GetPrecision()
# append these as a list to the output_list
output_list.append([fld_name, fld_type, str(fld_width), str(fld_precision)])
# append field name to fld_name
fld_names.append(fld_name)
The output_list is a list of lists containing information for each field, the field name, data type, width and precision, this is matched in the header_list. The fld_names will be used further down to print out attributes, this list hold the field names as headers. Let’s print the field information…
print "{0}\n".format(tabulate(output_list, header_list))
Here’s the output…
FIELD NAME TYPE WIDTH PRECISION
------------ ------ ------- -----------
NUTS1 String 3 0
NUTS1NAME String 7 0
NUTS2 String 4 0
NUTS2NAME String 26 0
NUTS3 String 5 0
NUTS3NAME String 15 0
COUNTY String 2 0
COUNTYNAME String 25 0
CSOED String 11 0
OSIED String 13 0
EDNAME String 45 0
SMALL_AREA String 61 0
GEOGID String 65 0
MALE2011 Real 20 10
FEMALE2011 Real 20 10
TOTAL2011 Real 20 10
PPOCC2011 Real 20 10
UNOCC2011 Real 20 10
VACANT2011 Real 20 10
HS2011 Real 20 10
PCVAC2011 Real 20 10
CREATEDATE String 10 0
Attribute Table
Next we print out some attributes for a set of features, the first ten.
# number of features from the first to print attributes for
num_to_return = 10
#num_to_return = lyr.GetFeatureCount()
Use the commented out line if you want to print attributes for all features. Create an empty list to hold the attributes. Some fields contain characters from the Irish language so we account for this so that the attributes are printed correctly.
# list will be populated with attribute data
att_table = []
# for each feature in the Shapefile
for count, feature in enumerate(lyr):
# up to the number of set features to print
if count < num_to_return:
# count will beacome the Feature ID
atts = [count]
# for each field append the data to atts list
for name in fld_names:
try:
# if the attribute is a string then decode with Celtic Languages
atts.append(feature.GetField(name).decode("iso8859_14"))
except Exception:
atts.append(feature.GetField(name))
# append the data for the feature to the att_table list
att_table.append(atts)
The count becomes the Feature ID but we have no field for this so we will create one…
# add a FID header (count)
fld_names.insert(0, "FID")
So let’s print out the attributes…
print tabulate(att_table, fld_names)
print "{0} out of {1} features".format(num_to_return, lyr.GetFeatureCount())
Here’s the output…
FID NUTS1 NUTS1NAME NUTS2 NUTS2NAME NUTS3 NUTS3NAME COUNTY COUNTYNAME CSOED OSIED EDNAME SMALL_AREA GEOGID MALE2011 FEMALE2011 TOTAL2011 PPOCC2011 UNOCC2011 VACANT2011 HS2011 PCVAC2011 CREATEDATE
----- ------- ----------- ------- -------------------- ------- --------------- -------- -------------- ------- ------- ------------------------------ ------------ ---------- ---------- ------------ ----------- ----------- ----------- ------------ -------- ----------- ------------
0 IE0 Ireland IE02 Southern and Eastern IE022 Mid-East 15 Wicklow County 15039 257005 Aughrim 257005002 A257005002 137 138 275 84 18 15 102 14.7 27-03-2012
1 IE0 Ireland IE02 Southern and Eastern IE024 South-East (IE) 01 Carlow County 01054 017049 Tinnahinch 017049001 A017049001 186 176 362 111 25 24 136 17.6 27-03-2012
2 IE0 Ireland IE02 Southern and Eastern IE024 South-East (IE) 01 Carlow County 01053 017032 Marley 017032001 A017032001 194 173 367 121 8 5 129 3.9 27-03-2012
3 IE0 Ireland IE02 Southern and Eastern IE024 South-East (IE) 01 Carlow County 01054 017049 Tinnahinch 017049002 A017049002 75 75 150 67 29 29 96 30.2 27-03-2012
4 IE0 Ireland IE02 Southern and Eastern IE024 South-East (IE) 01 Carlow County 01054 017049 Tinnahinch 017049003 A017049003 84 81 165 64 16 14 80 17.5 27-03-2012
5 IE0 Ireland IE02 Southern and Eastern IE024 South-East (IE) 01 Carlow County 01015 017005 Ballyellin 017005002 A017005002 105 99 204 71 6 5 77 6.5 27-03-2012
6 IE0 Ireland IE02 Southern and Eastern IE024 South-East (IE) 01 Carlow County 01015 017005 Ballyellin 017005001 A017005001 115 108 223 70 9 8 79 10.1 27-03-2012
7 IE0 Ireland IE02 Southern and Eastern IE024 South-East (IE) 01 Carlow County 01033 017033 Muinebeag (Bagenalstown) Rural 017033001 A017033001 201 205 406 143 15 14 158 8.9 27-03-2012
8 IE0 Ireland IE02 Southern and Eastern IE024 South-East (IE) 01 Carlow County 01034 017034 Muinebeag (Bagenalstown) Urban 017034002 A017034002 142 116 258 89 9 9 98 9.2 27-03-2012
9 IE0 Ireland IE02 Southern and Eastern IE024 South-East (IE) 01 Carlow County 01034 017034 Muinebeag (Bagenalstown) Urban 017034003 A017034003 174 169 343 107 6 4 113 3.5 27-03-2012
10 out of 18488 features
Unique Values and Counts
Next we’ll get a list of the unique COUNTYNAME entries and a count to see how many small areas are in each. (The below works for text fields only)
# rest to first feature
lyr.ResetReading()
# field to return unique list and count of
field = "COUNTYNAME"
# create empty dictionary
values_dict = {}
# for each feature
for feature in lyr:
attribute = feature.GetField(field).decode("iso8859_14")
# if the COUNTYNAME is not already in the dictionary add it and assign a value of 1
if attribute not in values_dict:
values_dict[attribute] = 1
# otherwise do not add it and increase the existing value by 1
else:
values_dict[attribute] = values_dict[attribute] + 1
## convert dictionary to list for use with tabulate
key_value_list = [[key, value] for key, value in values_dict.items()]
## print results
print "\nTotal Feature Count: {0}\n".format(lyr.GetFeatureCount())
print tabulate(sorted(key_value_list), [field, "Count"])
And here’s the output…
Total Feature Count: 18488
COUNTYNAME Count
---------------------- -------
Carlow County 210
Cavan County 322
Clare County 511
Cork City 519
Cork County 1650
Donegal County 761
Dublin City 2202
Dún Laoghaire-Rathdown 760
Fingal 938
Galway City 307
Galway County 741
Kerry County 701
Kildare County 731
Kilkenny County 372
Laois County 305
Leitrim County 173
Limerick City 258
Limerick County 514
Longford County 179
Louth County 462
Mayo County 643
Meath County 636
Monaghan County 244
North Tipperary 283
Offaly County 286
Roscommon County 303
Sligo County 307
South Dublin 906
South Tipperary 350
Waterford City 198
Waterford County 275
Westmeath County 338
Wexford County 615
Wicklow County 488
Alternatively we could print out based on the highest count descending by replacing the last print statement with…
# http://stackoverflow.com/questions/17555218/python-how-to-sort-a-list-of-lists-by-the-fourth-element-in-each-list
print tabulate(sorted(key_value_list, key = itemgetter(1), reverse = True), [field, "Count"])
…to get…
COUNTYNAME Count
---------------------- -------
Dublin City 2202
Cork County 1650
Fingal 938
South Dublin 906
Donegal County 761
...
I will add to these as I come across something useful. If you know of any neat things to add please comment below. Please also comment if anything is unclear or if this was useful to you.
See Also…
Setting up GDAL/OGR with FileGDB Driver for Python on Windows
Measuring Geographic Distributions #1.1 – Mean Center
Measuring Geographic Distributions #2.1 – Central Feature
Measuring Geographic Distributions #3.1 – Median Center