(Open Source Geospatial Python)
Here we will look at the general makeup of a downloaded spatial dataset – a Shapefile from the Central Statistics Office in Ireland containing census data from 2011. We will look at getting the spatial reference of the file along with a breakdown of the field names, type, width and precision. We can print the top ten records or the entire attribute table and get a list of unique values for a field and the count of each.
Download the Small Areas of Ireland from the CSO. You will have to acknowledge a disclaimer. Once downloaded unzip Census2011_Small_Areas_generalised20m.zip to working folder. We will now begin to interrogate this Shapefile.
First we import the necessary modules…
# import modules from osgeo import ogr from tabulate import tabulate from operator import itemgetter
tabulate will allow us to print out formatted tables. Using ogr we can access the inner workings of the downloaded Shapefile. Please note that osgeo and tabulate are not standard Python libraries and will need to be installed.
Using the ESRI Shapefile driver we open the Shapefile in read mode (0) and access the data (lyr).
# use Shapefile driver driver = ogr.GetDriverByName("ESRI Shapefile") # reference Shapefile shp = r"C:\Users\Glen B\Documents\GDAL\shp\Census2011_Small_Areas_generalised20m.shp" # open the file ds = driver.Open(shp, 0) # reference the only layer in a Shapefile lyr = ds.GetLayer(0)
Spatial Reference Information
Straight away we cant print the spatial reference information associated with the Shapefile (contained in the .prj file)
print lyr.GetSpatialRef()
This will print out…
PROJCS["TM65_Irish_Grid", GEOGCS["GCS_TM65", DATUM["TM65", SPHEROID["Airy_Modified",6377340.189,299.3249646]], PRIMEM["Greenwich",0.0], UNIT["Degree",0.0174532925199433]], PROJECTION["Transverse_Mercator"], PARAMETER["False_Easting",200000.0], PARAMETER["False_Northing",250000.0], PARAMETER["Central_Meridian",-8.0], PARAMETER["Scale_Factor",1.000035], PARAMETER["Latitude_Of_Origin",53.5], UNIT["Meter",1.0]]
You can also access this information individually…
# projected coordinate system proj_string = lyr.GetSpatialRef().GetAttrValue("PROJCS", 0) # geographic coordinate system geog_string = lyr.GetSpatialRef().GetAttrValue("GEOGCS", 0) # EPSG Code if available epsg = lyr.GetSpatialRef().GetAttrValue("AUTHORITY", 1) # datum datum = lyr.GetSpatialRef().GetAttrValue("DATUM", 0) print "\nFile: {0}\n\nProjected: {1}\nEPSG: {2}\n".format(lyr.GetName(),proj_string, epsg) print "Geographic: {0}\nDatum: {1}\n".format(geog_string, datum)
The output…
File: Census2011_Small_Areas_generalised20m Projected: TM65_Irish_Grid EPSG: None Geographic: GCS_TM65 Datum: TM65
If there is an EPSG code in the .prj file it will be printed instead of None.
Geometry Type
If we reference the first feature we can get the geometry of the Shapefile…
first_feat = lyr.GetFeature(1) print "Geometry Type: {0}\n".format(first_feat.geometry().GetGeometryName())
In this instance it is a polygon Shapefile.
Geometry Type: POLYGON
Field Information
Let’s get some information on the data through the Layer Definition.
# https://pcjericks.github.io/py-gdalogr-cookbook/vector_layers.html
lyr_def = lyr.GetLayerDefn()
But before we do we need to create a few list structures. These will be used to hold the accessed information and enable us to neatly print them to screen.
# list to hold headers for filed information header_list = ["FIELD NAME", "TYPE", "WIDTH", "PRECISION"] # list will be populated with field information output_list = [] # list will be populated with field names and used for attribute headers fld_names = []
Cycle through each field and populate the necessary lists…
# for each field for i in range(lyr_def.GetFieldCount()): # reference the field name fld_name = lyr_def.GetFieldDefn(i).GetName() # reference the field type fld_type = lyr_def.GetFieldDefn(i).GetFieldTypeName(lyr_def.GetFieldDefn(i).GetType()) # reference the field width fld_width = lyr_def.GetFieldDefn(i).GetWidth() # reference the field precision fld_precision = lyr_def.GetFieldDefn(i).GetPrecision() # append these as a list to the output_list output_list.append([fld_name, fld_type, str(fld_width), str(fld_precision)]) # append field name to fld_name fld_names.append(fld_name)
The output_list is a list of lists containing information for each field, the field name, data type, width and precision, this is matched in the header_list. The fld_names will be used further down to print out attributes, this list hold the field names as headers. Let’s print the field information…
print "{0}\n".format(tabulate(output_list, header_list))
Here’s the output…
FIELD NAME TYPE WIDTH PRECISION
------------ ------ ------- -----------
NUTS1 String 3 0
NUTS1NAME String 7 0
NUTS2 String 4 0
NUTS2NAME String 26 0
NUTS3 String 5 0
NUTS3NAME String 15 0
COUNTY String 2 0
COUNTYNAME String 25 0
CSOED String 11 0
OSIED String 13 0
EDNAME String 45 0
SMALL_AREA String 61 0
GEOGID String 65 0
MALE2011 Real 20 10
FEMALE2011 Real 20 10
TOTAL2011 Real 20 10
PPOCC2011 Real 20 10
UNOCC2011 Real 20 10
VACANT2011 Real 20 10
HS2011 Real 20 10
PCVAC2011 Real 20 10
CREATEDATE String 10 0
Attribute Table
Next we print out some attributes for a set of features, the first ten.
# number of features from the first to print attributes for num_to_return = 10 #num_to_return = lyr.GetFeatureCount()
Use the commented out line if you want to print attributes for all features. Create an empty list to hold the attributes. Some fields contain characters from the Irish language so we account for this so that the attributes are printed correctly.
# list will be populated with attribute data att_table = [] # for each feature in the Shapefile for count, feature in enumerate(lyr): # up to the number of set features to print if count < num_to_return: # count will beacome the Feature ID atts = [count] # for each field append the data to atts list for name in fld_names: try: # if the attribute is a string then decode with Celtic Languages atts.append(feature.GetField(name).decode("iso8859_14")) except Exception: atts.append(feature.GetField(name)) # append the data for the feature to the att_table list att_table.append(atts)
The count becomes the Feature ID but we have no field for this so we will create one…
# add a FID header (count) fld_names.insert(0, "FID")
So let’s print out the attributes…
print tabulate(att_table, fld_names) print "{0} out of {1} features".format(num_to_return, lyr.GetFeatureCount())
Here’s the output…
FID NUTS1 NUTS1NAME NUTS2 NUTS2NAME NUTS3 NUTS3NAME COUNTY COUNTYNAME CSOED OSIED EDNAME SMALL_AREA GEOGID MALE2011 FEMALE2011 TOTAL2011 PPOCC2011 UNOCC2011 VACANT2011 HS2011 PCVAC2011 CREATEDATE
----- ------- ----------- ------- -------------------- ------- --------------- -------- -------------- ------- ------- ------------------------------ ------------ ---------- ---------- ------------ ----------- ----------- ----------- ------------ -------- ----------- ------------
0 IE0 Ireland IE02 Southern and Eastern IE022 Mid-East 15 Wicklow County 15039 257005 Aughrim 257005002 A257005002 137 138 275 84 18 15 102 14.7 27-03-2012
1 IE0 Ireland IE02 Southern and Eastern IE024 South-East (IE) 01 Carlow County 01054 017049 Tinnahinch 017049001 A017049001 186 176 362 111 25 24 136 17.6 27-03-2012
2 IE0 Ireland IE02 Southern and Eastern IE024 South-East (IE) 01 Carlow County 01053 017032 Marley 017032001 A017032001 194 173 367 121 8 5 129 3.9 27-03-2012
3 IE0 Ireland IE02 Southern and Eastern IE024 South-East (IE) 01 Carlow County 01054 017049 Tinnahinch 017049002 A017049002 75 75 150 67 29 29 96 30.2 27-03-2012
4 IE0 Ireland IE02 Southern and Eastern IE024 South-East (IE) 01 Carlow County 01054 017049 Tinnahinch 017049003 A017049003 84 81 165 64 16 14 80 17.5 27-03-2012
5 IE0 Ireland IE02 Southern and Eastern IE024 South-East (IE) 01 Carlow County 01015 017005 Ballyellin 017005002 A017005002 105 99 204 71 6 5 77 6.5 27-03-2012
6 IE0 Ireland IE02 Southern and Eastern IE024 South-East (IE) 01 Carlow County 01015 017005 Ballyellin 017005001 A017005001 115 108 223 70 9 8 79 10.1 27-03-2012
7 IE0 Ireland IE02 Southern and Eastern IE024 South-East (IE) 01 Carlow County 01033 017033 Muinebeag (Bagenalstown) Rural 017033001 A017033001 201 205 406 143 15 14 158 8.9 27-03-2012
8 IE0 Ireland IE02 Southern and Eastern IE024 South-East (IE) 01 Carlow County 01034 017034 Muinebeag (Bagenalstown) Urban 017034002 A017034002 142 116 258 89 9 9 98 9.2 27-03-2012
9 IE0 Ireland IE02 Southern and Eastern IE024 South-East (IE) 01 Carlow County 01034 017034 Muinebeag (Bagenalstown) Urban 017034003 A017034003 174 169 343 107 6 4 113 3.5 27-03-2012
10 out of 18488 features
Unique Values and Counts
Next we’ll get a list of the unique COUNTYNAME entries and a count to see how many small areas are in each. (The below works for text fields only)
# rest to first feature lyr.ResetReading() # field to return unique list and count of field = "COUNTYNAME" # create empty dictionary values_dict = {} # for each feature for feature in lyr: attribute = feature.GetField(field).decode("iso8859_14") # if the COUNTYNAME is not already in the dictionary add it and assign a value of 1 if attribute not in values_dict: values_dict[attribute] = 1 # otherwise do not add it and increase the existing value by 1 else: values_dict[attribute] = values_dict[attribute] + 1 ## convert dictionary to list for use with tabulate key_value_list = [[key, value] for key, value in values_dict.items()] ## print results print "\nTotal Feature Count: {0}\n".format(lyr.GetFeatureCount()) print tabulate(sorted(key_value_list), [field, "Count"])
And here’s the output…
Total Feature Count: 18488 COUNTYNAME Count ---------------------- ------- Carlow County 210 Cavan County 322 Clare County 511 Cork City 519 Cork County 1650 Donegal County 761 Dublin City 2202 Dún Laoghaire-Rathdown 760 Fingal 938 Galway City 307 Galway County 741 Kerry County 701 Kildare County 731 Kilkenny County 372 Laois County 305 Leitrim County 173 Limerick City 258 Limerick County 514 Longford County 179 Louth County 462 Mayo County 643 Meath County 636 Monaghan County 244 North Tipperary 283 Offaly County 286 Roscommon County 303 Sligo County 307 South Dublin 906 South Tipperary 350 Waterford City 198 Waterford County 275 Westmeath County 338 Wexford County 615 Wicklow County 488
Alternatively we could print out based on the highest count descending by replacing the last print statement with…
# http://stackoverflow.com/questions/17555218/python-how-to-sort-a-list-of-lists-by-the-fourth-element-in-each-list print tabulate(sorted(key_value_list, key = itemgetter(1), reverse = True), [field, "Count"])
…to get…
COUNTYNAME Count ---------------------- ------- Dublin City 2202 Cork County 1650 Fingal 938 South Dublin 906 Donegal County 761 ...
I will add to these as I come across something useful. If you know of any neat things to add please comment below. Please also comment if anything is unclear or if this was useful to you.
See Also…
Setting up GDAL/OGR with FileGDB Driver for Python on Windows
Measuring Geographic Distributions #1.1 – Mean Center
Measuring Geographic Distributions #2.1 – Central Feature
Measuring Geographic Distributions #3.1 – Median Center