Geospatial exploratory data evaluation with GeoPandas and DuckDB

-

this text, I’ll show you learn how to use two popular Python libraries to perform some geospatial evaluation of traffic accident data inside the UK.

I used to be a comparatively early adopter of DuckDB, the fast OLAP database, after it became available, but only recently realised that, through an extension, it offered a big number of doubtless useful geospatial functions.

Geopandas was latest to me. It’s a Python library that makes working with geographic data feel like working with regular pandas, but with geometry (points, lines, polygons) inbuilt. You may read/write standard GIS formats (GeoJSON, Shapefile, GeoPackage), manipulate attributes and geometries together, and quickly visualise layers with Matplotlib.

Wanting to check out the capabilities of each, I set about researching a useful mini-project that will be each interesting and a helpful learning experience.

Long story short, I made a decision to try using each libraries to find out the safest city within the UK for driving or walking around. It’s possible that you could possibly do all the pieces I’m about to indicate using Geopandas by itself, but unfortunately, I’m not as aware of it as I’m with DuckDB, which is why I used each.

A number of ground rules.

The accident and casualty data I’ll be using is from an official UK government source that covers the entire country. Nonetheless, I’ll be specializing in the information for less than six of the UK’s largest cities: London, Edinburgh, Cardiff, Glasgow, Birmingham, and Manchester. 

My method for determining the “safest” city will likely be to calculate the whole variety of road traffic-related casualties in each city over a five-year period and divide this number by the realm of every city in km². This will likely be my “safety index”, and the lower that number is, the safer town.

Getting our city boundary data

This was arguably essentially the most difficult a part of the whole process. 

Slightly than treating a “city” as a single administrative polygon, which ends up in various anomalies by way of city areas, I modelled every one by its built-up area (BUA) footprint. I did this using the Office of National Statistics (ONS) Built-Up Areas map data layer after which aggregated all BUA parts that fall inside a wise administrative boundary for that location. The mask comes from the official ONS boundaries and is chosen to reflect each wider urban area:

  • London → the London Region (Regions Dec 2023).
  • Manchester → the union of the ten Greater Manchester Local Authority Districts (LAD ) for May 2024.
  • Birmingham → the union of the 7 West Midlands Combined Authority LADs.
  • Glasgow → the union of the 8 Glasgow City Region councils.
  • Edinburgh and Cardiff → their single LAD.

How the polygons are inbuilt code

I downloaded boundary data directly from the UK Office for National Statistics (ONS) ArcGIS FeatureServer in GeoJSON format using the requests library. For every city we first construct a mask from official ONS admin layers: the London Region (Dec 2023) for London; the union of Local Authority Districts (LAD, May 2024) for Greater Manchester (10 LADs), the West Midlands core (7 LADs) and the Glasgow City Region (8 councils); and the single LAD for Edinburgh and Cardiff.

Next, I query the ONS Built-Up Areas (BUA 2022) layer for polygons intersecting the mask’s bounding box, keeping only those that intersect the mask, and dissolve (merge) the outcomes to create a single multipart polygon per city (“BUA 2022 aggregate”). Data are stored and plotted in EPSG:4326 (WGS84), i.e latitude and longitude are expressed as degrees. When reporting areas, we reproject them to EPSG:27700 (OSGB) and calculate the realm in square kilometres to avoid distortions.

Each city boundary data is downloaded to a GeoJSON file and loaded into Python using the geopandas and requests libraries. 

To point out that the boundary data we have now is correct, the person city layers are then combined right into a single Geodataframe, reprojected right into a consistent coordinate reference system (EPSG:4326), and plotted on top of a clean UK outline derived from the Natural Earth dataset (via a GitHub mirror). To focus only on the mainland, we crop the UK outline to the bounding box of the cities, excluding distant overseas territories. The world of every city can be calculated.

Boundary data licensing

All of the boundary datasets I’ve used are open data with permissive reuse terms.

London, Edinburgh, Cardiff, Glasgow, Birmingham & Manchester

  • Source: Office for National Statistics (ONS) —  and .
  • License: Open Government Licence v3.0 (OGL).
  • Terms: You might be free to make use of, modify, and share the information (even commercially) so long as you provide attribution.

UK Outline

  • Source: Natural Earth — .
  • License: Public Domain (No restrictions).Citation:
  • Natural Earth. . Public domain. Available from: https://www.naturalearthdata.com

City boundary code

Here is the Python code you need to use to download the information files for every city and confirm the boundaries on a map.

Before running the primary code, please ensure you will have installed the next libraries. You should utilize pip or one other approach to your alternative for this.

geopandas
matplotlib
requests
pandas
duckdb
jupyter
import requests, json
import geopandas as gpd
import pandas as pd
import matplotlib.pyplot as plt

# ---------- ONS endpoints ----------
LAD24_FS = "https://services1.arcgis.com/ESMARspQHYMw9BZ9/arcgis/rest/services/Local_Authority_Districts_May_2024_Boundaries_UK_BGC/FeatureServer/0/query"
REGION23_FS = "https://services1.arcgis.com/ESMARspQHYMw9BZ9/arcgis/rest/services/Regions_December_2023_Boundaries_EN_BGC/FeatureServer/0/query"
BUA_FS   = "https://services1.arcgis.com/ESMARspQHYMw9BZ9/arcgis/rest/services/BUA_2022_GB/FeatureServer/0/query"

# ---------- Helpers ----------
def arcgis_geojson(url, params):
    r = requests.get(url, params={**params, "f": "geojson"}, timeout=90)
    r.raise_for_status()
    return r.json()

def sql_quote(s: str) -> str:
    return "'" + s.replace("'", "''") + "'"

def fetch_lads_by_names(names):
    where = "LAD24NM IN ({})".format(",".join(sql_quote(n) for n in names))
    data = arcgis_geojson(LAD24_FS, {
        "where": where,
        "outFields": "LAD24NM",
        "returnGeometry": "true",
        "outSR": "4326"
    })
    gdf = gpd.GeoDataFrame.from_features(data.get("features", []), crs="EPSG:4326")
    if gdf.empty:
        raise RuntimeError(f"No LADs found for: {names}")
    return gdf

def fetch_lad_by_name(name):
    return fetch_lads_by_names([name])

def fetch_region_by_name(name):
    data = arcgis_geojson(REGION23_FS, {
        "where": f"RGN23NM={sql_quote(name)}",
        "outFields": "RGN23NM",
        "returnGeometry": "true",
        "outSR": "4326"
    })
    gdf = gpd.GeoDataFrame.from_features(data.get("features", []), crs="EPSG:4326")
    if gdf.empty:
        raise RuntimeError(f"No Region feature for: {name}")
    return gdf

def fetch_buas_intersecting_bbox(minx, miny, maxx, maxy):
    data = arcgis_geojson(BUA_FS, {
        "where": "1=1",
        "geometryType": "esriGeometryEnvelope",
        "geometry": json.dumps({
            "xmin": float(minx), "ymin": float(miny),
            "xmax": float(maxx), "ymax": float(maxy),
            "spatialReference": {"wkid": 4326}
        }),
        "inSR": "4326",
        "spatialRel": "esriSpatialRelIntersects",
        "outFields": "BUA22NM,BUA22CD,Shape__Area",
        "returnGeometry": "true",
        "outSR": "4326"
    })
    return gpd.GeoDataFrame.from_features(data.get("features", []), crs="EPSG:4326")

def aggregate_bua_by_mask(mask_gdf: gpd.GeoDataFrame, label: str) -> gpd.GeoDataFrame:
    """
    Fetch BUA 2022 polygons intersecting a mask (LAD/Region union) and dissolve to 1 polygon.
    Uses Shapely 2.x union_all() to construct the mask geometry.
    """
    # Union the mask polygon(s)
    mask_union = mask_gdf.geometry.union_all()

    # Fetch candidate BUAs via mask bbox, then filter by real intersection with the union
    minx, miny, maxx, maxy = gpd.GeoSeries([mask_union], crs="EPSG:4326").total_bounds
    buas = fetch_buas_intersecting_bbox(minx, miny, maxx, maxy)
    if buas.empty:
        raise RuntimeError(f"No BUAs intersecting bbox for {label}")
    buas = buas[buas.intersects(mask_union)]
    if buas.empty:
        raise RuntimeError(f"No BUAs actually intersect mask for {label}")

    dissolved = buas[["geometry"]].dissolve().reset_index(drop=True)
    dissolved["city"] = label
    return dissolved[["city", "geometry"]]

# ---------- Group definitions ----------
GM_10 = ["Manchester","Salford","Trafford","Stockport","Tameside",
         "Oldham","Rochdale","Bury","Bolton","Wigan"]

WMCA_7 = ["Birmingham","Coventry","Dudley","Sandwell","Solihull","Walsall","Wolverhampton"]

GLASGOW_CR_8 = ["Glasgow City","East Dunbartonshire","West Dunbartonshire",
                "East Renfrewshire","Renfrewshire","Inverclyde",
                "North Lanarkshire","South Lanarkshire"]

EDINBURGH = "City of Edinburgh"
CARDIFF   = "Cardiff"

# ---------- Construct masks ----------
london_region = fetch_region_by_name("London")     # Region mask for London
gm_lads   = fetch_lads_by_names(GM_10)             # Greater Manchester (10)
wmca_lads = fetch_lads_by_names(WMCA_7)            # West Midlands (7)
gcr_lads  = fetch_lads_by_names(GLASGOW_CR_8)      # Glasgow City Region (8)
edi_lad   = fetch_lad_by_name(EDINBURGH)           # Single LAD
cdf_lad   = fetch_lad_by_name(CARDIFF)             # Single LAD

# ---------- Aggregate BUAs by each mask ----------
layers = []

london_bua = aggregate_bua_by_mask(london_region, "London (BUA 2022 aggregate)")
london_bua.to_file("london_bua_aggregate.geojson", driver="GeoJSON")
layers.append(london_bua)

man_bua = aggregate_bua_by_mask(gm_lads, "Manchester (BUA 2022 aggregate)")
man_bua.to_file("manchester_bua_aggregate.geojson", driver="GeoJSON")
layers.append(man_bua)

bham_bua = aggregate_bua_by_mask(wmca_lads, "Birmingham (BUA 2022 aggregate)")
bham_bua.to_file("birmingham_bua_aggregate.geojson", driver="GeoJSON")
layers.append(bham_bua)

glas_bua = aggregate_bua_by_mask(gcr_lads, "Glasgow (BUA 2022 aggregate)")
glas_bua.to_file("glasgow_bua_aggregate.geojson", driver="GeoJSON")
layers.append(glas_bua)

edi_bua = aggregate_bua_by_mask(edi_lad, "Edinburgh (BUA 2022 aggregate)")
edi_bua.to_file("edinburgh_bua_aggregate.geojson", driver="GeoJSON")
layers.append(edi_bua)

cdf_bua = aggregate_bua_by_mask(cdf_lad, "Cardiff (BUA 2022 aggregate)")
cdf_bua.to_file("cardiff_bua_aggregate.geojson", driver="GeoJSON")
layers.append(cdf_bua)

# ---------- Mix & report areas ----------
cities = gpd.GeoDataFrame(pd.concat(layers, ignore_index=True), crs="EPSG:4326")

# ---------- Plot UK outline + the six aggregates ----------
# UK outline (Natural Earth 1:10m, easy countries)
ne_url = "https://raw.githubusercontent.com/nvkelso/natural-earth-vector/master/geojson/ne_10m_admin_0_countries.geojson"
world = gpd.read_file(ne_url)
uk = world[world["ADMIN"] == "United Kingdom"].to_crs(4326)

# Crop frame to our cities
minx, miny, maxx, maxy = cities.total_bounds
uk_crop = uk.cx[minx-5 : maxx+5, miny-5 : maxy+5]

fig, ax = plt.subplots(figsize=(9, 10), dpi=150)
uk_crop.boundary.plot(ax=ax, color="black", linewidth=1.2)
cities.plot(ax=ax, column="city", alpha=0.45, edgecolor="black", linewidth=0.8, legend=True)

# Label each polygon using an interior point
label_pts = cities.representative_point()
for (x, y), name in zip(label_pts.geometry.apply(lambda p: (p.x, p.y)), cities["city"]):
    ax.text(x, y, name, fontsize=8, ha="center", va="center")

ax.set_title("BUA 2022 Aggregates: London, Manchester, Birmingham, Glasgow, Edinburgh, Cardiff", fontsize=12)
ax.set_xlim(minx-1, maxx+1)
ax.set_ylim(miny-1, maxy+1)
ax.set_aspect("equal", adjustable="box")
ax.set_xlabel("Longitude"); ax.set_ylabel("Latitude")
plt.tight_layout()
plt.show()

After running this code, you need to have 6 GeoJSON files stored in your current directory, and you need to also see an output like this, which visually tells us that our city boundary files contain valid data.

Getting our accident data

Our final piece of the information puzzle is the accident data. The UK government publishes reports on vehicle accident data covering five-year periods. The most recent data covers the period from 2019 to 2024. This data set covers the whole UK, so we are going to must process it to extract only the information for the six cities we’re eager about. That’s where DuckDB will are available, but more on that later.

To view or download the accident data in CSV format (Source: Department for Transport — Road Safety Data), click the link below

https://data.dft.gov.uk/road-accidents-safety-data/dft-road-casualty-statistics-collision-last-5-years.csv

As with town boundary data, this can be published under the Open Government Licence v3.0 (OGL 3.0) and as such has the identical licence conditions.

The accident data set incorporates a lot of fields, but for our purposes, we’re only eager about 3 of them:

latitude
longitude
number_of_casualties

To acquire our casualty counts for every city is now only a 3-step process.

1/ Loading the accident data set into DuckDB

In the event you’ve never encountered DuckDB before, the TL;DR is that it’s a super-fast, in-memory (can be persistent) database written in C++ designed for analytical SQL workloads.

Certainly one of the primary reasons I prefer it is its speed. It’s one in every of the fastest third-party data analytics libraries I’ve used. It’s also very extensible through using extensions similar to the geospatial one, which we’ll use now.

Now we will load the accident data like this.

import requests
import duckdb

# Distant CSV (last 5 years collisions)
url = "https://data.dft.gov.uk/road-accidents-safety-data/dft-road-casualty-statistics-collision-last-5-years.csv"
local_file = "collisions_5yr.csv"

# Download the file
r = requests.get(url, stream=True)
r.raise_for_status()
with open(local_file, "wb") as f:
    for chunk in r.iter_content(chunk_size=8192):
        f.write(chunk)

print(f"Downloaded {local_file}")

# Connect once
con = duckdb.connect(database=':memory:')

# Install + load spatial on THIS connection
con.execute("INSTALL spatial;")
con.execute("LOAD spatial;")

# Create accidents table with geometry
con.execute("""
    CREATE TABLE accidents AS
    SELECT
        TRY_CAST(Latitude AS DOUBLE) AS latitude,
        TRY_CAST(Longitude AS DOUBLE) AS longitude,
        TRY_CAST(Number_of_Casualties AS INTEGER) AS number_of_casualties,
        ST_Point(TRY_CAST(Longitude AS DOUBLE), TRY_CAST(Latitude AS DOUBLE)) AS geom
    FROM read_csv_auto('collisions_5yr.csv', header=True, nullstr='NULL')
    WHERE 
        TRY_CAST(Latitude AS DOUBLE) IS NOT NULL AND 
        TRY_CAST(Longitude AS DOUBLE) IS NOT NULL AND
        TRY_CAST(Number_of_Casualties AS INTEGER) IS NOT NULL
""")

# Quick preview
print(con.execute("DESCRIBE accidents").df())
print(con.execute("SELECT * FROM accidents LIMIT 5").df())

It’s best to see the next output.

Downloaded collisions_5yr.csv
            column_name column_type null   key default extra
0              latitude      DOUBLE  YES  None    None  None
1             longitude      DOUBLE  YES  None    None  None
2  number_of_casualties     INTEGER  YES  None    None  None
3                  geom    GEOMETRY  YES  None    None  None
    latitude  longitude  number_of_casualties  
0  51.508057  -0.153842                     3   
1  51.436208  -0.127949                     1   
2  51.526795  -0.124193                     1   
3  51.546387  -0.191044                     1   
4  51.541121  -0.200064                     2   

                                                geom  
0  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ...  
1  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ...  
2  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ...  
3  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ...  
4  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ...

2/ Loading town boundary data using DuckDB spatial functions

The function we’ll be using to do that is known as ST_READ, which may read and import a wide range of geospatial file formats using the GDAL library.

city_files = {
    "London": "london_bua_aggregate.geojson",
    "Edinburgh": "edinburgh_bua_aggregate.geojson",
    "Cardiff": "cardiff_bua_aggregate.geojson",
    "Glasgow": "glasgow_bua_aggregate.geojson",
    "Manchester": "manchester_bua_aggregate.geojson",
    "Birmingham": "birmingham_bua_aggregate.geojson"
}

for city, file in city_files.items():
    con.execute(f"""
        CREATE TABLE {city.lower()} AS
        SELECT
            '{city}' AS city,
            geom
        FROM ST_Read('{file}')
    """)

con.execute("""
    CREATE TABLE cities AS
    SELECT * FROM london
    UNION ALL SELECT * FROM edinburgh
    UNION ALL SELECT * FROM cardiff
    UNION ALL SELECT * FROM glasgow
    UNION ALL SELECT * FROM manchester
    UNION ALL SELECT * FROM birmingham
    
    
""")

3/ The subsequent step is to affix accidents to town polygons and count casualties.

The important thing geospatial function we use this time is known as ST_WITHIN. It returns TRUE if the primary geometry point is inside the boundary of the second.

import duckdb

casualties_per_city = con.execute("""
    SELECT 
        c.city, 
        SUM(a.number_of_casualties) AS total_casualties,
        COUNT(*) AS accident_count
    FROM accidents a
    JOIN cities c
      ON ST_Within(a.geom, c.geom)
    GROUP BY c.city
    ORDER BY total_casualties DESC
""").df()

print("Casualties per city:")
print(casualties_per_city)

Note that I ran the above query on a robust desktop, and it still took just a few minutes to return results, so please be patient. Nonetheless, eventually, you need to see an output much like this.

Casualties per city:

         city  total_casualties  accident_count
0      London          134328.0          115697
1  Birmingham           14946.0           11119
2  Manchester            4518.0            3502
3     Glasgow            3978.0            3136
4   Edinburgh            3116.0            2600
5     Cardiff            1903.0            1523

Evaluation

It’s no surprise that London has, overall, essentially the most significant variety of casualties. But for its size, is it roughly dangerous to drive or be a pedestrian in than in the opposite cities? 

There may be clearly a problem with the casualty figures for Manchester and Glasgow. They need to each be larger, based on their city sizes. The suggestion is that this may very well be because lots of the busy ring roads and metro fringe roads (M8/M74/M77; M60/M62/M56/M61) related to each city are sitting just outside their tight BUA polygons, resulting in a big under-representation of accident and casualty data. I’ll leave that investigation as an exercise for the reader!

For our final determination of driver safety, we want to know each city’s area size so we will calculate the casualty rate per km². 

Luckily, DuckDB has a function for that. ST_AREA computes the realm of a geometry.

# --- Compute areas in km² (CRS84 -> OSGB 27700) ---
print("nCalculating areas in km^2...")
areas = con.execute("""
SELECT
  city,
  ST_Area(
    ST_MakeValid(
      ST_Transform(
          -- ST_Simplify(geom, 0.001), -- Experiment with epsilon value (e.g., 0.001 degrees)
          geom, 
          'OGC:CRS84','EPSG:27700'
      )
    )
  ) / 1e6 AS area_km2
FROM cities 
ORDER BY area_km2 DESC;
""").df()

print("City Areas:")
print(areas.round(2))

I obtained this output, which appears to be about right.

          city  area_km2
0      London   1321.45
1  Birmingham    677.06
2  Manchester    640.54
3     Glasgow    481.49
4   Edinburgh    123.00
5     Cardiff     96.08

We now have all the information we want to declare which city has the safest drivers within the UK. Remember, the lower the “safety_index” number, the safer.

   city         area_km2     casualties          safety_index (casualties/area)
0  London       1321.45      134328              101.65
1  Birmingham   677.06       14946               22.07
2  Manchester   640.54        4518                7.05
3  Glasgow      481.49        3978                8.26
4  Edinburgh    123.00        3116               25.33
5  Cardiff       96.08        1903               19.08

I’m not comfortable including the outcomes for each Manchester and Glasgow attributable to the doubts on their casualty counts that we mentioned before. 

Taking that into consideration, and since I’m the boss of this text, I’m declaring Cardiff because the winner of the prize for the safest city from a driver and pedestrian perspective. What do you’re thinking that of those results? Do you reside in one in every of the cities I checked out? If that’s the case, do the outcomes support your experience of driving or being a pedestrian there?

Summary

We examined the feasibility of performing exploratory data evaluation on a geospatial dataset. Our goal was to find out which of the UK’s six leading cities were safest to drive in or be a pedestrian in. Using a mix of GeoPandas and DuckDB, we were in a position to:

  • Use Geopandas to download city boundary data from an official government website that reasonably accurately represents the scale of every city.
  • Download and post-process an in depth 5-year accident survey CSV to acquire the three fields of interest it contained, namely latitude, longitude and variety of road accident casualties.
  • Join the accident data to town boundary data on latitude/longitude using DuckDB geospatial functions to acquire the whole variety of casualties for every city over 5 years.
  • Used DuckDB geospatial functions to calculate the scale in km² of every city.
  • Calculated a security index for every city, being the variety of casualties in each city divided by its size. We disregarded two of our results attributable to doubts in regards to the accuracy of a few of the data. We calculated that Cardiff had the bottom safety index rating and was, due to this fact, deemed the safest of the cities we surveyed.

Given the fitting input data, geospatial evaluation using the tools I describe may very well be a useful aid across many industries. Consider traffic and accident evaluation (as I’ve shown), flood, deforestation, forest fire and drought evaluation come to mind too. Principally, any system or process that’s connected to a spatial coordinate system is ripe for exploratory data evaluation by anyone.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x