Constructing a Geospatial Lakehouse with Open Source and Databricks

-

Most data that pertains to a measurable process in the true world has a geospatial aspect to it. Organisations that manage assets over a large geographical area, or have a business process which requires them to contemplate many layers of geographical attributes that require mapping, may have more complicated geospatial analytics requirements, after they start to make use of this data to reply strategic questions or optimise. These geospatially focussed organisations might ask these kinds of questions of their data:

All of those are priceless geospatial queries, requiring that a variety of data entities be integrated in a standard storage layer, and that geospatial joins corresponding to point-in-polygon operations and geospatial indexing be scaled to handle the inputs involved. This text will discuss approaches to scaling geospatial analytics using the features of Databricks, and open-source tools profiting from Spark implementations, the common Delta table storage format and Unity Catalog [1], focussing on batch analytics on vector geospatial data.

Solution Overview

The diagram below summarises an open-source approach to constructing a geospatial Lakehouse in Databricks. Through quite a lot of ingestion modes (though often through public APIs) geospatial datasets are landed into cloud storage in quite a lot of formats; with Databricks this might be a volume inside a Unity Catalog catalog and schema. Geospatial data formats mainly include vector formats (GeoJSONs, .csv and Shapefiles .shp) which represent Latitude/Longitude points, lines or polygons and attributes, and raster formats (GeoTIFF, HDF5) for imaging data. Using GeoPandas [2] or Spark-based geospatial tools corresponding to Mosaic [3] or H3 Databricks SQL functions [4] we will prepare vector files in memory and save them in a unified bronze layer in Delta format, using Well Known Text (WKT) as a string representation of any points or geometries.

While the landing to bronze layer represents an audit log of ingested data, the bronze to silver layer is where data preparation and any geospatial joins common to all upstream use-cases may be applied. The finished silver layer should represent a single geospatial view and should integrate with other non-geospatial datasets as a part of an enterprise data model; it also offers a possibility to consolidate multiple tables from bronze into core geospatial datasets which could have multiple attributes and geometries, at a base level of grain required for aggregations upstream. The gold layer is then the geospatial presentation layer where the output of geospatial analytics corresponding to journey time or density calculations may be stored. To be used in dashboarding tools corresponding to Power BI, outputs could also be materialised as star schemas, whilst cloud GIS tools corresponding to ESRI Online, will prefer GeoJSON files for specific mapping applications.

Geospatial Data Preparation

Along with the standard data quality challenges faced when unifying many individual data sources in a knowledge lake architecture (missing data, variable recording practices etc), geospatial data has unique data quality and preparation challenges. So as to make vectorised geospatial datasets interoperable and simply visualised upstream, it’s best to decide on a geospatial co-ordinate system corresponding to WGS 84 (the widely used international GPS standard). Within the UK many public geospatial datasets will use other co-ordinate systems corresponding to OSGB 36, which is an optimisation for mapping geographical features within the UK with increased accuracy (this format is commonly written in Eastings and Northings fairly than the more typical Latitude and Longitude pairs) and a metamorphosis to WGS 84 is required for the these datasets to avoid inaccuracies within the downstream mapping as outlined within the Figure below.

with permission from writer. Copyright (c) Ordnance Survey 2018.

Most geospatial libraries corresponding to GeoPandas, Mosaic and others have built-in functions to handle these conversions, for instance from the Mosaic documentation:

df = (
  spark.createDataFrame([{'wkt': 'MULTIPOINT ((10 40), (40 30), (20 20), (30 10))'}])
  .withColumn('geom', st_setsrid(st_geomfromwkt('wkt'), lit(4326)))
)
df.select(st_astext(st_transform('geom', lit(3857)))).show(1, False)
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------
|MULTIPOINT ((1113194.9079327357 4865942.279503176), (4452779.631730943 3503549.843504374), (2226389.8158654715 2273030.926987689), (3339584.723798207 1118889.9748579597))|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Converts a multi-point geometry from WGS84 to Web Mercator projection format.

One other data quality issue unique to vector geospatial data, is the concept of invalid geometries outlined within the Figure below. These invalid geometries will break upstream GeoJSON files or analyses, so it’s best to repair them or delete them if mandatory. Most geospatial libraries offer functions to search out or try to fix invalid geometries.

. Copyright (c) 2024 Christoph Rieke.

These data quality and preparation steps ought to be implemented early on within the Lakehouse layers; I’ve done them within the bronze to silver step up to now, together with any reusable geospatial joins and other transformations.

Scaling Geospatial Joins and Analytics

The geospatial aspect of the silver/enterprise layer should ideally represent a single geospatial view that feeds all upstream aggregations, analytics, ML modelling and AI. Along with data quality checks and remediation, it’s sometimes useful to consolidate many geospatial datasets with aggregations or unions to simplify the info model, simplify upstream queries and forestall the necessity to redo expensive geospatial joins. Geospatial joins are sometimes very computationally expensive as a consequence of the big variety of bits required to represent sometimes complex multi-polygon geometries and the necessity for a lot of pair-wise comparisons.

Just a few strategies exist to make these joins more efficient. You possibly can, for instance, simplify complex geometries, effectively reducing the variety of lat lon pairs required to represent them; different approaches can be found for doing this that is perhaps geared towards different desired outputs (e.g., preserving area, or removing redundant points) and these may be implemented within the libraries, for instance in Mosaic:

df = spark.createDataFrame([{'wkt': 'LINESTRING (0 1, 1 2, 2 1, 3 0)'}])
df.select(st_simplify('wkt', 1.0)).show()
+----------------------------+
| st_simplify(wkt, 1.0)      |
+----------------------------+
| LINESTRING (0 1, 1 2, 3 0) |
+----------------------------+

One other approach to scaling geospatial queries is to make use of a geospatial indexing system as outlined within the Figure below. By aggregating point or polygon geometry data to a geospatial indexing system corresponding to H3, an approximation of the identical information may be represented in a highly compressed form represented by a brief string identifier, which maps to a set of fixed polygons (with visualisable lat/lon pairs) which cover the globe, over a spread of hexagon/pentagon areas at different resolutions, that may be rolled up/down in a hierarchy.

Copyright (c) CARTO 2023. Copyright (c) Uber 2018.

In Databricks the H3 indexing system can also be optimised to be used with its Spark SQL engine, so you possibly can write queries corresponding to this point in polygon join, as approximations in H3, first converting the points and polygons to H3 indexes at the specified resolution (res. 7 which is ~ 5km^2)  after which using the H3 index fields as keys to affix on:

WITH locations_h3 AS (
    SELECT
        id,
        lat,
        lon,
        h3_pointash3(
            CONCAT('POINT(', lon, ' ', lat, ')'),
            7
        ) AS h3_index
    FROM locations
),
regions_h3 AS (
    SELECT
        name,
        explode(
            h3_polyfillash3(
                wkt,
                7
            )
        ) AS h3_index
    FROM regions
)
SELECT
    l.id AS point_id,
    r.name AS region_name,
    l.lat,
    l.lon,
    r.h3_index,
    h3_boundaryaswkt(r.h3_index) AS h3_polygon_wkt  
FROM locations_h3 l
JOIN regions_h3 r
  ON l.h3_index = r.h3_index;

GeoPandas and Mosaic will even will let you do geospatial joins with none approximations if required, but often using H3 is a sufficiently accurate approximation for joins and analytics corresponding to density calculations. With a cloud analytics platform you can even make use of APIs, to herald live traffic data and journey time calculations using services corresponding to Open Route Service [9], or enrich geospatial data with additional attributes (e.g., transport hubs or retail locations) using tools corresponding to the Overpass API for Open Street Map [10].

Geospatial Presentation Layers

Now that some geospatial queries and aggregations have been done and analytics are able to visualise downstream, the presentation layer of a geospatial lakehouse may be structured in line with the downstream tools used for consuming the maps or analytics derived from the info. The Figure below outlines two typical approaches.

When serving a cloud geospatial information system (GIS) corresponding to ESRI Online or other web application with mapping tools, GeoJSON files stored in a gold/presentation layer volume, containing all the mandatory data for the map or dashboard to be created, can constitute the presentation layer. Using the FeatureCollection GeoJSON type you possibly can create a nested JSON containing multiple geometries and associated attributes (“features”) which could also be points, linestrings or polygons. If the downstream dashboarding tool is Power BI, a star schema is perhaps preferred, where the geometries and attributes may be modelled as facts and dimensions to benefit from its cross filtering and measure support, with outputs materialised as Delta tables within the presentation layer.

Platform Architecture and Integrations

Geospatial data will normally represent one a part of a wider enterprise data model and portfolio of analytics and ML/AI use-cases and these would require (ideally) a cloud data platform, with a series of upstream and downstream integrations to deploy, orchestrate and truly see that the analytics prove priceless to an organisation. The Figure below shows a high-level architecture for the type of Azure data platform I actually have worked with geospatial data on up to now.

. Image by writer.

Data is landed using quite a lot of ETL tools (if possible Databricks itself is sufficient). Inside the workspace(s) a medallion pattern of raw (bronze), enterprise (silver), and presentation (gold) layers are maintained, using the hierarchy of Unity Catalog catalog.schema.table/volume to generate per use-case layer separation (particularly of permissions) if needed. When presentable outputs are able to share, there are a selection of options for data sharing, app constructing and dashboarding and GIS integration options.

For instance with ESRI cloud, an ADLSG2 storage account connector inside ESRI allows data written to an external Unity Catalog volume (i.e., GeoJSON files) to be pulled through into the ESRI platform for integration into maps and dashboards. Some organisations may prefer that geospatial outputs be written to downstream systems corresponding to CRMs or other geospatial databases. Curated geospatial data and its aggregations are also continuously used as input features to ML models and this works seamlessly with geospatial Delta tables. Databricks are developing various AI analytics features built into the workspace (e.g., AI BI Genie [11] and Agent Bricks [12]), that give the power to question data in Unity Catalog using English and the likely long-term vision is for any geospatial data to work with these AI tools in the identical way as some other tabular data, only one in all the visualise outputs will probably be maps.

In Closing

At the tip of the day, it’s all about making cool maps which can be useful for decision making. The figure below shows a few geospatial analytics outputs I’ve generated over the previous couple of years. Geospatial analytics boils right down to knowing things like where people or events or assets cluster, how long it typically takes to get from A to B, and what the landscape looks like by way of the distribution of some attribute of interest (is perhaps habitats, deprivation, or some risk factor). All essential things to know for strategic planning (e.g., where do I put a fireplace station?), knowing your customer base (e.g., who’s inside 30 min of my location?) or operational decision support (e.g., this Friday which locations are more likely to require additional capability?).

. Image by writer.

Thanks for reading and in case you’re fascinated about discussing or reading further, please get in contact or try a few of the references below.

https://www.linkedin.com/in/robert-constable-38b80b151/

References

[1] https://learn.microsoft.com/en-us/azure/databricks/data-governance/unity-catalog/

[2] https://geopandas.org/en/stable/

[3] https://databrickslabs.github.io/mosaic/

[4] https://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/sql-ref-h3-geospatial-functions

[5] https://www.ordnancesurvey.co.uk/documents/resources/guide-coordinate-systems-great-britain.pdf

[6] https://github.com/chrieke/geojson-invalid-geometry

[7] https://carto.com/blog/h3-spatial-indexes-10-use-cases

[8] https://www.uber.com/en-GB/blog/h3/

[9] https://openrouteservice.org/dev/#/api-docs

[10] https://wiki.openstreetmap.org/wiki/Overpass_API 

[11] https://www.databricks.com/blog/aibi-genie-now-generally-available

[12] https://www.databricks.com/blog/introducing-agent-bricks

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x