Geospatial Index 102


Geospatial Indexing is an indexing technique that gives a sublime option to manage location-based data. It makes geospatial data will be searched and retrieved efficiently in order that the system can provide the most effective experience to its users. This text goes to reveal how this works in practice by applying a geospatial index to real-world data and demonstrating the performance gain by doing that. Let’s start. (Note: If you may have never heard of the geospatial index or would love to learn more about it, take a look at this text)

The information utilized in this text is the Chicago Crime Data which is a component of the Google Cloud Public Dataset Program. Anyone with a Google Cloud Platform account can access this dataset free of charge. It consists of roughly 8 million rows of knowledge (with a complete amount of 1.52 GB) recording incidents of crime that occurred in Chicago since 2001, where each record has geographic data indicating the incident’s location.

Not only that we’ll use the info from Google Cloud, but additionally we’ll use Google Big Query as a knowledge processing platform. Big Query provides the job execution details for each query executed. This includes the quantity of knowledge used and the variety of rows processed which might be very useful as an example the performance gain after optimization.

What we’re going to do to reveal the facility of the geospatial index is to optimize the performance of the location-based query. In this instance, we’re going to make use of Geohash as an index due to its simplicity and native support by Google BigQuery.
We’re going to retrieve all records of crimes that occurred inside 2 km of the Chicago Union Station. Before the optimization, let’s see what the performance looks like after we run this question on the unique dataset:

-- Chicago Union Station Coordinates = (-87.6402895591744 41.87887332682509)
ST_DISTANCE(ST_GEOGPOINT(longitude, latitude), ST_GEOGFROMTEXT("POINT(-87.6402895591744 41.87887332682509)")) <= 2000

Below is what the job information and execution details seem like:

Job information(Image by creator)
Execution details(Image by creator)

From the variety of Bytes processed and Records read, you may see that the query scans the entire table and processes every row with the intention to get the . This implies the more data we’ve, the longer the query will take, and the dearer the processing cost might be. Can this be more efficient? After all, and that’s where the geospatial index comes into play.

The issue with the above query is that despite the fact that many records are distant from the point-of-interest(Chicago Union Station), it must be processed anyway. If we will eliminate those records, that will make the query lots more efficient.

Geohash will be the answer to this issue. Along with encoding coordinates right into a text, one other power of geohash is the hash also incorporates geospatial properties. The similarity between hashes can infer geographical similarity between the locations they represent. For instance, the 2 areas represented by wxcgh and wxcgd are close since the two hashes are very similar, while accgh and dydgh are distant from one another since the two hashes are very different.

We will use this property with the clustered table to our advantage by calculating the geohash of each row upfront. Then, we calculate the geohash of the Chicago Union Station. This fashion, we will eliminate all records that the hashes aren’t close enough to the Chicago Union Station’s geohash beforehand.

Here is find out how to implement it:

  1. Create a recent table with a recent column that stores a geohash of the coordinates.
CREATE TABLE `..crime_with_geohash_lv5` AS (
SELECT *, ST_GEOHASH(ST_GEOGPOINT(longitude, latitude), 5) as geohash
FROM `bigquery-public-data.chicago_crime.crime`

2. Create a clustered table using a geohash column as a cluster key

CREATE TABLE `..crime_with_geohash_lv5_clustered` 
CLUSTER BY geohash
AS (
FROM `..crime_with_geohash_lv5`

Through the use of geohash as a cluster key, we create a table during which the rows that share the identical hash are physically stored together. If you happen to give it some thought, what actually happens is that the dataset is partitioned by geolocation since the closer the rows geographically are, the more likely they are going to have the identical hash.

3. Compute the geohash of the Chicago Union Station.
In this text, we use this website but there are many libraries in various programming languages that let you do that programmatically.

Geohash of the Chicago Union Station(Image by creator)

4. Add the geohash to the query condition.

geohash = "dp3wj" AND
ST_DISTANCE(ST_GEOGPOINT(longitude, latitude), ST_GEOGFROMTEXT("POINT(-87.6402895591744 41.87887332682509)")) <= 2000

This time the query should only scan the records positioned within the dp3wj because the geohash is a cluster key of the table. This supposes to avoid wasting a whole lot of processing. Let’s examine what happens.

Job information after making a clustered table(Image by creator)
Execution details after making a clustered table(Image by creator)

From the job info and execution details, you may see the variety of bytes processed and records scanned reduced significantly(from 1.5 GB to 55 MB and 7M to 260k). By introducing a geohash column and using it as a cluster key, we eliminate all of the records that clearly don’t satisfy the query beforehand just by taking a look at one column.

Nonetheless, we aren’t finished yet. Take a look at the variety of output rows fastidiously, you’ll see that it only has 100k records where the right result should have 380k. The result we got continues to be not correct.

5. Compute the neighbor zones and add them to the query.

In this instance, all of the neighbor hashes are dp3wk, dp3wm, dp3wq, dp3wh, dp3wn, dp3wu, dp3wv, and dp3wy . We use online geohash explore for this but, again, this may totally be written as a code.

Neighbors of the dp3wj(Image by creator)

Why do we want so as to add the neighbor zones to the query? Because geohash is just an approximation of location. Although we all know Chicago Union Station is within the dp3wj , we still do not know where exactly it’s within the zone. At the highest, bottom, left, or right? Now we have no idea. If it’s at the highest, it’s possible some data within the dp3wm could also be closer to it than 2km. If it’s on the best, it’s possible some data within the dp3wn zone may closer than 2km. And so forth. That is why all of the neighbor hashes should be included within the query to get the right result.

Note that geohash level 5 has a precision of 5km. Due to this fact, all zones apart from those within the above figure might be too removed from the Chicago Union Station. That is one other necessary design selection that must be made since it has a huge effect. We’ll gain little or no if it’s too coarse. Alternatively, using too advantageous precision-level will make the query sophisticated.

Here’s what the ultimate query looks like:

geohash = "dp3wk" OR
geohash = "dp3wm" OR
geohash = "dp3wq" OR
geohash = "dp3wh" OR
geohash = "dp3wj" OR
geohash = "dp3wn" OR
geohash = "dp3tu" OR
geohash = "dp3tv" OR
geohash = "dp3ty"
ST_DISTANCE(ST_GEOGPOINT(longitude, latitude), ST_GEOGFROMTEXT("POINT(-87.6402895591744 41.87887332682509)")) <= 2000

And that is what happens when executing the query:

Job information after adding neighbor hashes(Image by creator)
Execution details after adding neighbor hashes(Image by creator)

Now the result’s correct and the query processes 527 MB and scans 2.5M records in total. As compared with the unique query, using geohash and clustered table saves the processing resource around 3 times. Nonetheless, nothing comes free of charge. Applying geohash adds complexity to the way in which data is preprocessed and retrieved corresponding to the selection of precision level that must be chosen upfront and the extra logic of the SQL query.

In this text, we’ve seen how the geospatial index can assist improve the processing of geospatial data. Nonetheless, it has a value that needs to be well considered upfront. At the top of the day, it’s not a free lunch. To make it work properly, a great understanding of each the algorithm and the system requirements is required.


What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
Inline Feedbacks
View all comments

Share this article

Recent posts

Would love your thoughts, please comment.x