The Potential of Machine Learning for Compiling Standardized Zoning Data

Illustration by Rhiannon Newman for the Urban InstituteZoning codes, or the official documents that regulate a jurisdiction’s land use, contain an infinite amount of data about what buildings may be inbuilt an area and the way they may be used, with clear implications for racial equity, housing affordability, economic development, and environmental impact. Yet zoning documents are sometimes long, complex, unstandardized, and sometimes handwritten, making the extraction of those data difficult. Without clear, standardized zoning data, researchers and policymakers lack empirical evidence to reply questions on how zoning affects housing supply and to pursue specific reforms.
Many states have begun to construct towards a National Zoning Atlas, which might collect data for every zoning district, the smallest regulatory constructing blocks inside a jurisdiction, in a single place. But constructing and updating this atlas requires a significant manual effort. To ease this process, we partnered with Sara Bronin and the National Zoning Atlas to explore how text evaluation and natural language processing methods could help automate collecting standardized zoning data, publishing a report with findings and reflections from our pilot project.

To gauge how much of the info collection process could possibly be automated, we conducted an illustrative case study with Connecticut data, desiring to expand to the remainder of the country if we were successful. Importantly for this case study, we had access to “ground truth” data from the Connecticut Zoning Atlas, which we could compare against our results. If the outcomes were sufficiently encouraging, we could extend to other states where we don’t have already got benchmark data. Our methodology consisted of 4 major steps, each of which presented unique challenges, that are illustrated within the diagram below:

Step one for any data collection process, automated or manual, is to collect all of the documents in our “universe” — on this case, all jurisdictions in Connecticut. Luckily for us, the National Zoning Atlas team in Connecticut, led by Sara Bronin, has diligently collected zoning codes and maps from quite a lot of sources, including municipal web sites, GIS repositories, local law databases, and even manual outreach to jurisdiction staff. Although unnecessary for our case study, web scraping techniques could lighten the burden of this step for other states.
After saving documents in PDF, PNG, or JPEG format, we extracted the text using Amazon Web Services’s Textract Optical Character Recognition (OCR) software. Textract returns a series of JSON objects for every document that contain the output from text extraction, which we wrangled and appended into pandas dataframes in Python for further evaluation.

With our dataset compiled, we wanted to discover the names for every statement, a deceptively difficult task for multiple reasons: First, zoning districts can have each full and abbreviated names (e.g., “Residential 1” versus “R-1”). Second, listed districts can vary between the zoning code and zoning map. And third, district names often don’t appear neatly in a single place in a document.
To deal with these problems, we implemented a fuzzy matching comparison between text that appears within the legend and labels of a zoning map with text that appears within the zoning code. For the town of Andover, for instance, the one-page map lists all the zoning districts within the legend, while a specific page of the zoning code lists the districts just after the phrase “divided into the.” We used that term and lots of other common terms to cut back the long zoning documents all the way down to more manageable subsets.
We then scored vocabularies — on this case, unique sets of phrases — across the 2 document types. A rating of 100 would mean two strings are the exact same, while a rating of 0 would mean they don’t have any characters in common. Using SeatGeek’s thefuzz Python package, we selected thresholds for what constitutes a “match” through iterative testing on a small training set of jurisdictions. We go to all the difficulty of fuzzy matching to be sure that near matches across documents are captured. An awesome example is “ARRD — Andover Rural Residential Design District” within the map and “ARD Andover Rural Design District” within the code, that are clearly referring to the identical district but can be missed by exact string matching.
We kept pairs that had sufficiently high fuzzy matching scores and filtered out terms that didn’t meet the brink or appear in our list of stop words — the insignificant text we don’t want — corresponding to “and,” “the,” and “or.” We erred on the side of casting a large net to capture as many true district names as possible. Unfortunately, this cautious approach meant there have been quite a few examples of false positives, or algorithmically identified zoning districts that aren’t actual zoning districts.
We were capable of discover 55.2 percent of all mapped zoning districts within the Connecticut Zoning Atlas data, once we had sorted out the numerous false positives. Given the unstandardized nature of our text data, our methods showed some promise but weren’t an ideal substitute for researcher review.

After manually filling within the gaps in our dataset, we sought to construct a corpus, or dataset containing the relevant text for all the identified zoning districts. Ideally, each corpus would contain whatever relevant information a natural language model needed to make predictions. For this case study, we tackled just certainly one of the handfuls of columns within the National Zoning Atlas data: the kind of zoning district, which may be either primarily residential, mixed with residential, or nonresidential. Asking, What kind of zoning district is that this? is a three-class classification problem and a rather more concrete task than an open-ended query corresponding to, What’s the minimum lot size for a district?
To create the corpus, we looked for instances where an identified zoning district occurred with a certain window of other relevant vocabulary words. For instance, the reasonably priced housing category contained terms like “reasonably priced,” “opportunity,” and “workforce” that subject-matter experts have related to reasonably priced housing districts. (The precise search criteria were barely more complex than this and involved the use of normal expressions to match special patterns of character combos within the text.)
Unfortunately, we encountered some major limitations at this step. First, we learned that much of the data we were searching for exists in tabular form for a lot of zoning codes. Although the OCR software can extract text from tables and even maintain their original structure, we were unable to automate the means of interpreting the big variety of table formats across the documents (e.g., seeing where a given row and column intersect). Information presented on this fashion was incompatible with our window-based approach, and future efforts could have to search out one other option to parse tabular data.
Given the density and ranging structure of zoning documents, this window approach often returned text that wouldn’t be useful to a natural language model. Lots of our datasets didn’t pass the “eye test,” meaning a human reader can be unlikely to search out much useful text, let alone a machine reader. Still, we evaluated whether a mix of data from the zoning district names and these limited-quality datasets was enough to categorise zoning districts by type.

We used two kinds of variables for the machine learning portion of this case study. First, we created a set of term frequency-inverse document frequency (TF-IDF) features using the set of all words within the zoning district names collected by the National Zoning Atlas. TF-IDF calculates how often words appear in zoning district names, however it places more weight on words which can be more distinctive and fewer weight on words which can be more common throughout the info.
We also used information from the text datasets created within the previous step to create term concentration variables for every of the three classes. To do that, we counted the variety of times search terms related to primarily residential, mixed with residential, and nonresidential district types occurred in each dataset, scaling those counts by the common length of text datasets for that town’s zoning districts.
We split the districts right into a 70 percent training set and a 30 percent test set, experimenting with three kinds of classifiers: a baseline logistic regression, a random forest, and a support vector classifier. We performed five-fold cross validation, which entails dividing the training set into five groups of equal size, with the various groups taking turns getting used to each train the model and validate the model. This step allowed us to check different classifiers and ultimately select our greatest performing model, a support vector classifier with a linear kernel.
The table below shows model performance on the 716 districts within the test dataset. All major metrics are hovering right around 75 percent, significantly higher than the baseline “model-less” approach we used, which followed a set of naive classification rules and achieved only about 64 percent accuracy.
Accuracy is the proportion of zoning districts within the test set that the model appropriately identifies. Precision is the proportion of zoning districts that the model predicts as falling inside certainly one of the three categories (e.g., nonresidential) that really fall into that category. Recall is the proportion of zoning districts that really fall into certainly one of the three categories that the model predicts appropriately. F1 rating is the harmonic mean of precision and recall scores.Clearly, our support vector classifier is learning something from the text information that helps to enhance predictive power, however the variable importance plot below presents two reasons for caution. First, the occurrence of the words “residential” and “residence” within the zoning district names are an important for predictiveness, which could explain why the precision and recall are particularly high amongst primarily residential districts and indicates that the model is counting on something any human reader could quickly recognize as well (e.g., a district called “Residential 1” is trivial to categorise as primarily residential). Second, our term concentration features, that are missing from the chart below, were among the many least predictive variables, suggesting essentially no information derived from our text datasets proved useful to the model.

We learned early in our automation journey that completely removing human researchers from this process was impractical and ineffective. Even with marginal improvements in zoning district identification or the machine learning evaluation, the unstandardized nature of zoning documents presents too many barriers for this to be feasible. Still, we discovered many areas of promise which can be value pursuing in a hybrid approach, combining the relative strengths of National Zoning Atlas teams and our algorithmic approach.
We imagine that automation could make data collection efforts easier in lots of areas, including web scraping to gather zoning documents, OCR to quickly and effectively extract text from documents, text evaluation to flag relevant portions and save time for human reviewers, and data validation rules to hurry up the present manual checks.
Future efforts to enhance a totally or partially automated methodology should start with tabular machine learning methods for parsing data in tables that were beyond the scope of this case study. Perhaps more importantly, advocating for the standardization of zoning documents is crucial. For now, we conclude that a hybrid human-machine approach is the very best option to expand the supply of zoning data for policy evaluation.
-Judah Axelrod
Wish to learn more? Join for the Data@Urban newsletter.
No items found.