Home Artificial Intelligence Deriving a Rating to Show Relative Socio-Economic Advantage and Drawback of a Geographic Area

Deriving a Rating to Show Relative Socio-Economic Advantage and Drawback of a Geographic Area

0
Deriving a Rating to Show Relative Socio-Economic Advantage and Drawback of a Geographic Area

There exist publicly accessible data which describe the socio-economic characteristics of a geographic location. In Australia where I reside, the Government through the Australian Bureau of Statistics (ABS) collects and publishes individual and household data frequently in respect of income, occupation, education, employment and housing at an area level. Some examples of the published data points include:

  • Percentage of individuals on relatively high / low income
  • Percentage of individuals classified as managers of their respective occupations
  • Percentage of individuals with no formal educational attainment
  • Percentage of individuals unemployed
  • Percentage of properties with 4 or more bedrooms

Whilst these data points appear to focus heavily on individual people, it reflects people’s access to material and social resources, and their ability to take part in society in a selected geographic area, ultimately informing the socio-economic advantage and drawback of this area.

Given these data points, is there a approach to derive a rating which ranks geographic areas from probably the most to the least advantaged?

The goal to derive a rating may formulate this as a regression problem, where each data point or feature is used to predict a goal variable, on this scenario, a numerical rating. This requires the goal variable to be available in some instances for training the predictive model.

Nevertheless, as we don’t have a goal variable to start out with, we may have to approach this problem in one other way. For example, under the idea that every geographic areas is different from a socio-economic standpoint, can we aim to grasp which data points help explain probably the most variations, thereby deriving a rating based on a numerical combination of those data points.

We will do exactly that using a method called the Principal Component Evaluation (PCA), and this text demonstrates how!

ABS publishes data points indicating the socio-economic characteristics of a geographic area within the “Data Download” section of this webpage, under the “Standardised Variable Proportions data cube”[1]. These data points are published on the Statistical Area 1 (SA1) level, which is a digital boundary segregating Australia into areas of population of roughly 200–800 people. It is a rather more granular digital boundary in comparison with the Postcode (Zipcode) or the States digital boundary.

For the aim of demonstration in this text, I’ll be deriving a socio-economic rating based on 14 out of the 44 published data points provided in Table 1 of the info source above (I’ll explain why I choose this subset in a while). These are :

  • INC_LOW: Percentage of individuals living in households with stated annual household equivalised income between $1 and $25,999 AUD
  • INC_HIGH: Percentage of individuals with stated annual household equivalised income greater than $91,000 AUD
  • UNEMPLOYED_IER: Percentage of individuals aged 15 years and over who’re unemployed
  • HIGHBED: Percentage of occupied private properties with 4 or more bedrooms
  • HIGHMORTGAGE: Percentage of occupied private properties paying mortgage greater than $2,800 AUD per 30 days
  • LOWRENT: Percentage of occupied private properties paying rent lower than $250 AUD per week
  • OWNING: Percentage of occupied private properties with out a mortgage
  • MORTGAGE: Per cent of occupied private properties with a mortgage
  • GROUP: Percentage of occupied private properties that are group occupied private properties (e.g. apartments or units)
  • LONE: Percentage of occupied properties that are lone person occupied private properties
  • OVERCROWD: Percentage of occupied private properties requiring a number of extra bedrooms (based on Canadian National Occupancy Standard)
  • NOCAR: Percentage of occupied private properties with no cars
  • ONEPARENT: Percentage of 1 parent families
  • UNINCORP: Percentage of properties with at the least one one who is a business owner

On this section, I’ll be stepping through the Python code for deriving a socio-economic rating for a SA1 region in Australia using PCA.

I’ll start by loading within the required Python packages and the info.

## Load the required Python packages

### For dataframe operations
import numpy as np
import pandas as pd

### For PCA
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

### For Visualization
import matplotlib.pyplot as plt
import seaborn as sns

### For Validation
from scipy.stats import pearsonr

## Load data

file1 = 'data/standardised_variables_seifa_2021.xlsx'

### Reading from Table 1, from row 5 onwards, for column A to AT
data1 = pd.read_excel(file1, sheet_name = 'Table 1', header = 5,
usecols = 'A:AT')

## Remove rows with missing value (113 out of 60k rows)

data1_dropna = data1.dropna()

A very important cleansing step before performing PCA is to standardise each of the 14 data points (features) to a mean of 0 and standard deviation of 1. That is primarily to make sure the loadings assigned to every feature by PCA (consider them as indicators of how necessary a feature is) are comparable across features. Otherwise, more emphasis, or higher loading, could also be given to a feature which is definitely not significant or vice versa.

Note that the ABS data source quoted above have already got the features standardised. That said, for an unstandardised data source:

## Standardise data for PCA

### Take all but the primary column which is merely a location indicator
data_final = data1_dropna.iloc[:,1:]

### Perform standardisation of knowledge
sc = StandardScaler()
sc.fit(data_final)

### Standardised data
data_final = sc.transform(data_final)

With the standardised data, PCA may be performed in only a number of lines of code:

## Perform PCA

pca = PCA()
pca.fit_transform(data_final)

PCA goals to represent the underlying data by Principal Components (PC). The variety of PCs provided in a PCA is the same as the variety of standardised features in the info. On this instance, 14 PCs are returned.

Each PC is a linear combination of all of the standardised features, only differentiated by its respective loadings of the standardised feature. For instance, the image below shows the loadings assigned to the primary and second PCs (PC1 and PC2) by feature.

Image 1 — Code to return first two Principal Components. Image by creator.

With 14 PCs, the code below provides a visualization of how much variation each PC explains:


## Create visualization for variations explained by each PC

exp_var_pca = pca.explained_variance_ratio_
plt.bar(range(1, len(exp_var_pca) + 1), exp_var_pca, alpha = 0.7,
label = '% of Variation Explained',color = 'darkseagreen')

plt.ylabel('Explained Variation')
plt.xlabel('Principal Component')
plt.legend(loc = 'best')
plt.show()

As illustrated within the output visualization below, Principal Component 1 (PC1) accounts for the most important proportion of variance in the unique dataset, with each following PC explaining less of the variance. To be specific, PC1 explains circa. 35% of the variation throughout the data.

Image 2 — Variation explained by PC. Image by creator.

For the aim of demonstration in this text, PC1 is chosen because the only PC for deriving the socio-economic rating, for the next reasons:

  • PC1 explains sufficiently large variation throughout the data on a relative basis.
  • Whilst selecting more PCs potentially allows for (marginally) more variation to be explained, it makes interpretation of the rating difficult within the context of socio-economic advantage and drawback by a selected geographic area. For instance, as shown within the image below, PC1 and PC2 may provide conflicting narratives as to how a selected feature (e.g. ‘INC_LOW’) influences the socio-economic variation of a geographic area.
## Show and compare loadings for PC1 and PC2

### Using df_plot dataframe per Image 1

sns.heatmap(df_plot, annot = False, fmt = ".1f", cmap = 'summer')
plt.show()

Image 3 — Different loadings for PC1 and PC2. Image by creator.

To acquire a rating for every SA1, we simply multiply the standardised portion of every feature by its PC1 loading. This may be achieved by:


## Obtain raw rating based on PC1

### Perform sum product of standardised feature and PC1 loading
pca.fit_transform(data_final)

### Reverse the sign of the sum product above to make output more interpretable
pca_data_transformed = -1.0*pca.fit_transform(data_final)

### Convert to Pandas dataframe, and join raw rating with SA1 column
pca1 = pd.DataFrame(pca_data_transformed[:,0], columns = ['Score_Raw'])
score_SA1 = pd.concat([data1_dropna['SA1_2021'].reset_index(drop = True), pca1]
, axis = 1)

### Inspect the raw rating
score_SA1.head()

Image 4 — Raw socio-economic rating by SA1. Image by creator.

The upper the rating, the more advantaged a SA1 is in terms its access to socio-economic resource.

How can we know the rating we derived above was even remotely correct?

For context, the ABS actually published a socio-economic rating called the Index of Economic Resource (IER), defined on the ABS website as:

“The Index of Economic Resources (IER) focuses on the financial features of relative socio-economic advantage and drawback, by summarising variables related to income and housing. IER excludes education and occupation variables as they usually are not direct measures of economic resources. It also excludes assets similar to savings or equities which, although relevant, can’t be included as they usually are not collected within the Census.”

Without disclosing the detailed steps, the ABS stated of their Technical Paper that the IER was derived using the identical features (14) and methodology (PCA, PC1 only) as what we had performed above. That’s, if we did derive the right scores, they must be comparable against the IER scored published here (“Statistical Area Level 1, Indexes, SEIFA 2021.xlsx”, Table 4).

Because the published rating is standardised to a mean of 1,000 and standard deviation of 100, we start the validation by standardising the raw rating the identical:

## Standardise raw scores

score_SA1['IER_recreated'] =
(score_SA1['Score_Raw']/score_SA1['Score_Raw'].std())*100 + 1000

For comparison, we read within the published IER scores by SA1:

## Read in ABS published IER scores
## similarly to how we read within the standardised portion of the features

file2 = 'data/Statistical Area Level 1, Indexes, SEIFA 2021.xlsx'

data2 = pd.read_excel(file2, sheet_name = 'Table 4', header = 5,
usecols = 'A:C')

data2.rename(columns = {'2021 Statistical Area Level 1 (SA1)': 'SA1_2021', 'Rating': 'IER_2021'}, inplace = True)

col_select = ['SA1_2021', 'IER_2021']
data2 = data2[col_select]

ABS_IER_dropna = data2.dropna().reset_index(drop = True)

Validation 1— PC1 Loadings

As shown within the image below, comparing the PC1 loading derived above against the PC1 loading published by the ABS suggests that they differ by a continuing of -45%. As that is merely a scaling difference, it doesn’t impact the derived scores that are standardised (to a mean of 1,000 and standard deviation of 100).

Image 5 — Compare PC1 loadings. Image by creator.

(You need to find a way to confirm the ‘Derived (A)’ column with the PC1 loadings in Image 1).

Validation 2— Distribution of Scores

The code below creates a histogram for each scores, whose shapes look to be almost equivalent.

## Check distribution of scores

score_SA1.hist(column = 'IER_recreated', bins = 100, color = 'darkseagreen')
plt.title('Distribution of recreated IER scores')

ABS_IER_dropna.hist(column = 'IER_2021', bins = 100, color = 'lightskyblue')
plt.title('Distribution of ABS IER scores')

plt.show()

Image 6— Distribution of IER scores, recreated vs. published. Image by creator.

Validation 3— IER rating by SA1

As the last word validation, let’s compare the IER scores by SA1:


## Join the 2 scores by SA1 for comparison
IER_join = pd.merge(ABS_IER_dropna, score_SA1, how = 'left', on = 'SA1_2021')

## Plot scores on x-y axis.
## If scores are equivalent, it should show a straight line.

plt.scatter('IER_recreated', 'IER_2021', data = IER_join, color = 'darkseagreen')
plt.title('Comparison of recreated and ABS IER scores')
plt.xlabel('Recreated IER rating')
plt.ylabel('ABS IER rating')

plt.show()

A diagonal straight line as shown within the output image below supports that the 2 scores are largely equivalent.

Image 7— Comparison of scores by SA1. Image by creator.

So as to add to this, the code below shows the 2 scores have a correlation near 1:

Image 8— Correlation between the recreated and published scores. Image by creator.

The demonstration in this text effectively replicates how the ABS calibrates the IER, certainly one of the 4 socio-economic indexes it publishes, which may be used to rank the socio-economic status of a geographic area.

Taking a step back, what we’ve achieved in essence is a discount in dimension of the info from 14 to 1, losing some information conveyed by the info.

Dimensionality reduction technique similar to the PCA can also be commonly seen in helping to scale back high-dimension space similar to text embeddings to 2–3 (visualizable) Principal Components.

LEAVE A REPLY

Please enter your comment!
Please enter your name here