Air for Tomorrow: Mapping the Digital Air-Quality Landscape, from Repositories and Data Types to Starter Code

-

road in Lao PDR. The varsity is 200 meters away. Traffic roars, smoke from burning garbage drifts across the trail, and youngsters walk straight through it. What are they respiratory today? Without local data, nobody really knows. 

Across East Asia and the Pacific, 325 million children [1] breathe toxic air day by day, sometimes at levels 10 times above secure limits. The damage is commonly silent: affected lungs, asthma, but it could result in missed school days in acute cases. The futures are at stake. In the long term, the health systems are strained, and economies should bear the prices.

In lots of cases, air quality data isn’t even available.

No monitors. No evidence. No protection. 

On this second a part of the blog series [2], we investigate the info repositories where useful air-quality data is offered, methods to import them, and methods to get them up and running in your notebook. We might also demystify data formats resembling GeoJSON, Parquet/GeoParquet, NetCDF/HDF5, COG, GRIB, and Zarr so you possibly can pick the proper tool for the job. We are constructing it up in order that in the following part, we will go step-by-step through how we developed an open-source air quality model. 

In the previous couple of years, there was a major push to generate and use air-quality data. These data come from different sources, and their quality varies accordingly. A couple of repositories can assist quantify them: regulatory stations for ground truth, community sensors to grasp hyperlocal variations, satellites for regional context, and model reanalyses for estimates (Figure 2). The excellent news: most of that is open. The higher news: the code to start is relatively short. 

Figure 2: Fire hotspot as on 20.04.2024 and the interpolated density map created using multiple data sources. Source: @UNICEF. All rights reserved.

Repository quick-starts (with minimal Python) 

On this section, we move from concepts to practice. Below, we walk through a set of commonly used open-source repositories and show the smallest possible code it is advisable start pulling data from each of them. All examples assume Python ≥3.10 with pip install as needed. 

For every numbered repository, you can see: 

  • a brief description of what the info source is and the way it’s maintained, 
  • typical use-cases (when this source is an excellent fit), 
  • methods to access it (API keys, sign-up notes, or direct URLs), and 
  • a minimal Python code snippet to extract data. 

Imagine this as a practical guide where you possibly can skim the descriptions, pick the source that matches your problem, after which adapt the code to plug directly into your personal evaluation or model pipeline. 

Tip: Keep secrets out of code. Use environment variables for tokens (e.g., export AIRNOW_API_KEY=…). 

1) OpenAQ (global ground measurements; open API) 

OpenAQ [3] is an open-source data platform that hosts global data for air quality data, resembling PM2.5, PM10, and O3. They supply air quality data by partnering with various governmental partners, community partners, and air quality sensor firms resembling Air Gradient, IQAir, amongst others.

Great for: quick cross-country pulls, harmonised units/metadata, reproducible pipelines. 

Enroll for an OpenAQ API key at https://explore.openaq.org. After signing up, find your API key in your settings. Use this key to authenticate requests. 

!pip install openaq pandas
import pandas as pd
from pandas import json_normalize
from openaq import OpenAQ
import datetime
from datetime import timedelta
import geopandas as gpd
import requests
import time
import json

# follow the quickstart to get the api key https://docs.openaq.org/using-the-api/quick-start
api_key = '' #enter you API Key before executing
client = OpenAQ(api_key=api_key) #use the API key generated earlier

# get the locations of each sensors within the chosen countries codes: https://docs.openaq.org/resources/countries
locations = client.locations.list(
countries_id=[68,111],
limit = 1000
)

data_locations = locations.dict()
df_sensors_country = json_normalize(data_locations ['results'])
df_sensors_exploded = df_sensors_country.explode('sensors')
df_sensors_exploded['sensor_id']=df_sensors_exploded['sensors'].apply(lambda x: x['id'])
df_sensors_exploded['sensor_type']=df_sensors_exploded['sensors'].apply(lambda x: x['name'])
df_sensors_pm25 = df_sensors_exploded[df_sensors_exploded['sensor_type'] == "pm25 µg/m³"]
df_sensors_pm25

# undergo each location and extract the  hourly measurements
df_concat_aq_data=pd.DataFrame()
to_date = datetime.datetime.now()
from_date = to_date - timedelta(days=2) # get the past 2 days data
sensor_list = df_sensors_pm25.sensor_id

for sensor_id in sensor_list[0:5]:
    print("-----")
    response = client.measurements.list(
        sensors_id= sensor_id,
        datetime_from = from_date,
        datetime_to = to_date,
        limit = 500 )
    print(response)

    data_measurements = response.dict()
    df_hourly_data = json_normalize(data_measurements ['results'])
    df_hourly_data["sensor_id"] = sensor_id
    if len(df_hourly_data) > 0:
        df_concat_aq_data=pd.concat([df_concat_aq_data,df_hourly_data])
        df_concat_aq_data = df_concat_aq_data[["sensor_id","period.datetime_from.utc","period.datetime_to.utc","parameter.name","value"]]

    df_concat_aq_data

2) EPA AQS Data Mart (U.S. regulatory archive; token needed) 

The EPA AQS Data Mart [4] is a U.S. regulatory data archive that hosts quality-controlled air-quality measurements from hundreds of monitoring stations across the country. It provides long-term records for criteria pollutants resembling PM₂․₅, PM₁₀, O₃, NO₂, SO₂, and CO, together with detailed site metadata and QA flags, and is freely accessible via an API when you register and acquire an access token. It provides meteorological data as well. 

Great for: authoritative QA/QC-d U.S. data. 

Enroll for an AQS Data Mart account on the US EPA website at: https://aqs.epa.gov/aqsweb/documents/data_api.html
Create a .env file in your environment and add your credentials, including AQS email and AQS key.

# pip install requests pandas 

import os, requests, pandas as pd 
AQS_EMAIL = os.getenv("AQS_EMAIL") 
AQS_KEY   = os.getenv("AQS_KEY") 

url = "https://aqs.epa.gov/data/api/sampleData/byState" 
params = {"email": AQS_EMAIL, "key": AQS_KEY, "param": "88101", "b date":"20250101", "edate": "20250107", "state": "06"} 
r = requests.get(url, params=params, timeout=60) 

df = pd.json_normalize(r.json()["Data"]) 
print(df[["state_name","county_name","date_local","sample_measurement","units_of_measure"]].head()) 

3) AirNow (U.S. real-time indices; API key) 

AirNow [5] is a U.S. government platform that gives near real-time air-quality index (AQI) information based on regulatory monitoring data. It publishes current and forecast AQI values for pollutants resembling PM₂․₅ and O₃, together with category breakpoints (“Good”, “Moderate”, etc.) which can be easy to speak to the general public. Data will be accessed programmatically via the AirNow API when you register and acquire an API key. 

Great for: wildfire and public-facing AQI visuals. 

Register for an AirNow API account via the AirNow API portal: https://docs.airnowapi.org/ 

From the Log In page, select “Request an AirNow API Account” and complete the registration form together with your email and basic details. After you activate your account, you will find your API key in your AirNow API dashboard; use this key to authenticate all calls to the AirNow web services. 

import os, requests, pandas as pd 

API_KEY = os.getenv("AIRNOW_API_KEY") 
url = "https://www.airnowapi.org/aq/commentary/latLong/current/" 
params = {"format":"application/json", "latitude": 37.7749, "longitude": -122.4194, "distance":25, "API_KEY": API_KEY} 
df = pd.DataFrame(requests.get(url, params=params, timeout=30).json()) 

print(df[["ParameterName", "AQI" ,"Category.Name ","DateObserved", "HourObserved"]]) 

4) Copernicus Atmosphere Monitoring Service (CAMS; Atmosphere Data Store)

The Copernicus Atmosphere Monitoring Service [6], implemented by ECMWF for the EU’s Copernicus programme, provides global reanalyses and near-real-time forecasts of atmospheric composition. Through the Atmosphere Data Store (ADS), you possibly can access gridded fields for aerosols, reactive gases (O₃, NO₂, etc.), greenhouse gases and related meteorological variables, with multi-year records suitable for each research and operational applications. All CAMS products within the ADS are open and freed from charge, subject to accepting the Copernicus licence. 

Great for: global background fields (aerosols & trace gases), forecasts and reanalyses. 

Easy methods to register and get API access 

  1. Go to the Atmosphere Data Store: https://ads.atmosphere.copernicus.eu
  1. Click Login / Register within the top-right corner and create a (free) Copernicus/ECMWF account. 
  1. After confirming your email, log in and visit your profile page to search out your ADS API key (UID + key). 
  1. Follow the ADS “Easy methods to use the API” instructions to create a configuration file (typically ~/.cdsapirc) with: 
  1. url: https://ads.atmosphere.copernicus.eu/api 
  1. key: : 
  1. On the net page of every CAMS dataset you must use, go to the Download data tab and accept the licence at the underside once; only then will API requests for that dataset succeed. 

Once this is ready up, you need to use the usual cdsapi Python client to programmatically download CAMS datasets from the ADS. 

# pip install cdsapi xarray cfgrib 

import cdsapi 
c = cdsapi.Client() 

# Example: CAMS global reanalysis (EAC4) total column ozone (toy example) 
c.retrieve( 
    "cams-global-reanalysis-eac4", 
    {"variable":"total_column_ozone","date":"2025-08-01/2025-08-02","time":["00:00","12:00"], 
     "format":"grib"}, "cams_ozone.grib") 

5) NASA Earthdata (LAADS DAAC / GES DISC; token/login) 

NASA Earthdata [7] provides unified sign-on access to a big selection of Earth science data, including satellite aerosol and trace gas products which can be crucial for air-quality applications. Two key centres for atmospheric composition are: 

  • LAADS DAAC (Level-1 and Atmosphere Archive and Distribution System DAAC), which hosts MODIS, VIIRS and other instrument products (e.g., AOD, cloud, fire, radiance). 
  • GES DISC (Goddard Earth Sciences Data and Information Services Center), which serves model and satellite products resembling MERRA-2 reanalysis, OMI, TROPOMI, and related atmospheric datasets. 

Most of those datasets are free to make use of but require a NASA Earthdata Login; downloads are authenticated either via HTTP basic auth (username/password stored in .netrc) or via a private access token (PAT) in request headers. 

Great for: MODIS/VIIRS AOD, MAIAC, TROPOMI trace-gas products.  

Easy methods to register and get API/download access: 

  1. Create a NASA Earthdata Login account at: 
    https://urs.earthdata.nasa.gov 
  1. Confirm your email and log in to your Earthdata profile. 
  1. Under your profile, generate a personal access token (PAT). Save this token securely; you need to use it in scripts via an Authorization: Bearer  header or in tools that support Earthdata tokens. 
  1. For traditional wget/curl-based downloads, you possibly can alternatively create a ~/.netrc file to store your Earthdata username and password, for instance: 
machine urs.earthdata.nasa.gov 
login  
password 

Then set file permissions to user-only (chmod 600 ~/.netrc) so command-line tools can authenticate robotically. 

  1. For LAADS DAAC products, go to https://ladsweb.modaps.eosdis.nasa.gov, log in together with your Earthdata credentials, and use the Search & Download interface to construct download URLs; you possibly can copy the auto-generated wget/curl commands into your scripts. 
  1. For GES DISC datasets, start from https://disc.gsfc.nasa.gov, select a dataset (e.g., MERRA-2), and use the “Data Access” or “Subset/Get Data” tools. The positioning can generate script templates (Python, wget, etc.) that already include the proper endpoints for authenticated access. 

Once your Earthdata Login and token are arrange, LAADS DAAC and GES DISC behave like standard HTTPS APIs: you possibly can call them from Python (e.g., with requests, xarray + pydap/OPeNDAP, or s3fs for cloud buckets) using your credentials or token for authenticated, scriptable downloads. 

#Downloads via HTTPS with Earthdata login. 

# pip install requests 
import requests 
url = "https://ladsweb.modaps.eosdis.nasa.gov/archive/allData/6/MCD19A2/2025/214/MCD19A2.A2025214.h21v09.006.2025xxxxxx.hdf" 

# Requires a sound token cookie; recommend using .netrc or requests.Session() with auth 
# See NASA docs for token-based download; here we only illustrate the pattern: 
# s = requests.Session(); s.auth = (USERNAME, PASSWORD); r = s.get(url) 

6) STAC catalogues (search satellites programmatically) 

SpatioTemporal Asset Catalog (STAC) [8] is an open specification for describing geospatial assets, resembling satellite scenes, tiles, and derived products, in a consistent, machine-readable way. As a substitute of manually browsing download portals, you query a STAC API with filters like time, bounding box, cloud cover, platform (e.g., Sentinel-2, Landsat-8, Sentinel-5P), or processing level, and get back JSON items with direct links to COGs, NetCDF, Zarr, or other assets.  

Great for: discover and stream assets (COGs/NetCDF) without bespoke APIs and works well with Sentinel-5P, Landsat, Sentinel-2, more. 

Easy methods to register and get API access: 
STAC itself is just a typical; access depends upon the precise STAC API you utilize: 

  • Many public STAC catalogues (e.g., demo or research endpoints) are fully open and require no registration—you possibly can hit their /search endpoint directly with HTTP POST/GET. 
  • Some cloud platforms that expose STAC (for instance, industrial or large cloud providers) require you to create a free account and acquire credentials before you possibly can read the underlying assets (e.g., blobs in S3/Blob storage), regardless that the STAC metadata is open. 

A generic pattern you possibly can describe is: 

  1. Pick a STAC API endpoint for the satellite data you care about (often documented as something along the lines of https:///stac or …/stac/search). 
  1. If the provider requires sign-up, create an account of their portal and acquire the API key or storage credentials they recommend (this could be a token, SAS URL, or cloud access role). 
  1. Use a STAC client library in Python (for instance, pystac-client) to go looking the catalogue: 
# pip install pystac-client 
from pystac_client import Client 

api = Client.open("https://example.com/stac") 
search = api.search( 
    collections=["sentinel-2-l2a"], 
    bbox=[102.4, 17.8, 103.0, 18.2],   # minx, miny, maxx, maxy 
    datetime="2024-01-01/2024-01-31", 
    query={"eo:cloud_cover": {"lt": 20}}, 
    )
items = list(search.get_items()) 
first_item = items[0] 
assets = first_item.assets  # e.g., COGs, QA bands, metadata 
  1. For every returned STAC item, follow the asset href links (often HTTPS URLs or cloud URIs like s3://…) and browse them with the appropriate library (rasterio/xarray/zarr etc.). If credentials are needed, configure them via environment variables or your cloud SDK as per the provider’s instructions. 

Once arrange, STAC catalogues provide you with a uniform, programmatic technique to search and retrieve satellite data across different providers, without rewriting your search logic each time you turn from one archive to a different. 

# pip install pystac-client planetary-computer rasterio 
from pystac_client import Client 
from shapely.geometry import box, mapping 
import geopandas as gpd 

catalog = Client.open("https://earth-search.aws.element84.com/v1") 
aoi = mapping(box(-0.3, 5.5, 0.3, 5.9))  # bbox around Accra
search = catalog.search(collections=["sentinel-2-l2a"], intersects=aoi, limit=5) 
items = list(search.get_items()) 
for it in items: 
    print(it.id, list(it.assets.keys())[:5])   # e.g., "B04", "B08", "SCL", "visual" 

It’s preferrable to make use of STAC where possible as they supply clean metadata, cloud-optimised assets, and simple filtering by time/space. 

7) Google Earth Engine (GEE; fast prototyping at scale) 

Google Earth Engine [9] is a cloud-based geospatial evaluation platform that hosts a big catalogue of satellite, climate, and land-surface datasets (e.g., MODIS, Landsat, Sentinel, reanalyses) and permits you to process them at scale without managing your personal infrastructure. You write short scripts in JavaScript or Python, and GEE handles the heavy lifting like data access, tiling, reprojection, and parallel computation thus making it ideal for fast prototyping, exploratory analyses, and teaching. 

Nonetheless, GEE itself isn’t open source: it’s a proprietary, closed platform where the underlying codebase isn’t publicly available. This has implications for open, reproducible workflows discussed in the primary  blog [add link]: 
 
Great for: testing fusion/downscaling over a city/region using petabyte-scale datasets. 
 
Easy methods to register and get access 

  1. Visit the Earth Engine sign-up page: https://earthengine.google.com
  1. Sign up with a Google account and complete the non-commercial sign-up form, describing your intended use (research, education, or personal, non-commercial projects). 
  1. Once your account is approved, you possibly can: 
  • use the browser-based Code Editor to put in writing JavaScript Earth Engine scripts; and 
  • enable the Earth Engine API in Google Cloud and install the earthengine-api Python package (pip install earthengine-api) to run workflows from Python notebooks. 
  1. When sharing your work, consider exporting key intermediate results (e.g., GeoTIFF/COG, NetCDF/Zarr) and documenting your processing steps in open-source code in order that others can re-create the evaluation without depending entirely on GEE. 

When used this manner, Earth Engine becomes a strong “rapid laboratory” for testing ideas, which you’ll be able to then harden into fully open, portable pipelines for production and long-term stewardship. 

# pip install earthengine-api 
import ee 

ee.Initialize()  # first run: ee.Authenticate() in a console 
s5p = ee.ImageCollection('COPERNICUS/S5P/OFFL/L3_NO2').select('NO2_column_number_density') 
       .filterDate('2025-08-01', '2025-08-07').mean() 

print(s5p.getInfo()['bands'][0]['id']) 

# Exporting and visualization occur inside GEE; you possibly can sample to a grid then .getDownloadURL() 

8) HIMAWARI

Himawari-8 and Himawari-9 are geostationary meteorological satellites operated by the Japan Meteorological Agency (JMA). Their Advanced Himawari Imager (AHI) provides multi-band visible, near-infrared and infrared imagery over East Asia and the western–central Pacific, with full-disk scans every 10 minutes and even faster refresh over goal regions. This high-cadence view is incredibly useful for tracking smoke plumes, dust, volcanic eruptions, convective storms and the diurnal evolution of clouds—precisely the sorts of processes that modulate near-surface air quality. 
 
Great for: tracking diurnal haze/smoke plumes and fire events, generating high-frequency AOD to fill polar-orbit gaps, and rapid situational awareness for cities across SE/E Asia (via JAXA P-Tree L3 products). 

Easy methods to access and register 

 

  1. Browse the dataset description on the AWS Registry of Open Data: https://registry.opendata.aws/noaa-himawari/ 
  1. Himawari-8 and Himawari-9 imagery are hosted in public S3 buckets (s3://noaa-himawari8/ and s3://noaa-himawari9/). Since the buckets are world-readable, you possibly can list or download files anonymously, for instance: 

aws s3 ls --no-sign-request s3://noaa-himawari9/ 

or access individual objects via HTTPS (e.g., https://noaa-himawari9.s3.amazonaws.com/…). 

  1. For Python workflows, you need to use libraries like s3fs, fsspec, xarray, or rasterio to stream data directly from these buckets without prior registration, keeping in mind the attribution guidance from JMA/NOAA if you publish results. 

 

  1. Go to the JAXA Himawari Monitor / P-Tree portal: 
    https://www.eorc.jaxa.jp/ptree/ 
  1. Click User Registration / Account request and browse the “Precautions” and “Terms of Use”. Data access is restricted to non-profit purposes resembling research and education; industrial users are directed to the Japan Meteorological Business Support Center. 
  1. Submit your email address within the account request form. You’ll receive a brief acceptance email, then a link to finish your user information. After manual review, JAXA enables your access and notifies you once you possibly can download Himawari Standard Data and geophysical parameter products. 
  1. Once approved, you possibly can log in to download near-real-time and archived Himawari data via the P-Tree FTP/HTTP services, following JAXA’s guidance on non-redistribution and citation. 

In practice, a standard pattern is to make use of the NOAA/AWS buckets for open, scriptable access to raw imagery, and the JAXA P-Tree products if you need value-added parameters (e.g., cloud or aerosol properties) and are working inside non-profit research or educational projects. 

# open the downloaded file
!pip install xarray netCDF4
!pip install rasterio polars_h3
!pip install geopandas pykrige
!pip install polars==1.25.2
!pip install dask[complete] rioxarray h3==3.7.7
!pip install h3ronpy==0.21.1
!pip install geowrangler
# Himawari using – JAXA Himawari Monitor / P-Tree
# create your account here and use the username and password sent by email - https://www.eorc.jaxa.jp/ptree/registration_top.html

user = '' # enter the username 
password = '' # enter the password 
from ftplib import FTP
from pathlib import Path
import rasterio
from rasterio.transform import from_origin
import xarray as xr
import os
import matplotlib.pyplot as plt


def get_himawari_ftp_past_2_days(user, password):

    # FTP connection details
    ftp = FTP('ftp.ptree.jaxa.jp')
    ftp.login(user=user, passwd=password)

    # check the directory content : /pub/himawari/L2/ARP/031/
    # details of AOD directoty here: https://www.eorc.jaxa.jp/ptree/documents/README_HimawariGeo_en.txt

    overall_path= "/pub/himawari/L3/ARP/031/"
    directories = overall_path.strip("/").split("/")

    for directory in directories:
      ftp.cwd(directory)

    # List files within the goal directory
    date_month_files = ftp.nlst()

    # order files desc
    date_month_files.sort(reverse=False)
    print("Files in goal directory:", date_month_files)

    # get a listing of all of the month / days throughout the "/pub/himawari/L3/ARP/031/" path throughout the past 2 months
    limited_months_list = date_month_files[-2:]

    i=0
    # for every month within the limited_months_list, list all the times inside in
    for month in limited_months_list:
      ftp.cwd(month)
      date_day_files = ftp.nlst()
      date_day_files.sort(reverse=False)


      # mix each element of the date_day_file list with the month : month +"/" + date_day_file
      list_combined_days_month_inter = [month + "/" + date_day_file for date_day_file in date_day_files]
      if i ==0:
        list_combined_days_month= list_combined_days_month_inter
        i=i+1
      else:
        list_combined_days_month= list_combined_days_month + list_combined_days_month_inter
      ftp.cwd("..")

    # remove all elements containing day by day or monthly from list_combined_days_month
    list_combined_days_month = [item for item in list_combined_days_month if 'daily' not in item and 'monthly' not in item]

    # get the list of days we would like to download : in our case last 2 days - for NRT
    limited_list_combined_days_month=list_combined_days_month[-2:]


    for month_day_date in limited_list_combined_days_month:
      #navigate to the relevant directory
      ftp.cwd(month_day_date)
      print(f"directory: {month_day_date}")

      # get the list of the hourly files inside each directory
      date_hour_files = ftp.nlst()
      !mkdir -p ./raw_data/{month_day_date}

      #for every hourly file within the list
      for date_hour_file in date_hour_files:
        target_file_path=f"./raw_data/{month_day_date}/{date_hour_file}"
        # Download the goal file - provided that it doesn't exist already

        if not os.path.exists(target_file_path):
            with open(target_file_path, "wb") as local_file:
              ftp.retrbinary(f"RETR {date_hour_file}", local_file.write)
              print(f"Downloaded {date_hour_file} successfully!")
        else:
            print(f"File already exists: {date_hour_file}")



      print("--------------")
      # return 2 steps within the ftp tree
      ftp.cwd("..")
      ftp.cwd("..")
def transform_to_tif():
    # get list of files in raw_data folder
    month_file_list = os.listdir("./raw_data")
    month_file_list

    #order month_file_list
    month_file_list.sort(reverse=False)

    nb_errors=0
    # get list of every day folder for the past 2 months only

    for month_file in month_file_list[-2:]:
        print(f"-----------------------------------------")
        print(f"Month considered: {month_file}")
        date_file_list=os.listdir(f"./raw_data/{month_file}")
        date_file_list.sort(reverse=False)

        # get list of files for every day folder

        for date_file in date_file_list[-2:]:
            print(f"---------------------------")
            print(f"Day considered: {date_file}")
            hour_file_list=os.listdir(f"./raw_data/{month_file}/{date_file}")
            hour_file_list.sort(reverse=False)

            #process each hourly file right into a tif file and transform it into an h3 processed dataframe
            for hour_file in hour_file_list:
                file_path = f"./raw_data/{month_file}/{date_file}/{hour_file}"
                hour_file_tif=hour_file.replace(".nc",".tif")
                output_tif = f"./tif/{month_file}/{date_file}/{hour_file_tif}"
                if os.path.exists(output_tif):
                   print(f"File already exists: {output_tif}")
                else:

                   try:
                      dataset = xr.open_dataset(file_path, engine='netcdf4')
                   except:
                      #go to next hour_file
                      print(f"error opening {hour_file} file - skipping ")
                      nb_errors=nb_errors+1
                      proceed

                   # Access a selected variable
                   variable_name = list(dataset.data_vars.keys())[1] # Merged AOT product
                   data = dataset[variable_name]

                   # Plot data (if it's 2D and compatible)
                   plt.figure()
                   data.plot()
                   plt.title(f'{date_file}')
                   plt.show()

                   # Extract metadata (replace with actual coordinates out of your data if available)
                   lon = dataset['longitude'] if 'longitude' in dataset.coords else None
                   lat = dataset['latitude'] if 'latitude' in dataset.coords else None

                   # Handle missing lat/lon (example assumes evenly spaced grid)
                   if lon is None or lat is None:
                        lon_start, lon_step = -180, 0.05 # Example values
                        lat_start, lat_step = 90, -0.05 # Example values
                        lon = xr.DataArray(lon_start + lon_step * range(data.shape[-1]), dims=['x'])
                        lat = xr.DataArray(lat_start + lat_step * range(data.shape[-2]), dims=['y'])

                   # Define the affine transform for georeferencing
                   transform = from_origin(lon.min().item(), lat.max().item(), abs(lon[1] - lon[0]).item(), abs(lat[0] - lat[1]).item())

                   # Save to GeoTIFF
                   !mkdir -p ./tif/{month_file}/{date_file}

                   with rasterio.open(
                   output_tif,
                   'w',
                   driver='GTiff',
                   height=data.shape[-2],
                   width=data.shape[-1],
                   count=1, # Variety of bands
                   dtype=data.dtype.name,
                   crs='EPSG:4326', # Coordinate Reference System (e.g., WGS84)
                   transform=transform
                   ) as dst:

                        dst.write(data.values, 1) # Write the info to band 1
                   print(f"Saved {output_tif} successfully!")
                   print(f"{nb_errors} error(s) ")
get_himawari_ftp_past_2_days(user, password)
transform_to_tif()

9) NASA — FIRMS [Special Highlight] 

NASA’s Fire Information for Resource Management System (FIRMS) [10] provides near-real-time information on energetic fires and thermal anomalies detected by instruments resembling MODIS and VIIRS. It offers global coverage with low latency (on the order of minutes to hours), supplying attributes resembling fire radiative power, confidence, and acquisition time. FIRMS is widely used for wildfire monitoring, agricultural burning, forest management, and as a proxy input for air-quality and smoke dispersion modelling. 
 
Great for: pinpointing fire hotspots that drive AQ spikes, tracking plume sources and fire-line progression, monitoring crop-residue/forest burns, and triggering rapid response. Quick access via CSV/GeoJSON/Shapefile, map tiles/API, with 24–72 h rolling feeds and full archives for seasonal evaluation. 

Easy methods to register and get API access 

  1. Create a free NASA Earthdata Login account at: 
    https://urs.earthdata.nasa.gov 
  1. Confirm your email and check in together with your latest credentials. 
  1. Go to the FIRMS site you intend to make use of, for instance: 
  1. Click Login (top right) and authenticate together with your Earthdata username and password. Once logged in, you possibly can: 
  • customise map views and download options from the net interface, and 
  • generate or use FIRMS Web Services/API URLs that honour your authenticated session. 
  1. For scripted access, you possibly can call the FIRMS download or web service endpoints (e.g., GeoJSON, CSV) using standard HTTP tools (e.g., curl, requests in Python). If an endpoint requires authentication, supply your Earthdata credentials via a .netrc file or session cookies, as you’ll for other Earthdata services. 

In practice, FIRMS is a convenient technique to pull recent fire locations into an air-quality workflow: you possibly can fetch day by day or hourly fire detections for a region, convert them to a GeoDataFrame, after which intersect with wind fields, population grids, or sensor networks to grasp potential smoke impacts. 

#FIRMS  
!pip install geopandas rtree shapely 
import pandas as pd 
import geopandas as gpd 
from shapely.geometry import Point 
import numpy as np 
import matplotlib.pyplot as plt 
import rtree 

# get boundaries of Thailand 
boundaries_country = gpd.read_file(f'https://github.com/wmgeolab/geoBoundaries/raw/fcccfab7523d4d5e55dfc7f63c166df918119fd1/releaseData/gbOpen/THA/ADM0/geoBoundaries-THA-ADM0.geojson') 
boundaries_country.plot() 

# Real time data source: https://firms.modaps.eosdis.nasa.gov/active_fire/ 
# Past 7 days links: 
modis_7d_url= "https://firms.modaps.eosdis.nasa.gov/data/active_fire/modis-c6.1/csv/MODIS_C6_1_SouthEast_Asia_7d.csv" 
suomi_7d_url= "https://firms.modaps.eosdis.nasa.gov/data/active_fire/suomi-npp-viirs-c2/csv/SUOMI_VIIRS_C2_SouthEast_Asia_7d.csv" 
j1_7d_url= "https://firms.modaps.eosdis.nasa.gov/data/active_fire/noaa-20-viirs-c2/csv/J1_VIIRS_C2_SouthEast_Asia_7d.csv" 
j2_7d_url="https://firms.modaps.eosdis.nasa.gov/data/active_fire/noaa-21-viirs-c2/csv/J2_VIIRS_C2_SouthEast_Asia_7d.csv" 
urls = [modis_7d_url, suomi_7d_url, j1_7d_url, j2_7d_url] 

# Create an empty GeoDataFrame to store the combined data 
gdf = gpd.GeoDataFrame() 

for url in urls: 
    df = pd.read_csv(url) 

    # Create a geometry column from latitude and longitude 
    geometry = [Point(xy) for xy in zip(df['longitude'], df['latitude'])] 
    gdf_temp = gpd.GeoDataFrame(df, crs="EPSG:4326", geometry=geometry)
     
    # Concatenate the temporary GeoDataFrame to the predominant GeoDataFrame 
    gdf = pd.concat([gdf, gdf_temp], ignore_index=True) 

# Filter to maintain only fires throughout the country boundaries 
gdf = gpd.sjoin(gdf, boundaries_country, how="inner", predicate="inside") 

# Display fires on map  
frp = gdf["frp"].astype(float) 
fig, ax = plt.subplots(figsize=(9,9)) 
boundaries_country.plot(ax=ax, facecolor="none", edgecolor="0.3", linewidth=0.8) 
gdf.plot(ax=ax, markersize=frp, color="crimson", alpha=0.55) 
ax.set_title("Fires inside country boundaries (bubble size = Fire Radiative Power )") 
ax.set_axis_off() 
plt.show() 

Data types you will meet (and methods to read them right) 

Air-quality work rarely lives in a single, tidy CSV. So, it helps to know what the file types you’ll meet. You’ll move between multidimensional model outputs (NetCDF/GRIB/Zarr), satellite rasters (COG/GeoTIFF), point measurements (CSV /Parquet /GeoParquet), and web-friendly formats (JSON/GeoJSON), often in the identical notebook. 

This section is a fast field guide to those formats and methods to open them without getting stuck. 

There is no such thing as a have to memorise any of this, so be at liberty to skim the list once, then come back if you hit an unfamiliar file extension within the wild. 

  1. NetCDF4 / HDF5 (self-describing scientific arrays): Widely used for reanalyses, satellite products, and models. Wealthy metadata, multi-dimensional (time, level, lat, lon) Usual extensions: .nc, .nc4, .h5, .hdf5 

 

# pip install xarray netCDF4 

import xarray as xr 
ds = xr.open_dataset("modis_aod_2025.nc") 
ds = ds.sel(time=slice("2025-08-01","2025-08-07")) 
print(ds) 
  1. Cloud-Optimised GeoTIFF (COG): Raster format tuned for HTTP range requests (stream just what you wish). Common for satellite imagery and gridded products. Usual extensions: .tif, .tiff 

  

# pip install rasterio 

import rasterio 
with rasterio.open("https://example-bucket/no2_mean_2025.tif") as src: 
    window = rasterio.windows.from_bounds(*(-0.3,5.5,0.3,5.9), src.transform) 
    arr = src.read(1, window=window)
  1. JSON (nested) & GeoJSON (features + geometry): Great for APIs and light-weight geospatial. GeoJSON uses WGS84 (EPSG:4326) by default. Usual extensions: json, .jsonl, .ndjson, .geojsonl, .ndgeojson 

 

# pip install geopandas 

import geopandas as gpd 
gdf = gpd.read_file("points.geojson")  # columns + geometry 
gdf = gdf.set_crs(4326)                # ensure WGS84 
  1. GRIB2 (meteorology, model outputs): Compact, tiled; often utilized by CAMS/ECMWF/NWP. Usual extensions: .grib2, .grb2, .grib, .grb. In practice, data providers often add compression suffixes too, e.g. .grib2.gz or .grb2.bz2. 

 

# pip install xarray cfgrib 

import xarray as xr 
ds = xr.open_dataset("cams_ozone.grib", engine="cfgrib") 
  1. Parquet & GeoParquet (columnar, compressed): Best for large tables: fast column selection, predicate pushdown, partitioning (e.g., by date/city). GeoParquet adds a typical for geometries. Usual extensions: .parquet, .parquet.gz 

 

# pip install pandas pyarrow geopandas geoparquet 

import pandas as pd, geopandas as gpd 
df = pd.read_parquet("openaq_accra_2025.parquet")   # columns only 

# Convert a GeoDataFrame -> GeoParquet 
gdf = gpd.read_file("points.geojson") 
gdf.to_parquet("points.geoparquet")  # preserves geometry & CRS 
  1. CSV/TSV (text tables): Easy, universal. Weak at large scale (slow I/O, no schema), no geometry. Usual extensions: .csv, .tsv (also sometimes .tab, less common)

 

# pip install pandas 

import pandas as pd
df = pd.read_csv("measurements.csv", parse_dates=["datetime"], dtype={"site_id":"string"}) 
  1. Zarr (chunked, cloud-native): Ideal for evaluation within the cloud with parallel reads (works great with Dask). Usual extension: .zarr (often a directory / store ending in .zarr; occasionally packaged as .zarr.zip) 

  

# pip install xarray zarr s3fs 

import xarray as xr
ds = xr.open_zarr("s3://bucket/cams_eac4_2025.zarr", consolidated=True) 

Note: Shapefile (legacy vector): Works, but brittle (many files, 10-char field limit). . It is a legacy formats and it is best to make use of the alternatives like GeoPackage or GeoParquet  

It is crucial to decide on the proper geospatial (or scientific) file format because it isn’t only a storage decision nevertheless it directly impacts how quickly you possibly can read data, tool compatibility, how easily you possibly can share it, and the way well it scales from a desktop workflow to cloud-native processing. The next table (Table 1) provides a practical “format-to-task” cheat sheet: for every common need (from quick API dumps to cloud-scale arrays and web mapping), it lists probably the most suitable format, the extensions you’ll typically encounter, and the core reason that format is an excellent fit. It may possibly be used as a default start line when designing pipelines, publishing datasets, or choosing what to download from an external repository. 

Need Best Bet Usual Extension Why 
Human-readable logs or quick API dumps  CSV/JSON  .csv, .json (also .jsonl, .ndjson)  Ubiquitous, easy to examine 
Big tables (hundreds of thousands of rows)  Parquet/ GeoParquet  .parquet  Fast scans, column pruning, and partitioning 
Large rasters  over HTTP  COG  .tif, .tiff  Range requests; no full download 
Multi-dimensional scientific data  NetCDF4/HDF5  .nc, .nc4, .h5, .hdf5  Self-describing, units/attrs 
Meteorological model outputs  GRIB2  .grib2, .grb2, .grib, .grb  Compact, widely supported by wx tools 
Cloud-scale arrays  Zarr  .zarr  Chunked + parallel; cloud-native 
Exchangeable vector file  GeoPackage  .gpkg  Single file; robust 
Web mapping geometries  GeoJSON  .geojsonl, 
.ndgeojson 
Easy; native to web stacks 
Table 1: Picking the proper format for the job 

Tip: An interesting talk on STAC and data types (especially GeoParquet): https://github.com/GSA/gtcop-wiki/wiki/June-2025:-GeoParquet,-Iceberg-and-Cloud%E2%80%90Native-Spatial-Data-Infrastructures

Multiple open STAC catalogues are actually available, including public endpoints for optical, radar, and atmospheric products (for instance, Landsat and Sentinel imagery via providers resembling Element 84’s Earth Search or Microsoft’s Planetary Computer). STAC makes it much easier to script “find and download all scenes for this polygon and time range” and to integrate different datasets into the identical workflow. 

Conclusion — from “where” the info lives to “how” you utilize it 

Figure 3: Creating exposure maps from hotspots  © UNICEF/UNI724381/Kongchan Phi. All rights reserved. 

Air for Tomorrow: We began with the query “” This post provides a practical path and tools to make it easier to answer this query. You now know where open-air quality data resides, including regulatory networks, community sensors, satellite measurements, and reanalysis. You furthermore mght understand what those files are (GeoJSON, Parquet/GeoParquet, NetCDF/HDF5, COG, GRIB, Zarr) and methods to retrieve them with compact, reproducible snippets. The goal is beyond just downloading them; it’s to make defensible, fast, and shareable analyses that delay tomorrow

You possibly can assemble a reputable local picture in hours, not weeks. From fire hotspots (Figure 2) to school-route exposure (Figure 1), you possibly can create exposure maps (Figure 3).

Up next: We might showcase an actual Air Quality Model developed by us on the UNICEF Country Office of Lao PDR with the UNICEF EAPRO’s Frontier Data Team. We might undergo an open, end-to-end model pipeline. When there are ground-level air quality data streams available, we might cover how feature engineering, bias correction, normalisation, and a model will be developed with an actionable surface that a regional can use tomorrow morning. 

Contributors: Prithviraj Pramanik, AQAI; Hugo Ruiz Verastegui, Anthony Mockler, Judith Hanan, Frontier Data Lab; Risdianto Irawan, UNICEF EAPRO; Soheib Abdalla, Andrew Dunbrack, UNICEF Lao PDR Country Office; Halim Jun, Daniel Alvarez, Shane O’Connor, UNICEF Office of Innovation; 

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x