Methods to Construct an AI-Powered Weather ETL Pipeline with Databricks and GPT-4o: From API To Dashboard

, Databricks has shaken the information market once more. The corporate launched its free edition of the Databricks platform It’s an incredible resource for learning and testing, to say the least.

With that in mind, I created an end-to-end project to assist you learning the basics of the important resources inside Databricks.

This project demonstrates a whole Extract, Transform, Load (ETL) workflow inside Databricks. It integrates the OpenWeatherMap API for data retrieval and the OpenAI GPT-4o-mini model to supply personalized, weather-based dressing suggestions.

Let’s learn more about it.

The Project

The project implements a full data pipeline inside Databricks, following these steps.

Extract: Fetches current weather data for via the OpenWeatherMap API [1].
Transform: Converts UTC timestamps to Latest York local time and utilizes OpenAI’s [2] GPT-4o-mini to generate personalized dressing suggestions based on the temperature.
Load: Persists the information into the Databricks Unity Catalog as each raw JSON files and a structured Delta table (Silver Layer).
Orchestration: The notebook with this ETL code is added to a job and scheduled to run every 1 hour in Databricks.
Analytics: The silver layer feeds a Databricks Dashboard that displays relevant weather information alongside the LLM’s suggestions.

Here is the architecture.

Project Architecture. Image by the creator.

Great. Now that we understand what we want to do, let’s move on with the piece of this tutorial.

Note: if you happen to still don’t have an account in Databricks, go to Databricks Free Edition page [3], click and follow the prompts on screen to get your free access.

Extract: Integrating API And Databricks

As I often say, an information project needs data to start, right? So our task here is integrating OpenWeatherMap API to ingest data directly right into a PySpark notebook inside Databricks. This task may look complicated at first, but trust me, it is just not.

On Databricks’ initial page, create a brand new notebook using the button, then select Notebook.

Create a brand new Notebook. Image by the creator.

For the Extract part, we’ll need:

1. The API Key from the API OpenWeatherMap.

To get that, go to the API’s signup page and complete your free registration process. Once logged in to the dashboard, click on the , where you’ll have the ability to see it.

2. Import packages

# Imports
import requests
import json

Next, we’re going to create a Python class to modularize our code and make it production-ready as well.

This class receives the API_KEY we just created, in addition to town and country for the weather fetch.
Returns the response in JSON format.

# Making a class to modularize our code

class Weather:
    
    # Define the constructor
    def __init__(self, API_KEY):
        self.API_KEY = API_KEY

    # Define a way to retrieve weather data
    def get_weather(self, city, country, units='imperial'):
        self.city = city
        self.country = country
        self.units = units

        # Make a GET request to an API endpoint that returns JSON data
        url = f"https://api.openweathermap.org/data/2.5/weather?q={city},{country}&APPID={w.API_KEY}&units={units}"
        response = requests.get(url)

        # Use the .json() method to parse the response text and return
        if response.status_code != 200:
            raise Exception(f"Error: {response.status_code} - {response.text}")
        return response.json()

Nice. Now we are able to run this class. Notice we use dbutils.widgets.get(). This command looks on the within the scheduled job, which we’ll see later in this text. It’s a best practice to maintain the secrets secure.

# Get the API OpenWeatherMap key
API_KEY = dbutils.widgets.get('API_KEY')

# Instantiate the category
w = Weather(API_KEY=API_KEY)

# Get the weather data
nyc = w.get_weather(city='Latest York', country='US')
nyc

Here is the response.

{'coord': {'lon': -74.006, 'lat': 40.7143},
 'weather': [{'id': 804,
   'main': 'Clouds',
   'description': 'overcast clouds',
   'icon': '04d'}],
 'base': 'stations',
 'important': {'temp': 54.14,
  'feels_like': 53.44,
  'temp_min': 51.76,
  'temp_max': 56.26,
  'pressure': 992,
  'humidity': 89,
  'sea_level': 992,
  'grnd_level': 993},
 'visibility': 10000,
 'wind': {'speed': 21.85, 'deg': 270, 'gust': 37.98},
 'clouds': {'all': 100},
 'dt': 1766161441,
 'sys': {'type': 1,
  'id': 4610,
  'country': 'US',
  'sunrise': 1766146541,
  'sunset': 1766179850},
 'timezone': -18000,
 'id': 5128581,
 'name': 'Latest York',
 'cod': 200}

With that response in hand, we are able to move on to the Transformation a part of our project, where we’ll clean and transform the information.

Transform: Formatting The Data

On this section, we’ll take a look at the clean and transform tasks performed over the raw data. We’ll start by choosing the pieces of information needed for our dashboard. This is solely getting data from a dictionary (or a JSON).

# Getting information
id = nyc['id']
timestamp = nyc['dt']
weather = nyc['weather'][0]['main']
temp = nyc['main']['temp']
tmin = nyc['main']['temp_min']
tmax = nyc['main']['temp_max']
country = nyc['sys']['country']
city = nyc['name']
sunrise = nyc['sys']['sunrise']
sunset = nyc['sys']['sunset']

Next, let’s transform the timestamps to the Latest York time zone, because it comes with Greenwich time.

# Transform sunrise and sunset to datetime in NYC timezone
from datetime import datetime, timezone
from zoneinfo import ZoneInfo
import time

# Timestamp, Sunrise and Sunset to NYC timezone
target_timezone = ZoneInfo("America/New_York")
dt_utc = datetime.fromtimestamp(sunrise, tz=timezone.utc)
sunrise_nyc = str(dt_utc.astimezone(target_timezone).time()) # get only sunrise time time
dt_utc = datetime.fromtimestamp(sunset, tz=timezone.utc)
sunset_nyc = str(dt_utc.astimezone(target_timezone).time()) # get only sunset time time
dt_utc = datetime.fromtimestamp(timestamp, tz=timezone.utc)
time_nyc = str(dt_utc.astimezone(target_timezone))

Finally, we format it as a Spark dataframe.

# Create a dataframe from the variables
df = spark.createDataFrame([[id, time_nyc, weather, temp, tmin, tmax, country, city, sunrise_nyc, sunset_nyc]], schema=['id', 'timestamp','weather', 'temp', 'tmin', 'tmax', 'country', 'city', 'sunrise', 'sunset'])

Data cleaned and transformed. Image by the creator.

The ultimate step on this section is adding the suggestion from an LLM. On this step, we’re going to pick a few of the data fetched from the API and pass it to the model, asking it to return a suggestion of how an individual could dress to be prepared for the weather.

You will have an OpenAI API Key.
Pass the weather condition, and temperatures (weather, tmax, tmin)
Ask the LLM to return a suggestion about tips on how to dress for the weather.
Add the suggestion to the ultimate dataframe.

%pip install openai --quiet
from openai import OpenAI
import pyspark.sql.functions as F
from pyspark.sql.functions import col

# Get OpenAI Key
OPENAI_API_KEY= dbutils.widgets.get('OPENAI_API_KEY')

client = OpenAI(
    # That is the default and may be omitted
    api_key=OPENAI_API_KEY
)

response = client.responses.create(
    model="gpt-4o-mini",
    instructions="You might be a weatherman that offers suggestions about tips on how to dress based on the weather. Answer in a single sentence.",
    input=f"The weather is {weather}, with max temperature {tmax} and min temperature {tmin}. How should I dress?"
)

suggestion = response.output_text

# Add the suggestion to the df
df = df.withColumn('suggestion', F.lit(suggestion))
display(df)

Cool. We’re almost done with the ETL. Now it’s all about loading it. That’s the following section.

Load: Saving the Data and Creating the Silver Layer

The last piece of the ETL is the information. We’ll load it in two other ways.

Persisting the raw files in a Unity Catalog Volume.
Saving the transformed dataframe directly into the silver layer, which is a Delta Table ready for the Dashboard consumption.

Let’s create a that can hold all of the weather data that we get from the API.

-- Making a Catalog
CREATE CATALOG IF NOT EXISTS pipeline_weather
COMMENT 'That is the catalog for the weather pipeline';

Next, we create a for the Lakehouse. This one will store the quantity with the raw JSON files fetched.

-- Making a Schema
CREATE SCHEMA IF NOT EXISTS pipeline_weather.lakehouse
COMMENT 'That is the schema for the weather pipeline';

Now, we create the quantity for the raw files.

-- Let's create a volume
CREATE VOLUME IF NOT EXISTS pipeline_weather.lakehouse.raw_data
COMMENT 'That is the raw data volume for the weather pipeline';

We also create one other to carry the Silver Layer Delta Table.

--Creating Schema to carry transformed data
CREATE SCHEMA IF NOT EXISTS pipeline_weather.silver
COMMENT 'That is the schema for the weather pipeline';

Once we have now every little thing arrange, that is how our Catalog looks.

Catalog able to receive data. Image by the creator.

Now, let’s save the raw JSON response into our Raw Volume. To maintain every little thing organized and forestall overwriting, we’ll attach a novel timestamp to every filename.

By these files to the quantity somewhat than simply overwriting them, we’re making a reliable “audit trail”. This acts as a security net, meaning that if a downstream process fails or we run into data loss later, we are able to all the time return to the source and re-process the unique data each time we want it.

# Get timestamp
stamp = datetime.now().strftime('%Y-%m-%d_%H-%M-%S')

# Path to save lots of
json_path = f'/Volumes/pipeline_weather/lakehouse/raw_data/weather_{stamp}.json'

# Save the information right into a json file
df.write.mode('append').json(json_path)

While we keep the raw JSON as our “source of truth,” saving the cleaned data right into a Delta Table within the Silver layer is where the true magic happens. By utilizing .mode(“append”) and the Delta format, we ensure our data is structured, schema-enforced, and prepared for high-speed analytics or BI tools. This layer transforms messy API responses right into a reliable, queryable table that grows with every pipeline run.

# Save the transformed data right into a table (schema)
(
    df
    .write
    .format('delta')
    .mode("append")
    .saveAsTable('pipeline_weather.silver.weather')
)

Beautiful! With this all set, let’s check how our table looks now.

Silver Layer Table. Image by the creator.

Let’s start automating this pipeline now.

Orchestration: Scheduling the Notebook to Run Routinely

Moving on with the project, it’s time to make this pipeline run by itself, with minimal supervision. For that, Databricks has the tab, where it is straightforward we are able to schedule jobs to run.

Click the tab on the left panel
Find the button and choose Job
Click on Notebook so as to add it to the Job.
Configure like the information below.
Add the API Keys to the Parameters.
Click Create task.
Click Run Now to check if it really works.

Adding a Notebook to the Job. Image by the creator

When you click the Run Now button, it should start running the notebook and display the message.

If the job is working wonderful, it’s time to schedule it to run mechanically.

Click on Add trigger on the precise side of the screen, right under the section .
Trigger type = Scheduled.
Schedule type: select
Select Every 1 hour from the drop-downs.
Reserve it.

Excellent. Our Pipeline is on auto-mode now! Every hour, the system will hit the OpenWeatherMap API and get fresh weather information for NYC and put it aside to our Silver Layer Table.

Analytics: Constructing a Dashboard for Data-Driven Decisions

The last piece of this puzzle is creating the Analytics deliverable, which can show the weather information and supply the user with actionable details about tips on how to dress for the weather outside.

Click on the on the left side panel.
Click on the button
It should open a blank canvas for us to work on.

Now dashboards work based on data fetched from SQL queries. Subsequently, before we start adding text and graphics to the canvas, first we want to create some metrics that might be the variables to feed the dashboard cards and graphics.

So, click on the +Create from SQL button to start out a metric. Give it a reputation. For instance, , to retrieve the newest fetched city name, I need to use this question that follows.

-- Get the newest city name fetched
SELECT city
FROM pipeline_weather.silver.weather
ORDER BY timestamp DESC
LIMIT 1

And we must create one SQL query for every metric. You may see all of them within the GitHub repository [ ].

Next, we click on the Dashboard tab and begin dragging and dropping elements to the canvas.

Dashboard creation elements menu. Image by the creator.

When you click on the Text, it allows you to insert a box into the canvas and edit the text. While you click on the graphic element, it inserts a placeholder for a graphic, and opens the precise side menu for choice of the variables and configuration.

Interacting with Dashboards in Databricks. Image by the creator.

Okay. In any case the weather are added, the dashboard will seem like this.

Accomplished Dashboard. Image by the creator.

So nice! And that concludes our project.

Before You Go

You may easily replicate this project in about an hour, depending in your experience with the Databricks ecosystem. While it’s a fast construct, it packs lots when it comes to the core engineering skills you’ll get to exercise:

Architectural Design: You’ll learn tips on how to structure a contemporary Lakehouse environment from the bottom up.
Seamless Data Integration: You’ll bridge the gap between external web APIs and the Databricks platform for real-time data ingestion.
Clean, Modular Code: We move beyond easy scripts through the use of Python classes and functions to maintain the codebase organized and maintainable.
Automation & Orchestration: You’ll get hands-on experience scheduling jobs to make sure your project runs reliably on autopilot.
Delivering Real Value: The goal isn’t just to maneuver data; it’s to supply value. By transforming raw weather metrics into actionable dressing suggestions via AI, we turn “cold data” right into a helpful service for the tip user.

When you liked this content, find my contacts and more about me in my website.

https://gustavorsantos.me

GitHub Repository

Here is the repository for this project.

https://github.com/gurezende/Databricks-Weather-Pipeline

References

[1. OpenWeatherMap API] (https://openweathermap.org/)

[2. Open Ai Platform] (https://platform.openai.com/)

[3. Databricks Free Edition] (https://www.databricks.com/learn/free-edition)

[4. GitHub Repository] (https://github.com/gurezende/Databricks-Weather-Pipeline)

Methods to Construct an AI-Powered Weather ETL Pipeline with Databricks and GPT-4o: From API To Dashboard

The Project

Extract: Integrating API And Databricks

Transform: Formatting The Data

Load: Saving the Data and Creating the Silver Layer

Orchestration: Scheduling the Notebook to Run Routinely

Analytics: Constructing a Dashboard for Data-Driven Decisions

Before You Go

GitHub Repository

References

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

FineVideo: behind the scenes

Llama can now see and run in your device

Think Your Python Code Is Slow? Stop Guessing and Start Measuring

Converting Vertex-Coloured Meshes to Textured Meshes

Can your LLM Understand Czech?

Methods to Construct an AI-Powered Weather ETL Pipeline with Databricks and GPT-4o: From API To Dashboard

The Project

Extract: Integrating API And Databricks

Transform: Formatting The Data

Load: Saving the Data and Creating the Silver Layer

Orchestration: Scheduling the Notebook to Run Routinely

Analytics: Constructing a Dashboard for Data-Driven Decisions

Before You Go

GitHub Repository

References

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.