Designing, Constructing & Deploying an AI Chat App from Scratch (Part 1)

The aim of this project is to learn in regards to the fundamentals of contemporary, scalable web applications by designing, constructing and deploying an AI-powered chat app from scratch. We won’t use fancy frameworks or business platforms like ChatGPT. This may provide a greater understanding of how real-world systems may fit under the hood, and provides us full control over the language model, infrastructure, data and costs. The main target can be on engineering, backend and cloud deployment, quite than the language model or a elaborate frontend.

This is an element 1. We’ll design and construct a cloud-native app with several APIs, a database, private network, reverse proxy, and straightforward user interface with sessions. All the pieces runs on our local computer. Partly 2, we’ll deploy our application to a cloud platform like AWS, GCP or Azure with a deal with scalability so actual users can reach it over the web.

A fast demo of the app. We start a brand new chat, come back to that very same chat, and begin one other chat. We’ll now construct this app locally and make it available at localhost.

You’ll find the codebase at https://github.com/jsbaan/ai-app-from-scratch. Throughout this post I’ll link to specific lines of code with this hyperlink robot 🤖 (try it!)

Modern web applications are sometimes built using microservices — small, independent software components with a particular role. Each service runs in its own Docker container — an isolated environment independent of the underlying operating system and hardware. Services communicate with one another over a network using REST APIs.

You may consider a REST API because the interface that defines interact with a service by defining endpoints — specific URLs that represent the possible resources or actions, formatted like http://hostname:port/endpoint-name Endpoints, also called paths or routes, are accessed with HTTP requests that may have various types like GET to retrieve data or POST to create data. Parameters may be passed within the URL itself or within the request body or header.

Let’s make this more concrete. We would like an online page where users can chat with a language model and are available back to their previous chats. Our architecture will appear to be this:

Local architecture of the app. Each service runs in its own Docker container and communicates over a personal network. Made by writer in draw.io.

The above architecture diagram shows how a user’s HTTP request to localhost on the left flows through the system. We’ll discuss and arrange each individual service, starting with the backend services on the correct. Finally, we discuss communication, networking and container orchestration.

The structure of this post follows the components in our architecture (click to leap to the section):

Language model API. A llama.cpp language model inference server running the quantized Qwen2.5–0.5B-Instruct model 🤖.
PostgreSQL database server. A database that stores chats and messages 🤖.
Database API. A FastAPI and Uvicorn Python server that queries the PostgreSQL database 🤖.
User interface. A FastAPI and Uvicorn Python server that serves HTML and support session-based authentication 🤖.
Private Docker network. For communication between microservices 🤖.
Nginx reverse proxy. A gateway between the skin world and network-isolated services 🤖.
Docker Compose. A container orchestration tool to simply run manage our services together 🤖.

Organising the actual language model is pretty easy, nicely demonstrating that ML engineering is often more about engineering than ML. Since I need our app to run on a laptop, model inference must be fast and CPU-based with low memory.

I checked out several inference engines, like Fastchat with vLLM or Huggingface TGI, but went with llama.cpp since it’s popular, fast, lightweight and supports CPU-based inference. Llama.cpp is written in C/C++ and conveniently provides a Docker image with its inference engine and an easy web server that implements the favored OpenAI API specification. It comes with a basic UI for experimenting, but we’ll construct our own UI shortly.

As for the actual language model, I selected the quantized Qwen2.5–0.5B-Instruct model from Alibaba Cloud, whose responses are surprisingly coherent given how small it’s.

The fantastic thing about containerized applications is that, given a Docker image, we are able to have it running in seconds without installing any packages. The docker run command below pulls the llama.cpp server image, mounts the model file that we downloaded earlier to the container’s filesystem, and runs a container with the llama.cpp server listening for HTTP requests at port 80. It uses flash attention and has max generation length of 512 tokens.

docker run
--name lm-api 
--volume $PROJECT_PATH/lm-api/gguf_models:/models 
--publish 8000:80  # add this to make the API accessible on localhost
ghcr.io/ggerganov/llama.cpp:server 
-m /models/qwen2-0_5b-instruct-q5_k_m.gguf --port 80 --host 0.0.0.0 --predict 512 --flash-attn

Ultimately we’ll use Docker Compose to run this container along with the others 🤖.

Since Docker containers are completely isolated from the whole lot else on their host machine, i.e., our computer, we are able to’t reach our language model API yet.

Nevertheless, we are able to break through a little bit of networking isolation by publishing the container’s port 80 to our host machine’s port 8000 with --publish 8000:80 within the docker run command. This makes the llama.cpp server available at http://localhost:8000.

The hostname localhost resolves to the loopback IP address 127.0.0.1 and is an element of the loopback network interface that permits a pc to speak with itself. Once we visit http://localhost:8000, our browser sends an HTTP GET request to our own computer on port 8000, which gets forwarded to the llama.cpp container listening at port 80.

Let’s test the language model server by sending a POST request with a brief chat history.

curl -X POST  
-H "Content-Type: application/json" 
-d '{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "assistant", "content": "Hello, how can I assist you today?"},
{"role": "user", "content": "Hi, what is an API?"}
],
"max_tokens": 10
}'

The response is JSON and the generated text is under decisions.message.content: “An API (Application Programming Interface) is a specification…”.

Perfect! Ultimately, our UI service can be the one to send requests to the language model API, and define the system prompt and opening message 🤖.

Next, let’s look into storing chats and messages. PostgreSQL is a strong, open-source relational database and running a PostgreSQL server locally is just one other one other docker run command using its official image. We’ll pass some extra environment variables to configure the database name, username, and password.

docker run --name db --publish 5432:5432 --env POSTGRES_USER=myuser --env POSTGRES_PASSWORD=mypassword postgres

After publishing port 5432, the database server is accessible on localhost:5432. PostgreSQL uses its own protocol for communication and doesn’t understand HTTP requests. We are able to use a database client like psql to check the connection.

pg_isready -U joris -h localhost -d postgres
> localhost:5432 - accepting connections

Once we deploy our application partially 2, we’ll use a database managed by a cloud provider to make our lives easier and add more security, reliability and scalability. Nevertheless, setting one up locally like this is beneficial for local development and, perhaps in a while, integration tests.

Databases often have a separate API server sitting in front to manage access, implement extra security, and supply an easy, standardized interface that abstracts away the database’s complexity.

We’ll construct this API from scratch with FastAPI, a contemporary framework for constructing fast, production-ready Python APIs. We’ll run the API with Uvicorn, a high-performance Python web server that handles things like network communication and simultaneous requests.

Let’s quickly get a sense for FastAPI and take a look at a minimal example app with a single GET endpoint /hello.

from fastapi import FastAPI# FastAPI app object that the Uvicorn web server will load and serve
my_app = FastAPI()
# Decorator telling FastAPI that function below handles GET requests to /hello
@my_app.get("/hello") 
def read_hello():
# Define this endpoint's response
return {"Hello": "World"}

We are able to serve our app at http://localhost:8080 by running the Uvicorn server.

uvicorn important.py:my_app --host 0.0.0.0 --port 8080

If we now send a GET request to our endpoint by visiting http://localhost:8080/hello in our browser, we receive the JSON response {"Hello": "World"} !

On to the actual database API. We define 4 endpoints in important.py 🤖 for creating or fetching chats and messages. You get a pleasant visual summary of those within the auto-generated docs, see below. The UI will call these endpoints to process user data.

A cool feature of FastAPI is that it mechanically generates interactive documentation in keeping with the OpenAPI specification with Swagger. If the Uvicorn server is running we are able to find it at `http://hostname:port/docs`. This can be a screenshot of the doc web page.

The very first thing we’d like to do is to attach the database API to the database server. We use SQLAlchemy, a preferred Python SQL toolkit and Object-Relational Mapper (ORM) that abstracts away writing manual SQL queries.

We establish this connection in database.py 🤖 by creating the SQLAlchemy engine with a connection URL that features the database hostname, username and password (remember, we configured these by passing them as environment variables to the PostgreSQL server). We also create a session factory that creates a brand new database session for every request to the database API.

Now let’s design our database. We define two SQLAlchemy data models in models.py 🤖 that can be mapped to actual database tables. The primary is a Message model 🤖 with an id, content, speaker role, owner_id, and session_id (more on this later). The second is a Chat model 🤖, which I’ll show here to get a greater feeling for SQLAlchemy models:

class Chat(Base):
__tablename__ = "chats"# Unique identifier for every chat that can be generated mechanically.
id = Column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4)
# Username related to the chat. Index is created for faster lookups.
username = Column(String, index=True)
# Session ID related to the chat. 
# Used to "scope" chats, i.e., users can only access chats from their session.
session_id = Column(String, index=True)
# The connection function links the Chat model to the Message model.
# The back_populates flag creates a bidirectional relationship.
messages = relationship("Message", back_populates="owner")

Database tables are typically created using migration tools like Alembic, but we’ll simply ask SQLAlchemy to create them in important.py 🤖.

Next, we define the CRUD (Create, Read, Update, Delete) methods in crud.py 🤖. These methods use a fresh database session from our factory to question the database and create latest rows in our tables. The endpoints in important.py will import and use these CRUD methods 🤖.

FastAPI is heavily based on Python’s type annotations and the info validation library Pydantic. For every endpoint, we are able to define a request and response schema that defines the input/output format we expect. Each request to or response from an endpoint is mechanically validated and converted to the correct data type and included in our API’s mechanically generated documentation. If something a few request or response is missing or unsuitable, an informative error is thrown.

We define the Pydantic schemas for the database-api in schemas.py 🤖 and use them within the endpoint definitions in important.py 🤖, too. For instance, that is the endpoint to create a brand new chat:

@app.post("/chats", response_model=schemas.Chat)
async def create_chat(chat: schemas.ChatCreate, db: Session = Depends(get_db)):
db_chat = crud.create_chat(db, chat)
return db_chat

We are able to see that it expects a ChatCreate request body and Chat response body. FastAPI verifies and converts the request and response bodies in keeping with these schemas 🤖:

class ChatCreate(BaseModel):
username: str
messages: List[MessageCreate] = []
session_id: strclass Chat(ChatCreate):
id: UUID
messages: List[Message] = []

Note: our SQLAlchemy models for the database mustn’t be confused with these Pydantic schemas for endpoint input/output validation.

We are able to serve the database API using Uvicorn, making it available at http://localhost:8001.

cd $PROJECT_PATH/db-api
uvicorn app.important.py:app --host 0.0.0.0 --port 8001

To run the Uvicorn server in its own Docker container, we create a Dockerfile 🤖 that specifies incrementally construct the Docker image. We are able to then construct the image and run the container, again making the database API available at http://localhost:8001 after publishing the container’s port 80 to host port 8001. We pass the database credentials and hostname as environment variables.

docker construct --tag db-api-image $PROJECT_PATH/db-api
docker run --name db-api --publish 8001:80 --env POSTGRES_USERNAME= --env POSTGRES_PASSWORD= --env POSTGRES_HOST= db-api-image

With the backend in place, let’s construct the frontend. An internet interface typically consists of HTML for structure, CSS for styling and Javascript for interactivity. Frameworks like React, Vue, and Angular use higher-level abstractions like JSX that may ultimately be transformed into HTML, CSS, and JS files to be bundled and served by an online server like Nginx.

Since I need to deal with the backend, I hacked together an easy UI with FastAPI. As an alternative of JSON responses, its endpoints now return HTML based on template files which can be rendered by Jinja, a templating engine that replaces variables within the template with real data like chat messages.

To handle user input and interact with the backend (e.g., retrieve chat history from the database API or generate a reply via the language model API), I’ve avoided JavaScript altogether through the use of HTML forms 🤖 that trigger internal POST endpoints. These endpoints then simply use Python’s httpx library to make HTTP requests 🤖.

Endpoints are defined in important.py 🤖, HTML templates are within the app/templates directory 🤖, and the static CSS file for styling the pages is within the app/static directory 🤖. FastAPI serves the CSS file at http://hostname/static/style.css so the browser can find it.

Screenshot of the UI’s interactive documentation.

The homepage allows users to enter their name to begin or return to a chat 🤖. The submit button triggers a POST request to the inner /chats endpoint with username as form parameter, which calls the database API to create a brand new chat after which redirects to the Chat Page 🤖.

The chat page calls the database API to retrieve the chat history 🤖. Users can then enter a message that triggers a POST request to the inner /generate/{chat_id} endpoint with the message as form parameter 🤖.

The generate endpoint calls the database API so as to add the user’s message to the chat history, after which the language model API with the complete chat history to generate a reply 🤖. After adding the reply to the chat history, the endpoint redirects to the chat page, that again retrieves and displays the most recent chat history. We send POST request to the LM API using httpx, but we could use a more standardized LM API package like langchain to invoke its completion endpoint.

Thus far, all users can access all endpoints and all data. This implies anyone can see your chat given your username or chat id. To treatment that, we’ll use session-based authentication and authorization.

We’ll store a primary party & GDRP compliant signed session cookie within the user’s browser 🤖. That is just an encrypted dict-like object within the request/response header. The user’s browser will send that session cookie with each request to our hostname, in order that we are able to discover and confirm a user user and show them their very own chats only.

As an additional layer of security, we “scope” the database API such that every chat row and every message row within the database incorporates a session id. For every request to the database API, we include the present user’s session id within the request header and query the database with each the chat id (or username) AND that session id. This fashion, the database can only ever return chats for the present user with its unique session id 🤖.

To run the UI in a Docker container, we follow the identical recipe because the database API, adding the hostnames of the database API and language model API as environment variables.

docker construct --tag chat-ui-image $PROJECT_PATH/chat-ui
docker run --name chat-ui --publish 8002:80 --env LM_API_URL= --env DB_API_URL= chat-ui-image

How will we know the hostnames of the 2 APIs? We’ll look networking and communication next.

Let’s zoom out and try our architecture again. By now, now we have 4 containers: the UI, DB API, LM API, and PostgreSQL database. What’s missing is the network, reverse proxy and container orchestration.

Our app’s microservices architecture. Made by writer in draw.io.

Until now, we used our computer’s localhost loopback network to send requests to a person container. This was possible because we published their ports to our localhost. Nevertheless, for containers to speak with one another, they have to be connected to the identical network and know each others hostname/IP address and port.

We’ll create a user-defined bridge Docker network that gives automatic DNS resolution. Which means container names are resolved to their dynamic container’s IP address. The network also provides isolation, and subsequently security: you may have to be on the identical network to give you the chance to succeed in our containers.

docker network create --driver bridge chat-net

We connect all containers to it by adding --network chat-net to their docker run command. Now, the database API can reach the database at db:5432 and the UI can reach the database API at http://db-api and the language model API at http://lm-api. Port 80 is default for HTTP requests so we are able to omit it.

Now, how will we — the user — reach our network-isolated containers? During development we published the UI container port to our localhost, but in a sensible scenario you sometimes use a reverse proxy. That is one other web server that acts like a gateway, forwarding HTTP requests to containers of their private network, enforcing security and isolation.

Nginx is an online server often used as reverse proxy. We are able to easily run it using its official Docker image. We also mount a configuration file 🤖 wherein we specify how Nginx should route incoming requests. For example, the only possible configuration forwards all requests (location / ) from Nginx container’s port 80 to the UI container at http://chat-ui.

http { server { listen 80; location / { proxy_pass http://chat-ui } } }

For the reason that Nginx container is in same private network, we are able to’t reach it either. Nevertheless, we are able to publish its port so it becomes the one access point of our entire app 🤖. A request to localhost now goes to the Nginx container, who forwards it to the UI and the UI’s response back to us.

docker run --network chat-net --publish 80:80 --volume $PROJECT_PATH/nginx.conf:/etc/nginx/nginx.conf nginx

Partly 2 we’ll see that these gateway servers may distribute incoming requests over copies of the identical containers (load balancing) for scalability; enable secure HTTPS traffic; and do advanced routing and caching. We’ll use an Azure-managed reverse proxy quite this Nginx container, but I feel it’s very useful to grasp how they work and set one up yourself. It might even be significantly cheaper in comparison with a managed reverse proxy.

Let’s put the whole lot together. Throughout this post we manually pulled or built each image and ran its container. Nevertheless, within the codebase I’m actually using Docker Compose: a tool designed to define, run and stop multi-container applications on a single host like our computer.

To make use of Docker Compose, we simply specify a compose.yml file 🤖 with construct and run instructions for every service. A cool feature is that it mechanically creates a user-defined bridge network to attach our services. Docker DNS will resolve the service names to container IP addresses.

Contained in the project directory we are able to start all services with a single command:

docker compose up --build

That wraps it up! We built an AI-powered chat web application that runs on our local computer, learning about microservices, REST APIs, FastAPI, Docker (Compose), reverse proxies, PostgreSQL databases, SQLAlchemy, and llama.cpp. We’ve built it with a cloud-native architecture in mind so we are able to deploy our app without changing a single line of code.

We’ll discuss deployment partially 2 and canopy Kubernetes, the industry-standard container orchestration tool for large-scale applications across multiple hosts; Azure Container Apps, a serverless platform that abstracts away a few of Kubernetes’ complexities; and ideas like load balancing, horizontal scaling, HTTPS, etc.

There’s lots we could do to enhance this app. Listed below are some things I’d work on given more time.

Language model. We now use a really general instruction-tuned language model as virtual assistant. I originally began this project to have a “virtual representation of me” on my website for visitors to debate my research with, based on my scientific publications. For such a use case, a very important direction is to enhance and tweak the language model output. Perhaps that’ll change into an element 3 of this series in the long run.

Frontend. As an alternative of a fast FastAPI UI, I’d construct a correct frontend using something like React, Angular or Vue to permit things like steaming LM responses and dynamic views quite than reloading the page each time. A more lightweight alternative that I’d prefer to experiment with is htmx, a library that gives modern browser features directly from HTML quite than javascript. It will be pretty straightforward to implement LM response streaming, for instance.

Reliability. To make the system more mature, I’d add unit and integration tests and a greater database setup like Alembic allowing for migrations.

Because of Dennis Ulmer, Bryan Eikema and David Stap for initial feedback or proofreading.

I used Pycharm’s CoPilot plugin for code completion, and ChatGPT for a primary version of the HTML and CSS template files. Towards the tip, I began experimenting more with debugging and sparring too, which proved surprisingly useful. For instance, I used it to find out about Nginx configurations and session cookies in FastAPI. I didn’t use AI to write down this post, though I did use ChatGPT to paraphrase a couple of bad-running sentences.

Listed below are some additional resources that I discovered useful during this project.

Designing, Constructing & Deploying an AI Chat App from Scratch (Part 1)

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Deep Reinforcement Learning: The Actor-Critic Method

Putting RL back in RLHF

EDA in Public (Part 3): RFM Evaluation for Customer Segmentation in Pandas

From DeepSpeed to FSDP and Back Again with Hugging Face Speed up

The Next Generation of HumanEval

Designing, Constructing & Deploying an AI Chat App from Scratch (Part 1)

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.