within the federated learning series I’m doing, and should you just landed here, I might recommend going through the primary part where we discussed how federated learning works at a high level. For a fast refresher, here is an interactive app that I created in a marimo notebook where you possibly can perform local training, merge models using the Federated Averaging (FedAvg) algorithm and observe how the worldwide model improves across federated rounds.
On this part, our focus shall be on implementing the federated logic using the Flower framework.
What happens when models are trained on skewed datasets
In the primary part, we discussed how federated learning was used for early COVID screening with Curial AI. If the model had been trained only on data from a single hospital, it could have learnt patterns specific to that hospital only and would have generalised badly on out-of-distribution datasets. We all know it is a theory, but now allow us to put a number to it.
I’m borrowing an example from the Flower Labs course on DeepLearning.AI since it uses the familiar which makes the concept easier to know without getting lost in details. This instance makes it easy to know what happens when models are trained on biased local datasets. We then use the identical setup to point out how federated learning changes the final result.
Splitting the Dataset
We start by taking the MNIST dataset and splitting it into three parts to represent data held by different clients, let’s say three different hospitals. Moreover, we remove certain digits from each split so that each one clients have incomplete data, as shown below. This is finished to simulate real-world data silos.

As shown within the image above, client 1 never sees digits 1, 3 and seven. Similarly, client 2 never sees 2, 5 and eight and client 3 never sees 4, 6, and 9. Regardless that all three datasets come from the identical source, they represent pretty different distributions.
Training on Biased Data
Next, we train separate models on each dataset using the identical architecture and training setup. We use a quite simple neural network implemented in PyTorch with just two fully connected layers and train the model for 10 epochs.

As might be seen from the loss curves above, the loss progressively goes down during training. This means that the models are learning something. Nevertheless, remember, each model is just learning from its own limited view of the information and it’s only after we test it on a held-out set that we’ll know the true accuracy.
Evaluating on Unseen Data
To check the models, we load the MNIST test dataset with the identical normalization applied to the training data. Once we evaluate these models on the entire test set (all 10 digits), accuracy lands around 65 to 70 percent, which seems reasonable provided that three digits were missing from each training dataset. At the least the accuracy is best than the random probability of 10%.
Next, we also evaluate how individual models perform on data examples that weren’t represented of their training set. For that, we create three specific test subsets:
- Test set [1,3,7] only includes digits 1, 3, and seven
- Test set [2,5,8] only includes digits 2, 5, and eight
- Test set [4,6,9] only includes digits 4, 6, and 9

Once we evaluate each model only on the digits it never saw during training, accuracy drops to 0 percent. The models completely fail on classes they were never exposed to. Well, this can also be expected since a model cannot learn to acknowledge patterns it has never seen before. But there’s greater than what meets the attention, so we next take a look at the confusion matrix to know the behavior in additional detail.
Understanding the Failure Through Confusion Matrices
Below is the confusion matrix for model 1 that was trained on data excluding digits 1, 3, and seven. Since these digits were never seen during training, the model almost never predicts those labels.
Nevertheless, In few cases, the model predicts visually similar digits as a substitute. When label 1 is missing, the model never outputs 1 and as a substitute predicts digits like 2 or 8. The identical pattern appears for other missing classes. Which means the model fails in a way by assigning high confidence to the mistaken label. This is unquestionably not expected.

This instance shows the boundaries of centralized training with skewed data. When each client has only a partial view of the true distribution, models fail in systematic ways in which overall accuracy doesn’t capture. This is precisely the issue federated learning is supposed to deal with and that’s what we’ll implement in the following section using the Flower framework.
What’s Flower 🌼 ?
Flower is an open source framework that makes federated learning very easy to implement, even for beginners. It’s framework agnostic so that you don’t should worry about using PyTorch, TensorFlow, Hugging Face, JAX and more. Also, the identical core abstractions apply whether you’re running experiments on a single machine or training across real devices in production.
Flower models federated learning in a really direct way. A Flower app is built around the identical roles we discussed within the previous article: clients, a server and a technique that connects them. Let’s now take a look at these roles in additional detail.
Understanding Flower Through Simulation
Flower makes it very easy to start out with federated learning without worrying about any complex setup. For local simulation, there are principally two commands you could care about:
- one to generate the app —
flwr recentand - one to run it—
flwr run
You define a Flower app once after which run it locally to simulate many purchasers. Regardless that all the things runs on a single machine, Flower treats each client as an independent participant with its own data and training loop. This makes it much easier to experiment and test before moving to an actual deployment.
Allow us to start by installing the most recent version of Flower, which on the time of writing this text is 1.25.0.
# Install flower in a virtual environment
pip install -U flwr
# Checking the installed version
flwr --version
Flower version: 1.25.0
The fastest method to create a working Flower app is to let Flower scaffold one for you via flwr recent.
flwr recent #to pick out from a listing of templates
or
flwr recent @flwrlabs/quickstart-pytorch #directly specify a template
You now have a whole project with a clean structure to start out with.
quickstart-pytorch
├── pytorchexample
│ ├── client_app.py
│ ├── server_app.py
│ └── task.py
├── pyproject.toml
└── README.md
There are three primary files within the project:
- The
task.pyfile defines the model, dataset and training logic. - The
client_app.pyfile defines what each client does locally. - The
server_app.pyfile coordinates training and aggregation, often using federated averaging but you can too modify it.
Running the federated simulation
We are able to now run the federation using the commands below.
pip install -e .
flwr run .
This single command starts the server, creates simulated clients, assigns data partitions and runs federated training end to finish.

A very important point to notice here is that the server and clients don’t call one another directly. All communication happens using message objects. Each message carries model parameters, metrics, and configuration values. Model weights are sent using array records, metrics reminiscent of loss or accuracy are sent using metric records and values like learning rate are sent using config records. During each round, the server sends the present global model to chose clients, clients train locally and return updated weights with metrics and the server aggregates the outcomes. The server may run an evaluation step where clients only report metrics, without updating the model.
Should you look contained in the generated pyproject.toml, you may even see how the simulation is defined.
[tool.flwr.app.components]
serverapp = "pytorchexample.server_app:app"
clientapp = "pytorchexample.client_app:app"
This section tells Flower which Python objects implement the ServerApp and ClientApp. These are the entry points Flower uses when it launches the federation.
[tool.flwr.app.config]
num-server-rounds = 3
fraction-evaluate = 0.5
local-epochs = 1
learning-rate = 0.1
batch-size = 32
[tool.flwr.federations]
default = "local-simulation"
[tool.flwr.federations.local-simulation]
options.num-supernodes = 10
Next, these values define the run configuration. They control what number of server rounds are executed, how long each client trains locally and which training parameters are used. These settings can be found at runtime through the Flower Context object.
[tool.flwr.federations]
default = "local-simulation"
[tool.flwr.federations.local-simulation]
options.num-supernodes = 10
This section defines the local simulation itself. Setting options.num-supernodes = 10 tells Flower to create ten simulated clients. Each SuperNode runs one ClientApp instance with its own data partition.
Here’s a quick rundown of the steps mentioned above.

Now that we’ve seen how easy it’s to run a federated simulation with Flower, we’ll apply this structure to our MNIST example and revisit the skewed data problem we observed earlier.
Improving Accuracy through Collaborative Training
Now let’s return to our MNIST example. We saw that the models trained on individual local datasets didn’t give good results. On this section, we alter the setup in order that clients now collaborate by sharing model updates as a substitute of working in isolation. Each dataset, nonetheless, remains to be missing certain digits like before and every client still trains locally.
The most effective part in regards to the project obtained through simulation within the previous section is that it could possibly now be easily adapted to our use case. I even have taken the flower app generated within the previous section and made a number of changes within the client_app ,server_app and the task file. I configured the training to run for 3 server rounds, with all clients participating in every round, and every client training its local model for ten local epochs. All these settings might be easily managed via the pyproject.toml file. The local models are then aggregated to a single global model using Federated Averaging.


Now let’s take a look at the outcomes. Keep in mind that within the isolated training approach, the three individual models achieved an accuracy of roughly between 65 and 70%. Here, with federated learning, we see an enormous jump in accuracy to around 96%. Which means the worldwide model is a lot better than any of the person models trained in isolation.
This global model even performs higher on the particular subsets (the digits that were missing from each client’s data) and sees a jump in accuracy from previously 0% to between 94 and 97%.

The confusion matrix above corroborates this finding. It shows the model learns the way to classify all digits properly, even those to which it was not exposed. We don’t see any columns that only have zeros in them anymore and each digit class now has predictions, showing that collaborative training enabled the model to learn the entire data distribution with none single client gaining access to all digit types.
Taking a look at the massive picture
While it is a toy example, it helps to offer the intuition behind why federated learning is so powerful. This same principle might be applied to situations where data is distributed across multiple locations and can’t be centralized attributable to privacy or regulatory constraints.

As an illustration, should you substitute the above example with, let’s say, three hospitals, each having local data, you’d see that despite the fact that each hospital only has its own limited dataset, the general model trained through federated learning can be a lot better than any individual model trained in isolation. Moreover, the information stays private and secure in each hospital however the model advantages from the collective knowledge of all participating institutions.
Conclusion & What’s Next
That’s all for this a part of the series. In this text, we implemented an end-to-end Federated Learning loop with Flower, understood the assorted components of the Flower app and compared machine learning with and without collaborative learning. In the following part, we’ll explore Federated Learning from the privacy viewpoint. While federated learning itself is a knowledge minimization solution because it prevents direct access to data, the model updates exchanged between client and server can still potentially result in privacy leaks. Let’s touch upon this in the following part. For now, it’ll be an awesome idea to look into the official documentation.
