From Centralized to Federated Learning

-

Federated Learning (FL) is a technique to coach Machine Learning (ML) models in a distributed setting [1]. The thought is that clients (for instance hospitals) need to cooperate without sharing their private and sensitive data. Each client holds their private data in FL and trains an ML model on it. Then a central server collects and aggregates the model parameters, thus constructing a world model based on information from all the information distribution. Ideally, this serves as privacy protection by design.

An extended line of research has been done to grasp FL’s efficiency, privacy, and fairness. Here we’ll concentrate on the benchmark datasets used to guage horizontal FL methods where the clients share the identical task and data type but they’ve their individual data samples.

If you need to know more about Federated Learning and what I work on, visit our research lab website!

Photo by JJ Ying on Unsplash

There are three kinds of datasets within the literature:

  1. Real FL scenario: an application where FL is a needed method. It has natural distributions and sensitive data. Nevertheless, given the character of FL if you need to keep the information locally you won’t publish the dataset online for benchmarking. Subsequently it is tough to search out a dataset of this type. OpenMinded behind PySyft tries to arrange an FL community of universities and research labs to host data in a more realistic scenario. Moreover, there are applications where the privacy-awareness has risen recently. So there may be publicly available data while the demand for FL exists. One application is sensible electricity meters [2].
  2. FL benchmark datasets: these datasets are designed to function FL benchmarks. The distribution is realistic, however the sensitivity of the information is questionable as they’re built from publicly available origins. One example is creating an FL dataset from Reddit posts using the users as clients and distributing it to 1 user as one partition. The LEAF project proposed more datasets like this [3].
  3. Distributing standard datasets: there are a few well-known datasets like CIFAR and ImageNet for images for instance used as a benchmark in lots of Machine Learning works. Here FL scientists define a distribution in keeping with their research questions. It is sensible to make use of this method if the subject is well-studied on a regular ML scenario and one wants to match their FL algorithm to centralized SOTA. Nevertheless, this artificial distribution doesn’t reveal every problem with the distribution skew. For instance, if the clients collect images with very different cameras or in several lighting conditions.

Because the last category isn’t distributed by design, there are several ways past research works split them. In the remaining of this story, I’ll summarise distribution techniques used for the CIFAR dataset in a federated scenario.

CIFAR dataset

The CIFAR-10 and CIFAR-100 datasets contain 32×32 coloured images labeled to mutually exclusive classes [4]. The CIFAR-10 has 10 classes of 6000 images and the CIFAR-100 has 100 classes of 600 images. They’re utilized in many image classification tasks and one can access dozens of models evaluated on them, even browsing them using a leaderboard on PapersWithCode.

Uniform distribution

This is taken into account to be identically and independently distributed (IID) data. Data points are randomly allocated to clients.

Single (n-) class clients

Data points allocated for a particular client come from the identical class or classes. It may be recognized as an extreme non-IID setting. Examples of this distribution are in [1,5–8]. The work first naming the setting as Federated Learning [1] uses 200 single-class sets and offers two sets to every client making them 2-class clients. [5–7] use 2-class clients.

[9] builds on the hierarchical classes in CIFAR-100: clients have data points from one subclass in each superclass. This fashion within the classification task for superclasses has clients with samples from each (super)class, yet a distribution skew is simulated as the information points are from different subclasses. For instance, one client has access to lions while the opposite has tiger images, the superclass task is to categorize each as large carnivores.

Dominant class clients

[5] also uses a combination of uniform and 2-class clients, which implies half of the information points come from the two dominant classes, and the remaining are uniformly chosen from the opposite classes. [10] uses an 80%-20% partition 80% chosen from a single dominant class and the remaining is uniformly chosen from the opposite classes.

Dirichlet distribution

To grasp the Dirichlet distribution, I follow the instance of this blog post. Let’s say one wants to provide a dice, with θ=(1/6,1/6,1/6,1/6,1/6,1/6) probabilities for every #1–6. Nevertheless, in point of fact, nothing will be perfect, so each die can be a bit skewed. 4 a bit more likely and three a bit less likely for instance. The Dirichlet distribution describes this variety with a parameter vector α=(α₁,α₂,..,α₆). Larger αᵢ strengthens the burden of that number and the larger overall sum of the αᵢ values ensures more similar sampled probabilities (dice). Turning back to the dice example, to have a good die each αᵢ must be equal, and the larger the α value the higher manufactured the dice are. Because it is a multivariate generalization of the beta distribution, let’s display some examples of the beta distribution (Dirichlet distribution with two dice):

Different beta distributions (Dirichlet distribution for two variables) — Figure by the creator

I reproduced the visualization in [11], using the identical α value for αᵢ each. This is known as a symmetric Dirichlet distribution. We are able to see that because the α value decreases it’s more likely that there can be unbalanced dice. The figures below show the Dirichlet distribution for various α values. Here each row represents a category, each column is a client and the world of the circles is proportionate to the possibilities.

Distribution over classes: Sampling 20 clients for 10 classes using different Dirichlet distribution α values — Figure by the creator

Distribution over classes: The samples for every client are drawn independently with class distribution following the Dirichlet method. [11, 16] use this version of the Dirichlet distribution.

Distribution over classes: normalized sum of samples by class (10) and by client (20) — Figure by the creator

Each client has a predetermined variety of samples, however the classes are chosen randomly, thus the ultimate total class representation can be unbalanced. Within the clients, α→∞ is the prior (uniform) distribution while α→0 means single-class clients.

Distribution over clients: Sampling 20 clients for 10 classes using different Dirichlet distribution α values — Figure by the creator

Distribution over clients: if we all know the entire variety of samples in a category and the variety of clients, we will distribute the samples to the clients class by class. This may lead to clients having a unique variety of samples (which may be very typical in FL), while the worldwide class distribution is balanced. [12] use this variation of the Dirichlet distribution.

Distribution over clients: normalized sum of samples by class (10) and by client (20) — Figure by the creator

While works like [11–16] follow and cite one another using Dirichlet distribution, they use the 2 different methods. Moreover, the several experiments use different α values which may end up in very different performances. [11,12] uses α=0.1 and [13-15] uses α=0.5, [16] gives an summary of various α values. These design selections lose the unique principle of using the identical benchmark dataset to guage algorithms.

Asymmetric Dirichlet distribution: one can use different αᵢ values to simulate more resourceful clients. For instance, the figure below is produced using 1/i for the ith client. It isn’t represented within the literature to my knowledge, as a substitute, Zipf distribution is utilized in [17].

Asymmetric Dirichlet distribution with αᵢ=1/i — Figure by the creator

Zipf distribution

[17] uses a mixture of Zipf and Dirichlet distributions. It uses the Zipf distribution to find out the variety of samples at each client after which selects the category distribution using the Dirichlet.

Probability for rank k within the Zipf distribution where is the Riemann Zeta function

Within the Zipf (zeta) distribution the frequency of an item is inversely proportional to its rank in a frequency table. Zipf’s law will be observed in lots of real-world datasets, for instance regarding the word frequency in language corpora [18].

Sampling items using the Zipf distribution — Figure by the creator following the numpy documentation on Zipf

Benchmarking federated learning methods is a difficult task. Ideally, one uses predefined real federated datasets. Nevertheless, if a certain scenario needs to be simulated without a great existing dataset to cover it, one can use data distribution techniques. Proper documentation for reproducibility and motivation of the design selection is very important. Here I summarized probably the most common methods already in use for FL algorithm evaluation. Visit this Colab notebook for the codes used for this story!

[1] McMahan, B., Moore, E., Ramage, D., Hampson, S., & y Arcas, B. A. (2017, April). Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics (pp. 1273–1282). PMLR.

[2] Savi, M., & Olivadese, F. (2021). Short-term energy consumption forecasting at the sting: A federated learning approach. IEEE Access, 9, 95949–95969.

[3] Caldas, S., Duddu, S. M. K., Wu, P., Li, T., Konečný, J., McMahan, H. B., … & Talwalkar, A. (2019). Leaf: A benchmark for federated settings. Workshop on Federated Learning for Data Privacy and Confidentiality

[4] Krizhevsky, A. (2009). Learning Multiple Layers of Features from Tiny Images. Master’s thesis, University of Tront.

[5] Liu, W., Chen, L., Chen, Y., & Zhang, W. (2020). Accelerating federated learning via momentum gradient descent. IEEE Transactions on Parallel and Distributed Systems, 31(8), 1754–1766.

[6] Zhang, L., Luo, Y., Bai, Y., Du, B., & Duan, L. Y. (2021). Federated learning for non-iid data via unified feature learning and optimization objective alignment. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 4420–4428).

[7] Zhang, J., Guo, S., Ma, X., Wang, H., Xu, W., & Wu, F. (2021). Parameterized knowledge transfer for personalized federated learning. Advances in Neural Information Processing Systems, 34, 10092–10104.

[8] Zhao, Y., Li, M., Lai, L., Suda, N., Civin, D., & Chandra, V. (2018). Federated learning with non-iid data. arXiv preprint arXiv:1806.00582.

[9] Li, D., & Wang, J. (2019). Fedmd: Heterogenous federated learning via model distillation. arXiv preprint arXiv:1910.03581.

[10] Wang, H., Kaplan, Z., Niu, D., & Li, B. (2020, July). Optimizing federated learning on non-iid data with reinforcement learning. In IEEE INFOCOM 2020-IEEE Conference on Computer Communications (pp. 1698–1707). IEEE.

[11] Lin, T., Kong, L., Stich, S. U., & Jaggi, M. (2020). Ensemble distillation for robust model fusion in federated learning. Advances in Neural Information Processing Systems, 33, 2351–2363.

[12] Luo, M., Chen, F., Hu, D., Zhang, Y., Liang, J., & Feng, J. (2021). No fear of heterogeneity: Classifier calibration for federated learning with non-iid data. Advances in Neural Information Processing Systems, 34, 5972–5984.

[13] Yurochkin, M., Agarwal, M., Ghosh, S., Greenewald, K., Hoang, N., & Khazaeni, Y. (2019, May). Bayesian nonparametric federated learning of neural networks. In International conference on machine learning (pp. 7252–7261). PMLR.

[14] Wang, H., Yurochkin, M., Sun, Y., Papailiopoulos, D., & Khazaeni, Y. (2020) Federated Learning with Matched Averaging. In International Conference on Learning Representations.

[15] Li, Q., He, B., & Song, D. (2021). Model-contrastive federated learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10713–10722).

[16] Hsu, T. M. H., Qi, H., & Brown, M. (2019). Measuring the results of non-identical data distribution for federated visual classification. arXiv preprint arXiv:1909.06335.

[17] Wadu, M. M., Samarakoon, S., & Bennis, M. (2021). Joint client scheduling and resource allocation under channel uncertainty in federated learning. IEEE Transactions on Communications, 69(9), 5962–5974.

[18] Fagan, Stephen; Gençay, Ramazan (2010), “An introduction to textual econometrics”, in Ullah, Aman; Giles, David E. A. (eds.), Handbook of Empirical Economics and Finance, CRC Press, pp. 133–153

ASK DUKE

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x