Detailed information about the theoretical aspects, framework, and methodologies underpinning the HarmoniFL tool can be found in the accompanying theoretical paper. For a comprehensive understanding of the development process that led to HarmoniFL, please refer to the theoretical paper.
This thesis introduces HarmoniFL, an initiative engineered to navigate the complexities of resource heterogeneity within federated learning frameworks. Focused on enhancing efficiency and inclusivity, HarmoniFL employs a dynamic client selection protocol, leveraging real-time metrics like CPU load, memory availability, and network bandwidth to optimize device participation in learning tasks. Adaptive strategies—ranging from data sampling adjustments to epoch reduction and batch size optimization for high-demand devices—address the challenges of resource diversity. Our experimental analysis focused on two primary goals: minimizing training duration for less capable devices and improving the accuracy of the aggregated global model. Results demonstrate that HarmoniFL effectively reduces training times and enhances model performance, underscoring its potential to foster more equitable device participation in federated learning tasks without sacrificing learning quality.
The training process for each client is designed to be dynamic, allowing for adjustments based on the available computational resources. This ensures both optimal performance and efficient resource utilization across different hardware environments.
To customize the training parameters dynamically, settings are specified in the criteria.yaml
file within the config
folder. This configuration file is pivotal in adjusting training parameters like batch size, learning rate, and more, tailored to the real-time capabilities of each client's hardware.
The criteria.yaml
file contains several criteria, each defined with specific types and configurations. These criteria are used to dynamically adjust training parameters based on hardware utilization metrics such as CPU and memory usage, as well as network bandwidth. Each criterion is structured as follows:
type
: Identifies the adjustment strategy, e.g., adaptive batch size, learning rate adjustment, etc.blocking
: Indicates if the training should be halted when this criterion is not met.active
: Specifies whether the criterion is currently enabled or disabled.config
: Contains the configuration settings for the criterion, including thresholds and adjustment factors.
- Adaptive Learning Rate: Adjusts the learning rate based on CPU utilization, helping to balance computational load and training efficiency.
- Adaptive Batch Size: Modifies the batch size in response to memory utilization, optimizing memory usage without compromising training effectiveness.
- Model Layer Freezing: Reduces computational complexity by freezing model layers when CPU utilization is high, ensuring stable training performance.
Docker must be installed and the Docker daemon running on your server. If you don't already have Docker installed, you can get installation instructions for your specific Linux distribution or macOS from Docker. Besides Docker, the only extra requirement is having Python installed. You don't need to create a new environment for this example since all dependencies will be installed inside Docker containers automatically.
Execute the following command to run the helpers/generate_docker_compose.py
script. This script creates the docker-compose configuration needed to set up the environment.
python helpers/generate_docker_compose.py
Within the script, specify the number of clients (total_clients
) and resource limitations for each client in the client_configs
array. You can adjust the number of rounds by passing --num_rounds
to the above command.
-
Execute Initialization Script:
-
To build the Docker images and start the containers, use the following command:
docker-compose up
-
-
Services Startup:
-
Several services will automatically launch as defined in your
docker-compose.yml
file:- Monitoring Services: Prometheus for metrics collection, Cadvisor for container monitoring, and Grafana for data visualization.
- Flower Federated Learning Environment: The Flower server and client containers are initialized and start running.
-
-
Automated Grafana Configuration:
Grafana is configured to load pre-defined data sources and dashboards for immediate monitoring, facilitated by provisioning files. The provisioning files include
prometheus-datasource.yml
for data sources, located in the./config/provisioning/datasources
directory, anddashboard_index.json
for dashboards, in the./config/provisioning/dashboards
directory. Thegrafana.ini
file is also tailored to enhance user experience:- Admin Credentials: We provide default admin credentials in the
grafana.ini
configuration, which simplifies access by eliminating the need for users to go through the initial login process. - Default Dashboard Path: A default dashboard path is set in
grafana.ini
to ensure that the dashboard with all the necessary panels is rendered when Grafana is accessed.
Visit
http://localhost:3000
to enter Grafana, where the automated setup greets you with a series of pre-configured dashboards, ready for immediate monitoring and customizable to fit specific requirements. Thedashboard_index.json
file, pivotal for dashboard configuration, outlines panel structures and settings for key metrics like model accuracy and CPU usage. This setup, enhanced by direct volume mappings in Docker Compose, ensures Grafana's readiness from startup without additional configuration, providing an insightful snapshot into the federated learning system's performance. - Admin Credentials: We provide default admin credentials in the
-
Begin Training Process:
- The federated learning training automatically begins once all client containers are successfully connected to the Flower server. This synchronizes the learning process across all participating clients.
To conduct experiments and analyze results within the HarmoniFL framework, follow these steps:
-
Data Extraction: In the
experiments/dataExtraction
directory, you'll find scripts designed to extract data from MLflow. These scripts are essential for gathering the metrics and results generated during the federated learning process. -
Plot Generation: After extracting the data, navigate to the
experiments/plotGeneration
folder. Here, scripts are available to create visual representations of the experiment results, such as accuracy over training rounds. -
Pre-configured Data and Results: For your convenience, we've included pre-collected data and results from our experiments. This allows for a quick start in analyzing the performance and efficiency of the HarmoniFL system without the need for immediate experiment replication.
By following these steps, you can efficiently run experiments, extract necessary data, and generate plots to visualize the performance and outcomes of the HarmoniFL framework.
This project is licensed under the MIT License - see the LICENSE file for details.