kubeflow · google-oss-prow · May 20, 2024 · Mar 12, 2024 · Apr 22, 2024 · Apr 26, 2024
diff --git a/content/en/_index.html b/content/en/_index.html
@@ -124,7 +124,7 @@ <h5 class="card-title text-white section-head">AutoML</h5>
  <div class="card-body bg-primary-dark">
  <h5 class="card-title text-white section-head">Model Training</h5>
  <p class="card-text text-white">
- <a href="/docs/components/training/overview/" target="_blank" rel="noopener" >Kubeflow Training Operator</a> is a unified interface for model training on Kubernetes.
+ <a href="/docs/components/training/overview/" target="_blank" rel="noopener" >Kubeflow Training Operator</a> is a unified interface for model training and fine-tuning on Kubernetes.
  It runs scalable and distributed training jobs for popular frameworks including PyTorch, TensorFlow, MPI, MXNet, PaddlePaddle, and XGBoost.
  </p>
  </div>

diff --git a/content/en/docs/components/training/explanation/_index.md b/content/en/docs/components/training/explanation/_index.md
@@ -0,0 +1,5 @@
++++
+title = "Explanation"
+description = "Explanation for Training Operator Features"
+weight = 60
++++
diff --git a/content/en/docs/components/training/explanation/fine-tuning.md b/content/en/docs/components/training/explanation/fine-tuning.md
@@ -0,0 +1,63 @@
++++
+title = "LLM Fine-Tuning with Training Operator"
+description = "Why Training Operator needs fine-tuning API"
+weight = 10
++++
+
+{{% alert title="Warning" color="warning" %}}
+This feature is in **alpha** stage and Kubeflow community is looking for your feedback. Please
+share your experience using [#kubeflow-training-operator Slack channel](https://kubeflow.slack.com/archives/C985VJN9F)
+or [Kubeflow Training Operator GitHib](https://github.com/kubeflow/training-operator/issues/new).
+{{% /alert %}}
+
+This page explains how [Training Operator fine-tuning API](/docs/components/training/user-guides/fine-tuning)
+fits into Kubeflow ecosystem.
+
+In the rapidly evolving landscape of machine learning (ML) and artificial intelligence (AI),
+the ability to fine-tune pre-trained models represents a significant leap towards achieving custom
+solutions with less effort and time. Fine-tuning allows practitioners to adapt large language models
+(LLMs) like BERT or GPT to their specific needs by training these models on custom datasets.
+This process maintains the model's architecture and learned parameters while making it more relevant
+to particular applications. Whether you're working in natural language processing (NLP),
+image classification, or another ML domain, fine-tuning can drastically improve performance and
+applicability of pre-existing models to new datasets and problems.
+
+## Why Training Operator Fine-Tune API Matter ?
+
+Training Operator Python SDK introduction of Fine-Tune API is a game-changer for ML practitioners
+operating within the Kubernetes ecosystem. Historically, Training Operator has streamlined the
+orchestration of ML workloads on Kubernetes, making distributed training more accessible. However,
+fine-tuning tasks often require extensive manual intervention, including the configuration of
+training environments and the distribution of data across nodes. The Fine-Tune API aim to simplify
+this process, offering an easy-to-use Python interface that abstracts away the complexity involved
+in setting up and executing fine-tuning tasks on distributed systems.
+
+## The Rationale Behind Kubeflow's Fine-Tune API
+
+Implementing Fine-Tune API within Training Operator is a logical step in enhancing the platform's
+capabilities. By providing this API, Training Operator not only simplifies the user experience for
+ML practitioners but also leverages its existing infrastructure for distributed training.
+This approach aligns with Kubeflow's mission to democratize distributed ML training, making it more
+accessible and less cumbersome for users. The API facilitate a seamless transition from model
+development to deployment, supporting the fine-tuning of LLMs on custom datasets without the need
+for extensive manual setup or specialized knowledge of Kubernetes internals.
+
+## Roles and Interests
+
+Different user personas can benefit from this feature:
+
+- **MLOps Engineers:** Can leverage this API to automate and streamline the setup and execution of
+ fine-tuning tasks, reducing operational overhead.
+
+- **Data Scientists:** Can focus more on model experimentation and less on the logistical aspects of
+ distributed training, speeding up the iteration cycle.
+
+- **Business Owners:** Can expect quicker turnaround times for tailored ML solutions, enabling faster
+ response to market needs or operational challenges.
+
+- **Platform Engineers:** Can utilize this API to better operationalize the ML toolkit, ensuring
+ scalability and efficiency in managing ML workflows.
+
+## Next Steps
+
+- Understand [the architecture behind `train` API](/docs/components/training/reference/fine-tuning).
diff --git a/content/en/docs/components/training/images/fine-tune-llm-api.drawio.svg b/content/en/docs/components/training/images/fine-tune-llm-api.drawio.svg
diff --git a/content/en/docs/components/training/reference/fine-tuning.md b/content/en/docs/components/training/reference/fine-tuning.md
@@ -0,0 +1,57 @@
++++
+title = "LLM Fine-Tuning with Training Operator"
+description = "How Training Operator performs fine-tuning on Kubernetes"
+weight = 10
++++
+
+This page shows how Training Operator implements the
+[API to fine-tune LLMs](/docs/components/training/user-guides/fine-tuning).
+
+## Architecture
+
+In the following diagram you can see how `train` Python API works:
+
+<img src="/docs/components/training/images/fine-tune-llm-api.drawio.svg"
+ alt="Fine-Tune API for LLMs"
+ class="mt-3 mb-3">
+
+- Once user executes `train` API, Training Operator creates PyTorchJob with appropriate resources
+ to fine-tune LLM.
+
+- Storage initializer InitContainer is added to the PyTorchJob worker 0 to download
+ pre-trained model and dataset with provided parameters.
+
+- PVC with [`ReadOnlyMany` access mode](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes)
+ it attached to each PyTorchJob worker to distribute model and dataset across Pods. **Note**: Your
+ Kubernetes cluster must support volumes with `ReadOnlyMany` access mode, otherwise you can use a
+ single PyTorchJob worker.
+
+- Every PyTorchJob worker runs LLM Trainer that fine-tunes model using provided parameters.
+
+Training Operator implements `train` API with these pre-created components:
+
+### Model Provider
+
+Model provider downloads pre-trained model. Currently, Training Operator supports
+[HuggingFace model provider](https://github.com/kubeflow/training-operator/blob/6ce4d57d699a76c3d043917bd0902c931f14080f/sdk/python/kubeflow/storage_initializer/hugging_face.py#L56)
+that downloads model from HuggingFace Hub.
+
+You can implement your own model provider by using [this abstract base class](https://github.com/kubeflow/training-operator/blob/6ce4d57d699a76c3d043917bd0902c931f14080f/sdk/python/kubeflow/storage_initializer/abstract_model_provider.py#L4)
+
+### Dataset Provider
+
+Dataset provider downloads dataset. Currently, Training Operator supports
+[AWS S3](https://github.com/kubeflow/training-operator/blob/6ce4d57d699a76c3d043917bd0902c931f14080f/sdk/python/kubeflow/storage_initializer/s3.py#L37)
+and [HuggingFace](https://github.com/kubeflow/training-operator/blob/6ce4d57d699a76c3d043917bd0902c931f14080f/sdk/python/kubeflow/storage_initializer/hugging_face.py#L92)
+dataset providers.
+
+You can implement your own dataset provider by using [this abstract base class](https://github.com/kubeflow/training-operator/blob/6ce4d57d699a76c3d043917bd0902c931f14080f/sdk/python/kubeflow/storage_initializer/abstract_dataset_provider.py)
+
+### LLM Trainer
+
+Trainer implements training loop to fine-tune LLM. Currently, Training Operator supports
+[HuggingFace trainer](https://github.com/kubeflow/training-operator/blob/6ce4d57d699a76c3d043917bd0902c931f14080f/sdk/python/kubeflow/trainer/hf_llm_training.py#L118-L139)
+to fine-tune LLMs.
+
+You can implement your own trainer for other ML use-cases such as image classification,
+voice recognition, etc.
diff --git a/content/en/docs/components/training/user-guides/fine-tuning.md b/content/en/docs/components/training/user-guides/fine-tuning.md
@@ -0,0 +1,97 @@
++++
+title = "How to Fine-Tune LLMs with Kubeflow"
+description = "Overview of LLM fine-tuning API in Training Operator"
+weight = 10
++++
+
+{{% alert title="Warning" color="warning" %}}
+This feature is in **alpha** stage and Kubeflow community is looking for your feedback. Please
+share your experience using [#kubeflow-training-operator Slack channel](https://kubeflow.slack.com/archives/C985VJN9F)
+or [Kubeflow Training Operator GitHib](https://github.com/kubeflow/training-operator/issues/new).
+{{% /alert %}}
+
+This page describes how to use a [`train` API from Training Python SDK](https://github.com/kubeflow/training-operator/blob/6ce4d57d699a76c3d043917bd0902c931f14080f/sdk/python/kubeflow/training/api/training_client.py#L112) that simplifies the ability to fine-tune LLMs with
+distributed PyTorchJob workers.
+
+If you want to learn more about how the fine-tuning API fit in the Kubeflow ecosystem, head to
+[explanation guide](/docs/components/training/explanation/fine-tuning).
+
+## Prerequisites
+
+You need to install Training Python SDK [with fine-tuning support](/docs/components/training/installation/#install-python-sdk-with-fine-tuning-capabilities)
+to run this API.
+
+## How to use Fine-Tuning API ?
+
+You need to provide the following parameters to use the `train` API:
+
+- Pre-trained model parameters.
+- Dataset parameters.
+- Trainer parameters.
+- Number of PyTorch workers and resources per workers.
+
+For example, you can use `train` API as follows to fine-tune BERT model using Yelp Review dataset
+from HuggingFace Hub:
+
+```python
+import transformers
+from peft import LoraConfig
+
+from kubeflow.training import TrainingClient
+from kubeflow.storage_initializer.hugging_face import (
+ HuggingFaceModelParams,
+ HuggingFaceTrainerParams,
+ HuggingFaceDatasetParams,
+)
+
+TrainingClient().train(
+ name="fine-tune-bert",
+ # BERT model URI and type of Transformer to train it.
+ model_provider_parameters=HuggingFaceModelParams(
+ model_uri="hf://google-bert/bert-base-cased",
+ transformer_type=transformers.AutoModelForSequenceClassification,
+ ),
+ # Use 3000 samples from Yelp dataset.
+ dataset_provider_parameters=HuggingFaceDatasetParams(
+ repo_id="yelp_review_full",
+ split="train[:3000]",
+ ),
+ # Specify HuggingFace Trainer parameters. In this example, we will skip evaluation and model checkpoints.
+ trainer_parameters=HuggingFaceTrainerParams(
+ training_parameters=transformers.TrainingArguments(
+ output_dir="test_trainer",
+ save_strategy="no",
+ evaluation_strategy="no",
+ do_eval=False,
+ disable_tqdm=True,
+ log_level="info",
+ ),
+ # Set LoRA config to reduce number of trainable model parameters.
+ lora_config=LoraConfig(
+ r=8,
+ lora_alpha=8,
+ lora_dropout=0.1,
+ bias="none",
+ ),
+ ),
+ num_workers=4, # nnodes parameter for torchrun command.
+ num_procs_per_worker=2, # nproc-per-node parameter for torchrun command.
+ resources_per_worker={
+ "gpu": 2,
+ "cpu": 5,
+ "memory": "10G",
+ },
+)
+```
+
+After you execute `train`, Training Operator will orchestrate appropriate PyTorchJob resources
+to fine-tune LLM.
+
+## Next Steps
+
+- Run example to [fine-tune TinyLlama LLM](https://github.com/kubeflow/training-operator/blob/6ce4d57d699a76c3d043917bd0902c931f14080f/examples/pytorch/language-modeling/train_api_hf_dataset.ipynb)
+
+- Check this example to compare `create_job` and `train` Python API for
+ [fine-tuning BERT LLM](https://github.com/kubeflow/training-operator/blob/6ce4d57d699a76c3d043917bd0902c931f14080f/examples/pytorch/text-classification/Fine-Tune-BERT-LLM.ipynb).
+
+- Understand [the architecture behind `train` API](/docs/components/training/reference/fine-tuning).