#

distributed-training

Here are 144 public repositories matching this topic...

PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice （『飞桨』核心框架，深度学习&机器学习高性能单机、分布式训练和跨平台部署）

python machine-learning deep-learning neural-network scalability efficiency paddlepaddle distributed-training

Updated May 23, 2024
C++

PaddleNLP

PaddlePaddle / PaddleNLP

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.

nlp search-engine compression sentiment-analysis transformers information-extraction question-answering llama pretrained-models embedding bert semantic-analysis distributed-training ernie neural-search uie document-intelligence paddlenlp llm

Updated May 23, 2024
Python

determined

determined-ai / determined

Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.

kubernetes data-science machine-learning deep-learning tensorflow keras pytorch hyperparameter-optimization hyperparameter-tuning hyperparameter-search distributed-training ml-infrastructure mlops ml-platform

Updated May 23, 2024
Go

DeepRec-AI / DeepRec

DeepRec is a high-performance recommendation deep learning framework based on TensorFlow. It is hosted in incubation in LF AI & Data Foundation.

python search-engine machine-learning deep-learning scalability recommendation-engine advertising distributed-training

Updated May 23, 2024
C++

intelligent-machine-learning / dlrover

DLRover: An Automatic Distributed Deep Learning System

k8s distributed-training llm-training

Updated May 23, 2024
Python

skypilot-org / skypilot

SkyPilot: Run LLMs, AI, and Batch jobs on any cloud. Get maximum savings, highest GPU availability, and managed execution—all with a simple interface.

Updated May 23, 2024
Python

foundation-model-stack / fms-fsdp

🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash attention v2.

pytorch distributed-training llm

Updated May 22, 2024
Python

huggingface / pytorch-image-models

The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNet-V3/V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more

Updated May 22, 2024
Python

Hz188 / experiments

Everything is born from a simple experiment.

cmake leetcode learning-by-doing distributed-training

Updated May 23, 2024
Python

NoteDance / Note

Easily implement parallel training and distributed training. Machine learning library. Note.neuralnetwork.tf package include Llama2, Llama3, Gemma, CLIP, ViT, ConvNeXt, BEiT, Swin Transformer, Segformer, etc, these models built with Note are compatible with TensorFlow and can be trained with TensorFlow.

Updated May 22, 2024
Python

l294265421 / my-llm

All about large language models

distributed-training deepspeed large-language-models chatgpt

Updated May 22, 2024

FedML-AI / FedML

FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, TensorOpera AI (https://TensorOpera.ai) is your generative AI platform at scale.

machine-learning deep-learning inference-engine model-deployment model-serving distributed-training federated-learning mlops edge-ai ai-agent on-device-training

Updated May 23, 2024
Python

saforem2 / ezpz

Distributed training, `ezpz`.

python machine-learning launcher rich distributed-training

Updated May 21, 2024
Python

pytorch / torchx

TorchX is a universal job launcher for PyTorch applications. TorchX is designed to have fast iteration time for training/research and support for E2E production ML pipelines when you're ready.

python kubernetes components machine-learning airflow deep-learning slurm pipelines pytorch ray aws-batch distributed-training

Updated May 23, 2024
Python

aws / sagemaker-xgboost-container

This is the Docker container based on open source framework XGBoost (https://xgboost.readthedocs.io/en/latest/) to allow customers use their own XGBoost scripts in SageMaker.

python training aws machine-learning inference xgboost gbm distributed-training sagemaker

Updated May 20, 2024
Python

chairc / Integrated-Design-Diffusion-Model

IDDM (Industrial, landscape, animate...), support DDPM, DDIM, PLMS, webui and multi-GPU distributed training. Pytorch实现，生成模型，扩散模型，分布式训练

distributed-computing pytorch generative-model webui industrial unet distributed-training diffusion-models ddpm plms ddim aigc

Updated May 16, 2024
Python

learning-at-home / hivemind

Decentralized deep learning in PyTorch. Built to train models on thousands of volunteers across the world.

distributed-systems machine-learning deep-learning pytorch dht neural-networks asyncio asynchronous-programming volunteer-computing hivemind distributed-training mixture-of-experts

Updated May 13, 2024
Python

nanodl

HMUNACHI / nanodl

A Jax-based library for designing and training transformer models from scratch.

nlp machine-learning deep-learning transformer attention llama flax gpt attention-mechanism mistral distributed-training jax

Updated May 12, 2024
Python

Oneflow-Inc / libai

LiBai(李白): A Toolbox for Large-Scale Distributed Parallel Training

nlp deep-learning transformer large-scale data-parallelism model-parallelism distributed-training self-supervised-learning oneflow pipeline-parallelism vision-transformer

Updated May 12, 2024
Python

harinik05 / LettucifyAI

MLOps Pipeline & fine-tuned deep learning model to classify between various food items 🍎🚀

docker kubernetes distributed-systems anaconda cnn pytorch azureml nvidia-gpu distributed-training azure-devops

Updated May 4, 2024
Jupyter Notebook

Improve this page

Add a description, image, and links to the distributed-training topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the distributed-training topic, visit your repo's landing page and select "manage topics."