feat: experimental python packaging and interface #1912

drbh · 2024-05-16T20:01:25Z

This draft PR explores the possibility of wrapping the launcher and server rust applications into a python package to make it even easier to get started with TGI. This change has many implications and may not be sustainable/practical to add/maintain in the long term.

In general the goal of this PR is to enable a simple dev experience fully within a Python runtime. An example API may look like:

from tgi import TGI
from huggingface_hub import InferenceClient
import time

llm = TGI(model_id="google/paligemma-3b-mix-224")

# ✂️ startup logic snipped
print("Model is ready!")

client = InferenceClient("http://localhost:3000")
generated = client.text_generation("What are the main characteristics of a cat?")
print(generated)

# Cats are known for their independent nature, curious minds, and affectionate nature. Here are the main characteristics of a cat...

llm.close()

Please see the tgi package readme for dev instructions/how to test

Foreseeable issues

packaging the Python dependencies in a reasonable way (handle kernel compilation, etc..) is difficult
how would a python install work/interfere with our current install path
possibly complexities with graceful shutdowns
lots of dev surface area added (weight against benefits)

Opening this draft PR for visibility and feedback, any ideas/concerns/thoughts would be greatly appreciated 🙏

fxmarty · 2024-05-20T01:15:35Z

would the wheel bundle FA/paged attention/quantization kernels?

drbh · 2024-05-20T14:15:54Z

@fxmarty currently this branch does not include/build any 3rd party kernels but best case it would provide a easy to use upgrade path and maybe loud warnings/errors if the optimized kernels are not included.

Curious what your thought are; but I think it would be best to avoid the expensive/hardware specific build process in the default installation process (to make the library as easy to start with as possible), and then provide some interface for checking and installing kernels?

At the moment all of the kernel build processes are handled via cli commands, and would require actions outside of the library to add kernels, I wonder if theres a nice way to move that logic into the library and users could choose the kernels they want/support and build those individually...

fxmarty · 2024-05-20T15:31:35Z

I am asking because for example for vllm in nvidia & rocm build we have forks with some modifications. So for example somebody installing vllm from vllm repo/pip and using vllm._C in TGI, this will not work.

There could be an external tgi-kernels package, maybe, with an --extra-index-url like https://github.com/AutoGPTQ/AutoGPTQ?tab=readme-ov-file#installation

drbh · 2024-05-21T17:16:36Z

@fxmarty those are great points. I think a tgi-kernels package may be the best solution. Currently I'm looking into how we might precompile an individual kernel (just flash_attn) as I think this is a requirement for all kernels, and then I'll look into conditionally including them via --extra-index-url

drbh · 2024-05-29T01:48:04Z

closing this PR in favor of a smaller PR that only adds a workflow to precompile kernels. Will revisit after precompiles are complete #1970

drbh force-pushed the pip-installable branch from c61ea51 to 4805202 Compare May 16, 2024 20:03

drbh force-pushed the pip-installable branch from 52c13cd to 95944b6 Compare May 27, 2024 04:26

drbh added 22 commits May 28, 2024 11:58

feat: experimental python packaging and interface

0e5220d

fix: add missing imports post rebase

72d6907

feat: package text-generation-server with tgi library

af2b2e8

feat: bundle launcher and refactor cli wrappers

30f4deb

fix: exclude tgi from workspace and improve build

70b27c4

fix: avoid library name collision and add core deps to build

38688ba

feat: add workflow to build flash_attn

7d1651f

fix: adjust workflow condition

15fa236

fix: avoid setup_release job

816e14f

fix: limit build and tweak post build commands

f4e7cdb

fix: move to parent dir after wheel build

a4cd403

fix: debug cwd after build

7aebbd2

fix: list nested dist in workflow

49c1224

feat: upload single pre compile to hub

01e68b5

fix: skip redundant login

47e1937

fix: adjust upload command

2ee4b9f

fix: build proto in CI and avoid rate limit in client test

814e07d

fix: adjust skip build typo

7765aa6

feat: cache wheel as build artifact

ec8c638

fix: build kernels inside of repo and move to single dist

8253f83

feat: compile vllm for cuda after flash_attn

ad94f29

fix: set cuda arch list prior to vllm build

da1a0b3

drbh force-pushed the pip-installable branch from 25312cb to da1a0b3 Compare May 28, 2024 16:03

feat: upload assets to hub rather than github

dab44ac

fix: install hf cli before upload

1bf32d9

drbh mentioned this pull request May 29, 2024

feat: precompile kernels workflow #1970

Closed

drbh closed this May 29, 2024

drbh mentioned this pull request May 29, 2024

feat: add precompile kernels workflow #1971

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: experimental python packaging and interface #1912

feat: experimental python packaging and interface #1912

drbh commented May 16, 2024 •

edited

fxmarty commented May 20, 2024

drbh commented May 20, 2024

fxmarty commented May 20, 2024

drbh commented May 21, 2024 •

edited

drbh commented May 29, 2024 •

edited

feat: experimental python packaging and interface #1912

feat: experimental python packaging and interface #1912

Conversation

drbh commented May 16, 2024 • edited

Foreseeable issues

fxmarty commented May 20, 2024

drbh commented May 20, 2024

fxmarty commented May 20, 2024

drbh commented May 21, 2024 • edited

drbh commented May 29, 2024 • edited

drbh commented May 16, 2024 •

edited

drbh commented May 21, 2024 •

edited

drbh commented May 29, 2024 •

edited