Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: experimental python packaging and interface #1912

Closed
wants to merge 24 commits into from
Closed

Conversation

drbh
Copy link
Collaborator

@drbh drbh commented May 16, 2024

This draft PR explores the possibility of wrapping the launcher and server rust applications into a python package to make it even easier to get started with TGI. This change has many implications and may not be sustainable/practical to add/maintain in the long term.

In general the goal of this PR is to enable a simple dev experience fully within a Python runtime. An example API may look like:

from tgi import TGI
from huggingface_hub import InferenceClient
import time

llm = TGI(model_id="google/paligemma-3b-mix-224")

# ✂️ startup logic snipped
print("Model is ready!")

client = InferenceClient("http://localhost:3000")
generated = client.text_generation("What are the main characteristics of a cat?")
print(generated)

# Cats are known for their independent nature, curious minds, and affectionate nature. Here are the main characteristics of a cat...

llm.close()

Please see the tgi package readme for dev instructions/how to test

Foreseeable issues

  • packaging the Python dependencies in a reasonable way (handle kernel compilation, etc..) is difficult
  • how would a python install work/interfere with our current install path
  • possibly complexities with graceful shutdowns
  • lots of dev surface area added (weight against benefits)

Opening this draft PR for visibility and feedback, any ideas/concerns/thoughts would be greatly appreciated 🙏

@fxmarty
Copy link
Collaborator

fxmarty commented May 20, 2024

would the wheel bundle FA/paged attention/quantization kernels?

@drbh
Copy link
Collaborator Author

drbh commented May 20, 2024

@fxmarty currently this branch does not include/build any 3rd party kernels but best case it would provide a easy to use upgrade path and maybe loud warnings/errors if the optimized kernels are not included.

Curious what your thought are; but I think it would be best to avoid the expensive/hardware specific build process in the default installation process (to make the library as easy to start with as possible), and then provide some interface for checking and installing kernels?

At the moment all of the kernel build processes are handled via cli commands, and would require actions outside of the library to add kernels, I wonder if theres a nice way to move that logic into the library and users could choose the kernels they want/support and build those individually...

@fxmarty
Copy link
Collaborator

fxmarty commented May 20, 2024

I am asking because for example for vllm in nvidia & rocm build we have forks with some modifications. So for example somebody installing vllm from vllm repo/pip and using vllm._C in TGI, this will not work.

There could be an external tgi-kernels package, maybe, with an --extra-index-url like https://github.com/AutoGPTQ/AutoGPTQ?tab=readme-ov-file#installation

@drbh
Copy link
Collaborator Author

drbh commented May 21, 2024

@fxmarty those are great points. I think a tgi-kernels package may be the best solution. Currently I'm looking into how we might precompile an individual kernel (just flash_attn) as I think this is a requirement for all kernels, and then I'll look into conditionally including them via --extra-index-url

@drbh
Copy link
Collaborator Author

drbh commented May 29, 2024

closing this PR in favor of a smaller PR that only adds a workflow to precompile kernels. Will revisit after precompiles are complete #1970

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants