Fallback to Flash Attention v1 for pre-Ampere GPUs #440

tgaddair · 2024-04-26T17:18:02Z

We can add back the FA1 implementation from huggingface/text-generation-inference#624 when compute capability of Volta or Turing is detected. This may bloat the Docker somewhat to support both, but it seems this is a common user pain point we should definitely address.

N1RM4L13 · 2024-05-07T10:33:34Z

@tgaddair I would like to contribute to this

tgaddair added enhancement New feature or request good first issue Good for newcomers labels Apr 26, 2024

flozi00 linked a pull request May 21, 2024 that will close this issue

start porting latest tgi #480

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fallback to Flash Attention v1 for pre-Ampere GPUs #440

Fallback to Flash Attention v1 for pre-Ampere GPUs #440

tgaddair commented Apr 26, 2024

N1RM4L13 commented May 7, 2024

Fallback to Flash Attention v1 for pre-Ampere GPUs #440

Fallback to Flash Attention v1 for pre-Ampere GPUs #440

Comments

tgaddair commented Apr 26, 2024

N1RM4L13 commented May 7, 2024