start porting latest tgi #480

flozi00 · 2024-05-20T20:01:00Z

What does this PR do?

@tgaddair its just for you, tracking progress now, please do not merge at the moment

This PR also introduces FP8 Linear and fp8 kv cache by vllm

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Was this discussed/approved via a Github issue or the discord / slack channel? Please add a link
to it if that's the case.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

flozi00 · 2024-05-22T10:06:18Z

Mistral + eetq tested and working

flozi00 · 2024-05-22T10:32:49Z

llama tested too

flozi00 · 2024-05-22T10:39:00Z

Benchmark vs Main branch:
{"input_tokens_per_second": 14643, "output_tokens_per_second": 218} -- main
{"input_tokens_per_second": 15003, "output_tokens_per_second": 236} -- This PR

flozi00 · 2024-05-22T10:56:41Z

awq tested
Sharding tested

tgaddair

Nice! LGTM. Can land and then test it out further on main.

flozi00 and others added 16 commits May 20, 2024 22:00

start porting latest tgi

25bd2b6

awq

1337aa5

refactor attention

8449a94

more layers

e446812

docker

ce4417b

rm awq from docker

262b5f4

ruff

48aecdc

fix docker awq

fe1a883

fix vllm make

679f338

dockerfile

5f0d8b2

ttf imports

5d4848d

ruff

893a0ff

fix layers

9caaf0f

ruff

0791dfd

Update __init__.py

573fb32

Update __init__.py

fe65875

flozi00 linked an issue May 21, 2024 that may be closed by this pull request

Fallback to Flash Attention v1 for pre-Ampere GPUs #440

Closed

flozi00 added 12 commits May 22, 2024 07:34

ruff

d32e6a9

imports

f420eee

imports

1bb9030

ruff

357fe3d

fix

8712ec2

import

629986e

import

78519ea

linear layer import in layers.py

f207d40

add fp8, update FA

9649fad

fix FA2 check

98ac61c

paged attention

6d41fbf

fix llama PA

a20bc03

flozi00 marked this pull request as ready for review May 22, 2024 10:32

awq typo

ab80c15

flozi00 linked an issue May 22, 2024 that may be closed by this pull request

Quantized KV Cache #483

Closed

flozi00 requested a review from tgaddair May 22, 2024 10:56

tgaddair added 3 commits May 24, 2024 15:40

Revert build.yaml

feb06be

Added back channels

503af2a

Revert

dae793e

tgaddair approved these changes May 24, 2024

View reviewed changes

tgaddair merged commit a2ca687 into main May 24, 2024
3 checks passed

tgaddair deleted the synctgi branch May 24, 2024 22:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

start porting latest tgi #480

start porting latest tgi #480

flozi00 commented May 20, 2024 •

edited

flozi00 commented May 22, 2024

flozi00 commented May 22, 2024

flozi00 commented May 22, 2024

flozi00 commented May 22, 2024

tgaddair left a comment

start porting latest tgi #480

start porting latest tgi #480

Conversation

flozi00 commented May 20, 2024 • edited

What does this PR do?

Before submitting

Who can review?

flozi00 commented May 22, 2024

flozi00 commented May 22, 2024

flozi00 commented May 22, 2024

flozi00 commented May 22, 2024

tgaddair left a comment

Choose a reason for hiding this comment

flozi00 commented May 20, 2024 •

edited