Releases · predibase/lorax

23 May 16:55

tgaddair

v0.10.0

bd7db80

v0.10.0: Speculative decoding adapters and SGMV + BGMV Latest

Latest

🎉 Enhancements

Added support for Medusa speculative decoding adapters by @tgaddair in #372
Added Medusa adapters per request by @tgaddair in #454
Support jointly trained Medusa + LoRA adapters by @tgaddair in #482
Adds prompt lookup decoding (ngram speculation) by @tgaddair in #375
Use SGMV for prefill BGMV for decode by @tgaddair in #464
Added phi3 by @tgaddair in #445
Added support for C4AI Command-R (cohere) by @tgaddair in #411
Add DBRX by @tgaddair in #423
Refactor adapter interface to support adapters other than LoRA (e.g., speculative decoding) by @tgaddair in #359
Initializing server with an adapter sets it as the default by @tgaddair in #370
Implement Seed Parameter Support for OpenAI-Compatible API Endpoints by @GirinMan in #374
lorax launcher now has --default-adapter-source by @noyoshi in #419
enh: Make client's handling of error responses more robust and user-friendly by @jeffreyftang in #418
Support both medusa v1 and v2 by @tgaddair in #421
use default HF HUB token when checking for base model info by @noyoshi in #428
Added adapter_source and api_token to completions API by @tgaddair in #446
Increase max stop sequences by @tgaddair in #453
Support LORAX_USE_GLOBAL_HF_TOKEN by @tgaddair in #462
Allow setting temperature=0 by @tgaddair in #467
Merge medusa segments by @tgaddair in #471

🐛 Bugfixes

Fix CUDA compile when using long sequence lengths by @tgaddair in #363
Fix CUDA graph compile with speculative decoding by @tgaddair in #381
Fix mixtral for speculative decoding by @tgaddair in #382
Fix import of EntryNotFoundError by @tgaddair in #401
Fix warmup when using spculative decoding by @tgaddair in #402
fix: assign bias directly by @thincal in #398
fix: Enable ignoring botocore ClientError during download_file by @jeffreyftang in #404
Fix Pydantic v2 adapter_id and merged_adapters validation by @claudioMontanari in #408
fix: Suppress pydantic warning over model_id field in DeployedModel by @jeffreyftang in #409
Fix phi by @noyoshi in #410
fix: Missing / in pbase endpoint by @jeffreyftang in #415
Print correct number of key value heads on dimension assertion. by @dstripelis in #414
Fix request variable by @Infernaught in #416
fix: Rename _get_slice to get_slice by @tgaddair in #424
fix: Hack for llama3 eos_token_id by @tgaddair in #427
fix: checking the base_model_name_or_path of adapter_config and early return if null by @thincal in #431
fix: use logits to calculate alternative tokens by @JTS22 in #425
Fixed default pbase endpoint url by @tgaddair in #435
fix: Downloading private adapters from HF by @tgaddair in #443
Fix Outlines compatibility with speculative decoding by @tgaddair in #447
fix: Handle edge case where allowed tokens are out of bounds by @tgaddair in #449
Fix special tokens showing up in the response by @tgaddair in #450
Fix Medusa + LoRA by @tgaddair in #455
Ensure Llama 3 stops on all EOS tokens by @arnavgarg1 in #456
Reuse session per class instance by @gyanesh-mishra in #468

📝 Docs

Fix chat completion and docs by @GirinMan in #358
Added batch processing example by @tgaddair in #386
Medusa docs by @tgaddair in #459
Updated supported base models in docs by @arnavgarg1 in #458
Docs for private HF models by @tgaddair in #460
Auth header docs by @tgaddair in #461

🔧 Maintenance

Add CNAME file for Docs by @martindavis in #364
Update tagging logic and add flake8 linter by @magdyksaleh in #365
Apply black formatting by @tgaddair in #376
Switch formatting and linting to ruff by @tgaddair in #378
Style: change line length to 120 and enforce import sort order by @tgaddair in #383
Bump pydantic version to >2, <3 by @claudioMontanari in #405
refactor: set config into weights for quantization feature support more easily by @thincal in #400
Update Predibase integration to support v2 API by @jeffreyftang in #403
logging by @magdyksaleh in #436
revert by @magdyksaleh in #437
Upgrade to CUDA 12.1 and PyTorch 2.3.0 by @tgaddair in #472
int: Bump Lorax Client to 3.9 by @gyanesh-mishra in #486
Bump lorax client v0.6.0 by @tgaddair in #488

New Contributors

@GirinMan made their first contribution in #358
@martindavis made their first contribution in #364
@thincal made their first contribution in #398
@claudioMontanari made their first contribution in #405
@dstripelis made their first contribution in #414

Full Changelog: v0.9.0...v0.10.0

Contributors

jeffreyftang, thincal, and 11 other contributors

Assets 2

23 Mar 00:10

tgaddair

v0.9.0

8ff0bf5

v0.9.0

🎉 Enhancements

Allow assigning dedicated memory reservation for adapters on GPU by @tgaddair in #303
Enforce adapters cannot be loaded past --adapter-memory-fraction by @tgaddair in #306
Added Qwen2 by @tgaddair in #327
Make max_new_tokens optional, default to max_total_tokens - input_length by @tgaddair in #353
Expose ignore_eos_token option in generate requests by @jeffreyftang in #340
Generate to max_total_tokens during warmup by @tgaddair in #286
Add support for returning alternative tokens by @JTS22 in #297
feat: add repetition_penalty and top_k to openai by @huytuong010101 in #288
Add support for LoRA adapters trained with Rank-Stabilized scaling by @arnavgarg1 in #299
Provide more granular methods to configure the embedded S3 client. by @mitchklusty in #325
Allow specifying base model as model param in OpenAI API by @tgaddair in #331
Add ignore_eos_token param to completions and chat completions endpoints by @jeffreyftang in #344
Log whether SGMV kernel is enabled by @tgaddair in #342
Log generated tokens out to file when streaming by @magdyksaleh in #309

🐛 Bugfixes

Fix tensor parallelism with SGMV to use true rank of the LoRA after splitting by @tgaddair in #324
Fix hanging caused by tqdm stderr not being printed by @tgaddair in #352
Fix dynamic RoPE by @tgaddair in #350
Only update cache during warmup by @tgaddair in #351
Prevent model loading errors from appearing as flash attention import errors by @tgaddair in #328
Make architecture compatibility check non-fatal if base model config cannot be loaded by @tgaddair in #317
Fix Qwen2 LoRA loading by @tgaddair in #345
Remove vec wrapping from OpenAI-compatible response by @jeffreyftang in #273
Disallow early stopping during warmup by @tgaddair in #290
Skip returning EOS token on finish_reason 'stop' by @jeffreyftang in #289
Fixed static adapter loading with same arch by @tgaddair in #300
Ensure model_id is a string when using a model from s3 by @fadebek in #291
Fix name for adapter id by @noyoshi in #284
Update AsyncClient with ignore_eos_token parameter by @jeffreyftang in #341

📝 Docs

Update docs now that we no longer return a list from OpenAI-compatible endpoints by @jeffreyftang in #281
Change guided generation to structured generation by @jeffreyftang in #302
Clarify getting started documentation regarding port number used in pre-built Docker image. by @alexsherstinsky in #313
Added system requirements to README by @tgaddair in #293
Update README.md by @tgaddair in #294

🔧 Maintenance

Split out server and router unit tests by @tgaddair in #275
Add in response headers to streaming endpoint by @noyoshi in #282
Propagate bearer token from header if one exists for OpenAI-compatible endpoints by @jeffreyftang in #278
Update tokenizers to v0.15 to be consistent with server by @tgaddair in #285
Autogen python client docs by @tgaddair in #295
Reporting on total tokens by @noyoshi in #349

New Contributors

@huytuong010101 made their first contribution in #288
@fadebek made their first contribution in #291
@JTS22 made their first contribution in #297
@alexsherstinsky made their first contribution in #313
@mitchklusty made their first contribution in #325

Full Changelog: v0.8.1...v0.9.0

Contributors

alexsherstinsky, jeffreyftang, and 8 other contributors

Assets 2

21 Feb 22:28

tgaddair

v0.8.1

a3b865d

v0.8.1: Gemma support

🎉 Enhancements

Added Gemma by @tgaddair in #267
Pass details param into client by @magdyksaleh in #265

🔧 Maintenance

bump version by @magdyksaleh in #268
Bump by @magdyksaleh in #270

Full Changelog: v0.8.0...v0.8.1

Contributors

tgaddair and magdyksaleh

Assets 2

20 Feb 23:47

tgaddair

v0.8.0

dd68924

v0.8: Structured Output via Outlines

🎉 Enhancements

Added Outlines logits processor for JSON schema validation by @tgaddair in #224
Enable JSON guided generation via OpenAI-compatible API by @jeffreyftang in #243
JSON schema for guided generation now optionally respects field order by @jeffreyftang in #264
Set default adapter source by @magdyksaleh in #223
Pad LoRA ranks to ensure compatibility with SGMV kernel by @tgaddair in #256
Add model and adapter response headers by @magdyksaleh in #220
Add Cors params by @magdyksaleh in #221
Add expose headers by @magdyksaleh in #230

🐛 Bugfixes

Properly split out model_id when retrieving adapter weights downloaded from S3 by @jeffreyftang in #246
Fixed TIES merging to calculate sign before applying weights by @tgaddair in #239
Update s3.py by @llama-shepard in #234
Fix concatenate for flash batch by @tgaddair in #254
Fixed batch merging and filtering to handle Outlines state by @tgaddair in #263

📝 Docs

Add guide for guided generation by @jeffreyftang in #240
Added contributing guide by @tgaddair in #226
Update README to include model merging by @tgaddair in #225
Updated structured output by @tgaddair in #258
Minor corrections to development env setup instructions by @jeffreyftang in #228

🔧 Maintenance

Upgrade docker to use rust 1.75 and ubuntu 22.04 by @tgaddair in #250
Upgrading rust for dependency changes by @DhruvaBansal00 in #248
fix paths on runner by @noyoshi in #242

New Contributors

@jeffreyftang made their first contribution in #228
@DhruvaBansal00 made their first contribution in #248

Full Changelog: v0.7.0...v0.8.0

Contributors

jeffreyftang, tgaddair, and 4 other contributors

Assets 2

01 Feb 22:08

tgaddair

v0.7.0

56dc6e2

v0.7: LoRA Merging (linear, TIES, DARE) per request

🎉 Enhancements

Merge multiple LoRA adapters per request (linear, TIES, DARE) by @tgaddair in #212
Eetq by @flozi00 in #195
hqq JIT Quantization by @flozi00 in #147
Added Bloom dynamic adapter loading by @tgaddair in #187
Added pbase adapter_source and expose api_token in client by @tgaddair in #181
Cloudflare R2 Source by @llama-shepard in #198

🐛 Bugfixes

Fixed Phi for new HF format by @tgaddair in #192
Fixed OpenAI stream response data by @tgaddair in #193
fix: OpenAI response format by @tgaddair in #184
Fix RoPE and YARN scaling by @tgaddair in #202
check for base model earlier in the adapter function by @noyoshi in #196

📝 Docs

Updated quantization docs by @tgaddair in #206

🔧 Maintenance

Upgrade to pytorch==2.2.0 by @tgaddair in #217
upgrade exllama kernel by @flozi00 in #209
Add a model cache to avoid running out of storage by @magdyksaleh in #201

New Contributors

@llama-shepard made their first contribution in #198

Full Changelog: v0.6.0...v0.7.0

Contributors

tgaddair, magdyksaleh, and 3 other contributors

Assets 2

10 Jan 19:38

tgaddair

v0.6.0

64739ad

v0.6: OpenAI compatible API

🎉 Enhancements

OpenAI v1 Completions API by @tgaddair in #170
OpenAI v1 Chat Completions API by @tgaddair in #171
Added prompt_tokens to the response by @tgaddair in #165

🐛 Bugfixes

fix: Handle NaN values during weight conversion by @tgaddair in #168

📝 Docs

docs: OpenAI compatible API by @tgaddair in #174

🔧 Maintenance

fix: Only install stanford-stk on linux by @tgaddair in #169
added separate installation for torch by @asingh9530 in #173

New Contributors

@asingh9530 made their first contribution in #173

Full Changelog: v0.5.0...v0.6.0

Contributors

tgaddair and asingh9530

Assets 2

08 Jan 17:14

tgaddair

v0.5.0

57d5470

v0.5: CUDA graph compilation

🎉 Enhancements

CUDA graph compilation by @tgaddair in #154

🐛 Bugfixes

Fixed deadlock in sgmv_shrink kernel caused by imbalanced segments by @tgaddair in #156
Fixed loading adapter from absolute s3 path by @tgaddair in #161

📝 Docs

Update client docs with new endpoint source by @abidwael in #126
Update client docs with new endpoint source by @abidwael in #146

🔧 Maintenance

Reduce Docker size by removing duplicate torch install by @tgaddair in #144
remove CACHE_MANAGER in flash_causal_lm.py by @michaelfeil in #157

New Contributors

@michaelfeil made their first contribution in #157

Full Changelog: v0.4.1...v0.5.0

Contributors

tgaddair, michaelfeil, and abidwael

Assets 2

18 Dec 19:53

tgaddair

v0.4.1

9ae65b3

v0.4.1

🐛 Bugfixes

fix: Phi LoRA loading by @tgaddair in #136
fix: Triton usage for GPT-Q by @tgaddair in #140

🔧 Maintenance

Optimize SGMV kernel code path to reduce mallocs by @tgaddair in #139
fix sync script to account for subfolder bucket paths by @noyoshi in #135

Full Changelog: v0.4.0...v0.4.1

Contributors

tgaddair and noyoshi

Assets 2

15 Dec 18:15

tgaddair

v0.4.0

ce99dbf

v0.4.0

🎉 Enhancements

Mixtral by @flozi00 in #122
Added Phi by @tgaddair in #132
add support for H100s by @thelinuxkid in #111
upgrade to py 3.10 by @flozi00 in #121
Add predibase as a source for adapters by @magdyksaleh in #125
enh: Add soci indexing to allow Lazy loading of LoRAX images by @gyanesh-mishra in #95

🐛 Bugfixes

fix: Set Mistral sliding window to max position embeddings when None by @tgaddair in #128
Fix Qwen tensor parallelism by @tgaddair in #120
fix: Llama AWQ with GQA by @tgaddair in #114
fix: Mixtral adapter loading wraps lm_head by @tgaddair in #131

📝 Docs

Add Skypilot example and getting started guide by @tgaddair in #117
docs: fix broken link by @Fluder-Paradyne in #133
Added Mixtral and Phi to docs by @tgaddair in #134

🔧 Maintenance

Increase default client timeout to 60s by @tgaddair in #119
Make transpose contiguous for fan-in-fan-out by @tgaddair in #129
remove lorax env var by @geoffreyangus in #113

New Contributors

@gyanesh-mishra made their first contribution in #95
@thelinuxkid made their first contribution in #111
@Fluder-Paradyne made their first contribution in #133

Full Changelog: v0.3.0...v0.4.0

Contributors

thelinuxkid, tgaddair, and 5 other contributors

Assets 2

07 Dec 18:56

tgaddair

v0.3.0

bb950cc

v0.3.0

What's Changed

Enhancements

Add AWQ quantization by @flozi00 in #102
Add support for Qwen by @tgaddair in #103
Add Flash GPT2 by @geoffreyangus in #93
LoRAX-compatible GPT-2 by @geoffreyangus in #109

Bugfixes

decrease the max batch total tokens manually by @flozi00 in #89
Added --max-active-adapters to launcher by @tgaddair in #96
fix gptq fp16 inference by @flozi00 in #104
fix static adapter merge by @geoffreyangus in #106

Maintenance

Update values.yaml tag to always use the latest image by @arnavgarg1 in #87
Update chart version by @abidwael in #88
Warn if there are unused weights in the adapter by @tgaddair in #105
docs: Added client docs for connecting to Predibase endpoints by @tgaddair in #98
Generalized layer types and row parallel split logic by @tgaddair in #110
Mkdocs by @tgaddair in #112

New Contributors

@arnavgarg1 made their first contribution in #87

Full Changelog: v0.2.1...v0.3.0

Contributors

tgaddair, geoffreyangus, and 3 other contributors

Assets 2

Releases: predibase/lorax

v0.10.0: Speculative decoding adapters and SGMV + BGMV

🎉 Enhancements

🐛 Bugfixes

📝 Docs

🔧 Maintenance

New Contributors

Contributors

v0.9.0

🎉 Enhancements

🐛 Bugfixes

📝 Docs

🔧 Maintenance

New Contributors

Contributors

v0.8.1: Gemma support

🎉 Enhancements

🔧 Maintenance

Contributors

v0.8: Structured Output via Outlines

🎉 Enhancements

🐛 Bugfixes

📝 Docs

🔧 Maintenance

New Contributors

Contributors

v0.7: LoRA Merging (linear, TIES, DARE) per request

🎉 Enhancements

🐛 Bugfixes

📝 Docs

🔧 Maintenance

New Contributors

Contributors

v0.6: OpenAI compatible API

🎉 Enhancements

🐛 Bugfixes

📝 Docs

🔧 Maintenance

New Contributors

Contributors

v0.5: CUDA graph compilation

🎉 Enhancements

🐛 Bugfixes

📝 Docs

🔧 Maintenance

New Contributors

Contributors

v0.4.1

🐛 Bugfixes

🔧 Maintenance

Contributors

v0.4.0

🎉 Enhancements

🐛 Bugfixes

📝 Docs

🔧 Maintenance

New Contributors

Contributors

v0.3.0

What's Changed

Enhancements

Bugfixes

Maintenance

New Contributors

Contributors