Skip to content

v0.9.0

Latest
Compare
Choose a tag to compare
@tgaddair tgaddair released this 23 Mar 00:10
· 67 commits to main since this release
8ff0bf5

馃帀 Enhancements

  • Allow assigning dedicated memory reservation for adapters on GPU by @tgaddair in #303
  • Enforce adapters cannot be loaded past --adapter-memory-fraction by @tgaddair in #306
  • Added Qwen2 by @tgaddair in #327
  • Make max_new_tokens optional, default to max_total_tokens - input_length by @tgaddair in #353
  • Expose ignore_eos_token option in generate requests by @jeffreyftang in #340
  • Generate to max_total_tokens during warmup by @tgaddair in #286
  • Add support for returning alternative tokens by @JTS22 in #297
  • feat: add repetition_penalty and top_k to openai by @huytuong010101 in #288
  • Add support for LoRA adapters trained with Rank-Stabilized scaling by @arnavgarg1 in #299
  • Provide more granular methods to configure the embedded S3 client. by @mitchklusty in #325
  • Allow specifying base model as model param in OpenAI API by @tgaddair in #331
  • Add ignore_eos_token param to completions and chat completions endpoints by @jeffreyftang in #344
  • Log whether SGMV kernel is enabled by @tgaddair in #342
  • Log generated tokens out to file when streaming by @magdyksaleh in #309

馃悰 Bugfixes

  • Fix tensor parallelism with SGMV to use true rank of the LoRA after splitting by @tgaddair in #324
  • Fix hanging caused by tqdm stderr not being printed by @tgaddair in #352
  • Fix dynamic RoPE by @tgaddair in #350
  • Only update cache during warmup by @tgaddair in #351
  • Prevent model loading errors from appearing as flash attention import errors by @tgaddair in #328
  • Make architecture compatibility check non-fatal if base model config cannot be loaded by @tgaddair in #317
  • Fix Qwen2 LoRA loading by @tgaddair in #345
  • Remove vec wrapping from OpenAI-compatible response by @jeffreyftang in #273
  • Disallow early stopping during warmup by @tgaddair in #290
  • Skip returning EOS token on finish_reason 'stop' by @jeffreyftang in #289
  • Fixed static adapter loading with same arch by @tgaddair in #300
  • Ensure model_id is a string when using a model from s3 by @fadebek in #291
  • Fix name for adapter id by @noyoshi in #284
  • Update AsyncClient with ignore_eos_token parameter by @jeffreyftang in #341

馃摑 Docs

馃敡 Maintenance

  • Split out server and router unit tests by @tgaddair in #275
  • Add in response headers to streaming endpoint by @noyoshi in #282
  • Propagate bearer token from header if one exists for OpenAI-compatible endpoints by @jeffreyftang in #278
  • Update tokenizers to v0.15 to be consistent with server by @tgaddair in #285
  • Autogen python client docs by @tgaddair in #295
  • Reporting on total tokens by @noyoshi in #349

New Contributors

Full Changelog: v0.8.1...v0.9.0