Skip to content

Releases: vllm-project/vllm

v0.4.2

05 May 04:31
c7f2cf2
Compare
Choose a tag to compare

Highlights

Features

Models and Enhancements

Dependency Upgrade

  • Upgrade to torch==2.3.0 (#4454)
  • Upgrade to tensorizer==2.9.0 (#4467)
  • Expansion of AMD test suite (#4267)

Progress and Dev Experience

What's Changed

Read more

v0.4.1

24 Apr 02:28
468d761
Compare
Choose a tag to compare

Highlights

Features

  • Support and enhance CommandR+ (#3829), minicpm (#3893), Meta Llama 3 (#4175, #4182), Mixtral 8x22b (#4073, #4002)
  • Support private model registration, and updating our support policy (#3871, 3948)
  • Support PyTorch 2.2.1 and Triton 2.2.0 (#4061, #4079, #3805, #3904, #4271)
  • Add option for using LM Format Enforcer for guided decoding (#3868)
  • Add option for optionally initialize tokenizer and detokenizer (#3748)
  • Add option for load model using tensorizer (#3476)

Enhancements

Hardwares

  • Intel CPU inference backend is added (#3993, #3634)
  • AMD backend is enhanced with Triton kernel and e4m3fn KV cache (#3643, #3290)

What's Changed

  • [Kernel] Layernorm performance optimization by @mawong-amd in #3662
  • [Doc] Update installation doc for build from source and explain the dependency on torch/cuda version by @youkaichao in #3746
  • [CI/Build] Make Marlin Tests Green by @robertgshaw2-neuralmagic in #3753
  • [Misc] Minor fixes in requirements.txt by @WoosukKwon in #3769
  • [Misc] Some minor simplifications to detokenization logic by @njhill in #3670
  • [Misc] Fix Benchmark TTFT Calculation for Chat Completions by @ywang96 in #3768
  • [Speculative decoding 4/9] Lookahead scheduling for speculative decoding by @cadedaniel in #3250
  • [Misc] Add support for new autogptq checkpoint_format by @Qubitium in #3689
  • [Misc] [CI/Build] Speed up block manager CPU-only unit tests ~10x by opting-out of GPU cleanup by @cadedaniel in #3783
  • [Hardware][Intel] Add CPU inference backend by @bigPYJ1151 in #3634
  • [HotFix] [CI/Build] Minor fix for CPU backend CI by @bigPYJ1151 in #3787
  • [Frontend][Bugfix] allow using the default middleware with a root path by @A-Mahla in #3788
  • [Doc] Fix vLLMEngine Doc Page by @ywang96 in #3791
  • [CI/Build] fix TORCH_CUDA_ARCH_LIST in wheel build by @youkaichao in #3801
  • Fix crash when try torch.cuda.set_device in worker by @leiwen83 in #3770
  • [Bugfix] Add __init__.py files for vllm/core/block/ and vllm/spec_decode/ by @mgoin in #3798
  • [CI/Build] 0.4.0.post1, fix sm 7.0/7.5 binary by @youkaichao in #3803
  • [Speculative decoding] Adding configuration object for speculative decoding by @cadedaniel in #3706
  • [BugFix] Use different mechanism to get vllm version in is_cpu() by @njhill in #3804
  • [Doc] Update README.md by @robertgshaw2-neuralmagic in #3806
  • [Doc] Update contribution guidelines for better onboarding by @michaelfeil in #3819
  • [3/N] Refactor scheduler for chunked prefill scheduling by @rkooo567 in #3550
  • Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) by @AdrianAbeyta in #3290
  • [Misc] Publish 3rd meetup slides by @WoosukKwon in #3835
  • Fixes the argument for local_tokenizer_group by @sighingnow in #3754
  • [Core] Enable hf_transfer by default if available by @michaelfeil in #3817
  • [Bugfix] Add kv_scale input parameter to CPU backend by @WoosukKwon in #3840
  • [Core] [Frontend] Make detokenization optional by @mgerstgrasser in #3749
  • [Bugfix] Fix args in benchmark_serving by @CatherineSue in #3836
  • [Benchmark] Refactor sample_requests in benchmark_throughput by @gty111 in #3613
  • [Core] manage nccl via a pypi package & upgrade to pt 2.2.1 by @youkaichao in #3805
  • [Hardware][CPU] Update cpu torch to match default of 2.2.1 by @mgoin in #3854
  • [Model] Cohere CommandR+ by @saurabhdash2512 in #3829
  • [Core] improve robustness of pynccl by @youkaichao in #3860
  • [Doc]Add asynchronous engine arguments to documentation. by @SeanGallen in #3810
  • [CI/Build] fix pip cache with vllm_nccl & refactor dockerfile to build wheels by @youkaichao in #3859
  • [Misc] Add pytest marker to opt-out of global test cleanup by @cadedaniel in #3863
  • [Misc] Fix linter issues in examples/fp8/quantizer/quantize.py by @cadedaniel in #3864
  • [Bugfix] Fixing requirements.txt by @noamgat in #3865
  • [Misc] Define common requirements by @WoosukKwon in #3841
  • Add option to completion API to truncate prompt tokens by @tdoublep in #3144
  • [Chunked Prefill][4/n] Chunked prefill scheduler. by @rkooo567 in #3853
  • [Bugfix] Fix incorrect output on OLMo models in Tensor Parallelism by @Isotr0py in #3869
  • [CI/Benchmark] add more iteration and use multiple percentiles for robust latency benchmark by @youkaichao in #3889
  • [Core] enable out-of-tree model register by @youkaichao in #3871
  • [WIP][Core] latency optimization by @youkaichao in #3890
  • [Bugfix] Fix Llava inference with Tensor Parallelism. by @Isotr0py in #3883
  • [Model] add minicpm by @SUDA-HLT-ywfang in #3893
  • [Bugfix] Added Command-R GPTQ support by @egortolmachev in #3849
  • [Bugfix] Enable Proper attention_bias Usage in Llama Model Configuration by @Ki6an in #3767
  • [Hotfix][CI/Build][Kernel] CUDA 11.8 does not support layernorm optimizations by @mawong-amd in #3782
  • [BugFix][Model] Fix commandr RoPE max_position_embeddings by @esmeetu in #3919
  • [Core] separate distributed_init from worker by @youkaichao in #3904
  • [Misc] [Core] Implement RFC "Augment BaseExecutor interfaces to enable hardware-agnostic speculative decoding" by @cadedaniel in #3837
  • [Bugfix] Fix KeyError on loading GPT-NeoX by @jsato8094 in #3925
  • [ROCm][Hardware][AMD] Use Triton Kernel for default FA on ROCm by @jpvillam-amd in #3643
  • [Misc] Avoid loading incorrect LoRA config by @jeejeelee in #3777
  • [Benchmark] Add cpu options to bench scripts by @PZD-CHINA in #3915
  • [Bugfix] fix utils.py/merge_dict func TypeError: 'type' object is not subscriptable by @zhaotyer in #3955
  • [Bugfix] Fix logits processor when prompt_logprobs is not None by @huyiwen in #3899
  • [Bugfix] handle prompt_logprobs in _apply_min_tokens_penalty by @tjohnson31415 in #3876
  • [Bugfix][ROCm] Add numba to Dockerfile.rocm by @WoosukKwon in #3962
  • [Model][AMD] ROCm support for 256 head dims for Gemma by @jamestwhedbee in #3972
  • [Doc] Add doc to state our model support policy by @youkaichao in #3948
  • [Bugfix] Remove key sorting for guided_json parameter in OpenAi compatible Server by @dmarasco in #3945
  • [Doc] Fix getting stared to use publicly available model by @fpaupier in #3963
  • [Bugfix] handle hf_config with architectures == None by @tjohnson31415 in #3982
  • [WIP][Core][Refactor] move vllm/model_executor/parallel_utils into vllm/distributed and vllm/device_communicators by @youkaichao in #3950
  • [Core][5/N] Fully working chunked prefill e2e by @rkooo567 in #3884
  • [Core][Model] Use torch.compile to accelerate layernorm in commandr by @youkaichao in #3985
  • [Test] Add xformer and flash attn tests by @rkooo567 in #3961
  • [Misc] refactor ops and cache_ops layer by @jikunshang in #3913
  • [Doc][Installation] delete python setup.py develop by @youkaichao in #3989
  • [Ke...
Read more

v0.4.0.post1, restore sm70/75 support

02 Apr 20:01
a3c226e
Compare
Choose a tag to compare

Highlight

v0.4.0 lacks support for sm70/75 support. We did a hotfix for it.

What's Changed

  • [Kernel] Layernorm performance optimization by @mawong-amd in #3662
  • [Doc] Update installation doc for build from source and explain the dependency on torch/cuda version by @youkaichao in #3746
  • [CI/Build] Make Marlin Tests Green by @robertgshaw2-neuralmagic in #3753
  • [Misc] Minor fixes in requirements.txt by @WoosukKwon in #3769
  • [Misc] Some minor simplifications to detokenization logic by @njhill in #3670
  • [Misc] Fix Benchmark TTFT Calculation for Chat Completions by @ywang96 in #3768
  • [Speculative decoding 4/9] Lookahead scheduling for speculative decoding by @cadedaniel in #3250
  • [Misc] Add support for new autogptq checkpoint_format by @Qubitium in #3689
  • [Misc] [CI/Build] Speed up block manager CPU-only unit tests ~10x by opting-out of GPU cleanup by @cadedaniel in #3783
  • [Hardware][Intel] Add CPU inference backend by @bigPYJ1151 in #3634
  • [HotFix] [CI/Build] Minor fix for CPU backend CI by @bigPYJ1151 in #3787
  • [Frontend][Bugfix] allow using the default middleware with a root path by @A-Mahla in #3788
  • [Doc] Fix vLLMEngine Doc Page by @ywang96 in #3791
  • [CI/Build] fix TORCH_CUDA_ARCH_LIST in wheel build by @youkaichao in #3801
  • Fix crash when try torch.cuda.set_device in worker by @leiwen83 in #3770
  • [Bugfix] Add __init__.py files for vllm/core/block/ and vllm/spec_decode/ by @mgoin in #3798
  • [CI/Build] 0.4.0.post1, fix sm 7.0/7.5 binary by @youkaichao in #3803

New Contributors

Full Changelog: v0.4.0...v0.4.0.post1

v0.4.0

30 Mar 01:54
51c31bc
Compare
Choose a tag to compare

Major changes

Models

Production features

  • Automatic prefix caching (#2762, #3703) supporting long system prompt to be automatically cached across requests. Use the flag --enable-prefix-caching to turn it on.
  • Support json_object in OpenAI server for arbitrary JSON, --use-delay flag to improve time to first token across many requests, and min_tokens to EOS suppression.
  • Progress in chunked prefill scheduler (#3236, #3538), and speculative decoding (#3103).
  • Custom all reduce kernel has been re-enabled after more robustness fixes.
  • Replaced cupy dependency due to its bugs.

Hardware

  • Improved Neuron support for AWS Inferentia.
  • CMake based build system for extensibility.

Ecosystem

  • Extensive serving benchmark refactoring (#3277)
  • Usage statistics collection (#2852)

What's Changed

Read more

v0.3.3

01 Mar 20:58
82091b8
Compare
Choose a tag to compare

Major changes

  • StarCoder2 support
  • Performance optimization and LoRA support for Gemma
  • 2/3/8-bit GPTQ support
  • Integrate Marlin Kernels for Int4 GPTQ inference
  • Performance optimization for MoE kernel
  • [Experimental] AWS Inferentia2 support
  • [Experimental] Structured output (JSON, Regex) in OpenAI Server

What's Changed

New Contributors

Full Changelog: v0.3.2...v0.3.3

v0.3.2

21 Feb 19:50
8fbd84b
Compare
Choose a tag to compare

Major Changes

This version adds support for the OLMo and Gemma Model, as well as seed parameter.

What's Changed

New Contributors

Full Changelog: v0.3.1...v0.3.2

v0.3.1

16 Feb 23:06
5f08050
Compare
Choose a tag to compare

Major Changes

This version fixes the following major bugs:

  • Memory leak with distributed execution. (Solved by using CuPY for collective communication).
  • Support for Python 3.8.

Also with many smaller bug fixes listed below.

What's Changed

New Contributors

Full Changelog: v0.3.0...v0.3.1

v0.3.0

31 Jan 08:07
1af090b
Compare
Choose a tag to compare

Major Changes

  • Experimental multi-lora support
  • Experimental prefix caching support
  • FP8 KV Cache support
  • Optimized MoE performance and Deepseek MoE support
  • CI tested PRs
  • Support batch completion in server

What's Changed

New Contributors

Read more

v0.2.7

04 Jan 01:36
2e0b6e7
Compare
Choose a tag to compare

Major Changes

  • Up to 70% throughput improvement for distributed inference by removing serialization/deserialization overheads
  • Fix tensor parallelism support for Mixtral + GPTQ/AWQ

What's Changed

New Contributors

Full Changelog: v0.2.6...v0.2.7

v0.2.6

17 Dec 18:35
671af2b
Compare
Choose a tag to compare

Major changes

  • Fast model execution with CUDA/HIP graph
  • W4A16 GPTQ support (thanks to @chu-tianxiang)
  • Fix memory profiling with tensor parallelism
  • Fix *.bin weight loading for Mixtral models

What's Changed

New Contributors

Full Changelog: v0.2.5...v0.2.6