Low-Rank LLama2

In the ever-evolving landscape of artificial intelligence (AI), one undeniable trend has emerged in recent years: the relentless growth in the size and complexity of machine learning models. More specifically, large language models (LLMs) that mainly rely on transformers as building blocks, are reaching a substantial number of parameters and require a significant amount of compute that is expected to increase with larger and larger models being released.

In this blog post and supporting code, we explore low-rankness as a pruning technique of the LLama2-7B base model. We show that, by splitting almost all the linear layer weights into low-rank pairs without fine-tuning and leveraging LoRA for custom training, we can achieve the following without implementing custom kernels:

~50% reduction in the number of parameters.
Up to ~50% faster training vs. bitsandbytes’s 8-bit quantization.
Up to ~1.25x inference speed-up.

The blog is at https://mobiusml.github.io/low-rank-llama2/
and code is at https://github.com/mobiusml/low-rank-llama2/tree/main/code

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
code		code
figs		figs
README.md		README.md
index.html		index.html
styling.css		styling.css

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Low-Rank LLama2

About

Releases

Packages

Contributors 2

Languages

mobiusml/low-rank-llama2

Folders and files

Latest commit

History

Repository files navigation

Low-Rank LLama2

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages