Training Details

The entire training process includes three parts: vocabulary expansion, pre-training, and instruction fine-tuning. Please refer to the merge_tokenizers.py for vocabulary expansion; refer to the run_clm.py in 🤗transformers and the relevant parts of dataset processing in the Stanford Alpaca project for pre-training and self-instruct fine-tuning.

Preparation: Vocabulary Expansion

Due to the limited support for Chinese (and other non-English languages) in the original LLaMA,

We further expanded the Chinese vocabulary based on training with the general Chinese corpus using sentencepiece to create a 20K Chinese vocabulary, which was then merged with the original LLaMA model's 32K vocabulary.
After removing duplicate tokens, the final Chinese LLaMA vocabulary size is 49,953.
It should be noted that during the fine-tuning stage, Alpaca has one more pad token than LLaMA, so the Chinese Alpaca vocabulary size is 49,954.

For more information on the motivation behind expanding the Chinese vocabulary, please refer to the FAQ. If you want to know the details of vocabulary expansion, or expand LLaMA tokenizer with your custom vocabulary, please check merge_tokenizers.py. The script can be run as follows:

python merge_tokenizers.py \
  --llama_tokenizer_dir llama_tokenizer_dir \
  --chinese_sp_model_file chinese_sp_model_file

where

llama_tokenizer_dir: path to the directory that stores the original LLaMA tokenizer
chinese_sp_model_file: the Chinese sentencepiece model file generated by sentencepiece

We also release the 20K-vocab Chinese sentencepiece model that was used in vocabulary expansion, available at scripts/merge_tokenizer/chinese_sp.model.

Pre-training

We release the pre-training code scripts/training/run_clm_pt_with_peft.py for reference. See Pre-training Script for the detailed usage.
For technical details, please refer to https://arxiv.org/abs/2304.08177

Instruction Fine-tuning

The task format of the instruction fine-tuning phase is basically the same as that of Stanford Alpaca. The training scheme also used LoRA for efficient fine-tuning and further increased the number of trainable parameters.
We follow the original prompt by Stanford Alpaca that without "input". For the data that contains "input" values, we simply concatenate them in the form off"{instruction}+\n+{input}".
We release the SFT code scripts/training/run_clm_sft_with_peft.py for reference. See SFT Script for the detailed usage.

Training Data

During the instruction fine-tuning phase, about 2M data were used for 7B model, and 3M data for 13B model. Details:

Dataset	Size	Source	Description
Chinese-English Translation	500K	link	sampled and cleaned from original dataset
pCLUE	300K	link	sampled and cleaned from original dataset
Stanford Alpaca data	50K	link	Original training data of Stanford Alpaca
Stanford Alpaca data (Chinese)	50K	link	We translate original data into Chinese using ChatGPT
Self-instruction data	1-2M	N/A	We use ChatGPT API to get these data, see below

This project provides a script script/crawl_prompt.py for dynamically generating prompts of different domains and instruction types.

python script/crawl_prompt.py output-file

The idea is similar to the approach used in Stanford Alpaca. It generates 20 sets of data at a time (you can modify the templates), reducing the cost of crawling.
The generated file contains data crawled through gpt-3.5-turbo (you must have an OpenAI API key to use it).
Although the instruction template requires the output to be in JSON format, the system does not always return valid JSON, so you need to clean it up according to the returned data.
Since crawling takes a long time, it is recommended to run this script in the background. When running multiple threads, pay attention to the call limit of the OpenAI API.

Experimental Setups

The followings are experimental setups for basic 7B model. More details should refer to our technical report.

Settings	Pre-training Stage One	Pre-training Stage Two	Instruction Fine-tuning
Batch Size	1024	1024	512
Initial Learning Rate	2e-4	1e-4	1e-4
Training Steps	3K	6K	6K-10K
Max Length	512	512	512
Trainable Parameters (%)	2.97%	6.06%	6.22%
Training Device	8 × A40(48G)	16 × A40(48G)	16 × A40(48G)
Distributed Training	DeepSpeed Zero-2	DeepSpeed Zero-2	DeepSpeed Zero-2

中文文档

模型合并与转换
- 在线模型合并与转换（Colab）
- 手动模型合并与转换
模型量化、推理、部署
效果与评测
- 指令理解与生成效果
- C-Eval评测效果与脚本
训练细节
- 预训练脚本
- 指令精调脚本
常见问题

English Docs

Model Reconstruction
- Online conversion with Colab
- Manual Conversion
Model Quantization, Inference and Deployment
System Performance
- Instruction-following and Text Generation
- C-Eval
Training Details
- Pre-training Script
- SFT Script
FAQ

Provide feedback

Saved searches

Use saved searches to filter your results more quickly