🇨🇳中文 | 🌐English | 📖文档/Docs | 🤖模型/Models

Text2vec: Text to Vector

Text2vec: Text to Vector, Get Sentence Embeddings. Text vectorization, representing text (including words, sentences, paragraphs) as a vector matrix.

text2vec implements Word2Vec, RankBM25, BERT, Sentence-BERT, CoSENT and other text representation and text similarity calculation models, and compares the effects of each model on the text semantic matching (similarity calculation) task.

Guide

Feature
Evaluation
Install
Usage
Contact
Reference

Feature

文本向量表示模型

Word2Vec: large-scale high-quality Chinese word vector data (8 million Chinese words light weight) through Tencent AI Lab open source version) (file name: light_Tencent_AILab_ChineseEmbedding.bin password: tawe) to achieve word vector retrieval, this project realizes word2vec vector representation of sentences (word vector average)
SBERT(Sentence-BERT): A sentence vector representation model that balances performance and efficiency, and supervises the upper layer during training Classification function, direct sentence vector as cosine when text matching prediction, this project reproduces the training and prediction of Sentence-BERT model based on PyTorch
CoSENT(Cosine Sentence): The CoSENT model proposes a sorted loss function to make the training process closer to the prediction, The model convergence speed and effect are better than Sentence-BERT. This project implements the training and prediction of the CoSENT model based on PyTorch

Evaluation

文本匹配

英文匹配数据集的评测结果：

Arch	Backbone	Model Name	English-STS-B
GloVe	glove	Avg_word_embeddings_glove_6B_300d	61.77
BERT	bert-base-uncased	BERT-base-cls	20.29
BERT	bert-base-uncased	BERT-base-first_last_avg	59.04
BERT	bert-base-uncased	BERT-base-first_last_avg-whiten(NLI)	63.65
SBERT	sentence-transformers/bert-base-nli-mean-tokens	SBERT-base-nli-cls	73.65
SBERT	sentence-transformers/bert-base-nli-mean-tokens	SBERT-base-nli-first_last_avg	77.96
SBERT	xlm-roberta-base	paraphrase-multilingual-MiniLM-L12-v2	84.42
CoSENT	bert-base-uncased	CoSENT-base-first_last_avg	69.93
CoSENT	sentence-transformers/bert-base-nli-mean-tokens	CoSENT-base-nli-first_last_avg	79.68

Evaluation results of Chinese matching dataset:

Arch	Backbone	Model Name	ATEC	BQ	LCQMC	PAWSX	STS-B	Avg	QPS
CoSENT	hfl/chinese-macbert-base	CoSENT-macbert-base	50.39	72.93	79.17	60.86	80.51	68.77	3008
CoSENT	Langboat/mengzi-bert-base	CoSENT-mengzi-base	50.52	72.27	78.69	12.89	80.15	58.90	2502
CoSENT	bert-base-chinese	CoSENT-bert-base	49.74	72.38	78.69	60.00	80.14	68.19	2653
SBERT	bert-base-chinese	SBERT-bert-base	46.36	70.36	78.72	46.86	66.41	61.74	3365
SBERT	hfl/chinese-macbert-base	SBERT-macbert-base	47.28	68.63	79.42	55.59	64.82	63.15	2948
CoSENT	hfl/chinese-roberta-wwm-ext	CoSENT-roberta-ext	50.81	71.45	79.31	61.56	81.13	68.85	-
SBERT	hfl/chinese-roberta-wwm-ext	SBERT-roberta-ext	48.29	69.99	79.22	44.10	72.42	62.80	-

Chinese matching evaluation results of the release model of this project:

Arch	Backbone	Model Name	ATEC	BQ	LCQMC	PAWSX	STS-B	Avg	QPS
Word2Vec	word2vec	w2v-light-tencent-chinese	20.00	31.49	59.46	2.57	55.78	33.86	23769
SBERT	xlm-roberta-base	paraphrase-multilingual-MiniLM-L12-v2	18.42	38.52	63.96	10.14	78.90	41.99	3138
CoSENT	hfl/chinese-macbert-base	shibing624/text2vec-base-chinese	31.93	42.67	70.16	17.21	79.30	48.25	3008
CoSENT	hfl/chinese-lert-large	GanymedeNil/text2vec-large-chinese	32.61	44.59	69.30	14.51	79.44	48.08	1046

Evaluation conclusion:

The result values are all using the spearman coefficient
The results only use the train training of the data set, and evaluate the performance obtained on the test, without using external data
shibing624/text2vec-base-chinese model is trained with CoSENT method, based on MacBERT in Chinese STS-B data training, and in Chinese STS -B test set evaluation reaches SOTA, run examples/training_sup_text_matching_model.py code to train the model, the model file has been uploaded to huggingface The model library shibing624/text2vec-base-chinese, recommended for Chinese semantic matching tasks
The SBERT-macbert-base model is trained with the SBERT method, and the code can be trained by running examples/training_sup_text_matching_model.py Model
paraphrase-multilingual-MiniLM-L12-v2 model name is [sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12 -v2), trained with SBERT, is a multilingual version of the paraphrase-MiniLM-L12-v2 model, supports Chinese, English, etc.
w2v-light-tencent-chinese is the Word2Vec model of Tencent word vectors, which is loaded and used by CPU, and is suitable for Chinese literal matching tasks and cold start situations with lack of data
Each pre-trained model can be called through transformers, such as MacBERT model: --model_name hfl/chinese-macbert-base or roberta model: --model_name uer/roberta-medium-wwm-chinese-cluecorpussmall
Chinese matching data set download [link below] (#data set)
The Chinese matching task experiment shows that the optimal pooling is first_last_avg, that is, EncoderType.FIRST_LAST_AVG of SentenceModel, which has little difference in prediction effect from the method of EncoderType.MEAN
Chinese matching evaluation results are reappearing, you can download the Chinese matching dataset to examples/data, run [tests/test_model_spearman.py](https://github.com/shibing624/text2vec/blob/master/tests/test_model_spearman. py) code to reproduce the evaluation results
The GPU test environment of QPS is Tesla V100 with 32GB of video memory

Demo

Official Demo: https://www.mulanai.com/product/short_text_sim/

HuggingFace Demo: https://huggingface.co/spaces/shibing624/text2vec

run example: examples/gradio_demo.py to see the demo:

python examples/gradio_demo.py

Install

pip install torch # conda install pytorch
pip install -U text2vec

or

pip install torch # conda install pytorch
pip install -r requirements.txt

git clone https://github.com/shibing624/text2vec.git
cd text2vec
pip install --no-deps .

Usage

文本向量表征

Compute text vectors based on pretrained model:

>>> from text2vec import SentenceModel
>>> m = SentenceModel()
>>> m.encode("如何更换花呗绑定银行卡")
Embedding shape: (768,)

example: examples/computing_embeddings_demo.py

import sys

sys.path.append('..')
from text2vec import SentenceModel
from text2vec import Word2Vec


def compute_emb(model):
    # Embed a list of sentences
    sentences = [
        '卡',
        '银行卡',
        '如何更换花呗绑定银行卡',
        '花呗更改绑定银行卡',
        'This framework generates embeddings for each input sentence',
        'Sentences are passed as a list of string.',
        'The quick brown fox jumps over the lazy dog.'
    ]
    sentence_embeddings = model.encode(sentences)
    print(type(sentence_embeddings), sentence_embeddings.shape)

    # The result is a list of sentence embeddings as numpy arrays
    for sentence, embedding in zip(sentences, sentence_embeddings):
        print("Sentence:", sentence)
        print("Embedding shape:", embedding.shape)
        print("Embedding head:", embedding[:10])
        print()


if __name__ == "__main__":
    # 中文句向量模型(CoSENT)，中文语义匹配任务推荐，支持fine-tune继续训练
    t2v_model = SentenceModel("shibing624/text2vec-base-chinese")
    compute_emb(t2v_model)

    # 支持多语言的句向量模型（Sentence-BERT），英文语义匹配任务推荐，支持fine-tune继续训练
    sbert_model = SentenceModel("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")
    compute_emb(sbert_model)

    # 中文词向量模型(word2vec)，中文字面匹配任务和冷启动适用
    w2v_model = Word2Vec("w2v-light-tencent-chinese")
    compute_emb(w2v_model)

output:

<class 'numpy.ndarray'> (7, 768)
Sentence: 卡
Embedding shape: (768,)

Sentence: 银行卡
Embedding shape: (768,)
 ...

The return value embeddings is of numpy.ndarray type, and the shape is (sentences_size, model_embedding_size). You can choose one of the three models, and the first one is recommended.
The shibing624/text2vec-base-chinese model is trained by the CoSENT method on the Chinese STS-B dataset, and the model has been uploaded to huggingface Model library shibing624/text2vec-base-chinese, It is the default model specified by text2vec.SentenceModel, which can be called by the above example, or by transformers library as shown below, The model is automatically downloaded to the local path: ~/.cache/huggingface/transformers
The sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 model is a multilingual sentence vector model of Sentence-BERT, Suitable for paraphrase recognition and text matching, the model can be called through text2vec.SentenceModel and sentence-transformers library
w2v-light-tencent-chinese is a Word2Vec model loaded by gensim, using the Tencent word vector Tencent_AILab_ChineseEmbedding.tar.gz to calculate the word vector of each word, and the sentence vector through the word word The average value of the vector is obtained, and the model is automatically downloaded to the local path: ~/.text2vec/datasets/light_Tencent_AILab_ChineseEmbedding.bin

Usage (HuggingFace Transformers)

Without text2vec, you can use the model like this:

First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.

example: examples/use_origin_transformers_demo.py

import os
import torch
from transformers import AutoTokenizer, AutoModel

os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"


# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]  # First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('shibing624/text2vec-base-chinese')
model = AutoModel.from_pretrained('shibing624/text2vec-base-chinese')
sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)
# Perform pooling. In this case, max pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)

Usage (sentence-transformers)

sentence-transformers is a popular library to compute dense vector representations for sentences.

Install sentence-transformers:

pip install -U sentence-transformers

Then load model and predict:

from sentence_transformers import SentenceTransformer

m = SentenceTransformer("shibing624/text2vec-base-chinese")
sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡']

sentence_embeddings = m.encode(sentences)
print("Sentence embeddings:")
print(sentence_embeddings)

`Word2Vec` word vector model

Two Word2Vec word vectors are provided, choose one:

Lightweight version of Tencent Word Vector Baidu Cloud Disk-Password: tawe or [Google Cloud Disk](https://drive.google.com/u/ 0/uc?id=1iQo9tBb2NgFOBxx0fA16AZpSgc-bG_Rp&export=download), binary file, 111M, is a simplified high-frequency 143613 words, each word vector is still 200 dimensions (same as the original version), run the program, and automatically download to ~/ .text2vec/datasets/light_Tencent_AILab_ChineseEmbedding.bin
Tencent word vector - official full volume, 6.78G put: ~/.text2vec/datasets/Tencent_AILab_ChineseEmbedding.txt, Tencent word vector homepage: https://ai.tencent.com/ailab/nlp/zh/index.html Word vector download address: https://ai.tencent.com/ailab/nlp/en/download.html For more information, see [Tencent Word Vector Introduction-wiki](https://github.com/shibing624/text2vec/wiki/ %E8%85%BE%E8%AE%AF%E8%AF%8D%E5%90%91%E9%87%8F%E4%BB%8B%E7%BB%8D)

Downstream tasks

1. Sentence similarity calculation

example: examples/semantic_text_similarity_demo.py

import sys

sys.path.append('..')
from text2vec import Similarity

# Two lists of sentences
sentences1 = ['如何更换花呗绑定银行卡',
              'The cat sits outside',
              'A man is playing guitar',
              'The new movie is awesome']

sentences2 = ['花呗更改绑定银行卡',
              'The dog plays in the garden',
              'A woman watches TV',
              'The new movie is so great']

sim_model = Similarity()
for i in range(len(sentences1)):
    for j in range(len(sentences2)):
        score = sim_model.get_score(sentences1[i], sentences2[j])
        print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i], sentences2[j], score))

output:

如何更换花呗绑定银行卡 		 花呗更改绑定银行卡 		 Score: 0.9477
如何更换花呗绑定银行卡 		 The dog plays in the garden 		 Score: -0.1748
如何更换花呗绑定银行卡 		 A woman watches TV 		 Score: -0.0839
如何更换花呗绑定银行卡 		 The new movie is so great 		 Score: -0.0044
The cat sits outside 		 花呗更改绑定银行卡 		 Score: -0.0097
The cat sits outside 		 The dog plays in the garden 		 Score: 0.1908
The cat sits outside 		 A woman watches TV 		 Score: -0.0203
The cat sits outside 		 The new movie is so great 		 Score: 0.0302
A man is playing guitar 		 花呗更改绑定银行卡 		 Score: -0.0010
A man is playing guitar 		 The dog plays in the garden 		 Score: 0.1062
A man is playing guitar 		 A woman watches TV 		 Score: 0.0055
A man is playing guitar 		 The new movie is so great 		 Score: 0.0097
The new movie is awesome 		 花呗更改绑定银行卡 		 Score: 0.0302
The new movie is awesome 		 The dog plays in the garden 		 Score: -0.0160
The new movie is awesome 		 A woman watches TV 		 Score: 0.1321
The new movie is awesome 		 The new movie is so great 		 Score: 0.9591

Sentence cosine similarity value score ranges from [-1, 1], the larger the value, the more similar it is.

2. Text matching search

Generally, the text that is most similar to the query is found in the document candidate set, which is often used in tasks such as question similarity matching and text similarity retrieval in QA scenarios.

example: examples/semantic_search_demo.py

import sys

sys.path.append('..')
from text2vec import SentenceModel, cos_sim, semantic_search

embedder = SentenceModel()

# Corpus with example sentences
corpus = [
    '花呗更改绑定银行卡',
    '我什么时候开通了花呗',
    'A man is eating food.',
    'A man is eating a piece of bread.',
    'The girl is carrying a baby.',
    'A man is riding a horse.',
    'A woman is playing violin.',
    'Two men pushed carts through the woods.',
    'A man is riding a white horse on an enclosed ground.',
    'A monkey is playing drums.',
    'A cheetah is running behind its prey.'
]
corpus_embeddings = embedder.encode(corpus)

# Query sentences:
queries = [
    '如何更换花呗绑定银行卡',
    'A man is eating pasta.',
    'Someone in a gorilla costume is playing a set of drums.',
    'A cheetah chases prey on across a field.']

for query in queries:
    query_embedding = embedder.encode(query)
    hits = semantic_search(query_embedding, corpus_embeddings, top_k=5)
    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")
    hits = hits[0]  # Get the hits for the first query
    for hit in hits:
        print(corpus[hit['corpus_id']], "(Score: {:.4f})".format(hit['score']))

output:

Query: 如何更换花呗绑定银行卡
Top 5 most similar sentences in corpus:
花呗更改绑定银行卡 (Score: 0.9477)
我什么时候开通了花呗 (Score: 0.3635)
A man is eating food. (Score: 0.0321)
A man is riding a horse. (Score: 0.0228)
Two men pushed carts through the woods. (Score: 0.0090)

======================
Query: A man is eating pasta.
Top 5 most similar sentences in corpus:
A man is eating food. (Score: 0.6734)
A man is eating a piece of bread. (Score: 0.4269)
A man is riding a horse. (Score: 0.2086)
A man is riding a white horse on an enclosed ground. (Score: 0.1020)
A cheetah is running behind its prey. (Score: 0.0566)

======================
Query: Someone in a gorilla costume is playing a set of drums.
Top 5 most similar sentences in corpus:
A monkey is playing drums. (Score: 0.8167)
A cheetah is running behind its prey. (Score: 0.2720)
A woman is playing violin. (Score: 0.1721)
A man is riding a horse. (Score: 0.1291)
A man is riding a white horse on an enclosed ground. (Score: 0.1213)

======================
Query: A cheetah chases prey on across a field.
Top 5 most similar sentences in corpus:
A cheetah is running behind its prey. (Score: 0.9147)
A monkey is playing drums. (Score: 0.2655)
A man is riding a horse. (Score: 0.1933)
A man is riding a white horse on an enclosed ground. (Score: 0.1733)
A man is eating food. (Score: 0.0329)

Downstream task support library

similarities library [recommended]

For text similarity calculation and text matching search tasks, it is recommended to use similarities library, which is compatible with the release of this project Word2vec, SBERT, and Cosent semantic matching models also support literal dimension similarity calculations, matching search algorithms, and text and images.

Install: pip install -U similarities

Sentence similarity calculation:

from similarities import Similarity

m = Similarity()
r = m.similarity('如何更换花呗绑定银行卡', '花呗更改绑定银行卡')
print(f"similarity score: {float(r)}")  # similarity score: 0.855146050453186

Models

CoSENT model

CoSENT (Cosine Sentence) text matching model, improved the sentence vector scheme of CosineRankLoss on Sentence-BERT

Network structure:

Training:

Inference:

CoSENT Supervised Model

Train and predict CoSENT model:

Train and evaluate the CoSENT model on the Chinese STS-B dataset

example: examples/training_sup_text_matching_model.py

cd examples
python training_sup_text_matching_model.py --model_arch cosent --do_train --do_predict --num_epochs 10 --model_name hfl/chinese-macbert-base --output_dir ./outputs/STS-B-cosent

Train and evaluate the CoSENT model on the Ant Financial matching dataset ATEC

Support the use of these Chinese matching datasets: 'ATEC', 'STS-B', 'BQ', 'LCQMC', 'PAWSX', for details refer to HuggingFace datasets [https://huggingface.co/datasets/shibing624/nli_zh] (https://huggingface.co/datasets/shibing624/nli_zh)

python training_sup_text_matching_model.py --task_name ATEC --model_arch cosent --do_train --do_predict --num_epochs 10 --model_name hfl/chinese-macbert-base --output_dir ./outputs/ATEC-cosent

Train the model on our own Chinese dataset

example: examples/training_sup_text_matching_model_selfdata.py

python training_sup_text_matching_model_selfdata.py --do_train --do_predict

Train and evaluate the CoSENT model on the English STS-B dataset

example: examples/training_sup_text_matching_model_en.py

cd examples
python training_sup_text_matching_model_en.py --model_arch cosent --do_train --do_predict --num_epochs 10 --model_name bert-base-uncased  --output_dir ./outputs/STS-B-en-cosent

CoSENT Unsupervised Model

Train the CoSENT model on the English NLI dataset and evaluate the effect on the STS-B test set

example: examples/training_unsup_text_matching_model_en.py

cd examples
python training_unsup_text_matching_model_en.py --model_arch cosent --do_train --do_predict --num_epochs 10 --model_name bert-base-uncased --output_dir ./outputs/STS-B-en-unsup-cosent

Sentence-BERT model

Sentence-BERT text matching model, representational sentence vector representation scheme

Network structure:

Training:

Inference:

SentenceBERT Supervised Model

Train and evaluate the SBERT model on the Chinese STS-B dataset

example: examples/training_sup_text_matching_model.py

cd examples
python training_sup_text_matching_model.py --model_arch sentencebert --do_train --do_predict --num_epochs 10 --model_name hfl/chinese-macbert-base --output_dir ./outputs/STS-B-sbert

Train and evaluate the SBERT model on the English STS-B dataset

example: examples/training_sup_text_matching_model_en.py

cd examples
python training_sup_text_matching_model_en.py --model_arch sentencebert --do_train --do_predict --num_epochs 10 --model_name bert-base-uncased --output_dir ./outputs/STS-B-en-sbert

SentenceBERT Unsupervised Model

Train the SBERT model on the English NLI dataset and evaluate the effect on the STS-B test set

example: examples/training_unsup_text_matching_model_en.py

cd examples
python training_unsup_text_matching_model_en.py --model_arch sentencebert --do_train --do_predict --num_epochs 10 --model_name bert-base-uncased --output_dir ./outputs/STS-B-en-unsup-sbert

BERT-Match model

BERT text matching model, native BERT matching network structure, interactive sentence vector matching model

Network structure:

Training and inference:

The training script is the same as above [examples/training_sup_text_matching_model.py](https://github.com/shibing624/text2vec/blob/master/examples/training_sup_text_matching_model.py).

Model Distillation

Since the model trained by text2vec can be loaded using the sentence-transformers library, its model distillation method [distillation](https://github.com/ UKPLab/sentence-transformers/tree/master/examples/training/distillation).

Model dimension reduction, refer to dimensionality_reduction.py use PCA to reduce the dimensionality of the model output embedding, It can reduce the storage pressure of vector retrieval databases such as milvus and slightly improve the model effect.
For model distillation, refer to model_distillation.py and use the distillation method to distill the Teacher model to In the student model with fewer layers, the prediction speed of the model can be greatly improved under the condition of weighing the effect.

Model Deployment

Provide two deployment models and methods of building services: 1) Building gRPC services based on Jina [recommended]; 2) Building native Http services based on FastAPI.

Jina Service

It adopts C/S mode to build high-performance services, supports docker cloud native, gRPC/HTTP/WebSocket, supports simultaneous prediction of multiple models, and multi-GPU processing.

Install: pip install jina
Start the service:

example: examples/jina_server_demo.py

from jina import Flow

port = 50001
f = Flow(port=port).add(
    uses='jinahub://Text2vecEncoder',
    uses_with={'model_name': 'shibing624/text2vec-base-chinese'}
)

with f:
    # backend server forever
    f.block()

The model prediction method (executor) has been uploaded to JinaHub, which includes docker and k8s deployment methods.

call service:

from jina import Client
from docarray import Document, DocumentArray

port = 50001

c = Client(port=port)

data = ['如何更换花呗绑定银行卡',
        '花呗更改绑定银行卡']
print("data:", data)
print('data embs:')
r = c.post('/', inputs=DocumentArray([Document(text='如何更换花呗绑定银行卡'), Document(text='花呗更改绑定银行卡')]))
print(r.embeddings)

See example for batch call method: examples/jina_client_demo.py

Fast API service

Install: pip install fastapi uvicorn
Start the service:

example: examples/fastapi_server_demo.py

cd examples
python fastapi_server_demo.py

调用服务：

curl -X 'GET' \
  'http://0.0.0.0:8001/emb?q=hello' \
  -H 'accept: application/json'

Dataset

The data set of this project release:

Dataset	Introduce	Download Link
shibing624/nli_zh	中文语义匹配数据集，整合了中文ATEC、BQ、LCQMC、PAWSX、STS-B共5个任务的数据集	https://huggingface.co/datasets/shibing624/nli_zh or 百度网盘(提取码:qkt6) or github
ATEC	中文ATEC数据集，蚂蚁金服Q-Qpair数据集	ATEC
BQ	中文BQ(Bank Question)数据集，银行Q-Qpair数据集	BQ
LCQMC	中文LCQMC(large-scale Chinese question matching corpus)数据集，Q-Qpair数据集	LCQMC
PAWSX	中文PAWS(Paraphrase Adversaries from Word Scrambling)数据集，Q-Qpair数据集	PAWSX
STS-B	中文STS-B数据集，中文自然语言推理数据集，从英文STS-B翻译为中文的数据集	STS-B

Chinese semantic matching dataset shibing624/nli_zh, including ATEC, [BQ](http://icrc.hitsz.edu.cn /info/1037/1162.htm), LCQMC, PAWSX, [STS-B](https: //github.com/pluto-junzeng/CNSD) a total of 5 tasks. You can download it yourself from the link corresponding to the dataset, or you can download it from Baidu Netdisk (extraction code: qkt6). Among them, the senteval_cn directory is a summary of the evaluation data set, and senteval_cn.zip is the packaging of the senteval directory, whichever is better.

Dataset usage example:

pip install datasets

from datasets import load_dataset

dataset = load_dataset("shibing624/nli_zh", "STS-B") # ATEC or BQ or LCQMC or PAWSX or STS-B
print(dataset)
print(dataset['test'][0])

output:

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label'],
        num_rows: 5231
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label'],
        num_rows: 1458
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label'],
        num_rows: 1361
    })
})
{'sentence1': '一个女孩在给她的头发做发型。', 'sentence2': '一个女孩在梳头。', 'label': 2}

Introduction to text vector methods

Question

What about text vector representation? Which model is better for text matching tasks?

The success of many NLP tasks is inseparable from training high-quality and effective text representation vectors. Especially text semantic matching (Semantic Textual Similarity, such as paraphrase detection, QA question pair matching), text vector retrieval (Dense Text Retrieval) and other tasks.

Solution

Traditional Approach: Feature-Based Matching

Based on TF-IDF, BM25, Jaccord, SimHash, LDA and other algorithms to extract the vocabulary, topic and other features of the two texts, and then use the machine learning model (LR, xgboost) to train the classification model
Pros: better interpretability
Disadvantages: Rely on manual search for features, generalization ability is average, and due to the limitation of the number of features, the effect of the model is relatively average

Representative model:

BM25

The BM25 algorithm calculates the matching score between the fields of the candidate sentence by the degree of coverage of the qurey field. The candidate with a higher score has a better matching degree with the query, and it mainly solves the problem of similarity at the lexical level.

Deep Methods: Representation-Based Matching

Based on the representation matching method, the two texts are processed separately in the initial stage, and the deep neural network is used to encode (encode) to obtain the text representation (embedding), and then perform similarity calculation on the two representations to obtain two text similarity
Advantages: The BERT-based model has achieved good performance in text representation and text matching tasks through supervised Fine-tune
Disadvantage: The sentence vector derived by BERT itself (without Fine-tune, averaging all word vectors) is of low quality, even inferior to the result of Glove, so it is difficult to reflect the semantic similarity of two sentences

The main reasons are:

1.BERT tends to encode all sentences into a small spatial region, which makes most sentence pairs have high similarity scores, even those semantically completely unrelated sentence pairs.

The aggregation phenomenon represented by the BERT sentence vector is related to the high-frequency words in the sentence. Specifically, when the sentence vector is calculated by means of the average word vector, the word vector of those high-frequency words will dominate the sentence vector, making it difficult to reflect its original semantics. When some high-frequency words are removed when calculating sentence vectors, the aggregation phenomenon can be alleviated to a certain extent, but the representation ability will be reduced.

Models：

DSSM(2013)
CDSSM(2014)
ARC I(2014)
Siamese Network(2016)
InferSent(2017)
BERT(2018)
Sentence-BERT(2019)
BERT-flow(2020)
SimCSE(2021)
ConSERT(2021)
CoSENT(2022)

Since the BERT model brought about earth-shaking changes in the NLP industry in 2018, the models before 2018 will not be discussed and compared here (if you are interested in learning, you can refer to the open source [MatchZoo] of the Chinese Academy of Sciences (https://github.com /NTMC-Community/MatchZoo) and MatchZoo-py).

Therefore, this project mainly investigates the following vector representation models that are better than native BERT and suitable for text matching: Sentence-BERT (2019), BERT-flow (2020), SimCSE (2021), CoSENT (2022).

Deep Approach: Interaction-Based Matching

Based on the interactive matching method, it is believed that calculating the similarity of the text in the final stage will be too dependent on the quality of the text representation, and will also lose the basic text features (such as morphology, syntax, etc.), so it is proposed to match the text as early as possible. Features interact, capture more basic features, and finally calculate a matching score based on these basic matching features at a high level
Advantages: end-to-end processing of interaction-based matching model, good effect
Disadvantage: The input requirement of this type of model (Cross-Encoder) is two sentences, and the output is the similarity value of the sentence pair. The model will not generate sentence embedding (sentence embedding), and we cannot input a single sentence to the model . Therefore, such models are not practical for tasks that require text vector representations

Models:

ARC II(2014)
MV-LSTM(2015)
MatchPyramid(2016)
DRMM(2016)
Conv-KNRM(2018)
RE2(2019)
Keyword-BERT(2020)

Cross-Encoder is suitable for vector retrieval fine sorting.

Contact

Issue (suggestion):
Email me: xuming: xuming624@qq.com
WeChat me: Add me Wechat ID: xuming624, Remarks: Name-Company-NLP Enter the NLP exchange group.

Citation

If you use text2vec in your research, please cite it in the following format:

APA:

Xu, M. Text2vec: Text to vector toolkit (Version 1.1.2) [Computer software]. https://github.com/shibing624/text2vec

BibTeX:

@misc{Text2vec,
  author = {Xu, Ming},
  title = {Text2vec: Text to vector toolkit},
  year = {2022},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/shibing624/text2vec}},
}

License

The authorization agreement is The Apache License 2.0, which can be used for commercial purposes free of charge. Please attach text2vec's link and license agreement in the product description.

Contribute

The project code is still very rough. If you have improved the code, you are welcome to submit it back to this project. Before submitting, please pay attention to the following two points:

Add corresponding unit tests in tests
Use python -m pytest -v to run all unit tests to ensure that all unit tests are passed

Then you can submit a PR.

Reference

将句子表示为向量（上）：无监督句子表示学习（sentence embedding）
将句子表示为向量（下）：无监督句子表示学习（sentence embedding）
A Simple but Tough-to-Beat Baseline for Sentence Embeddings[Sanjeev Arora and Yingyu Liang and Tengyu Ma, 2017]
四种计算文本相似度的方法对比[Yves Peirsman]
Improvements to BM25 and Language Models Examined
CoSENT：比Sentence-BERT更有效的句向量方案
谈谈文本匹配和多轮检索
Sentence-transformers

Files

README_EN.md

Latest commit

History

README_EN.md

File metadata and controls

Text2vec: Text to Vector

Feature

文本向量表示模型

Evaluation

文本匹配

Demo

Install

Usage

文本向量表征

Usage (HuggingFace Transformers)

Usage (sentence-transformers)

Word2Vec word vector model

Downstream tasks

1. Sentence similarity calculation

2. Text matching search

Downstream task support library

Models

CoSENT model

CoSENT Supervised Model

CoSENT Unsupervised Model

Sentence-BERT model

SentenceBERT Supervised Model

SentenceBERT Unsupervised Model

BERT-Match model

Model Distillation

Model Deployment

Jina Service

Fast API service

Dataset

Question

Solution

Traditional Approach: Feature-Based Matching

Deep Methods: Representation-Based Matching

Deep Approach: Interaction-Based Matching

Contact

Citation

License

Contribute

Reference

`Word2Vec` word vector model