langcache-embed-v3 / README.md

Add new SentenceTransformer model

b5823b9 verified 11 days ago

17.4 kB

	---
	language:
	- en
	license: apache-2.0
	tags:
	- biencoder
	- sentence-transformers
	- text-classification
	- sentence-pair-classification
	- semantic-similarity
	- semantic-search
	- retrieval
	- reranking
	- generated_from_trainer
	- dataset_size:1047690
	- loss:CoSENTLoss
	base_model: Alibaba-NLP/gte-modernbert-base
	widget:
	- source_sentence: That is evident from their failure , three times in a row , to
	get a big enough turnout to elect a president .
	sentences:
	- 'given a text, decide to which of a predefined set of classes it belongs. examples:
	language identification, genre classification, sentiment analysis, and spam detection'
	- Three times in a row , they failed to get a big _ enough turnout to elect a president
	.
	- He said the Government still did not know the real reason the original Saudi buyer
	pulled out on August 21 .
	- source_sentence: these use built-in and learned knowledge to make decisions and
	accomplish tasks that fulfill the intentions of the user.
	sentences:
	- It also features a 4.5 in back-lit LCD screen and memory expansion facilities
	.
	- '- set of interrelated components - collect, process, store and distribute info.
	- support decision-making, coordination, and control'
	- software programs that work without direct human intervention to carry out specific
	tasks for an individual user, business process, or software application -siri
	adapts to your preferences over time
	- source_sentence: any location in storage can be accessed at any moment in approximately
	the same amount of time.
	sentences:
	- your study can adopt the original model used by the cited theorist but you can
	modify different variables depending on your study of the whole theory
	- an access method that can access any storage location directly and in any order;
	primary storage devices and disk storage devices use random access...
	- Branson said that his preference would be to operate a fully commercial service
	on routes to New York , Barbados and Dubai .
	- source_sentence: United issued a statement saying it will " work professionally
	and cooperatively with all its unions . "
	sentences:
	- network that acts like the human brain; type of ai
	- a database system consists of one or more databases and a database management
	system (dbms).
	- Senior vice president Sara Fields said the airline " will work professionally
	and cooperatively with all our unions . "
	- source_sentence: A European Union spokesman said the Commission was consulting EU
	member states " with a view to taking appropriate action if necessary " on the
	matter .
	sentences:
	- Justice Minister Martin Cauchon and Prime Minister Jean Chretien both have said
	the government will introduce legislation to decriminalize possession of small
	amounts of pot .
	- Laos 's second most important export destination - said it was consulting EU member
	states ' ' with a view to taking appropriate action if necessary ' ' on the matter
	.
	- the form data assumes and the possible range of values that the attribute defined
	as that type of data may express 1. text 2. numerical
	datasets:
	- redis/langcache-sentencepairs-v1
	pipeline_tag: sentence-similarity
	library_name: sentence-transformers
	metrics:
	- cosine_accuracy
	- cosine_accuracy_threshold
	- cosine_f1
	- cosine_f1_threshold
	- cosine_precision
	- cosine_recall
	- cosine_ap
	- cosine_mcc
	model-index:
	- name: Redis fine-tuned BiEncoder model for semantic caching on LangCache
	results:
	- task:
	type: binary-classification
	name: Binary Classification
	dataset:
	name: val
	type: val
	metrics:
	- type: cosine_accuracy
	value: 0.7638310529446758
	name: Cosine Accuracy
	- type: cosine_accuracy_threshold
	value: 0.8640533685684204
	name: Cosine Accuracy Threshold
	- type: cosine_f1
	value: 0.6912742186395134
	name: Cosine F1
	- type: cosine_f1_threshold
	value: 0.825770378112793
	name: Cosine F1 Threshold
	- type: cosine_precision
	value: 0.6289243437982501
	name: Cosine Precision
	- type: cosine_recall
	value: 0.7673469387755102
	name: Cosine Recall
	- type: cosine_ap
	value: 0.7353968345121902
	name: Cosine Ap
	- type: cosine_mcc
	value: 0.4778469995044085
	name: Cosine Mcc
	- task:
	type: binary-classification
	name: Binary Classification
	dataset:
	name: test
	type: test
	metrics:
	- type: cosine_accuracy
	value: 0.7037777526966672
	name: Cosine Accuracy
	- type: cosine_accuracy_threshold
	value: 0.8524033427238464
	name: Cosine Accuracy Threshold
	- type: cosine_f1
	value: 0.7122170715871171
	name: Cosine F1
	- type: cosine_f1_threshold
	value: 0.8118724822998047
	name: Cosine F1 Threshold
	- type: cosine_precision
	value: 0.5989283084033827
	name: Cosine Precision
	- type: cosine_recall
	value: 0.8783612662942272
	name: Cosine Recall
	- type: cosine_ap
	value: 0.6476665223951498
	name: Cosine Ap
	- type: cosine_mcc
	value: 0.44182914870985407
	name: Cosine Mcc
	---

	# Redis fine-tuned BiEncoder model for semantic caching on LangCache

	This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Alibaba-NLP/gte-modernbert-base](https://huggingface.co/Alibaba-NLP/gte-modernbert-base) on the [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v1) dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for sentence pair similarity.

	## Model Details

	### Model Description
	- Model Type: Sentence Transformer
	- Base model: [Alibaba-NLP/gte-modernbert-base](https://huggingface.co/Alibaba-NLP/gte-modernbert-base) <!-- at revision e7f32e3c00f91d699e8c43b53106206bcc72bb22 -->
	- Maximum Sequence Length: 8192 tokens
	- Output Dimensionality: 768 dimensions
	- Similarity Function: Cosine Similarity
	- Training Dataset:
	- [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v1)
	- Language: en
	- License: apache-2.0

	### Model Sources

	- Documentation: [Sentence Transformers Documentation](https://sbert.net)
	- Repository: [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
	- Hugging Face: [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)

	### Full Model Architecture

	```
	SentenceTransformer(
	(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False, 'architecture': 'ModernBertModel'})
	(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
	)
	```

	## Usage

	### Direct Usage (Sentence Transformers)

	First install the Sentence Transformers library:

	```bash
	pip install -U sentence-transformers
	```

	Then you can load this model and run inference.
	```python
	from sentence_transformers import SentenceTransformer

	# Download from the 🤗 Hub
	model = SentenceTransformer("redis/langcache-embed-v3")
	# Run inference
	sentences = [
	'A European Union spokesman said the Commission was consulting EU member states " with a view to taking appropriate action if necessary " on the matter .',
	"Laos 's second most important export destination - said it was consulting EU member states ' ' with a view to taking appropriate action if necessary ' ' on the matter .",
	'the form data assumes and the possible range of values that the attribute defined as that type of data may express 1. text 2. numerical',
	]
	embeddings = model.encode(sentences)
	print(embeddings.shape)
	# [3, 768]

	# Get the similarity scores for the embeddings
	similarities = model.similarity(embeddings, embeddings)
	print(similarities)
	# tensor([[1.0078, 0.8789, 0.4961],
	# [0.8789, 1.0000, 0.4648],
	# [0.4961, 0.4648, 1.0078]], dtype=torch.bfloat16)
	```

	<!--
	### Direct Usage (Transformers)

	<details><summary>Click to see the direct usage in Transformers</summary>

	</details>
	-->

	<!--
	### Downstream Usage (Sentence Transformers)

	You can finetune this model on your own dataset.

	<details><summary>Click to expand</summary>

	</details>
	-->

	<!--
	### Out-of-Scope Use

	List how the model may foreseeably be misused and address what users ought not to do with the model.
	-->

	## Evaluation

	### Metrics

	#### Binary Classification

	* Datasets: `val` and `test`
	* Evaluated with [<code>BinaryClassificationEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.BinaryClassificationEvaluator)

	\| Metric \| val \| test \|
	\|:--------------------------\|:-----------\|:-----------\|
	\| cosine_accuracy \| 0.7638 \| 0.7038 \|
	\| cosine_accuracy_threshold \| 0.8641 \| 0.8524 \|
	\| cosine_f1 \| 0.6913 \| 0.7122 \|
	\| cosine_f1_threshold \| 0.8258 \| 0.8119 \|
	\| cosine_precision \| 0.6289 \| 0.5989 \|
	\| cosine_recall \| 0.7673 \| 0.8784 \|
	\| cosine_ap \| 0.7354 \| 0.6477 \|
	\| cosine_mcc \| 0.4778 \| 0.4418 \|

	<!--
	## Bias, Risks and Limitations

	What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.
	-->

	<!--
	### Recommendations

	What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.
	-->

	## Training Details

	### Training Dataset

	#### LangCache Sentence Pairs (all)

	* Dataset: [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v1)
	* Size: 8,405 training samples
	* Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>label</code>
	* Approximate statistics based on the first 1000 samples:
	\| \| sentence1 \| sentence2 \| label \|
	\|:--------\|:----------------------------------------------------------------------------------\|:---------------------------------------------------------------------------------\|:------------------------------------------------\|
	\| type \| string \| string \| int \|
	\| details \| <ul><li>min: 6 tokens</li><li>mean: 24.89 tokens</li><li>max: 50 tokens</li></ul> \| <ul><li>min: 6 tokens</li><li>mean: 24.3 tokens</li><li>max: 43 tokens</li></ul> \| <ul><li>0: ~45.80%</li><li>1: ~54.20%</li></ul> \|
	* Samples:
	\| sentence1 \| sentence2 \| label \|
	\|:--------------------------------------------------------------------------------------------------------------------------------------\|:---------------------------------------------------------------------------------------------------------------------------------------------------\|:---------------\|
	\| <code>He said the foodservice pie business doesn 't fit the company 's long-term growth strategy .</code> \| <code>" The foodservice pie business does not fit our long-term growth strategy .</code> \| <code>1</code> \|
	\| <code>Magnarelli said Racicot hated the Iraqi regime and looked forward to using his long years of training in the war .</code> \| <code>His wife said he was " 100 percent behind George Bush " and looked forward to using his years of training in the war .</code> \| <code>0</code> \|
	\| <code>The dollar was at 116.92 yen against the yen , flat on the session , and at 1.2891 against the Swiss franc , also flat .</code> \| <code>The dollar was at 116.78 yen JPY = , virtually flat on the session , and at 1.2871 against the Swiss franc CHF = , down 0.1 percent .</code> \| <code>0</code> \|
	* Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
	```json
	{
	"scale": 20.0,
	"similarity_fct": "pairwise_cos_sim"
	}
	```

	### Evaluation Dataset

	#### LangCache Sentence Pairs (all)

	* Dataset: [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v1)
	* Size: 8,405 evaluation samples
	* Columns: <code>sentence1</code>, <code>sentence2</code>, and <code>label</code>
	* Approximate statistics based on the first 1000 samples:
	\| \| sentence1 \| sentence2 \| label \|
	\|:--------\|:----------------------------------------------------------------------------------\|:---------------------------------------------------------------------------------\|:------------------------------------------------\|
	\| type \| string \| string \| int \|
	\| details \| <ul><li>min: 6 tokens</li><li>mean: 24.89 tokens</li><li>max: 50 tokens</li></ul> \| <ul><li>min: 6 tokens</li><li>mean: 24.3 tokens</li><li>max: 43 tokens</li></ul> \| <ul><li>0: ~45.80%</li><li>1: ~54.20%</li></ul> \|
	* Samples:
	\| sentence1 \| sentence2 \| label \|
	\|:--------------------------------------------------------------------------------------------------------------------------------------\|:---------------------------------------------------------------------------------------------------------------------------------------------------\|:---------------\|
	\| <code>He said the foodservice pie business doesn 't fit the company 's long-term growth strategy .</code> \| <code>" The foodservice pie business does not fit our long-term growth strategy .</code> \| <code>1</code> \|
	\| <code>Magnarelli said Racicot hated the Iraqi regime and looked forward to using his long years of training in the war .</code> \| <code>His wife said he was " 100 percent behind George Bush " and looked forward to using his years of training in the war .</code> \| <code>0</code> \|
	\| <code>The dollar was at 116.92 yen against the yen , flat on the session , and at 1.2891 against the Swiss franc , also flat .</code> \| <code>The dollar was at 116.78 yen JPY = , virtually flat on the session , and at 1.2871 against the Swiss franc CHF = , down 0.1 percent .</code> \| <code>0</code> \|
	* Loss: [<code>CoSENTLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cosentloss) with these parameters:
	```json
	{
	"scale": 20.0,
	"similarity_fct": "pairwise_cos_sim"
	}
	```

	### Training Logs
	\| Epoch \| Step \| val_cosine_ap \| test_cosine_ap \|
	\|:-----:\|:----:\|:-------------:\|:--------------:\|
	\| -1 \| -1 \| 0.7354 \| 0.6477 \|


	### Framework Versions
	- Python: 3.12.3
	- Sentence Transformers: 5.1.0
	- Transformers: 4.56.0
	- PyTorch: 2.8.0+cu128
	- Accelerate: 1.10.1
	- Datasets: 4.0.0
	- Tokenizers: 0.22.0

	## Citation

	### BibTeX

	#### Sentence Transformers
	```bibtex
	@inproceedings{reimers-2019-sentence-bert,
	title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
	author = "Reimers, Nils and Gurevych, Iryna",
	booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
	month = "11",
	year = "2019",
	publisher = "Association for Computational Linguistics",
	url = "https://arxiv.org/abs/1908.10084",
	}
	```

	#### CoSENTLoss
	```bibtex
	@online{kexuefm-8847,
	title={CoSENT: A more efficient sentence vector scheme than Sentence-BERT},
	author={Su Jianlin},
	year={2022},
	month={Jan},
	url={https://kexue.fm/archives/8847},
	}
	```

	<!--
	## Glossary

	Clearly define terms in order to be accessible across audiences.
	-->

	<!--
	## Model Card Authors

	Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.
	-->

	<!--
	## Model Card Contact

	Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.
	-->