metadata

license: mit
base_model:
  - Xenova/distiluse-base-multilingual-cased-v2
pipeline_tag: feature-extraction
tags:
  - feature-extraction
  - sentence-embeddings
  - sentence-transformers
  - sentence-similarity
  - semantic-search
  - vector-search
  - retrieval-augmented-generation
  - multilingual
  - cross-lingual
  - low-resource
  - merged-model
  - combined-model
  - tokenizer-embedded
  - tokenizer-integrated
  - standalone
  - all-in-one
  - quantized
  - int8
  - int8-quantization
  - optimized
  - efficient
  - fast-inference
  - low-latency
  - lightweight
  - small-model
  - edge-ready
  - arm64
  - edge-device
  - mobile-device
  - on-device
  - mobile-inference
  - tablet
  - smartphone
  - embedded-ai
  - onnx
  - onnx-runtime
  - onnx-model
  - transformers
  - MiniLM
  - MiniLM-L12-v2
  - paraphrase
  - usecase-ready
  - plug-and-play
  - production-ready
  - deployment-ready
  - real-time
  - fasttext
  - distiluse

🧠 Unified Multilingual Distiluse Text Embedder (ONNX + Tokenizer Merged)

This is a highly optimized, quantized, and fully standalone model for generating sentence embeddings from multilingual text, including Ukrainian, English, Polish, and more.

Built upon distiluse-base-multilingual-cased-v2, the model has been:

🔁 Merged with its tokenizer into a single ONNX file
⚙️ Extended with a custom preprocessing layer
⚡ Quantized to INT8 and ARM64-ready
🧪 Extensively tested across real-world NLP tasks
🛠️ Bug-fixed vs the original sentence-transformers quantized version that produced inaccurate cosine similarity

🚀 Key Features

🧩 Single-file architecture: no need for external tokenizer, vocab, or transformers library.
⚡ 93% faster inference on mobile compared to the original model.
🗣️ Multilingual: robust across many languages, including low-resource ones.
🧠 Output = pure embeddings: pass a string, get a 768-dim vector. That’s it.
🛠️ Ready for production: small, fast, accurate, and easy to integrate.
📱 Ideal for edge-AI, mobile, and offline scenarios.

🤖 Author @vlad-m-dev Built for edge-ai/phone/tablet offline Telegram: https://t.me/dwight_schrute_engineer

🐍 Python Example

import numpy as np
import onnxruntime as ort
from onnxruntime_extensions import get_library_path

sess_options = ort.SessionOptions()
sess_options.register_custom_ops_library(get_library_path())

session = ort.InferenceSession(
    'model.onnx',
    sess_options=sess_options,
    providers=['CPUExecutionProvider']
)

input_feed = {"text": np.asarray(['something..'])}
outputs = session.run(None, input_feed)
embedding = outputs[0]

🐍 JS Example

const session = await InferenceSession.create(EMBEDDING_FULL_MODEL_PATH); 
const inputTensor = new Tensor('string', ['something..'], [1]); 
const feeds = { text: inputTensor };
const outputMap = await session.run(feeds);
const embedding = outputMap.text_embedding.data;