SPLADE Sparse Encoder

This is a SPLADE Sparse Encoder model finetuned from Shuu12121/CodeModernBERT-Finch using the sentence-transformers library. It maps sentences & paragraphs to a 30005-dimensional sparse vector space and can be used for semantic search and sparse retrieval.

Model Details

Model Description

  • Model Type: SPLADE Sparse Encoder
  • Base model: Shuu12121/CodeModernBERT-Finch
  • Maximum Sequence Length: 8192 tokens
  • Output Dimensionality: 30005 dimensions
  • Similarity Function: Dot Product

Model Sources

Full Model Architecture

SparseEncoder(
  (0): MLMTransformer({'max_seq_length': 8192, 'do_lower_case': False, 'architecture': 'ModernBertForMaskedLM'})
  (1): SpladePooling({'pooling_strategy': 'max', 'activation_function': 'relu', 'word_embedding_dimension': 30005})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SparseEncoder

# Download from the ๐Ÿค— Hub
model = SparseEncoder("sparse_encoder_model_id")
# Run inference
sentences = [
    'Will detect inputs that begin with @MyNamespace/... and replace the namespace with the corresponding path.\n\n@see \\Assetic\\Factory\\AssetFactory::parseInput()',
    'protected function parseInput($input, array $options = array())\n    {\n        $matches = null;\n        // search for @MyNamespace/path/to/asset\n        if (preg_match("|^\\@([a-z_][_a-z0-9]*)/|i", $input, $matches)) {\n            $ns = $matches[1];\n            if (!array_key_exists($ns, $this->namespaces)) {\n                throw new \\RuntimeException("$ns : unknown namespace !");\n            }\n            $input = $this->namespaces[$ns] . substr($input, strlen($ns) + 1);\n        }\n        return parent::parseInput($input, $options);\n    }',
    'function seed_mix() {\n      a ^= b <<  11; d = add(d, a); b = add(b, c);\n      b ^= c >>>  2; e = add(e, b); c = add(c, d);\n      c ^= d <<   8; f = add(f, c); d = add(d, e);\n      d ^= e >>> 16; g = add(g, d); e = add(e, f);\n      e ^= f <<  10; h = add(h, e); f = add(f, g);\n      f ^= g >>>  4; a = add(a, f); g = add(g, h);\n      g ^= h <<   8; b = add(b, g); h = add(h, a);\n      h ^= a >>>  9; c = add(c, h); a = add(a, b);\n    }',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 30005]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[26.3028, 23.1010,  3.4799],
#         [23.1010, 42.4588,  6.9869],
#         [ 3.4799,  6.9869, 59.2962]])

Training Details

Training Dataset

Unnamed Dataset

  • Size: 1,441,500 training samples
  • Columns: text1, text2, and label
  • Approximate statistics based on the first 1000 samples:
    text1 text2 label
    type string string float
    details
    • min: 3 tokens
    • mean: 49.63 tokens
    • max: 1024 tokens
    • min: 28 tokens
    • mean: 180.64 tokens
    • max: 6082 tokens
    • min: 1.0
    • mean: 1.0
    • max: 1.0
  • Samples:
    text1 text2 label
    // makeWin32File makes a new win32File from an existing file handle func makeWin32File(h syscall.Handle) (*win32File, error) {
    f := &win32File{handle: h}
    ioInitOnce.Do(initIo)
    _, err := createIoCompletionPort(h, ioCompletionPort, 0, 0xffffffff)
    if err != nil {
    return nil, err
    }
    err = setFileCompletionNotificationModes(h, cFILE_SKIP_COMPLETION_PORT_ON_SUCCESS
    cFILE_SKIP_SET_EVENT_ON_HANDLE)
    if err != nil {
    return nil, err
    }
    f.readDeadline.channel = make(timeoutChan)
    f.writeDeadline.channel = make(timeoutChan)
    return f, nil
    }
    // Convert_v1_FlexPersistentVolumeSource_To_core_FlexPersistentVolumeSource is an autogenerated conversion function. func Convert_v1_FlexPersistentVolumeSource_To_core_FlexPersistentVolumeSource(in *v1.FlexPersistentVolumeSource, out *core.FlexPersistentVolumeSource, s conversion.Scope) error {
    return autoConvert_v1_FlexPersistentVolumeSource_To_core_FlexPersistentVolumeSource(in, out, s)
    }
    1.0
    // AddRunCmd is defined on the RunCmdsConfig interface. func (cfg *cloudConfig) AddRunCmd(args ...string) {
    cfg.attrs["runcmd"] = append(cfg.RunCmds(), strings.Join(args, " "))
    }
    1.0
  • Loss: SpladeLoss with these parameters:
    {
        "loss": "SparseMultipleNegativesRankingLoss(scale=1.0, similarity_fct='dot_score')",
        "document_regularizer_weight": 3e-05,
        "query_regularizer_weight": 5e-05
    }
    

Evaluation Dataset

Unnamed Dataset

  • Size: 6,000 evaluation samples
  • Columns: text1, text2, and label
  • Approximate statistics based on the first 1000 samples:
    text1 text2 label
    type string string float
    details
    • min: 3 tokens
    • mean: 45.53 tokens
    • max: 495 tokens
    • min: 29 tokens
    • mean: 183.92 tokens
    • max: 7677 tokens
    • min: 1.0
    • mean: 1.0
    • max: 1.0
  • Samples:
    text1 text2 label
    // establish data storage, format and dimensions of a renderbuffer object's image func RenderbufferStorage(target uint32, internalformat uint32, width int32, height int32) {
    syscall.Syscall6(gpRenderbufferStorage, 4, uintptr(target), uintptr(internalformat), uintptr(width), uintptr(height), 0, 0)
    }
    1.0
    // GetObject is a wrapper around gtk_builder_get_object(). The returned result
    // is an IObject, so it will need to be type-asserted to the appropriate type before
    // being used. For example, to get an object and type assert it as a window:
    //
    // obj, err := builder.GetObject("window")
    // if err != nil {
    // // object not found
    // return
    // }
    // if w, ok := obj.(*gtk.Window); ok {
    // // do stuff with w here
    // } else {
    // // not a *gtk.Window
    // }
    //
    func (b *Builder) GetObject(name string) (glib.IObject, error) {
    cstr := C.CString(name)
    defer C.free(unsafe.Pointer(cstr))
    c := C.gtk_builder_get_object(b.native(), (*C.gchar)(cstr))
    if c == nil {
    return nil, errors.New("object '" + name + "' not found")
    }
    obj, err := cast(c)
    if err != nil {
    return nil, err
    }
    return obj, nil
    }
    1.0
    // augmentGoroutine processes source files to improve call to be more
    // descriptive.
    //
    // It modifies the routine.
    func (c *cache) augmentGoroutine(goroutine *Goroutine) {
    if c.files == nil {
    c.files = map[string][]byte{}
    }
    if c.parsed == nil {
    c.parsed = map[string]*parsedFile{}
    }
    // For each call site, look at the next call and populate it. Then we can
    // walk back and reformat things.
    for i := range goroutine.Stack.Calls {
    c.load(goroutine.Stack.Calls[i].LocalSrcPath)
    }

    // Once all loaded, we can look at the next call when available.
    for i := 0; i < len(goroutine.Stack.Calls)-1; i++ {
    // Get the AST from the previous call and process the call line with it.
    if f := c.getFuncAST(&goroutine.Stack.Calls[i]); f != nil {
    processCall(&goroutine.Stack.Calls[i], f)
    }
    }
    }
    1.0
  • Loss: SpladeLoss with these parameters:
    {
        "loss": "SparseMultipleNegativesRankingLoss(scale=1.0, similarity_fct='dot_score')",
        "document_regularizer_weight": 3e-05,
        "query_regularizer_weight": 5e-05
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • per_device_train_batch_size: 2
  • gradient_accumulation_steps: 25
  • num_train_epochs: 1
  • warmup_ratio: 0.1
  • fp16: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: no
  • prediction_loss_only: True
  • per_device_train_batch_size: 2
  • per_device_eval_batch_size: 8
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 25
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 1
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • hub_revision: None
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • liger_kernel_config: None
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional
  • router_mapping: {}
  • learning_rate_mapping: {}

Training Logs

Epoch Step Training Loss
0.0173 500 252.5855
0.0347 1000 0.4281
0.0520 1500 0.071
0.0694 2000 0.0579
0.0867 2500 0.04
0.1041 3000 0.0422
0.1214 3500 0.041
0.1387 4000 0.0347
0.1561 4500 0.0341
0.1734 5000 0.0288
0.1908 5500 0.0243
0.2081 6000 0.0249
0.2255 6500 0.0242
0.2428 7000 0.0204
0.2601 7500 0.0206
0.2775 8000 0.0198
0.2948 8500 0.0205
0.3122 9000 0.0176
0.3295 9500 0.0207
0.3469 10000 0.0196
0.3642 10500 0.0132
0.3815 11000 0.016
0.3989 11500 0.0151
0.4162 12000 0.0168
0.4336 12500 0.0161
0.4509 13000 0.0156
0.4683 13500 0.0134
0.4856 14000 0.0156
0.5029 14500 0.0138
0.5203 15000 0.0134
0.5376 15500 0.0146
0.5550 16000 0.0153
0.5723 16500 0.0135
0.5897 17000 0.0136
0.6070 17500 0.0122
0.6243 18000 0.0115
0.6417 18500 0.0132
0.6590 19000 0.0101
0.6764 19500 0.0092
0.6937 20000 0.0117
0.7111 20500 0.0098
0.7284 21000 0.0122
0.7458 21500 0.0102
0.7631 22000 0.0088
0.7804 22500 0.0093
0.7978 23000 0.0101
0.8151 23500 0.0083
0.8325 24000 0.0095
0.8498 24500 0.0081
0.8672 25000 0.0095
0.8845 25500 0.009
0.9018 26000 0.0081
0.9192 26500 0.0065
0.9365 27000 0.009
0.9539 27500 0.0075
0.9712 28000 0.0078
0.9886 28500 0.0094

Framework Versions

  • Python: 3.11.13
  • Sentence Transformers: 5.0.0
  • Transformers: 4.53.1
  • PyTorch: 2.6.0+cu124
  • Accelerate: 1.8.1
  • Datasets: 3.6.0
  • Tokenizers: 0.21.2

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

SpladeLoss

@misc{formal2022distillationhardnegativesampling,
      title={From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective},
      author={Thibault Formal and Carlos Lassance and Benjamin Piwowarski and Stรฉphane Clinchant},
      year={2022},
      eprint={2205.04733},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2205.04733},
}

SparseMultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

FlopsLoss

@article{paria2020minimizing,
    title={Minimizing flops to learn efficient sparse representations},
    author={Paria, Biswajit and Yeh, Chih-Kuan and Yen, Ian EH and Xu, Ning and Ravikumar, Pradeep and P{'o}czos, Barnab{'a}s},
    journal={arXiv preprint arXiv:2004.05665},
    year={2020}
}
Downloads last month
1
Safetensors
Model size
40.8M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Shuu12121/Code-SparseEncoder-Finch

Finetuned
(3)
this model