Instructions to use fyaronskiy/english_code_retriever with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use fyaronskiy/english_code_retriever with sentence-transformers:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("fyaronskiy/english_code_retriever")

sentences = [
    "search_query: Finds the top long, short, and absolute positions.\n\n    Parameters\n    ----------\n    positions : pd.DataFrame\n        The positions that the strategy takes over time.\n    top : int, optional\n        How many of each to find (default 10).\n\n    Returns\n    -------\n    df_top_long : pd.DataFrame\n        Top long positions.\n    df_top_short : pd.DataFrame\n        Top short positions.\n    df_top_abs : pd.DataFrame\n        Top absolute positions.",
    "search_document: def symmetric_ema(xolds, yolds, low=None, high=None, n=512, decay_steps=1., low_counts_threshold=1e-8):\n    '''\n    perform symmetric EMA (exponential moving average)\n    smoothing and resampling to an even grid with n points.\n    Does not do extrapolation, so we assume\n    xolds[0] <= low && high <= xolds[-1]\n\n    Arguments:\n\n    xolds: array or list  - x values of data. Needs to be sorted in ascending order\n    yolds: array of list  - y values of data. Has to have the same length as xolds\n\n    low: float            - min value of the new x grid. By default equals to xolds[0]\n    high: float           - max value of the new x grid. By default equals to xolds[-1]\n\n    n: int                - number of points in new x grid\n\n    decay_steps: float    - EMA decay factor, expressed in new x grid steps.\n\n    low_counts_threshold: float or int\n                          - y values with counts less than this value will be set to NaN\n\n    Returns:\n        tuple sum_ys, count_ys where\n            xs        - array with new x grid\n            ys        - array of EMA of y at each point of the new x grid\n            count_ys  - array of EMA of y counts at each point of the new x grid\n\n    '''\n    xs, ys1, count_ys1 = one_sided_ema(xolds, yolds, low, high, n, decay_steps, low_counts_threshold=0)\n    _,  ys2, count_ys2 = one_sided_ema(-xolds[::-1], yolds[::-1], -high, -low, n, decay_steps, low_counts_threshold=0)\n    ys2 = ys2[::-1]\n    count_ys2 = count_ys2[::-1]\n    count_ys = count_ys1 + count_ys2\n    ys = (ys1 * count_ys1 + ys2 * count_ys2) / count_ys\n    ys[count_ys < low_counts_threshold] = np.nan\n    return xs, ys, count_ys",
    "search_document: def project(self, from_shape, to_shape):\n        \"\"\"\n        Project the polygon onto an image with different shape.\n\n        The relative coordinates of all points remain the same.\n        E.g. a point at (x=20, y=20) on an image (width=100, height=200) will be\n        projected on a new image (width=200, height=100) to (x=40, y=10).\n\n        This is intended for cases where the original image is resized.\n        It cannot be used for more complex changes (e.g. padding, cropping).\n\n        Parameters\n        ----------\n        from_shape : tuple of int\n            Shape of the original image. (Before resize.)\n\n        to_shape : tuple of int\n            Shape of the new image. (After resize.)\n\n        Returns\n        -------\n        imgaug.Polygon\n            Polygon object with new coordinates.\n\n        \"\"\"\n        if from_shape[0:2] == to_shape[0:2]:\n            return self.copy()\n        ls_proj = self.to_line_string(closed=False).project(\n            from_shape, to_shape)\n        return self.copy(exterior=ls_proj.coords)",
    "search_document: def get_top_long_short_abs(positions, top=10):\n    \"\"\"\n    Finds the top long, short, and absolute positions.\n\n    Parameters\n    ----------\n    positions : pd.DataFrame\n        The positions that the strategy takes over time.\n    top : int, optional\n        How many of each to find (default 10).\n\n    Returns\n    -------\n    df_top_long : pd.DataFrame\n        Top long positions.\n    df_top_short : pd.DataFrame\n        Top short positions.\n    df_top_abs : pd.DataFrame\n        Top absolute positions.\n    \"\"\"\n\n    positions = positions.drop('cash', axis='columns')\n    df_max = positions.max()\n    df_min = positions.min()\n    df_abs_max = positions.abs().max()\n    df_top_long = df_max[df_max > 0].nlargest(top)\n    df_top_short = df_min[df_min < 0].nsmallest(top)\n    df_top_abs = df_abs_max.nlargest(top)\n    return df_top_long, df_top_short, df_top_abs"
]
embeddings = model.encode(sentences)

similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [4, 4]

Notebooks
Google Colab
Kaggle

SentenceTransformer

This is a answerdotai/ModernBERT-base model trained on the code_search_net dataset with MultipleNegativesRankingLoss with in-batch negatives. Model can be used for code retrieval and reranking.

Perfomance on code retrieval benchmarks

RTEB

On 14.10.2025 the model is 6th on RTEB leaderbord among models with <500M parameters:

Click

Perfomance per task:

Model	AppsRetrieval	Code1Retrieval (Private)	DS1000Retrieval	FreshStackRetrieval	HumanEvalRetrieval	JapaneseCode1Retrieval (Private)	MBPPRetrieval	WikiSQLRetrieval
english_code_retriever	8.04	75.36	32.42	18.30	71.82	46.59	72.06	87.92

COIR:

Model	AppsRetrieval	COIRCodeSearchNetRetrieval	CodeFeedbackMT	CodeFeedbackST	CodeSearchNetCCRetrieval	CodeTransOceanContest	CodeTransOceanDL	CosQA	StackOverflowQA	SyntheticText2SQL
english_code_retriever	8.04	74.23	44.01	57.79	42.71	60.68	35.16	25.56	56.53	42.79

more information you cand find in MTEB leaderbord

Model Details

Model Description

Model Type: Sentence Transformer
Maximum Sequence Length: 8192 tokens
Output Dimensionality: 768
Similarity Function: Cosine Similarity
Mean pooling

Usage

Using is easy with Sentence Transformers.

Pay attention that model was trained with prefixes 'search_query' for queries and 'search_document' for docs with code. So using with prefixes will improve model retrieving abilities.

import torch
from sentence_transformers import SentenceTransformer, util

device = "cuda" if torch.cuda.is_available() else "cpu"
model = SentenceTransformer("fyaronskiy/english_code_retriever").to(device)

queries = [
    "Write a Python function that calculates the factorial of a number recursively.",
    "How to check if a given string reads the same backward and forward?",
    "Combine two sorted lists into a single sorted list."
]

corpus = [
    # Relevant for Q1
    """def factorial(n):
    if n == 0:
        return 1
    return n * factorial(n-1)""",

    # Hard negative for Q1 (similar structure but computes sum)
    """def sum_recursive(n):
    if n == 0:
        return 0
    return n + sum_recursive(n-1)""",

    # Relevant for Q2
    """def is_palindrome(s: str) -> bool:
    s = s.lower().replace(" ", "")
    return s == s[::-1]""",

    # Hard negative for Q2 (string reverse but not palindrome check)
    """def reverse_string(s: str) -> str:
    return s[::-1]""",

    # Relevant for Q3
    """def merge_sorted_lists(a, b):
    result = []
    i = j = 0
    while i < len(a) and j < len(b):
        if a[i] < b[j]:
            result.append(a[i])
            i += 1
        else:
            result.append(b[j])
            j += 1
    result.extend(a[i:])
    result.extend(b[j:])
    return result""",

    # Hard negative for Q3 (similar iteration but sums two lists elementwise)
    """def add_lists(a, b):
    return [x + y for x, y in zip(a, b)]"""
]


doc_embeddings = model.encode(corpus, prompt_name='search_query', convert_to_tensor=True, device=device)
query_embeddings = model.encode(queries, prompt_name='search_document', convert_to_tensor=True, device=device)

# Compute cosine similarity and retrieve top-1
for i, query in enumerate(queries):
    scores = util.cos_sim(query_embeddings[i], doc_embeddings)[0]
    best_idx = torch.argmax(scores).item()
    print(f"\n Query {i+1}: {query}")
    print(f"Top-1 match (score={scores[best_idx]:.4f}):\n{corpus[best_idx]}")

''' Query 1: Write a Python function that calculates the factorial of a number recursively.
Top-1 match (score=0.5983):
def factorial(n):
    if n == 0:
        return 1
    return n * factorial(n-1)

 Query 2: How to check if a given string reads the same backward and forward?
Top-1 match (score=0.4925):
def is_palindrome(s: str) -> bool:
    s = s.lower().replace(" ", "")
    return s == s[::-1]

 Query 3: Combine two sorted lists into a single sorted list.
Top-1 match (score=0.6524):
def merge_sorted_lists(a, b):
    result = []
    i = j = 0
    while i < len(a) and j < len(b):
        if a[i] < b[j]:
            result.append(a[i])
            i += 1
        else:
            result.append(b[j])
            j += 1
    result.extend(a[i:])
    result.extend(b[j:])
    return result
'''

Using with Transformers

import torch
from transformers import AutoTokenizer, AutoModel

device = "cuda" if torch.cuda.is_available() else "cpu"

model_name = "fyaronskiy/english_code_retriever"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name).to(device)
model.eval()



queries = [
"function of addition of two numbers",
"finding the maximum element in an array",
"sorting a list in ascending order"
]

corpus = [
    "def add(a, b): return a + b",
    "def find_max(arr): return max(arr)",
    "def sort_list(lst): return sorted(lst)"
]

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]  # (batch_size, seq_len, hidden_dim)
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return (token_embeddings * input_mask_expanded).sum(1) / input_mask_expanded.sum(1).clamp(min=1e-9)

def encode_texts(texts):
    encoded = tokenizer(
        texts,
        padding=True,
        truncation=True,
        return_tensors="pt",
        max_length=8192
    ).to(device)
    with torch.no_grad():
        model_output = model(**encoded)
    return mean_pooling(model_output, encoded["attention_mask"])

doc_embeddings = encode_texts(["search_document: " + document  for document in corpus])
query_embeddings = encode_texts(["search_query: " + query  for query in queries])

# Normalize embeddings for cosine similarity
doc_embeddings = torch.nn.functional.normalize(doc_embeddings, p=2, dim=1)
query_embeddings = torch.nn.functional.normalize(query_embeddings, p=2, dim=1)

# Compute cosine similarity and retrieve top-1
for i, query in enumerate(queries):
    scores = torch.matmul(query_embeddings[i], doc_embeddings.T)
    best_idx = torch.argmax(scores).item()
    print(f"\n Query {i+1}: {query}")
    print(f"Top-1 match (score={scores[best_idx]:.4f}):\n{corpus[best_idx]}")

''' Query 1: function of addition of two numbers
Top-1 match (score=0.6047):
def add(a, b): return a + b

 Query 2: finding the maximum element in an array
Top-1 match (score=0.7772):
def find_max(arr): return max(arr)

 Query 3: sorting a list in ascending order
Top-1 match (score=0.7389):
def sort_list(lst): return sorted(lst)
'''

Evaluation

Metrics

Information Retrieval

Dataset: validation part of codesearchnet_val
Size: 30,000 evaluation samples
Evaluated with InformationRetrievalEvaluator

Metric	Value
cosine_accuracy@1	0.8926
cosine_accuracy@3	0.9454
cosine_accuracy@5	0.9545
cosine_accuracy@10	0.9638
cosine_precision@1	0.8926
cosine_precision@3	0.3151
cosine_precision@5	0.1909
cosine_precision@10	0.0964
cosine_recall@1	0.8926
cosine_recall@3	0.9454
cosine_recall@5	0.9545
cosine_recall@10	0.9638
cosine_ndcg@10	0.9313
cosine_mrr@10	0.9206
cosine_map@100	0.9212