pszemraj commited on
Commit
3e7409c
·
1 Parent(s): 327fd9c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +155 -1
README.md CHANGED
@@ -8,4 +8,158 @@ tags:
8
 
9
  # flan-ul2-text-encoder
10
 
11
- THe encoder from This model is 17.44 GB in `bfloat16` precision.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
 
9
  # flan-ul2-text-encoder
10
 
11
+ THe encoder from [flan-ul2](https://huggingface.co/google/flan-ul2). This model is 17.44 GB in `bfloat16` precision.
12
+
13
+
14
+ ## basic usage
15
+
16
+ > note: this is 'a way' of using the encoder, and not 'the only way'. suggestions and ideas welcome
17
+
18
+ This guide provides a set of functions to calculate the cosine similarity between the embeddings of different texts. The embeddings are calculated using a pre-trained model.
19
+
20
+ ## Functions
21
+
22
+ ### load_model_and_tokenizer
23
+
24
+ <details>
25
+ <summary><b>Details</b></summary>
26
+
27
+ This function loads the model and tokenizer based on the given model name. It returns a tuple containing the loaded model and tokenizer.
28
+
29
+ ```python
30
+ def load_model_and_tokenizer(model_name: str) -> Tuple[AutoModel, AutoTokenizer]:
31
+ """
32
+ Load the model and tokenizer based on the given model name.
33
+
34
+ Args:
35
+ model_name (str): The name of the model to be loaded.
36
+
37
+ Returns:
38
+ Tuple[AutoModel, AutoTokenizer]: The loaded model and tokenizer.
39
+ """
40
+ model = AutoModel.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
41
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
42
+ model.eval() # Deactivate Dropout
43
+ return model, tokenizer
44
+ ```
45
+
46
+ </details>
47
+
48
+ ### get_embeddings
49
+
50
+ This function gets the embeddings for the given texts using the provided model and tokenizer. It returns the calculated embeddings.
51
+
52
+ <details>
53
+ <summary><b>Details</b></summary>
54
+
55
+
56
+ ```python
57
+ def get_embeddings(model: AutoModel, tokenizer: AutoTokenizer, texts: List[str]) -> torch.Tensor:
58
+ """
59
+ Get the embeddings for the given texts using the provided model and tokenizer.
60
+
61
+ Args:
62
+ model (AutoModel): The model to be used for getting embeddings.
63
+ tokenizer (AutoTokenizer): The tokenizer to be used for tokenizing the texts.
64
+ texts (List[str]): The texts for which embeddings are to be calculated.
65
+
66
+ Returns:
67
+ torch.Tensor: The calculated embeddings.
68
+ """
69
+ # Tokenize input texts
70
+ batch_tokens = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
71
+
72
+ # Get the embeddings
73
+ with torch.no_grad():
74
+ last_hidden_state = model(**batch_tokens, output_hidden_states=True, return_dict=True).last_hidden_state
75
+
76
+ # Get weights
77
+ weights = (
78
+ torch.arange(start=1, end=last_hidden_state.shape[1] + 1)
79
+ .unsqueeze(0)
80
+ .unsqueeze(-1)
81
+ .expand(last_hidden_state.size())
82
+ .float().to(last_hidden_state.device)
83
+ )
84
+
85
+ # Get attn mask
86
+ input_mask_expanded = (
87
+ batch_tokens["attention_mask"]
88
+ .unsqueeze(-1)
89
+ .expand(last_hidden_state.size())
90
+ .float()
91
+ )
92
+
93
+ # Perform weighted mean pooling across seq_len: bs, seq_len, hidden_dim -> bs, hidden_dim
94
+ sum_embeddings = torch.sum(last_hidden_state * input_mask_expanded * weights, dim=1)
95
+ sum_mask = torch.sum(input_mask_expanded * weights, dim=1)
96
+
97
+ embeddings = sum_embeddings / sum_mask
98
+
99
+ return embeddings
100
+ ```
101
+
102
+ </details>
103
+
104
+ ### calculate_cosine_similarity
105
+
106
+ This function calculates and prints the cosine similarity between the first text and all other texts. It does not return anything.
107
+
108
+ <details>
109
+ <summary><b>click to expand</b></summary>
110
+
111
+
112
+ ```python
113
+ def calculate_cosine_similarity(embeddings: torch.Tensor, texts: List[str]) -> None:
114
+ """
115
+ Calculate and print the cosine similarity between the first text and all other texts.
116
+
117
+ Args:
118
+ embeddings (torch.Tensor): The embeddings for the texts.
119
+ texts (List[str]): The texts for which cosine similarity is to be calculated.
120
+ """
121
+ # Calculate cosine similarities
122
+ for i in range(1, len(embeddings)):
123
+ cosine_sim = 1 - cosine(embeddings[0], embeddings[i])
124
+ print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[i], cosine_sim))
125
+ ```
126
+
127
+ </details>
128
+
129
+ ## Usage
130
+
131
+ To use these functions, you need to have the `transformers` and `scipy` libraries installed. You can install these with pip:
132
+
133
+ ```bash
134
+ pip install transformers scipy
135
+ ```
136
+
137
+ Then, you can use the functions in your Python code as needed. For example:
138
+
139
+ ```python
140
+ model_name = "pszemraj/flan-ul2-text-encoder"
141
+ model, tokenizer = load_model_and_tokenizer(model_name)
142
+
143
+ texts = [
144
+ "deep learning",
145
+ "artificial intelligence",
146
+ "deep diving",
147
+ "artificial snow",
148
+ ]
149
+
150
+ embeddings = get_embeddings(model, tokenizer, texts)
151
+ calculate_cosine_similarity(embeddings, texts)
152
+ ```
153
+
154
+ This will print the cosine similarity between the first text and all other texts in the `texts` list.
155
+
156
+ <details>
157
+ <summary><b>Customization</b></summary>
158
+
159
+ You can customize the texts by modifying the `texts` list. You can also use a different model by changing the `model_name` variable.
160
+
161
+ </details>
162
+
163
+ ## References
164
+
165
+ This guide is based on the examples provided in the [sGPT repository](https://github.com/Muennighoff/sgpt#symmetric-semantic-search-be).