martinhillebrandtd commited on
Commit
e9e1652
·
1 Parent(s): 01cf913
README.md CHANGED
@@ -1,3 +1,165 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ pipeline_tag: sentence-similarity
4
+ tags:
5
+ - sentence-transformers
6
+ - feature-extraction
7
+ - sentence-similarity
8
+ - onnx
9
+ - teradata
10
+
11
+ ---
12
+ # A Teradata Vantage compatible Embeddings Model
13
+
14
+ # BAAI/bge-m3
15
+
16
+ ## Overview of this Model
17
+
18
+ An Embedding Model which maps text (sentence/ paragraphs) into a vector. The [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) model well known for its effectiveness in capturing semantic meanings in text data. It's a state-of-the-art model trained on a large corpus, capable of generating high-quality text embeddings.
19
+
20
+ - 567.75M params (Sizes in ONNX format - "int8": 542.57MB, "uint8": 542.57MB)
21
+ - 8194 maximum input tokens
22
+ - 1024 dimensions of output vector
23
+ - Licence: mit. The released models can be used for commercial purposes free of charge.
24
+ - Reference to Original Model: https://huggingface.co/BAAI/bge-m3
25
+
26
+
27
+ ## Quickstart: Deploying this Model in Teradata Vantage
28
+
29
+ We have pre-converted the model into the ONNX format compatible with BYOM 6.0, eliminating the need for manual conversion.
30
+
31
+ **Note:** Ensure you have access to a Teradata Database with BYOM 6.0 installed.
32
+
33
+ For detailed information, refer to the ONNXEmbeddings documentation: TODO
34
+
35
+ To get started, clone the pre-converted model directly from the Teradata HuggingFace repository.
36
+
37
+
38
+ ```python
39
+
40
+ import teradataml as tdml
41
+ import getpass
42
+ from huggingface_hub import hf_hub_download
43
+
44
+ model_name = "bge-m3"
45
+ number_dimensions_output = 1024
46
+ model_file_name = "model_int8.onnx"
47
+
48
+ # Step 1: Download Model from Teradata HuggingFace Page
49
+
50
+ hf_hub_download(repo_id=f"Teradata/{model_name}", filename=f"onnx/{model_file_name}", local_dir="./")
51
+ hf_hub_download(repo_id=f"Teradata/{model_name}", filename=f"tokenizer.json", local_dir="./")
52
+
53
+ # Step 2: Create Connection to Vantage
54
+
55
+ tdml.create_context(host = input('enter your hostname'),
56
+ username=input('enter your username'),
57
+ password = getpass.getpass("enter your password"))
58
+
59
+ # Step 3: Load Models into Vantage
60
+ # a) Embedding model
61
+ tdml.save_byom(model_id = model_name, # must be unique in the models table
62
+ model_file = model_file_name,
63
+ table_name = 'embeddings_models' )
64
+ # b) Tokenizer
65
+ tdml.save_byom(model_id = model_name, # must be unique in the models table
66
+ model_file = 'tokenizer.json',
67
+ table_name = 'embeddings_tokenizers')
68
+
69
+ # Step 4: Test ONNXEmbeddings Function
70
+ # Note that ONNXEmbeddings expects the 'payload' column to be 'txt'.
71
+ # If it has got a different name, just rename it in a subquery/CTE.
72
+ input_table = "emails.emails"
73
+ embeddings_query = f"""
74
+ SELECT
75
+ *
76
+ from mldb.ONNXEmbeddings(
77
+ on {input_table} as InputTable
78
+ on (select * from embeddings_models where model_id = '{model_name}') as ModelTable DIMENSION
79
+ on (select model as tokenizer from embeddings_tokenizers where model_id = '{model_name}') as TokenizerTable DIMENSION
80
+ using
81
+ Accumulate('id', 'txt')
82
+ ModelOutputTensor('sentence_embedding')
83
+ EnableMemoryCheck('false')
84
+ OutputFormat('FLOAT32({number_dimensions_output})')
85
+ OverwriteCachedModel('true')
86
+ ) a
87
+ """
88
+ DF_embeddings = tdml.DataFrame.from_query(embeddings_query)
89
+ DF_embeddings
90
+ ```
91
+
92
+
93
+
94
+ ## What Can I Do with the Embeddings?
95
+
96
+ Teradata Vantage includes pre-built in-database functions to process embeddings further. Explore the following examples:
97
+
98
+ - **Semantic Clustering with TD_KMeans:** [Semantic Clustering Python Notebook](https://github.com/Teradata/jupyter-demos/blob/main/UseCases/Language_Models_InVantage/Semantic_Clustering_Python.ipynb)
99
+ - **Semantic Distance with TD_VectorDistance:** [Semantic Similarity Python Notebook](https://github.com/Teradata/jupyter-demos/blob/main/UseCases/Language_Models_InVantage/Semantic_Similarity_Python.ipynb)
100
+ - **RAG-Based Application with TD_VectorDistance:** [RAG and Bedrock Query PDF Notebook](https://github.com/Teradata/jupyter-demos/blob/main/UseCases/Language_Models_InVantage/RAG_and_Bedrock_QueryPDF.ipynb)
101
+
102
+
103
+ ## Deep Dive into Model Conversion to ONNX
104
+
105
+ **The steps below outline how we converted the open-source Hugging Face model into an ONNX file compatible with the in-database ONNXEmbeddings function.**
106
+
107
+ You do not need to perform these steps—they are provided solely for documentation and transparency. However, they may be helpful if you wish to convert another model to the required format.
108
+
109
+
110
+ ### Part 1. Importing and Converting Model using optimum
111
+
112
+ We start by importing the pre-trained [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) model from Hugging Face.
113
+
114
+ To enhance performance and ensure compatibility with various execution environments, we'll use the [Optimum](https://github.com/huggingface/optimum) utility to convert the model into the ONNX (Open Neural Network Exchange) format.
115
+
116
+ After conversion to ONNX, we are fixing the opset in the ONNX file for compatibility with ONNX runtime used in Teradata Vantage
117
+
118
+ We are generating ONNX files for multiple different precisions: int8, uint8
119
+
120
+ You can find the detailed conversion steps in the file [convert.py](./convert.py)
121
+
122
+ ### Part 2. Running the model in Python with onnxruntime & compare results
123
+
124
+ Once the fixes are applied, we proceed to test the correctness of the ONNX model by calculating cosine similarity between two texts using native SentenceTransformers and ONNX runtime, comparing the results.
125
+
126
+ If the results are identical, it confirms that the ONNX model gives the same result as the native models, validating its correctness and suitability for further use in the database.
127
+
128
+
129
+ ```python
130
+ import onnxruntime as rt
131
+
132
+ from sentence_transformers.util import cos_sim
133
+ from sentence_transformers import SentenceTransformer
134
+
135
+ import transformers
136
+
137
+
138
+ sentences_1 = 'How is the weather today?'
139
+ sentences_2 = 'What is the current weather like today?'
140
+
141
+ # Calculate ONNX result
142
+ tokenizer = transformers.AutoTokenizer.from_pretrained("BAAI/bge-m3")
143
+ predef_sess = rt.InferenceSession("onnx/model_int8.onnx")
144
+
145
+ enc1 = tokenizer(sentences_1)
146
+ embeddings_1_onnx = predef_sess.run(None, {"input_ids": [enc1.input_ids],
147
+ "attention_mask": [enc1.attention_mask]})
148
+
149
+ enc2 = tokenizer(sentences_2)
150
+ embeddings_2_onnx = predef_sess.run(None, {"input_ids": [enc2.input_ids],
151
+ "attention_mask": [enc2.attention_mask]})
152
+
153
+
154
+ # Calculate embeddings with SentenceTransformer
155
+ model = SentenceTransformer(model_id, trust_remote_code=True)
156
+ embeddings_1_sentence_transformer = model.encode(sentences_1, normalize_embeddings=True, trust_remote_code=True)
157
+ embeddings_2_sentence_transformer = model.encode(sentences_2, normalize_embeddings=True, trust_remote_code=True)
158
+
159
+ # Compare results
160
+ print("Cosine similiarity for embeddings calculated with ONNX:" + str(cos_sim(embeddings_1_onnx[1][0], embeddings_2_onnx[1][0])))
161
+ print("Cosine similiarity for embeddings calculated with SentenceTransformer:" + str(cos_sim(embeddings_1_sentence_transformer, embeddings_2_sentence_transformer)))
162
+ ```
163
+
164
+ You can find the detailed ONNX vs. SentenceTransformer result comparison steps in the file [test_local.py](./test_local.py)
165
+
config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:08e8e41086ac9578be304eb4ca596dac64f965595316aabd116adec08d6e3e39
3
+ size 776
conversion_config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:707a2be6604284c248968e44faf46d64e6950425bf05fa66e3de780c7c3359d5
3
+ size 272
convert.py ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import json
3
+ import shutil
4
+
5
+ from optimum.exporters.onnx import main_export
6
+ import onnx
7
+ from onnxconverter_common import float16
8
+ import onnxruntime as rt
9
+ from onnxruntime.tools.onnx_model_utils import *
10
+ from onnxruntime.quantization import quantize_dynamic, QuantType
11
+
12
+ with open('conversion_config.json') as json_file:
13
+ conversion_config = json.load(json_file)
14
+
15
+
16
+ model_id = conversion_config["model_id"]
17
+ number_of_generated_embeddings = conversion_config["number_of_generated_embeddings"]
18
+ precision_to_filename_map = conversion_config["precision_to_filename_map"]
19
+ opset = conversion_config["opset"]
20
+ IR = conversion_config["IR"]
21
+
22
+
23
+ op = onnx.OperatorSetIdProto()
24
+ op.version = opset
25
+
26
+
27
+ if not os.path.exists("onnx"):
28
+ os.makedirs("onnx")
29
+
30
+ print("Exporting the main model version")
31
+
32
+ main_export(model_name_or_path=model_id, output="./", opset=opset, trust_remote_code=True, task="feature-extraction", dtype="fp32")
33
+
34
+ if "fp32" in precision_to_filename_map:
35
+ print("Exporting the fp32 onnx file...")
36
+
37
+ shutil.copyfile('model.onnx', precision_to_filename_map["fp32"])
38
+
39
+ print("Done\n\n")
40
+ if "fp16" in precision_to_filename_map:
41
+ print("Exporting the fp16 onnx file...")
42
+ model_fp16 = float16.convert_float_to_float16(onnx.load('model.onnx'),\
43
+ min_positive_val=1e-7, \
44
+ max_finite_val=1e4, \
45
+ keep_io_types=True, \
46
+ disable_shape_infer=True, \
47
+ op_block_list=None, \
48
+ node_block_list=None)
49
+
50
+ model_fp16 = onnx.helper.make_model(model_fp16.graph, ir_version = IR, opset_imports = [op]) #to be sure that we have compatible opset and IR version
51
+
52
+ onnx.save(model_fp16, precision_to_filename_map["fp16"])
53
+ print("Done\n\n")
54
+
55
+ if "int8" in precision_to_filename_map:
56
+ print("Quantizing fp32 model to int8...")
57
+ quantize_dynamic("model.onnx", precision_to_filename_map["int8"], weight_type=QuantType.QInt8)
58
+ print("Done\n\n")
59
+
60
+ if "uint8" in precision_to_filename_map:
61
+ print("Quantizing fp32 model to uint8...")
62
+ quantize_dynamic("model.onnx", precision_to_filename_map["uint8"], weight_type=QuantType.QUInt8)
63
+ print("Done\n\n")
64
+
65
+ os.remove("model.onnx")
onnx/model_int8.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fc4692db47a783252a0dba2d5601a3b9ab0c5058186eced9b963bb0e112f011a
3
+ size 568921578
onnx/model_uint8.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fd0b5b04efbafed0e8ab8c4d563524e16a4818245dadf3bb5262c293ca34cdfb
3
+ size 568921649
sentencepiece.bpe.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cfc8146abe2a0488e9e2a0c56de7952f7c11ab059eca145a0a727afce0db2865
3
+ size 5069051
special_tokens_map.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8c785abebea9ae3257b61681b4e6fd8365ceafde980c21970d001e834cf10835
3
+ size 964
test_local.py ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import onnxruntime as rt
2
+
3
+ from sentence_transformers.util import cos_sim
4
+ from sentence_transformers import SentenceTransformer
5
+
6
+ import transformers
7
+
8
+ import gc
9
+ import json
10
+
11
+
12
+ with open('conversion_config.json') as json_file:
13
+ conversion_config = json.load(json_file)
14
+
15
+
16
+ model_id = conversion_config["model_id"]
17
+ number_of_generated_embeddings = conversion_config["number_of_generated_embeddings"]
18
+ precision_to_filename_map = conversion_config["precision_to_filename_map"]
19
+
20
+ sentences_1 = 'How is the weather today?'
21
+ sentences_2 = 'What is the current weather like today?'
22
+
23
+ print(f"Testing on cosine similiarity between sentences: \n'{sentences_1}'\n'{sentences_2}'\n\n\n")
24
+
25
+ tokenizer = transformers.AutoTokenizer.from_pretrained("./")
26
+ enc1 = tokenizer(sentences_1)
27
+ enc2 = tokenizer(sentences_2)
28
+
29
+ for precision, file_name in precision_to_filename_map.items():
30
+
31
+
32
+ onnx_session = rt.InferenceSession(file_name)
33
+ embeddings_1_onnx = onnx_session.run(None, {"input_ids": [enc1.input_ids],
34
+ "attention_mask": [enc1.attention_mask]})[1][0]
35
+
36
+ embeddings_2_onnx = onnx_session.run(None, {"input_ids": [enc2.input_ids],
37
+ "attention_mask": [enc2.attention_mask]})[1][0]
38
+
39
+ del onnx_session
40
+ gc.collect()
41
+ print(f'Cosine similiarity for ONNX model with precision "{precision}" is {str(cos_sim(embeddings_1_onnx, embeddings_2_onnx))}')
42
+
43
+
44
+
45
+
46
+ model = SentenceTransformer(model_id, trust_remote_code=True)
47
+ embeddings_1_sentence_transformer = model.encode(sentences_1, normalize_embeddings=True, trust_remote_code=True)
48
+ embeddings_2_sentence_transformer = model.encode(sentences_2, normalize_embeddings=True, trust_remote_code=True)
49
+ print('Cosine similiarity for original sentence transformer model is '+str(cos_sim(embeddings_1_sentence_transformer, embeddings_2_sentence_transformer)))
test_teradata.py ADDED
@@ -0,0 +1,106 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import sys
2
+ import teradataml as tdml
3
+ from tabulate import tabulate
4
+
5
+ import json
6
+
7
+
8
+ with open('conversion_config.json') as json_file:
9
+ conversion_config = json.load(json_file)
10
+
11
+
12
+ model_id = conversion_config["model_id"]
13
+ number_of_generated_embeddings = conversion_config["number_of_generated_embeddings"]
14
+ precision_to_filename_map = conversion_config["precision_to_filename_map"]
15
+
16
+ host = sys.argv[1]
17
+ username = sys.argv[2]
18
+ password = sys.argv[3]
19
+
20
+ print("Setting up connection to teradata...")
21
+ tdml.create_context(host = host, username = username, password = password)
22
+ print("Done\n\n")
23
+
24
+
25
+ print("Deploying tokenizer...")
26
+ try:
27
+ tdml.db_drop_table('tokenizer_table')
28
+ except:
29
+ print("Can't drop tokenizers table - it's not existing")
30
+ tdml.save_byom('tokenizer',
31
+ 'tokenizer.json',
32
+ 'tokenizer_table')
33
+ print("Done\n\n")
34
+
35
+ print("Testing models...")
36
+ try:
37
+ tdml.db_drop_table('model_table')
38
+ except:
39
+ print("Can't drop models table - it's not existing")
40
+
41
+ for precision, file_name in precision_to_filename_map.items():
42
+ print(f"Deploying {precision} model...")
43
+ tdml.save_byom(precision,
44
+ file_name,
45
+ 'model_table')
46
+ print(f"Model {precision} is deployed\n")
47
+
48
+ print(f"Calculating embeddings with {precision} model...")
49
+ try:
50
+ tdml.db_drop_table('emails_embeddings_store')
51
+ except:
52
+ print("Can't drop embeddings table - it's not existing")
53
+
54
+ tdml.execute_sql(f"""
55
+ create volatile table emails_embeddings_store as (
56
+ select
57
+ *
58
+ from mldb.ONNXEmbeddings(
59
+ on emails.emails as InputTable
60
+ on (select * from model_table where model_id = '{precision}') as ModelTable DIMENSION
61
+ on (select model as tokenizer from tokenizer_table where model_id = 'tokenizer') as TokenizerTable DIMENSION
62
+
63
+ using
64
+ Accumulate('id', 'txt')
65
+ ModelOutputTensor('sentence_embedding')
66
+ EnableMemoryCheck('false')
67
+ OutputFormat('FLOAT32({number_of_generated_embeddings})')
68
+ OverwriteCachedModel('true')
69
+ ) a
70
+ ) with data on commit preserve rows
71
+
72
+ """)
73
+ print("Embeddings calculated")
74
+ print(f"Testing semantic search with cosine similiarity on the output of the model with precision '{precision}'...")
75
+ tdf_embeddings_store = tdml.DataFrame('emails_embeddings_store')
76
+ tdf_embeddings_store_tgt = tdf_embeddings_store[tdf_embeddings_store.id == 3]
77
+
78
+ tdf_embeddings_store_ref = tdf_embeddings_store[tdf_embeddings_store.id != 3]
79
+
80
+ cos_sim_pd = tdml.DataFrame.from_query(f"""
81
+ SELECT
82
+ dt.target_id,
83
+ dt.reference_id,
84
+ e_tgt.txt as target_txt,
85
+ e_ref.txt as reference_txt,
86
+ (1.0 - dt.distance) as similiarity
87
+ FROM
88
+ TD_VECTORDISTANCE (
89
+ ON ({tdf_embeddings_store_tgt.show_query()}) AS TargetTable
90
+ ON ({tdf_embeddings_store_ref.show_query()}) AS ReferenceTable DIMENSION
91
+ USING
92
+ TargetIDColumn('id')
93
+ TargetFeatureColumns('[emb_0:emb_{number_of_generated_embeddings - 1}]')
94
+ RefIDColumn('id')
95
+ RefFeatureColumns('[emb_0:emb_{number_of_generated_embeddings - 1}]')
96
+ DistanceMeasure('cosine')
97
+ topk(3)
98
+ ) AS dt
99
+ JOIN emails.emails e_tgt on e_tgt.id = dt.target_id
100
+ JOIN emails.emails e_ref on e_ref.id = dt.reference_id;
101
+ """).to_pandas()
102
+ print(tabulate(cos_sim_pd, headers='keys', tablefmt='fancy_grid'))
103
+ print("Done\n\n")
104
+
105
+
106
+ tdml.remove_context()
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:249df0778f236f6ece390de0de746838ef25b9d6954b68c2ee71249e0a9d8fd4
3
+ size 17082799
tokenizer_config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b87c8703482b0300d3da30e201519aa641f6a450f5eb5bf1e624afbf70c74d80
3
+ size 1203