Dan Mo commited on
Commit
975f207
·
1 Parent(s): 712ecb4

Add comprehensive technical reference documentation for the Feelings to Emoji application

Browse files
Files changed (1) hide show
  1. REFERENCE.md +131 -0
REFERENCE.md ADDED
@@ -0,0 +1,131 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Feelings to Emoji: Technical Reference
2
+
3
+ This document provides technical details about the implementation of the Feelings to Emoji application.
4
+
5
+ ## Project Structure
6
+
7
+ The application is organized into several Python modules:
8
+
9
+ - `app.py` - Main application file with Gradio interface
10
+ - `emoji_processor.py` - Core processing logic for emoji matching
11
+ - `config.py` - Configuration settings
12
+ - `utils.py` - Utility functions
13
+ - `generate_embeddings.py` - Standalone tool to pre-generate embeddings
14
+
15
+ ## Embedding Models
16
+
17
+ The system uses the following sentence embedding models from the Sentence Transformers library:
18
+
19
+ | Model Key | Model ID | Size | Description |
20
+ |-----------|----------|------|-------------|
21
+ | mpnet | all-mpnet-base-v2 | 110M | Balanced, great general-purpose model |
22
+ | gte | thenlper/gte-large | 335M | Context-rich, good for emotion & nuance |
23
+ | bge | BAAI/bge-large-en-v1.5 | 350M | Tuned for ranking & high-precision similarity |
24
+
25
+ ## Emoji Matching Algorithm
26
+
27
+ The application uses cosine similarity between sentence embeddings to match text with emojis:
28
+
29
+ 1. For each emoji category (emotion and event):
30
+ - Embed descriptions using the selected model
31
+ - Calculate cosine similarity between the input text embedding and each emoji description embedding
32
+ - Return the emoji with the highest similarity score
33
+
34
+ 2. The embeddings are pre-computed and cached to improve performance:
35
+ - Stored as pickle files in the `embeddings/` directory
36
+ - Generated using `generate_embeddings.py`
37
+ - Loaded at startup to minimize processing time
38
+
39
+ ## Module Reference
40
+
41
+ ### `config.py`
42
+
43
+ Contains configuration settings including:
44
+
45
+ - `CONFIG`: Dictionary with basic application settings (model name, file paths, etc.)
46
+ - `EMBEDDING_MODELS`: Dictionary defining the available embedding models
47
+
48
+ ### `utils.py`
49
+
50
+ Utility functions including:
51
+
52
+ - `setup_logging()`: Configures application logging
53
+ - `kitchen_txt_to_dict(filepath)`: Parses emoji dictionary files
54
+ - `save_embeddings_to_pickle(embeddings, filepath)`: Saves embeddings to pickle files
55
+ - `load_embeddings_from_pickle(filepath)`: Loads embeddings from pickle files
56
+ - `get_embeddings_pickle_path(model_id, emoji_type)`: Generates consistent paths for embedding files
57
+
58
+ ### `emoji_processor.py`
59
+
60
+ Core processing logic:
61
+
62
+ - `EmojiProcessor`: Main class for emoji matching and processing
63
+ - `__init__(model_name=None, model_key=None, use_cached_embeddings=True)`: Initializes the processor with a specific model
64
+ - `load_emoji_dictionaries(emotion_file, item_file)`: Loads emoji dictionaries from text files
65
+ - `switch_model(model_key)`: Switches to a different embedding model
66
+ - `sentence_to_emojis(sentence)`: Processes text to find matching emojis and generate mashup
67
+ - `find_top_emojis(embedding, emoji_embeddings, top_n=1)`: Finds top matching emojis using cosine similarity
68
+
69
+ ### `app.py`
70
+
71
+ Gradio interface:
72
+
73
+ - `EmojiMashupApp`: Main application class
74
+ - `create_interface()`: Creates the Gradio interface
75
+ - `process_with_model(model_selection, text, use_cached_embeddings)`: Processes text with selected model
76
+ - `get_random_example()`: Gets a random example sentence for demonstration
77
+
78
+ ### `generate_embeddings.py`
79
+
80
+ Standalone utility to pre-generate embeddings:
81
+
82
+ - `generate_embeddings_for_model(model_key, model_info)`: Generates embeddings for a specific model
83
+ - `main()`: Main function that processes all models and saves embeddings
84
+
85
+ ## Emoji Data Files
86
+
87
+ - `google-emoji-kitchen-emotion.txt`: Emotion emojis with descriptions
88
+ - `google-emoji-kitchen-item.txt`: Event/object emojis with descriptions
89
+ - `google-emoji-kitchen-compatible.txt`: Compatibility information for emoji combinations
90
+
91
+ ## Embedding Cache Structure
92
+
93
+ The `embeddings/` directory contains pre-generated embeddings in pickle format:
94
+
95
+ - `[model_id]_emotion.pkl`: Embeddings for emotion emojis
96
+ - `[model_id]_event.pkl`: Embeddings for event/object emojis
97
+
98
+ ## API Usage Examples
99
+
100
+ ### Using the EmojiProcessor Directly
101
+
102
+ ```python
103
+ from emoji_processor import EmojiProcessor
104
+
105
+ # Initialize with default model (mpnet)
106
+ processor = EmojiProcessor()
107
+ processor.load_emoji_dictionaries()
108
+
109
+ # Process a sentence
110
+ emotion, event, image = processor.sentence_to_emojis("I'm feeling happy today!")
111
+ print(f"Emotion emoji: {emotion}")
112
+ print(f"Event emoji: {event}")
113
+ # image contains the PIL Image object of the mashup
114
+ ```
115
+
116
+ ### Switching Models
117
+
118
+ ```python
119
+ # Switch to a different model
120
+ processor.switch_model("gte")
121
+
122
+ # Process with the new model
123
+ emotion, event, image = processor.sentence_to_emojis("I'm feeling anxious about tomorrow.")
124
+ ```
125
+
126
+ ## Performance Considerations
127
+
128
+ - Embedding generation is computationally intensive but only happens once per model
129
+ - Using cached embeddings significantly improves response time
130
+ - Larger models (GTE, BGE) may provide better accuracy but require more resources
131
+ - The MPNet model offers a good balance of performance and accuracy for most use cases