tolgadev commited on
Commit
095e8d7
Β·
verified Β·
1 Parent(s): 0af71fa

upload files

Browse files
Files changed (7) hide show
  1. .gitattributes +2 -0
  2. LICENSE +21 -0
  3. README.md +118 -14
  4. app.py +416 -0
  5. assets/1.JPG +3 -0
  6. assets/2.JPG +3 -0
  7. requirements.txt +7 -0
.gitattributes CHANGED
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ assets/1.JPG filter=lfs diff=lfs merge=lfs -text
37
+ assets/2.JPG filter=lfs diff=lfs merge=lfs -text
LICENSE ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MIT License
2
+
3
+ Copyright (c) [2025] [Tolga Kurtulus]
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
README.md CHANGED
@@ -1,14 +1,118 @@
1
- ---
2
- title: Langchain Text Chunker
3
- emoji: πŸ“š
4
- colorFrom: purple
5
- colorTo: pink
6
- sdk: gradio
7
- sdk_version: 5.34.0
8
- app_file: app.py
9
- pinned: false
10
- license: mit
11
- short_description: This Gradio application provides an interactive interface fo
12
- ---
13
-
14
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # πŸ¦œοΈπŸ”— LangChain Text Chunker
2
+
3
+ [![Python 3.8+](https://img.shields.io/badge/Python-3.8%2B-blue?style=flat&logo=python)](https://www.python.org/)
4
+ [![Gradio](https://img.shields.io/badge/Built%20with-Gradio-FF6600?style=flat&logo=gradio)](https://gradio.app/)
5
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
6
+
7
+ ## Description
8
+
9
+ Welcome to the πŸ¦œοΈπŸ”— LangChain Text Chunker application! This interactive tool, built with Gradio, empowers users to effortlessly upload various document types, extract their raw text content, and then apply a diverse set of LangChain text splitting (chunking) methods. It provides a clear visualization of how each method breaks down text into smaller, manageable chunks, complete with their associated metadata. Furthermore, for developers and researchers, the application dynamically generates Python code examples, allowing for easy replication and integration of the chunking strategies.
10
+
11
+ ## Features
12
+
13
+ * **Multi-Document Type Support**: Seamlessly process text from a wide range of document formats, including:
14
+ * PDF (`.pdf`)
15
+ * Microsoft Word (`.docx`)
16
+ * Plain Text (`.txt`)
17
+ * HTML (`.html`)
18
+ * CSS (`.css`)
19
+ * Python Code (`.py`)
20
+ * Jupyter Notebooks (`.ipynb`)
21
+ * CSV (`.csv`)
22
+ * **Diverse Chunking Strategies**: Explore and compare the output of various LangChain text splitters:
23
+ * **Recursive Character Text Splitter**: Ideal for general-purpose text, attempting to split on a list of characters in order.
24
+ * **Character Text Splitter**: Splits text based on a single, user-defined separator.
25
+ * **Markdown Text Splitter**: Specifically designed to understand and preserve the structure of Markdown documents.
26
+ * **Python Code Text Splitter**: Optimized for splitting Python source code while maintaining syntactical integrity.
27
+ * **JavaScript Code Text Splitter**: Utilizes language-specific rules to chunk JavaScript code effectively.
28
+ * **Customizable Chunking Parameters**: Fine-tune the chunking process with adjustable parameters:
29
+ * `Chunk Size`: Define the maximum size of the generated chunks.
30
+ * `Chunk Overlap`: Specify the number of characters that overlap between consecutive chunks.
31
+ * `Character Splitter Separator`: Choose custom separators for the Character Chunking method.
32
+ * `Keep Separator`: Control whether the separator is included in the chunk and its placement.
33
+ * `Add Start Index to Metadata`: Option to include the starting character index of each chunk in its metadata.
34
+ * `Strip Whitespace`: Automatically remove leading/trailing whitespace from chunks.
35
+ * **Interactive Chunk Visualization**: View the resulting chunks in a clear, structured JSON format within the Gradio interface.
36
+ * **Dynamic Python Code Examples**: For each chunking method, the application generates ready-to-use Python code, demonstrating how to achieve the same chunking results programmatically. This is invaluable for integrating these strategies into your own projects.
37
+ * **User-Friendly Gradio Interface**: An intuitive web interface that makes it easy for anyone to experiment with text chunking without deep programming knowledge.
38
+
39
+ ## Installation
40
+
41
+ To get this application up and running on your local machine, follow these steps:
42
+
43
+ ### Prerequisites
44
+
45
+ * Python 3.8 or higher
46
+
47
+ ### Steps
48
+
49
+ 1. **Clone the repository:**
50
+ ```bash
51
+ git clone https://github.com/tolgakurtuluss/langchain-text-chunker.git
52
+ cd langchain-text-chunker
53
+ ```
54
+
55
+ 2. **Create a virtual environment (recommended):**
56
+ ```bash
57
+ python -m venv venv
58
+ ```
59
+
60
+ 3. **Activate the virtual environment:**
61
+ * **On Windows:**
62
+ ```bash
63
+ .\venv\Scripts\activate
64
+ ```
65
+ * **On macOS/Linux:**
66
+ ```bash
67
+ source venv/bin/activate
68
+ ```
69
+
70
+ 4. **Install dependencies:**
71
+ ```bash
72
+ pip install -r requirements.txt
73
+ ```
74
+
75
+ ## Usage
76
+
77
+ Once the installation is complete, you can run the Gradio application:
78
+
79
+ 1. **Run the application:**
80
+ ```bash
81
+ python app.py
82
+ ```
83
+ This command will start the Gradio server, and you will typically see a local URL (e.g., `http://127.0.0.1:7860`) in your terminal. Open this URL in your web browser.
84
+
85
+ 2. **Using the Interface:**
86
+ * **Upload your document**: Use the "Upload your document" file input to select a file (PDF, DOCX, TXT, HTML, CSS, PY, IPYNB, CSV).
87
+ * **Adjust Chunking Parameters**: Utilize the sliders, dropdowns, and checkboxes in the "Chunking Parameters" accordion to customize `Chunk Size`, `Chunk Overlap`, `Character Splitter Separator`, `Keep Separator` behavior, `Add Start Index` to metadata, and `Strip Whitespace`.
88
+ * **Process Document**: Click the "Process Document" button. The extracted raw text will appear, and the results of various chunking methods will be displayed in their respective tabs.
89
+ * **Explore Chunks**: Navigate through the tabs ("Recursive Chunking", "Character Chunking", etc.) to see the chunks as JSON, along with the total number of chunks created for each method.
90
+ * **Python Example Code**: In each chunking tab, you can view dynamically generated Python code that demonstrates how to achieve the same chunking results programmatically.
91
+
92
+ ### Inspiration
93
+
94
+ This Gradio application is inspired by and inferred from [Mervin Praison's insightful work](https://mer.vin/2024/03/chunking-strategy/) on "Advanced Chunking Strategies."
95
+
96
+ ## Screenshots
97
+
98
+ *Interface for interacting with "Attention is All You Need 1706.03762" paper.*
99
+ ![interface](assets/1.JPG)
100
+
101
+ *Chunking results of Recursice Chunking Method.*
102
+ ![interface](assets/2.JPG)
103
+
104
+
105
+ ## Contributing
106
+
107
+ Contributions are welcome! If you have suggestions for improvements or new features, please follow these steps:
108
+
109
+ 1. Fork the repository.
110
+ 2. Create a new branch (`git checkout -b feature/YourFeature`).
111
+ 3. Make your changes.
112
+ 4. Commit your changes (`git commit -m 'Add some feature'`).
113
+ 5. Push to the branch (`git push origin feature/YourFeature`).
114
+ 6. Open a Pull Request.
115
+
116
+ ## License
117
+
118
+ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
app.py ADDED
@@ -0,0 +1,416 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # --------------------------------------------- Libraries ----------------------------------------------------------#
2
+ import gradio as gr
3
+ from PyPDF2 import PdfReader
4
+ import nbformat
5
+
6
+ from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter, MarkdownTextSplitter, PythonCodeTextSplitter, Language
7
+ from langchain.docstore.document import Document
8
+ from langchain_community.document_loaders import Docx2txtLoader, CSVLoader
9
+
10
+ # --------------------------------------------- Functions ----------------------------------------------------------#
11
+
12
+ def process_uploaded_file(uploaded_file):
13
+ text = ""
14
+ display_content = ""
15
+ file_extension = uploaded_file.name.split(".")[-1]
16
+
17
+ if file_extension == "pdf":
18
+ try:
19
+ # Gradio's uploaded_file.name provides the path to the temporary file
20
+ pdf = PdfReader(uploaded_file.name)
21
+ for page in pdf.pages:
22
+ page_text = page.extract_text()
23
+ text += page_text + "\n"
24
+ display_content += page_text + "\n"
25
+ except Exception as e:
26
+ display_content = f"Error reading PDF file: {e}"
27
+ text = ""
28
+
29
+ elif file_extension == "docx":
30
+ try:
31
+ docx_loader = Docx2txtLoader(uploaded_file.name)
32
+ documents = docx_loader.load()
33
+ text = "\n".join([doc.page_content for doc in documents])
34
+ display_content = text
35
+ except Exception as e:
36
+ display_content = f"Error reading DOCX file: {e}"
37
+ text = ""
38
+
39
+ elif file_extension in ["html", "css", "py", "txt"]:
40
+ try:
41
+ with open(uploaded_file.name, "r", encoding="utf-8") as f:
42
+ file_content = f.read()
43
+ display_content = file_content # Display as plain text in Textbox
44
+ text = file_content
45
+ except Exception as e:
46
+ display_content = f"Error reading {file_extension.upper()} file: {e}"
47
+ text = ""
48
+
49
+ elif file_extension == "ipynb":
50
+ try:
51
+ # nbformat.read can take a file path
52
+ nb_content = nbformat.read(uploaded_file.name, as_version=4)
53
+ nb_filtered = [cell for cell in nb_content["cells"] if cell["cell_type"] in ["code", "markdown"]]
54
+
55
+ for cell in nb_filtered:
56
+ if cell["cell_type"] == "code":
57
+ display_content += f"```python\n{cell['source']}\n```\n"
58
+ text += cell["source"] + "\n"
59
+ elif cell["cell_type"] == "markdown":
60
+ display_content += f"{cell['source']}\n"
61
+ text += cell["source"] + "\n"
62
+ except Exception as e:
63
+ display_content = f"Error reading IPYNB file: {e}"
64
+ text = ""
65
+
66
+ elif file_extension == "csv":
67
+ try:
68
+ loader = CSVLoader(file_path=uploaded_file.name, encoding="utf-8", csv_args={'delimiter': ','})
69
+ documents = loader.load()
70
+ text = "\n".join([doc.page_content for doc in documents])
71
+ display_content = text # For CSV, display the concatenated text
72
+ except Exception as e:
73
+ display_content = f"Error reading CSV file: {e}"
74
+ text = ""
75
+ else:
76
+ display_content = "Unsupported file type."
77
+ text = ""
78
+
79
+ return text, display_content
80
+
81
+
82
+ def chunk_recursive(text, chunk_size, chunk_overlap, keep_separator, add_start_index, strip_whitespace):
83
+ if not text:
84
+ return [], ""
85
+ text_splitter = RecursiveCharacterTextSplitter(
86
+ chunk_size=chunk_size,
87
+ chunk_overlap=chunk_overlap,
88
+ length_function=len,
89
+ keep_separator=keep_separator,
90
+ add_start_index=add_start_index,
91
+ strip_whitespace=strip_whitespace,
92
+ )
93
+ chunks = text_splitter.create_documents([text])
94
+ formatted_chunks = []
95
+ for chunk in chunks:
96
+ if isinstance(chunk, Document):
97
+ formatted_chunks.append({"content": chunk.page_content, "metadata": chunk.metadata})
98
+ else:
99
+ formatted_chunks.append({"content": str(chunk), "metadata": {}})
100
+
101
+ code_example = f"""
102
+ from langchain.text_splitter import RecursiveCharacterTextSplitter
103
+
104
+ text_content = \"\"\"{text[:50]}...\"\"\" # Truncated for example
105
+
106
+ text_splitter = RecursiveCharacterTextSplitter(
107
+ chunk_size={chunk_size},
108
+ chunk_overlap={chunk_overlap},
109
+ length_function=len,
110
+ keep_separator={keep_separator},
111
+ add_start_index={add_start_index},
112
+ strip_whitespace={strip_whitespace},
113
+ )
114
+ chunks = text_splitter.create_documents([text_content])
115
+ # Access chunks: chunks[0].page_content, chunks[0].metadata
116
+ """
117
+ return formatted_chunks, code_example
118
+
119
+ def chunk_character(text, chunk_size, chunk_overlap, separator, keep_separator, add_start_index, strip_whitespace):
120
+ if not text:
121
+ return [], ""
122
+
123
+ if isinstance(separator, list):
124
+ separator_str = "".join(separator)
125
+ else:
126
+ separator_str = separator
127
+
128
+ text_splitter = CharacterTextSplitter(
129
+ separator=separator_str,
130
+ chunk_size=chunk_size,
131
+ chunk_overlap=chunk_overlap,
132
+ length_function=len,
133
+ keep_separator=keep_separator,
134
+ add_start_index=add_start_index,
135
+ strip_whitespace=strip_whitespace,
136
+ )
137
+ chunks = text_splitter.create_documents([text])
138
+ formatted_chunks = []
139
+ for chunk in chunks:
140
+ if isinstance(chunk, Document):
141
+ formatted_chunks.append({"content": chunk.page_content, "metadata": chunk.metadata})
142
+ else:
143
+ formatted_chunks.append({"content": str(chunk), "metadata": {}})
144
+
145
+ code_example = f"""
146
+ from langchain.text_splitter import CharacterTextSplitter
147
+
148
+ text_content = \"\"\"{text[:50]}...\"\"\" # Truncated for example
149
+
150
+ text_splitter = CharacterTextSplitter(
151
+ separator=\"\"\"{separator_str}\"\"\",
152
+ chunk_size={chunk_size},
153
+ chunk_overlap={chunk_overlap},
154
+ length_function=len,
155
+ keep_separator={keep_separator},
156
+ add_start_index={add_start_index},
157
+ strip_whitespace={strip_whitespace},
158
+ )
159
+ chunks = text_splitter.create_documents([text_content])
160
+ # Access chunks: chunks[0].page_content, chunks[0].metadata
161
+ """
162
+ return formatted_chunks, code_example
163
+
164
+ def chunk_python_code(text, chunk_size, chunk_overlap, keep_separator, add_start_index, strip_whitespace):
165
+ if not text:
166
+ return [], ""
167
+ text_splitter = PythonCodeTextSplitter(
168
+ chunk_size=chunk_size,
169
+ chunk_overlap=chunk_overlap,
170
+ keep_separator=keep_separator,
171
+ add_start_index=add_start_index,
172
+ strip_whitespace=strip_whitespace,
173
+ )
174
+ chunks = text_splitter.create_documents([text])
175
+ formatted_chunks = []
176
+ for chunk in chunks:
177
+ if isinstance(chunk, Document):
178
+ formatted_chunks.append({"content": chunk.page_content, "metadata": chunk.metadata})
179
+ else:
180
+ formatted_chunks.append({"content": str(chunk), "metadata": {}})
181
+
182
+ code_example = f"""
183
+ from langchain.text_splitter import PythonCodeTextSplitter
184
+
185
+ text_content = \"\"\"{text[:50]}...\"\"\" # Truncated for example
186
+
187
+ text_splitter = PythonCodeTextSplitter(
188
+ chunk_size={chunk_size},
189
+ chunk_overlap={chunk_overlap},
190
+ keep_separator={keep_separator},
191
+ add_start_index={add_start_index},
192
+ strip_whitespace={strip_whitespace},
193
+ )
194
+ chunks = text_splitter.create_documents([text_content])
195
+ # Access chunks: chunks[0].page_content, chunks[0].metadata
196
+ """
197
+ return formatted_chunks, code_example
198
+
199
+ def chunk_javascript_code(text, chunk_size, chunk_overlap, keep_separator, add_start_index, strip_whitespace):
200
+ if not text:
201
+ return [], ""
202
+ text_splitter = RecursiveCharacterTextSplitter.from_language(
203
+ language=Language.JS,
204
+ chunk_size=chunk_size,
205
+ chunk_overlap=chunk_overlap,
206
+ keep_separator=keep_separator,
207
+ add_start_index=add_start_index,
208
+ strip_whitespace=strip_whitespace,
209
+ )
210
+ chunks = text_splitter.create_documents([text])
211
+ formatted_chunks = []
212
+ for chunk in chunks:
213
+ if isinstance(chunk, Document):
214
+ formatted_chunks.append({"content": chunk.page_content, "metadata": chunk.metadata})
215
+ else:
216
+ formatted_chunks.append({"content": str(chunk), "metadata": {}})
217
+
218
+ code_example = f"""
219
+ from langchain.text_splitter import RecursiveCharacterTextSplitter, Language
220
+
221
+ text_content = \"\"\"{text[:50]}...\"\"\" # Truncated for example
222
+
223
+ text_splitter = RecursiveCharacterTextSplitter.from_language(
224
+ language=Language.JS,
225
+ chunk_size={chunk_size},
226
+ chunk_overlap={chunk_overlap},
227
+ keep_separator={keep_separator},
228
+ add_start_index={add_start_index},
229
+ strip_whitespace={strip_whitespace},
230
+ )
231
+ chunks = text_splitter.create_documents([text_content])
232
+ # Access chunks: chunks[0].page_content, chunks[0].metadata
233
+ """
234
+ return formatted_chunks, code_example
235
+
236
+ def chunk_markdown(text, chunk_size, chunk_overlap, keep_separator, add_start_index, strip_whitespace):
237
+ if not text:
238
+ return [], ""
239
+ text_splitter = MarkdownTextSplitter(
240
+ chunk_size=chunk_size,
241
+ chunk_overlap=chunk_overlap,
242
+ length_function=len,
243
+ keep_separator=keep_separator,
244
+ add_start_index=add_start_index,
245
+ strip_whitespace=strip_whitespace,
246
+ )
247
+ chunks = text_splitter.create_documents([text])
248
+ formatted_chunks = []
249
+ for chunk in chunks:
250
+ if isinstance(chunk, Document):
251
+ formatted_chunks.append({"content": chunk.page_content, "metadata": chunk.metadata})
252
+ else:
253
+ formatted_chunks.append({"content": str(chunk), "metadata": {}})
254
+
255
+ code_example = f"""
256
+ from langchain.text_splitter import MarkdownTextSplitter
257
+
258
+ text_content = \"\"\"{text[:50]}...\"\"\" # Truncated for example
259
+
260
+ text_splitter = MarkdownTextSplitter(
261
+ chunk_size={chunk_size},
262
+ chunk_overlap={chunk_overlap},
263
+ length_function=len,
264
+ keep_separator={keep_separator},
265
+ add_start_index={add_start_index},
266
+ strip_whitespace={strip_whitespace},
267
+ )
268
+ chunks = text_splitter.create_documents([text_content])
269
+ # Access chunks: chunks[0].page_content, chunks[0].metadata
270
+ """
271
+ return formatted_chunks, code_example
272
+
273
+ def main_interface(uploaded_file, chunk_size, chunk_overlap, separator, keep_separator, add_start_index, strip_whitespace):
274
+ if uploaded_file is None:
275
+ return "", "", [], [], [], [], [], "", "", "", "", "", "", "", "", "", "", ""
276
+
277
+ # Ensure chunk_size and chunk_overlap are integers
278
+ chunk_size = int(chunk_size)
279
+ chunk_overlap = int(chunk_overlap)
280
+
281
+ raw_text, display_content = process_uploaded_file(uploaded_file)
282
+
283
+ recursive_chunks, recursive_code = chunk_recursive(raw_text, chunk_size, chunk_overlap, keep_separator, add_start_index, strip_whitespace)
284
+ character_chunks, character_code = chunk_character(raw_text, chunk_size, chunk_overlap, separator, keep_separator, add_start_index, strip_whitespace)
285
+ markdown_chunks, markdown_code = chunk_markdown(raw_text, chunk_size, chunk_overlap, keep_separator, add_start_index, strip_whitespace)
286
+ python_chunks, python_code = chunk_python_code(raw_text, chunk_size, chunk_overlap, keep_separator, add_start_index, strip_whitespace)
287
+ javascript_chunks, javascript_code = chunk_javascript_code(raw_text, chunk_size, chunk_overlap, keep_separator, add_start_index, strip_whitespace)
288
+
289
+ return (
290
+ display_content,
291
+ raw_text,
292
+ recursive_chunks,
293
+ character_chunks,
294
+ markdown_chunks,
295
+ python_chunks,
296
+ javascript_chunks,
297
+ f"Number of chunks: {len(recursive_chunks)}",
298
+ f"Number of chunks: {len(character_chunks)}",
299
+ f"Number of chunks: {len(markdown_chunks)}",
300
+ f"Number of chunks: {len(python_chunks)}",
301
+ f"Number of chunks: {len(javascript_chunks)}",
302
+ recursive_code,
303
+ character_code,
304
+ markdown_code,
305
+ python_code,
306
+ javascript_code
307
+ )
308
+
309
+ # --------------------------------------------- Gradio Interface ----------------------------------------------------------#
310
+
311
+ with gr.Blocks(theme=gr.themes.Soft(), title="πŸ¦œοΈπŸ”— LangChain Text Chunker") as demo:
312
+ gr.Markdown(
313
+ """
314
+ # πŸ¦œοΈπŸ”— LangChain Text Chunker
315
+ Welcome to the LangChain Text Chunker application! This tool allows you to upload various document types,
316
+ extract their text content, and then apply different LangChain text splitting (chunking) methods.
317
+ You can observe how each method breaks down the text into smaller, manageable chunks, along with their metadata.
318
+
319
+ ### How to Use:
320
+ 1. **Upload your document**: Select a file (PDF, DOCX, TXT, HTML, CSS, PY, IPYNB, CSV) using the file input.
321
+ 2. **Adjust Chunking Parameters**: Use the sliders and dropdowns to customize `Chunk Size`, `Chunk Overlap`,
322
+ `Character Splitter Separator`, `Keep Separator` behavior, `Add Start Index` to metadata, and `Strip Whitespace`.
323
+ 3. **Process Document**: Click the "Process Document" button to see the extracted raw text and the results
324
+ of various chunking methods in their respective tabs.
325
+ 4. **Explore Chunks**: Each tab will display the chunks as JSON, along with the total number of chunks created.
326
+ 5. **Python Example Code**: You can view dynamically generated Python 🐍 example code.
327
+ 6. **Inference**: This Gradio app is inferred from [Mervin Praison's work](https://mer.vin/2024/03/chunking-strategy/) about "Advanced Chunking Strategies".
328
+ """
329
+ )
330
+
331
+ with gr.Row():
332
+ with gr.Column(scale=1):
333
+ file_input = gr.File(label="Upload your document", file_types=[".pdf", ".docx", ".txt", ".html", ".css", ".py", ".ipynb", ".csv"])
334
+ process_button = gr.Button("Process Document", variant="primary")
335
+
336
+ with gr.Accordion("Chunking Parameters", open=False):
337
+ chunk_size_input = gr.Slider(minimum=100, maximum=2000, value=250, step=50, label="Chunk Size", info="Maximum size of chunks to return.")
338
+ chunk_overlap_input = gr.Slider(minimum=0, maximum=500, value=0, step=10, label="Chunk Overlap", info="Overlap in characters between chunks.")
339
+ separator_input = gr.Dropdown(
340
+ label="Character Splitter Separator",
341
+ choices=["\\n\\n", "\\n", " ", "", "\n", "." ,",", ";", ":", "!", "?", "-",
342
+ "β€”", "(", ")", "[", "]", "{", "}", '"', "'",
343
+ "β€œ", "”", "β€˜", "’", "..."], # Representing common separators
344
+ value="\\n\\n",
345
+ allow_custom_value=True,
346
+ multiselect=True,
347
+ info="Characters to split on for Character Chunking. Multiple selections will be joined."
348
+ )
349
+ keep_separator_input = gr.Dropdown(
350
+ label="Keep Separator",
351
+ choices=[True, False, "start", "end"],
352
+ value=False,
353
+ info="Whether to keep the separator and where to place it in each corresponding chunk (True='start')."
354
+ )
355
+ add_start_index_input = gr.Checkbox(label="Add Start Index to Metadata", value=True, info="If checked, includes chunk’s start index in metadata.")
356
+ strip_whitespace_input = gr.Checkbox(label="Strip Whitespace", value=True, info="If checked, strips whitespace from the start and end of every document.")
357
+
358
+ with gr.Column(scale=2):
359
+ raw_text_display = gr.Textbox(label="Extracted Raw Text", lines=10, interactive=False, show_copy_button=True)
360
+ hidden_raw_text = gr.State("") # To store the actual raw text for chunking
361
+
362
+ with gr.Tabs():
363
+ with gr.TabItem("Recursive Chunking"):
364
+ recursive_count_output = gr.Markdown()
365
+ recursive_output = gr.JSON(label="Recursive Chunks")
366
+ recursive_code_output = gr.Code(label="Python Code Example", language="python", interactive=False)
367
+ with gr.TabItem("Character Chunking"):
368
+ character_count_output = gr.Markdown()
369
+ character_output = gr.JSON(label="Character Chunks")
370
+ character_code_output = gr.Code(label="Python Code Example", language="python", interactive=False)
371
+ with gr.TabItem("Markdown Chunking"):
372
+ markdown_count_output = gr.Markdown()
373
+ markdown_output = gr.JSON(label="Markdown Chunks")
374
+ markdown_code_output = gr.Code(label="Python Code Example", language="python", interactive=False)
375
+ with gr.TabItem("Python Code Chunking"):
376
+ python_count_output = gr.Markdown()
377
+ python_output = gr.JSON(label="Python Code Chunks")
378
+ python_code_output = gr.Code(label="Python Code Example", language="python", interactive=False)
379
+ with gr.TabItem("JavaScript Code Chunking"):
380
+ javascript_count_output = gr.Markdown()
381
+ javascript_output = gr.JSON(label="JavaScript Code Chunks")
382
+ javascript_code_output = gr.Code(label="Python Code Example", language="python", interactive=False)
383
+
384
+ process_button.click(
385
+ fn=main_interface,
386
+ inputs=[
387
+ file_input,
388
+ chunk_size_input,
389
+ chunk_overlap_input,
390
+ separator_input,
391
+ keep_separator_input,
392
+ add_start_index_input,
393
+ strip_whitespace_input
394
+ ],
395
+ outputs=[
396
+ raw_text_display,
397
+ hidden_raw_text,
398
+ recursive_output,
399
+ character_output,
400
+ markdown_output,
401
+ python_output,
402
+ javascript_output,
403
+ recursive_count_output,
404
+ character_count_output,
405
+ markdown_count_output,
406
+ python_count_output,
407
+ javascript_count_output,
408
+ recursive_code_output,
409
+ character_code_output,
410
+ markdown_code_output,
411
+ python_code_output,
412
+ javascript_code_output
413
+ ]
414
+ )
415
+
416
+ demo.launch()
assets/1.JPG ADDED

Git LFS Details

  • SHA256: d4f9463497b762fb45d6e2aaac404e9d51147692547552345bae982cc20ff27e
  • Pointer size: 131 Bytes
  • Size of remote file: 203 kB
assets/2.JPG ADDED

Git LFS Details

  • SHA256: a343b6650d955c17e534eef1f4b85f853c4e3ce35177386a095d0af10d2a9113
  • Pointer size: 131 Bytes
  • Size of remote file: 136 kB
requirements.txt ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ gradio==5.33.2
2
+ langchain
3
+ langchain_community
4
+ pypdf
5
+ python-docx
6
+ nbformat
7
+ unstructured