efficient-context Documentation
Overview
efficient-context
is a Python library designed to optimize the handling of context for Large Language Models (LLMs) in CPU-constrained environments. It addresses the challenges of using LLMs with limited computational resources by providing efficient context management strategies.
Key Features
- Context Compression: Reduce memory requirements while preserving information quality
- Semantic Chunking: Go beyond token-based approaches for more effective context management
- Retrieval Optimization: Minimize context size through intelligent retrieval strategies
- Memory Management: Handle large contexts on limited hardware resources
Installation
pip install efficient-context
Core Components
ContextManager
The central class that orchestrates all components of the library.
from efficient_context import ContextManager
# Initialize with default settings
context_manager = ContextManager()
# Add documents
context_manager.add_document("This is a sample document about renewable energy...")
context_manager.add_documents([doc1, doc2, doc3]) # Add multiple documents
# Generate context for a query
optimized_context = context_manager.generate_context(query="Tell me about renewable energy")
Context Compression
The compression module reduces the size of content while preserving key information.
from efficient_context.compression import SemanticDeduplicator
# Initialize with custom settings
compressor = SemanticDeduplicator(
threshold=0.85, # Similarity threshold for deduplication
embedding_model="lightweight", # Use a lightweight embedding model
min_sentence_length=10, # Minimum length of sentences to consider
importance_weight=0.3 # Weight given to sentence importance vs. deduplication
)
# Compress content
compressed_content = compressor.compress(
content="Your large text content here...",
target_size=1000 # Optional target size in tokens
)
Semantic Chunking
The chunking module divides content into semantically coherent chunks.
from efficient_context.chunking import SemanticChunker
# Initialize with custom settings
chunker = SemanticChunker(
chunk_size=512, # Target size for chunks in tokens
chunk_overlap=50, # Number of tokens to overlap between chunks
respect_paragraphs=True, # Avoid breaking paragraphs across chunks
min_chunk_size=100, # Minimum chunk size in tokens
max_chunk_size=1024 # Maximum chunk size in tokens
)
# Chunk content
chunks = chunker.chunk(
content="Your large text content here...",
document_id="doc-1", # Optional document ID
metadata={"source": "example", "author": "John Doe"} # Optional metadata
)
Retrieval Optimization
The retrieval module finds the most relevant chunks for a query.
from efficient_context.retrieval import CPUOptimizedRetriever
# Initialize with custom settings
retriever = CPUOptimizedRetriever(
embedding_model="lightweight", # Use a lightweight embedding model
similarity_metric="cosine", # Metric for comparing embeddings
use_batching=True, # Batch embedding operations
batch_size=32, # Size of batches for embedding
max_index_size=5000 # Maximum number of chunks to keep in the index
)
# Index chunks
retriever.index_chunks(chunks)
# Retrieve relevant chunks
relevant_chunks = retriever.retrieve(
query="Your query here...",
top_k=5 # Number of chunks to retrieve
)
Memory Management
The memory module helps optimize memory usage during operations.
from efficient_context.memory import MemoryManager
# Initialize with custom settings
memory_manager = MemoryManager(
target_usage_percent=80.0, # Target memory usage percentage
aggressive_cleanup=False, # Whether to perform aggressive garbage collection
memory_monitor_interval=None # Interval for memory monitoring in seconds
)
# Use context manager for memory-intensive operations
with memory_manager.optimize_memory():
# Run memory-intensive operations here
results = process_large_documents(documents)
# Get memory usage statistics
memory_stats = memory_manager.get_memory_usage()
print(f"Process memory: {memory_stats['process_rss_bytes'] / (1024*1024):.2f} MB")
Advanced Usage
Customizing the Context Manager
from efficient_context import ContextManager
from efficient_context.compression import SemanticDeduplicator
from efficient_context.chunking import SemanticChunker
from efficient_context.retrieval import CPUOptimizedRetriever
from efficient_context.memory import MemoryManager
# Initialize a fully customized context manager
context_manager = ContextManager(
compressor=SemanticDeduplicator(threshold=0.85),
chunker=SemanticChunker(chunk_size=256, chunk_overlap=50),
retriever=CPUOptimizedRetriever(embedding_model="lightweight"),
memory_manager=MemoryManager(target_usage_percent=80.0),
max_context_size=4096
)
Integration with LLMs
from efficient_context import ContextManager
from your_llm_library import LLM # Replace with your actual LLM library
# Initialize components
context_manager = ContextManager()
llm = LLM(model="lightweight-model")
# Process documents
context_manager.add_documents(documents)
# For each query
query = "Tell me about renewable energy"
optimized_context = context_manager.generate_context(query=query)
# Use context with the LLM
response = llm.generate(
prompt=query,
context=optimized_context,
max_tokens=512
)
Performance Considerations
- Memory Usage: The library is designed to be memory-efficient, but be aware that embedding models may still require significant memory.
- CPU Performance: Choose the appropriate embedding model based on your CPU capabilities. The
lightweight
option is recommended for constrained environments. - Batch Size: Adjust the
batch_size
parameter in retrieval to balance between memory usage and processing speed. - Context Size: Setting appropriate
max_context_size
can significantly impact performance, especially when working with limited resources.
Extending the Library
You can create custom implementations of the base classes to adapt the library to your specific needs:
from efficient_context.compression.base import BaseCompressor
class MyCustomCompressor(BaseCompressor):
def __init__(self, custom_param=None):
self.custom_param = custom_param
def compress(self, content, target_size=None):
# Your custom compression logic here
return compressed_content
Troubleshooting
High Memory Usage
- Reduce
batch_size
in the retriever - Use a more lightweight embedding model
- Decrease
max_index_size
to limit the number of chunks stored in memory
Slow Processing
- Increase
batch_size
(balancing with memory constraints) - Increase
threshold
in the SemanticDeduplicator to be more aggressive with deduplication - Reduce
chunk_overlap
to minimize redundant processing
Example Applications
- Chatbots on Edge Devices: Enable context-aware conversations on devices with limited resources
- Document QA Systems: Create efficient question-answering systems for large document collections
- Embedded AI Applications: Incorporate context-aware LLM capabilities in embedded systems
- Mobile Applications: Provide sophisticated LLM features in mobile apps with limited resources