Spaces:
Runtime error
Runtime error
# Multi-GPU Setup Guide | |
This guide explains how to run the neural OS demo with multiple GPUs and user queue management. | |
## Architecture Overview | |
The system has been split into two main components: | |
1. **Dispatcher** (`dispatcher.py`): Handles WebSocket connections, manages user queues, and routes requests to workers | |
2. **Worker** (`worker.py`): Runs the actual model inference on individual GPUs | |
## Files Overview | |
- `main.py` - Original single-GPU implementation (kept as backup) | |
- `dispatcher.py` - Queue management and WebSocket handling | |
- `worker.py` - GPU worker for model inference | |
- `start_workers.py` - Helper script to start multiple workers | |
- `start_system.sh` - Shell script to start the entire system | |
- `tail_workers.py` - Script to monitor all worker logs simultaneously | |
- `requirements.txt` - Dependencies | |
- `static/index.html` - Frontend interface | |
## Setup Instructions | |
### 1. Install Dependencies | |
```bash | |
pip install -r requirements.txt | |
``` | |
### 2. Start the Dispatcher | |
The dispatcher runs on port 7860 and manages user connections and queues: | |
```bash | |
python dispatcher.py | |
``` | |
### 3. Start Workers (One per GPU) | |
Start one worker for each GPU you want to use. Workers automatically register with the dispatcher. | |
#### GPU 0: | |
```bash | |
python worker.py --gpu-id 0 | |
``` | |
#### GPU 1: | |
```bash | |
python worker.py --gpu-id 1 | |
``` | |
#### GPU 2: | |
```bash | |
python worker.py --gpu-id 2 | |
``` | |
And so on for additional GPUs. | |
Workers run on ports 8001, 8002, 8003, etc. (8001 + GPU_ID). | |
### 4. Access the Application | |
Open your browser and go to: `http://localhost:7860` | |
## System Behavior | |
### Queue Management | |
- **No Queue**: Users get normal timeout behavior (20 seconds of inactivity) | |
- **With Queue**: Users get limited session time (60 seconds) with warnings and grace periods | |
- **Grace Period**: If queue becomes empty during grace period, time limits are removed | |
### User Experience | |
1. **Immediate Access**: If GPUs are available, users start immediately | |
2. **Queue Position**: Users see their position and estimated wait time | |
3. **Session Warnings**: Users get warnings when their time is running out | |
4. **Grace Period**: 10-second countdown when session time expires, but if queue empties, users can continue | |
5. **Queue Updates**: Real-time updates on queue position every 5 seconds | |
### Worker Management | |
- Workers automatically register with the dispatcher on startup | |
- Workers send periodic pings (every 10 seconds) to maintain connection | |
- Workers handle session cleanup when users disconnect | |
- Each worker can handle one session at a time | |
### Input Queue Optimization | |
The system implements intelligent input filtering to maintain performance: | |
- **Queue Management**: Each worker maintains an input queue per session | |
- **Interesting Input Detection**: The system identifies "interesting" inputs (clicks, key presses) vs. uninteresting ones (mouse movements) | |
- **Smart Processing**: When multiple inputs are queued: | |
- Processes "interesting" inputs immediately, skipping boring mouse movements | |
- If no interesting inputs are found, processes the latest mouse position | |
- This prevents the system from getting bogged down processing every mouse movement | |
- **Performance**: Maintains responsiveness even during rapid mouse movements | |
## Configuration | |
### Dispatcher Settings (in `dispatcher.py`) | |
```python | |
self.IDLE_TIMEOUT = 20.0 # When no queue | |
self.QUEUE_WARNING_TIME = 10.0 | |
self.MAX_SESSION_TIME_WITH_QUEUE = 60.0 # When there's a queue | |
self.QUEUE_SESSION_WARNING_TIME = 45.0 # 15 seconds before timeout | |
self.GRACE_PERIOD = 10.0 | |
``` | |
### Worker Settings (in `worker.py`) | |
```python | |
self.MODEL_NAME = "yuntian-deng/computer-model-s-newnewd-freezernn-origunet-nospatial-online-x0-joint-onlineonly-222222k7-06k" | |
self.SCREEN_WIDTH = 512 | |
self.SCREEN_HEIGHT = 384 | |
self.NUM_SAMPLING_STEPS = 32 | |
self.USE_RNN = False | |
``` | |
## Monitoring | |
### Health Checks | |
Check worker health: | |
```bash | |
curl http://localhost:8001/health # GPU 0 | |
curl http://localhost:8002/health # GPU 1 | |
``` | |
### Logs | |
The system provides detailed logging for debugging and monitoring: | |
**Dispatcher logs:** | |
- `dispatcher.log` - All dispatcher activity, session management, queue operations | |
**Worker logs:** | |
- `workers.log` - Summary output from the worker startup script | |
- `worker_gpu_0.log` - Detailed logs from GPU 0 worker | |
- `worker_gpu_1.log` - Detailed logs from GPU 1 worker | |
- `worker_gpu_N.log` - Detailed logs from GPU N worker | |
**Monitor all worker logs:** | |
```bash | |
# Tail all worker logs simultaneously | |
python tail_workers.py --num-gpus 2 | |
# Or monitor individual workers | |
tail -f worker_gpu_0.log | |
tail -f worker_gpu_1.log | |
``` | |
## Troubleshooting | |
### Common Issues | |
1. **Worker not registering**: Check that dispatcher is running first | |
2. **GPU memory issues**: Ensure each worker is assigned to a different GPU | |
3. **Port conflicts**: Make sure ports 7860, 8001, 8002, etc. are available | |
4. **Model loading errors**: Check that model files and configurations are present | |
### Debug Mode | |
Enable debug logging by setting log level in both files: | |
```python | |
logging.basicConfig(level=logging.DEBUG) | |
``` | |
## Scaling | |
To add more GPUs: | |
1. Start additional workers with higher GPU IDs | |
2. Workers automatically register with the dispatcher | |
3. Queue processing automatically utilizes all available workers | |
The system scales horizontally - add as many workers as you have GPUs available. | |
## API Endpoints | |
### Dispatcher | |
- `GET /` - Serve the web interface | |
- `WebSocket /ws` - User connections | |
- `POST /register_worker` - Worker registration | |
- `POST /worker_ping` - Worker health pings | |
### Worker | |
- `POST /process_input` - Process user input | |
- `POST /end_session` - Clean up session | |
- `GET /health` - Health check |