--- title: RuSimulBench Arena emoji: 📊 colorFrom: green colorTo: blue sdk: gradio sdk_version: 5.21.0 app_file: app.py pinned: false --- # Model Response Evaluator This application evaluates model responses based on both creativity metrics (using Gemini) and stability metrics (using semantic similarity). ## Features - Evaluate individual model responses for creativity, diversity, relevance, and stability - Run batch evaluations on multiple models from a CSV file - Web interface for easy use - Command-line interface for scripting and automation - Combined scoring that balances creativity and stability ## Installation 1. Clone this repository 2. Install dependencies: ```bash pip install -r requirements.txt ``` 3. Get a Gemini API key from Google AI Studio (https://makersuite.google.com/) ## Usage ### Web Interface ```bash python app.py --web ``` This will start a Gradio web interface where you can: - Evaluate single responses - Upload CSV files for batch evaluation - View evaluation results ### Command Line For batch evaluation of models from a CSV file: ```bash python app.py --gemini_api_key YOUR_API_KEY --input_file your_responses.csv ``` Optional arguments: - `--models`: Comma-separated list of model names to evaluate (e.g., "gpt-4,claude-3") - `--prompt_col`: Column name containing prompts (default: "rus_prompt") ## CSV Format Your CSV file should have these columns: - A prompt column (default: "rus_prompt") - One or more response columns with names ending in "_answers" (e.g., "gpt4_answers", "claude_answers") ## Evaluation Metrics ### Creativity Metrics - **Креативность (Creativity)**: Uniqueness and originality of the response - **Разнообразие (Diversity)**: Use of varied linguistic features - **Релевантность (Relevance)**: How well the response addresses the prompt ### Stability Metrics - **Stability Score**: Semantic similarity between prompts and responses ### Combined Score - Average of creativity and stability scores ## Output The evaluation produces: - CSV files with detailed per-response evaluations for each model - A benchmark_results.csv file with aggregated metrics for all models ## Environment Variables You can set the `GEMINI_API_KEY` environment variable instead of passing it as an argument.