File size: 4,147 Bytes
5301c48
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102

## Overview
The `generate_by_topic` template is designed to create diverse synthetic data across multiple topics based on user instructions. It can automatically generate relevant topics if not provided and handles deduplication across generated content.

## Key Features
- Automatic topic generation based on user instructions
- Customizable number of records and records per topic
- Built-in deduplication mechanism
- Flexible output schema configuration
- Parallel data generation with configurable concurrency

## Input Schema
```python
class GenerateByTopicInput(BaseModel):
    user_instruction: Optional[str] = None
    num_records: Optional[int] = 10
    records_per_topic: int = 10
    topics: Optional[List[Union[str, Dict[str, int]]]] = None
    topic_model_name: str = "openai/gpt-4o-mini"
    topic_model_kwargs: Optional[Dict[str, Any]] = None
    generation_model_name: str = "openai/gpt-4o-mini"
    generation_model_kwargs: Optional[Dict[str, Any]] = None
    output_schema: Optional[Union[List[Dict[str, Any]], Dict[str, Any], type]] = [
        {"name": "question", "type": "str"},
        {"name": "answer", "type": "str"}
    ]
    data_factory_config: Optional[Dict[str, Any]] = {}
```

## Parameters
| Parameter | Type | Description | Default |
|-----------|------|-------------|---------|
| `user_instruction` | str | Instruction for data generation | None |
| `num_records` | int | Total number of records to generate | 10 |
| `records_per_topic` | int | Number of records per topic | 10 |
| `topics` | List[Union[str, Dict[str, int]]] | List of topics or topic with specific record count | None |
| `topic_model_name` | str | Model name for topic generation | "openai/gpt-4o-mini" |
| `topic_model_kwargs` | Dict[str, Any] | Additional parameters for topic model | None |
| `generation_model_name` | str | Model name for data generation | "openai/gpt-4o-mini" |
| `generation_model_kwargs` | Dict[str, Any] | Additional parameters for generation model | None |
| `output_schema` | Union[List[Dict[str, Any]], Dict[str, Any], type] | Schema for generated data | [{"name": "question", "type": "str"}, {"name": "answer", "type": "str"}] |
| `data_factory_config` | Dict[str, Any] | Configuration for data generation process | {} |

## Example Usage
```python
{
    "user_instruction": "Generate Q&A pairs about machine learning concepts",
    "num_records": 100,
    "records_per_topic": 5,
    "topics": [
        "supervised learning",
        "unsupervised learning",
        {"reinforcement learning": 3},
        "neural networks",
    ],
    "topic_model_name": "openai/gpt-4",
    "topic_model_kwargs": {"temperature": 0.7},
    "generation_model_name": "openai/gpt-4",
    "generation_model_kwargs": {"temperature": 0.8, "max_tokens": 200},
    "output_schema": [
        {"name": "question", "type": "str"},
        {"name": "answer", "type": "str"},
        {"name": "difficulty", "type": "str"},
    ],
    "data_factory_config": {"max_concurrency": 4, "task_runner_timeout": 60 * 2},
}
```

## Workflow
1. Topic Preparation:
   - If topics are not provided, generates relevant topics based on user instruction
   - Shuffles topics for better distribution and deduplication

2. Data Generation:
   - Generates data for each topic using the specified model
   - Implements deduplication by tracking previously generated examples
   - Adds topic information to each generated record

## Output
The generated data will include:
- Fields specified in the output schema
- An additional `topic` field indicating the topic of each record

## Dependencies
- `starfish` framework
- `pydantic` for input validation


## Sample Run

Check out [`sample_run.ipynb`](./sample_run.ipynb) for a complete example you can run right away.

## Source Implementation

The actual template code is located at:
```
src/starfish/data_gen_template/templates/starfish/generate_by_topic/
```

---

**Try it out!** If you have any questions, let us know - we'd be happy to help. If you like this template, consider starring the repo and building your own! We welcome community contributions and are always happy to chat about new ideas. ⭐