Spaces:
Sleeping
Sleeping
Commit
·
e2b6ad9
1
Parent(s):
e365a68
[fix] Add Hugging Face Space metadata
Browse files
README.md
CHANGED
|
@@ -1,117 +1,9 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
<a href="screenshot_2.png"><img src="assets/screenshot_2.png" width="335"></a>
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
## 📖 Overview
|
| 14 |
-
**SynthDataGen** is an AI-powered tool that creates **realistic, fake data** for any project. You don’t need to collect real information—instead, just tell SynthDataGen what kind of data you want, and it will **quickly generate** it. Thanks to its **easy-to-use web interface** built with Gradio, **anyone** can start making custom datasets right away.
|
| 15 |
-
|
| 16 |
-
### 🔑 **Key Features**
|
| 17 |
-
- The app can generate **various types of datasets**, such as **tables**, **time-series data**, or **text content**.
|
| 18 |
-
- The output can be saved in different **formats**, including **JSON**, **CSV**, **Parquet**, or **Markdown**.
|
| 19 |
-
- **AI models** like **GPT** and **Claude** are used to automatically create the dataset based on the task.
|
| 20 |
-
- A short **description of the desired dataset** is all that's needed to trigger the generation process.
|
| 21 |
-
- A **download link** is provided once the dataset is ready, making it easy to save and use.
|
| 22 |
-
- The **interface updates options automatically** and includes helpful **examples for inspiration**.
|
| 23 |
-
|
| 24 |
-
### 🎯 **How It Works**
|
| 25 |
-
1️⃣ Describe the dataset to generate by entering a short business problem or topic.
|
| 26 |
-
|
| 27 |
-
2️⃣ Select the dataset type, output format, AI model, and number of samples.
|
| 28 |
-
|
| 29 |
-
3️⃣ Download the generated dataset once it's ready — clean, structured, and ready to use..
|
| 30 |
-
|
| 31 |
-
### 🤔 **Why Choose SynthDataGen?**
|
| 32 |
-
- ⏰ **Time Saver**: Automatically creates tables, time-series, or text data—no need to gather real data yourself.
|
| 33 |
-
- ⚙️ **Flexible and Accessible**: Supports multiple formats (JSON, CSV, Parquet, Markdown) with a beginner-friendly interface.
|
| 34 |
-
- 🤖 **Powered by GPT & Claude**: Uses two top AI models to produce realistic synthetic data for prototyping or research.
|
| 35 |
-
|
| 36 |
-
### 🔧 **SynthDataGen Customization**
|
| 37 |
-
SynthDataGen is fully customizable through Python code. You can easily modify:
|
| 38 |
-
- ✏️ **System prompt** to control how the AI models generate code
|
| 39 |
-
- 🤖 Easily add **new frontier** or **open-source models** (e.g., LLaMA, DeepSeek, Qwen), or integrate any model from **Hugging Face libraries** and **inference endpoints**.
|
| 40 |
-
- 📊 **Dataset types**, by adding new categories like image metadata, dialogue transcripts ...
|
| 41 |
-
- 📁 **Output formats**, such as YAML, XML ...
|
| 42 |
-
- 🎨 **Interface styling**, including layout, colors, and themes
|
| 43 |
-
|
| 44 |
-
### 🏗️ **Architecture**
|
| 45 |
-
|
| 46 |
-
<a href="func_architecture.png"><img src="assets/func_architecture.png"></a>
|
| 47 |
-
<a href="tech_architecture.png"><img src="assets/tech_architecture.png"></a>
|
| 48 |
-
|
| 49 |
-
## ⚙️ Setup & Installation
|
| 50 |
-
|
| 51 |
-
**1. Clone the Repository**
|
| 52 |
-
```bash
|
| 53 |
-
git clone https://github.com/lisek75/synthdatagen_app.git
|
| 54 |
-
cd synthdatagen_app
|
| 55 |
-
```
|
| 56 |
-
|
| 57 |
-
**2. Install Dependencies**
|
| 58 |
-
|
| 59 |
-
```bash
|
| 60 |
-
conda env create -f synthdatagen_env.yml
|
| 61 |
-
conda activate synthdatagen
|
| 62 |
-
```
|
| 63 |
-
**3. Configure API Keys & Endpoints**
|
| 64 |
-
|
| 65 |
-
Create `.env` file with the following variables:
|
| 66 |
-
```python
|
| 67 |
-
OPENAI_API_KEY = your_openai_api_key
|
| 68 |
-
ANTHROPIC_API_KEY = your_anthropic_api_key
|
| 69 |
-
```
|
| 70 |
-
Ensure that the `.env` file remains **secure** and is not shared publicly.
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
## 🚀 Running the Gradio App
|
| 74 |
-
|
| 75 |
-
**Run the Application Locally**
|
| 76 |
-
```bash
|
| 77 |
-
python app.py
|
| 78 |
-
```
|
| 79 |
-
|
| 80 |
-
**Run the Application with Docker**
|
| 81 |
-
|
| 82 |
-
To run the app using Docker, you can either build the image yourself or use the pre-built image from Docker Hub.
|
| 83 |
-
|
| 84 |
-
- Build and run the app locally:
|
| 85 |
-
Build the image from the provided Dockerfile using your own Docker Hub username:
|
| 86 |
-
```bash
|
| 87 |
-
docker build -t <user-dockerhub-username>/synthdatagen:v1.0 .
|
| 88 |
-
docker run -d --name synthdatagen-container -p 7860:7860 --env-file .env <user-dockerhub-username>/synthdatagen:v1.0
|
| 89 |
-
```
|
| 90 |
-
This will build the Docker image and run the app in a container.
|
| 91 |
-
|
| 92 |
-
- Run the app directly from Docker Hub:
|
| 93 |
-
Pull the pre-built image from the Docker Hub repository (⚠️make sure to use the latest version tag from Docker Hub).
|
| 94 |
-
Check: https://hub.docker.com/r/lizk75/synthdatagen/tags
|
| 95 |
-
|
| 96 |
-
```bash
|
| 97 |
-
docker pull lizk75/synthdatagen:v1.0
|
| 98 |
-
docker run -d --name synthdatagen-container -p 7860:7860 --env-file .env lizk75/synthdatagen:v1.0
|
| 99 |
-
```
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
## 🧑💻 Usage Guide
|
| 103 |
-
- You can launch the app directly from:
|
| 104 |
-
- The **demo link** provided at the top of this README.
|
| 105 |
-
- Or by executing it **locally** using the command `python app.py` from Visual Studio or any other IDE.
|
| 106 |
-
- **Describe your dataset** by entering a clear business problem or topic.
|
| 107 |
-
- Select the **dataset type** and **output format**.
|
| 108 |
-
- Choose an **AI model** (GPT or Claude).
|
| 109 |
-
- Set the desired **number of samples**.
|
| 110 |
-
- Click **Create Dataset** and download the generated file.
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
## 📓 Google Colab
|
| 114 |
-
A **notebook version** is available for users who prefer running the app in a notebook environment. The notebook includes additional **open-source models ** that require a **GPU**, which is why it's recommended to run it on Google Colab or a local machine with GPU support.
|
| 115 |
-
|
| 116 |
-
https://github.com/lisek75/nlp_llms_notebook/blob/main/07_data_generator.ipynb
|
| 117 |
-
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: SynthDataGen
|
| 3 |
+
emoji: 🧬
|
| 4 |
+
colorFrom: indigo
|
| 5 |
+
colorTo: pink
|
| 6 |
+
sdk: docker
|
| 7 |
+
app_file: Dockerfile
|
| 8 |
+
pinned: false
|
| 9 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|