Loren1214 commited on
Commit
5eeb352
·
verified ·
1 Parent(s): 90b6539

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +112 -96
README.md CHANGED
@@ -12,139 +12,155 @@ license: mit
12
  tag: agent-demo-track
13
  ---
14
 
15
- # Scriptura: A Multi-Agent System for Screenplay Creation and Editing
16
 
17
- The explanation **video** is available at: https://www.youtube.com/watch?v=I0201ruB1Uo&ab_channel=3DLabFactory.
18
 
19
- The screenplay used in the video as sample is available at: https://www.studiobinder.com/blog/best-free-movie-scripts-online/
20
 
21
  ## Introduction
22
 
23
  **Scriptura** is a multi-agent AI framework based on HF-SmolAgents that streamlines the creation of screenplays, storyboards, and soundtracks by automating the stages of analysis, summarization, and multimodal enrichment—freeing authors to focus on pure creativity.
24
 
25
  At its heart:
26
- - **Qwen3-32B** serves as the primary orchestrating agent, coordinating workflows and managing high-level reasoning across the system.
27
- - **Gemma-3-27B-IT** acts as a specialized assistant for multimodal tasks, supporting both text and audio inputs to refine narrative elements and prepare them for downstream generation.
 
28
 
29
  For media generation, Scriptura integrates:
30
- - **MusicGen** models (per the AudioCraft MusicGen specification), deployed via Hugging Face Spaces, enabling the agent to produce original soundtracks and sound effects from text prompts or combined text + audio samples.
31
- - **FLUX (black-forest-labs/FLUX.1-dev)** for on-the-fly image creation, ideal for storyboards, concept art, and visual references that seamlessly tie into the narrative flow.
 
32
 
33
  Optionally, Scriptura can query external sources (e.g., via a DuckDuckGo API integration) to pull in reference scripts, sound samples, or research materials, ensuring that every draft is not only creatively rich but also contextually informed.
34
 
 
 
35
  ## Agent Capabilities
36
 
37
- **Input File Parsing**
38
- : - **Formats accepted**: `TXT`, `PDF`, `DOCX`, `JPEG/PNG`, `MP3/WAV`
39
- - **Process**: PDF/DOCX plain text; OCR on images; speech-to-text on audio.
40
- - **Why it matters**: Provides structured input for all downstream modules.
41
-
42
- **Overall Plot Summary**
43
- : - **Model**: `DeepSeek-R1`
44
- - **Output**: 4–6 sentence summary of main narrative threads (timeframe, tone).
45
- - **Mechanics**: API calls to DeepSeek with retry logic for improved coherence.
46
-
47
- **Entity & Theme Extraction**
48
- : - **Technique**: Named Entity Recognition (via **DeepSeek**)
49
- - **Extracts**: Characters, locations, key events, recurring themes, narrative tone.
50
- - **Output**: JSON/CSV + ~5-sentence abstract.
51
-
52
- **Rights & Licensing Verification**
53
- : - **Web Search ON**: Queries DuckDuckGo API → fetch license info if match.
54
- - **Web Search OFF**: May recognize very famous works internally (e.g. “Harry Potter”) but not guaranteed.
55
- - **If no match & search OFF**: No licensing check.
56
-
57
- **Image Generation (Storyboard & Concept Art)**
58
- : - **Model**: `FLUX (black-forest-labs/FLUX.1-dev)`
59
- - **Trigger**: “Generate Image” / storyboard phase.
60
- - **Process**: DeepSeek crafts cinematic prompt → FLUX returns PNG/JPEG + caption.
61
-
62
- **Audio Generation (Music & Sound Effects)**
63
- : - **Model**: `MusicGen (facebook/musicgen-melody)`
64
- - **Trigger**: “Generate Audio.”
65
- - **Process**: Send prompt → receive MP3/WAV (standalone audio, no text/images).
66
-
67
- **In-Depth Analysis of Key Points**
68
- : - **Extracts**:
69
- - Characters (role, gender, description)
70
- - Locations (interior/exterior, period, geography)
71
- - Plot Points (crucial narrative beats via Story Understanding models)
72
- - **Extras**: Semantic toponym extraction → internal scene maps; detect transitions (“Suddenly,” “Meanwhile”).
73
-
74
- **Optional Web Search**
75
- : - **Checkbox** toggles DuckDuckGo API lookups.
76
- - **If Enabled**: search preconfigured sites (free & paid) for scripts, sound effects.
77
- - **Output**: List of links + short summaries.
78
 
 
 
 
 
 
79
 
80
  ---
81
 
82
  ## Agent Flow
83
 
84
- ```mermaid
85
- flowchart LR
86
- A[Start Agent] --> B[Load Input (text, image, audio)]
87
- B --> C[Preprocessing: PDF/DOCX → text, OCR, audio transcription]
88
- C --> D[Generate Plot Summary (DeepSeek)]
89
- D --> E[Extract Entities & Themes (DeepSeek)]
90
- E --> F {Web Search Enabled?}
91
- F -->|Yes| G[Web Search via DuckDuckGo API]
92
- F -->|No| H[Continue Offline Analysis]
93
- H --> I[Rights & Licensing Check]
94
- I --> J[Deep Analysis: characters, locations, plot points]
95
- J --> K {Image Generation Requested?}
96
- K -->|Yes| L[API Call to FLUX for storyboard/concept art]
97
- K -->|No| M[Skip Image Generation]
98
- M --> N {Audio Generation Requested?}
99
- N -->|Yes| O[API Call to MusicGen for audio tracks]
100
- N -->|No| P[Skip Audio Generation]
101
- L & O --> Q[Final Output: text, JSON/CSV, images, audio]
102
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
103
  ---
104
- ## Deployment & Access and the Code Overview
 
 
 
 
 
 
 
 
 
 
 
105
 
106
  ---
107
  ## Use Cases
108
 
109
  **Independent Writer**
110
- : - Upload a screenplay and quickly get a summary, a list of characters, and locations.
111
- - Create visual storyboards of key narrative moments via FLUX (PNG/JPEG outputs).
112
- - Generate brief soundtracks or sound effects to accompany script presentations (MP3/WAV).
113
 
114
  **Film Production Company**
115
- : - Import multiple screenplays (PDF, DOCX) and automatically receive reports on characters, locations, and potential copyright issues.
116
- - Use the web search feature to find reference scripts or specific sound effects from free/paid sources.
117
- - Develop visual storyboards and audio prototypes to share with directors, artists, and investors.
118
 
119
  **Translation and Adaptation Agency**
120
- : - Upload foreign-language scripts and obtain a structured text version with extracted entities (JSON/CSV).
121
- - Generate contextual images for cultural adaptation (e.g., images matching the original setting via FLUX).
122
- - Produce reference audio via MusicGen to test culturally appropriate music for the target audience.
123
 
124
  **Digital Humanities Course**
125
- : - Demonstrate how to build a text-mining tool applied to performing arts, combining NLP, image, and audio pipelines.
126
- - Allow students to analyze real scripts, generate abstracts, scene maps, and visual/audio prototypes in a hands-on environment.
127
- - Explore Transformer models (DeepSeek), OCR, speech-to-text, and AI-driven media generation as part of the curriculum.
128
 
129
  ---
130
- ## Credits
131
-
132
 
133
- ---
134
- ## Acknowledgements
135
 
 
 
 
 
136
 
137
  ---
138
- ### Contributors:
139
- - Code development and implementation made by **luke9705**;
140
- - Ideas creation, testing and videomaking conducted by **OrianIce**;
141
- - Research and testing by **Loren1214**;
142
- - Code revisions by **DDPM**.
143
-
144
- ---
145
- ### Sources
146
-
147
- - Russell, S., & Norvig, P. (2021). *Artificial Intelligence: A Modern Approach* (3rd ed.). Pearson.
148
- - Cambria, E., & White, B. (2014). *Jumping NLP Curves: A Review of Natural Language Processing Research*. IEEE Computational Intelligence Magazine, 9(2), 48–57.
149
- - Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., … & Sutskever, I. (2022). *Hierarchical Text-Conditional Image Generation with CLIP Latents*. arXiv preprint arXiv:2204.06125.
150
-
 
 
 
 
 
 
 
 
12
  tag: agent-demo-track
13
  ---
14
 
15
+ # Scriptura: A MultiAgent System for Screenplay Creation and Editing
16
 
17
+ The explanation video is available [here](https://www.youtube.com/watch?v=I0201ruB1Uo)
18
 
19
+ The screenplay used in the video as sample is available [here](https://www.studiobinder.com/blog/best-free-movie-scripts-online/)
20
 
21
  ## Introduction
22
 
23
  **Scriptura** is a multi-agent AI framework based on HF-SmolAgents that streamlines the creation of screenplays, storyboards, and soundtracks by automating the stages of analysis, summarization, and multimodal enrichment—freeing authors to focus on pure creativity.
24
 
25
  At its heart:
26
+
27
+ * Qwen3-32B serves as the primary orchestrating agent, coordinating workflows and managing high-level reasoning across the system.
28
+ * Gemma-3-27B-IT acts as a specialized assistant for multimodal tasks, supporting both text and audio inputs to refine narrative elements and prepare them for downstream generation.
29
 
30
  For media generation, Scriptura integrates:
31
+
32
+ * MusicGen models (per the AudioCraft MusicGen specification), deployed via Hugging Face Spaces, enabling the agent to produce original soundtracks and sound effects from text prompts or combined text + audio samples.
33
+ * FLUX (black-forest-labs/FLUX.1-dev) for on-the-fly image creation, ideal for storyboards, concept art, and visual references that seamlessly tie into the narrative flow.
34
 
35
  Optionally, Scriptura can query external sources (e.g., via a DuckDuckGo API integration) to pull in reference scripts, sound samples, or research materials, ensuring that every draft is not only creatively rich but also contextually informed.
36
 
37
+ ---
38
+
39
  ## Agent Capabilities
40
 
41
+ Scriptura provides a rich set of agents and tools to cover the full screenplay production and enrichment pipeline:
42
+
43
+ - **Text Analysis & Summarization**
44
+ - Automatically extracts key themes, character arcs, and plot points
45
+ - Segments and summarizes scenes for rapid iteration
46
+
47
+ - **Multimodal Ingestion**
48
+ - Supports PDF, DOCX, ODT, TXT and image uploads
49
+ - Transcribes audio files using OpenAI Whisper
50
+
51
+ - **Image Generation**
52
+ - On-the-fly storyboard and concept art creation via FLUX (black-forest-labs/FLUX.1-dev)
53
+
54
+ - **Audio Generation**
55
+ - Produces original soundtracks and SFX with MusicGen (AudioCraft spec)
56
+ - Allows sample-conditioned audio generation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57
 
58
+ - **Captioning & Metadata**
59
+ - Auto-generates captions and descriptions for images using Gemma-3-27B-IT
60
+
61
+ - **Optional Web Research**
62
+ - Queries DuckDuckGo to fetch example scripts, sound samples, or contextual references
63
 
64
  ---
65
 
66
  ## Agent Flow
67
 
68
+ Here’s an example flow demonstrating how you could use the agent.
69
+
70
+ <img alt="Flowchart" src="https://www.canva.com/design/DAGphLlng2I/MZ2cOAnS520rFtnhTP5H6A/view?utm_content=DAGphLlng2I&utm_campaign=designshare&utm_medium=link2&utm_source=uniquelinks&utlId=hca1222039d" width="600"/>
71
+
72
+ ![img.png](img.png)
73
+ ---
74
+
75
+ ## Code Overview
76
+
77
+ ```bash
78
+ .
79
+ ├── app.py # Entry point: defines Gradio interface and routing logic
80
+ ├── system_prompt.txt # System-level prompt template for the CodeAgent
81
+ ├── requirements.txt # Python dependencies (Gradio, SmolAgents, OpenAI, etc.)
82
+ └── README.md # Project documentation
 
 
 
83
  ```
84
+
85
+ * **app.py**
86
+
87
+ * **Agent** class: loads Qwen3-32B model, registers all tools
88
+ * **respond()**: orchestrates between Gradio inputs and CodeAgent
89
+ * Decorated `@tool` functions for image download, media generation, transcription, captioning
90
+ * Gradio `ChatInterface` setup with text/file support and “Enable web search” toggle
91
+
92
+ * **system\_prompt.txt**
93
+
94
+ * Injects the agent’s “way of thinking,” including reasoning structure and error handling
95
+
96
+ * **requirements.txt**
97
+
98
+ * Lists all required libraries (Gradio, SmolAgents, OpenAI, HuggingFace, PDFPlumber, etc.)
99
+
100
  ---
101
+
102
+ ## Deployment & Access
103
+
104
+ ### Hugging Face Spaces
105
+
106
+ 1. Include `app.py`, `system_prompt.txt`, and `requirements.txt` in the root of your Space.
107
+ 2. Configure `OPENAI_API_KEY` and `HF_TOKEN` as Secrets in your Space’s settings.
108
+ 3. Make sure the Space is set to use **Python 3.10 or higher**.
109
+ 4. Select **Gradio** as the SDK (version 5.32.1).
110
+ 5. Pin or share the Space link to collaborate with your team.
111
+
112
+ > **Note:** If you choose to clone this repository and run it locally, make sure to set your own `OPENAI_API_KEY` and `HF_TOKEN` environment variables before launching.
113
 
114
  ---
115
  ## Use Cases
116
 
117
  **Independent Writer**
118
+ * Upload a screenplay and quickly get a summary, a list of characters, and locations.
119
+ * Create visual storyboards of key narrative moments via FLUX (PNG/JPEG outputs).
120
+ * Generate brief soundtracks or sound effects to accompany script presentations (MP3/WAV).
121
 
122
  **Film Production Company**
123
+ * Import multiple screenplays (PDF, DOCX) and automatically receive reports on characters, locations, and potential copyright issues.
124
+ * Use the web search feature to find reference scripts or specific sound effects from free/paid sources.
125
+ * Develop visual storyboards and audio prototypes to share with directors, artists, and investors.
126
 
127
  **Translation and Adaptation Agency**
128
+ * Upload foreign-language scripts and obtain a structured text version with extracted entities (JSON/CSV).
129
+ * Generate contextual images for cultural adaptation (e.g., images matching the original setting via FLUX).
130
+ * Produce reference audio via MusicGen to test culturally appropriate music for the target audience.
131
 
132
  **Digital Humanities Course**
133
+ * Demonstrate how to build a text-mining tool applied to performing arts, combining NLP, image, and audio pipelines.
134
+ * Allow students to analyze real scripts, generate abstracts, scene maps, and visual/audio prototypes in a hands-on environment.
135
+ * Explore Transformer models (DeepSeek), OCR, speech-to-text, and AI-driven media generation as part of the curriculum.
136
 
137
  ---
 
 
138
 
139
+ ## Contributors:
 
140
 
141
+ * Code development and implementation made by luke9705;
142
+ * Ideas creation, testing and videomaking conducted by OrianIce;
143
+ * Research and testing by Loren1214;
144
+ * Code revisions by DDPM.
145
 
146
  ---
147
+ ## Sources
148
+ The following libraries, models, and tools power Scriptura’s agents and multimodal capabilities:
149
+
150
+ - **Qwen3-32B** – primary orchestrating LLM for high-level reasoning and workflow management
151
+ - **Gradio** interactive web UI framework
152
+ - **smolagents** – lightweight multi-agent orchestrator from Hugging Face
153
+ - **huggingface_hub** – model & dataset management
154
+ - **duckduckgo-search** – optional web research integration
155
+ - **openai** – Whisper transcription, GPT-based reasoning
156
+ - **anthropic** Claude-style LLM support
157
+ - **pdfplumber** PDF text extraction
158
+ - **docx2txt** DOCX parsing
159
+ - **odfpy** – ODT parsing
160
+ - **pandas** – data handling
161
+ - **Pillow (PIL)** – image processing
162
+ - **requests** – HTTP client for external APIs
163
+ - **numpy** – numerical operations
164
+ - **MusicGen (AudioCraft)** – soundtrack and SFX generation
165
+ - **FLUX (black-forest-labs/FLUX.1-dev)** – on-the-fly image generation
166
+ - **Gemma-3-27B-IT** – multimodal captioning and metadata