# Google Colab Version: [Open this notebook in Google Colab](https://colab.research.google.com/github/starfishdata/starfish/blob/main/examples/data_factory.ipynb)

#### Dependencies 

In [23]:
#%pip install starfish-core
%pip install --index-url https://test.pypi.org/simple/ \
 --extra-index-url https://pypi.org/simple \
 starfish-core

Looking in indexes: https://test.pypi.org/simple/, https://pypi.org/simple

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [24]:
## Fix for Jupyter Notebook only — do NOT use in production
## Enables async code execution in notebooks, but may cause issues with sync/async issues
## For production, please run in standard .py files without this workaround
## See: https://github.com/erdewit/nest_asyncio for more details
import nest_asyncio
nest_asyncio.apply()

from starfish import StructuredLLM, data_factory
from starfish.llm.utils import merge_structured_outputs

from starfish.common.env_loader import load_env_file ## Load environment variables from .env file
load_env_file()



In [25]:
## Helper function mock llm call
# When developing data pipelines with LLMs, making thousands of real API calls
# can be expensive. Using mock LLM calls lets you test your pipeline's reliability,
# failure handling, and recovery without spending money on API calls.
from starfish.data_factory.utils.mock import mock_llm_call

#### 3. Working with Different Input Formats


Data Factory is flexible with how you provide inputs. Let's demonstrate different ways to pass parameters to data_factory functions.

'data' is a reserved keyword expecting list(dict) or tuple(dict) - this design make it super easy to pass large data and support HuggingFace and Pandas dataframe very easily

In [26]:
## We will be using mock llm call for rest of example to save on token
## Mock LLM call is a function that simulates an LLM API call with random delays (controlled by sleep_time) and occasional failures (controlled by fail_rate)
await mock_llm_call(city_name="New York", num_records_per_city=3)

[{'answer': 'New York_3'}, {'answer': 'New York_1'}, {'answer': 'New York_5'}]

In [27]:
@data_factory(max_concurrency=100)
async def input_format_mock_llm(city_name: str, num_records_per_city: int):
 return await mock_llm_call(city_name=city_name, num_records_per_city=num_records_per_city, fail_rate=0.01)

In [28]:
# Format 1: Multiple lists that get zipped together
input_format_data1 = input_format_mock_llm.run(city_name=["New York", "London", "Tokyo", "Paris", "Sydney"], num_records_per_city=[2, 1, 1, 1, 1])

[32m2025-05-23 22:50:10[0m | [1mINFO [0m | [1m[1m[JOB START][0m [36mMaster Job ID: 4da82fc7-4112-4e05-b58c-53cf470747ad[0m | [33mLogging progress every 3 seconds[0m[0m
[32m2025-05-23 22:50:10[0m | [1mINFO [0m | [1m[JOB PROGRESS] [32mCompleted: 0/5[0m | [33mRunning: 5[0m | [36mAttempted: 0[0m ([32mCompleted: 0[0m, [31mFailed: 0[0m, [35mFiltered: 0[0m, [34mDuplicate: 0[0m, [1;31mInDeadQueue: 0[0m)[0m
[32m2025-05-23 22:50:11[0m | [1mINFO [0m | [1m[JOB FINISHED] [1mFinal Status:[0m [32mCompleted: 5/5[0m | [33mAttempted: 5[0m (Failed: 0, Filtered: 0, Duplicate: 0, InDeadQueue: 0)[0m


In [29]:
# Format 2: List + single value (single value gets broadcasted)
input_format_data2 = input_format_mock_llm.run(city_name=["New York", "London", "Tokyo", "Paris", "Sydney"], num_records_per_city=1)

[32m2025-05-23 22:50:11[0m | [1mINFO [0m | [1m[1m[JOB START][0m [36mMaster Job ID: 73973449-6069-485e-ac8c-b1b3a6b3f1a4[0m | [33mLogging progress every 3 seconds[0m[0m
[32m2025-05-23 22:50:11[0m | [1mINFO [0m | [1m[JOB PROGRESS] [32mCompleted: 0/5[0m | [33mRunning: 5[0m | [36mAttempted: 0[0m ([32mCompleted: 0[0m, [31mFailed: 0[0m, [35mFiltered: 0[0m, [34mDuplicate: 0[0m, [1;31mInDeadQueue: 0[0m)[0m
[32m2025-05-23 22:50:12[0m | [1mINFO [0m | [1m[JOB FINISHED] [1mFinal Status:[0m [32mCompleted: 5/5[0m | [33mAttempted: 5[0m (Failed: 0, Filtered: 0, Duplicate: 0, InDeadQueue: 0)[0m


In [30]:
# Format 3: Special 'data' parameter
# 'data' is a reserved keyword expecting list(dict) or tuple(dict)
# Makes integration with various data sources easier
input_format_data3 = input_format_mock_llm.run(data=[{"city_name": "New York", "num_records_per_city": 2}, {"city_name": "London", "num_records_per_city": 1}, {"city_name": "Tokyo", "num_records_per_city": 1}, {"city_name": "Paris", "num_records_per_city": 1}, {"city_name": "Sydney", "num_records_per_city": 1}])

[32m2025-05-23 22:50:12[0m | [1mINFO [0m | [1m[1m[JOB START][0m [36mMaster Job ID: aa9954f9-fc18-4b42-959e-fb2a897987c7[0m | [33mLogging progress every 3 seconds[0m[0m
[32m2025-05-23 22:50:12[0m | [1mINFO [0m | [1m[JOB PROGRESS] [32mCompleted: 0/5[0m | [33mRunning: 5[0m | [36mAttempted: 0[0m ([32mCompleted: 0[0m, [31mFailed: 0[0m, [35mFiltered: 0[0m, [34mDuplicate: 0[0m, [1;31mInDeadQueue: 0[0m)[0m
[32m2025-05-23 22:50:13[0m | [1mINFO [0m | [1m[JOB FINISHED] [1mFinal Status:[0m [32mCompleted: 5/5[0m | [33mAttempted: 5[0m (Failed: 0, Filtered: 0, Duplicate: 0, InDeadQueue: 0)[0m


#### 4. Resilient error retry
Data Factory automatically handles errors and retries, making your pipelines robust.

Let's demonstrate with a high failure rate example.

In [31]:
@data_factory(max_concurrency=100)
async def high_error_rate_mock_llm(city_name: str, num_records_per_city: int):
 return await mock_llm_call(city_name=city_name, num_records_per_city=num_records_per_city, fail_rate=0.3) # Hardcode to 30% chance of failure

# Process all cities - some will fail, but data_factory keeps going
cities = ["New York", "London", "Tokyo", "Paris", "Sydney"] * 5 # 25 cities
high_error_rate_mock_lllm_data = high_error_rate_mock_llm.run(city_name=cities, num_records_per_city=1)

print(f"\nSuccessfully completed {len(high_error_rate_mock_lllm_data)} out of {len(cities)} tasks")
print("Data Factory automatically handled the failures and continued processing")
print("The results only include successful tasks")

[32m2025-05-23 22:50:13[0m | [1mINFO [0m | [1m[1m[JOB START][0m [36mMaster Job ID: 730b766d-3c23-419a-a3dd-271d683818b1[0m | [33mLogging progress every 3 seconds[0m[0m
[32m2025-05-23 22:50:13[0m | [1mINFO [0m | [1m[JOB PROGRESS] [32mCompleted: 0/25[0m | [33mRunning: 25[0m | [36mAttempted: 0[0m ([32mCompleted: 0[0m, [31mFailed: 0[0m, [35mFiltered: 0[0m, [34mDuplicate: 0[0m, [1;31mInDeadQueue: 0[0m)[0m
[32m2025-05-23 22:50:15[0m | [31m[1mERROR [0m | [31m[1mError running task: Mock LLM failed to process city: Tokyo[0m
[32m2025-05-23 22:50:15[0m | [31m[1mERROR [0m | [31m[1mError running task: Mock LLM failed to process city: New York[0m
[32m2025-05-23 22:50:16[0m | [1mINFO [0m | [1m[JOB PROGRESS] [32mCompleted: 23/25[0m | [33mRunning: 0[0m | [36mAttempted: 25[0m ([32mCompleted: 23[0m, [31mFailed: 2[0m, [35mFiltered: 0[0m, [34mDuplicate: 0[0m, [1;31mInDeadQueue: 0[0m)[0m
[32m2025-05-23 22:50:19[0m | [1mINFO [0m | [

#### 5. Resume

This is essential for long-running jobs with thousands of tasks.

If a job is interrupted, you can pick up where you left off using one of two resume methods:


1. **Same Session Resume**: If you're still in the same session where the job was interrupted, simply call - Same instance with .resume()

2. **Cross-Session Resume**: If you've closed your notebook or lost your session, you can resume using the job ID:
 ```python
 from starfish import DataFactory
 # Resume using the master job ID from a previous run
 data_factory = DataFactory.resume_from_checkpoint(job_id="your_job_id")
 ```

The key difference:
- `resume()` uses the same DataFactory instance you defined
- `resume_from_checkpoint()` reconstructs your DataFactory from persistent storage where tasks and progress are saved

> **Note**: Google Colab users may experience issues with `resume_from_checkpoint()` due to how Colab works

We're simulating an interruption here. In a real scenario, this might happen if your notebook errors out, is manually interrupted with a keyboard command, encounters API rate limits, or experiences any other issues that halt execution.

In [32]:
@data_factory(max_concurrency=10)
async def re_run_mock_llm(city_name: str, num_records_per_city: int):
 return await mock_llm_call(city_name=city_name, num_records_per_city=num_records_per_city, fail_rate=0.3)

cities = ["New York", "London", "Tokyo", "Paris", "Sydney"] * 20 # 100 cities
re_run_mock_llm_data_1 = re_run_mock_llm.run(city_name=cities, num_records_per_city=1)

[32m2025-05-23 22:50:19[0m | [1mINFO [0m | [1m[1m[JOB START][0m [36mMaster Job ID: 6829de29-0b83-4a64-835b-cc79cbad5e3a[0m | [33mLogging progress every 3 seconds[0m[0m
[32m2025-05-23 22:50:19[0m | [1mINFO [0m | [1m[JOB PROGRESS] [32mCompleted: 0/100[0m | [33mRunning: 10[0m | [36mAttempted: 0[0m ([32mCompleted: 0[0m, [31mFailed: 0[0m, [35mFiltered: 0[0m, [34mDuplicate: 0[0m, [1;31mInDeadQueue: 0[0m)[0m
[32m2025-05-23 22:50:21[0m | [31m[1mERROR [0m | [31m[1mError running task: Mock LLM failed to process city: Paris[0m
[32m2025-05-23 22:50:21[0m | [31m[1mERROR [0m | [31m[1mError running task: Mock LLM failed to process city: Sydney[0m
[32m2025-05-23 22:50:21[0m | [31m[1mERROR [0m | [31m[1mError running task: Mock LLM failed to process city: New York[0m
[32m2025-05-23 22:50:21[0m | [31m[1mERROR [0m | [31m[1mconsecutive_not_completed: in 3 times, stopping this job; please adjust factory config and input data then resume_from_c

In [33]:
print("When a job is interrupted, you'll see a message like:")
print("[RESUME INFO] 🚨 Job stopped unexpectedly. You can resume the job by calling .resume()")

print("\nTo resume an interrupted job, simply call:")
print("interrupted_job_mock_llm.resume()")
print('')
print(f"For this example we have {len(re_run_mock_llm_data_1)}/{len(cities)} data generated and not finished yet!")

When a job is interrupted, you'll see a message like:
[RESUME INFO] 🚨 Job stopped unexpectedly. You can resume the job by calling .resume()

To resume an interrupted job, simply call:
interrupted_job_mock_llm.resume()

For this example we have 17/100 data generated and not finished yet!


In [34]:
## Lets keep continue the rest of run by resume_from_checkpoint 
re_run_mock_llm_data_2 = re_run_mock_llm.resume()

[32m2025-05-23 22:50:22[0m | [1mINFO [0m | [1m[1m[JOB RESUME START][0m [33mPICKING UP FROM WHERE THE JOB WAS LEFT OFF...[0m
[0m
[32m2025-05-23 22:50:22[0m | [1mINFO [0m | [1m[1m[RESUME PROGRESS] STATUS AT THE TIME OF RESUME:[0m [32mCompleted: 17 / 100[0m | [31mFailed: 3[0m | [31mDuplicate: 0[0m | [33mFiltered: 0[0m[0m
[32m2025-05-23 22:50:22[0m | [1mINFO [0m | [1m[JOB PROGRESS] [32mCompleted: 17/100[0m | [33mRunning: 10[0m | [36mAttempted: 20[0m ([32mCompleted: 17[0m, [31mFailed: 3[0m, [35mFiltered: 0[0m, [34mDuplicate: 0[0m, [1;31mInDeadQueue: 0[0m)[0m
[32m2025-05-23 22:50:24[0m | [31m[1mERROR [0m | [31m[1mError running task: Mock LLM failed to process city: Paris[0m
[32m2025-05-23 22:50:24[0m | [31m[1mERROR [0m | [31m[1mconsecutive_not_completed: in 3 times, stopping this job; please adjust factory config and input data then resume_from_checkpoint(6829de29-0b83-4a64-835b-cc79cbad5e3a)[0m
[32m2025-05-23 22:50:24[0m | [

In [35]:
print(f"Now we still able to finished with what is left!! {len(re_run_mock_llm_data_2)} data generated!")

Now we still able to finished with what is left!! 30 data generated!


#### 6. Dry run
Before running a large job, you can do a "dry run" to test your pipeline. This only processes a single item and doesn't save state to the database.

In [36]:
@data_factory(max_concurrency=10)
async def dry_run_mock_llm(city_name: str, num_records_per_city: int):
 return await mock_llm_call(city_name=city_name, num_records_per_city=num_records_per_city, fail_rate=0.3)

dry_run_mock_llm_data = dry_run_mock_llm.dry_run(city_name=["New York", "London", "Tokyo", "Paris", "Sydney"]*20, num_records_per_city=1)

[32m2025-05-23 22:50:24[0m | [1mINFO [0m | [1m[1m[JOB START][0m [36mMaster Job ID: None[0m | [33mLogging progress every 3 seconds[0m[0m
[32m2025-05-23 22:50:24[0m | [1mINFO [0m | [1m[JOB PROGRESS] [32mCompleted: 0/1[0m | [33mRunning: 1[0m | [36mAttempted: 0[0m ([32mCompleted: 0[0m, [31mFailed: 0[0m, [35mFiltered: 0[0m, [34mDuplicate: 0[0m, [1;31mInDeadQueue: 0[0m)[0m
[32m2025-05-23 22:50:25[0m | [1mINFO [0m | [1m[JOB FINISHED] [1mFinal Status:[0m [32mCompleted: 1/0[0m | [33mAttempted: 1[0m (Failed: 0, Filtered: 0, Duplicate: 0, InDeadQueue: 0)[0m


#### 8. Advanced Usage
Data Factory offers more advanced capabilities for complete pipeline customization, including hooks that execute at key stages and shareable state to coordinate between tasks. These powerful features enable complex workflows and fine-grained control. Our dedicated examples for advanced data_factory usage will be coming soon!