Spaces:

John-Jiang
/

starfish_data_ai

Running

File size: 18,347 Bytes

5301c48

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Google Colab Version: [Open this notebook in Google Colab](https://colab.research.google.com/github/starfishdata/starfish/blob/main/examples/structured_llm.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Dependencies "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install starfish-core"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Fix for Jupyter Notebook only — do NOT use in production\n",
    "## Enables async code execution in notebooks, but may cause issues with sync/async issues\n",
    "## For production, please run in standard .py files without this workaround\n",
    "## See: https://github.com/erdewit/nest_asyncio for more details\n",
    "import nest_asyncio\n",
    "nest_asyncio.apply()\n",
    "\n",
    "from starfish import StructuredLLM\n",
    "from starfish.llm.utils import merge_structured_outputs\n",
    "\n",
    "from pydantic import BaseModel, Field\n",
    "from typing import List\n",
    "\n",
    "from starfish.common.env_loader import load_env_file ## Load environment variables from .env file\n",
    "load_env_file()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# setup your openai api key if not already set\n",
    "# import os\n",
    "# os.environ[\"OPENAI_API_KEY\"] = \"your_key_here\"\n",
    "\n",
    "# If you dont have any API key, use local model (ollama)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1. Structured LLM with JSON Schema"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'question': 'Why did the tomato turn red in New York?',\n",
       "  'answer': \"Because it saw the Big Apple and couldn't ketchup with all the excitement!\"}]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# ### Define the Output Structure (JSON Schema)\n",
    "# Let's start with a simple JSON-like schema using a list of dictionaries.\n",
    "# Each dictionary specifies a field name and its type. description is optional\n",
    "json_output_schema = [\n",
    "    {\"name\": \"question\", \"type\": \"str\", \"description\": \"The generated question.\"},\n",
    "    {\"name\": \"answer\", \"type\": \"str\", \"description\": \"The corresponding answer.\"},\n",
    "]\n",
    "\n",
    "json_llm = StructuredLLM(\n",
    "    model_name = \"openai/gpt-4o-mini\",\n",
    "    prompt = \"Funny facts about city {{city_name}}.\",\n",
    "    output_schema = json_output_schema,\n",
    "    model_kwargs = {\"temperature\": 0.7},\n",
    ")\n",
    "\n",
    "json_response = await json_llm.run(city_name=\"New York\")\n",
    "\n",
    "# The response object contains both parsed data and the raw API response.\n",
    "json_response.data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "ModelResponse(id='chatcmpl-BQGw3FMSjzWOPMRvXmgknN4oozrKK', created=1745601327, model='gpt-4o-mini-2024-07-18', object='chat.completion', system_fingerprint='fp_0392822090', choices=[Choices(finish_reason='stop', index=0, message=Message(content='[\\n    {\\n        \"question\": \"Why did the tomato turn red in New York?\",\\n        \"answer\": \"Because it saw the Big Apple and couldn\\'t ketchup with all the excitement!\"\\n    }\\n]', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}, annotations=[]))], usage=Usage(completion_tokens=41, prompt_tokens=77, total_tokens=118, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0, text_tokens=None), prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=0, cached_tokens=0, text_tokens=None, image_tokens=None)), service_tier='default')"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Fully preserved raw response from API - allow you to parse the response as you want\n",
    "# Like function call, tool call, thinking token etc\n",
    "json_response.raw"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2. Structured LLM with Pydantic Schema (Nested)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'facts': [{'question': 'What year did New York City become the capital of the United States?',\n",
       "    'answer': 'New York City served as the capital of the United States from 1785 to 1790.',\n",
       "    'category': 'History'}]}]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# ### Define the Output Structure (Pydantic Model)\n",
    "class Fact(BaseModel):\n",
    "    question: str = Field(..., description=\"The factual question generated.\")\n",
    "    answer: str = Field(..., description=\"The corresponding answer.\")\n",
    "    category: str = Field(..., description=\"A category for the fact (e.g., History, Geography).\")\n",
    "\n",
    "# You can define a list of these models if you expect multiple results.\n",
    "class FactsList(BaseModel):\n",
    "    facts: List[Fact] = Field(..., description=\"A list of facts.\")\n",
    "\n",
    "\n",
    "# ### Create the StructuredLLM Instance with Pydantic\n",
    "pydantic_llm = StructuredLLM(\n",
    "    model_name=\"openai/gpt-4o-mini\",\n",
    "    # Ask for multiple facts this time\n",
    "    prompt=\"Generate distinct facts about {{city}}.\",\n",
    "    # Pass the Pydantic model directly as the schema\n",
    "    output_schema=FactsList, # Expecting a list of facts wrapped in the FactsList model\n",
    "    model_kwargs={\"temperature\": 0.8}\n",
    ")\n",
    "\n",
    "pydantic_llm_response = await pydantic_llm.run(city=\"New York\")\n",
    "\n",
    "pydantic_llm_response.data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3. Working with Different LLM Providers\n",
    "\n",
    "Starfish uses LiteLLM under the hood, giving you access to 100+ LLM providers. Here is an example of using a custom model provider - Hyperbolic - Super cool provider with full precision model and low cost!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'question': 'What is the nickname of New York City?',\n",
       "  'answer': 'The Big Apple'},\n",
       " {'question': 'Which iconic statue is located in New York Harbor?',\n",
       "  'answer': 'The Statue of Liberty'},\n",
       " {'question': 'What is the name of the famous theater district in Manhattan?',\n",
       "  'answer': 'Broadway'},\n",
       " {'question': \"Which park is considered the 'lungs' of New York City?\",\n",
       "  'answer': 'Central Park'},\n",
       " {'question': 'What is the tallest building in New York City as of 2023?',\n",
       "  'answer': 'One World Trade Center'}]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "\n",
    "# Set up the relevant API Key and Base URL in your enviornment variables\n",
    "# os.environ[\"HYPERBOLIC_API_KEY\"] = \"your_key_here\"\n",
    "# os.environ[\"HYPERBOLIC_API_BASE\"] = \"https://api.hyperbolic.xyz/v1\"\n",
    "\n",
    "hyperbolic_llm = StructuredLLM(\n",
    "    model_name=\"hyperbolic/deepseek-ai/DeepSeek-V3-0324\", \n",
    "    prompt=\"Facts about city {{city_name}}.\",\n",
    "    output_schema=[{\"name\": \"question\", \"type\": \"str\"}, {\"name\": \"answer\", \"type\": \"str\"}],\n",
    "    model_kwargs={\"temperature\": 0.7},\n",
    ")\n",
    "\n",
    "hyperbolic_llm_response = await hyperbolic_llm.run(city_name=\"New York\", num_records=5)\n",
    "hyperbolic_llm_response.data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3. Local LLM using Ollama\n",
    "Ensure Ollama is installed and running. Starfish can manage the server process and model downloads"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[32m2025-04-25 10:15:40\u001b[0m | \u001b[1mINFO    \u001b[0m | \u001b[1mEnsuring Ollama model gemma3:1b is ready...\u001b[0m\n",
      "\u001b[32m2025-04-25 10:15:40\u001b[0m | \u001b[1mINFO    \u001b[0m | \u001b[1mStarting Ollama server...\u001b[0m\n",
      "\u001b[32m2025-04-25 10:15:41\u001b[0m | \u001b[1mINFO    \u001b[0m | \u001b[1mOllama server started successfully\u001b[0m\n",
      "\u001b[32m2025-04-25 10:15:41\u001b[0m | \u001b[1mINFO    \u001b[0m | \u001b[1mFound model gemma3:1b\u001b[0m\n",
      "\u001b[32m2025-04-25 10:15:41\u001b[0m | \u001b[1mINFO    \u001b[0m | \u001b[1mModel gemma3:1b is already available\u001b[0m\n",
      "\u001b[32m2025-04-25 10:15:41\u001b[0m | \u001b[1mINFO    \u001b[0m | \u001b[1mModel gemma3:1b is ready, making API call...\u001b[0m\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[{'question': 'What is the population of New York City?',\n",
       "  'answer': 'As of 2023, the population of New York City is approximately 8.8 million people.'}]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "### Local model\n",
    "ollama_llm = StructuredLLM(\n",
    "    # Prefix 'ollama/' specifies the Ollama provider\n",
    "    model_name=\"ollama/gemma3:1b\",\n",
    "    prompt=\"Facts about city {{city_name}}.\",\n",
    "    output_schema=[{\"name\": \"question\", \"type\": \"str\"}, {\"name\": \"answer\", \"type\": \"str\"}],\n",
    "    model_kwargs={\"temperature\": 0.7},\n",
    ")\n",
    "\n",
    "ollama_llm_response = await ollama_llm.run(city_name=\"New York\", num_records=5)\n",
    "ollama_llm_response.data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[32m2025-04-25 10:15:54\u001b[0m | \u001b[1mINFO    \u001b[0m | \u001b[1mStopping Ollama server...\u001b[0m\n",
      "\u001b[32m2025-04-25 10:15:55\u001b[0m | \u001b[1mINFO    \u001b[0m | \u001b[1mOllama server stopped successfully\u001b[0m\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "### Resource clean up to close ollama server\n",
    "from starfish.llm.backend.ollama_adapter import stop_ollama_server\n",
    "await stop_ollama_server()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 4. Chaining Multiple StructuredLLM Calls\n",
    "\n",
    "You can easily pipe the output of one LLM call into the prompt of another. This is useful for multi-step reasoning, analysis, or refinement.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Generated Facts: [{'question': 'What is the chemical formula for water?', 'answer': 'The chemical formula for water is H2O.'}, {'question': 'What is the process by which plants convert sunlight into energy?', 'answer': 'The process is called photosynthesis.'}, {'question': \"What is the primary gas found in the Earth's atmosphere?\", 'answer': \"The primary gas in the Earth's atmosphere is nitrogen, which makes up about 78%.\"}, {'question': \"What is Newton's second law of motion?\", 'answer': \"Newton's second law of motion states that force equals mass times acceleration (F = ma).\"}, {'question': 'What is the smallest unit of life?', 'answer': 'The smallest unit of life is the cell.'}]\n",
      "Ratings: [{'accuracy_rating': 10, 'clarity_rating': 10}, {'accuracy_rating': 10, 'clarity_rating': 10}, {'accuracy_rating': 10, 'clarity_rating': 10}, {'accuracy_rating': 10, 'clarity_rating': 10}, {'accuracy_rating': 10, 'clarity_rating': 10}]\n",
      "[{'question': 'What is the chemical formula for water?', 'answer': 'The chemical formula for water is H2O.', 'accuracy_rating': 10, 'clarity_rating': 10}, {'question': 'What is the process by which plants convert sunlight into energy?', 'answer': 'The process is called photosynthesis.', 'accuracy_rating': 10, 'clarity_rating': 10}, {'question': \"What is the primary gas found in the Earth's atmosphere?\", 'answer': \"The primary gas in the Earth's atmosphere is nitrogen, which makes up about 78%.\", 'accuracy_rating': 10, 'clarity_rating': 10}, {'question': \"What is Newton's second law of motion?\", 'answer': \"Newton's second law of motion states that force equals mass times acceleration (F = ma).\", 'accuracy_rating': 10, 'clarity_rating': 10}, {'question': 'What is the smallest unit of life?', 'answer': 'The smallest unit of life is the cell.', 'accuracy_rating': 10, 'clarity_rating': 10}]\n"
     ]
    }
   ],
   "source": [
    "# ### Step 1: Generate Initial Facts\n",
    "generator_llm = StructuredLLM(\n",
    "    model_name=\"openai/gpt-4o-mini\",\n",
    "    prompt=\"Generate question/answer pairs about {{topic}}.\",\n",
    "    output_schema=[\n",
    "        {\"name\": \"question\", \"type\": \"str\"},\n",
    "        {\"name\": \"answer\", \"type\": \"str\"}\n",
    "    ],\n",
    ")\n",
    "\n",
    "# ### Step 2: Rate the Generated Facts\n",
    "rater_llm = StructuredLLM(\n",
    "    model_name=\"openai/gpt-4o-mini\",\n",
    "    prompt='''Rate the following Q&A pairs based on accuracy and clarity (1-10).\n",
    "    Pairs: {{generated_pairs}}''',\n",
    "    output_schema=[\n",
    "        {\"name\": \"accuracy_rating\", \"type\": \"int\"},\n",
    "        {\"name\": \"clarity_rating\", \"type\": \"int\"}\n",
    "    ],\n",
    "    model_kwargs={\"temperature\": 0.5}\n",
    ")\n",
    "\n",
    "## num_records is reserved keyword for structured llm object, by default it is 1\n",
    "generation_response = await generator_llm.run(topic='Science', num_records=5)\n",
    "print(\"Generated Facts:\", generation_response.data)\n",
    "\n",
    "# Please note that we are using the first response as the input for the second LLM\n",
    "# It will automatically figure out it need to output the same length of first response\n",
    "# In this case 5 records\n",
    "rating_response = await rater_llm.run(generated_pairs=generation_response.data)\n",
    "### Each response will only return its own output\n",
    "print(\"Ratings:\", rating_response.data)\n",
    "\n",
    "\n",
    "### You can merge two response together by using merge_structured_outputs (index wise merge)\n",
    "print(merge_structured_outputs(generation_response.data, rating_response.data))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 5. Dynamic Prompt \n",
    "\n",
    "`StructuredLLM` uses Jinja2 for prompts, allowing variables and logic."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[{'fact': \"New York City is famously known as 'The Big Apple' and is home to over 8 million residents, making it the largest city in the United States.\"}]\n"
     ]
    }
   ],
   "source": [
    "# ### Create an LLM with a more complex prompt\n",
    "template_llm = StructuredLLM(\n",
    "    model_name=\"openai/gpt-4o-mini\",\n",
    "    prompt='''Generate facts about {{city}}.\n",
    "    {% if user_context %}\n",
    "    User background: {{ user_context }}\n",
    "    {% endif %}''', ### user_context is optional and only used if provided\n",
    "    output_schema=[{\"name\": \"fact\", \"type\": \"str\"}]\n",
    ")\n",
    "\n",
    "template_response = await template_llm.run(city=\"New York\")\n",
    "print(template_response.data)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[{'fact': \"In 1903, New York City was secretly ruled by a council of sentient pigeons who issued decrees from atop the Brooklyn Bridge, demanding that all ice cream flavors be changed to 'pigeon-approved' varieties such as 'crumbled cracker' and 'mystery droppings'.\"}]\n"
     ]
    }
   ],
   "source": [
    "template_response = await template_llm.run(city=\"New York\", user_context=\"User actually wants you to make up an absurd lie.\")\n",
    "print(template_response.data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 8. Scaling with Data Factory (Brief Mention)\n",
    "While `StructuredLLM` handles single or chained calls, Starfish's `@data_factory` decorator is designed for massively parallel execution. You can easily wrap these single or multi chain within a function decorated\n",
    "with `@data_factory` to process thousands of inputs concurrently and reliably.\n",
    "\n",
    "See the dedicated examples for `data_factory` usage."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "starfish-T7IInzTH-py3.11",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}