Page-to-Video: Generate videos from webpages 🪄🎬

Community Article Published May 6, 2025

tldr: we made an application for converting web pages into educational videos with slides.

The web is bursting with knowledge, but often it's locked away in articles and pages. Wouldn't it be cool if you could automatically transform that content into engaging video lessons? That's exactly what the experimental page-to-video space does!

Origin Story

Over the past few months, our team has released courses at a record pace on subjects like LLMs, inference, fine-tuning, reasoning models, and agents. From sharing these courses, we’ve learned that folks want more ways to learn and digest information quickly. Reading dense articles is great, but sometimes a video just clicks better, right? Plus, creating high-quality video content takes time.

However, the field is moving super fast and recording videos takes time to get right. Wouldn’t it be cool if we could generate videos for ourselves on the fly.

This obviously comes with some quality concerns. Learning is time consuming and we don’t want people learning hallucinations. So we set out these prerequisites for the application.

  • Cost effective so people can generate their own videos.
  • Text based so that it’s easy to define what’s said and shown.
  • Source control so the community can make changes.

With this in mind, we went for a straightforward approach that combines a text transcription, markdown slides, and text to speech to make a video with audio and slides.

So, what's page-to-video?

page-to-video takes a URL for a webpage and returns a video with an audio description and slides about the content of the page. Here are the steps:

  1. You give it a URL to a webpage

page-to-video collects the HTML from the websites and removes the text, images, and entities like tables so that they can be handled within the transcription and slides.

image/png

  1. Generates slides and transcription.

First, the application uses Cohere’s CohereLabs/c4ai-command-a-03-2025 model on inference providers to generate a transcription from the webpage. This sort of summarisation and format change is usually straightforward for LLMs where they need to present the input text in an alternative format. In this case, a spoken lesson.

With the transcription, command-a is used to generate slides in a markdown format like Marp. This allows the user to edit the slides and transcription as text and take complete control over their video.

image/png

💡TIP: page-to-video also returns a PDF version of the slides in case you want to take it from here!

  1. Generate speech from content

Next, we generate speech based on the transcription using the Fal ai platform and the Dia-1.6B model. This creates voice clips, one for each slide, which you can review via the application interface.

image/png

💡 TIP: Minimax can clone voices, so if you duplicate the space, clone your voice, and add the VOICE_ID param to the environment variables, you can generate videos with your own voice!

  1. Combine in video format

Finally, the application will use ffmpeg to combine the slide images and audio files together into a video format. This takes around a minute to load for a 5 minute clip, but there’s no AI inference so it’s free.

image/png

How Does it Work? How do I build my own?

Here, I’ll take the application apart and focus on the key AI aspects that you could reuse in your own projects.

Slides and transcriptions

The application is built on top of inference providers which it uses for the LLM calls.

Instead of trying to build and host massive AI models for every single step (text understanding, image generation, video generation), video-to-page just uses a service available via an API on the hub.

Prompt for Slide generation

LLM_MODEL = "CohereLabs/c4ai-command-a-03-2025"  # Model ID
PRESENTATION_PROMPT_TEMPLATE = """
You are an expert technical writer and presentation creator. Your task is to convert the
following web content into a complete Remark.js presentation file suitable for conversion
to PDF/video.

**Input Web Content:**

{markdown_content}

**Available Images from the Webpage (Use relevant ones appropriately):**

{image_list_str}

**Instructions:**

1.  **Structure:** Create slides based on the logical sections of the input content.
    Use headings or distinct topics as indicators for new slides. Aim for a
    reasonable number of slides (e.g., 5-15 depending on content length).
2.  **Slide Format:** Each slide should start with `# Slide Title`.
3.  **Content:** Include the relevant text and key points from the input content
    within each slide. Keep slide content concise.
4.  **Images & Layout:**
    *   Where appropriate, incorporate relevant images from the 'Available Images'
        list provided above.
    *   Use the `![alt text](url)` markdown syntax for images.
    *   To display text and an image side-by-side, use the following HTML structure
        within the markdown slide content:
        ```markdown
        .col-6[
            {{text}}  # Escaped braces for Python format
        ]
        .col-6[
            ![alt text](url)
        ]
        ```
    *   Ensure the image URL is correct and accessible from the list. Choose images
        that are close to the slide's text content. If no image is relevant,
        just include the text. Only use images from the provided list.
5.  **Presenter Notes (Transcription Style):** For each slide, generate a detailed
    **transcription** of what the presenter should say, explaining the slide's
    content in a natural, flowing manner. Place this transcription after the slide
    content, separated by `???`.
6.  **Speaker Style:** The speaker notes should flow smoothly from one slide to the
    next. No need to explicitly mention the slide number. The notes should
    elaborate on the concise slide content.
7.  **Separators:** Separate individual slides using `\\n\\n---\\n\\n`.
8.  **Cleanup:** Do NOT include any specific HTML tags from the original source webpage
    unless explicitly instructed (like the `.row`/`.col-6` structure for layout).
    Remove boilerplate text, navigation links, ads, etc. Focus on the core content.
9.  **Start Slide:** Begin the presentation with a title slide based on the source URL
    or main topic. Example:
    ```markdown
    class: impact

    # Presentation based on {input_filename}
    ## Key Concepts

    .center[![Hugging Face Logo](https://huggingface.co/front/assets/huggingface_logo.svg)]

    ???
    Welcome everyone. This presentation, automatically generated from the content at
    {input_filename}, will walk you through the key topics discussed. Let's begin.
    ```
10. **Output:** Provide ONLY the complete Remark.js Markdown content, starting with
    the title slide and ending with the last content slide. Do not include any
    introductory text, explanations, or a final 'Thank You' slide.
11. **Conciseness:** Keep slide *content* (the part before `???`) concise (bullet
    points, short phrases). Elaborate in the *speaker notes* (the part after `???`).

**Generate the Remark.js presentation now:**
"""
from huggingface_hub import InferenceClient

client = InferenceClient(
    provider="cohere",
    api_key="hf_xxxxxxxxxxxxxxxxxxxxxxxx",
)

completion = client.chat.completions.create(
    model="CohereLabs/c4ai-command-a-03-2025",
    messages=[
        {
            "role": "user",
            "content": prompt
        }
    ],
)

print(completion.choices[0].message)

Text to Speech

My favorite thing about inference providers is that you can mature projects out of it. So once you’ve compared models and providers on the hub, you focus on a specific combination and embed that within your application.

That’s what we did for the text to speech component. We ended up using the Dia-1.6B model after experimenting with the fal-ai/minimax-tts. You can try these models out via inference providers.

Once the LLM has created a transcription, you can pass it to the text-to-speech model like this:

from huggingface_hub import InferenceClient

client = InferenceClient(
    provider="fal-ai",
    api_key="hf_xxxxxxxxxxxxxxxxxxxxxxxx",
)

# audio is returned as bytes
audio = client.text_to_speech(
    "The answer to the universe is 42",
    model="nari-labs/Dia-1.6B",
)

Fal ai also has clients which expose some handy features and can be used like this:

import fal_client

def on_queue_update(update):
    if isinstance(update, fal_client.InProgress):
        for log in update.logs:
            print(log["message"])


result = fal_client.subscribe(
    "fal-ai/minimax-tts/text-to-speech/turbo",
    arguments={
        "text": "Hello, world!",
        "voice_setting": {"speed": 1.0, "emotion": "happy"},
        "language_boost": "English",
    },
    with_logs=True,
    on_queue_update=on_queue_update,
)

Check it Out!

Remember, this is an experimental project, so treat it as a cool proof-of-concept showing what's possible when you combine different inference providers services. It’s a great example of how you can build awesome multi-step AI applications by leveraging the growing ecosystem of Inference Providers (including many you can access right here on the Hub!).

Community

super useful! thanks a lot 👏

We shall make this page a video to try it out~

Sign up or log in to comment