Create a Video Intro Editor with Python on Mac with Silero VAD

Creating engaging video intros can be time-consuming, but what if you could automate the entire process? In this guide, we’ll build a Python script that automatically analyzes your videos, detects key speaking moments using AI-powered Voice Activity Detection (VAD), and generates a professional-looking 3D-styled intro montage.

This script is perfect for content creators, YouTubers, and video editors who want to quickly generate dynamic intros from their existing footage without manual editing.

What You'll Learn

How to use Silero VAD for intelligent speech detection in videos
Working with MoviePy for video composition and effects
Creating 3D perspective effects with FFmpeg
Running Python scripts with uv for easy dependency management

Prerequisites

Before we dive into the script, you’ll need to have the following installed on your Mac:

FFmpeg

FFmpeg is essential for video processing. Install it using Homebrew:

brew install ffmpeg

uv Package Manager

We’ll use uv to run our script with all dependencies automatically managed. If you don’t have uv installed:

curl -LsSf https://astral.sh/uv/install.sh | sh

New to uv?

If you’re new to uv, check out our comprehensive guide Getting Started with uv: Setting Up Your Python Project to understand how it simplifies Python project management.

How the Video Intro Editor Works

The script performs several intelligent operations to create your intro:

Audio Extraction: Extracts audio from your video using FFmpeg
Speech Detection: Uses Silero VAD (Voice Activity Detection) to identify segments where someone is speaking
Segment Selection: Picks the best speaking moments for the intro
3D Effects: Applies perspective warping and blur effects to create depth
Compositing: Combines background blur with foreground 3D-warped clips
Final Output: Renders the intro with crossfade transitions

The Complete Script

Here’s the full Python script that creates video intros. Save this as intro_generator.py:

# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "openai-whisper",
#     "torch",
#     "torchaudio<2.6",
#     "soundfile",
#     "numpy<2.0.0",
#     "moviepy==1.0.3",
#     "packaging",
#     "Pillow<10.0.0",
# ]
# ///

import os
import sys
import warnings

# 1. Suppress warnings BEFORE importing moviepy
# This hides the "invalid escape sequence" text
warnings.filterwarnings("ignore")

import random
import subprocess
from pathlib import Path

import torch
import whisper
from moviepy.editor import (
    CompositeVideoClip,
    VideoFileClip,
    concatenate_videoclips,
    vfx,
)

# --- Configuration ---
MIN_SILENCE = 0.5
MIN_SPEECH = 0.25
PADDING = 0.1

# Intro Style Settings
INTRO_CLIP_COUNT = 6  # Number of fast cuts
INTRO_SPEED = 3.0  # Speed multiplier (3x fast)
CLIP_DURATION = 1.5  # Duration of each clip in seconds
OUTPUT_FILENAME = "intro_only.mp4"


def check_ffmpeg():
    try:
        subprocess.run(["ffmpeg", "-version"], capture_output=True, check=True)
    except:
        print("❌ Error: FFmpeg not found. Run: brew install ffmpeg")
        sys.exit(1)


def extract_audio(video_path, audio_path):
    subprocess.run(
        [
            "ffmpeg",
            "-y",
            "-i",
            video_path,
            "-vn",
            "-acodec",
            "pcm_s16le",
            "-ar",
            "16000",
            "-ac",
            "1",
            "-loglevel",
            "error",
            audio_path,
        ],
        check=True,
    )


def get_good_segments(audio_path):
    """Finds segments where people are actually speaking (Key Moments)."""
    print("🧠 Scanning video for key moments...")

    # Load Silero VAD
    # trust_repo=True fixes the "cache" warning/error
    model, utils = torch.hub.load(
        repo_or_dir="snakers4/silero-vad", model="silero_vad", trust_repo=True
    )
    (get_speech_timestamps, _, read_audio, _, _) = utils

    wav = read_audio(audio_path)
    vad_stamps = get_speech_timestamps(
        wav,
        model,
        threshold=0.5,
        min_speech_duration_ms=int(MIN_SPEECH * 1000),
        min_silence_duration_ms=int(MIN_SILENCE * 1000),
    )

    segments = []
    for v in vad_stamps:
        segments.append((v["start"] / 16000, v["end"] / 16000))

    return segments


def apply_blur_background(input_path, output_path):
    subprocess.run(
        [
            "ffmpeg",
            "-y",
            "-i",
            input_path,
            "-vf",
            "boxblur=40:5,eq=brightness=-0.4",
            "-c:v",
            "libx264",
            "-preset",
            "ultrafast",
            "-an",
            "-loglevel",
            "error",
            output_path,
        ],
        check=True,
    )


def apply_3d_warp(input_path, output_path, direction="left"):
    if direction == "left":
        vf = "perspective=x0=0:y0=0:x1=W:y1=H/5:x2=0:y2=H:x3=W:y3=4*H/5:sense=destination"
    else:
        vf = "perspective=x0=0:y0=H/5:x1=W:y1=0:x2=0:y2=4*H/5:x3=W:y3=H:sense=destination"

    vf += ",pad=w=iw+100:h=ih+100:x=50:y=50:color=black@0"

    subprocess.run(
        [
            "ffmpeg",
            "-y",
            "-i",
            input_path,
            "-vf",
            vf,
            "-c:v",
            "libx264",
            "-preset",
            "ultrafast",
            "-an",
            "-loglevel",
            "error",
            output_path,
        ],
        check=True,
    )


def generate_intro(video_path, segments):
    print("✨ Rendering 3D Intro (No Text)...")

    long_segments = [s for s in segments if (s[1] - s[0]) > 2.0]

    if len(long_segments) < INTRO_CLIP_COUNT:
        print(f"⚠️ Not enough footage found. Need {INTRO_CLIP_COUNT} distinct moments.")
        if not long_segments:
            return
        picks = random.choices(long_segments, k=INTRO_CLIP_COUNT)
    else:
        picks = sorted(random.sample(long_segments, INTRO_CLIP_COUNT))

    intro_clips = []
    temp_files = []

    for i, (start, end) in enumerate(picks):
        raw_dur = CLIP_DURATION * INTRO_SPEED
        mid = start + (end - start) / 2 - (raw_dur / 2)

        raw_clip = f"temp_raw_{i}.mp4"
        bg_clip = f"temp_bg_{i}.mp4"
        fg_clip = f"temp_fg_{i}.mp4"

        # 1. Extract
        subprocess.run(
            [
                "ffmpeg",
                "-y",
                "-ss",
                str(mid),
                "-t",
                str(raw_dur),
                "-i",
                video_path,
                "-c:v",
                "libx264",
                "-an",
                "-loglevel",
                "error",
                raw_clip,
            ],
            check=True,
        )

        # 2. Process
        apply_blur_background(raw_clip, bg_clip)

        direction = "left" if i % 2 == 0 else "right"
        apply_3d_warp(raw_clip, fg_clip, direction)

        # 3. Composite
        try:
            bg = VideoFileClip(bg_clip).fx(vfx.speedx, INTRO_SPEED)
            fg = VideoFileClip(fg_clip).fx(vfx.speedx, INTRO_SPEED)

            if direction == "left":
                fg = fg.set_position(
                    lambda t: (int(-50 + 50 * (t / CLIP_DURATION)), "center")
                )
            else:
                fg = fg.set_position(
                    lambda t: (int(50 - 50 * (t / CLIP_DURATION)), "center")
                )

            comp = CompositeVideoClip([bg, fg]).set_duration(CLIP_DURATION)
            if i > 0:
                comp = comp.crossfadein(0.2)

            intro_clips.append(comp)
            temp_files.extend([raw_clip, bg_clip, fg_clip])

        except Exception as e:
            print(f"   ⚠️ Error processing clip {i}: {e}")

    if not intro_clips:
        print("❌ Failed to generate intro clips.")
        return

    # Concatenate
    full_montage = concatenate_videoclips(intro_clips, method="compose")

    print("   💾 Saving video file...")
    full_montage.write_videofile(
        "temp_visual_intro.mp4", fps=24, codec="libx264", logger=None
    )

    # Add silent audio
    print("   🔊 Adding silent audio track...")
    subprocess.run(
        [
            "ffmpeg",
            "-y",
            "-i",
            "temp_visual_intro.mp4",
            "-f",
            "lavfi",
            "-i",
            "anullsrc=channel_layout=mono:sample_rate=44100",
            "-c:v",
            "copy",
            "-c:a",
            "aac",
            "-shortest",
            "-loglevel",
            "error",
            OUTPUT_FILENAME,
        ],
        check=True,
    )

    # Cleanup
    for f in temp_files:
        if os.path.exists(f):
            os.remove(f)
    if os.path.exists("temp_visual_intro.mp4"):
        os.remove("temp_visual_intro.mp4")

    print(f"✅ Success! Intro saved as: {OUTPUT_FILENAME}")


def main():
    if len(sys.argv) < 2:
        print("Usage: uv run intro_generator.py <video.mp4>")
        sys.exit(1)

    input_video = sys.argv[1]
    check_ffmpeg()

    temp_wav = "temp_analysis.wav"

    try:
        extract_audio(input_video, temp_wav)

        good_parts = get_good_segments(temp_wav)

        if not good_parts:
            print("⚠️ No speech detected. Picking random segments...")
            duration = float(
                subprocess.check_output(
                    [
                        "ffprobe",
                        "-v",
                        "error",
                        "-show_entries",
                        "format=duration",
                        "-of",
                        "default=noprint_wrappers=1:nokey=1",
                        input_video,
                    ]
                )
            )
            good_parts = [(t, t + 5) for t in range(0, int(duration), 10)]

        generate_intro(input_video, good_parts)

    finally:
        if os.path.exists(temp_wav):
            os.remove(temp_wav)


if __name__ == "__main__":
    main()

Understanding the Script

Let’s break down the key components of this video intro editor:

PEP 723 Inline Metadata

The script starts with inline metadata that tells uv what dependencies to install:

# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "openai-whisper",
#     "torch",
#     "torchaudio<2.6",
#     "soundfile",
#     "numpy<2.0.0",
#     "moviepy==1.0.3",
#     "packaging",
#     "Pillow<10.0.0",
# ]
# ///

This PEP 723 format allows uv to automatically install all required packages without any manual setup. For more details on this pattern, check out our guide on Running Test Scripts with uv.

Configuration Variables

The script includes several customizable settings:

Variable	Default	Description
`MIN_SILENCE`	0.5	Minimum silence duration in seconds
`MIN_SPEECH`	0.25	Minimum speech duration in seconds
`INTRO_CLIP_COUNT`	6	Number of clips in the intro
`INTRO_SPEED`	3.0	Speed multiplier for clips
`CLIP_DURATION`	1.5	Duration of each clip in seconds
`OUTPUT_FILENAME`	”intro_only.mp4”	Output file name

Voice Activity Detection (VAD)

The get_good_segments() function uses Silero VAD to detect speech:

def get_good_segments(audio_path):
    """Finds segments where people are actually speaking (Key Moments)."""
    model, utils = torch.hub.load(
        repo_or_dir="snakers4/silero-vad", model="silero_vad", trust_repo=True
    )
    (get_speech_timestamps, _, read_audio, _, _) = utils

    wav = read_audio(audio_path)
    vad_stamps = get_speech_timestamps(
        wav,
        model,
        threshold=0.5,
        min_speech_duration_ms=int(MIN_SPEECH * 1000),
        min_silence_duration_ms=int(MIN_SILENCE * 1000),
    )
    
    return [(v["start"] / 16000, v["end"] / 16000) for v in vad_stamps]

Silero VAD is a lightweight, efficient model that runs locally and doesn’t require an API key or internet connection once downloaded.

3D Perspective Effects

The script creates a unique 3D look using FFmpeg’s perspective filter:

def apply_3d_warp(input_path, output_path, direction="left"):
    if direction == "left":
        vf = "perspective=x0=0:y0=0:x1=W:y1=H/5:x2=0:y2=H:x3=W:y3=4*H/5:sense=destination"
    else:
        vf = "perspective=x0=0:y0=H/5:x1=W:y1=0:x2=0:y2=4*H/5:x3=W:y3=H:sense=destination"

This creates alternating left and right perspective warps that give the intro a dynamic, professional appearance.

Video Compositing

The script layers a blurred background with a 3D-warped foreground using MoviePy:

bg = VideoFileClip(bg_clip).fx(vfx.speedx, INTRO_SPEED)
fg = VideoFileClip(fg_clip).fx(vfx.speedx, INTRO_SPEED)

if direction == "left":
    fg = fg.set_position(
        lambda t: (int(-50 + 50 * (t / CLIP_DURATION)), "center")
    )

comp = CompositeVideoClip([bg, fg]).set_duration(CLIP_DURATION)

Running the Script

With uv installed, running the script is simple:

uv run intro_generator.py your_video.mp4

The first run will take a bit longer as uv downloads and installs the dependencies. Subsequent runs will be much faster thanks to caching.

Expected Output

When you run the script, you’ll see output like:

🧠 Scanning video for key moments...
✨ Rendering 3D Intro (No Text)...
   💾 Saving video file...
   🔊 Adding silent audio track...
✅ Success! Intro saved as: intro_only.mp4

Customizing the Output

You can easily modify the script to change the intro style:

Change Number of Clips

Edit the INTRO_CLIP_COUNT variable:

INTRO_CLIP_COUNT = 8  # More clips for a longer intro

Adjust Speed

Modify INTRO_SPEED for faster or slower playback:

INTRO_SPEED = 2.0  # Slower, more dramatic
INTRO_SPEED = 4.0  # Faster, more energetic

Change Clip Duration

Adjust how long each clip appears:

CLIP_DURATION = 2.0  # Longer clips
CLIP_DURATION = 1.0  # Shorter, snappier cuts

Modify Blur Intensity

Edit the apply_blur_background() function:

# Stronger blur
"-vf", "boxblur=60:10,eq=brightness=-0.5"

# Lighter blur
"-vf", "boxblur=20:3,eq=brightness=-0.2"

Dependencies Explained

Package	Purpose
openai-whisper	Speech recognition (used for loading audio utilities)
torch	PyTorch for running the VAD model
torchaudio	Audio processing with PyTorch
soundfile	Reading audio files
numpy	Numerical operations
moviepy	Video editing and compositing
Pillow	Image processing (required by MoviePy)
packaging	Version handling utilities

Version Constraints

Note the specific version constraints in the dependencies. These ensure compatibility between packages, especially torchaudio<2.6 and numpy<2.0.0 which prevent breaking changes.

Troubleshooting

FFmpeg not found error

If you see “FFmpeg not found”, install it using Homebrew:

brew install ffmpeg

Make sure your terminal has access to the ffmpeg command by running:

ffmpeg -version

Not enough footage found warning

This happens when the video doesn’t have enough distinct speaking segments longer than 2 seconds. The script will still work but may reuse segments. Try:

Using a longer source video
Reducing INTRO_CLIP_COUNT
Lowering the minimum segment duration in the code

Memory errors with large videos

For very large videos, you might run into memory issues. Try:

Processing shorter clips
Reducing the resolution of the source video first
Closing other applications to free up RAM

First run is slow

The first run downloads the Silero VAD model and installs all Python dependencies. Subsequent runs will be much faster as everything is cached by uv and PyTorch.

Conclusion

This Python video intro editor combines several powerful technologies - Silero VAD for speech detection, FFmpeg for video processing, and MoviePy for compositing - into a single, easy-to-run script. Thanks to uv’s inline dependency management, you don’t need to worry about virtual environments or package installation.

The script is highly customizable, allowing you to adjust the number of clips, speed, duration, and visual effects to match your content style. Whether you’re creating YouTube intros, social media content, or presentation openers, this tool can save you hours of manual video editing.

For more Python scripting tutorials with uv, check out:

Create a Video Intro Editor with Python on Mac with Silero VAD

Table of Contents

What You'll Learn

Prerequisites

FFmpeg

uv Package Manager

New to uv?

How the Video Intro Editor Works

The Complete Script

Understanding the Script

PEP 723 Inline Metadata

Configuration Variables

Voice Activity Detection (VAD)

3D Perspective Effects

Video Compositing

Running the Script

Expected Output

Customizing the Output

Change Number of Clips

Adjust Speed

Change Clip Duration

Modify Blur Intensity

Dependencies Explained

Version Constraints

Troubleshooting

Conclusion

Create a Video Intro Editor with Python on Mac with Silero VAD

VibeProxy: Use Your Claude, Codex & Gemini Subscriptions with Any AI Coding Tool

How to Use Claude Sonnet 4.5, Opus 4.5, GPT-5, Gemini 3 for FREE

Table of Contents

What You'll Learn

Prerequisites

FFmpeg

uv Package Manager

New to uv?

How the Video Intro Editor Works

The Complete Script

Understanding the Script

PEP 723 Inline Metadata

Configuration Variables

Voice Activity Detection (VAD)

3D Perspective Effects

Video Compositing

Running the Script

Expected Output

Customizing the Output

Change Number of Clips

Adjust Speed

Change Clip Duration

Modify Blur Intensity

Dependencies Explained

Version Constraints

Troubleshooting

Conclusion

Related Posts

Building an AI Agent with Agno and Context7 MCP

How to Build Your First Agent with Google Agent Development Kit (ADK)

How to Use Any OpenRouter Model with Google Agent Development Kit (ADK)