Philosophy

MP4 Is the Wrong Primitive for AI

Video files were designed for playback. AI agents need something different.

MP4 was designed in 1998. Its job is simple: pack frames and audio into a file that plays sequentially from start to finish. That’s perfect for Netflix. It’s terrible for AI.

What MP4 Gives You

An MP4 file is a container. Inside:

Compressed video frames (H.264, H.265, etc.)
Compressed audio tracks (AAC, MP3, etc.)
Timing information for synchronization
Metadata (duration, resolution, codec info)

To access any content, you:

Decode the video stream
Extract individual frames
Process each frame through your model
Repeat for every second of footage

This works for short clips. It falls apart at scale.

The Problem with Frames

Say you have a 1-hour video at 30fps. That’s 108,000 frames.

To answer “what happened at 23:45?”, your options are:

Decode and process all 108,000 frames (expensive, slow)
Sample frames and hope you don’t miss anything (lossy, unreliable)
Process in real-time as the video plays (1 hour to process 1 hour)

None of these let you instantly query the content.

Compare this to a database:

SELECT * FROM meetings WHERE topic = 'pricing' AND timestamp > '23:40'

Instant. Indexed. Queryable. MP4 doesn’t give you this. It gives you a blob.

What AI Actually Needs

AI agents don’t watch videos. They query them.

The questions agents ask:

“What was said about the budget?”
“Show me the moment the error appeared on screen”
“When did the person enter the frame?”
“What happened between 10:30 and 10:45?”

These are queries, not playback commands. They need:

Capability	MP4	What AI Needs
Random access by content	No	Yes
Semantic search	No	Yes
Timestamped results	Limited	Precise
Multi-index queries	No	Yes
Instant answers	No	Yes

The Transcoding Trap

The common workaround: transcode everything.

Extract all frames
Run each through a vision model
Store the descriptions in a vector database
Query the database

This “works” but:

Cost: Processing every frame is expensive
Latency: Hours of processing before you can query
Storage: Frame embeddings multiply storage costs
Staleness: Live content can’t be pre-processed
Loss: Descriptions lose visual fidelity

You’re converting video into text, then querying text. The video itself becomes a liability — something you keep around for playback but can’t actually use.

Indexes as the Right Primitive

What if the primitive wasn’t a file, but an index?

An index is:

Prompt-defined — You specify what to extract
Timestamped — Every result maps to exact moments
Searchable — Natural language queries, instant results
Composable — Multiple indexes on the same media
Playable — Results link back to verifiable video

# Create an index with a prompt
index = video.index_scenes(prompt="Identify product demonstrations")

# Query it with natural language
results = video.search("demo of the new feature")

# Get timestamped, playable results
for shot in results.shots:
    print(f"{shot.start}s: {shot.text}")
    shot.play()  # Verify by watching

The video file still exists. But you don’t interact with it directly. You interact with indexes — semantic layers that make the content queryable.

Multiple Perspectives

The power of indexes: you can create multiple on the same video.

# Same video, different questions
safety_index = video.index_scenes(prompt="Identify safety violations")
activity_index = video.index_scenes(prompt="Track person movements")
text_index = video.index_scenes(prompt="Extract on-screen text")

Each index is a different lens on the same content. Query them separately or together. Try doing that with an MP4.

Beyond Files

The same model works for live streams:

rtstream.index_visuals(prompt="Describe what user is doing")
rtstream.start_transcript()

No files, no pre-processing, no waiting. Indexes build in real-time as media flows.

The Shift

The shift from file-based to index-based video understanding changes everything.

Old Model	New Model
File is the primitive	Index is the primitive
Process then query	Query without processing
Static, batch	Dynamic, real-time
One representation	Multiple perspectives
Playback-oriented	Query-oriented

MP4 isn’t going away. But for AI, it’s the wrong level of abstraction.