MP4 Is the Wrong Primitive for AI
Video files were designed for playback. AI agents need something different.
MP4 was designed in 1998. Its job is simple: pack frames and audio into a file that plays sequentially from start to finish. That’s perfect for Netflix. It’s terrible for AI.
What MP4 Gives You
An MP4 file is a container. Inside:
- Compressed video frames (H.264, H.265, etc.)
- Compressed audio tracks (AAC, MP3, etc.)
- Timing information for synchronization
- Metadata (duration, resolution, codec info)
To access any content, you:
- Decode the video stream
- Extract individual frames
- Process each frame through your model
- Repeat for every second of footage
This works for short clips. It falls apart at scale.
The Problem with Frames
Say you have a 1-hour video at 30fps. That’s 108,000 frames.
To answer “what happened at 23:45?”, your options are:
- Decode and process all 108,000 frames (expensive, slow)
- Sample frames and hope you don’t miss anything (lossy, unreliable)
- Process in real-time as the video plays (1 hour to process 1 hour)
None of these let you instantly query the content.
Compare this to a database:
SELECT * FROM meetings WHERE topic = 'pricing' AND timestamp > '23:40'
Instant. Indexed. Queryable. MP4 doesn’t give you this. It gives you a blob.
What AI Actually Needs
AI agents don’t watch videos. They query them.
The questions agents ask:
- “What was said about the budget?”
- “Show me the moment the error appeared on screen”
- “When did the person enter the frame?”
- “What happened between 10:30 and 10:45?”
These are queries, not playback commands. They need:
| Capability | MP4 | What AI Needs |
|---|---|---|
| Random access by content | No | Yes |
| Semantic search | No | Yes |
| Timestamped results | Limited | Precise |
| Multi-index queries | No | Yes |
| Instant answers | No | Yes |
The Transcoding Trap
The common workaround: transcode everything.
- Extract all frames
- Run each through a vision model
- Store the descriptions in a vector database
- Query the database
This “works” but:
- Cost: Processing every frame is expensive
- Latency: Hours of processing before you can query
- Storage: Frame embeddings multiply storage costs
- Staleness: Live content can’t be pre-processed
- Loss: Descriptions lose visual fidelity
You’re converting video into text, then querying text. The video itself becomes a liability — something you keep around for playback but can’t actually use.
Indexes as the Right Primitive
What if the primitive wasn’t a file, but an index?
An index is:
- Prompt-defined — You specify what to extract
- Timestamped — Every result maps to exact moments
- Searchable — Natural language queries, instant results
- Composable — Multiple indexes on the same media
- Playable — Results link back to verifiable video
# Create an index with a prompt
index = video.index_scenes(prompt="Identify product demonstrations")
# Query it with natural language
results = video.search("demo of the new feature")
# Get timestamped, playable results
for shot in results.shots:
print(f"{shot.start}s: {shot.text}")
shot.play() # Verify by watching
The video file still exists. But you don’t interact with it directly. You interact with indexes — semantic layers that make the content queryable.
Multiple Perspectives
The power of indexes: you can create multiple on the same video.
# Same video, different questions
safety_index = video.index_scenes(prompt="Identify safety violations")
activity_index = video.index_scenes(prompt="Track person movements")
text_index = video.index_scenes(prompt="Extract on-screen text")
Each index is a different lens on the same content. Query them separately or together. Try doing that with an MP4.
Beyond Files
The same model works for live streams:
rtstream.index_visuals(prompt="Describe what user is doing")
rtstream.start_transcript()
No files, no pre-processing, no waiting. Indexes build in real-time as media flows.
The Shift
The shift from file-based to index-based video understanding changes everything.
| Old Model | New Model |
|---|---|
| File is the primitive | Index is the primitive |
| Process then query | Query without processing |
| Static, batch | Dynamic, real-time |
| One representation | Multiple perspectives |
| Playback-oriented | Query-oriented |
MP4 isn’t going away. But for AI, it’s the wrong level of abstraction.