# MP4 Is the Wrong Primitive for AI

> Video files were designed for playback. AI agents need something different.

Category: Philosophy

---

MP4 was designed in 1998. Its job is simple: pack frames and audio into a file that plays sequentially from start to finish.

That's perfect for Netflix. It's terrible for AI.

## What MP4 Gives You

An MP4 file is a container. Inside:

* Compressed video frames (H.264, H.265, etc.)
* Compressed audio tracks (AAC, MP3, etc.)
* Timing information for synchronization
* Metadata (duration, resolution, codec info)

To access any content, you:

1. Decode the video stream
2. Extract individual frames
3. Process each frame through your model
4. Repeat for every second of footage

This works for short clips. It falls apart at scale.

## The Problem with Frames

Say you have a 1-hour video at 30fps. That's 108,000 frames.

To answer "what happened at 23:45?", your options are:

1. Decode and process all 108,000 frames (expensive, slow)
2. Sample frames and hope you don't miss anything (lossy, unreliable)
3. Process in real-time as the video plays (1 hour to process 1 hour)

None of these let you instantly query the content.

Compare this to a database:

```sql
SELECT * FROM meetings WHERE topic = 'pricing' AND timestamp > '23:40'
```

Instant. Indexed. Queryable.

MP4 doesn't give you this. It gives you a blob.

## What AI Actually Needs

AI agents don't watch videos. They query them.

The questions agents ask:

* "What was said about the budget?"
* "Show me the moment the error appeared on screen"
* "When did the person enter the frame?"
* "What happened between 10:30 and 10:45?"

These are queries, not playback commands. They need:

| Capability               | MP4     | What AI Needs |
| :----------------------- | :------ | :------------ |
| Random access by content | No      | Yes           |
| Semantic search          | No      | Yes           |
| Timestamped results      | Limited | Precise       |
| Multi-index queries      | No      | Yes           |
| Instant answers          | No      | Yes           |

## The Transcoding Trap

The common workaround: transcode everything.

1. Extract all frames
2. Run each through a vision model
3. Store the descriptions in a vector database
4. Query the database

This "works" but:

* **Cost**: Processing every frame is expensive
* **Latency**: Hours of processing before you can query
* **Storage**: Frame embeddings multiply storage costs
* **Staleness**: Live content can't be pre-processed
* **Loss**: Descriptions lose visual fidelity

You're converting video into text, then querying text. The video itself becomes a liability — something you keep around for playback but can't actually use.

## Indexes as the Right Primitive

What if the primitive wasn't a file, but an index?

An index is:

* **Prompt-defined** — You specify what to extract
* **Timestamped** — Every result maps to exact moments
* **Searchable** — Natural language queries, instant results
* **Composable** — Multiple indexes on the same media
* **Playable** — Results link back to verifiable video

```python
# Create an index with a prompt
index = video.index_scenes(prompt="Identify product demonstrations")

# Query it with natural language
results = video.search("demo of the new feature")

# Get timestamped, playable results
for shot in results.shots:
    print(f"{shot.start}s: {shot.text}")
    shot.play()  # Verify by watching
```

The video file still exists. But you don't interact with it directly. You interact with indexes — semantic layers that make the content queryable.

## Multiple Perspectives

The power of indexes: you can create multiple on the same video.

```python
# Same video, different questions
safety_index = video.index_scenes(prompt="Identify safety violations")
activity_index = video.index_scenes(prompt="Track person movements")
text_index = video.index_scenes(prompt="Extract on-screen text")
```

Each index is a different lens on the same content. Query them separately or together.

Try doing that with an MP4.

## Beyond Files

The same model works for live streams:

```python
rtstream.index_visuals(prompt="Describe what user is doing")
rtstream.start_transcript()
```

No files, no pre-processing, no waiting. Indexes build in real-time as media flows.

## The Shift

| Old Model             | New Model                |
| :-------------------- | :----------------------- |
| File is the primitive | Index is the primitive   |
| Process then query    | Query without processing |
| Static, batch         | Dynamic, real-time       |
| One representation    | Multiple perspectives    |
| Playback-oriented     | Query-oriented           |

MP4 isn't going away. But for AI, it's the wrong level of abstraction.

---

## It's time to move beyond the file. Indexes are the right primitive for AI.

[Read: Why Video Was Built for Playback, Not Perception](/blogs/playback-vs-perception)