# Agents that watch the screen with you

> Desktop agents — background runtimes that continuously watch pixels and listen to audio, building memory of what you've done and a model of what you need next. VideoDB ships the OS bridge, ingest pipeline, and recall memory layer for them.

---

## Hero

The desktop is the highest-leverage surface AI has ever had access to. The next decade of productivity software gets built on top of it — starting with an agent that can see what's on your screen.

For forty years, every productivity tool has been blind. Word processors don't know what you're writing about. IDEs don't watch you debug. Slack has no idea what you just said on the call. The most expensive surface on your computer — the screen itself — has been a complete blind spot.

That's changing. The first generation of desktop agents is here, and they're nothing like the chatbots that came before them. They don't sit in a sidebar waiting to be summoned. They run in the background, continuously, watching pixels and listening to mic input, building a memory of what you've done and a model of what you need next.

If LLMs were the moment text got cheap, desktop agents are the moment *context* got cheap. And context is what work actually runs on.

---

## Why now

Three things had to happen in parallel for desktop agents to be possible:

- Models had to get fast and multimodal enough that watching a screen at 24fps wasn't a fantasy.
- OS-level capture APIs had to mature on every major platform.
- Someone had to build the runtime: the part that handles streams, indexes frames, manages memory, and exposes the whole stack as a single tool any agent loop can call.

VideoDB is that runtime. One install gives an agent a native bridge into the operating system, an ingest pipeline that runs at the speed of the OS, and a memory layer that lets it recall anything it watched as a searchable, replayable clip.

> "The desktop agent doesn't ask what you want. It already saw."

---

## Three things builders are shipping today

### 1. The pair programmer that actually pairs

Copilot in your editor is a good autocomplete. A pair programmer is something else: it watches the whole environment. It sees the architecture diagram opened in another window. It sees the YouTube deep-dive queued up about the auth flow being rewritten. It hears the question a colleague asked in the last call. When you ask *"how should I rebuild this?"* it answers from the same shared context a human would.

VideoDB ships a reference implementation at `video-db/pair-programmer`. It uses the screen capture stream as the primary input and threads a recall API into the assistant so any moment from the last hour, day, or week is one query away.

### 2. The meeting copilot without the bot

The standard meeting assistant joins your call as a third participant. Everyone watches a robot blink in the corner. That model is dying. The new one is local: capture the audio and screen-share on the device the meeting is already happening on, run perception locally, never invite a stranger to your call.

`call.md` is the open-source reference. Every meeting becomes a markdown document. Every decision has a playable clip attached. When someone asks *"what did we agree about pricing?"* the agent answers with the moment, not a paraphrase.

### 3. The second brain that finally works

People have tried to build "second brains" for two decades. Every attempt failed for the same reason: capture is too hard. Nobody wants to take notes, tag emails, transcribe meetings, save the right Slack threads. Desktop agents solve this without asking anyone to change a behavior. The capture happens whether you notice it or not. Cognitive load drops to zero.

Memory becomes a recall API. The agent watching your screen turns into the most reliable note-taker you've ever had — one that remembers everything and forgets only what you tell it to.

---

## What you actually get

Install the native SDK on Mac, Windows, or Linux. One package, three lines of code:

```python
# Stream screen + mic continuously
vdb = VideoDB()
async with vdb.desktop(screen=True, mic=True):
    async for event in vdb.stream():
        # transcripts, screen events, intents - typed
        agent.handle(event)
```

What comes back is not raw video. It's a stream of typed events: transcripts, screen changes, application focus, recognized intents. The agent subscribes to exactly the signals it cares about and ignores the rest. Frames flow through the process without ever touching disk — unless you decide a moment is worth remembering.

### Privacy, on by default

Every desktop deployment is ephemeral by default. Frames are processed in-memory and discarded. Persistence is opt-in, per-stream, and can be locked to your own cloud. SOC 2 and HIPAA-ready out of the box.

---

## The category that didn't exist three years ago

Desktop agents will be the most consequential product category of the next five years. Not because they're a flashier chatbot, but because they finally close the loop between what software *sees* a user doing and what it can *help* them do. That loop has been open since the GUI was invented. VideoDB is the cheapest, fastest way to close it.

The builders shipping on this stack today are building the next Slack, the next Notion, the next Figma. Not because they have better models — everyone has the same models. They have something nobody else has: a backend that actually understands what's happening on the screen.

---

## Build a desktop agent

One SDK. Mac, Windows, Linux. Stream screen and mic in three lines.

CTAs: [Open the quickstart](/developers#desktop-quickstart) · [View pair-programmer on GitHub](https://github.com/video-db/pair-programmer)
