Computer use

When the agent interacts with the GUI, it does so through the accessibility tree, not screenshots. This is one of the biggest differences between Kiki and screenshot-driven "computer use" agents.

The accessibility tree, not pixels

The compositor keeps the accessibility tree in memory and publishes it over a typed channel (FlatBuffers on a Unix socket) every time it changes. agentd subscribes and keeps a local copy to feed into the model when needed.

So the agent:

Knows the exact state of every element — which button is pressed, a slider's value, an input's text.
Acts precisely — it can click, type, and navigate without inferring from pixels.
Is fast — no screen capture, compression, or vision in the loop.
Is predictable — it knows what an action will do before doing it.

Graceful fallback

Not every app is Kiki-native, so it degrades:

The agent always prefers the highest tier available. Screenshots are a genuine last resort, only for apps that expose no contract.

Why it matters for apps

If you build a Kiki app with the SDK, your app's state and UI are exposed natively through this contract — so the agent operates your app with full fidelity and zero guesswork. You get reliable agent control for free, without designing around a screenshot model. See Your first app.

Computer use ​

The accessibility tree, not pixels ​

Graceful fallback ​

Why it matters for apps ​

Computer use

The accessibility tree, not pixels

Graceful fallback

Why it matters for apps