Computer use
When the agent interacts with the GUI, it does so through the accessibility tree, not screenshots. This is one of the biggest differences between Kiki and screenshot-driven "computer use" agents.
The accessibility tree, not pixels
The compositor keeps the accessibility tree in memory and publishes it over a typed channel (FlatBuffers on a Unix socket) every time it changes. agentd subscribes and keeps a local copy to feed into the model when needed.
So the agent:
- Knows the exact state of every element — which button is pressed, a slider's value, an input's text.
- Acts precisely — it can click, type, and navigate without inferring from pixels.
- Is fast — no screen capture, compression, or vision in the loop.
- Is predictable — it knows what an action will do before doing it.
Graceful fallback
Not every app is Kiki-native, so it degrades:
The agent always prefers the highest tier available. Screenshots are a genuine last resort, only for apps that expose no contract.
Why it matters for apps
If you build a Kiki app with the SDK, your app's state and UI are exposed natively through this contract — so the agent operates your app with full fidelity and zero guesswork. You get reliable agent control for free, without designing around a screenshot model. See Your first app.