If you’re poking around with OpenAI Operator on Apple Silicon (or just want to build AI agents that can actually use a computer like a human), this is for you. I’ve written a guide to walk you through getting started with cua-agent, show you how to pick the right model/loop for your use case, and share some code patterns that’ll get you up and running fast.
Here is the full guide: https://www.trycua.com/blog/build-your-own-operator-on-macos-2
What is cua-agent, really?
Think of cua-agent as the toolkit that lets you skip the gnarly boilerplate of screenshotting, sending context to an LLM, parsing its output, and safely running actions in a VM. It gives you a clean Python API for building “Computer-Use Agents” (CUAs) that can click, type, and see what’s on the screen. You can swap between OpenAI, Anthropic, UI-TARS, or local open-source models (Ollama, LM Studio, vLLM, etc.) with almost zero code changes.
Setup: Get Rolling in 5 Minutes
Prereqs:
Python 3.10+ (Conda or venv is fine) macOS CUA image already set up (see Part 1 if you haven’t) API keys for OpenAI/Anthropic (optional if you want to use local models) Ollama installed if you want to run local models
Install everything:
bashpip install « cua-agent[all] »
Or cherry-pick what you need:
bashpip install « cua-agent[openai] » # OpenAI pip install « cua-agent[anthropic] » # Anthropic pip install « cua-agent[uitars] » # UI-TARS pip install « cua-agent[omni] » # Local VLMs pip install « cua-agent[ui] » # Gradio UI
Set up your Python environment:
bashconda create -n cua-agent python=3.10 conda activate cua-agent # or python -m venv cua-env source cua-env/bin/activate
Export your API keys:
bashexport OPENAI_API_KEY=sk-… export ANTHROPIC_API_KEY=sk-ant-…
Agent Loops: Which Should You Use?
Here’s the quick-and-dirty rundown:
Loop Models it Runs When to Use It OPENAI OpenAI CUA Preview Browser tasks, best web automation, Tier 3 only ANTHROPIC Claude 3.5/3.7 Reasoning-heavy, multi-step, robust workflows UITARS UI-TARS-1.5 (ByteDance) OS/desktop automation, low latency, local OMNI Any VLM (Ollama, etc.) Local, open-source, privacy/cost-sensitive
TL;DR:
Use OPENAI for browser stuff if you have access. Use UITARS for desktop/OS automation. Use OMNI if you want to run everything locally or avoid API costs.
Your First Agent in ~15 Lines
pythonimport asyncio from computer import Computer from agent import ComputerAgent, LLMProvider, LLM, AgentLoop async def main(): async with Computer() as macos: agent = ComputerAgent( computer=macos, loop=AgentLoop.OPENAI, model=LLM(provider=LLMProvider.OPENAI) ) task = « Open Safari and search for ‘Python tutorials’ » async for result in agent.run(task): print(result.get(‘text’)) if __name__ == « __main__ »: asyncio.run(main())
Just drop that in a file and run it. The agent will spin up a VM, open Safari, and run your task. No need to handle screenshots, parsing, or retries yourself1.
Chaining Tasks: Multi-Step Workflows
You can feed the agent a list of tasks, and it’ll keep context between them:
pythontasks = [ « Open Safari and go to github.com », « Search for ‘trycua/cua’ », « Open the repository page », « Click on the ‘Issues’ tab », « Read the first open issue » ] for i, task in enumerate(tasks): print(f »nTask {i+1}/{len(tasks)}: {task} ») async for result in agent.run(task): print(f » → {result.get(‘text’)} ») print(f »✅ Task {i+1} done »)
Great for automating actual workflows, not just single clicks1.
Local Models: Save Money, Run Everything On-Device
Want to avoid OpenAI/Anthropic API costs? You can run agents with open-source models locally using Ollama, LM Studio, vLLM, etc.
Example:
bashollama pull gemma3:4b-it-q4_K_M pythonagent = ComputerAgent( computer=macos_computer, loop=AgentLoop.OMNI, model=LLM( provider=LLMProvider.OLLAMA, name= »gemma3:4b-it-q4_K_M » ) )
You can also point to any OpenAI-compatible endpoint (LM Studio, vLLM, LocalAI, etc.)1.
Debugging & Structured Responses
Every action from the agent gives you a rich, structured response:
Action text Token usage Reasoning trace Computer action details (type, coordinates, text, etc.)
This makes debugging and logging a breeze. Just print the result dict or log it to a file for later inspection1.
Visual UI (Optional): Gradio
If you want a UI for demos or quick testing:
pythonfrom agent.ui.gradio.app import create_gradio_ui if __name__ == « __main__ »: app = create_gradio_ui() app.launch(share=False) # Local only
Supports model/loop selection, task input, live screenshots, and action history.
Set share=True for a public link (with optional password)1.
Tips & Gotchas
You can swap loops/models with almost no code changes. Local models are great for dev, testing, or privacy. .gradio_settings.json saves your UI config-add it to .gitignore. For UI-TARS, deploy locally or on Hugging Face and use OAICOMPAT provider. Check the structured response for debugging, not just the action text.
submitted by /u/sandropuppo to r/learnmachinelearning
[link] [comments]
Laisser un commentaire