What changed in 2025
- Agents can finally do things in messy environments. OpenAI's Computer-Using Agent set new SOTA on OSWorld/Web tasks-evidence that UI-level action is viable beyond brittle RPA.
- Enterprise plumbing matured. The Model Context Protocol (MCP) is becoming a lingua franca to plug agents into data/tools. Microsoft added MCP support in Copilot Studio to extend agents with enterprise sources.
- Clouds ship agent runtimes. Azure AI Foundry's Agent Service focuses on observability, safety, and identity-letting teams deploy multi-tool agents with guardrails.
- Regulation clarified the guardrails. EU AI Act timelines set obligations for GPAI and risk classes-so governance is moving from slides to checklists.
A simple blueprint
- Goal & constraints: express the business intent plus limits (e.g., "never email external domains," "HITL for payments").
- Planner: break the goal into steps and decide when to browse, call APIs, or operate the UI (e.g., Gemini/OpenAI computer-use).
- Tools via MCP: standardize connections to CRMs, warehouses, and ticketing so agents discover and invoke capabilities consistently.
- Memory & context: store working notes, partial results, retrieved facts (vector DB, CRM, warehouse).
- Observability & guardrails: trace every step; evaluate quality/cost/safety; enforce policies (least-privilege tools, rate limits, approvals).
Your 30-day pilot plan
Pick a low-risk, high-annoyance workflow with clear success metrics.
- Week 1 - Select the moment: choose a process with UI clicks + copy/paste; define outcomes and HITL points.
- Week 2 - Wire the basics: expose one golden-source dataset via MCP; enable tracing from day one.
- Week 3 - Ship a narrow agent: implement computer-use for one browser-only step; log actions/outcomes; measure time saved and error rate.
- Week 4 - Prove & expand: A/B the workflow; add a second tool; extend evaluations including adversarial cases.
Safety & governance
- Prompt-injection is #1 in OWASP's Top-10 for LLM apps-treat every page/document as untrusted instructions and isolate content.
- Least privilege every tool; fail closed on ambiguity; require approvals for sensitive actions (payments, credentials, external comms).
- Map to frameworks: NIST AI RMF / Generative AI Profile + EU AI Act timelines keep controls auditable and future-proof.
FAQs
How is this different from classic RPA?
Agents operate across unfamiliar UIs with reasoning + perception and can decide among tools (APIs vs. UI), whereas RPA scripts often break outside narrow templates.
Do we need a data lake first?
No-start with one trusted system and expose it via MCP. Expand once you've proven value and have observability in place.
How do we evaluate agents?
Use execution traces, regression suites, and task success rates. Convert real traces into test datasets; add adversarial cases (prompt-injection, misleading UI states).
Authentic resources & references
Hand-picked sources underpinning claims in this article.
