Names and a few details are anonymized. The numbers are real.
The setup
A direct-to-consumer brand we’ll call Northwind sells home goods at \$50–\$300 price points across a Shopify storefront, Amazon, and a wholesale channel. Forty thousand orders a month, growing.
Their support org: 14 people across two time zones, handling roughly 3,200 tickets per week in the busy season. Median first-response time was 6h 40m. Agent burnout was real. NPS was good but slipping. The director of CX had budget approved to hire 4 more reps, but the talent pipeline was dry and onboarding was eating her own time.
She didn’t want a chatbot. She’d been burned by chatbots. What she wanted was a system that could actually resolve tickets end-to-end — not just route them to humans with a slightly nicer summary.
The triage: where the volume actually came from
Before writing a single line of code, we spent two weeks reading tickets. Real tickets, in their actual messy form. We tagged 1,200 of them by intent and resolution path.
The shape of the volume:
| Bucket | % of tickets | Median resolution time |
|---|---|---|
| Order status / WISMO | 31% | 4 min |
| Address change requests | 12% | 7 min |
| Returns initiation | 14% | 9 min |
| Refund status | 9% | 5 min |
| Product questions (pre-purchase) | 11% | 12 min |
| Damaged / wrong item | 8% | 25 min |
| Discount requests | 6% | 8 min |
| Everything else (true edge cases) | 9% | 40+ min |
Eighty-one percent of tickets fell into seven repeatable patterns. None of them required real judgement on most cases. All of them required tool access — checking ShipStation, mutating Shopify orders, querying Klaviyo, sometimes hitting the 3PL’s API.
This is exactly the shape of work that modern agentic systems do well — and exactly where chatbots historically fail. Chatbots can answer; agents can act.
The architecture
We built a single agent with a constrained tool set, deployed behind their existing Gorgias front end. Seven tools, no more:
-
lookup_order— pulls order, tracking, and customer history -
update_shipping_address— gated by carrier rules -
initiate_return— opens a return record and emails the label -
issue_refund— bounded by amount and reason code -
apply_discount— single-use, bounded percentage -
search_kb— semantic search over their help docs -
escalate_to_human— with structured handoff context
Every tool call is logged. Every response is reviewed in batches. The agent operates with three explicit confidence tiers:
- High confidence (resolve autonomously): standard cases that match a known pattern. About 65% of incoming tickets.
- Medium confidence (draft a response, route to human for one-click send): cases with one ambiguous detail. About 22%.
- Low confidence (full human handoff with context): anything else. About 13%.
We did not let the agent improvise. It runs on a tight system prompt with explicit do-not-do rules: no commitments about delivery dates, no policy exceptions, no apologies for things outside its scope.
The guardrails
This is where most agentic deployments fail. We baked in five non-negotiables:
- Refund cap per ticket: \$200, hard-coded. Anything bigger goes to a human, no exceptions.
- Discount cap: 15%, single-use, only on specific approved SKUs.
- No refunds within 90 days of a previous refund without human review.
- Tone matches their existing voice guide, validated by a smaller scoring model on every response.
- Every action is reversible for 48 hours by support leads with a single click.
We deployed in shadow mode first — agent generates responses, humans send them — for two weeks. We caught 17 nontrivial issues that way. None of them made it to a real customer.
The first six weeks
After going live:
- Tickets autonomously resolved: 71% (vs. 0% baseline)
- Median first-response time: 47 seconds (vs. 6h 40m baseline)
- Median full resolution time: 4m 12s for autonomous tickets, 38m for human-touched
- CSAT on AI-resolved tickets: 4.6/5, vs. 4.4/5 historic team average
- Escalation rate from agent to human: 13%, exactly matching our pre-launch model
- Hallucinated facts caught in production: 0 (this matters; the guardrails earned their keep)
The team didn’t shrink. The ticket volume the team can absorb went up — and the work shifted from repetitive lookups to the gnarly 13% that actually needs human judgement. Three of the reps switched to a hybrid CX-ops role, building new agent tools and reviewing logs.
The director’s comment six weeks in: “We went from drowning to building. I’m not hiring those four reps.”
What we got wrong
Two surprises worth sharing.
One: the agent was too polite. Northwind’s brand voice is warm but direct. Early agent responses skewed apologetic in a way that felt off-brand and, paradoxically, lowered trust. We re-tuned the system prompt and the tone scoring model, and CSAT recovered within a week. Lesson: brand voice is a first-class engineering concern, not a polish step at the end.
Two: the long tail of the long tail. The 13% “escalate to human” bucket included a sub-bucket of about 1.5% of total tickets that were genuinely novel — situations the agent had never seen. Those needed an entirely different routing path because they were the cases where the agent confidently chose the wrong tool. We added an explicit “novelty detection” check that pushes anything semantically far from prior tickets directly to a human, regardless of the agent’s self-reported confidence.
The takeaway
Agentic AI doesn’t replace your support team. It dissolves the bottom 70% of the work so the team can do the top 30% well — and absorb growth without breaking. The architecture is genuinely different from the chatbot wave: real tool access, hard guardrails, structured escalation, and a humility about what the model should and shouldn’t decide.
Done right, it pays for itself in weeks and compounds for years. Done wrong, it burns customer trust faster than any cost saving recovers it. The engineering decisions in the middle are everything.