AI-Driven Customer Experience Scoring – Insights, Benefits, and Challenges

Angelique – Moderator; frames discussion around AI-driven measurement, cost-benefit, and future expectations.
Eric (Intercom Support Manager) – Internal user of Intercom’s CX Score; shares first-hand data, frames conceptual differences, and summarizes take-aways.
Louis (Early adopter) – Uses CX Score for agent QA; reports quantitative gains and accuracy concerns.
Sarita – Building an in-house AI scoring system; raises conflict between CSAT and CX Score, asks about refinement.
Adam, Peter, Keith – Contribute perspectives on expectation shifts, QA approaches, and feature requests.
Numerous unnamed attendees – Provide background chatter, quick polls (hands raised), and noise that highlights meeting-management issues.

CSAT (Customer Satisfaction Score)
- Traditional post-interaction survey sent to a subset of customers.
- Measures the customer’s sentiment about the last human agent or final reply.
- Typical response rates hover around $15\%\text{–}20\%$ .
CX Score (Customer Experience Score)
- AI-generated quality score for every conversation.
- Holistic: evaluates the entire journey – chatbot (Finn), routing, human hand-offs, resolution, tone.
- Formerly branded "AI CSAT" inside Intercom; currently in open beta, multi-language support arriving at General Release (GR).
Finn – Intercom’s AI agent that handles first-line queries; its performance is folded into overall CX Score.

Coverage jumped from $15\%$ CSAT sampling to $100\%$ scored conversations overnight.
Removed need to purchase separate QA suites such as MaestroQA or ScoreBuddy – direct cost avoidance.
Managers triage work by filtering only the low-score tickets, cutting manual QA workload dramatically.
Facilitates agent coaching; “bad” tickets surfaced automatically.

Accuracy Gaps
- AI sometimes flags "multiple unanswered inquiries" when just one question existed.
- Blanket low scores when customer can’t get a feature the product doesn’t support (e.g.
  order cancellation) even if tone is polite.
- $30$ -point deltas between CSAT and CX Score for the same agent set.
Lack of Training Loop
- No "thumbs-up / thumbs-down" or direct feedback button to retrain the model on org-specific workflows.
Explainability
- Managers want to understand why a score was assigned; need root-cause tags.
Customization Needs
- Desire to tweak prompts so “good experience” reflects company context (e.g.
  inability to cancel orders shouldn’t always equal “bad support”).

Confidently Incorrect Agent scenario:
- Human provides wrong solution ➔ Customer unaware, gives 5★ CSAT ➔ AI flags poor CX because underlying issue persists.
Consumer Bias
- Positive bias toward humans, negative bias toward AI.
Scope Difference
- CSAT = last reply; CX Score = entire thread + bots + routing.

AI now clears "easy" tickets, so human agents handle more complex cases.
- Raises quality bar; customers expect complete fixes.
- Some teams see fewer 5★ scores but more 4★ – expectation shift noted.
Possibility to repurpose agents toward:
- Conversation design / AI training.
- Proactive outreach & technical sales assistance.
Persistent balancing act: quality ⇄ quantity ⇄ budget.

Random QA audits on $\approx 10\%$ of tickets.
DSAT deep-dives & calibrations to discover systemic failure patterns.
Tag CSAT responses by cause (Agent / AI / Product) – newly added Intercom option.
In-house NLP or LLM scripts to derive Customer Effort Score or sentiment if vendor tool absent.

Eric: “We’re in the Nokia 3310 era of AI.”
- Early but functional; expect rapid capability curve.
Intercom ethos: “Drink our own champagne” (use product internally
and funnel feedback straight to Product/Eng).

Over-reliance on imperfect AI may misclassify good work; need human oversight.
Must balance cost of exhaustive measurement against risk of missing systemic failures.
Transparency & explainability critical for agent trust and fair performance reviews.

Treat CX Score as one data point; triangulate with CSAT, QA audits, operational metrics.
Use low-score filters to prioritize coaching sessions and speed QA.
Document recurring false negatives/positives; pass examples to vendor for model tuning.
When deploying Finn or similar, communicate its limitations to customers to manage expectation gap.
Begin building internal taxonomy (Process, Policy, Tone…) so future CX-Score categories align with org needs.
If still on CSAT-only:
- Increase sampling or auto-trigger QA for DSATs.
- Invest in lightweight LLM sentiment script as interim step.

Mirrors evolution from sample-based QC (CSAT) to population-level monitoring (AI analytics) seen in manufacturing and DevOps (log aggregation ➔ anomaly detection).
Highlights shift from output metrics (ticket count) to outcome metrics (experience, resolution accuracy).

Eric invites attendees to connect on LinkedIn for deeper, 1-to-1 tactical conversations around AI implementation, QA process design, and change management.
Session ends with reminder to return to main webinar room for closing remarks.