A complete Claude Code guide to A/B/C message testing and iterative refinement. Seven-question study design, two-round iteration workflow, cross-segment testing, language harvest extraction, and six deliverables from a single study. From three untested variants to validated messaging in 70 minutes.
Positioning is internal and strategic. Messaging is external and is what actually touches customers. Yet most product marketing teams validate positioning rigorously and test messaging by gut feel.
Traditional message testing options and their limitations:
| Method | Time Per Round | Cost | Limitation |
|---|---|---|---|
| Customer interviews | 2-4 weeks | $5-15K | Slow to arrange, small sample, biased by relationship |
| Wynter (B2B panels) | 24-48 hours | $300-600/test | Days per round, costly to iterate |
| Qualtrics surveys | 1-2 weeks | $2-10K | Quantitative, misses qualitative depth |
| Live A/B testing | 1-4 weeks | Opportunity cost | Tests after launch with real traffic, no qualitative insight |
| Ditto + Claude Code | 30 min per round | API usage only | Synthetic (validated at 95% correlation with real research) |
Before running a message test, you need variants worth testing. The same positioning can produce radically different messaging depending on which framing you lead with. The three canonical approaches:
| Framing | Opens With | Works When | Fails When | Example |
|---|---|---|---|---|
| Problem-led | The pain the customer feels | Problem is universal and emotionally resonant | Audience doesn't recognise the problem or it feels abstract | "Tired of spending $50K on research that takes three months?" |
| Outcome-led | The result the customer achieves | Outcome is specific, measurable, and desirable | Claim sounds too good to be true without proof | "Get validated customer insights in 30 minutes." |
| Capability-led | What the product does | The capability itself is the differentiator and the audience is sophisticated | Audience cares about outcomes, not mechanisms | "AI-powered synthetic research with 300,000 personas across 15 countries." |
Claude Code should produce three variants of 1-3 sentences each. Keep them similar in length and level of detail so the test measures framing preference, not information asymmetry.
# Example: Three variants for a project management SaaS MESSAGE_A = """Tired of status meetings that could have been a dashboard? Your team wastes 5 hours a week on alignment that should be automatic. FlowBoard replaces the meeting with a living project view.""" MESSAGE_B = """Ship projects 40% faster with zero status meetings. FlowBoard gives every stakeholder real-time visibility into progress, blockers, and deadlines without a single sync call.""" MESSAGE_C = """AI-powered project tracking that learns your team's workflow. FlowBoard auto-generates status updates, predicts delays before they happen, and routes blockers to the right person instantly."""
Each question targets a specific dimension of message performance. Together, they produce the data needed for all six deliverables.
| Q# | Question Template | What It Measures | Maps To Deliverable |
|---|---|---|---|
| Q1 | "Read this message: '[Message A]'. In your own words, what is this company offering? Who is it for? Would you want to learn more?" | Comprehension, relevance, intent | Clarity Scorecard |
| Q2 | "Now read this: '[Message B]'. How does this compare to the first? Which feels more relevant to your situation?" | Comparative preference, framing impact | Performance Ranking |
| Q3 | "One more: '[Message C]'. Of the three, which would make you most likely to click, sign up, or reach out? Why?" | Action intent, decision drivers | Performance Ranking, Audience-Message Fit |
| Q4 | "What is unclear or confusing about any of these messages? What questions do they leave unanswered?" | Clarity gaps, information needs | Clarity Scorecard, Clarity Checklist |
| Q5 | "If you saw the winning message on a website, what would you expect to find when you clicked through?" | Message-to-experience alignment | Clarity Checklist |
| Q6 | "What one word or phrase from these messages stuck with you most? What fell completely flat?" | Language resonance, memorability | Language Harvest |
| Q7 | "Thinking about your actual work or life, which of these problems feels most urgent to you right now? Why?" | Problem urgency, messaging-market fit | Performance Ranking, Audience-Message Fit |
Replace the bracketed placeholders with your actual messaging variants. The surrounding question text should remain unchanged to maintain the measurement structure. For example:
# Q1: Replace [Message A] with your problem-led variant
question_1 = f"""Read this message: '{MESSAGE_A}'
In your own words, what is this company offering? Who is it for?
Would you want to learn more?"""
Create a group matching your target buyer profile. Use demographic filters to ensure relevance.
curl -s -X POST "https://app.askditto.io/v1/research-groups/recruit" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "Message Test: PM SaaS Target Buyers",
"group_size": 10,
"filters": {
"country": "US",
"age_min": 28,
"age_max": 50,
"employment_status": "Employed"
}
}'
Response includes group_id. Save it for Step 2.
curl -s -X POST "https://app.askditto.io/v1/research-studies" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"title": "A/B/C Message Test: FlowBoard Project Management",
"objective": "Test three messaging variants (problem-led, outcome-led, capability-led) to identify which framing resonates most with target buyers and produces the strongest action intent.",
"research_group_id": GROUP_ID
}'
Response includes study_id. Save it for Step 3.
# Ask Question 1
curl -s -X POST "https://app.askditto.io/v1/research-studies/STUDY_ID/questions" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"question": "Read this message: '\''Tired of status meetings that could have been a dashboard? Your team wastes 5 hours a week on alignment that should be automatic. FlowBoard replaces the meeting with a living project view.'\'' In your own words, what is this company offering? Who is it for? Would you want to learn more?"
}'
Response includes job_ids (one per persona). Poll these jobs until all complete.
# Poll each job until status is "completed" curl -s "https://app.askditto.io/v1/jobs/JOB_ID" \ -H "Authorization: Bearer YOUR_API_KEY"
When status returns "completed" for all jobs in a question, proceed to the next question. Poll every 3-5 seconds. Typical completion time: 15-45 seconds per question.
After Q1 completes, ask Q2. After Q2 completes, ask Q3. Continue through Q7. Total time for 7 questions across 10 personas: approximately 5-8 minutes.
# Example: Question 3 (the three-way comparison)
curl -s -X POST "https://app.askditto.io/v1/research-studies/STUDY_ID/questions" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"question": "One more: '\''AI-powered project tracking that learns your team'\''s workflow. FlowBoard auto-generates status updates, predicts delays before they happen, and routes blockers to the right person instantly.'\'' Of the three messages, which would make you most likely to click, sign up, or reach out? Why?"
}'
curl -s -X POST "https://app.askditto.io/v1/research-studies/STUDY_ID/complete" \ -H "Authorization: Bearer YOUR_API_KEY"
Ditto generates an automated analysis: key segments, divergences, shared mindsets, and suggested follow-up questions. This analysis often surfaces message testing insights not obvious from individual responses.
curl -s "https://app.askditto.io/v1/research-studies/STUDY_ID" \ -H "Authorization: Bearer YOUR_API_KEY"
The share_url field provides a public URL to the full study results.
curl -s "https://app.askditto.io/v1/research-studies/STUDY_ID/questions" \ -H "Authorization: Bearer YOUR_API_KEY"
Returns all 7 questions with all 10 persona responses per question (70 total responses). This is the raw data for generating deliverables.
A single round of message testing is useful. Two rounds produce messaging you can be genuinely confident in. This is the core workflow that makes Ditto + Claude Code uniquely powerful for messaging.
| Phase | Duration | What Happens |
|---|---|---|
| Round 1: Test | ~30 minutes | Test 3 messaging variants against 10 personas using the 7-question study. Analyse responses. Identify: which variant won, why the others lost, which phrases resonated, which fell flat, what gaps remain. |
| Refinement | ~10 minutes | Claude Code rewrites the two losing variants, incorporating winning language from Round 1, addressing clarity gaps, adjusting framing based on urgency data from Q7. |
| Round 2: Re-Test | ~30 minutes | Test the 3 refined variants against a fresh group of 10 personas (new recruitment). Fresh personas prevent priming bias. If the same variant wins both rounds, you have convergence. |
After Round 1, Claude Code should analyse the 70 responses and apply these refinement rules:
# Recruit a FRESH group with the same filters
curl -s -X POST "https://app.askditto.io/v1/research-groups/recruit" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "Message Test Round 2: PM SaaS Target Buyers",
"group_size": 10,
"filters": {
"country": "US",
"age_min": 28,
"age_max": 50,
"employment_status": "Employed"
}
}'
# Create a new study with the refined messages
curl -s -X POST "https://app.askditto.io/v1/research-studies" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"title": "A/B/C Message Test Round 2: FlowBoard (Refined)",
"objective": "Re-test three refined messaging variants. Round 1 winner was outcome-led. Problem-led and capability-led variants have been rewritten based on Round 1 persona feedback.",
"research_group_id": NEW_GROUP_ID
}'
# Ask the same 7 questions with the REFINED message variants
# (same question structure, updated message text)
A completed message testing study (70 qualitative responses: 10 personas x 7 questions) produces six structured deliverables. Claude Code generates these by analysing the response data from the /questions endpoint.
| Deliverable | Source Questions | What It Contains | Primary User |
|---|---|---|---|
| Message Performance Ranking | Q2, Q3, Q7 | Which variant won, by how much, and the specific reasons personas cited. Not just "B won" but "B won because the outcome framing resolved scepticism that the problem framing triggered." | PMM, Marketing |
| Clarity Scorecard | Q1, Q4 | For each variant: was it understood correctly, was it misinterpreted, what questions did it leave unanswered? Misinterpretation is worse than confusion. | PMM, Copywriting |
| Language Harvest | Q6 (primary), all Q's | Words and phrases that stuck (keep these) vs fell flat (kill these). The language customers naturally use to describe your value proposition. | Copywriting, Content |
| Audience-Message Fit Matrix | Q3, Q7, demographics | Which message works for which persona type. Maps demographic/psychographic profiles to message preference. | PMM, Demand Gen |
| Messaging Hierarchy | All Q's synthesised | Primary message, 3-4 supporting pillars, and proof points, populated with tested, validated language. | PMM, Marketing, Sales |
| Clarity Checklist | Q4, Q5 | Specific questions personas needed answered: pricing, free trial, social proof, implementation time. These become mandatory elements for any asset carrying this messaging. | Web, Content, Sales |
Parse Q3 responses (the three-way comparison) to tally explicit preferences. Then cross-reference with Q2 (pairwise comparison) and Q7 (urgency alignment).
# Pseudocode for performance ranking extraction
responses_q3 = fetch_responses(study_id, question_3_id)
tally = {"A": 0, "B": 0, "C": 0}
reasons = {"A": [], "B": [], "C": []}
for response in responses_q3:
text = response["response_text"]
# Identify which message the persona chose
# Extract their stated reason
# Tally preference and collect reasoning
# Output:
# "Message B (outcome-led) won 6/10 preferences.
# Primary reason: the specific outcome ('40% faster') felt credible
# and actionable, while the problem framing ('tired of meetings')
# was seen as generic by 3 personas."
The messaging hierarchy is the most important output. It follows the standard structure:
MESSAGING HIERARCHY
===================
PRIMARY MESSAGE:
"Ship projects 40% faster with zero status meetings."
[Tested: 6/10 preference in Round 1, 7/10 in Round 2]
SUPPORTING PILLARS:
Pillar 1: Real-time visibility
"Every stakeholder sees progress, blockers, and deadlines
without a single sync call."
Evidence: 8/10 personas cited "no more check-in meetings"
as the most compelling benefit.
Pillar 2: Predictive intelligence
"Know about delays before they happen, not after."
Evidence: Q6 language harvest - "predicts delays" was the
#1 phrase that "stuck" across both rounds.
Pillar 3: Zero-effort status updates
"Auto-generated updates from your team's actual work."
Evidence: Q4 clarity gap - personas needed to understand
HOW updates are generated without manual input.
PROOF POINTS:
- "40% faster" needs supporting data (case study, benchmark)
- "AI-powered" needs specificity (what model, what data)
- Social proof needed: "Who else uses this?"
[From Q5 expectation alignment data]
The language harvest is the most immediately actionable output. It tells you exactly which words and phrases to use (and avoid) in all customer-facing copy.
Q6 asks directly: "What one word or phrase stuck with you most? What fell completely flat?" But valuable language data is embedded across all seven questions. Claude Code should scan all 70 responses for:
| Category | What to Look For | How to Use It |
|---|---|---|
| Keep (high resonance) | Phrases cited in Q6 as "stuck with me", language personas use when paraphrasing your message positively in Q1, words that appear in multiple personas' Q3 action reasons | Use in headlines, email subject lines, ad copy, sales scripts |
| Kill (negative resonance) | Phrases cited in Q6 as "fell flat", language personas flag as "confusing" or "jargon" in Q4, words associated with scepticism in Q1 | Remove from all messaging, replace with tested alternatives |
| Adopt (customer language) | Natural language personas use to describe the value in Q1, problem descriptions in Q7 that differ from your framing, paraphrases that are clearer than your original | Replace your internal language with customer language throughout |
LANGUAGE HARVEST ================ KEEP (use more): - "zero status meetings" [cited by 7/10 as memorable] - "predicts delays" [cited by 5/10, appeared in Q3 reasons] - "living project view" [cited by 4/10, natural paraphrase] - "40% faster" [cited by 6/10, BUT 3 wanted proof] KILL (remove immediately): - "alignment that should be automatic" [4/10 said "corporate jargon"] - "routes blockers" [3/10 didn't understand what this means] - "learns your workflow" [3/10 found this "creepy" or "vague"] ADOPT (customer language, replace yours): - Personas say "no more Monday syncs" → use instead of "eliminates meetings" - Personas say "see what's stuck" → use instead of "identifies blockers" - Personas say "keeps everyone on the same page" → use instead of "real-time visibility"
The basic workflow tests one set of messages against one audience. The advanced version tests the same messages against multiple audiences simultaneously, revealing which framing works for which buyer.
Claude Code orchestrates three studies concurrently:
# Group 1: SMB decision-makers
curl -s -X POST "https://app.askditto.io/v1/research-groups/recruit" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "Message Test: SMB Buyers (28-40)",
"group_size": 10,
"filters": {
"country": "US",
"age_min": 28,
"age_max": 40,
"employment_status": "Employed"
}
}'
# Group 2: Enterprise evaluators
curl -s -X POST "https://app.askditto.io/v1/research-groups/recruit" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "Message Test: Enterprise Buyers (35-55)",
"group_size": 10,
"filters": {
"country": "US",
"age_min": 35,
"age_max": 55,
"employment_status": "Employed"
}
}'
# Group 3: Technical buyers
curl -s -X POST "https://app.askditto.io/v1/research-groups/recruit" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "Message Test: Technical Buyers (Bachelor+)",
"group_size": 10,
"filters": {
"country": "US",
"age_min": 25,
"age_max": 50,
"employment_status": "Employed",
"education_level": "Bachelors"
}
}'
Create three separate studies (one per group) and run the same 7 questions with the same 3 message variants across all three. Claude Code can interleave the API calls, asking Q1 to all three studies, then Q2, and so on.
AUDIENCE-MESSAGE FIT MATRIX
============================
SMB Buyers Enterprise Technical
Message A (Problem) 6/10 3/10 2/10
Message B (Outcome) 3/10 7/10 4/10
Message C (Capability) 1/10 0/10 4/10
KEY INSIGHT:
- SMB buyers respond to problem framing (pain is personal and immediate)
- Enterprise buyers respond to outcome framing (need to justify ROI)
- Technical buyers split between outcome and capability
(want to understand the mechanism before trusting the claim)
RECOMMENDATION:
- Website homepage: Outcome-led (broadest appeal)
- SMB email sequences: Problem-led
- Enterprise sales deck: Outcome-led with ROI proof
- Technical documentation: Capability-led
Context: FlowBoard is a project management tool launching a new AI-powered status tracking feature. The PMM team has validated positioning (competitive alternative: Monday.com + manual updates; unique attribute: AI-generated status; value: eliminates status meetings). Now they need to determine which messaging framing will drive the most trial sign-ups.
Primary: "Ship projects faster with zero status meetings."
Pillar 1: Real-time visibility without sync calls
Pillar 2: AI-predicted delays before they become problems
Pillar 3: Auto-generated updates from actual work
Proof: Social proof ("2,000+ teams"), speed claim needs case study support, pricing on the landing page (not behind a form)
Message testing sits between positioning and execution in the PMM stack. It connects to other Ditto + Claude Code workflows:
| Workflow | Relationship to Message Testing | Sequence |
|---|---|---|
| Positioning Validation | Positioning determines what to say. Messaging determines how to say it. Always validate positioning first. | Before message testing |
| Competitive Intelligence | Competitive battlecards provide "quick dismisses" and "landmine questions" that should be reflected in competitive messaging variants. | Before or parallel |
| Sales Enablement | The messaging hierarchy and language harvest feed directly into pitch decks, email templates, and demo scripts. | After message testing |
| Content Marketing | Tested messaging informs blog headlines, social copy, ad creative, and landing page copy. The language harvest provides exact words to use. | After message testing |
| Pricing Research | Q4 clarity gaps often surface pricing as the #1 unanswered question. If so, run a pricing study next. | After message testing |
Positioning Validation → 30 minutes → What to say
↓
Message Testing (2 rounds) → 70 minutes → How to say it
↓
Competitive Intelligence → 45 minutes → How to say it about the competition
↓
Total: ~2.5 hours for the strategic foundation most teams spend a quarter building
| Practice | Why It Matters |
|---|---|
| Keep variants similar in length | If Message A is 2 sentences and Message C is a paragraph, you're testing length, not framing |
| Test framing, not content | All three variants should convey the same value proposition with different emphasis |
| Use fresh personas for Round 2 | Same personas are primed by Round 1 context, biasing results |
| Ask questions sequentially | Personas build context across questions, producing richer qualitative data |
| Include the message text in the question | Don't reference "Message A" abstractly. Paste the actual message so personas respond to the words |
| Run the study through completion | Ditto's automated analysis often surfaces insights not obvious from individual responses |
| Trust the language harvest over intuition | If 7/10 personas remember a phrase, use it. If your favourite phrase fell flat with 4/10, kill it. |
| Mistake | What Goes Wrong | How to Avoid |
|---|---|---|
| Testing before positioning is validated | You might be testing the right framing for the wrong value proposition | Run positioning validation first (see Positioning Validation guide) |
| Only one round of testing | No way to verify the winner or test whether refined losers improve | Always run two rounds. The second round costs 30 minutes and provides convergence evidence |
| Reusing the same group for Round 2 | Personas are primed by Round 1, biasing Round 2 results | Recruit a fresh group with identical filters |
| Testing more than 3 variants | Comparison fatigue. Personas lose the ability to differentiate after 3 options | Test 3 at a time. If you have 5 variants, run a first round to narrow to 3, then test those |
| Ignoring the clarity checklist | Messaging wins the preference test but fails in production because it leaves critical questions unanswered | Treat Q4 and Q5 outputs as mandatory design requirements for landing pages and assets |
| Asking all 7 questions simultaneously | Loses sequential context. Q2 and Q3 are specifically designed to build on Q1 | Ask one question, wait for completion, then ask the next |
Not in a single study. Three is the maximum for meaningful comparison without cognitive overload. If you have five variants, run a screening round (Q3 only, rapid preference check) with all five, then take the top three into the full 7-question study.
When the same variant wins in two consecutive rounds with different persona groups, and the language harvest shows consistent patterns, messaging has converged. You can always re-test after launch with real-world data, but two rounds of synthetic testing provides a strong pre-launch foundation.
Full messages (2-3 sentences) work best. Headlines alone lack enough context for personas to evaluate comprehension and relevance. If you need to test headlines specifically, include a sentence of supporting context with each one.
This is informative. It means either: (a) the messaging variants are too similar (differentiate the framings more), (b) the value proposition itself doesn't resonate strongly with this audience (a positioning issue, not a messaging issue), or (c) the audience is genuinely split (consider segment-specific messaging). Check the Q7 urgency data to diagnose which.
Yes. Ditto has personas across 15+ countries. Recruit a group filtered by country (e.g., Germany, France, Japan) and present messages in the target language. Claude Code can orchestrate parallel studies across markets to compare messaging resonance cross-culturally.
Ditto message testing is a pre-launch qualifier. It eliminates weak variants before you spend real traffic testing them. Use Ditto to narrow from 3 to 1, then use live A/B testing (Optimizely, VWO, etc.) to fine-tune the winner against minor variations with real conversion data.
This means the refinement overcorrected, or the two groups had meaningfully different preferences. Compare the Q3 reasoning from both rounds. If the reasons are consistent but the winner flipped, the variants are close in performance and either could work. If the reasons differ, the groups may represent different segments, which is itself a valuable finding.
10 per group is the sweet spot for message testing. Fewer than 6 produces unreliable patterns. More than 15 adds data volume without proportionally increasing insight quality. For cross-segment testing, use 10 per segment (30 total for 3 segments).
Yes, with a modification. For subject lines, use Q1 to present each subject line and ask "Would you open an email with this subject? Why or why not?" Adapt Q6 to focus on which subject line creates the most curiosity. The rest of the study structure applies.
Related guides: