How long does a full message testing study take?

Approximately 70 minutes for two rounds: about 30 minutes for Round 1, 10–15 minutes to refine variants, and another 30 minutes for Round 2.

How many personas should I use per segment?

10 personas per segment is the recommended minimum. For cross-segment tests, run separate studies per segment with 10 personas each.

How many message variants should I test at once?

Three variants (A/B/C) are ideal. They give meaningful contrast without overwhelming respondents or diluting insights.

Should I run a second round after testing?

Yes. The iterative loop is where the value comes from. Use Round 1 feedback to refine weaker variants and re-test in Round 2.

Can I use this for B2C messaging?

Yes. Adjust the demographic filters and examples to match your consumer profile, but keep the 7-question structure.

Test Product Messaging with Ditto + Claude Code

A complete Claude Code guide to A/B/C message testing and iterative refinement. Seven-question study design, two-round iteration workflow, cross-segment testing, language harvest extraction, and six deliverables from a single study. From three untested variants to validated messaging in 70 minutes.

1. Why Test Messaging with Synthetic Research

Positioning is internal and strategic. Messaging is external and is what actually touches customers. Yet most product marketing teams validate positioning rigorously and test messaging by gut feel.

2. The Three Messaging Framings

Before running a message test, you need variants worth testing. The same positioning can produce radically different messaging depending on which framing you lead with. The three canonical approaches:

Drafting Variants for Testing

Claude Code should produce three variants of 1-3 sentences each. Keep them similar in length and level of detail so the test measures framing preference, not information asymmetry.

3. The 7-Question Message Testing Study

Each question targets a specific dimension of message performance. Together, they produce the data needed for all six deliverables.

Customising the Questions

Replace the bracketed placeholders with your actual messaging variants. The surrounding question text should remain unchanged to maintain the measurement structure. For example:

4. Complete API Workflow: Step by Step

Step 1: Recruit a Research Group

Create a group matching your target buyer profile. Use demographic filters to ensure relevance.

Step 2: Create a Research Study

Step 3: Ask Questions Sequentially

Method	Time Per Round	Cost	Limitation
Customer interviews	2-4 weeks	$5-15K	Slow to arrange, small sample, biased by relationship
Wynter (B2B panels)	24-48 hours	$300-600/test	Days per round, costly to iterate
Qualtrics surveys	1-2 weeks	$2-10K	Quantitative, misses qualitative depth
Live A/B testing	1-4 weeks	Opportunity cost	Tests after launch with real traffic, no qualitative insight
Ditto + Claude Code	30 min per round	API usage only	Synthetic (validated at 95% correlation with real research)

Framing	Opens With	Works When	Fails When	Example
Problem-led	The pain the customer feels	Problem is universal and emotionally resonant	Audience doesn't recognise the problem or it feels abstract	"Tired of spending $50K on research that takes three months?"
Outcome-led	The result the customer achieves	Outcome is specific, measurable, and desirable	Claim sounds too good to be true without proof	"Get validated customer insights in 30 minutes."
Capability-led	What the product does	The capability itself is the differentiator and the audience is sophisticated	Audience cares about outcomes, not mechanisms	"AI-powered synthetic research with 300,000 personas across 15 countries."

Q#	Question Template	What It Measures	Maps To Deliverable
Q1	"Read this message: '[Message A]'. In your own words, what is this company offering? Who is it for? Would you want to learn more?"	Comprehension, relevance, intent	Clarity Scorecard
Q2	"Now read this: '[Message B]'. How does this compare to the first? Which feels more relevant to your situation?"	Comparative preference, framing impact	Performance Ranking
Q3	"One more: '[Message C]'. Of the three, which would make you most likely to click, sign up, or reach out? Why?"	Action intent, decision drivers	Performance Ranking, Audience-Message Fit
Q4	"What is unclear or confusing about any of these messages? What questions do they leave unanswered?"	Clarity gaps, information needs	Clarity Scorecard, Clarity Checklist
Q5	"If you saw the winning message on a website, what would you expect to find when you clicked through?"	Message-to-experience alignment	Clarity Checklist
Q6	"What one word or phrase from these messages stuck with you most? What fell completely flat?"	Language resonance, memorability	Language Harvest
Q7	"Thinking about your actual work or life, which of these problems feels most urgent to you right now? Why?"	Problem urgency, messaging-market fit	Performance Ranking, Audience-Message Fit

Response includes job_ids (one per persona). Poll these jobs until all complete.

Step 4: Poll for Question Completion

When status returns "completed" for all jobs in a question, proceed to the next question. Poll every 3-5 seconds. Typical completion time: 15-45 seconds per question.

Step 5: Repeat for Questions 2-7

After Q1 completes, ask Q2. After Q2 completes, ask Q3. Continue through Q7. Total time for 7 questions across 10 personas: approximately 5-8 minutes.

Step 6: Complete the Study

Ditto generates an automated analysis: key segments, divergences, shared mindsets, and suggested follow-up questions. This analysis often surfaces message testing insights not obvious from individual responses.

Step 7: Get the Share Link

Step 8: Fetch All Questions and Answers

Returns all 7 questions with all 10 persona responses per question (70 total responses). This is the raw data for generating deliverables.

5. The Iterative Loop: Test, Refine, Re-Test

A single round of message testing is useful. Two rounds produce messaging you can be genuinely confident in. This is the core workflow that makes Ditto + Claude Code uniquely powerful for messaging.

The Two-Round Process

Refinement Strategy

After Round 1, Claude Code should analyse the 70 responses and apply these refinement rules:

Round 2 Implementation

6. Generating the Six Deliverables

Phase	Duration	What Happens
Round 1: Test	~30 minutes	Test 3 messaging variants against 10 personas using the 7-question study. Analyse responses. Identify: which variant won, why the others lost, which phrases resonated, which fell flat, what gaps remain.
Refinement	~10 minutes	Claude Code rewrites the two losing variants, incorporating winning language from Round 1, addressing clarity gaps, adjusting framing based on urgency data from Q7.
Round 2: Re-Test	~30 minutes	Test the 3 refined variants against a fresh group of 10 personas (new recruitment). Fresh personas prevent priming bias. If the same variant wins both rounds, you have convergence.

A completed message testing study (70 qualitative responses: 10 personas x 7 questions) produces six structured deliverables. Claude Code generates these by analysing the response data from the /questions endpoint.

Generating the Performance Ranking

Parse Q3 responses (the three-way comparison) to tally explicit preferences. Then cross-reference with Q2 (pairwise comparison) and Q7 (urgency alignment).

Generating the Messaging Hierarchy

The messaging hierarchy is the most important output. It follows the standard structure:

7. Deep Dive: Language Harvest Extraction

The language harvest is the most immediately actionable output. It tells you exactly which words and phrases to use (and avoid) in all customer-facing copy.

Extraction Process

Q6 asks directly: "What one word or phrase stuck with you most? What fell completely flat?" But valuable language data is embedded across all seven questions. Claude Code should scan all 70 responses for:

Example Language Harvest Output

8. Cross-Segment Message Testing

The basic workflow tests one set of messages against one audience. The advanced version tests the same messages against multiple audiences simultaneously, revealing which framing works for which buyer.

Running Parallel Studies

Create three separate studies (one per group) and run the same 7 questions with the same 3 message variants across all three. Claude Code can interleave the API calls, asking Q1 to all three studies, then Q2, and so on.

Audience-Message Fit Matrix Output

9. Worked Example: Project Management SaaS

Deliverable	Source Questions	What It Contains	Primary User
Message Performance Ranking	Q2, Q3, Q7	Which variant won, by how much, and the specific reasons personas cited. Not just "B won" but "B won because the outcome framing resolved scepticism that the problem framing triggered."	PMM, Marketing
Clarity Scorecard	Q1, Q4	For each variant: was it understood correctly, was it misinterpreted, what questions did it leave unanswered? Misinterpretation is worse than confusion.	PMM, Copywriting
Language Harvest	Q6 (primary), all Q's	Words and phrases that stuck (keep these) vs fell flat (kill these). The language customers naturally use to describe your value proposition.	Copywriting, Content
Audience-Message Fit Matrix	Q3, Q7, demographics	Which message works for which persona type. Maps demographic/psychographic profiles to message preference.	PMM, Demand Gen
Messaging Hierarchy	All Q's synthesised	Primary message, 3-4 supporting pillars, and proof points, populated with tested, validated language.	PMM, Marketing, Sales
Clarity Checklist	Q4, Q5	Specific questions personas needed answered: pricing, free trial, social proof, implementation time. These become mandatory elements for any asset carrying this messaging.	Web, Content, Sales

Category	What to Look For	How to Use It
Keep (high resonance)	Phrases cited in Q6 as "stuck with me", language personas use when paraphrasing your message positively in Q1, words that appear in multiple personas' Q3 action reasons	Use in headlines, email subject lines, ad copy, sales scripts
Kill (negative resonance)	Phrases cited in Q6 as "fell flat", language personas flag as "confusing" or "jargon" in Q4, words associated with scepticism in Q1	Remove from all messaging, replace with tested alternatives
Adopt (customer language)	Natural language personas use to describe the value in Q1, problem descriptions in Q7 that differ from your framing, paraphrases that are clearer than your original	Replace your internal language with customer language throughout

Case Study: FlowBoard Message Testing

Context: FlowBoard is a project management tool launching a new AI-powered status tracking feature. The PMM team has validated positioning (competitive alternative: Monday.com + manual updates; unique attribute: AI-generated status; value: eliminates status meetings). Now they need to determine which messaging framing will drive the most trial sign-ups.

Three Variants Drafted

A (Problem-led): "Tired of status meetings that could have been a dashboard? Your team wastes 5 hours a week on alignment that should be automatic. FlowBoard replaces the meeting with a living project view."
B (Outcome-led): "Ship projects 40% faster with zero status meetings. FlowBoard gives every stakeholder real-time visibility into progress, blockers, and deadlines without a single sync call."
C (Capability-led): "AI-powered project tracking that learns your team's workflow. FlowBoard auto-generates status updates, predicts delays before they happen, and routes blockers to the right person instantly."

Round 1 Results (10 personas, US, employed, age 28-50)

Winner: Message B (outcome-led) with 6/10 preference
Q1 (comprehension): All three understood correctly, but Message C was described as "an AI tool" rather than a project management tool by 3 personas
Q4 (clarity gaps): "What does '40% faster' mean exactly?" (4 personas); "How does AI generate updates?" (3 personas); "What does it cost?" (6 personas)
Q6 (language): "zero status meetings" stuck with 7/10; "alignment that should be automatic" fell flat with 4/10 ("corporate jargon"); "predicts delays" stuck with 5/10
Q7 (urgency): "Too many meetings" was the most urgent problem for 7/10 personas

Refinement Applied

Message A revised: "Your team has 5 meetings this week that could be a dashboard. FlowBoard auto-tracks progress so nobody needs to ask 'where are we?' again." (Swapped "alignment" for concrete "5 meetings", borrowed "auto-tracks" language from Round 1 feedback)
Message B kept with minor tweak: Added "Join 2,000+ teams shipping faster" to address the social proof gap
Message C revised: "See delays before they happen. FlowBoard watches your team's actual work and tells you what's stuck, what's on track, and what needs attention. No manual updates." (Replaced "routes blockers" and "learns your workflow" with clearer language)

Round 2 Results (fresh 10 personas, same filters)

Winner: Message B again with 7/10 preference (convergence achieved)
Revised Message A improved from 3/10 to 2/10 (problem framing still weaker, but clarity improved)
Revised Message C improved from 1/10 to 1/10 but clarity scores improved significantly (fewer misinterpretations)
Key finding: "zero status meetings" was cited as memorable by 8/10 in Round 2 (up from 7/10). This phrase is the anchor of the messaging hierarchy.

Final Messaging Hierarchy

Primary: "Ship projects faster with zero status meetings."

Pillar 1: Real-time visibility without sync calls

Pillar 2: AI-predicted delays before they become problems

Pillar 3: Auto-generated updates from actual work

Proof: Social proof ("2,000+ teams"), speed claim needs case study support, pricing on the landing page (not behind a form)

10. Connecting to Other PMM Workflows

Message testing sits between positioning and execution in the PMM stack. It connects to other Ditto + Claude Code workflows:

The Full PMM Sequence (Under 3 Hours)

11. Best Practices and Common Mistakes

Best Practices

Common Mistakes

12. Frequently Asked Questions

Can I test more than three variants?

Not in a single study. Three is the maximum for meaningful comparison without cognitive overload. If you have five variants, run a screening round (Q3 only, rapid preference check) with all five, then take the top three into the full 7-question study.

How do I know when messaging is "done"?

When the same variant wins in two consecutive rounds with different persona groups, and the language harvest shows consistent patterns, messaging has converged. You can always re-test after launch with real-world data, but two rounds of synthetic testing provides a strong pre-launch foundation.

Should I test headlines, taglines, or full messages?

Full messages (2-3 sentences) work best. Headlines alone lack enough context for personas to evaluate comprehension and relevance. If you need to test headlines specifically, include a sentence of supporting context with each one.

What if no variant clearly wins?

This is informative. It means either: (a) the messaging variants are too similar (differentiate the framings more), (b) the value proposition itself doesn't resonate strongly with this audience (a positioning issue, not a messaging issue), or (c) the audience is genuinely split (consider segment-specific messaging). Check the Q7 urgency data to diagnose which.

Can I test messaging in different languages?

Yes. Ditto has personas across 15+ countries. Recruit a group filtered by country (e.g., Germany, France, Japan) and present messages in the target language. Claude Code can orchestrate parallel studies across markets to compare messaging resonance cross-culturally.

How does this relate to A/B testing on my website?

Ditto message testing is a pre-launch qualifier. It eliminates weak variants before you spend real traffic testing them. Use Ditto to narrow from 3 to 1, then use live A/B testing (Optimizely, VWO, etc.) to fine-tune the winner against minor variations with real conversion data.

What if the Round 2 winner is different from Round 1?

This means the refinement overcorrected, or the two groups had meaningfully different preferences. Compare the Q3 reasoning from both rounds. If the reasons are consistent but the winner flipped, the variants are close in performance and either could work. If the reasons differ, the groups may represent different segments, which is itself a valuable finding.

How many personas should I use?

10 per group is the sweet spot for message testing. Fewer than 6 produces unreliable patterns. More than 15 adds data volume without proportionally increasing insight quality. For cross-segment testing, use 10 per segment (30 total for 3 segments).

Can I use this for email subject line testing?

Yes, with a modification. For subject lines, use Q1 to present each subject line and ask "Would you open an email with this subject? Why or why not?" Adapt Q6 to focus on which subject line creates the most curiosity. The rest of the study structure applies.

Workflow	Relationship to Message Testing	Sequence
Positioning Validation	Positioning determines what to say. Messaging determines how to say it. Always validate positioning first.	Before message testing
Competitive Intelligence	Competitive battlecards provide "quick dismisses" and "landmine questions" that should be reflected in competitive messaging variants.	Before or parallel
Sales Enablement	The messaging hierarchy and language harvest feed directly into pitch decks, email templates, and demo scripts.	After message testing
Content Marketing	Tested messaging informs blog headlines, social copy, ad creative, and landing page copy. The language harvest provides exact words to use.	After message testing
Pricing Research	Q4 clarity gaps often surface pricing as the #1 unanswered question. If so, run a pricing study next.	After message testing

Practice	Why It Matters
Keep variants similar in length	If Message A is 2 sentences and Message C is a paragraph, you're testing length, not framing
Test framing, not content	All three variants should convey the same value proposition with different emphasis
Use fresh personas for Round 2	Same personas are primed by Round 1 context, biasing results
Ask questions sequentially	Personas build context across questions, producing richer qualitative data
Include the message text in the question	Don't reference "Message A" abstractly. Paste the actual message so personas respond to the words
Run the study through completion	Ditto's automated analysis often surfaces insights not obvious from individual responses
Trust the language harvest over intuition	If 7/10 personas remember a phrase, use it. If your favourite phrase fell flat with 4/10, kill it.

Mistake	What Goes Wrong	How to Avoid
Testing before positioning is validated	You might be testing the right framing for the wrong value proposition	Run positioning validation first (see Positioning Validation guide)
Only one round of testing	No way to verify the winner or test whether refined losers improve	Always run two rounds. The second round costs 30 minutes and provides convergence evidence
Reusing the same group for Round 2	Personas are primed by Round 1, biasing Round 2 results	Recruit a fresh group with identical filters
Testing more than 3 variants	Comparison fatigue. Personas lose the ability to differentiate after 3 options	Test 3 at a time. If you have 5 variants, run a first round to narrow to 3, then test those
Ignoring the clarity checklist	Messaging wins the preference test but fails in production because it leaves critical questions unanswered	Treat Q4 and Q5 outputs as mandatory design requirements for landing pages and assets
Asking all 7 questions simultaneously	Loses sequential context. Q2 and Q3 are specifically designed to build on Q1	Ask one question, wait for completion, then ask the next