Why Your Synthetic Consumers Built Simply with ChatGPT Keep Lying to You

There's a dirty secret in cheap synthetic research: when you ask an LLM to rate purchase intent on a 1-5 scale, it lies.

Not maliciously. Just predictably. It defaults to "3" like a nervous test-taker hedging every answer. It avoids extremes. It produces distributions that look nothing like real human responses, where people actually have opinions and skew positive or negative.

For anyone building synthetic market research, this has been one of the many central technical challenges: how do you get realistic response distributions without training on proprietary human survey data?

A new paper from PyMC Labs and Colgate-Palmolive just offered a new solution.

The Problem: Direct Ratings Don't Work

The researchers tested 57 personal care product concepts against 9,300 real human survey responses. When they asked GPT-4o and Gemini to rate purchase intent directly (1-5 scale), the results were useless:

Narrow distributions clustered around "3"—safe, hedged, middle-of-the-road
Almost no 1s or 5s—where real humans concentrate their responses
Poor distributional similarity (KS similarity: 0.26-0.39)

The LLMs could rank products relatively well (80% correlation with human rankings), but the distributions were so artificial you couldn't use them for forecasting, segmentation, or understanding intensity of preference.

This matters because distribution shape is signal. A 3.8 average from a bimodal "love it or hate it" distribution means something completely different than a 3.8 from everyone rating it "pretty good, I guess."

Here's what worked: stop asking for numbers.

Instead, the researchers used Semantic Similarity Rating (SSR):

Elicit free-text responses about purchase intent
- "I'm somewhat interested. If it works well and isn't too expensive, I might give it a try."
- "Seems kinda bougie for body wash. I'll stick with what I know."
Create anchor statements for each Likert point (1-5)
- 1: "It's very unlikely I'd buy it"
- 3: "I'm unsure whether I'd purchase this"
- 5: "It's very likely I'd buy it"
Use embeddings to map responses to probability distributions
- Calculate cosine similarity between the response and each anchor
- Convert similarities to probabilities across the 5-point scale
- Aggregate to get realistic response distributions

The Results: 90% of Human Reliability

With SSR, synthetic consumers achieved:

90% correlation attainment
vs human test-retest reliability (the ceiling for this type of task)
KS similarity > 0.85
for response distributions (vs 0.26 baseline)
Zero training data required
—pure zero-shot elicitation
Bonus: rich qualitative feedback
explaining every rating

The breakthrough wasn't the LLM. It was respecting that purchase intent is fuzzy and multidimensional, and using embeddings to capture that ambiguity rather than forcing premature quantification.

Why Demographics Still Matter (A Lot)

One surprising finding: prompting with demographics dramatically improved accuracy.

Synthetic consumers conditioned on age and income level replicated real human patterns:

Younger and older consumers rated purchase intent lower than middle-aged cohorts
Budget-conscious consumers were systematically more cautious
Price tier sensitivity matched real behavior

But when they ran the same tests without demographic conditioning, distributional similarity stayed high (0.91) while correlation dropped to 50%.

Translation: generic synthetic consumers will tell you what "everyone" thinks (high ratings across the board) but miss the signal about which concepts resonate with whom.

This is exactly why Ditto builds population-true digital twins with demographic, psychographic, and behavioral grounding. The LLM needs real constraints—census data, market structure, income realities—to simulate real people, not "average consumers."

What This Means for Real Research Decisions

Three implications for anyone running concept tests:

1. Distribution shape > mean scores

If your synthetic consumers all rate a concept 3.5-4.0, you haven't learned much. The question is: are your 4s "confident yes with minor concerns" or "fine but forgettable"? SSR preserves that distinction.

2. Qualitative + quantitative, automatically

Traditional surveys give you Likert scores or open-text comments (if respondents bother). SSR gives you both by default—every rating comes with reasoning. This is especially valuable for early-stage concepts where you need to understand why something resonates.

3. Zero-shot is viable (with caveats)

You don't need to fine-tune on proprietary human surveys to get reliable results—but you do need rigorous elicitation methodology, demographic grounding, and validation against real-world signals.

The Limits (Because Everything Has Limits)

The paper is honest about boundaries:

Domain dependence: This worked for personal care products because LLMs have been exposed to abundant consumer discussions in training data. For truly novel categories, reliability will drop.
Demographic gaps: Age and income patterns transferred well; gender and ethnicity did not.
Reference statement sensitivity: Different anchor sets produced slightly different mappings (they averaged across 6 sets to stabilize).

The researchers also note that synthetic consumers showed less positivity bias than humans, producing wider spread in ratings. That's actually useful for early concept screening, you get more discriminating signal, but it means absolute scores won't match human surveys 1:1.

Why This Validates Ditto's Approach

We've been building population-true digital twins using similar principles:

Context-rich grounding (census data, market structure, cultural signals, news)
Free-form reasoning before quantification (our twins "think out loud")
Distributional validation (checking that response patterns match known market segments)

The SSR paper provides academic validation that:

This approach works at scale (57 products, 9,300+ synthetic responses)
Zero-shot elicitation can reach 90% reliability (vs human test-retest)
Demographic conditioning is non-negotiable for realistic results

The Real Unlock: Speed × Fidelity

Traditional research made you choose: fast or accurate. Rapid online panels sacrificed representativeness. Recruiting representative samples took months.

LLM-based synthetic research promises both—but only if you solve the elicitation problem. Naive prompting gives you speed without fidelity. SSR (and methods like it) give you both.

This is why synthetic research is finally ready for high-stakes decisions: we can now generate responses that are:

Population-true (grounded in real demographic distributions)
Context-aware (trained on category behavior and current events)
Distributionally valid (matching human response patterns, not caricatures)
Instant (minutes, not months)

The era of "we need to validate this with humans first" is ending. Not because synthetic research replaces humans, but because it's finally reliable enough to trust for iterating concepts, narrowing options, and identifying which ideas deserve human validation.