Analyzing Experiment Results

When the experiment ends, the analysis begins. Your job is to extract truth from data while avoiding common interpretation mistakes.

The Analysis Framework

Follow this systematic approach:

1. Validity checks → Is the experiment trustworthy?
2. Primary metric → What does it show?
3. Statistical significance → Is it real?
4. Practical significance → Does it matter?
5. Segment analysis → Who benefits?
6. Guardrails → Any red flags?
7. Decision → Launch, iterate, or kill?

Validity Checks First

Before looking at results, verify the experiment ran correctly:

Check	What to Look For	Red Flag
Sample ratio	50/50 split achieved?	>1% deviation
Pre-experiment metrics	Groups balanced?	Different baselines
Implementation	Feature deployed correctly?	Engineering bugs
Duration	Full weeks completed?	Partial weeks

Interview insight: "I always check sample ratio mismatch (SRM) first. If my 50/50 split ended up 52/48, something went wrong with randomization and the results aren't trustworthy."

Statistical vs Practical Significance

Two separate questions:

Statistical significance: Is the effect real (not random noise)?

Answer: p < 0.05 (typically)

Practical significance: Is the effect large enough to matter?

Answer: Depends on business context

Example:
- p = 0.01 (highly significant)
- Effect: +0.01% conversion (5.00% → 5.01%)
- 95% CI: [0.005%, 0.015%]

Statistically significant, but is +0.01% worth the engineering maintenance cost?

Interview question: "We found a significant result with p=0.02, but the lift is only 0.5%. Should we launch?"

Good answer: "I'd calculate the business impact. If 0.5% lift means $1M annual revenue, probably yes. If it means $10K but requires ongoing maintenance, maybe not. I'd also check if the confidence interval includes effects large enough to be clearly worthwhile."

Confidence Intervals Over p-Values

Confidence intervals provide more information:

Scenario	p-value	95% CI	Interpretation
A	0.02	[0.5%, 3.0%]	Significant, effect likely 0.5-3%
B	0.02	[0.01%, 0.1%]	Significant, but tiny effect
C	0.15	[-0.5%, 2.5%]	Not significant, but could be meaningful

Pro tip: If the CI includes zero, the result is not significant. The width of the CI shows your precision.

Segment Analysis

Look beyond the aggregate:

Key segments to always check:

Device (mobile vs desktop)
New vs returning users
Geography (if relevant)
User tenure/maturity

Example finding:

Overall: +2% conversion (significant)

By device:
- Mobile: +5% conversion (significant)
- Desktop: -1% conversion (not significant)

Insight: The feature works well on mobile but may hurt desktop.
Consider mobile-only launch.

Interpreting Null Results

"Not significant" doesn't mean "no effect":

Possible interpretations:

No true effect exists
Effect exists but too small to detect
Effect exists but we lacked power
Effect exists in segments we didn't analyze

How to report:

Good: "We observed a +0.8% lift, but this was not statistically
significant (p=0.23, 95% CI: [-0.5%, 2.1%]). With our sample size,
we could only reliably detect effects ≥2%. We cannot conclude
whether the feature has a small positive effect or no effect."

Bad: "The feature doesn't work."

Making the Decision

Combine all evidence:

Signal	Launch	Don't Launch
Primary metric	Significant lift	Not significant or negative
Practical size	Business-meaningful	Too small to matter
Guardrails	All healthy	Any red flags
Segments	Consistent or positive	Harms key segments
Confidence	Narrow CI, clear result	Wide CI, uncertain

When it's ambiguous:

Run longer if more data would help
Consider limited launch (one segment)
Iterate on the feature and retest

Interview framework: "For this decision, I'd summarize: The treatment showed a [X%] lift in [primary metric] (p=[value], 95% CI: [range]). Guardrail metrics [were/weren't] impacted. Segment analysis revealed [findings]. My recommendation is [launch/don't launch/iterate] because [reasoning]."

Always tie statistical findings back to business impact. Numbers alone don't make decisions - context does. :::