Analyzing Experiment Results
When the experiment ends, the analysis begins. Your job is to extract truth from data while avoiding common interpretation mistakes.
The Analysis Framework
Follow this systematic approach:
1. Validity checks → Is the experiment trustworthy?
2. Primary metric → What does it show?
3. Statistical significance → Is it real?
4. Practical significance → Does it matter?
5. Segment analysis → Who benefits?
6. Guardrails → Any red flags?
7. Decision → Launch, iterate, or kill?
Validity Checks First
Before looking at results, verify the experiment ran correctly:
| Check | What to Look For | Red Flag |
|---|---|---|
| Sample ratio | 50/50 split achieved? | >1% deviation |
| Pre-experiment metrics | Groups balanced? | Different baselines |
| Implementation | Feature deployed correctly? | Engineering bugs |
| Duration | Full weeks completed? | Partial weeks |
Interview insight: "I always check sample ratio mismatch (SRM) first. If my 50/50 split ended up 52/48, something went wrong with randomization and the results aren't trustworthy."
Statistical vs Practical Significance
Two separate questions:
Statistical significance: Is the effect real (not random noise)?
- Answer: p < 0.05 (typically)
Practical significance: Is the effect large enough to matter?
- Answer: Depends on business context
Example:
- p = 0.01 (highly significant)
- Effect: +0.01% conversion (5.00% → 5.01%)
- 95% CI: [0.005%, 0.015%]
Statistically significant, but is +0.01% worth the engineering maintenance cost?
Interview question: "We found a significant result with p=0.02, but the lift is only 0.5%. Should we launch?"
Good answer: "I'd calculate the business impact. If 0.5% lift means $1M annual revenue, probably yes. If it means $10K but requires ongoing maintenance, maybe not. I'd also check if the confidence interval includes effects large enough to be clearly worthwhile."
Confidence Intervals Over p-Values
Confidence intervals provide more information:
| Scenario | p-value | 95% CI | Interpretation |
|---|---|---|---|
| A | 0.02 | [0.5%, 3.0%] | Significant, effect likely 0.5-3% |
| B | 0.02 | [0.01%, 0.1%] | Significant, but tiny effect |
| C | 0.15 | [-0.5%, 2.5%] | Not significant, but could be meaningful |
Pro tip: If the CI includes zero, the result is not significant. The width of the CI shows your precision.
Segment Analysis
Look beyond the aggregate:
Key segments to always check:
- Device (mobile vs desktop)
- New vs returning users
- Geography (if relevant)
- User tenure/maturity
Example finding:
Overall: +2% conversion (significant)
By device:
- Mobile: +5% conversion (significant)
- Desktop: -1% conversion (not significant)
Insight: The feature works well on mobile but may hurt desktop.
Consider mobile-only launch.
Interpreting Null Results
"Not significant" doesn't mean "no effect":
Possible interpretations:
- No true effect exists
- Effect exists but too small to detect
- Effect exists but we lacked power
- Effect exists in segments we didn't analyze
How to report:
Good: "We observed a +0.8% lift, but this was not statistically
significant (p=0.23, 95% CI: [-0.5%, 2.1%]). With our sample size,
we could only reliably detect effects ≥2%. We cannot conclude
whether the feature has a small positive effect or no effect."
Bad: "The feature doesn't work."
Making the Decision
Combine all evidence:
| Signal | Launch | Don't Launch |
|---|---|---|
| Primary metric | Significant lift | Not significant or negative |
| Practical size | Business-meaningful | Too small to matter |
| Guardrails | All healthy | Any red flags |
| Segments | Consistent or positive | Harms key segments |
| Confidence | Narrow CI, clear result | Wide CI, uncertain |
When it's ambiguous:
- Run longer if more data would help
- Consider limited launch (one segment)
- Iterate on the feature and retest
Interview framework: "For this decision, I'd summarize: The treatment showed a [X%] lift in [primary metric] (p=[value], 95% CI: [range]). Guardrail metrics [were/weren't] impacted. Segment analysis revealed [findings]. My recommendation is [launch/don't launch/iterate] because [reasoning]."
Always tie statistical findings back to business impact. Numbers alone don't make decisions - context does. :::