Here's a number that should bother you: the average net worth of people in a room jumps by about a billion dollars the moment Jeff Bezos walks in. The typical person in that room got no richer. That gap — between the average and the typical — is the first thing statistics teaches you, and the AP exam tests it relentlessly.
Psychology lives and dies by numbers. A researcher can't just say a therapy works; she has to show the difference between groups is bigger than what random luck would cough up. She can't just say sleep and grades are linked; she has to put a number on how tightly. And she can't run the study at all unless a committee of strangers first agreed she won't harm anyone.
This lesson is two skills the exam treats as one: reading the numbers a study produces, and judging whether the study should have produced them ethically in the first place. Master both and you've got the spine of every AAQ you'll ever write. Let's start by making averages honest.
Descriptive statistics organize and summarize data — they describe what you collected, nothing more. (Their counterpart, inferential statistics, lets you generalize beyond your sample; we'll get there.) The first job is finding the center.
Measures of central tendency locate the middle of a distribution three different ways. The mean is the arithmetic average — add the scores, divide by how many. The median is the literal middle score when you line them up in order. The mode is the most frequent score.
When do they disagree? Whenever the data are lopsided. Picture five quiz scores: 70, 72, 74, 76, and 8 (someone bombed it). The median is 74 — a fair "typical" score. But the mean gets yanked down to 60 by that single 8. That's the headline rule: the mean is sensitive to extreme scores (outliers); the median is resistant to them. Bezos in the room moves the mean, not the median. This is why income and home prices are reported as medians — a few billionaires would make the mean meaningless.
Try This. Write down the ages of everyone in your household, then add a 95-year-old visiting relative. Recalculate the mean and median. Watch the mean lurch toward 95 while the median barely twitches. You just felt the difference between a measure that outliers control and one they don't.
Two classes can have the same average test score of 80 and be wildly different — one where everyone scored 78–82, another where half scored 60 and half scored 100. Central tendency can't see that difference. Variability can.
The crudest measure is the range — the highest score minus the lowest. It's quick but fragile: one outlier blows it up, since it only looks at the two extremes.
The measure the exam cares about is the standard deviation (SD) — roughly, the average distance of scores from the mean. A small SD means scores cluster tightly around the mean; a large SD means they're scattered widely. If two classes both average 80 but one has SD = 2 and the other SD = 18, you instantly know the first class is homogeneous and the second is all over the map. You do not need to compute SD by hand for the AP exam — you need to interpret it. Bigger SD = more spread; smaller SD = more consistency. That's the whole test-relevant idea.
Many measured traits — height, IQ, reaction time — pile up in a symmetric, bell-shaped normal distribution (normal curve) when you graph enough of them. In a normal distribution, the mean, median, and mode all sit at the same central point, and a fixed, predictable percentage of scores falls within each band of standard deviations: about 68% within 1 SD of the mean, about 95% within 2 SD, and about 99.7% within 3 SD. That last fact is why an IQ of 130 (two SDs above the mean of 100, with SD = 15) is so rare — only about 2.5% of people score that high or higher.
When a distribution isn't symmetric, it's skewed — and the direction names itself by where the tail points, which trips up nearly everyone. A positively skewed distribution has a long tail stretching to the right (the high end); think household income, where a few enormous earners drag the tail out. A negatively skewed distribution has a long tail to the left; think an easy exam where most people score high and a few stragglers pull the tail down. The rule for the mean and median in a skew: the mean chases the tail. In a positive skew the mean lands to the right of the median; in a negative skew, to the left.
Percentile rank tells you the percentage of scores at or below a given score. If your SAT is in the 90th percentile, you scored as well as or better than 90% of test-takers. The median is always exactly the 50th percentile.
A correlation coefficient (r) is a single number, from −1.00 to +1.00, that captures two things about the relationship between two variables: its direction and its strength.
Direction is the sign. A positive correlation (+) means the variables move together — as one goes up, the other tends to go up (hours studied and grades). A negative correlation (−) means they move in opposite directions — as one goes up, the other tends to go down (hours of TV and grades, perhaps). A correlation near zero means no linear relationship.
Strength is the absolute value — how close r is to 1, regardless of sign. Here's the trap that sinks students: a correlation of −.80 is STRONGER than a correlation of +.30. The minus sign is not a weakness; it's a direction. An r of −.90 describes a tighter, more predictable relationship than an r of +.40. Strength lives in the distance from zero, not in the sign.
On a scatterplot, each dot is one participant plotted on two variables. The tighter the dots hug an imaginary straight line, the stronger the correlation (closer to ±1). A cloud of dots with no slope is r ≈ 0. An upward slope is positive; a downward slope is negative.
And the cardinal rule, repeated until it's reflex: correlation does not prove causation. If ice-cream sales correlate with drownings, ice cream isn't drowning anyone — a third variable (summer heat) drives both. Only a controlled experiment can establish cause.
Suppose a new study method raises a treatment group's mean score above a control group's. How do you know that gap is real and not just the random luck of which students landed in which group? Inferential statistics let you decide whether a result likely generalizes to the larger population or is probably just chance.
The key idea is statistical significance. A result is statistically significant when it is unlikely to have occurred by chance alone. By long-standing convention, psychologists use the cutoff p < .05, meaning: if there were truly no effect, you'd see a result this large (or larger) less than 5% of the time by chance. Cross that threshold and the result is "significant"; you reject the idea that it's pure luck.
Now the part the exam loves to test — what significance does not mean:
Because significance ignores size, researchers also report effect size — a measure of how big the difference or relationship actually is, independent of sample size. A result can be statistically significant (reliable) yet have a tiny effect size (trivial in the real world). The sophisticated reading of any study holds both in mind: Is it real (significance) AND does it matter (effect size)?
Before any of those numbers get collected, the study has to clear ethics. In the United States, every institution running human research has an Institutional Review Board (IRB) — a committee that reviews proposed studies to ensure participants are protected before the research begins. The APA (American Psychological Association) publishes the ethical guidelines the IRB enforces. The core requirements:
For animal research, separate guidelines apply: studies must have a clear scientific purpose, animals must be cared for humanely, and discomfort must be minimized. Animal research is permitted but strictly regulated. Whenever an AAQ asks you to name an ethical guideline a study followed, these terms are your answer bank.
Stanley Milgram's obedience study — as an ethics flashpoint (Yale, 1963).
Who & when: Stanley Milgram, at Yale, beginning 1961 with results published in 1963. (You'll meet this study again in Unit 4 for what it revealed about obedience; here it's the field's most-cited ethics cautionary tale.)
What he did: Participants were told they were in a "learning" experiment and ordered to deliver what they believed were escalating, painful electric shocks to another person each time he answered wrong. The shocks were fake and the "learner" was an actor — but the participants didn't know that. About 65% obeyed all the way to the maximum 450-volt level.
The ethical problem: Participants were deceived about the study's true purpose, and many showed genuine, severe distress — sweating, trembling, nervous laughter — raising the question of whether they were adequately protected from harm. Some argued they could not have given truly informed consent to an experience that traumatic.
Why it matters: The outcry over Milgram's methods (alongside studies like Zimbardo's prison experiment) directly shaped the modern ethics regime — IRBs, mandatory informed consent, and required debriefing. Milgram did debrief his participants and followed up on their wellbeing, but the case crystallized why the rules exist. For the exam: Milgram = obedience finding and the textbook example of the deception-vs-protection-from-harm tension.
Scenario 1. A real-estate report announces that the median home price in a town is \$310,000, while the mean is \$540,000. A student concludes the report contradicts itself.
What's actually going on, and which statistic is more informative here? There's no contradiction — the gap reveals a positively skewed distribution. A handful of multimillion-dollar mansions (high-end outliers) pull the mean sharply upward while leaving the median near the typical home. Because the mean is sensitive to outliers and the median is resistant, the median (\$310,000) better represents what a typical buyer faces. The very size of the mean-median gap is your evidence of skew.
Scenario 2. A researcher finds that a new tutoring program raised test scores with p = .03, and a news headline blares "Tutoring Dramatically Boosts Scores!"
What can you legitimately conclude, and where does the headline overreach? Because p = .03 is below .05, the result is statistically significant — the score increase is unlikely to be due to chance. That's the legitimate claim. But "dramatically" overreaches: significance is not effect size. With a large enough sample, even a tiny, real-world-trivial bump clears p < .05. Without knowing the effect size, you cannot say the boost was large. The headline confuses "real" with "big."
Scenario 3. A study reports that the correlation between daily screen time and reported loneliness is r = +.45, and between hours of in-person socializing and loneliness is r = −.62. A blog claims the first relationship is the stronger one "because it's positive."
Which relationship is actually stronger, and what does each sign mean? The blog has it backward. Strength is the distance from zero, so r = −.62 is the stronger relationship — more in-person socializing reliably tracks with less loneliness. The signs only give direction: screen time and loneliness rise together (positive); socializing and loneliness move opposite (negative). A negative correlation is not a weak correlation. And note: even r = −.62 cannot prove socializing causes lower loneliness — a third variable (say, depression) could drive both.
Mean vs. median under skew. Students memorize "average" and stop. The trap is forgetting that the mean follows the tail while the median holds steady. Mnemonic: the mean is a pushover — outliers shove it around; the median has a backbone. Whenever mean and median differ sharply, suspect a skewed distribution and trust the median for "typical."
Correlation strength vs. sign. The single most-missed stats point on the exam. The sign (+/−) is direction; the absolute value is strength. −.80 beats +.30 every time. Mnemonic: strip the sign to size it up. Cover the +/− with your thumb, read the number, that's the strength.
Statistical significance vs. importance. "Significant" in statistics means "probably not chance," not "big" or "meaningful." A significant result with a tiny effect size can be practically worthless. Keep two questions separate: Is it real? (significance) and Does it matter? (effect size).
Positive vs. negative skew direction. Counterintuitive because you name the skew by the tail, not the hump. Positive skew = tail points right (toward high values, like income); negative skew = tail points left (like an easy test). The mound of scores sits on the opposite side from the skew's name. Remember: the skew is named for its tail, and the mean is dragged into that tail.
Four-choice MCQs in current AP format. Answers and explanations in section (h).
1. (C) The mean. The mean is calculated from every value, so a single extreme score drags it toward that extreme. (A) the mode is just the most frequent value and (B) the median is the middle position — both ignore how extreme an outlier is. (D) the range is a measure of variability, not central tendency.
2. (B) Median. The \$2,000k income is a massive outlier that inflates the mean far above what anyone typical earns; the median (\$36k) resists it and better represents "typical." (A) precisely because the mean uses the outlier, it's misleading here. (C) range measures spread, not the typical value. (D) there is no repeated value, so the mode is useless here.
3. (C). A smaller standard deviation (Class A, SD = 3) means scores cluster tightly around the mean — more consistency. (A) reverses it (the larger SD, Class B, is more spread out). (B) reverses it again. (D) is false — equal means with different SDs are different distributions.
4. (C) 68%. The empirical rule: ~68% within 1 SD, ~95% within 2 SD, ~99.7% within 3 SD. (A) 34% is half of one SD band (one side only). (B) 50% describes the median split, not an SD band. (D) 95% is the 2-SD band.
5. (B). A tail toward the low end is a negative skew, and because the mean chases the tail, the mean is pulled below the median. (A) describes a positive skew. (D) gets the skew direction right but the mean/median relationship backward. (C) a normal distribution has no tail.
6. (B) −.79. Strength is the absolute value; |−.79| = .79 is the closest to 1, so it's strongest. The negative sign is direction, not weakness. (C) +.60 is weaker than .79; (A) and (D) are weaker still. This is the classic "sign isn't strength" item.
7. (B). A strong positive correlation means the variables rise together, but correlation never establishes causation. (A) and (D) both illegitimately claim causation (and in opposite directions); (C) contradicts a strong r.
8. (C). p = .02 means a result this extreme would occur by chance alone less than 5% of the time, making it statistically significant. (A) confuses significance with effect size. (B) misreads the p-value — it's not the probability the hypothesis is true. (D) overstates — there's still a small chance it's a fluke.
9. (B). With 50,000 participants, even a trivially small 0.3-point difference can reach p < .001 — statistically significant but almost certainly a tiny effect size. (A) confuses significance with magnitude. (C) is false — p < .001 is significant. (D) misrepresents a focus-test change as an intelligence change and asserts unsupported causation framing.
10. (B) Institutional Review Board (IRB). The IRB reviews and approves human-subjects research before it begins. (A) the APA writes guidelines but doesn't approve individual campus studies; (C) and (D) play no such role.
11. (B). Ethical deception requires a full debriefing afterward that reveals the true purpose and resolves the deception. (A) and (C) are not ethics requirements; (D) is backward — deception should be minimized, not prolonged.
12. (B) r = −.20. A loose, widely scattered cloud means a weak correlation (small absolute value), and a downward drift means negative. So a small negative value fits. (A) and (D) describe tight, strong relationships (dots hugging a line); (C) is positive and essentially zero (no slope).
13. (B). Mean (88) above median (82) signals a positive skew — a few high scores pull the mean up while the median holds near typical. (C) reverses the direction. (A) a normal distribution has mean ≈ median. (D) a single repeated value isn't implied by the mean-median gap.
14. (B). Percentile rank is the percentage of test-takers scoring at or below you, so the 75th percentile means you scored as well as or better than 75% of them. (A) confuses percentile with percent-correct. (C) percentile isn't a raw point distance. (D) misframes it as chance.
15. (A). A lower mean (240 vs. 295) means the caffeine group was faster, and a smaller SD (12 vs. 45) means their times were more consistent. (B) reverses both. (C) the SDs differ markedly (12 vs. 45). (D) is false — SD is precisely a measure of consistency/spread.
---
PsyIQ · Lesson 3 of 30 · Unit 1: Biological Bases of Behavior. Q1-style practice modeled on the redesigned (2025+) AP Psychology exam. Not affiliated with the College Board. AP is a registered trademark of the College Board. Content pending external psychology QC.