PsyIQ · AP Psychology

Lesson 3: Statistics & Ethics in Research

Unit 1 · Biological Bases of Behavior (15–25%) · Science Practices:** 3 — Data Interpretation (primary); 2 — Research Methods and Design (supporting)

Objectives:

Read a dataset the way the exam wants you to — naming central tendency and variability, and knowing when each measure misleads.
Interpret a correlation coefficient and a scatterplot for *strength* and *direction* without sliding into the "correlation proves causation" trap.
Say precisely what statistical significance (*p* < .05) does and does *not* mean, and name the ethical guidelines that govern every study you'll ever read on this exam.

(a) Hook

Here's a number that should bother you: the average net worth of people in a room jumps by about a billion dollars the moment Jeff Bezos walks in. The typical person in that room got no richer. That gap — between the average and the typical — is the first thing statistics teaches you, and the AP exam tests it relentlessly.

Psychology lives and dies by numbers. A researcher can't just say a therapy works; she has to show the difference between groups is bigger than what random luck would cough up. She can't just say sleep and grades are linked; she has to put a number on how tightly. And she can't run the study at all unless a committee of strangers first agreed she won't harm anyone.

This lesson is two skills the exam treats as one: reading the numbers a study produces, and judging whether the study should have produced them ethically in the first place. Master both and you've got the spine of every AAQ you'll ever write. Let's start by making averages honest.

(b) Core Concepts

Descriptive statistics: summarizing a pile of numbers

Descriptive statistics organize and summarize data — they describe what you collected, nothing more. (Their counterpart, inferential statistics, lets you generalize beyond your sample; we'll get there.) The first job is finding the center.

Measures of central tendency locate the middle of a distribution three different ways. The mean is the arithmetic average — add the scores, divide by how many. The median is the literal middle score when you line them up in order. The mode is the most frequent score.

When do they disagree? Whenever the data are lopsided. Picture five quiz scores: 70, 72, 74, 76, and 8 (someone bombed it). The median is 74 — a fair "typical" score. But the mean gets yanked down to 60 by that single 8. That's the headline rule: the mean is sensitive to extreme scores (outliers); the median is resistant to them. Bezos in the room moves the mean, not the median. This is why income and home prices are reported as medians — a few billionaires would make the mean meaningless.

Try This. Write down the ages of everyone in your household, then add a 95-year-old visiting relative. Recalculate the mean and median. Watch the mean lurch toward 95 while the median barely twitches. You just felt the difference between a measure that outliers control and one they don't.

Variability: how spread out is the pile?

Two classes can have the same average test score of 80 and be wildly different — one where everyone scored 78–82, another where half scored 60 and half scored 100. Central tendency can't see that difference. Variability can.

The crudest measure is the range — the highest score minus the lowest. It's quick but fragile: one outlier blows it up, since it only looks at the two extremes.

The measure the exam cares about is the standard deviation (SD) — roughly, the average distance of scores from the mean. A small SD means scores cluster tightly around the mean; a large SD means they're scattered widely. If two classes both average 80 but one has SD = 2 and the other SD = 18, you instantly know the first class is homogeneous and the second is all over the map. You do not need to compute SD by hand for the AP exam — you need to interpret it. Bigger SD = more spread; smaller SD = more consistency. That's the whole test-relevant idea.

The normal distribution and skew

Many measured traits — height, IQ, reaction time — pile up in a symmetric, bell-shaped normal distribution (normal curve) when you graph enough of them. In a normal distribution, the mean, median, and mode all sit at the same central point, and a fixed, predictable percentage of scores falls within each band of standard deviations: about 68% within 1 SD of the mean, about 95% within 2 SD, and about 99.7% within 3 SD. That last fact is why an IQ of 130 (two SDs above the mean of 100, with SD = 15) is so rare — only about 2.5% of people score that high or higher.

When a distribution isn't symmetric, it's skewed — and the direction names itself by where the tail points, which trips up nearly everyone. A positively skewed distribution has a long tail stretching to the right (the high end); think household income, where a few enormous earners drag the tail out. A negatively skewed distribution has a long tail to the left; think an easy exam where most people score high and a few stragglers pull the tail down. The rule for the mean and median in a skew: the mean chases the tail. In a positive skew the mean lands to the right of the median; in a negative skew, to the left.

Percentile rank tells you the percentage of scores at or below a given score. If your SAT is in the 90th percentile, you scored as well as or better than 90% of test-takers. The median is always exactly the 50th percentile.

Correlation: putting a number on a relationship

A correlation coefficient (r) is a single number, from −1.00 to +1.00, that captures two things about the relationship between two variables: its direction and its strength.

Direction is the sign. A positive correlation (+) means the variables move together — as one goes up, the other tends to go up (hours studied and grades). A negative correlation (−) means they move in opposite directions — as one goes up, the other tends to go down (hours of TV and grades, perhaps). A correlation near zero means no linear relationship.

Strength is the absolute value — how close r is to 1, regardless of sign. Here's the trap that sinks students: a correlation of −.80 is STRONGER than a correlation of +.30. The minus sign is not a weakness; it's a direction. An r of −.90 describes a tighter, more predictable relationship than an r of +.40. Strength lives in the distance from zero, not in the sign.

On a scatterplot, each dot is one participant plotted on two variables. The tighter the dots hug an imaginary straight line, the stronger the correlation (closer to ±1). A cloud of dots with no slope is r ≈ 0. An upward slope is positive; a downward slope is negative.

And the cardinal rule, repeated until it's reflex: correlation does not prove causation. If ice-cream sales correlate with drownings, ice cream isn't drowning anyone — a third variable (summer heat) drives both. Only a controlled experiment can establish cause.

Inferential statistics: is this difference real?

Suppose a new study method raises a treatment group's mean score above a control group's. How do you know that gap is real and not just the random luck of which students landed in which group? Inferential statistics let you decide whether a result likely generalizes to the larger population or is probably just chance.

The key idea is statistical significance. A result is statistically significant when it is unlikely to have occurred by chance alone. By long-standing convention, psychologists use the cutoff p < .05, meaning: if there were truly no effect, you'd see a result this large (or larger) less than 5% of the time by chance. Cross that threshold and the result is "significant"; you reject the idea that it's pure luck.

Now the part the exam loves to test — what significance does not mean:

It does not mean the effect is large or important. With a huge sample, a trivially small difference can be statistically significant. Significance is about reliability, not size.
It does not mean the result is certain — there's still that <5% chance it's a fluke.
It does not prove your hypothesis is "true." It only says chance is an unlikely explanation.

Because significance ignores size, researchers also report effect size — a measure of how big the difference or relationship actually is, independent of sample size. A result can be statistically significant (reliable) yet have a tiny effect size (trivial in the real world). The sophisticated reading of any study holds both in mind: Is it real (significance) AND does it matter (effect size)?

Research ethics: the rules behind every study

Before any of those numbers get collected, the study has to clear ethics. In the United States, every institution running human research has an Institutional Review Board (IRB) — a committee that reviews proposed studies to ensure participants are protected before the research begins. The APA (American Psychological Association) publishes the ethical guidelines the IRB enforces. The core requirements:

Informed consent: participants must know enough about the study to agree to take part voluntarily, and must be told they can withdraw at any time without penalty.
Deception + debriefing: deception (misleading participants about the study's true purpose) is permitted only when necessary and harmless — and it requires a full debriefing afterward, in which researchers reveal the true purpose and clear up any deception.
Protection from harm: participants must not be exposed to significant physical or psychological harm; risks must be minimized.
Confidentiality: participants' data and identities are kept private (often via anonymous codes rather than names).

For animal research, separate guidelines apply: studies must have a clear scientific purpose, animals must be cared for humanely, and discomfort must be minimized. Animal research is permitted but strictly regulated. Whenever an AAQ asks you to name an ethical guideline a study followed, these terms are your answer bank.

(c) Classic Studies Spotlight

Stanley Milgram's obedience study — as an ethics flashpoint (Yale, 1963).

Who & when: Stanley Milgram, at Yale, beginning 1961 with results published in 1963. (You'll meet this study again in Unit 4 for what it revealed about obedience; here it's the field's most-cited ethics cautionary tale.)

What he did: Participants were told they were in a "learning" experiment and ordered to deliver what they believed were escalating, painful electric shocks to another person each time he answered wrong. The shocks were fake and the "learner" was an actor — but the participants didn't know that. About 65% obeyed all the way to the maximum 450-volt level.

The ethical problem: Participants were deceived about the study's true purpose, and many showed genuine, severe distress — sweating, trembling, nervous laughter — raising the question of whether they were adequately protected from harm. Some argued they could not have given truly informed consent to an experience that traumatic.

Why it matters: The outcry over Milgram's methods (alongside studies like Zimbardo's prison experiment) directly shaped the modern ethics regime — IRBs, mandatory informed consent, and required debriefing. Milgram did debrief his participants and followed up on their wellbeing, but the case crystallized why the rules exist. For the exam: Milgram = obedience finding and the textbook example of the deception-vs-protection-from-harm tension.

(d) Application Practice

Scenario 1. A real-estate report announces that the median home price in a town is \$310,000, while the mean is \$540,000. A student concludes the report contradicts itself.

What's actually going on, and which statistic is more informative here? There's no contradiction — the gap reveals a positively skewed distribution. A handful of multimillion-dollar mansions (high-end outliers) pull the mean sharply upward while leaving the median near the typical home. Because the mean is sensitive to outliers and the median is resistant, the median (\$310,000) better represents what a typical buyer faces. The very size of the mean-median gap is your evidence of skew.

Scenario 2. A researcher finds that a new tutoring program raised test scores with p = .03, and a news headline blares "Tutoring Dramatically Boosts Scores!"

What can you legitimately conclude, and where does the headline overreach? Because p = .03 is below .05, the result is statistically significant — the score increase is unlikely to be due to chance. That's the legitimate claim. But "dramatically" overreaches: significance is not effect size. With a large enough sample, even a tiny, real-world-trivial bump clears p < .05. Without knowing the effect size, you cannot say the boost was large. The headline confuses "real" with "big."

Scenario 3. A study reports that the correlation between daily screen time and reported loneliness is r = +.45, and between hours of in-person socializing and loneliness is r = −.62. A blog claims the first relationship is the stronger one "because it's positive."

Which relationship is actually stronger, and what does each sign mean? The blog has it backward. Strength is the distance from zero, so r = −.62 is the stronger relationship — more in-person socializing reliably tracks with less loneliness. The signs only give direction: screen time and loneliness rise together (positive); socializing and loneliness move opposite (negative). A negative correlation is not a weak correlation. And note: even r = −.62 cannot prove socializing causes lower loneliness — a third variable (say, depression) could drive both.

(e) Traps & Confusions

Mean vs. median under skew. Students memorize "average" and stop. The trap is forgetting that the mean follows the tail while the median holds steady. Mnemonic: the mean is a pushover — outliers shove it around; the median has a backbone. Whenever mean and median differ sharply, suspect a skewed distribution and trust the median for "typical."

Correlation strength vs. sign. The single most-missed stats point on the exam. The sign (+/−) is direction; the absolute value is strength. −.80 beats +.30 every time. Mnemonic: strip the sign to size it up. Cover the +/− with your thumb, read the number, that's the strength.

Statistical significance vs. importance. "Significant" in statistics means "probably not chance," not "big" or "meaningful." A significant result with a tiny effect size can be practically worthless. Keep two questions separate: Is it real? (significance) and Does it matter? (effect size).

Positive vs. negative skew direction. Counterintuitive because you name the skew by the tail, not the hump. Positive skew = tail points right (toward high values, like income); negative skew = tail points left (like an easy test). The mound of scores sits on the opposite side from the skew's name. Remember: the skew is named for its tail, and the mean is dragged into that tail.

(f) Practice Problems

Four-choice MCQs in current AP format. Answers and explanations in section (h).

Question 1

Which measure of central tendency is most affected by a single extreme outlier?

(A) The mode
(B) The median
(C) The mean
(D) The range

Question 2

A set of seven incomes is: \$30k, \$32k, \$35k, \$36k, \$38k, \$40k, and \$2,000k. The best single measure of the typical income in this set is the

(A) mean, because it uses every value
(B) median, because it resists the outlier
(C) range, because it shows the spread
(D) mode, because it is most frequent

Question 3

Two classes both have a mean test score of 75. Class A has a standard deviation of 3; Class B has a standard deviation of 14. Which statement is best supported?

(A) Class A's scores are more spread out than Class B's.
(B) Class B's scores cluster more tightly around the mean.
(C) Class A's scores are more consistent and closer to the mean.
(D) The two classes have identical distributions.

Question 4

In a normal distribution, approximately what percentage of scores falls within one standard deviation of the mean?

(A) 34%
(B) 50%
(C) 68%
(D) 95%

Question 5

A distribution of exam scores has a long tail extending toward the low end, with most students scoring high. This distribution is

(A) positively skewed, and the mean is higher than the median
(B) negatively skewed, and the mean is lower than the median
(C) normally distributed
(D) negatively skewed, and the mean is higher than the median

Question 6

Which correlation coefficient indicates the strongest relationship between two variables?

(A) +.25
(B) −.79
(C) +.60
(D) −.40

Question 7

A researcher finds r = +.85 between hours of sleep and exam performance. The most accurate interpretation is that

(A) getting more sleep causes higher exam scores
(B) more sleep is strongly associated with higher exam scores, but causation isn't established
(C) sleep and exam scores are unrelated
(D) higher exam scores cause students to sleep more

Question 8

A study reports a result with p = .02. This means

(A) the effect is large and important
(B) there is a 2% probability the hypothesis is true
(C) a result this extreme would occur by chance alone less than 5% of the time, so it is statistically significant
(D) the result is certainly not due to chance

Question 9

Scenario. A pharmaceutical company runs a trial with 50,000 participants and finds that its new supplement raises average daily focus-test scores by 0.3 points on a 100-point scale, with p < .001. Which conclusion is most defensible?

(A) The supplement has a large, meaningful effect on focus.
(B) The result is statistically significant but likely has a very small effect size.
(C) The result is not statistically significant.
(D) The supplement causes a 0.3% increase in intelligence.

Question 10

Before a psychology experiment can begin at a university, it must typically be approved by the

(A) APA national headquarters
(B) Institutional Review Board (IRB)
(C) participants' families
(D) Department of Education

Question 11

A researcher misleads participants about the true purpose of a study to prevent them from altering their behavior. To do this ethically, the researcher must, at minimum,

(A) obtain a court order
(B) fully debrief participants afterward, revealing the true purpose
(C) pay each participant
(D) ensure the deception lasts the entire study

Question 12

Data interpretation. A scatterplot shows data points scattered widely in a loose cloud with a slight downward drift from upper-left to lower-right. The correlation this scatterplot most likely represents is

(A) r = +.90
(B) r = −.20
(C) r = +.05
(D) r = −.95

Question 13

Scenario. A teacher reports that her students' quiz scores have a mean of 88 and a median of 82. What does this difference most strongly suggest about the distribution?

(A) The scores are normally distributed.
(B) The scores are positively skewed, with a few high outliers pulling the mean up.
(C) The scores are negatively skewed, with a few low outliers pulling the mean down.
(D) Most students scored exactly 85.

Question 14

A student scores in the 75th percentile on a standardized test. This means the student

(A) answered 75% of questions correctly
(B) scored as well as or better than 75% of test-takers
(C) is 75 points above the mean
(D) is in the top 75% only by chance

Question 15

Data interpretation. The table below shows reaction-time data (in milliseconds) for two groups: | Group | Mean | Standard Deviation | |---|---|---| | Caffeine | 240 | 12 | | Placebo | 295 | 45 | Which conclusion is best supported by the table?

(A) The caffeine group was both faster on average and more consistent in their times.
(B) The placebo group was faster and more consistent.
(C) The two groups had identical variability.
(D) Standard deviation tells us nothing about consistency.

🔑 Answer Key

1. (C) The mean. The mean is calculated from every value, so a single extreme score drags it toward that extreme. (A) the mode is just the most frequent value and (B) the median is the middle position — both ignore how extreme an outlier is. (D) the range is a measure of variability, not central tendency.

2. (B) Median. The \$2,000k income is a massive outlier that inflates the mean far above what anyone typical earns; the median (\$36k) resists it and better represents "typical." (A) precisely because the mean uses the outlier, it's misleading here. (C) range measures spread, not the typical value. (D) there is no repeated value, so the mode is useless here.

3. (C). A smaller standard deviation (Class A, SD = 3) means scores cluster tightly around the mean — more consistency. (A) reverses it (the larger SD, Class B, is more spread out). (B) reverses it again. (D) is false — equal means with different SDs are different distributions.

4. (C) 68%. The empirical rule: ~68% within 1 SD, ~95% within 2 SD, ~99.7% within 3 SD. (A) 34% is half of one SD band (one side only). (B) 50% describes the median split, not an SD band. (D) 95% is the 2-SD band.

5. (B). A tail toward the low end is a negative skew, and because the mean chases the tail, the mean is pulled below the median. (A) describes a positive skew. (D) gets the skew direction right but the mean/median relationship backward. (C) a normal distribution has no tail.

6. (B) −.79. Strength is the absolute value; |−.79| = .79 is the closest to 1, so it's strongest. The negative sign is direction, not weakness. (C) +.60 is weaker than .79; (A) and (D) are weaker still. This is the classic "sign isn't strength" item.

7. (B). A strong positive correlation means the variables rise together, but correlation never establishes causation. (A) and (D) both illegitimately claim causation (and in opposite directions); (C) contradicts a strong r.

8. (C). p = .02 means a result this extreme would occur by chance alone less than 5% of the time, making it statistically significant. (A) confuses significance with effect size. (B) misreads the p-value — it's not the probability the hypothesis is true. (D) overstates — there's still a small chance it's a fluke.

9. (B). With 50,000 participants, even a trivially small 0.3-point difference can reach p < .001 — statistically significant but almost certainly a tiny effect size. (A) confuses significance with magnitude. (C) is false — p < .001 is significant. (D) misrepresents a focus-test change as an intelligence change and asserts unsupported causation framing.

10. (B) Institutional Review Board (IRB). The IRB reviews and approves human-subjects research before it begins. (A) the APA writes guidelines but doesn't approve individual campus studies; (C) and (D) play no such role.

11. (B). Ethical deception requires a full debriefing afterward that reveals the true purpose and resolves the deception. (A) and (C) are not ethics requirements; (D) is backward — deception should be minimized, not prolonged.

12. (B) r = −.20. A loose, widely scattered cloud means a weak correlation (small absolute value), and a downward drift means negative. So a small negative value fits. (A) and (D) describe tight, strong relationships (dots hugging a line); (C) is positive and essentially zero (no slope).

13. (B). Mean (88) above median (82) signals a positive skew — a few high scores pull the mean up while the median holds near typical. (C) reverses the direction. (A) a normal distribution has mean ≈ median. (D) a single repeated value isn't implied by the mean-median gap.

14. (B). Percentile rank is the percentage of test-takers scoring at or below you, so the 75th percentile means you scored as well as or better than 75% of them. (A) confuses percentile with percent-correct. (C) percentile isn't a raw point distance. (D) misframes it as chance.

15. (A). A lower mean (240 vs. 295) means the caffeine group was faster, and a smaller SD (12 vs. 45) means their times were more consistent. (B) reverses both. (C) the SDs differ markedly (12 vs. 45). (D) is false — SD is precisely a measure of consistency/spread.

---

PsyIQ · Lesson 3 of 30 · Unit 1: Biological Bases of Behavior. Q1-style practice modeled on the redesigned (2025+) AP Psychology exam. Not affiliated with the College Board. AP is a registered trademark of the College Board. Content pending external psychology QC.

← Lesson 2

Lesson 4 →