A city's data team notices that in months when ice-cream sales rise, drowning deaths also rise. The correlation is strong, consistent, year after year. Should the city restrict ice-cream sales to save lives?
Obviously not — hot weather drives both: more ice cream, more swimming, more drownings. Ice cream doesn't cause drowning; a hidden third factor moves them together.
You laughed, but swap in less-obvious variables and this exact error drives real headlines, real product decisions, and at least one question on every AP CSP exam. "The data shows A and B rise together" never, by itself, proves A causes B. Let's build the full toolkit for what data can and can't tell you.
Data are the raw values — numbers, text, measurements, clicks. Information is what you extract from data: patterns, conclusions, insights, knowledge. [98, 87, 92, 55, 91] is data; "one student is struggling while the class average is a B+" is information.
Extracting information from data is why organizations collect data at all. Programs do the extracting — filtering, sorting, aggregating (Lesson 6 covers the how). Two CED claims worth knowing verbatim:
Correlation: two things vary together. Causation: one thing makes the other happen.
Data alone establishes correlation. Causation requires more — controlled experiments, mechanisms, ruling out confounders. The three classic explanations for a correlation between A and B:
Exam behavior: when a question asks what can be concluded from observational data, the safe answers say "are associated," "is correlated with," "tend to occur together." The trap answers say "causes," "leads to," "will improve." Choose the weakest claim the data supports.
Metadata is data that describes other data. It is not the content itself — it's the label on the box.
| Primary data | Its metadata |
|---|---|
| A photo's pixels | Time taken, GPS location, camera model, resolution |
| A document's text | Author, creation date, file size, last-modified time |
| An email's message | Sender, recipient, timestamp, subject line |
| A web page's content | Page title, language, tags, date published |
CED claims: metadata helps in finding, organizing, and managing data (search by date! sort by size! group by location!), and changing metadata does not change the primary data — retitling a photo doesn't alter a pixel. Also worth a beat of caution: metadata can reveal sensitive information (a photo's GPS tag exposes where you were) — this returns in Lesson 23.
When datasets grow from a spreadsheet to billions of records, new problems appear. The CED names these:
The CED's picture of data work: collect (with known limitations) → clean → combine/organize → filter and aggregate → visualize → interpret (carefully). Programs do the middle; humans supply the questions and the skepticism at both ends. On the exam, respect for that pipeline shows up as answers acknowledging limits: "the data supports X for the population sampled," "additional data would be needed to conclude Y."
Problem: For a podcast episode file, classify each as primary data or metadata: (i) the audio recording, (ii) the episode's duration, (iii) the upload date, (iv) the host's spoken words.
Solution: (i) primary data, (ii) metadata, (iii) metadata, (iv) primary data (it's in the audio content). Duration and upload date describe the file; the recording and its contents are the file.
Interpretation: Litmus test: if you changed it, would the content itself change? Renaming or re-dating the file leaves the audio identical → those are metadata.
Problem: A fitness app's data shows users who log breakfast lose more weight than users who don't. The company drafts four marketing claims. Which is defensible? (A) "Logging breakfast causes weight loss." (B) "Eat breakfast and you will lose weight." (C) "Breakfast-loggers in our data lost more weight on average than non-loggers." (D) "Weight loss is impossible without logging breakfast."
Strategy: Data is observational → correlation only. Find the weakest claim.
Solution: (C). It restates the observed association without asserting cause. (A) and (B) claim causation (maybe motivated users both log meals and exercise more — hidden factor). (D) is absurd overreach.
Interpretation: Notice (C) also scopes the claim: "in our data." Defensible conclusions are hedged and scoped. That style — claim exactly what the data shows — is what the exam rewards.
Problem: A streaming service surveys viewers about content preferences using a pop-up shown only in its smart-TV app. Results show overwhelming preference for family movies, so the service cuts back late-night documentary production. What is the most significant flaw?
Strategy: Ask who could even see the survey.
Solution: The sample is biased at collection: smart-TV users skew toward living-room, family-context viewing. Phone and laptop viewers — plausibly the documentary audience — never saw the pop-up. The data faithfully records the wrong population, so the conclusion doesn't transfer to all viewers. The fix is collecting across all platforms (and even then, pop-up responders aren't all viewers).
Interpretation: Collection bias questions hinge on a mismatch between the population sampled and the population the conclusion targets. Find who's missing.
Problem: A program counts customer visits per city from this data: ["Chicago", "chicago", "CHICAGO", "Chicgo", "Boston"]. It reports Chicago = 1, chicago = 1, CHICAGO = 1, Chicgo = 1, Boston = 1. What happened, and what's the remedy?
Solution: The program is correct; the data is dirty. Case variations and a typo split one real city across four records. Remedy: clean the data — normalize case ("chicago"), correct or flag misspellings — before analysis. After cleaning: Chicago = 4, Boston = 1, a completely different (and true) picture.
Interpretation: "The program ran without errors" tells you nothing about the conclusion's validity when the input is inconsistent. Garbage in, confident garbage out.
1. (C). Observational data → correlation only. (A), (B), (D) all assert causal or predictive claims the data can't support. (Plausible hidden factors: wealthier areas afford both more lighting and more security.)
2. (B). GPS location describes the photo — data about data. (A), (C), (D) are the photo's content, i.e., primary data.
3. (D). CED states it directly: changing metadata leaves primary data untouched. This is asked near-verbatim on real forms.
4. (B). Equivalent values in different formats = classic cleaning task; convert to one representation, keep all the reviews. (A) throws away good data; (C) is false — programs count "5" and "five stars" as different.
5. (B). Population-sample mismatch at collection: the offline population can't respond to an online poll — precisely the group an internet-habits study must include. Others are nonsense.
6. (B). Date + location tags = finding and organizing, metadata's core uses. (A)/(C)/(D) aren't metadata operations.
7. (C). Scale challenge → parallel/distributed processing (fully developed in Lesson 19). (A) destroys information; (B) is backwards; (D) is a joke answer — real forms include one occasionally, don't overthink it.
8. (B). The association is real and actionable (recommendations) without any causal claim. (A)/(C) assert causation; (D) turns a tendency into a certainty.
9. (A) and (C). Cleaning and scale — the CED's named challenges. (B) is false; (D) reverses metadata's purpose.
10. (D). This is the model sentence: state the correlation, scope it to the sample, name what causation would need. It's what your own conclusions should sound like.
11. (B). Who gets sampled depends on smartphone access → collection bias skewing the dataset before any code runs. (D) is dirtiness, not bias; (A)/(C) are unrelated.
12. (A). Data = raw values; information = extracted insight. (C) is a different distinction entirely; (B)/(D) invert.
Answer letter distribution check: C, B, D, B, B, B, C, B, A+C, D, B, A — singles: A×1, B×6, C×2, D×2 + multi (A,C). Running tally L1–L5 shows B over-selected (~40%); Lessons 6–7 keys will target A/D-heavy distributions to pull the course-wide spread toward balance.
If your Create PT processes data — scores, prices, votes, sensor readings — this lesson is your design conscience:
Mini practice (passage-skill warm-up): In one sentence each, state (i) a piece of data a bike-share app collects, (ii) information extractable from lots of it, (iii) a bias risk in that data. Model: (i) start/end station of each ride; (ii) which stations run empty at rush hour; (iii) rides by tourists with the app overrepresent downtown, so residential stations look less used than they are.
1. (C). Observational data → correlation only. (A), (B), (D) all assert causal or predictive claims the data can't support. (Plausible hidden factors: wealthier areas afford both more lighting and more security.)
2. (B). GPS location describes the photo — data about data. (A), (C), (D) are the photo's content, i.e., primary data.
3. (D). CED states it directly: changing metadata leaves primary data untouched. This is asked near-verbatim on real forms.
4. (B). Equivalent values in different formats = classic cleaning task; convert to one representation, keep all the reviews. (A) throws away good data; (C) is false — programs count "5" and "five stars" as different.
5. (B). Population-sample mismatch at collection: the offline population can't respond to an online poll — precisely the group an internet-habits study must include. Others are nonsense.
6. (B). Date + location tags = finding and organizing, metadata's core uses. (A)/(C)/(D) aren't metadata operations.
7. (C). Scale challenge → parallel/distributed processing (fully developed in Lesson 19). (A) destroys information; (B) is backwards; (D) is a joke answer — real forms include one occasionally, don't overthink it.
8. (B). The association is real and actionable (recommendations) without any causal claim. (A)/(C) assert causation; (D) turns a tendency into a certainty.
9. (A) and (C). Cleaning and scale — the CED's named challenges. (B) is false; (D) reverses metadata's purpose.
10. (D). This is the model sentence: state the correlation, scope it to the sample, name what causation would need. It's what your own conclusions should sound like.
11. (B). Who gets sampled depends on smartphone access → collection bias skewing the dataset before any code runs. (D) is dirtiness, not bias; (A)/(C) are unrelated.
12. (A). Data = raw values; information = extracted insight. (C) is a different distinction entirely; (B)/(D) invert.
Answer letter distribution check: C, B, D, B, B, B, C, B, A+C, D, B, A — singles: A×1, B×6, C×2, D×2 + multi (A,C). Running tally L1–L5 shows B over-selected (~40%); Lessons 6–7 keys will target A/D-heavy distributions to pull the course-wide spread toward balance.
Exam tip: On any "what can be concluded" question, rank the answer choices from weakest claim to strongest. The defensible answer is almost always the weakest one that still says something — correlations reported as correlations, scoped to the data collected. If an answer contains "causes," "will," or "proves," it needs experimental evidence the scenario almost never provides.