CSPIQ · AP Computer Science Principles · Lesson 5 of 25
CSPIQ · AP Computer Science Principles

Lesson 05: Extracting Information from Data

Big Idea 2 (DAT) · Phase 2

Objectives

Warm-Up

A city's data team notices that in months when ice-cream sales rise, drowning deaths also rise. The correlation is strong, consistent, year after year. Should the city restrict ice-cream sales to save lives?

Obviously not — hot weather drives both: more ice cream, more swimming, more drownings. Ice cream doesn't cause drowning; a hidden third factor moves them together.

You laughed, but swap in less-obvious variables and this exact error drives real headlines, real product decisions, and at least one question on every AP CSP exam. "The data shows A and B rise together" never, by itself, proves A causes B. Let's build the full toolkit for what data can and can't tell you.


Core Concept

Data vs. information

Data are the raw values — numbers, text, measurements, clicks. Information is what you extract from data: patterns, conclusions, insights, knowledge. [98, 87, 92, 55, 91] is data; "one student is struggling while the class average is a B+" is information.

Extracting information from data is why organizations collect data at all. Programs do the extracting — filtering, sorting, aggregating (Lesson 6 covers the how). Two CED claims worth knowing verbatim:

Correlation vs. causation — the star of this lesson

Correlation: two things vary together. Causation: one thing makes the other happen.

Data alone establishes correlation. Causation requires more — controlled experiments, mechanisms, ruling out confounders. The three classic explanations for a correlation between A and B:

  1. A causes B (maybe!)
  2. B causes A (reverse direction — "students with tutors score lower" may mean low scores cause tutor-hiring)
  3. A hidden factor C causes both (heat → ice cream AND drowning)

Exam behavior: when a question asks what can be concluded from observational data, the safe answers say "are associated," "is correlated with," "tend to occur together." The trap answers say "causes," "leads to," "will improve." Choose the weakest claim the data supports.

Metadata: data about data

Metadata is data that describes other data. It is not the content itself — it's the label on the box.

Primary data Its metadata
A photo's pixels Time taken, GPS location, camera model, resolution
A document's text Author, creation date, file size, last-modified time
An email's message Sender, recipient, timestamp, subject line
A web page's content Page title, language, tags, date published

CED claims: metadata helps in finding, organizing, and managing data (search by date! sort by size! group by location!), and changing metadata does not change the primary data — retitling a photo doesn't alter a pixel. Also worth a beat of caution: metadata can reveal sensitive information (a photo's GPS tag exposes where you were) — this returns in Lesson 23.

The challenges of big data

When datasets grow from a spreadsheet to billions of records, new problems appear. The CED names these:

Where insight actually comes from

The CED's picture of data work: collect (with known limitations) → clean → combine/organize → filter and aggregate → visualize → interpret (carefully). Programs do the middle; humans supply the questions and the skepticism at both ends. On the exam, respect for that pipeline shows up as answers acknowledging limits: "the data supports X for the population sampled," "additional data would be needed to conclude Y."


Worked Examples

Example 1 (easy): Metadata or data?

Problem: For a podcast episode file, classify each as primary data or metadata: (i) the audio recording, (ii) the episode's duration, (iii) the upload date, (iv) the host's spoken words.

Solution: (i) primary data, (ii) metadata, (iii) metadata, (iv) primary data (it's in the audio content). Duration and upload date describe the file; the recording and its contents are the file.

Interpretation: Litmus test: if you changed it, would the content itself change? Renaming or re-dating the file leaves the audio identical → those are metadata.

Example 2 (medium): The overreach detector

Problem: A fitness app's data shows users who log breakfast lose more weight than users who don't. The company drafts four marketing claims. Which is defensible? (A) "Logging breakfast causes weight loss." (B) "Eat breakfast and you will lose weight." (C) "Breakfast-loggers in our data lost more weight on average than non-loggers." (D) "Weight loss is impossible without logging breakfast."

Strategy: Data is observational → correlation only. Find the weakest claim.

Solution: (C). It restates the observed association without asserting cause. (A) and (B) claim causation (maybe motivated users both log meals and exercise more — hidden factor). (D) is absurd overreach.

Interpretation: Notice (C) also scopes the claim: "in our data." Defensible conclusions are hedged and scoped. That style — claim exactly what the data shows — is what the exam rewards.

Example 3 (AP-style): Bias at collection

Problem: A streaming service surveys viewers about content preferences using a pop-up shown only in its smart-TV app. Results show overwhelming preference for family movies, so the service cuts back late-night documentary production. What is the most significant flaw?

Strategy: Ask who could even see the survey.

Solution: The sample is biased at collection: smart-TV users skew toward living-room, family-context viewing. Phone and laptop viewers — plausibly the documentary audience — never saw the pop-up. The data faithfully records the wrong population, so the conclusion doesn't transfer to all viewers. The fix is collecting across all platforms (and even then, pop-up responders aren't all viewers).

Interpretation: Collection bias questions hinge on a mismatch between the population sampled and the population the conclusion targets. Find who's missing.

Example 4 (medium): Why cleaning matters

Problem: A program counts customer visits per city from this data: ["Chicago", "chicago", "CHICAGO", "Chicgo", "Boston"]. It reports Chicago = 1, chicago = 1, CHICAGO = 1, Chicgo = 1, Boston = 1. What happened, and what's the remedy?

Solution: The program is correct; the data is dirty. Case variations and a typo split one real city across four records. Remedy: clean the data — normalize case ("chicago"), correct or flag misspellings — before analysis. After cleaning: Chicago = 4, Boston = 1, a completely different (and true) picture.

Interpretation: "The program ran without errors" tells you nothing about the conclusion's validity when the input is inconsistent. Garbage in, confident garbage out.


Common Mistakes

  1. Concluding causation from correlation. The most-tested error in Big Idea 2. Observational data supports "associated with," never "causes" — no matter how strong the correlation.
  2. Thinking metadata changes the data. Editing a filename, tag, or date stamp leaves the primary content untouched. The CED states this explicitly, and the exam asks it directly.
  3. Treating cleaning as cheating. Correcting inconsistent formats and removing corrupt records is required good practice. (Deleting data because you dislike its conclusion — that's the cheating.)
  4. Ignoring who's missing from the data. A dataset can be huge and still unrepresentative. Size doesn't fix collection bias; a billion smart-TV responses still exclude phone viewers.
  5. Assuming one computer can handle anything. Scale questions expect you to say: past a certain size, storing/processing requires distributing work across machines and/or more efficient processing — dataset size is a real constraint.

Practice Problems

Question 1
Data shows that neighborhoods with more streetlights have less crime. Which conclusion does this data support?
Question 2
Which is metadata for a digital photograph?
Question 3
Changing a file's metadata (for example, renaming it):
Question 4
A dataset of restaurant reviews contains "5", "five stars", "5/5", and "★★★★★" as ratings. Before analysis, the team should:
Question 5
A national poll about internet habits is conducted entirely through an online form. The most significant limitation is:
Question 6
Which task is metadata MOST useful for?
Question 7
A hospital's dataset is too large for one computer to process in reasonable time. Which approach does the CED suggest?
Question 8
Users who buy hiking boots at an online store also frequently buy water bottles. The store can defensibly conclude:
Question 9
(Select two answers.) Which are challenges the CED associates with extracting information from very large datasets?
Question 10
A scientist finds that students who sleep more have higher grades, and honestly reports: "In our sample, sleep duration and GPA were positively correlated; experiments would be needed to establish whether more sleep improves grades." This statement is:
Question 11
Which scenario best illustrates bias introduced at data collection?
Question 12
"Data" is to "information" as:

Create PT Connection

If your Create PT processes data — scores, prices, votes, sensor readings — this lesson is your design conscience:

Mini practice (passage-skill warm-up): In one sentence each, state (i) a piece of data a bike-share app collects, (ii) information extractable from lots of it, (iii) a bias risk in that data. Model: (i) start/end station of each ride; (ii) which stations run empty at rush hour; (iii) rides by tourists with the app overrepresent downtown, so residential stations look less used than they are.


Show answer key & explanations

(g) Answer Key

1. (C). Observational data → correlation only. (A), (B), (D) all assert causal or predictive claims the data can't support. (Plausible hidden factors: wealthier areas afford both more lighting and more security.)

2. (B). GPS location describes the photo — data about data. (A), (C), (D) are the photo's content, i.e., primary data.

3. (D). CED states it directly: changing metadata leaves primary data untouched. This is asked near-verbatim on real forms.

4. (B). Equivalent values in different formats = classic cleaning task; convert to one representation, keep all the reviews. (A) throws away good data; (C) is false — programs count "5" and "five stars" as different.

5. (B). Population-sample mismatch at collection: the offline population can't respond to an online poll — precisely the group an internet-habits study must include. Others are nonsense.

6. (B). Date + location tags = finding and organizing, metadata's core uses. (A)/(C)/(D) aren't metadata operations.

7. (C). Scale challenge → parallel/distributed processing (fully developed in Lesson 19). (A) destroys information; (B) is backwards; (D) is a joke answer — real forms include one occasionally, don't overthink it.

8. (B). The association is real and actionable (recommendations) without any causal claim. (A)/(C) assert causation; (D) turns a tendency into a certainty.

9. (A) and (C). Cleaning and scale — the CED's named challenges. (B) is false; (D) reverses metadata's purpose.

10. (D). This is the model sentence: state the correlation, scope it to the sample, name what causation would need. It's what your own conclusions should sound like.

11. (B). Who gets sampled depends on smartphone access → collection bias skewing the dataset before any code runs. (D) is dirtiness, not bias; (A)/(C) are unrelated.

12. (A). Data = raw values; information = extracted insight. (C) is a different distinction entirely; (B)/(D) invert.

Answer letter distribution check: C, B, D, B, B, B, C, B, A+C, D, B, A — singles: A×1, B×6, C×2, D×2 + multi (A,C). Running tally L1–L5 shows B over-selected (~40%); Lessons 6–7 keys will target A/D-heavy distributions to pull the course-wide spread toward balance.


Exam tip: On any "what can be concluded" question, rank the answer choices from weakest claim to strongest. The defensible answer is almost always the weakest one that still says something — correlations reported as correlations, scoped to the data collected. If an answer contains "causes," "will," or "proves," it needs experimental evidence the scenario almost never provides.

← All lessons
Lesson 6 ›
Score: 0/0 correct