CSPIQ · AP Computer Science Principles

Lesson 05: Extracting Information from Data

Big Idea 2 (DAT) · Phase 2

Objectives

State the difference between data and information, and between correlation and causation — and spot conclusion-overreach in a scenario
Define metadata and give examples for a photo, a file, and a web page
Name the challenges of working with large datasets: cleaning, bias in collection, storage/processing scalability

Warm-Up

A city's data team notices that in months when ice-cream sales rise, drowning deaths also rise. The correlation is strong, consistent, year after year. Should the city restrict ice-cream sales to save lives?

Obviously not — hot weather drives both: more ice cream, more swimming, more drownings. Ice cream doesn't cause drowning; a hidden third factor moves them together.

You laughed, but swap in less-obvious variables and this exact error drives real headlines, real product decisions, and at least one question on every AP CSP exam. "The data shows A and B rise together" never, by itself, proves A causes B. Let's build the full toolkit for what data can and can't tell you.

Core Concept

Data vs. information

Data are the raw values — numbers, text, measurements, clicks. Information is what you extract from data: patterns, conclusions, insights, knowledge. [98, 87, 92, 55, 91] is data; "one student is struggling while the class average is a B+" is information.

Extracting information from data is why organizations collect data at all. Programs do the extracting — filtering, sorting, aggregating (Lesson 6 covers the how). Two CED claims worth knowing verbatim:

Combining data sources, clustering data, and classifying data are ways to extract information.
Search trends, interests, and patterns of a population can be found from large datasets even when no single record says them outright — insight emerges from scale.

Correlation vs. causation — the star of this lesson

Correlation: two things vary together. Causation: one thing makes the other happen.

Data alone establishes correlation. Causation requires more — controlled experiments, mechanisms, ruling out confounders. The three classic explanations for a correlation between A and B:

A causes B (maybe!)
B causes A (reverse direction — "students with tutors score lower" may mean low scores cause tutor-hiring)
A hidden factor C causes both (heat → ice cream AND drowning)

Exam behavior: when a question asks what can be concluded from observational data, the safe answers say "are associated," "is correlated with," "tend to occur together." The trap answers say "causes," "leads to," "will improve." Choose the weakest claim the data supports.

Metadata: data about data

Metadata is data that describes other data. It is not the content itself — it's the label on the box.

Primary data	Its metadata
A photo's pixels	Time taken, GPS location, camera model, resolution
A document's text	Author, creation date, file size, last-modified time
An email's message	Sender, recipient, timestamp, subject line
A web page's content	Page title, language, tags, date published

CED claims: metadata helps in finding, organizing, and managing data (search by date! sort by size! group by location!), and changing metadata does not change the primary data — retitling a photo doesn't alter a pixel. Also worth a beat of caution: metadata can reveal sensitive information (a photo's GPS tag exposes where you were) — this returns in Lesson 23.

The challenges of big data

When datasets grow from a spreadsheet to billions of records, new problems appear. The CED names these:

Storage and processing scale. A single computer may be unable to store or process the dataset in reasonable time; solutions include distributing storage and computation across many machines (Lesson 19) — the dataset's size becomes a design constraint.
Data cleaning. Real-world data is messy: typos ("New York," "NY," "new york" as three cities), missing values, inconsistent units, duplicates. Cleaning data — identifying and correcting or removing corrupt/incomplete/irrelevant records — often takes more effort than the analysis itself. Cleaning is legitimate and necessary; it is not the same as cherry-picking results.
Bias in data. Data reflects how it was collected. A city's pothole-reporting app collects more reports from neighborhoods where more residents have smartphones and time to report — the worst roads may generate the fewest reports. Analyzing biased data produces biased conclusions no matter how good the code is. (Computing bias gets a full lesson — L21; here the claim is: bias can enter at collection.)
Invalid or overreaching conclusions. Beyond correlation/causation: small or unrepresentative samples, survivorship effects, and combining incompatible sources all produce confident nonsense.

Where insight actually comes from

The CED's picture of data work: collect (with known limitations) → clean → combine/organize → filter and aggregate → visualize → interpret (carefully). Programs do the middle; humans supply the questions and the skepticism at both ends. On the exam, respect for that pipeline shows up as answers acknowledging limits: "the data supports X for the population sampled," "additional data would be needed to conclude Y."

Worked Examples

Example 1 (easy): Metadata or data?

Problem: For a podcast episode file, classify each as primary data or metadata: (i) the audio recording, (ii) the episode's duration, (iii) the upload date, (iv) the host's spoken words.

Solution: (i) primary data, (ii) metadata, (iii) metadata, (iv) primary data (it's in the audio content). Duration and upload date describe the file; the recording and its contents are the file.

Interpretation: Litmus test: if you changed it, would the content itself change? Renaming or re-dating the file leaves the audio identical → those are metadata.

Example 2 (medium): The overreach detector

Problem: A fitness app's data shows users who log breakfast lose more weight than users who don't. The company drafts four marketing claims. Which is defensible? (A) "Logging breakfast causes weight loss." (B) "Eat breakfast and you will lose weight." (C) "Breakfast-loggers in our data lost more weight on average than non-loggers." (D) "Weight loss is impossible without logging breakfast."

Strategy: Data is observational → correlation only. Find the weakest claim.

Solution: (C). It restates the observed association without asserting cause. (A) and (B) claim causation (maybe motivated users both log meals and exercise more — hidden factor). (D) is absurd overreach.

Interpretation: Notice (C) also scopes the claim: "in our data." Defensible conclusions are hedged and scoped. That style — claim exactly what the data shows — is what the exam rewards.

Example 3 (AP-style): Bias at collection

Problem: A streaming service surveys viewers about content preferences using a pop-up shown only in its smart-TV app. Results show overwhelming preference for family movies, so the service cuts back late-night documentary production. What is the most significant flaw?

Strategy: Ask who could even see the survey.

Solution: The sample is biased at collection: smart-TV users skew toward living-room, family-context viewing. Phone and laptop viewers — plausibly the documentary audience — never saw the pop-up. The data faithfully records the wrong population, so the conclusion doesn't transfer to all viewers. The fix is collecting across all platforms (and even then, pop-up responders aren't all viewers).

Interpretation: Collection bias questions hinge on a mismatch between the population sampled and the population the conclusion targets. Find who's missing.

Example 4 (medium): Why cleaning matters

Problem: A program counts customer visits per city from this data: ["Chicago", "chicago", "CHICAGO", "Chicgo", "Boston"]. It reports Chicago = 1, chicago = 1, CHICAGO = 1, Chicgo = 1, Boston = 1. What happened, and what's the remedy?

Solution: The program is correct; the data is dirty. Case variations and a typo split one real city across four records. Remedy: clean the data — normalize case ("chicago"), correct or flag misspellings — before analysis. After cleaning: Chicago = 4, Boston = 1, a completely different (and true) picture.

Interpretation: "The program ran without errors" tells you nothing about the conclusion's validity when the input is inconsistent. Garbage in, confident garbage out.

Common Mistakes

Concluding causation from correlation. The most-tested error in Big Idea 2. Observational data supports "associated with," never "causes" — no matter how strong the correlation.
Thinking metadata changes the data. Editing a filename, tag, or date stamp leaves the primary content untouched. The CED states this explicitly, and the exam asks it directly.
Treating cleaning as cheating. Correcting inconsistent formats and removing corrupt records is required good practice. (Deleting data because you dislike its conclusion — that's the cheating.)
Ignoring who's missing from the data. A dataset can be huge and still unrepresentative. Size doesn't fix collection bias; a billion smart-TV responses still exclude phone viewers.
Assuming one computer can handle anything. Scale questions expect you to say: past a certain size, storing/processing requires distributing work across machines and/or more efficient processing — dataset size is a real constraint.

Practice Problems

Question 1

Data shows that neighborhoods with more streetlights have less crime. Which conclusion does this data support?

(A) Installing streetlights causes crime to fall
(B) Criminals cause streetlight removal
(C) Streetlight count and crime rate are correlated in this data
(D) Doubling streetlights would halve crime

Question 2

Which is metadata for a digital photograph?

(A) The color of each pixel
(B) The GPS location where the photo was taken
(C) The person's face in the photo
(D) The photo's subject matter

Question 3

Changing a file's metadata (for example, renaming it):

(A) Changes the primary data it describes
(B) Compresses the file
(C) Deletes the file's contents
(D) Does not change the primary data it describes

Question 4

A dataset of restaurant reviews contains "5", "five stars", "5/5", and "★★★★★" as ratings. Before analysis, the team should:

(A) Delete all reviews with nonstandard ratings
(B) Clean the data by converting equivalent ratings to one consistent format
(C) Analyze the data as-is, since programs handle inconsistency automatically
(D) Collect metadata instead

Question 5

A national poll about internet habits is conducted entirely through an online form. The most significant limitation is:

(A) Online forms cannot store responses
(B) The sample excludes people without internet access — the population most relevant to questions about lacking connectivity
(C) The results will contain syntax errors
(D) Online data cannot be cleaned

Question 6

Which task is metadata MOST useful for?

(A) Changing the contents of a document
(B) Finding all photos taken in March at a specific beach
(C) Compressing a video file
(D) Correcting a logic error

Question 7

A hospital's dataset is too large for one computer to process in reasonable time. Which approach does the CED suggest?

(A) Delete half the data
(B) Convert the data to analog form
(C) Process the data in parallel across multiple computers
(D) Print the data and analyze it manually

Question 8

Users who buy hiking boots at an online store also frequently buy water bottles. The store can defensibly conclude:

(A) Hiking boots cause water-bottle purchases
(B) These two purchases are associated, so recommending bottles to boot-buyers may be effective
(C) Water bottles cause hiking-boot purchases
(D) Everyone who buys boots will buy a bottle

Question 9

(Select two answers.) Which are challenges the CED associates with extracting information from very large datasets?

(A) The need to clean inconsistent or incomplete records
(B) The impossibility of storing data in binary
(C) Storage and processing demands that can exceed a single machine
(D) Metadata making files impossible to search

Question 10

A scientist finds that students who sleep more have higher grades, and honestly reports: "In our sample, sleep duration and GPA were positively correlated; experiments would be needed to establish whether more sleep improves grades." This statement is:

(A) Overreaching, because it mentions experiments
(B) Invalid, because correlations cannot be reported
(C) Wrong, because the causal direction is already obvious
(D) Appropriately scoped: it claims the association and flags what causation would require

Question 11

Which scenario best illustrates bias introduced at data collection?

(A) A program crashes while sorting a dataset
(B) A pothole app's reports cluster in wealthy areas because smartphone ownership varies by neighborhood
(C) A dataset is compressed with a lossless algorithm
(D) A survey stores answers with inconsistent capitalization

Question 12

"Data" is to "information" as:

(A) Raw measurements are to the patterns and conclusions extracted from them
(B) Conclusions are to measurements
(C) Metadata is to primary data
(D) Output is to input

Create PT Connection

If your Create PT processes data — scores, prices, votes, sensor readings — this lesson is your design conscience:

Displaying a conclusion? Scope it. A program that reports "Most users prefer option A" from 12 votes is overreaching; "7 of 12 votes chose A" is exact and defensible. Written Response 1 rewards precise description of what your program actually determines.
Handle dirty input. What does your program do with an empty entry, a negative score, text where a number belongs? A line or two of input-checking is both good engineering and a ready-made Written Response 2(b) story (errors and testing).

Mini practice (passage-skill warm-up): In one sentence each, state (i) a piece of data a bike-share app collects, (ii) information extractable from lots of it, (iii) a bias risk in that data. Model: (i) start/end station of each ride; (ii) which stations run empty at rush hour; (iii) rides by tourists with the app overrepresent downtown, so residential stations look less used than they are.

Show answer key & explanations

(g) Answer Key

1. (C). Observational data → correlation only. (A), (B), (D) all assert causal or predictive claims the data can't support. (Plausible hidden factors: wealthier areas afford both more lighting and more security.)

2. (B). GPS location describes the photo — data about data. (A), (C), (D) are the photo's content, i.e., primary data.

3. (D). CED states it directly: changing metadata leaves primary data untouched. This is asked near-verbatim on real forms.

4. (B). Equivalent values in different formats = classic cleaning task; convert to one representation, keep all the reviews. (A) throws away good data; (C) is false — programs count "5" and "five stars" as different.

5. (B). Population-sample mismatch at collection: the offline population can't respond to an online poll — precisely the group an internet-habits study must include. Others are nonsense.

6. (B). Date + location tags = finding and organizing, metadata's core uses. (A)/(C)/(D) aren't metadata operations.

7. (C). Scale challenge → parallel/distributed processing (fully developed in Lesson 19). (A) destroys information; (B) is backwards; (D) is a joke answer — real forms include one occasionally, don't overthink it.

8. (B). The association is real and actionable (recommendations) without any causal claim. (A)/(C) assert causation; (D) turns a tendency into a certainty.

9. (A) and (C). Cleaning and scale — the CED's named challenges. (B) is false; (D) reverses metadata's purpose.

10. (D). This is the model sentence: state the correlation, scope it to the sample, name what causation would need. It's what your own conclusions should sound like.

11. (B). Who gets sampled depends on smartphone access → collection bias skewing the dataset before any code runs. (D) is dirtiness, not bias; (A)/(C) are unrelated.

12. (A). Data = raw values; information = extracted insight. (C) is a different distinction entirely; (B)/(D) invert.

Answer letter distribution check: C, B, D, B, B, B, C, B, A+C, D, B, A — singles: A×1, B×6, C×2, D×2 + multi (A,C). Running tally L1–L5 shows B over-selected (~40%); Lessons 6–7 keys will target A/D-heavy distributions to pull the course-wide spread toward balance.

Exam tip: On any "what can be concluded" question, rank the answer choices from weakest claim to strongest. The defensible answer is almost always the weakest one that still says something — correlations reported as correlations, scoped to the data collected. If an answer contains "causes," "will," or "proves," it needs experimental evidence the scenario almost never provides.

← All lessons

‹ Lesson 4

Lesson 6 ›