CSPIQ · AP Computer Science Principles

Lesson 06: Using Programs with Data

Big Idea 2 (DAT) · Phase 2

Objectives

Predict the exact output of a filter (keep rows meeting a condition) and a sort (reorder rows by a column) on a given table
Chain operations: filter-then-sort, filter-then-count, and explain why order can matter
Explain when data from two sources can be combined (a shared field) and what new questions that unlocks

Warm-Up

Your school's activities office has a spreadsheet of 2,400 club-membership records: student name, grade, club, hours volunteered. The principal asks: "Which seniors volunteered more than 20 hours, sorted by hours?"

Doing this by hand means scanning 2,400 rows, copying qualifying ones, then ordering them — twenty error-prone minutes. A program (or spreadsheet feature) does it in milliseconds: filter grade = 12, filter hours > 20, sort by hours descending. Three operations, composable like LEGO bricks, each simple alone and powerful chained.

This lesson is about those bricks. The exam will hand you a small table and a described chain of operations; your job is to execute the chain exactly — no skipped steps, no assumptions.

Core Concept

The dataset as a table

Data-processing questions present data as a table: each row is a record (one student, one song, one purchase); each column is a field (name, grade, price). Programs process such tables with a small set of operations:

Filtering keeps only the rows that satisfy a condition (grade = 12, price < 10). Output: a smaller table, same columns, original order preserved among survivors.
Sorting reorders rows by a column's values, ascending or descending. Output: same rows, new order.
Searching finds specific records (Lesson 12 builds search algorithms; here it's the spreadsheet-level idea).
Aggregating / summarizing computes a single value from many rows: count, sum, mean, minimum, maximum.
Visualizing turns the table into charts to expose patterns.

These operations are the "extracting information" machinery from Lesson 5 made concrete.

Executing a filter — precisely

Table: clubRecords
| Name    | Grade | Club     | Hours |
|---------|-------|----------|-------|
| Aisha   | 12    | Robotics | 34    |
| Ben     | 11    | Chess    | 22    |
| Carmen  | 12    | Robotics | 18    |
| Diego   | 12    | Art      | 25    |
| Elena   | 10    | Chess    | 40    |

Filter Grade = 12: keep Aisha, Carmen, Diego. Filter that result by Hours > 20: keep Aisha (34) and Diego (25) — Carmen's 18 fails. Two rows survive.

Discipline points: check every row against the condition; respect strict vs. inclusive comparisons (> 20 excludes a row with exactly 20; ≥ 20 includes it — Lesson 2's boundary obsession again); multiple conditions joined by AND require all to hold, OR requires at least one.

Executing a sort — precisely

Sort the filtered result by Hours descending: Aisha (34), then Diego (25). Sorting doesn't add or remove rows — if your sorted table has a different row count than before, you dropped something.

Ties: if two rows share the sort value, the question will either not care or specify a tiebreaker. Don't invent one.

Filter-then-sort vs. sort-then-filter: for keeping purposes the final set is the same; but questions asking "which record is FIRST after these operations" depend on the exact sequence — execute in the stated order, literally.

Aggregation

"How many juniors?" = filter + count. "Average hours of Robotics members?" = filter Club = Robotics → mean of Hours → (34 + 18) / 2 = 26. The classic error: computing the aggregate over the whole table instead of the filtered rows. Filter first, then aggregate what remains.

Combining tables

Two datasets can be combined when they share a common field. The activities table has student names and clubs; the main-office table has names and email addresses. Shared field: name. Combining unlocks questions neither table answers alone ("email every Robotics member").

CED-level claims: combining data sources is a named way to gain insight; the shared identifier is what makes the join possible; and combined datasets can reveal more about individuals than either source alone (a privacy concern that returns in Lesson 23 — innocuous datasets can combine into invasive ones).

Why programs, not eyeballs

The CED's point isn't that filtering is clever — it's that the size of real datasets makes programs the only viable tool. A human can filter 5 rows; only a program filters 5 billion. Bonus claims: programs apply the condition identically to every row (no fatigue errors), and the same program re-runs instantly when data updates. When a question asks why an organization processes data with software, the answer is scale + consistency + repeatability.

Worked Examples

Example 1 (easy): Single filter

Problem: Using clubRecords above, how many rows does the filter Club = "Chess" keep?

Solution: Check all five rows: Ben ✓, Elena ✓, others ✗. 2 rows.

Interpretation: Yes, it's this mechanical. Full credit goes to those who check every row rather than stopping at the first match.

Example 2 (medium): Chained operations

Problem: Starting from clubRecords: filter Hours ≥ 22, then sort by Name ascending. List the resulting Name column, in order.

Strategy: Execute in order. Filter first — carefully at the boundary — then alphabetize survivors.

Solution: Filter Hours ≥ 22: Aisha (34) ✓, Ben (22) ✓ (inclusive — 22 stays), Carmen (18) ✗, Diego (25) ✓, Elena (40) ✓. Sort by Name ascending: Aisha, Ben, Diego, Elena.

Interpretation: The planted trap was Ben: ≥ 22 keeps exactly-22. Had the filter been > 22, Ben drops. One symbol, different answer — read the comparison operator like it's radioactive.

Example 3 (medium): Aggregate after filter

Problem: What is the average Hours among students in grade 12?

Solution: Filter Grade = 12 → Aisha 34, Carmen 18, Diego 25. Mean = (34 + 18 + 25) / 3 = 77 / 3 ≈ 25.67.

Interpretation: The tempting error: averaging all five rows ((34+22+18+25+40)/5 = 27.8) — a listed distractor whenever this question appears. Filter first. Aggregate second.

Example 4 (AP-style): Combining tables

Problem: A library has two tables. Table 1: cardNumber, name, gradeLevel. Table 2: cardNumber, bookTitle, dueDate. Which question can be answered ONLY by combining both tables?

(A) How many books are due this week? (B) How many students are in grade 10? (C) Which grade level has the most overdue books? (D) What is the most-borrowed book title?

Solution: (C). Grade level lives only in Table 1; overdue status (dueDate) lives only in Table 2; connecting them requires matching rows via the shared cardNumber. (A) and (D) need only Table 2; (B) needs only Table 1.

Interpretation: The combine-tables question always works this way: find the answer choice whose two required fields live in different tables. Name the shared field to confirm the join is possible.

Common Mistakes

Boundary blindness in filters. > 20 vs ≥ 20 on a row with exactly 20. The exam plants a boundary row nearly every time. Circle the operator, check that row explicitly.
Aggregating before filtering. "Average hours of seniors" over all grades. The condition defines the population first.
Sorting the wrong direction. Ascending = smallest/A first; descending = largest/Z first. Questions say which; students skim past it.
Losing rows in a sort. Sorting reorders; it never removes. Recount after sorting.
Assuming any two tables can combine. They need a shared field with matching values. No common identifier, no join — and an answer choice claiming otherwise is wrong.

Practice Problems

Problems 1–7 use this table of a music app's playHistory:

Song	Artist	Genre	Plays	Minutes
Vega	Lumen	Pop	41	3
Slate	Korrid	Rock	18	4
Ember	Lumen	Pop	30	3
Quill	Fenwick	Jazz	18	5
Aurora	Korrid	Rock	55	4
Drift	Fenwick	Jazz	12	6

Question 1

How many rows does the filter Genre = "Rock" keep?

(A) 1
(B) 2
(C) 3
(D) 4

Question 2

After filtering Plays > 18, which songs remain?

(A) Vega, Ember, Aurora
(B) Vega, Slate, Ember, Quill, Aurora
(C) Vega, Aurora
(D) Vega, Ember, Aurora, Drift

Question 3

Sort the full table by Plays descending. Which song is second?

(A) Vega
(B) Aurora
(C) Ember
(D) Slate

Question 4

Filter Artist = "Lumen", then compute total Plays of the remaining rows:

(A) 41
(B) 174
(C) 71
(D) 96

Question 5

Which chained operation answers: "List the Jazz songs from most-played to least-played"?

(A) Filter Genre = "Jazz", then sort by Plays descending
(B) Sort by Plays ascending, then filter Artist = "Fenwick"
(C) Filter Plays > 12, then sort by Genre
(D) Sort by Song ascending, then filter Genre = "Jazz"

Question 6

What is the average of Minutes across songs with Plays ≥ 30?

(A) 4.17
(B) 3.33
(C) 3.5
(D) 10

Question 7

A second table has columns Artist, Country, DebutYear. Which question requires combining it with playHistory?

(A) Which artist debuted first?
(B) What is the total number of plays of songs by artists from Iceland?
(C) Which genre has the most plays?
(D) How many artists are in the table?

Question 8

Why are programs preferred over manual processing for large datasets? *(Select two answers.)

(A) Programs can process at a scale (millions of records) impossible by hand
(B) Programs apply conditions identically to every record without fatigue errors
(C) Programs eliminate the need to clean data
(D) Programs guarantee the data was collected without bias

Question 9

A filter keeps rows where Grade ≥ 11 AND Hours > 30. A student in grade 11 with exactly 30 hours is:

(A) Kept, because both conditions are satisfied
(B) Excluded, because Hours > 30 is false
(C) Kept, because at least one condition is satisfied
(D) Excluded, because Grade ≥ 11 is false

Question 10

Sorting a 500-row table by price, ascending, produces a table with:

(A) Fewer than 500 rows, since duplicates merge
(B) The cheapest item in the first row and 500 rows total
(C) The most expensive item in the first row and 500 rows total
(D) Rows in their original order

Question 11

A store's loyalty data and its online-order data both include a customer ID. Combining them could reveal a customer's in-store and online behavior together. This illustrates:

(A) That combining data sources can reveal more about individuals than either source alone
(B) That data cannot be combined without violating syntax rules
(C) Lossless compression
(D) That customer IDs are metadata about the store

12 (short response). Using playHistory: describe a chain of two operations that produces the Pop songs ordered alphabetically, and state the resulting Song column.

Create PT Connection

Filtering, aggregating, and processing a list of data is exactly the shape of a strong Create PT program — and of the list-traversal algorithms coming in Lesson 12. The PT rubric wants a list that manages complexity plus a procedure implementing an algorithm with selection and iteration. "Loop through my list, keep/count/total the entries matching a condition" is the canonical way to satisfy all of it at once.

Start keeping a PT idea list now, in this shape: what data (list) → what question (filter/aggregate) → what output. Examples: quiz scores → how many above the class average → display the count; pantry items → which expire this week → display the list. By Lesson 14 you'll wrap one of these in a procedure, and your PT skeleton will exist.

Show answer key & explanations

(g) Answer Key

1. (B). Slate and Aurora. Mechanical scan of all six rows.

2. (A). Plays > 18 (strict): Vega 41 ✓, Slate 18 ✗ (boundary row — strict excludes it), Ember 30 ✓, Quill 18 ✗, Aurora 55 ✓, Drift 12 ✗. (B) treats > as ≥ — the planted trap.

3. (A). Descending Plays: Aurora 55, Vega 41, Ember 30, Slate 18 / Quill 18, Drift 12. Second = Vega. (B) is first, not second — read the position asked.

4. (C). Lumen rows: Vega 41 + Ember 30 = 71. (B) totals the whole table (aggregate-before-filter error); (A) takes one row only.

5. (A). Goal = Jazz only (filter) in play order (sort descending). (B) filters by artist, not genre — Fenwick happens to be Jazz in this table, but the operation doesn't express the goal and would break if Fenwick released a pop song. Express the condition you mean. (C) keeps non-jazz; (D) sorts by the wrong column.

6. (B). Plays ≥ 30: Vega (3 min), Ember (3), Aurora (4). Mean = (3+3+4)/3 = 10/3 ≈ 3.33. (A) averages Minutes over all six rows (≈ 4.17) — filter first! (D) is the sum, not the mean.

7. (B). "Plays" lives in playHistory; "Country" lives in the artist table; the join key is Artist. (A) and (D) need only the second table; (C) needs only the first.

8. (A) and (B). Scale and consistency — the CED's stated reasons. (C) false: cleaning is still on you. (D) false: processing can't repair collection bias (Lesson 5).

9. (B). AND requires both. Grade 11 ✓ (≥ 11 inclusive), Hours exactly 30 fails the strict > 30 → excluded. This question is Lessons 2 + 6 shaking hands: boundaries and logic.

10. (B). Ascending = smallest first; sorting preserves row count. (A) invents merging; (C) is descending; (D) denies the sort.

11. (A). Shared ID joins the tables; the combined profile exceeds either source — the CED's stated insight and privacy warning (returns in Lesson 23).

12. (Model answer.) Filter Genre = "Pop", then sort by Song ascending. Result: Ember, Vega. (Sort-then-filter also earns credit if the final list is correct — but say the operations explicitly.)

Answer letter distribution check: B, A, A, C, A, B, B, A+B, B, B, A, — singles: A×4, B×5, C×1, D×0 + multi (A,B). Cumulative through L6: D remains underweight; L7's key is engineered D-heavy (flagged for the final sweep).

Exam tip: For table questions, work with your pencil ON the table: strike out filtered rows, number the sort order in the margin. Never execute two operations in your head at once — the questions are engineered so that shortcut-takers pick the distractor that skips a step.

← All lessons

‹ Lesson 5

Lesson 7 ›