CSPIQ · AP Computer Science Principles

Lesson 04: Data Compression

Big Idea 2 (DAT) · Phase 2

Objectives

Define lossless and lossy compression and state the exact trade-off between them
Encode and decode simple run-length encoding (RLE) examples and compute the savings
Pick the right compression type from a scenario's purpose — the word the correct answer always hinges on

Warm-Up

You take a 12-megabyte photo of your dog. You text it to a friend — and the messaging app sends a 400-kilobyte version that still looks, to your eye, exactly like your dog. That's a 30× shrink. Where did 96% of the photo go?

Gone. Permanently. The app threw away detail your eye probably wouldn't notice — subtle color gradations, fine texture. Your friend can never recover the original from what you sent.

Now suppose instead you email your friend a 12-megabyte spreadsheet, zipped down to 2 megabytes. When she unzips it, she'd better get every cell back exactly — a spreadsheet with "approximately your data" is garbage.

Two compressions, two totally different promises. That difference is this whole lesson.

Core Concept

Why compression exists

Data compression reduces the number of bits needed to store or transmit data. Fewer bits = less storage, faster transmission, lower cost. Since everything is bits (Lesson 3), everything can potentially be compressed.

Compression works because real data has redundancy — repeated patterns, predictable structure. "AAAAAABBBB" doesn't need 10 stored letters; "6 A's, then 4 B's" carries the same information in less space.

Lossless compression: the perfect-restoration promise

Lossless compression reduces bits in a way that allows complete reconstruction of the original data. Nothing is lost — the compressed version is a perfectly reversible re-encoding.

Guarantee: decompress → get back exactly the original, bit for bit.
Cost: usually a smaller reduction in size than lossy (you can only squeeze out redundancy, not information).
When it's required: any data where every bit matters — text documents, spreadsheets, program code, bank records, medical data, scientific measurements.

Lossy compression: the good-enough bargain

Lossy compression reduces bits by permanently discarding some information — typically detail humans barely perceive.

Guarantee: decompress → get an approximation of the original. The discarded data is unrecoverable.
Benefit: usually a much greater reduction in size than lossless.
When it's acceptable: media where approximation is fine — streaming video, music, photos for sharing, video calls.

The trade-off, stated the way the exam states it

Choose lossless when complete reconstruction matters more than file size. Choose lossy when reducing size/transmission time matters more than perfect fidelity.

The deciding factor is always the purpose of the data's use. Same photo, different answers: archiving the master copy of a professional shoot → lossless; posting to social media → lossy is fine. Scenario questions bury the purpose in one phrase ("the analyst must recover the exact original readings") — find that phrase and the answer falls out.

Run-length encoding: lossless compression you can do by hand

Run-length encoding (RLE) replaces runs of repeated values with a count + the value. It's the exam's standard concrete example of lossless compression.

Consider one row of a black-and-white image (B = black pixel, W = white):

Original:  W W W W W W W W B B B W W W W B
RLE:       8W 3B 4W 1B

The original row takes 16 symbols; the encoded version takes 4 count-value pairs. Decoding is mechanical: write 8 W's, 3 B's, 4 W's, 1 B — the exact original returns, which is what makes RLE lossless.

Computing the savings (a real exam task): Suppose each pixel takes 1 bit uncompressed (16 bits total) and each RLE pair takes 4 bits for the count + 1 bit for the color = 5 bits (20 bits total for 4 pairs). Compressed is bigger — 20 > 16! RLE only wins when runs are long:

Original:  64 white pixels = 64 bits
RLE:       one pair "64 W" — but wait: a 4-bit count only reaches 15!
           Need: 15W 15W 15W 15W 4W = 5 pairs × 5 bits = 25 bits ✓ (61% smaller)

Two morals the exam tests: (1) compression effectiveness depends on the data — long runs compress well, noisy data can even grow; (2) implementation details (like how many bits the count gets) matter when computing savings.

One more distinction: compression is not conversion

Compression changes how many bits encode the data. Sampling (Lesson 3) converts analog → digital. A question about "measuring a sound wave 44,100 times per second" is sampling; a question about "reducing the MP3's file size" is compression. Don't let the sound-context blur them.

Worked Examples

Example 1 (easy): Which compression?

Problem: A hospital stores MRI scans that radiologists will later examine for subtle tissue anomalies. Which compression is appropriate, and why?

Strategy: Find the purpose: "examine for subtle anomalies" → every detail may matter → perfect reconstruction required.

Solution: Lossless. Discarded detail in a lossy scheme could be exactly the anomaly a radiologist needs to see; medical use requires complete reconstruction.

Interpretation: The setting (medical/legal/financial/scientific) almost always signals lossless. The signal-word is a stakes word: diagnose, audit, evidence, exact.

Example 2 (easy): Which compression, part 2

Problem: A video-streaming service wants to serve movies to phones on limited data plans. Which compression, and what's the trade-off?

Solution: Lossy. Purpose = minimize transmission size for smooth playback; viewers tolerate imperceptible (or even mildly perceptible) quality loss. The trade-off: the full original quality can never be recovered from the streamed version — which is fine, because the service keeps the master and the viewer only needs to watch.

Example 3 (medium): RLE by hand

Problem: Encode this pixel row with run-length encoding, then verify decoding restores it: B B B B B B W W B B B B W W W W W W W W

Strategy: Walk left to right, counting runs.

Solution:

Runs: 6 B's → 2 W's → 4 B's → 8 W's
RLE:  6B 2W 4B 8W
Decode check: BBBBBB WW BBBB WWWWWWWW = 6+2+4+8 = 20 pixels ✓ exact original

Interpretation: Count carefully — off-by-one in a run is the whole error mode here. Verify totals: 6+2+4+8 = 20 = original length.

Example 4 (AP-style): When compression backfires

Problem: A student applies run-length encoding to two 24-pixel rows. Row 1: WWWWWWWWWWWWBBBBBBBBBBBB (12 W's, 12 B's) Row 2: WBWBWBWBWBWBWBWBWBWBWBWB (alternating) Each RLE pair costs 6 bits; each raw pixel costs 1 bit. What happens?

Strategy: Count pairs for each row; compare to 24 bits raw.

Solution: - Row 1: 2 runs → 2 pairs × 6 bits = 12 bits vs. 24 raw → compressed to half. ✓ - Row 2: 24 runs (every pixel is a run of 1) → 24 pairs × 6 bits = 144 bits vs. 24 raw → six times larger. ✗

Interpretation: Compression exploits redundancy; data without redundancy can expand under the wrong scheme. If an exam question shows alternating or noisy data, watch for the "compressed version may be larger" answer — it's often correct.

Common Mistakes

"Lossy" ≠ "bad" and "lossless" ≠ "always better." Lossy is the right choice when the purpose is transmission/streaming/sharing and fidelity loss is acceptable. The exam grades your reading of the purpose, not a quality preference.
Thinking lossless means "no compression." Lossless compresses plenty (zip files work!) — it just guarantees perfect reconstruction. The trade is less size reduction, not none.
Believing lossy data can be restored. Discarded is discarded. Decompressing a lossy file gives an approximation; no algorithm can regenerate the lost bits. If an answer choice implies "convert back to the original quality," it's wrong.
RLE miscounts. Runs of length 1 still need a pair (1B). Decode-verify by summing run lengths against the original count.
Confusing sampling with compression. Sampling digitizes analog signals; compression shrinks digital data. They co-star in audio questions but are different operations.

Practice Problems

Question 1

Which statement correctly describes lossless compression?

(A) It reduces file size by permanently discarding the least important data
(B) It typically reduces file size more than lossy compression
(C) It allows the original data to be completely reconstructed
(D) It can only be used on image files

Question 2

Which statement correctly describes lossy compression?

(A) The original data can be perfectly restored by decompression
(B) It generally achieves greater size reduction than lossless compression by discarding information
(C) It is required for text documents and spreadsheets
(D) It stores data without using binary

Question 3

A journalist is choosing how to store interview transcripts (text files) that may later be quoted verbatim in court. She should use:

(A) Lossy compression, to save the most space
(B) No storage, to protect sources
(C) Lossless compression, because every character must be exactly recoverable
(D) Lossy compression, because text has no redundancy

Question 4

Using run-length encoding, WWWWBBBWWWWWW encodes to:

(A) 4W 3B 6W
(B) 4W 3B 5W
(C) W4 B3 W7
(D) 3W 4B 6W

Question 5

The RLE code 2B 5W 1B 4W decodes to a row of how many pixels?

(A) 4
(B) 10
(C) 12
(D) 22

Question 6

A phone's camera app offers "High Quality" (larger files) and "Space Saver" (smaller files, slightly reduced detail) modes. Space Saver most likely uses:

(A) Lossy compression
(B) Lossless compression
(C) No compression
(D) Run-length encoding of text

Question 7

Which data would compress LEAST effectively with run-length encoding?

(A) A pixel row of 60 identical white pixels
(B) A black-and-white logo with large solid regions
(C) A pixel row that alternates colors at every pixel
(D) A row with three long runs of color

Question 8

A music-archiving nonprofit keeps master recordings for future remastering, and also runs a streaming site for casual listeners. The best strategy is:

(A) Lossless for the archive; lossy for streaming
(B) Lossy for the archive; lossless for streaming
(C) Lossless for both, since quality always matters
(D) Lossy for both, since storage is expensive

Question 9

After a file is compressed with a lossy algorithm and then decompressed, the result is:

(A) Identical to the original file
(B) An approximation of the original; the discarded information cannot be recovered
(C) Larger than the original file
(D) Encrypted

Question 10

(Select two answers.) Which factors should drive the choice between lossy and lossless compression?

(A) Whether the data's purpose requires complete reconstruction of the original
(B) Whether minimizing file size or transmission time outweighs perfect fidelity
(C) Whether the data is stored in binary
(D) Whether the file will be stored on a solid-state or magnetic drive

Question 11

A 30-pixel row compresses via RLE to 15W 15W using pairs that cost 6 bits each; raw pixels cost 1 bit each. The compression achieves:

(A) 12 bits total vs. 30 raw — a reduction of more than half
(B) 30 bits total vs. 30 raw — no change
(C) 60 bits total vs. 30 raw — the file doubles
(D) 15 bits total vs. 30 raw — exactly half

Question 12

Why can compressed data sometimes be LARGER than the original?

(A) Compression always increases file size
(B) If the data has little redundancy, the encoding overhead (counts, markers) can exceed the savings
(C) Binary files cannot be compressed
(D) Lossless compression discards too much information

Create PT Connection

Compression won't appear in your PT program — but the trade-off reasoning pattern will, and it's worth internalizing now. The exam's written responses and the PT's design decisions reward the same sentence shape:

"I chose X over Y because the purpose required , and I accepted the cost of ."

That's the lossless-vs-lossy sentence. It's also the sentence for choosing a list over separate variables (Lesson 8) and choosing a procedure over repeated code (Lesson 14). Trade-off framing — benefit, cost, purpose — is the single most reusable answer skeleton in this course.

Mini practice: Write the trade-off sentence for a photo-sharing app choosing lossy compression for uploads. Model: "I chose lossy compression because the purpose — fast uploads on mobile data — makes transmission size more important than perfect fidelity, and I accepted the cost that original image detail cannot be recovered from the uploaded copy."

Show answer key & explanations

(g) Answer Key

1. (C). Complete reconstruction is the defining property. (A) describes lossy. (B) reverses the size trade-off. (D) is a fabricated restriction.

2. (B). Greater reduction by discarding information — both halves of the definition. (A) describes lossless. (C) reverses appropriateness. (D) is nonsense.

3. (C). "Quoted verbatim in court" = exactness is the purpose → lossless. (A)/(D) sacrifice recoverability that the purpose demands.

4. (A). Runs: 4 W's, 3 B's, then count the tail: W W W W W W = 6 W's → 4W 3B 6W. (B) miscounts the last run (off-by-one — the classic). (C) reverses pair order and miscounts. (D) swaps the first two counts.

5. (C). 2 + 5 + 1 + 4 = 12 pixels. (D) is a digit-mash; (B) drops a run; (A) counts pairs, not pixels.

6. (A). Smaller files with reduced detail = information discarded = lossy. (B) wouldn't reduce detail. (D) is a lossless technique and irrelevant to photos-with-reduced-detail.

7. (C). Alternating pixels = 60 runs of length 1 → maximal overhead, worst case (may even expand — see Example 4). (A), (B), (D) all have long runs, RLE's best case.

8. (A). The archive's purpose (future remastering) needs perfect masters → lossless; streaming's purpose (casual listening on limited bandwidth) tolerates loss → lossy. Purpose decides per use, even for the same organization. (C)/(D) apply one answer to two different purposes.

9. (B). Lossy's defining consequence: approximation out, originals unrecoverable. (A) describes lossless.

10. (A) and (B). These are the CED's two decision factors — reconstruction need vs. size/time need. (C) is universal (everything is binary); (D) is irrelevant to the algorithm choice.

11. (A). 2 pairs × 6 bits = 12 bits vs. 30 raw bits → 60% reduction. (D) forgets each pair costs 6 bits, not 7.5; (C) multiplies wrongly.

12. (B). Encoding overhead vs. redundancy — when runs are short, count-markers cost more than they save (Example 4's alternating row). (A) is false in general; (D) confuses lossless with lossy and misstates why size grows.

Answer letter distribution check: C, B, C, A, C, A, C, A, B, A+B, A, B — singles: A×4, B×3, C×4, D×0 + multi (A,B). D underused this lesson (D×1 across L3–4) — Lesson 5 key will include D answers deliberately.

Exam tip: In any compression scenario, underline the purpose phrase before reading the choices. "Must recover the exact original" → lossless. "Wants faster transmission / smaller files and can tolerate quality loss" → lossy. The scenario always contains exactly one such phrase — it is the question.

← All lessons

‹ Lesson 3

Lesson 5 ›