chapter 4 APCSP

Storing Data: Spreadsheets and Databases

  • Objective of Data: Transform data into information and then into insights.

  • Evolution of Data: From early human records to complex data collection (e.g., targeted advertising).

  • Need for Storage Solutions: Growth in data requires automated storage (spreadsheets for personal finance; databases for complex organizational data).

Gathering Data

  • Crowdsourcing: Collecting information from large groups for specific goals.

    • Citizen Science: Public contributions to scientific research; individuals can participate actively or passively.

Visualizing Data

  • Data Visualization: Essential for identifying trends; requires software tools.

    • Historical examples: Minard’s map of the Russian Campaign; John Snow’s cholera mapping.

  • Potential Misrepresentation: Data may be misleading when improperly visualized.

    • Example: Correlation vs. causation—similarity in trends does not imply one causes the other.

    • Manipulation of axis ranges can distort perceptions of data relationships.

Spreadsheet Basics

  • Definition: A grid of rows and columns to store data (numbers and text).

  • Components:

    • Cells: Intersection of rows and columns (e.g., A1).

    • Types: Labels (descriptive text), constants (fixed values), and formulas (calculations).

  • Excel Functions: Built-in functions like AVG, MIN, MAX, COUNT, and IF enhance data manipulation.

Database Fundamentals

  • Databases: Organized collections of data stored in tables to maintain data consistency.

    • Inconsistencies: Can arise from transaction issues; idempotency and rollback mechanisms prevent errors.

  • Relational Databases: Connect multiple tables via unique keys to reduce redundancy.

  • Structured Query Language (SQL): Language for managing databases; includes commands for data retrieval and manipulation.

    • Common SQL Keywords: SELECT, FROM, WHERE, JOIN, AGGREGATE functions.

Big Data Characteristics

  • Definition: Data sets larger than typical consumer software can manage.

    • Features: Volume, velocity, variety, and ability for machines to learn from data.

Important Vocabulary

  • Atomic Transaction: All components must succeed for the transaction to be completed.

  • Deadlock: Situation where two transactions compete for the same resource.

  • Simpson’s Paradox: Combined data trends may differ from individual group trends.

  • Write-ahead Logging: Method to ensure data consistency by logging changes before applying them.

  • Understanding the significance and application of these principles is crucial for effective data management and utilization.