chapter 4 APCSP
Storing Data: Spreadsheets and Databases
Objective of Data: Transform data into information and then into insights.
Evolution of Data: From early human records to complex data collection (e.g., targeted advertising).
Need for Storage Solutions: Growth in data requires automated storage (spreadsheets for personal finance; databases for complex organizational data).
Gathering Data
Crowdsourcing: Collecting information from large groups for specific goals.
Citizen Science: Public contributions to scientific research; individuals can participate actively or passively.
Visualizing Data
Data Visualization: Essential for identifying trends; requires software tools.
Historical examples: Minard’s map of the Russian Campaign; John Snow’s cholera mapping.
Potential Misrepresentation: Data may be misleading when improperly visualized.
Example: Correlation vs. causation—similarity in trends does not imply one causes the other.
Manipulation of axis ranges can distort perceptions of data relationships.
Spreadsheet Basics
Definition: A grid of rows and columns to store data (numbers and text).
Components:
Cells: Intersection of rows and columns (e.g., A1).
Types: Labels (descriptive text), constants (fixed values), and formulas (calculations).
Excel Functions: Built-in functions like AVG, MIN, MAX, COUNT, and IF enhance data manipulation.
Database Fundamentals
Databases: Organized collections of data stored in tables to maintain data consistency.
Inconsistencies: Can arise from transaction issues; idempotency and rollback mechanisms prevent errors.
Relational Databases: Connect multiple tables via unique keys to reduce redundancy.
Structured Query Language (SQL): Language for managing databases; includes commands for data retrieval and manipulation.
Common SQL Keywords: SELECT, FROM, WHERE, JOIN, AGGREGATE functions.
Big Data Characteristics
Definition: Data sets larger than typical consumer software can manage.
Features: Volume, velocity, variety, and ability for machines to learn from data.
Important Vocabulary
Atomic Transaction: All components must succeed for the transaction to be completed.
Deadlock: Situation where two transactions compete for the same resource.
Simpson’s Paradox: Combined data trends may differ from individual group trends.
Write-ahead Logging: Method to ensure data consistency by logging changes before applying them.
Understanding the significance and application of these principles is crucial for effective data management and utilization.