Data Sources & Data Handling – Comprehensive Lecture Notes

Data Sources: Definition & Importance
  • Definition: A data source is the fundamental origin or location where data is initially created, meticulously stored, systematically processed, and subsequently retrieved to be utilized for various analytical, reporting, or operational purposes. It can range from simple files to complex database systems or real-time streams.

  • Role in Data Science - Data sources serve as the foundational bedrock, supplying the essential raw material for every analytical workflow, machine learning model, and data-driven decision. The integrity and richness of these sources directly impact the output.

    • Quality of insights
      \implies
      quality of the underlying data: High-quality data from reliable sources leads to accurate, actionable insights, while poor-quality data can lead to erroneous conclusions and flawed strategies.

  • Why Understanding Sources Matters - Critical for ensuring the authenticity, reliability, and validity of data before use. This understanding enables:

    • Reliability & validity checks: Identifying the provenance of data helps assess its trustworthiness, completeness, and accuracy.

    • Awareness of potential biases/limitations: Different collection methods or storage systems may introduce specific biases (e.g., sampling bias in surveys, data entry errors in legacy systems) or limitations (e.g., lack of granularity).

    • Guides correct access method (e.g., direct database connection, RESTful API call, web-scraping script, file system read) and appropriate file-format handling (e.g., parsing CSV, deserializing JSON, transforming XML for structured use). This includes understanding security protocols and authentication mechanisms.

Relational (SQL) Databases
  • Core Idea: Data is meticulously organized and stored in predefined, structured tables (composed of rows representing records and columns representing attributes) that are interconnected through relationships. Data is managed and queried using Structured Query Language (SQL).

  • Key Properties: Relational databases are designed for transactional integrity and consistency.

    • ACID transactions (Atomicity, Consistency, Isolation, Durability):

      • Atomicity: Ensures that all operations within a transaction are completed successfully; if any part fails, the entire transaction is rolled back.

      • Consistency: Guarantees that a transaction brings the database from one valid state to another, maintaining all predefined rules and constraints.

      • Isolation: Ensures that concurrent transactions execute independently without interfering with each other, producing the same result as if they were executed sequentially.

      • Durability: Guarantees that once a transaction has been committed, it will remain permanent, even in the event of system failures.

    • Supports complex joins & analytics: The structured nature allows for efficient querying across multiple tables using SQL JOIN operations, enabling sophisticated data analysis and aggregation.

    • High integrity: Widely used in mission-critical systems such as Enterprise Resource Planning (ERP), Customer Relationship Management (CRM), and financial systems, where data accuracy and transactional consistency are paramount.

  • Schema Concepts & Keys - The schema defines the structure of the database, including tables, columns, data types, and relationships.

    • Primary Key (PK): A column or set of columns that uniquely identifies each row (record) in a table. It cannot contain NULL values and must be unique.

    • Candidate Key: A column or set of columns that could potentially serve as a Primary Key because it uniquely identifies rows.

    • Alternate Key: Any candidate key that is not chosen to be the Primary Key.

    • Foreign Key (FK): A column or set of columns in one table that refers to the Primary Key in another table. It establishes a link between the two tables and enforces referential integrity, ensuring relationships between related tables remain consistent.

    • Super Key / Unique Key: A set of one or more attributes that, taken collectively, can uniquely identify a record in a table. A primary key is a minimal super key.

  • Illustrative Table Snippets (Employees, Customers Orders) show PK/FK links and multiple keys. These demonstrate how relational integrity is maintained by linking related entities.

  • Fundamental SQL Commands - These are the backbone of interacting with relational databases, often referred to as CRUD operations.

    • \text{SELECT} – Used to retrieve data records from one or more tables based on specified criteria.

    • \text{INSERT\ INTO} – Used to add new data rows into a table.

    • \text{UPDATE} – Used to modify existing data records in a table.

    • \text{DELETE} – Used to remove existing data records from a table.

    • Other important commands include \text{JOIN} (combining rows from two or more tables), \text{GROUP\ BY} (aggregating data), \text{ORDER\ BY} (sorting results), etc.

  • Code Examples (slide code reproduced):

CREATE TABLE Employees (
  EmployeeID INT PRIMARY KEY,
  FirstName  VARCHAR(255) NOT NULL,
  LastName   VARCHAR(255) NOT NULL,
  BirthDate  DATE
);
INSERT INTO Employees (EmployeeID, FirstName, LastName, BirthDate)
VALUES (1,'John','Doe','1980-01-22');
SELECT * FROM Employees WHERE BirthDate < '1990-01-01';
UPDATE Employees SET LastName='Smith' WHERE EmployeeID=1;
DELETE FROM Employees WHERE EmployeeID=1;
Non-Relational (NoSQL) Databases
  • Key Variants: NoSQL databases offer diverse data models, each suited for different use cases.

    • Document Databases: Store data in flexible, semi-structured document formats (e.g., JSON, BSON), often used for content management and catalogs (e.g., MongoDB).

    • Key–Value Stores: The simplest NoSQL models, storing data as a collection of key-value pairs, ideal for caching and session management (e.g., Redis, DynamoDB).

    • Wide-Column Stores: Organize data into tables, rows, and dynamic columns, optimized for large datasets and time-series data (e.g., Cassandra, HBase).

    • Graph Databases: Use nodes, edges, and properties to represent and store relationships between data points, excellent for social networks, recommendation engines, and fraud detection (e.g., Neo4j).

  • Advantages - Designed for modern application demands requiring flexibility and massive scale.

    • Flexible schema: Unlike rigid relational databases, NoSQL databases can handle evolving, unstructured, or semi-structured data without requiring a predefined schema, making them adaptable to rapid changes in data models.

    • Horizontal scalability: Can scale out by adding more commodity servers to a distributed cluster, making them highly suitable for managing big data volumes and accommodating real-time web and mobile applications with high traffic.

    • Ideal for content management, social media, IoT, analytics: Their adaptability and scalability make them perfect for applications generating vast amounts of varied data, such as user profiles, sensor data, or online content.

Common Data File Formats
  • CSV (Comma-Separated Values) - A plaintext file format storing tabular data where each line is a data record, and each field within the record is separated by a comma (or another delimiter).

    • Simple, flat text table: Extremely straightforward to create, read, and understand.

    • Easy import/export: Universally supported by almost all data processing software (e.g., Excel, Google Sheets, databases, programming languages), making it a common interchange format.

    • Limited to non-hierarchical data; no metadata: Cannot inherently represent complex, nested data structures or include metadata (like data types, relationships) within the file itself, relying on external interpretation.

  • JSON (JavaScript Object Notation) - A lightweight, human-readable, and machine-parsable data interchange format that represents data as nested key–value pairs and ordered lists of values.

    • Human-readable, nested key–value pairs: Its structure closely mimics programming language objects, making it intuitive for developers.

    • Ubiquitous in web APIs, config files, and storing complex objects: The standard format for data exchange on the web due to its flexibility in representing complex, hierarchical data structures and objects.

  • XML (eXtensible Markup Language) - A markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. It emphasizes structured data and content rather than presentation.

    • Markup language supporting hierarchical structure, metadata, namespaces: Uses tags to define elements and attributes, allowing for highly complex, nested, and self-describing data with extensive metadata capabilities through DTDs or XML Schemas.

    • Historically used in enterprise (SOAP, config): Once a dominant format in enterprise application integration (e.g., SOAP web services) and for configuration files, though JSON has largely surpassed it for simpler web-based interactions.

  • Format Selection Summary - Choosing the right format depends on the data's complexity and its intended use.

    • CSV \rightarrow Simple tabular datasets, flat data, universal compatibility for basic data.

    • JSON \rightarrow Nested/web data interchange, APIs, configuration files, and flexible document storage.

    • XML \rightarrow Complex documents demanding rich metadata, highly structured content validation, and enterprise system integration (legacy).

Structured vs Unstructured Data
  • Structured - Data organized into a fixed format, typically stored in tabular form within relational databases (SQL) where each field has a defined type and length. It adheres to a predefined schema.

    • Rigid schema; SQL tables: Data must fit into a pre-established data model, making it highly organized and easy to query with SQL.

    • Supports complex queries/transactions: Ideal for analytical processing, reporting, and transactional systems that require high data integrity and consistency.

    • Examples: Customer names and addresses in a CRM system, bank transaction records (date, amount, account number), inventory levels in an ERP system.

  • Unstructured - Data that does not have a predefined data model or schema. It does not fit neatly into traditional rows and columns and often contains free-form text or multimedia.

    • No predefined model; stored in NoSQL/object storage: Stored typically in NoSQL databases, data lakes, or object storage systems, offering flexibility but requiring more advanced techniques for analysis.

    • Examples: Images and videos (e.g., security footage, medical scans), social media posts (text, emojis, hashtags), emails, audio recordings, text documents (e.g., legal contracts, research papers), entire web pages.

Primary vs Secondary Data
  • Primary Data - Original data collected directly by the researcher or organization specifically for the current research objective or project. It is fresh, precise, and tailored to specific needs.

    • Methods: Surveys (online, phone, in-person), controlled experiments (lab, field), interviews (structured, semi-structured, unstructured), focus groups, observations, sensor data collection.

    • Advantages: High relevance to the specific research question, greater control over data quality and collection methods, up-to-date and specific insights.

    • Disadvantages: Can be time-consuming, expensive, and require significant effort and resources to collect.

  • Secondary Data - Data that has been previously collected, compiled, and published by someone else for purposes other than the current research. It is readily available.

    • Sources: Government census data, published academic papers, historical records, organizational reports, publicly available datasets (e.g., from Kaggle), market research reports by third-party firms, news articles.

    • Advantages: Cost-effective, readily available, saves time and effort, can provide broader context or historical trends.

    • Disadvantages: May not be specifically tailored to the current research question, potential for outdated or irrelevant information, lack of control over data quality or collection methodology, potential biases from original collectors.

APIs in Data Science
  • Definition: An Application Programming Interface (API) is a set of defined rules, protocols, and tools for building software applications. It exposes specific functionalities or data from one software system, allowing other software components to communicate and interact with it without needing direct database access or understanding internal complexities.

  • Benefits - APIs streamline data access and integration, making them indispensable for data science workflows.

    • Structured data retrieval: Enables programmatic access to vast amounts of structured data from external services like social media platforms, weather forecasting services, financial markets, or public health organizations. Data is typically returned in predictable formats like JSON or XML.

    • Automation & scheduling: Facilitates the automation of data collection, updates, and triggering of tasks, significantly reducing manual effort and ensuring data freshness.

    • Real-time feeds for dashboards, streaming analytics: Many APIs provide real-time data streams (e.g., financial tickers, IoT sensor data), crucial for dynamic dashboards, fraud detection, or immediate operational monitoring.

    • Modularity: Allows data scientists to treat external services as