Study Notes on Data Acquisition and Introduction to Data Science

Introduction to Data Science

  • Context: The course focuses on Data Acquisition and working with Structured Data.

  • Instructor Details: Presentation by Katie Baynes, NASA Earth Data Officer, to the National Academies Committee on Earth Science and Applications from Space on January 12, 2026.

Course Overview

  • Course Title: Data Acquisition I (DSCI 403/503)

  • Objectives: To develop skills in acquiring and working with various types of data.

Housekeeping Items

  • Questions from the previous session addressed.

  • Ensure JupyterHub accounts are active for usage during the course.

  • Course textbook is available online via the Mines library.

  • Open to questions regarding IPython, Jupyter, and the Python programming language.

Data Acquisition Overview

  • Primary and Secondary Sources of Data:

    • Primary Data:

    • Definition: First-hand data collected directly by the researcher.

    • Characteristics:

      • Real-time data collection.

      • Involves detailed processes for data gathering (e.g., surveys, observations).

      • Sources include personal interviews, experiments, questionnaires.

      • Cost: Generally expensive and time-consuming.

    • Secondary Data:

    • Definition: Data collected previously by others.

    • Characteristics:

      • Historical data collection.

      • Quick and easy access to data.

      • Sources include government publications, websites, research articles.

      • Cost: Economical and collection time is shorter.

Data Types Covered

  • Structured Data

  • Unstructured Data

  • Data Ethics: Understanding the ethical implications involved in data acquisition and handling.

Data Formats

  • Text Files:

    • Formats:

    • Plain Text Files (.txt)

    • Comma-Separated Values (.csv)

    • JavaScript Object Notation (.json)

  • Database Formats:

    • Structured Query Language (SQL) Databases.

    • More complex structures (NetCDF, HDF5).

  • Sources For Structured Data:

    • Public datasets from Scikit Learn, Kaggle, etc.

Tools & Libraries Used in the Course

  • Jupyter Notebook: Interactive coding environment.

  • Python: Programming language used for data acquisition and analysis.

  • Pandas: Library for data manipulation and analysis.

  • SQL: Language for database management and queries.

Pandas Library Overview

  • Pandas Series:

    • One-dimensional labeled array capable of holding any data type.

    • Can be indexed, filtered, and manipulated.

  • DataFrame:

    • Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns).

    • Can be thought of as a dictionary of Series objects.

Examples of Data Loading Using Pandas

  • Data from Text Files:

    • Importing libraries:

    import pandas as pd
    from pandas import DataFrame
    
    • Reading a text file:

    fileName = "ice_cream.txt"
    iceCreamSales = pd.read_table(fileName, delimiter=",")
    
  • Data from CSV Files:

    • Example code:

    fileName = "ice_cream.csv"
    iceCreamSales = pd.read_csv(fileName)
    
  • Data from JSON Files:

    • Example code:
      python fileName = "ice_cream.json" iceCreamSales = pd.read_json(fileName)

Advanced Data Acquisition Techniques

  • Relational Databases:

    • The course introduces SQL to gather data stored in tables, where rows and columns depict individual records.

    • Example SQL query:

    SELECT * FROM universities WHERE city = 'Golden';
    
  • Scikit Learn:

    • Focuses on machine learning functionalities, including dataset handling via:
      python from sklearn.datasets import load_diabetes diabetesRaw = load_diabetes()

Course Applications

  • Kaggle Competitions:

    • Platform for data science exploration, featuring datasets and collaboration.

    • Examples using the Predict Pet Adoption Status dataset for practical learning and skill application:
      python fileName = "pet_adoption_data.csv" petData = pd.read_csv(fileName)

Conclusion

  • Students should be familiar with various data acquisition methods, tools, and ethical considerations related to data handling and analysis in the context of data science.