Study Notes on Data Acquisition and Introduction to Data Science

Context: The course focuses on Data Acquisition and working with Structured Data.
Instructor Details: Presentation by Katie Baynes, NASA Earth Data Officer, to the National Academies Committee on Earth Science and Applications from Space on January 12, 2026.

Course Title: Data Acquisition I (DSCI 403/503)
Objectives: To develop skills in acquiring and working with various types of data.

Questions from the previous session addressed.
Ensure JupyterHub accounts are active for usage during the course.
Course textbook is available online via the Mines library.
Open to questions regarding IPython, Jupyter, and the Python programming language.

Primary and Secondary Sources of Data:
- Primary Data:
- Definition: First-hand data collected directly by the researcher.
- Characteristics:
  - Real-time data collection.
  - Involves detailed processes for data gathering (e.g., surveys, observations).
  - Sources include personal interviews, experiments, questionnaires.
  - Cost: Generally expensive and time-consuming.
- Secondary Data:
- Definition: Data collected previously by others.
- Characteristics:
  - Historical data collection.
  - Quick and easy access to data.
  - Sources include government publications, websites, research articles.
  - Cost: Economical and collection time is shorter.

Structured Data
Unstructured Data
Data Ethics: Understanding the ethical implications involved in data acquisition and handling.

Text Files:
- Formats:
- Plain Text Files (.txt)
- Comma-Separated Values (.csv)
- JavaScript Object Notation (.json)
Database Formats:
- Structured Query Language (SQL) Databases.
- More complex structures (NetCDF, HDF5).
Sources For Structured Data:
- Public datasets from Scikit Learn, Kaggle, etc.

Pandas Series:
- One-dimensional labeled array capable of holding any data type.
- Can be indexed, filtered, and manipulated.
DataFrame:
- Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns).
- Can be thought of as a dictionary of Series objects.

Data from Text Files:

import pandas as pd
from pandas import DataFrame

fileName = "ice_cream.txt"
iceCreamSales = pd.read_table(fileName, delimiter=",")

Data from CSV Files:

fileName = "ice_cream.csv"
iceCreamSales = pd.read_csv(fileName)

Data from JSON Files:
- Example code:
  python fileName = "ice_cream.json" iceCreamSales = pd.read_json(fileName)

Relational Databases:
- The course introduces SQL to gather data stored in tables, where rows and columns depict individual records.
- Example SQL query:
```
SELECT * FROM universities WHERE city = 'Golden';
```
Scikit Learn:
- Focuses on machine learning functionalities, including dataset handling via:
  python from sklearn.datasets import load_diabetes diabetesRaw = load_diabetes()

Kaggle Competitions:
- Platform for data science exploration, featuring datasets and collaboration.
- Examples using the Predict Pet Adoption Status dataset for practical learning and skill application:
  python fileName = "pet_adoption_data.csv" petData = pd.read_csv(fileName)

Students should be familiar with various data acquisition methods, tools, and ethical considerations related to data handling and analysis in the context of data science.