Study Notes on Data Acquisition and Introduction to Data Science
Introduction to Data Science
Context: The course focuses on Data Acquisition and working with Structured Data.
Instructor Details: Presentation by Katie Baynes, NASA Earth Data Officer, to the National Academies Committee on Earth Science and Applications from Space on January 12, 2026.
Course Overview
Course Title: Data Acquisition I (DSCI 403/503)
Objectives: To develop skills in acquiring and working with various types of data.
Housekeeping Items
Questions from the previous session addressed.
Ensure JupyterHub accounts are active for usage during the course.
Course textbook is available online via the Mines library.
Open to questions regarding IPython, Jupyter, and the Python programming language.
Data Acquisition Overview
Primary and Secondary Sources of Data:
Primary Data:
Definition: First-hand data collected directly by the researcher.
Characteristics:
Real-time data collection.
Involves detailed processes for data gathering (e.g., surveys, observations).
Sources include personal interviews, experiments, questionnaires.
Cost: Generally expensive and time-consuming.
Secondary Data:
Definition: Data collected previously by others.
Characteristics:
Historical data collection.
Quick and easy access to data.
Sources include government publications, websites, research articles.
Cost: Economical and collection time is shorter.
Data Types Covered
Structured Data
Unstructured Data
Data Ethics: Understanding the ethical implications involved in data acquisition and handling.
Data Formats
Text Files:
Formats:
Plain Text Files (.txt)
Comma-Separated Values (.csv)
JavaScript Object Notation (.json)
Database Formats:
Structured Query Language (SQL) Databases.
More complex structures (NetCDF, HDF5).
Sources For Structured Data:
Public datasets from Scikit Learn, Kaggle, etc.
Tools & Libraries Used in the Course
Jupyter Notebook: Interactive coding environment.
Python: Programming language used for data acquisition and analysis.
Pandas: Library for data manipulation and analysis.
SQL: Language for database management and queries.
Pandas Library Overview
Pandas Series:
One-dimensional labeled array capable of holding any data type.
Can be indexed, filtered, and manipulated.
DataFrame:
Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns).
Can be thought of as a dictionary of Series objects.
Examples of Data Loading Using Pandas
Data from Text Files:
Importing libraries:
import pandas as pd from pandas import DataFrameReading a text file:
fileName = "ice_cream.txt" iceCreamSales = pd.read_table(fileName, delimiter=",")Data from CSV Files:
Example code:
fileName = "ice_cream.csv" iceCreamSales = pd.read_csv(fileName)Data from JSON Files:
Example code:
python fileName = "ice_cream.json" iceCreamSales = pd.read_json(fileName)
Advanced Data Acquisition Techniques
Relational Databases:
The course introduces SQL to gather data stored in tables, where rows and columns depict individual records.
Example SQL query:
SELECT * FROM universities WHERE city = 'Golden';Scikit Learn:
Focuses on machine learning functionalities, including dataset handling via:
python from sklearn.datasets import load_diabetes diabetesRaw = load_diabetes()
Course Applications
Kaggle Competitions:
Platform for data science exploration, featuring datasets and collaboration.
Examples using the Predict Pet Adoption Status dataset for practical learning and skill application:
python fileName = "pet_adoption_data.csv" petData = pd.read_csv(fileName)
Conclusion
Students should be familiar with various data acquisition methods, tools, and ethical considerations related to data handling and analysis in the context of data science.