Data Science

Basic Workflow

  1. ASK a question

  2. Plan an analysis or model to answer the question

  3. COLLECT the data

  4. CLEAN and organize data

  5. INTERPRET and communicate the results

Four important aspects of data sci:

  1. Exploration - identifying patterns in data (visualization)

  2. Inference - Using data to draw reliable conclusions about the world (statistics)

  3. Prediction - Making informed guesses about unobserved data (Machine Learning)

  4. Communication - How you present the results, what story can you craft? You would communicate differently depending on who you’re talking to.

Cause and Effect

Experiments and Observational studies:

Y = Response variable: The primary variable of interest

X = Explanatory variable(s) or factors: Other information about each subject that may explain why the subjects differ with respect to the response.

Data: Known cases where (X,Y) are observed

The objective is to use the data to predict new observations of Y using X. The data may be used to fit a model relating X and Y

Can get this data using either an experiment or an observational study

  • Observational: don’t have that level of control, just have to observe (duh)

  • Experiment: can set X’s or number of X’s to be whatever you want

Confounding variable: A confounding variable is a variable that s related to both X and Y. We day that the effect X has on Y cannot be separated from the effect of the confounding variable.

The presence of possible confounding variables makes drawing a conclusion of causality very difficult. Randomization (conducting an experiment) allows much greater control.

Interacting variables: A variable that interacts with the explanatory variable is a variable that changes the way the explanatory variable affects the response.

Meta-analysis combines results from multiple separate studies