Data Science Review

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/37

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

38 Terms

1
New cards

Data Science

A concept to unify statistics, data analysis, machine learning, and their related methods to understand and analyze actual phenomena with data.; Data science is the discipline of making data useful

2
New cards

Statistics

The science of changing your mind under uncertainty.

3
New cards

Machine Learning

Making labels using examples instead of explicit instructions.

4
New cards

Data Mining/Analytics

The process of finding data to make decisions, including descriptive analytics and exploratory data analysis.

5
New cards

Data Scientist

A person skilled in statistics and programming, possessing knowledge in data collection and application.

6
New cards

Categorical Variables

Variables that represent groups or categories, e.g., color.

7
New cards

Quantitative Variables

Variables that represent amounts or quantities, e.g., age, height.

8
New cards

Data Preparation

The process of converting raw data into a usable format for analysis.

9
New cards

Labeling

The process of ascribing meaning/categories to data, essential for model training.

10
New cards

Annotation

Adding explanatory notes to data, typically done by external or internal teams.

11
New cards

Human in the Loop

A system supervised by humans to improve data collection and application.

12
New cards

Brute Force Collection

A method of data acquisition focused solely on collecting extensive amounts of data.

13
New cards

Datafiction

The belief that data collection started with the internet and computers; suggests data science involves more than just statistics.

14
New cards

Confounding Variables

Variables that can affect the outcome of an analysis but are not directly measured. that can be indirectly measured

15
New cards

Bias

Systematic errors that can skew the results of data analysis, including selection and sampling ___.

16
New cards

Histogram

A graphical representation showing the distribution of one quantitative variable.

17
New cards

Bar Chart

A chart representing the frequency of discrete categories.

18
New cards

Scatter Plot

A plot used to show the relationship between two quantitative variables.

19
New cards

Pie Chart

A circular chart that shows the relative frequency of discrete categories.

20
New cards

Data Ethics (Five C's)

Consent, Clarity, Consistency, Trust, Control, and Consequences.

21
New cards
  1. The formula for the total number of hospitals who list a hip knee cost: 

=count (N:N)

22
New cards

If your histogram looks like this, your problem is

you have too many bins/ inputs

23
New cards

Fill in the blank: ___ does not Imply___

Correlation and causation

24
New cards

Know as the father of epidemiology, founder of the cholera experiment

John snow

25
New cards

A study wich neither experiment designers nor participants know who receives treatment

Double blind.

26
New cards

Three domains

Computer science, domain experience and statistics.

27
New cards

The formula for the average cost of better heart failure treatment:

= Averageif(k:K,”better”,J:J)

28
New cards

A type of data collection where the sole purpose of an activity is data acquisition

Brute force

29
New cards

The formula for the number of hospitals with between heart attacks and better pneumonia quality.

=Countifs(1:1,”better”,M:M,”better”)

30
New cards

This technique lets us use one variable in order to establish casually and avoid interference from other variable

Randomization

31
New cards

What is plotted on the vertical axis of a histogram

Frequency of points, or counts

32
New cards

The formula for the average cost of a heart attack

  1. =average(h:h)

33
New cards

The method of data aquation used by boston dynamics robotic company to train their robots:

Huaman in loop

34
New cards

Public Data

Data available on the internet, Largest source of data

Ex. Using Google Street View to classify cars, then interfere voting habits

35
New cards

Data Preparation

 Filtering out impurities, Labeling, Annotating Data, Getting users to generate labels, Tools to speed up annotation

36
New cards

Filtering out impurities

Sorted manually , or automated tools to separate but not discard

37
New cards

Labeling

The most time-consuming part,  Allows you to ascribe meaning/ categories to the data, what you want your algorithm to support in the wild.

38
New cards

Annotating Data


External annotation service providers, Internal annotation team