Data Science

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/23

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

24 Terms

1
New cards

[T1] Why must a scientific hypothesis be falsifiable?

Because science requires hypotheses that could be refuted by evidence; otherwise they can’t be tested meaningfully.

2
New cards

[T1] What’s the difference between descriptive and inferential statistics?

Descriptive summarizes the observed data; inferential makes conclusions beyond the sample (population/uncertainty).

3
New cards

[T1] In plain language, what does a p-value tell you?

How surprising your data (or more extreme) would be if the null hypothesis were true.

4
New cards

[T1] What role does α = 0.05 play in hypothesis testing?

It’s the cutoff used to decide if a result is “statistically significant” under the chosen standard.

5
New cards

[T2] If a network has N nodes and is fully connected, how many links does it have—and why?

N(N−1)/2 because every pair of distinct nodes forms one undirected link.

6
New cards

[T2] Degree distribution vs frequency distribution: what’s the difference?

Degree distribution is a probability (chance a node has degree k); frequency distribution is counts (# nodes with degree k).

7
New cards

[T2] What does a high clustering coefficient imply about a node’s neighborhood?

The node’s neighbors tend to also be connected to each other (more “clique-like”).

8
New cards

[T2] How do you generate a random network in the slides’ model?

Use N nodes and connect each pair with probability p.

9
New cards

[T2] What is homophily and why does it matter in social networks?

Similar people connect more often; it shapes communities and patterns like shared behaviors/opinions.

10
New cards

[T2] What is the main idea behind “six degrees of separation”?

Social networks can have short path lengths, so most people are connected by only a few steps.

11
New cards

[T3] What is culturomics trying to measure, and what dataset enables it?

Cultural change/usage patterns at scale using digitized text; Google Books corpus is central.

12
New cards

[T3] Why does the dataset impose a “min 40 occurrences” rule for n-grams?

To filter out extremely rare sequences and focus on more stable/meaningful patterns.

13
New cards

[T3] If the dataset supports up to 5-grams, what’s an example of a 5-gram?

Any 5-word sequence (e.g., “to be or not to”). (Concept: sequence of 5 tokens.)

14
New cards

[T4] Why is testing a model on its training data misleading?

It can overfit—performing well by memorizing training examples rather than generalizing.

15
New cards

[T4] In holdout evaluation, why separate training/validation/test?

Train builds the model; validation tunes settings; test estimates real-world performance.

16
New cards

[T4] How does k-fold CV reduce the risk of a “lucky/unlucky” split?

Every data point gets to be in a test fold; performance is averaged across k runs.

17
New cards

[T4] Give an example of when accuracy is a bad metric and explain why.

Imbalanced classes: predicting the majority class can yield high accuracy but fail to detect the rare (important) class.

18
New cards

[T4] How do precision and recall differ in what they “care about”?

Precision: how reliable positive predictions are; Recall: how many actual positives are caught.

19
New cards

[T4] When would you prioritize recall over precision?

When missing positives is costly (e.g., disease detection), so you want to catch as many positives as possible.

20
New cards

[T4] Why use F1 instead of accuracy?

F1 balances precision and recall, which is useful especially with class imbalance.

21
New cards

[T4] Walk through how k-NN classifies a new point.

Choose k, find k nearest labeled points, take majority vote, assign that class.

22
New cards

[T4] What’s a major weakness of k-NN mentioned in the slides?

It’s memory-heavy, prediction can be slower, and it’s sensitive to irrelevant features.

23
New cards

[T4] How does a decision tree decide splits, at a high level?

It iteratively chooses splits (often after discretizing continuous values) to improve accuracy until reaching leaves.

24
New cards

[T4] Why can decision trees overfit?

Too many branches (especially with outliers) can fit noise rather than general patterns.