Data Science

0.0(0)
studied byStudied by 0 people
0.0(0)
full-widthCall Kai
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
GameKnowt Play
Card Sorting

1/23

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced
Call with Kai

No study sessions yet.

24 Terms

1
New cards

[T1] Why must a scientific hypothesis be falsifiable?

Because science requires hypotheses that could be refuted by evidence; otherwise they can’t be tested meaningfully.

2
New cards

[T1] What’s the difference between descriptive and inferential statistics?

Descriptive summarizes the observed data; inferential makes conclusions beyond the sample (population/uncertainty).

3
New cards

[T1] In plain language, what does a p-value tell you?

How surprising your data (or more extreme) would be if the null hypothesis were true.

4
New cards

[T1] What role does α = 0.05 play in hypothesis testing?

It’s the cutoff used to decide if a result is “statistically significant” under the chosen standard.

5
New cards

[T2] If a network has N nodes and is fully connected, how many links does it have—and why?

N(N−1)/2 because every pair of distinct nodes forms one undirected link.

6
New cards

[T2] Degree distribution vs frequency distribution: what’s the difference?

Degree distribution is a probability (chance a node has degree k); frequency distribution is counts (# nodes with degree k).

7
New cards

[T2] What does a high clustering coefficient imply about a node’s neighborhood?

The node’s neighbors tend to also be connected to each other (more “clique-like”).

8
New cards

[T2] How do you generate a random network in the slides’ model?

Use N nodes and connect each pair with probability p.

9
New cards

[T2] What is homophily and why does it matter in social networks?

Similar people connect more often; it shapes communities and patterns like shared behaviors/opinions.

10
New cards

[T2] What is the main idea behind “six degrees of separation”?

Social networks can have short path lengths, so most people are connected by only a few steps.

11
New cards

[T3] What is culturomics trying to measure, and what dataset enables it?

Cultural change/usage patterns at scale using digitized text; Google Books corpus is central.

12
New cards

[T3] Why does the dataset impose a “min 40 occurrences” rule for n-grams?

To filter out extremely rare sequences and focus on more stable/meaningful patterns.

13
New cards

[T3] If the dataset supports up to 5-grams, what’s an example of a 5-gram?

Any 5-word sequence (e.g., “to be or not to”). (Concept: sequence of 5 tokens.)

14
New cards

[T4] Why is testing a model on its training data misleading?

It can overfit—performing well by memorizing training examples rather than generalizing.

15
New cards

[T4] In holdout evaluation, why separate training/validation/test?

Train builds the model; validation tunes settings; test estimates real-world performance.

16
New cards

[T4] How does k-fold CV reduce the risk of a “lucky/unlucky” split?

Every data point gets to be in a test fold; performance is averaged across k runs.

17
New cards

[T4] Give an example of when accuracy is a bad metric and explain why.

Imbalanced classes: predicting the majority class can yield high accuracy but fail to detect the rare (important) class.

18
New cards

[T4] How do precision and recall differ in what they “care about”?

Precision: how reliable positive predictions are; Recall: how many actual positives are caught.

19
New cards

[T4] When would you prioritize recall over precision?

When missing positives is costly (e.g., disease detection), so you want to catch as many positives as possible.

20
New cards

[T4] Why use F1 instead of accuracy?

F1 balances precision and recall, which is useful especially with class imbalance.

21
New cards

[T4] Walk through how k-NN classifies a new point.

Choose k, find k nearest labeled points, take majority vote, assign that class.

22
New cards

[T4] What’s a major weakness of k-NN mentioned in the slides?

It’s memory-heavy, prediction can be slower, and it’s sensitive to irrelevant features.

23
New cards

[T4] How does a decision tree decide splits, at a high level?

It iteratively chooses splits (often after discretizing continuous values) to improve accuracy until reaching leaves.

24
New cards

[T4] Why can decision trees overfit?

Too many branches (especially with outliers) can fit noise rather than general patterns.