Ch. 1 R for Data Science

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/90

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

91 Terms

1
New cards

Name the steps of a data science project.

  1. Import

  2. Tidy

  3. Understand

    -Transform

    -Visualize

    -Model

  4. Communicate

    Note: The steps contained in “Understand” form a cycle, which can be repeated as many times as desired.

2
New cards

API

Application Programming Interface. Similar to a user interface, but instead of allowing a user to interface with a program, an API allows a program to interface with another program.

3
New cards

In a “tidy” dataset, columns represent what?

Variables

4
New cards

In a “tidy” dataset, rows represent what?

Observations about the variables.

5
New cards

Give 3 examples of “Transforming” data.

  1. Narrowing focus to certain variables

  2. Calculating new variables from existing variables

    ex: speed = distance/time

  3. Calculating summary statistics

    ex: mean, standard deviation, etc.

6
New cards

IDE

Integrated Development Environment. Basically a text-editor for programming.

7
New cards

In RStudio, R code is typed into the … pane

console

8
New cards

In RStudio, how do your run code from the console pane?

Enter

9
New cards

RStudio: clear console

CTRL+L

10
New cards

Function for installing packages.

install.packages(“packagename“)

11
New cards

Statement to install Tidyverse

install.packages(“tidyverse“)

12
New cards

CRAN

Comprehensive R Archive Network. Database from which R and all it’s packages are installed.

13
New cards

To use the functions in an R package, you must first …1… and …2…

  1. Install the package onto your computer from CRAN

  2. Load the package into your R session

14
New cards

Statement to load a package into an R session

library(package_name)

15
New cards
<p>What does this mean:</p><p></p>

What does this mean:

Some packages in tidyverse have the same name as packages in base R. Code will still function, these are basically overrides.

16
New cards

Function that returns whether or not there is an update for tidyverse.

tidyverse_update()

17
New cards

Invoke a function by specifying it’s package name

package::function()

ex:

dplyr::mutate()

18
New cards

Data frame

Data set arranged in a matrix, with variables in columns and observations in rows

19
New cards

Variable

a quantity, quality, or property you can measure

20
New cards

Value

The state of a variable when you measure it.

21
New cards

Observation

A set of measurements made under similar conditions. An observation will contain several values, each associated with a different variable.

22
New cards

Observations are also referred to as…

Data points

23
New cards

Tabular Data

A set of values, each associated with a variable and an observation.

24
New cards

Tabular data is considered “tidy” if…

if each value is in it’s own cell, each variable is in it’s own column, and each observation is in it’s own row.

25
New cards

How do you preview a data frame in the console?

Enter it’s name in the console

26
New cards

Function that lets you view a dataframe with the rows and columns swapped

glimpse(dataframe)

27
New cards

Function that opens up an interactive viewer for a dataframe. (RStudio only)

View(dataframe)

28
New cards

Open the help page for a dataframe

?dataframe

29
New cards

Standard function for creating plots.

ggplot()

30
New cards

ggplot argument for which dataframe to use

data

31
New cards

ggplot argument for layout of plot

mapping

32
New cards

Set of functions for choosing the shape of data points in a plot

geom_point/bar/line/etc()

33
New cards

Function used as the value for the “mapping” argument in ggplot()

aes(x = x_axis_data, y = y_axis_data)

34
New cards

Create a basic scatterplot for the “palmerpenguins” data set

> ggplot(
+     data = penguins,
+     mapping = aes(x = flipper_length_mm, y = body_mass_g)
+ ) +
+     geom_point()

<pre><code>&gt; ggplot(
+     data = penguins,
+     mapping = aes(x = flipper_length_mm, y = body_mass_g)
+ ) +
+     geom_point()</code></pre><p></p>
35
New cards

Function for displaying data in bar graph form

geom_bar()

36
New cards

Function for displaying data in line graph form

geom_line()

37
New cards

Function for displaying data in boxplot form

geom_boxplot()

38
New cards

Function for displaying data in scatterplot form

geom_point()

39
New cards

Add a newline in the console, without executing code

SHIFT + ENTER

40
New cards

Argument that tells aes() to automatically display data points in different colors based on the value of a specific variable.

color = variable

41
New cards

Scaling

Scaling is when you assign a variable as the value of an argument in aes(), and each value of that variable is automatically assigned a unique aesthetic (like a unique color, for example).

42
New cards

ggplot() layer that generates a smooth curve based on a linear model.

+geom_smooth(method=”lm”)

43
New cards

Global level

Applies aesthetic mapping to the entire plot. Setting aesthetics at this level will effect points, bars, lines, etc all at the same time.

44
New cards

Local level

Applies aesthetic mapping only to specific parts of a plot (only the points/ only the lines/ etc)

45
New cards

Where are global aesthetics defined?

knowt flashcard image
46
New cards

Where are local aesthetics defined?

knowt flashcard image
47
New cards

Write a layer for ggplot() that changes the shape of points only based on a specific variable.

+geom_point(mapping=aes(shape=variable))

48
New cards

Add a new layer to a plot

ggplot() + new_layer() + new_layer + new_layer()

49
New cards

ggplot() layer for customizing the labels on a plot.

labs()

50
New cards

labs() argument for specifying the title of a plot

title = “title”

51
New cards

labs() argument for specifying the subtitle of a plot

subtitle = “subtitle”

52
New cards

labs() argument for specifying the x and y axis labels of a plot

x = “label”

y = “label”

53
New cards

labs() argument for specifying the legend title of a plot

shape = “label” or color = “label”

54
New cards

ggplot() layer that adds a colorblind-safe color palette

+scale_color_colorblind()

55
New cards
<p>Recreate this plot, using the palmerpenguins data set:</p>

Recreate this plot, using the palmerpenguins data set:

knowt flashcard image
56
New cards

geom_() argument that excludes missing values from the plot

geom_(na.rm=TRUE)

57
New cards

labs() argument to add a caption to a plot

caption=”label”

58
New cards

Can a dataframe be set locally with ggplot2?

Yes, dataframes may be set locally in a specific layer, or globally in ggplot().

59
New cards

What are the first two arguments in ggplot()?

  1. data

  2. mapping

60
New cards

Do you have tot write the names of arguments to assign them?

No. ggplot(data=penguins) is the same as ggplot(penguins).

61
New cards

Categorical variable

A variable representing a category, which can only contain a limited number of possible values.

ex: kingdom = plantae/animalia/fungi/etc

62
New cards

Which plot is best for examining the distribution of categorical variables?

Bar chart

63
New cards

Function that allows you to transform a variable into a factor which is ordered by frequency.

fct_infreq(variable)

64
New cards

Plot a bar chart for species of penguin, ordered by frequency.

knowt flashcard image
65
New cards

Numerical variables

Variables with numerical values, for which it would make sense to add/subtract/take averages of those values

66
New cards

Numerical variables are also called…

Quantitative variables

67
New cards

Numerical variables can be… or…

Continuous of Discrete

68
New cards

Histogram

Plot that divides the x-axis into equal “bins”, with the y-axis representing the number of observations per bin.

69
New cards

ggplot() layer that displays data as a histogram

+geom_histogram()

70
New cards

geom_histogram() argument that specifies the width of the bins

geom_histogram(binwidth=numericalvalue)

71
New cards

Make a histogram of penguin body mass

knowt flashcard image
72
New cards

Density plot

A “smoothed-out” version of a histogram, which is more useful for continuous variables.

<p>A “smoothed-out” version of a histogram, which is more useful for continuous variables.</p>
73
New cards

ggplot() layer for density plots

+geom_density()

74
New cards

Boxplot

A plot that uses boxes, “whiskers”, and points to categorize observations by percentiles.

<p>A plot that uses boxes, “whiskers”, and points to categorize observations by percentiles.</p><p></p>
75
New cards

Interquartile Range

The “middle half” of a variable, consisting of all observations between the 25th and 75th percentiles.

76
New cards

What does the horizontal line on a boxplot represent?

The median (50th percentile)

77
New cards

What do the “whiskers” on a boxplot represent?

Values outside the Interquartile Range (IQR) which are not outliers.

78
New cards

What do the points on a boxplot represent?

Outliers.

79
New cards

On a boxplot, a datapoint is considered an outlier if…

…it is separated from the IQR by a distance of 1.5 * IQR.

80
New cards

Make a boxplot of penguin body mass by species.

knowt flashcard image
81
New cards

Draw density curves of body mass for different species of penguin.

knowt flashcard image
82
New cards

geom_() argument for customizing the width of a line.

linewidth=numericalvalue

83
New cards

Aesthetic that fills in the space underneath a variable’s curve.

fill=variable

84
New cards

Aesthetic that makes the fill underneath a curve transparent.

alpha=value_between_0_and_1

85
New cards

Draw filled, transparent density curves for the body mass of different penguin species.

knowt flashcard image
86
New cards

Mapping an aesthetic means…

That aesthetic will be determined by the values of a variable.

87
New cards

Setting an aesthetic means…

The aesthetic will be determined by a user-defined value.

88
New cards

Using the Palmer Penguins data, create a bar graph of islands, with the bars stacked by species.

knowt flashcard image
89
New cards

Using Palmer Penguin data, create a stacked bar graph of the relative frequency of species per island.

knowt flashcard image
90
New cards

Run an entire R script in RStudio

CTRL+SHIFT+ENTER

91
New cards