Google Data Analytics Course 3 - Prepare Data for Exploration

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/105

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

106 Terms

1
New cards

Learning Objectives

1. Understanding the different types of data and data structures

2. What type of data is right for the question you're answering

3. Practical skills about how to extract, use, organize, and protect your data

- Explain how Kaggle can benefit a data analyst

- Explain how data is generated as a part of our daily activities with reference to the types of data generated

- Explain factors that should be considered when making decisions about data collection

- Explain the difference between structured and unstructured data

- Discuss the difference between data and data types

- Explain the relationship between data types, fields, and values

- Discuss wide and long data formats with references to organization and purpose

2
New cards

How data is collected

- interviews

- observations

- forms

- questionnaires

- cookies

- surveys

3
New cards

Data collection considerations

- How the data will be collected

- Choose data sources

- Decide what data to use

- How much data to collect

- Select the right data type

- Determine the time frame

4
New cards

First-party data

Data collected by an individual or group using their own resources

5
New cards

Second-party data

Data collected by a group directly from its audience and then sold

6
New cards

Third-party data

Data collected from outside sources who did not collect it directly

7
New cards

Population

All possible data values in a certain dataset

8
New cards

Sample

A part of a population that is representative of the population

9
New cards

Discrete data

Data that is counted and has a limited number of values

Partial measurements aren't allowed

10
New cards

Continuous data

Data that is measured and can have almost any numerical value

11
New cards

Nominal data

A type of qualitative data that is categorized without a set order

12
New cards

Ordinal data

A type of qualitative data with a set order or scale

13
New cards

Internal data

Data that lives within a company's own systems

14
New cards

External data

Data that lives and is generated outside of an organization

15
New cards

Structured data

Data organized in a certain format such as rows and columns

- defined data types

- most often quantitative data

- easy to organize

- easy to search

- easy to analyze

- stored in relational databases & data warehouses

- contained in rows and columns

Examples: Excel, Google Sheets, SQL, customer data, phone records, transaction history

16
New cards

Unstructured data

Data that is not organized in any easily identifiable manner

- Varied data types

- Most often qualitative data

- Difficult to search

- Provides more freedom for analysis

- Stored in data leaks, data warehouses, and NoSQL databases

- Can't be put in rows and columns

- Examples: text messages, social media comments, phone call transcriptions, various log files, images, audio, video

17
New cards

Data model

A model that is used for organizing data elements and how they relate to one another

18
New cards

Sources of structured data

- spreadsheets

- databases that store datasets

-

19
New cards

Data modeling

the process of creating diagrams that visually represent how data is organized and structured

20
New cards

Conceptual data modeling

high-level view of the data structure

- how data interacts across an organization

21
New cards

Logical data modeling

focuses on the technical details of a database such as relationships, attributes, and entities

22
New cards

Physical data modeling

depicts how a database operates

23
New cards

Entity Relationship Diagram (ERD)

A visual way to understand the relationship between entities in the data model

24
New cards

Unified Modeling Language (UML) diagram

Very detailed diagram that describes the structure of a system by showing the system's entities, attributes, operations, and their relationships

25
New cards

Data type

A specific kind of data attribute that tells what kind of value the data is

Tells you what type of data you're working with

26
New cards

Data types in spreadsheets

- Number

- Text or string

- Boolean

27
New cards

Text or string data type

A sequence of characters and punctuation that contains textual information

28
New cards

Boolean data type

A data type with only two possible values, such as TRUE or FALSE

29
New cards

Operator

a symbol that names the operation or calculation to be performed

30
New cards

Wide data

Data in which every data subject has a single row with multiple columns to hold the values of various attributes of the subject

31
New cards

Long data

Data in which each row is one time point per subject, so each subject will have data in multiple rows

each row contains a single data point

32
New cards

Data transformation

the process of changing the data's format, structure, or values

33
New cards

Data transformation usually involves

- Adding, copying, or replicating data

- Deleting fields or records

- Standardizing the names of variables

- Renaming, moving, or combining columns in a database

- Joining one set of data with another

- Saving a file named in a different format

34
New cards

Goals for data transformation

Data organization

Data compatibility

Data migration

Data merging

Data enhancement

Data comparison

35
New cards

Wide data is preferred when

- Creating tables and charts with a few variables about each subject

- Comparing straightforward line graphs

36
New cards

Long data is preferred when

- Storing a lot of variables about each subject

- Performing advanced statistical analysis or graphing

37
New cards

Bias

A preference in favor of or against a person, group of people, or thing

38
New cards

Data bias

A type of error that systematically skews results in a certain direction

39
New cards

Sampling bias

When a sample isn't representative of the population as a whole

40
New cards

Unbiased sampling

When a sample is representative of the population being measured

41
New cards

Observer bias (experimenter bias, research bias)

The tendency for different people to observe things differently

42
New cards

Interpretation bias

The tendency to always interpret ambiguous situations in a positive or negative way

43
New cards

Confirmation bias

The tendency to search for or interpret information in a way that confirms pre-existing beliefs

44
New cards

Is a data source 'good'? - ROCCC

Reliable

Original

Comprehensive

Current

Cited

45
New cards

Every good solution is found by ________

Avoiding bad data

46
New cards

Ethics

Well-founded standards of right and wrong that prescribe what humans ought to do, usually in terms of rights, obligations, benefits to society, fairness, or specific values

47
New cards

Data ethics

Well-founded standards of right and wrong that dictate how data is collected, shared, and used

48
New cards

General Data Protection Regulation of the European Union (GDPR)

49
New cards

Aspects of data ethics / Data Ethics Concerns

- Ownership

- Transaction transparency

- Consent

- Currency

- Privacy

- Openness

50
New cards

Consent

An individual's right to know explicit details about how and why their data will be used before agreeing to provide it

51
New cards

Privacy

Preserving a data subject's information and activity any time a data transaction occurs

52
New cards

Openness

Free access, usage, and sharing of data

53
New cards

Personally identifiable information (PII)

information that can be used by itself or with other data to track down a person's identity

54
New cards

Data anonymization

the process of protecting people's private or sensitive data by eliminating that kind of information

typically involves blanking, hashing, or masking personal information, often by using fixed-length codes to represent data columns, or hiding data with altered values

55
New cards

De-identification

a process used to wipe data clean of all personally identifying information

56
New cards

Openness (Open data)

Free access, usage, and sharing of data

57
New cards

Data interoperability

The ability of data systems and services to openly connect and share data

58
New cards

For data to be open, it must:

- Be available and accessible to the public as a complete dataset

- Be provided under terms that allow it to be reused and redistributed

- Allow universal participation so that anyone can use, reuse, and redistribute the data

59
New cards

Primary key

An identifier that references a column in which each value is unique

- used to ensure data in a specific column is unique

- uniquely identifies a record in a relational database table

- only one primary key is allowed in a table

- cannot contain null or blank values

60
New cards

Foreign key

A field within a table that is a primary key in another table

- a column or group of columns in a relational database table that provides a link between the data in two tables

- refers to the field in a table that's the primary key of another table

- more than one foreign key is allowed to exist in a table

61
New cards

relational database

a database that contains a series of tables that can be connected to form relationships

- allow data analysts to organize and link data based on what the data has in common

62
New cards

normalization

a process of organizing data in a relational database

- applied to eliminate data redundancy, increase data integrity, and reduce complexity in a database

ex: creating tables and establishing relationships between those tables

63
New cards

composite key

a primary key constructed using multiple columns of a table

64
New cards

metadata

Data about data

- stored in a single, central location

- gives the company standardized information about all of its data

- used in database management to help data analysts interpret the contents of the data within the database

- three common types: descriptive, structural, administrative

65
New cards

Descriptive metadata

Metadata that describes a piece of data and can be used to identify it at a later point in time

66
New cards

Structural metadata

Metadata that indicates how a piece of data is organized and whether it is part of one, or more than one, data collection

67
New cards

Administrative metadata

Metadata that indicates the technical source of a digital asset

68
New cards

Elements of metadata

- file or document type

- date, time, and creator

- title and description

- geolocation

- tags and categories

- who last modified it and when

- who can access or update it

69
New cards

photo metadata

filename, date, time, geolocation, type of device

70
New cards

email metadata

subject line, sender, recipient, date sent, time sent

hidden metadata: server names, IP addresses, HTML format, software details

71
New cards

spreadsheet metadata

title, author, creation date, number of pages, user comments, tab names, tables, columns

72
New cards

website metadata

tags, categories, site creator's name, web page title and description, time of creation

73
New cards

books and audiobooks metadata

title, author name, table of contents, publisher information, copyright description, index, brief description of the book's contents

narrator, recording length

74
New cards

Benefits of metadata

- reliability

- accurate

- precise

- relevant

- timely

- consistency

- organized ---> easily findable

- classified ---> follows a consistent format

- stored ---> efficiently stored in various data repositories

- accessed --> users, applications, and systems can locate and use data

75
New cards

Metadata repositories

- Specialized databases specifically created to store and manage metadata

- Can be kept in a physical location or a virtual environment (cloud)

- Allows for quick and easy access to metadata

Helps ensure data is reliable and consistent

76
New cards

Data governance

A process to ensure the formal management of a company's data assets

77
New cards

Internal/primary data

Data that lives within a company's own systems

78
New cards

External data

Data that lives and is generated outside an organization

79
New cards

CSV

Comma-separated values

- a csv file saves data in a table format

80
New cards

Importrange

Google Sheets function

81
New cards

Importhtml

Google Sheets function

82
New cards

Importdata

Google Sheets function

83
New cards

Sorting data

Arranging data into a meaningful order to make it easier to understand, analyze, and visualize

84
New cards

Filtering

Showing only the data that meets a specific criteria while hiding the rest

85
New cards

BigQuery Sandbox

- 12 projects at a time

- cannot insert new records to a database

86
New cards

Schema

A way of describing how something, such as data, is organized

Displays the column names in the dataset

87
New cards

Details

contains additional metadata, such as the creation date of the dataset

88
New cards

Preview

shows the first rows from the dataset

89
New cards

SQL

Structured Query Language

90
New cards

Naming tables (SQL)

camelCase or Pascalcase Capitalization

(Capitalize the first letter of each word)

**Never use spaces for names

91
New cards

*

Include all columns (entire dataset)

92
New cards

API

Application programming interface

93
New cards

Data governance

A process for ensuring the formal management of a company's data assets

94
New cards

Metadata repository

A database created to store metabase

95
New cards

Normalized database

A database in which only related data is stored in each table

96
New cards

Notebook

An interactive, editable programming environment for creating data reports and showcasing data skills

97
New cards

Benefits of organizing data

- makes it easier to find and use

- helps you avoid making mistakes during your analysis

- helps to protect your data

98
New cards

Naming conventions

Consistent guidelines that describe the content, date, or version of a file in its name

Use logical and descriptive names for your files to make them easier to find and use

99
New cards

Best practices when organizing data

- Naming conventions

- Folders

- Archiving older files

- Align your naming and storage practices with your team

- Develop metadata practices

100
New cards

Data security

Protecting data from unauthorized access or corruption by adopting safety measures