1/105
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
Learning Objectives
1. Understanding the different types of data and data structures
2. What type of data is right for the question you're answering
3. Practical skills about how to extract, use, organize, and protect your data
- Explain how Kaggle can benefit a data analyst
- Explain how data is generated as a part of our daily activities with reference to the types of data generated
- Explain factors that should be considered when making decisions about data collection
- Explain the difference between structured and unstructured data
- Discuss the difference between data and data types
- Explain the relationship between data types, fields, and values
- Discuss wide and long data formats with references to organization and purpose
How data is collected
- interviews
- observations
- forms
- questionnaires
- cookies
- surveys
Data collection considerations
- How the data will be collected
- Choose data sources
- Decide what data to use
- How much data to collect
- Select the right data type
- Determine the time frame
First-party data
Data collected by an individual or group using their own resources
Second-party data
Data collected by a group directly from its audience and then sold
Third-party data
Data collected from outside sources who did not collect it directly
Population
All possible data values in a certain dataset
Sample
A part of a population that is representative of the population
Discrete data
Data that is counted and has a limited number of values
Partial measurements aren't allowed
Continuous data
Data that is measured and can have almost any numerical value
Nominal data
A type of qualitative data that is categorized without a set order
Ordinal data
A type of qualitative data with a set order or scale
Internal data
Data that lives within a company's own systems
External data
Data that lives and is generated outside of an organization
Structured data
Data organized in a certain format such as rows and columns
- defined data types
- most often quantitative data
- easy to organize
- easy to search
- easy to analyze
- stored in relational databases & data warehouses
- contained in rows and columns
Examples: Excel, Google Sheets, SQL, customer data, phone records, transaction history
Unstructured data
Data that is not organized in any easily identifiable manner
- Varied data types
- Most often qualitative data
- Difficult to search
- Provides more freedom for analysis
- Stored in data leaks, data warehouses, and NoSQL databases
- Can't be put in rows and columns
- Examples: text messages, social media comments, phone call transcriptions, various log files, images, audio, video
Data model
A model that is used for organizing data elements and how they relate to one another
Sources of structured data
- spreadsheets
- databases that store datasets
-
Data modeling
the process of creating diagrams that visually represent how data is organized and structured
Conceptual data modeling
high-level view of the data structure
- how data interacts across an organization
Logical data modeling
focuses on the technical details of a database such as relationships, attributes, and entities
Physical data modeling
depicts how a database operates
Entity Relationship Diagram (ERD)
A visual way to understand the relationship between entities in the data model
Unified Modeling Language (UML) diagram
Very detailed diagram that describes the structure of a system by showing the system's entities, attributes, operations, and their relationships
Data type
A specific kind of data attribute that tells what kind of value the data is
Tells you what type of data you're working with
Data types in spreadsheets
- Number
- Text or string
- Boolean
Text or string data type
A sequence of characters and punctuation that contains textual information
Boolean data type
A data type with only two possible values, such as TRUE or FALSE
Operator
a symbol that names the operation or calculation to be performed
Wide data
Data in which every data subject has a single row with multiple columns to hold the values of various attributes of the subject
Long data
Data in which each row is one time point per subject, so each subject will have data in multiple rows
each row contains a single data point
Data transformation
the process of changing the data's format, structure, or values
Data transformation usually involves
- Adding, copying, or replicating data
- Deleting fields or records
- Standardizing the names of variables
- Renaming, moving, or combining columns in a database
- Joining one set of data with another
- Saving a file named in a different format
Goals for data transformation
Data organization
Data compatibility
Data migration
Data merging
Data enhancement
Data comparison
Wide data is preferred when
- Creating tables and charts with a few variables about each subject
- Comparing straightforward line graphs
Long data is preferred when
- Storing a lot of variables about each subject
- Performing advanced statistical analysis or graphing
Bias
A preference in favor of or against a person, group of people, or thing
Data bias
A type of error that systematically skews results in a certain direction
Sampling bias
When a sample isn't representative of the population as a whole
Unbiased sampling
When a sample is representative of the population being measured
Observer bias (experimenter bias, research bias)
The tendency for different people to observe things differently
Interpretation bias
The tendency to always interpret ambiguous situations in a positive or negative way
Confirmation bias
The tendency to search for or interpret information in a way that confirms pre-existing beliefs
Is a data source 'good'? - ROCCC
Reliable
Original
Comprehensive
Current
Cited
Every good solution is found by ________
Avoiding bad data
Ethics
Well-founded standards of right and wrong that prescribe what humans ought to do, usually in terms of rights, obligations, benefits to society, fairness, or specific values
Data ethics
Well-founded standards of right and wrong that dictate how data is collected, shared, and used
General Data Protection Regulation of the European Union (GDPR)
Aspects of data ethics / Data Ethics Concerns
- Ownership
- Transaction transparency
- Consent
- Currency
- Privacy
- Openness
Consent
An individual's right to know explicit details about how and why their data will be used before agreeing to provide it
Privacy
Preserving a data subject's information and activity any time a data transaction occurs
Openness
Free access, usage, and sharing of data
Personally identifiable information (PII)
information that can be used by itself or with other data to track down a person's identity
Data anonymization
the process of protecting people's private or sensitive data by eliminating that kind of information
typically involves blanking, hashing, or masking personal information, often by using fixed-length codes to represent data columns, or hiding data with altered values
De-identification
a process used to wipe data clean of all personally identifying information
Openness (Open data)
Free access, usage, and sharing of data
Data interoperability
The ability of data systems and services to openly connect and share data
For data to be open, it must:
- Be available and accessible to the public as a complete dataset
- Be provided under terms that allow it to be reused and redistributed
- Allow universal participation so that anyone can use, reuse, and redistribute the data
Primary key
An identifier that references a column in which each value is unique
- used to ensure data in a specific column is unique
- uniquely identifies a record in a relational database table
- only one primary key is allowed in a table
- cannot contain null or blank values
Foreign key
A field within a table that is a primary key in another table
- a column or group of columns in a relational database table that provides a link between the data in two tables
- refers to the field in a table that's the primary key of another table
- more than one foreign key is allowed to exist in a table
relational database
a database that contains a series of tables that can be connected to form relationships
- allow data analysts to organize and link data based on what the data has in common
normalization
a process of organizing data in a relational database
- applied to eliminate data redundancy, increase data integrity, and reduce complexity in a database
ex: creating tables and establishing relationships between those tables
composite key
a primary key constructed using multiple columns of a table
metadata
Data about data
- stored in a single, central location
- gives the company standardized information about all of its data
- used in database management to help data analysts interpret the contents of the data within the database
- three common types: descriptive, structural, administrative
Descriptive metadata
Metadata that describes a piece of data and can be used to identify it at a later point in time
Structural metadata
Metadata that indicates how a piece of data is organized and whether it is part of one, or more than one, data collection
Administrative metadata
Metadata that indicates the technical source of a digital asset
Elements of metadata
- file or document type
- date, time, and creator
- title and description
- geolocation
- tags and categories
- who last modified it and when
- who can access or update it
photo metadata
filename, date, time, geolocation, type of device
email metadata
subject line, sender, recipient, date sent, time sent
hidden metadata: server names, IP addresses, HTML format, software details
spreadsheet metadata
title, author, creation date, number of pages, user comments, tab names, tables, columns
website metadata
tags, categories, site creator's name, web page title and description, time of creation
books and audiobooks metadata
title, author name, table of contents, publisher information, copyright description, index, brief description of the book's contents
narrator, recording length
Benefits of metadata
- reliability
- accurate
- precise
- relevant
- timely
- consistency
- organized ---> easily findable
- classified ---> follows a consistent format
- stored ---> efficiently stored in various data repositories
- accessed --> users, applications, and systems can locate and use data
Metadata repositories
- Specialized databases specifically created to store and manage metadata
- Can be kept in a physical location or a virtual environment (cloud)
- Allows for quick and easy access to metadata
Helps ensure data is reliable and consistent
Data governance
A process to ensure the formal management of a company's data assets
Internal/primary data
Data that lives within a company's own systems
External data
Data that lives and is generated outside an organization
CSV
Comma-separated values
- a csv file saves data in a table format
Importrange
Google Sheets function
Importhtml
Google Sheets function
Importdata
Google Sheets function
Sorting data
Arranging data into a meaningful order to make it easier to understand, analyze, and visualize
Filtering
Showing only the data that meets a specific criteria while hiding the rest
BigQuery Sandbox
- 12 projects at a time
- cannot insert new records to a database
Schema
A way of describing how something, such as data, is organized
Displays the column names in the dataset
Details
contains additional metadata, such as the creation date of the dataset
Preview
shows the first rows from the dataset
SQL
Structured Query Language
Naming tables (SQL)
camelCase or Pascalcase Capitalization
(Capitalize the first letter of each word)
**Never use spaces for names
*
Include all columns (entire dataset)
API
Application programming interface
Data governance
A process for ensuring the formal management of a company's data assets
Metadata repository
A database created to store metabase
Normalized database
A database in which only related data is stored in each table
Notebook
An interactive, editable programming environment for creating data reports and showcasing data skills
Benefits of organizing data
- makes it easier to find and use
- helps you avoid making mistakes during your analysis
- helps to protect your data
Naming conventions
Consistent guidelines that describe the content, date, or version of a file in its name
Use logical and descriptive names for your files to make them easier to find and use
Best practices when organizing data
- Naming conventions
- Folders
- Archiving older files
- Align your naming and storage practices with your team
- Develop metadata practices
Data security
Protecting data from unauthorized access or corruption by adopting safety measures