CompTIA Data+ Chp1 Identifying Basic Concepts of Data Schemas

Studied by 0 people

0.0(0)

LearnA personalized and smart learning plan

Practice TestTake a test on your terms and definitions

Spaced RepetitionScientifically backed study method

Matching GameHow quick can you match all your cards?

FlashcardsStudy terms and definitions

1 / 116

There's no tags or description

Looks like no one added any tags here yet for you.

117 Terms

Data schema

used to describe both the organization of data & the relationships between tables in a given data

New cards

Database Engineers

plans the database schema before they begin to create the system

New cards

Relational Database

uses tables to store the data that's being capture

New cards

Spreadsheet

has multiple tables linked together with different relationships

New cards

Tabular Schema

use rows and columns in a table format to store all of their data

New cards

Relational Database Mgmt. System (RDBMS)

Used with lots of different database software, such as my SQL, Maria DB, and even Amazon serverless database system Aurora

New cards

SQL

a programming language for data, and it works across all relationship databases. Structured Query Language is a domain-specific language used to manage data, especially in a relational database management system. It is particularly useful in handling structured data, i.e., data incorporating relations among entities and variables

New cards

Non-relational Databases

Databases not based on relationships, not use SQL or structured query language, able to handle large amounts of traffic and data, its faster to get info from non relational DB than relational DB, Remember- SQL= relational databases, No SQL or Graph QL= non relational db

New cards

Document oriented databases

store the data inside of xml documents or using JSON format, which is the JavaScript object notation

New cards

Key value stores

store the data inside of xml documents or using JSON format, which is the JavaScript object notation

New cards

Column oriented databases

store data in columns instead of rows like we do in a traditional relational database

New cards

Graph Stores

x are used to store individual elements as nodes inside of this database

New cards

Relational DB

Tables are used to store fields inside of the different columns

Each row is going to have the record that holds the data for that relationship db

Tool to optimally design the tables and be able to have the least amount of info possible in those tables

Uses SQL, to write data into the DB

the amount of data that can be held in each field and the data type is limited

Each particular field in your db is going to have a specific purpose and field type

New cards

Non relational DB

Alternative to relational db

Stores as much info as you want in a key value pair, as opposed to limited to 255 characters

Easier to scale and build out for web based or cloud based applications

Uses any programming language

Stores both structured and unstructured data within those DBs because you have more flexibility than relational DB

New cards

Data normalization

Goal of x is to establish the relationships between the different data in forms to be able to have the data that we need when we run our reports, optimizes the storage and use of data within a given DB

New cards

First normal form (1NF)

eliminates any redundant info in individual tables

New cards

Second normal form (2NF)

has all the related info applicable to multiple tables using a foreign key

New cards

Third normal form (3NF)

eliminates fields that do not depend on a given key

New cards

Fourth normal form (4NF)

the data has to have a relationship that is in BC, NF or Boyce Code normal form and has no multi valued dependences in them

New cards

Fifth Normal form (5NF)

has all the same characteristics as 4NF and ensures there are no joint dependencies

New cards

Primary key-

unique identifier for a record that cannot contain duplicates, every table needs to have a primary key

New cards

Foreign key-

a primary key that was referenced by a different table

New cards

DB Relationships- One to one-

one record in the table will be associated with only one record in another table

New cards

DB Relationships: One to many

one record in the table with a primary key is associated with multiple records in other tables

New cards

Referential integrity

Remember: make sure that all of the changes we're doing to any part of our db are cascading throughout that db, either through an update or a deletion

Used to establish and maintain record relationships in tables

Ensures and guarantees that the primary key is being used as a foreign key in a table, and making sure that it actually exists in the table before creating records in at the second table

Prevents the occurrence of bad or missing data in any of the tables

Referential integrity comes under attack when people start modifying or deleting parts of the db

New cards

Cascade delete/update

make sure that no data is being left orphaned as you make updates or deletion to the primary key and the records they control

New cards

Data denormalization

Occurs when the data is not structured in tables using normalization

Will have many redundancies and repetitive data

Happens when dealing with big data, data warehousing, mining, data analysis and data visualization

New cards

Data systems

any info tech system that captures, processes, stores, queries or reports on the data contained within them

New cards

Data processing types

Its imp to determine whether you should use OLTP or OLAP to prevent performance issues

New cards

DB transaction

an insertion, deletion, or simply query of the db

New cards

Online Transactional Processing (OLTP)

systems designed to handle very large scales of transaction, built for large numbers of real time db transactions

New cards

Online Analytical Processing (OLAP)

systems designed to handle longer running queries for more complex data processing, build for longer running and more complex db queries, built to handle big queries that might involve a lot of joins and large data sets

New cards

Data Warehouse

Ingests data from the source systems record and combines them together, considered single source of truth, combines all data into one place to be accessed efficiently, example of an OLAP system where its built to handle really big, heavy queries

New cards

Source System

system of record for any particular kind of data

New cards

Clickstream data

data where individual users are clicking on different pages of the website

New cards

Purchase data

-involved people who handle credit cards to buy something that we then had to fulfil and ship to them

New cards

Catalog information

another source system that contains the categorization and the descriptions about individual items

New cards

OLAP Cubes

3D structure that provides data grouped together by different dimensions, optimizes the expected queries from each data marts customer

New cards

Data warehouse schemas

Fact Table- contains all of the main keys associated with the queries

Dimension table- info associated with a fact table that is tied together using the course id (?)

Star schema- individual dimensions tables branching off from that fact table that looks like a star, on layer dimension

Snowflake schema- multiple dimensions associated with each dimension branching out like a snowflake

New cards

Data Lakes

Centralized repository that can hold both structured and unstructured data, does not require data to be structured, but giving it structure makes it easier to query

New cards

Data Lakehouse

queries data in place on the data lake, in order to query, need structured data, and schema data off on the side that has been built up, you'll get flexibility and cost effectiveness of a data lake but you can still conduct queries across the entire data set, lake and lake or a true data warehouse can be used to analyse large amounts of data

New cards

Dimensional table-

contains metadata about stuff in your fact table, how manage changes to that dimensional data, how retain a history of what it used to be?

New cards

Type 1- slowly changing dimension,

slowly changing dimension, where new info is simply overriding the old info, can no longer query for previous names that existed in the past

New cards

Type 2 dimensional table

slowly change dimension, has complete history of the info and retains history of all previous data changes

New cards

Type 3 dimensional table

approach is to maintain the current and previous data

New cards

Discrete data

data that can be counted with a certain # of values

New cards

Continuous data

data that can be counted but with changing values, Continuous data is data that can take any value. Height, weight, temperature and length are all examples of continuous data. Some continuous data will change over time; the weight of a baby in its first year or the temperature in a room throughout the day.

New cards

Quantitative

data that is defined through #s

New cards

Qualitative

data arranged into groups or categories based on some kind of quality

New cards

Nominal data

Nominal data is data that can be labelled or classified into mutually exclusive categories within a variable. These categories cannot be ordered in a meaningful way. For example, for the nominal variable of preferred mode of transportation, you may have the categories of car, bus, train, tram or bicycle.

New cards

Ordinal data

type of data that has a natural ordering or ranking. It is categorical data that can be ranked or ordered in accordance with a specific attribute or characteristic. Examples of ordinal data are the level of education, the range of income, or the grades.

New cards

Data fields

contains the different pieces of info in different databases, data types are a thing we can control

New cards

Texts and alphanumeric field data types-

the most common type, and you'll hear this called either a character, a text or a string

New cards

Character type

a single character, either a letter or a # that's being stored in a field

New cards

Text or string type

grouping of characters that contains letters or #s in this alpha numeric data field type, can be uppercase or lowercase, cannot do mathematical operations on a # in a string

New cards

Date type

store exactly a calendar date in a month, day, year or day, month year format, can also store time,

New cards

Number data type

will not allow any text, as it only stores numbers (as you might expect). Use of the number data type allows for calculations. It is very important to use the number data type if you need to do any basic arithmetic in your reporting.

New cards

Currency data type

special type of number that represents $, allows for two decimal places

New cards

Boolean data type

used for things that have only two values, either a yes or a no, a true or a false, on off, one or zero

New cards

Structured Data

follows an existing convention, fits into tables and dbs nicely, specific format in a specific data field type for each particular data

New cards

Unstructured Data

not organized in a pre-defined manner that meets standards for structured data, can be text, images or video

New cards

Semi-structured data

mix of both, webpage is great example, also xml files and webpages, zip files, emails, JSON files

New cards

Delimited file

fields in which some form of character is going to be used to separate each field of data from the other data fields, most common type, CSV, common separated value file

New cards

Tab delimited file-

uses a tab to be able to separate each of the fields in the fie, can use a different character, like a pipe | to separate fields, .CSV, .TAB .TSV .TXT

New cards

Flat file

any delimited file that is exported out of db system and can then be sent to someone else, has been exported from db in real time, is back up data or point in time snapshot of db

New cards

SQL

x most commonly used when using a db and working with data, uses a series of statements to provide info to the db

New cards

Select statement-

how query info from db to select fields

New cards

Where keyword-

allows you to select something where a certain condition happens

New cards

HTML

hypertext markup language, used to write webpages and semi structured environment, uses tags to dictate what parts of info are display

New cards

XML

extensible markup language, text based market language, like html, but different purpose, interacts really well with JavaScript, goal to transfer data, not display to screen

New cards

JavaScript Object Notation (JSON)

language used to get data to and from different website, X is the de facto file format when sending info, uses arrays which make it easier to use in modern networks

New cards

Data System

any info tech system that captures, processes, stores, queries or reports on the data contained within it

New cards

Extract Transform Load (ETL)

process that occurs when moving data from a source system to a data warehouse by extracting data from the sources, transforming the data and then loading it to the data warehouse

New cards

Extract Load and Transform (ELT)

a modern method used when preparing data for data lakes by holding data in preparation for future transformation

New cards

Extracting data

Process of extracting source data and importing it into the system, obj of extraction is to connect to data source, SQL, Power BI and Power Query are tools for extracting data from external databases

New cards

Transforming data

Process of transforming data to another table format, timestamps are very useful for end users

New cards

API-

connection between computers or other programs, designed to present a set of questions and define answers in the system, Application Programming Interface. In the context of APIs, the word Application refers to any software with a distinct function. Interface can be thought of as a contract of service between two applications. This contract defines how the two communicate with each other using requests and responses.

New cards

Pull model-

continuously pulls data into system

New cards

Push model

only sends notifications when data changes

New cards

Web service

comm between or among electronic devices, JSON and SML both encode structured data, has a specific function to provide different kinds of info

New cards

Synchronous

request from web service and wait for response

New cards

Asynchronous

allows you to do there tasks while waiting for the response

New cards

Web/Data/Screen Scraping

act of extracting data from a website

New cards

Machine data

data generated by webservers, machine data can be used as a predictive maintenance tool

New cards

Observation

the act of collecting data by observing and then analyzing afterwards, observation data should not be manipulated

New cards

Sampling

creating a smaller data set from a larger data set, random sampling, systematic sampling, stratified sampling

New cards

Data Profiling

the process of working with data to being to discern information and trends present in that data

New cards

Steps of Data Profiling

1. ID and document data source

2. ID the field names and data types

3. Determine the fields to be ID'd for reporting

4. Check for the primary, natural or foreign keys

5. Recognize all the data in the dataset

New cards

Redundant data

ID data stored in multiple places, determine the redundant data and work on how to minimize it

New cards

Duplicated data

data repeated within the same dataset, to find duplicated data, use the built-in tools in a data analytics software

New cards

Unnecessary data

Its important to understand what data you need and what data you can ignore, extra data slows down your system Tools: Excel, Power BI, Tableau

New cards

Missing values

Missing data is referred as Null, represented as blank fields, Null or n/a, Null is...When the value is not applicable to the field, when the dataset doesn't have the information, when the datasets do not match the expected information, when the survey data is incomplete, What can you do about it? Filter out "NULL" values, replace missing values

New cards

Invalid Data

Invalid Data = Incorrect data, Different reasons data can be invalid: hard coding data, invalid data questions, extreme values, incorrect data, invisible characters, look for lead and trailing space, remove/replace invalid data, ASCII is a data code inside of a computer system that is invisible or non printable characters, look for leading or/and trailing space, remove/replace invalid data

New cards

Meeting specification

Specifications: certain types or quality set by DB engineers when designing systems, most common reason that data doesn't meet specifications is wrong data type, also improper storage of numeric characters

New cards

Data Outliers

Any data or piece of data that is outside the normal distance from the other values in a sample, Nonparametric Statistics: IDs data not assumed to come from a prescribed model that are predetermined by a small number of parameters

New cards

Parametric

normal baseline

New cards

Nonparametric

distribution independent

New cards

SCD type 3 relies on creating _____ to reflect historical data

new columns

New cards

Schema that will improve performance and allow for faster queries:

star schema