Looks like no one added any tags here yet for you.
Entity Relationship Diagram (ERD)
A diagram that depicts an entity relationship model's entities, attributes, and relations.
Normalisation
The process of minimising redundancy from a relation or set of relations. Redundancy in a relation may cause insertion, deletion and update anomalies.
Relation
table in a relational database
What are the three stages in normalisation?
First Normal Form (1NF), Second Normal Form (2NF) and Third Normal Form (3NF) (This is a direct process, cannot go to 2NF without 1NF)
First Normal Form (1NF)
A table is in first normal form if it contains no repeating attributes or groups of attributes. All attributes must be atomic - a single attribute cannot consist of two data items such as firstname and surname. This would make it difficult or impossible to sort on surname.
Second Normal Form (2NF)
A table is in second normal form (2NF) if it is in first normal form and contains no partial dependencies. It is a table that is in First Normal Form with every non-primary-key attribute fully functionally dependent on the primary key. A relation with a single-attribute (column) primary key is automatically in at least 2NF.
Third Normal Form (3NF)
A relation that is in Second Normal Form and in which no non-primary-key attribute is transitively dependent on the primary key.
Transitive Dependency
A transitive dependency occurs when one attribute depends on a second attribute, which depends on a third attribute. Deletions in a table with such a dependency can cause unwanted information loss.
Structured Query Language (SQL)
A declarative language used for querying and updating tables in a relational database.
SELECT
The SELECT statement is used to extract fields from one or more tables.
Syntax of SELECT
SELECT FROM WHERE ORDER BY
ORDER BY
The results of an SQL query can be sorted in ascending or descending sequence. ASC: ascending, DESC: descending Ascending order is assumed.
Wild Card
: (selects from) all fields
Other SQL Statements
BETWEEN: between an inclusive range IN: specify multiple possible values for a column LIKE: with matching characteristics AND/OR/NOT
Semicolon
Some database systems require a semicolon at the end of each SQL statement.
What is the role of database administrators?
Responsible for accurately and securely using, maintaining and developing computerised databases within a wide range of public and private sector organisations.
Why are database recovery techniques important?
Databases are prone to failures due to inconsistency, network failure, errors or any kind of accidental damage. So database recovery techniques are highly important to bring a database back into a working state after a failure.
What are four different database recovery techniques?
Mirroring Recovery using Backups Recovery using Transaction logs Shadow paging
Mirroring
Two complete copies of the database are maintained online on different stable storage devices.
When is mirroring used?
Mostly used in environments that require non-stop fault-tolerant operations.
What are the two types of backups?
Immediate backup Archival backup
Immediate Backup
Kept in a floppy disk, hard disk or magnetic tapes. These come in handy when a technical fault occurs in the primary database such as a system failure, disk crash, or network failure. Damage due to virus attacks are repaired using the immediate backup.
Archival Backup
Kept in mass storage devices such as magnetic tape, CD-ROMs, Internet Servers etc. They are very useful for recovering data after a disaster such as fire, earthquake, flood etc.
Where should archival backups be kept and why?
Archival Backup should be kept at a different site other than where the system is functioning. Archival Backup at a separate place remains safe from thefts and intentional destruction by users or staff.
Recovering data using transaction logs
Step 1: The log searches for all the transactions that have recorded a [start transaction, ''] entry, but haven't recorded a corresponding [commit, ''] entry.. Step 2: These transactions are rolled back. Step 3: Changes to the database which are carried out by transactions that have recorded a [commit, ''] entry in the log, are recorded. These changes will follow to undo their effects on the database.
Recovering data using shadow paging
In shadow paging, a database is divided into several fixed-sized disk pages, say n, which is created in a 'current directory'. It has n entries with each entry pointing to a disk page in the database. The current directory is transferred to the main memory. When a transaction begins executing, the current directory copies into a shadow directory. The shadow directory saves on the disk. During the transaction execution, all the modifications are made on the current directory and the shadow directory is never modified.
Integrated database
The term integrated database is used to describe two different database structures: connection of multiple databases, or a database built into another application tool.
Connecting multiple databases into an integrated database
Connections between the different databases must be made and tested from a variety of perspectives to ensure that the logic is sound. The different databases can also exist in different formats or programs. This adds to the complexity but may be achieved using advanced programming logic and powerful resources.
Databases in stock control
The database holds details and quantities of the products you stock, together with information on product suppliers.
Using databases in real life
Stock Control Police Records Health Records Employee Data
Data Discovery
Involves discovering which data sets exist in the organisation, which of them are business critical and which contains sensitive data that might be subject to compliance regulations.
Data Loss Prevention (DLP)
A set of strategies and tools that can be used to prevent data from being stolen, lost, or accidentally deleted.
Snapshots
Preserves the entire state and data of the system at the time it is taken. Similar to a backup, but is a complete image of a protected system, including data and system files. Can be used to restore an entire system to a specific point in time.
Replication
A technique for copying data on an ongoing basis from a protected system to another location. This provides a living up-to-date copy of the data, allowing not only recovery but also immediate failover to the copy if the primary system goes down.
Firewall
Utilities that enable you to monitor and filter network traffic Can be used to ensure that only authorised users are allowed to access or transfer data.
Authentication and Authorisation
Controls that help verify credentials and assure that user privileges are applied correctly. These measures are typically used as part of an identity and access management (IAM) solution and in combination with role-based access controls (RBAC).
Encryption
Process of converting readable data into unreadable characters to prevent unauthorised access.
Endpoint Protection
Protects gateways to your network, including ports, routers, and connected devices. Endpoint protection software typically enables you to monitor your network perimeter and filter traffic as needed.
Data Erasure
Limits liability by deleting data that is no longer needed. This can be done after data is processed and analysed or periodically when data is no longer relevant.
Disaster Recovery
A set of practices and technologies that determine how an organisation deals with a disaster, such as a cyber attack, natural disaster, or large-scale equipment failure. Typicially involves setting up a remote disaster recovery site with copies of protected systems, and switching operations to those systems in case of disaster.
Data Protection Act 2018
Controls how your personal information is used by organisations, businesses or the government. It is the UK's implementation of the General Data Protection Regulation (GDPR).
The Data Protection Act 2018 states that information must be...
Used fairly, lawfully and transparently Used for specified, explicit purposes Used in a way that is adequate, relevant and limited to only what is necessary Accurate and, where necessary, kept up to date Kept for no longer than necessary Handled in a way that ensures appropriate security, including protection against unlawful or unauthorised processing, access, loss, destruction or damage
The Computer Misuse Act (1990) makes the following illegal:
Unauthorised access to computer material e.g. hacking Unauthorised access to computer materials with intent to commit a further crime e.g. viruses Unauthorised modification of data e.g. electronic vandalism Making, supplying or obtaining anything which can be used in computer misuse offences
Data Matching
The process of comparing two different sets of data and matching them against each other.
What is the purpose of data matching?
The purpose of the process is to find the data that refers to the same entity. Many times the data comes from two or more different sets of data and have no common identifiers. But data matching is also useful to detect duplicate data within a database.
What are the two ways of linking data?
Deterministic record linkage, based on several matching identifiers. Probabilistic record linkage, based on the probability of several identifiers matching.
What is the most common way of linking data?
The most common is probabilistic record linking as deterministic linking tends to be too inflexible.
The process of probabilistic data matching
Standardise data Pick attributes unlikely to change Sort data into blocks Match via probabilities Assign value to matches Summarise to get total weight
Data Mining
The process of analysing data to extract information not offered by the raw data alone. By using software to look for patterns in large batches of data, businesses can learn more about their customers to develop more effective marketing strategies, increase sales and decrease costs.
What does data mining depend on?
Effective data collection, warehousing, and computer processing.
What is an ethical concern of data mining?
Users are often unaware of the data mining happening with their personal information, especially when it is used to influence preferences.
Object-oriented Database
A database that operates by storing data in objects rather than in tables.
Network Database
A type of database model wherein multiple member records or files can be linked to multiple owner files and vice versa.
Spatial Database
A database that is enhanced to store and access spatial data or data that defines a geometric space. These data are often associated with geographic locations and features, or constructed features like cities. Data on spatial databases are stored as coordinates, points, lines, polygons and topology.
Multi-dimensional Database
Created from multiple relational databases. While relational databases allow users to access data in the form of queries, multidimensional databases allow users to ask analytical questions related to business or market trends.
Where are object-oriented databases commonly used?
Applications that require high performance, calculations, and faster results. Some of the common applications that use object databases are real-time systems, architectural & engineering for 3D modeling, telecommunications, and scientific products.
Advantages of object-oriented databases
Provide persistent storage to objects Faster data access and better performance than a RDMS
Disadvantages of object-oriented databases
Object databases are not as popular as RDBMS. It is difficult to find object DB developers. Not many programming languages support object databases. RDBMS have SQL as a standard query language. Object databases do not have a standard. Object databases are difficult to learn for non-programmers.
Data Warehouse
An enterprise system used for the analysis and reporting of structured and semi-structured data from multiple sources, such as point-of-sale transactions, marketing automation, customer relationship management, and more. A data warehouse can store both current and historical data in one place and is designed to give a long-range view of data over time, making it a primary component of business intelligence.
Advantages of a data warehouse
Consolidated data from many sources Historical data analysis Data quality, consistency, and accuracy Separation of analytics processing from transactional databases, which improves performance of both systems
Data Lake
A storage repository that holds a vast amount of raw data in its original format until the business needs it.
Schema
An abstract design that represents the storage of data in a database.
Data Mart
A data collection, smaller than the data warehouse, that addresses the needs of a particular department or functional area of the business.
Why is data warehousing time dependent?
The data stored in the data warehouse is to support decision making - usually to make predictions and projections. The availability of new data can invalidate previous decisions. To combat this it is important that the data in a data warehouse is constantly updated to reflect the most recent data values.
How is data in a warehouse is updated in real time?
Data is refreshed from data in operational systems
ETL
A process in Data Warehousing and it stands for Extract, Transform and Load. It is a process in which an ETL tool extracts the data from various data source systems, transforms it in the staging area, and then finally, loads it into the Data Warehouse system.
Extract in ETL
Data from various source systems is extracted which can be in various formats. It is important to extract the data from various source systems and store it into the staging area first and not directly into the data warehouse because the extracted data is in various formats and can be corrupted. Loading it directly into the data warehouse may damage it and rollback will be much more difficult.
Transformation in ETL
A set of rules or functions are applied on the extracted data to convert it into a single standard format. It may involve the following processes/tasks: Filtering - loading only certain attributes into the data warehouse. Cleaning - filling up the NULL values with some default values, mapping U.S.A, United States, and America into USA, etc. Joining - joining multiple attributes into one. Splitting - splitting a single attribute into multiple attributes. Sorting - sorting tuples on the basis of some attribute (generally key-attribute).
Loading in ETL
The transformed data is loaded into the data warehouse.
Different types of patterns in data
Periodic Patterns Associative Patterns Abnormal Patterns Structural Patterns Chaotic Patterns
Periodic Patterns
Patterns that are seen repeating themselves after a certain lapse of time.
Associative Patterns
These are like Bread and Butter/Knife and Fork, i.e. co-occurring groups of things that make more sense with each other. The members of this pattern are complementary to each other.
Abnormal Patterns
Occurs when the data has a clear deviation from normal behaviour, an unexpected pattern appears between expected patterns and its appearance is not periodic.
Structural Patterns
Like pathfinding in graphs or cluster identification. An example would be low-cost residences tend to occur in suburbs whereas downtown has higher costing apartments.
Chaotic Patterns
Where the patterns appear but have no definitive characteristic related to time/space/frequency.
Advantages of data mining
It helps companies gather reliable information It's an efficient, cost-effective solution compared to other data applications Data mining uses both new and legacy systems It helps businesses make informed decisions It helps data scientists easily analyze enormous amounts of data quickly Data scientists can use the information to detect fraud, build risk models, and improve product safety It helps data scientists quickly initiate automated predictions of behaviors and trends and discover hidden patterns
Types of predictive modelling
Classification model Clustering model Forecast model Outliers model Time series model
Classification model
Considered the simplest model, it categorizes data for simple and direct query response. An example use case would be to answer the question "Is this a fraudulent transaction?"
Clustering model
This model nests data together by common attributes. It works by grouping things or people with shared characteristics or behaviors and plans strategies for each group at a larger scale. An example is in determining credit risk for a loan applicant based on what other people in the same or a similar situation did in the past.
Forecast model
Works on anything with a numerical value based on learning from historical data. For example, in answering how much lettuce a restaurant should order next week or how many calls a customer support agent should be able to handle per day or week, the system looks back to historical data.
Outliers model
This model works by analyzing abnormal or outlying data points. For example, a bank might use an outlier model to identify fraud by asking whether a transaction is outside of the customer's normal buying habits or whether an expense in a given category is normal or not.
Time series model
This model evaluates a sequence of data points based on time. For example, the number of stroke patients admitted to the hospital in the last four months is used to predict how many patients the hospital might expect to admit next week, next month or the rest of the year.
Data Segmentation
The process of taking the data you hold and dividing it up and grouping similar data together based on the chosen parameters so that you can use it more efficiently within marketing and operations.
Advantages of data segmentation
Through customer segmentation you will be able create messaging that is tailored and sophisticated to suit your target market - appealing to their needs better. It allows you to easier conduct an analysis of your data stored in your database, helping to identify potential opportunities and challenges based within it. Enables you to mass-personalise your marketing communications, reducing costs
Link Analysis
Link analysis is a data analysis technique used in network theory that is used to evaluate the relationships or connections between network nodes. These relationships can be between various types of objects (nodes), including people, organizations and even transactions. Link analysis is often used in search engine optimization as well as in intelligence, in security analysis and in market and medical research.
What are the three primary purposes of link analysis?
Find matches for known patterns of interests between linked objects. Find anomalies by detecting violated known patterns. Find new patterns of interest (for example, in social networking and marketing and business intelligence).
Deviation Detection
Data mining technique that identifies outliers, which express deviation from some previously known expectation and norm.
Logic
The system of operations performed by a computer that underlies the machine's representation of logical operations
Cascading Delete
occurs when associated child rows are deleted along with the deletion of a parent row
Database Transaction
A database transaction is a process carried out on a database, which may change its state, for example: moving money between bank accounts
The importance of durability in a database transaction
Durability in databases is important because it ensures transactions are saved permanently, and do not accidentally disappear or get erased
Different types of relationships within databases
One to one; One to many; Many to one; Many to many;
Data dictionary
It is a database that provides meta-data about the database. A data dictionary contains how much space has been allocated for and is currently used, and default values for columns/attributes. They do not contain any actual data from the database, only information for managing it
Responsibilities of a database administrator
Installing and configuring software, Creating new databases, Ensure database security is implemented to safeguard the data, Back up and recover the database
Difference between an information system and a database
An information system is the collection of software, hardware, networking infrastructure, human resources and databases that provide the storage, processing and communication of information. A database is a subset of the information system.
Atomicity (ACID)
Ensures that either all changes to data are performed as if they are a single operation or none of them are performed.
Consistency (ACID)
Ensures that data is in a consistent state when a transaction begins and when it ends. Ensures that only valid data is written in the database.
Isolation (ACID)
Ensures that the intermediate state of a transaction is invisible to other transactions
Durability (ACID)
Ensures that after a transaction successfully completes, changes to data persist and are not undone even in the event of a system failure