1/37
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
VOLUME
the amount of data generated
VERACITY
accuracy and trustworthiness of the data (if not correct or accurate data input garbage in garbage out)
VARIETY
the type of data files generated
structured→ excel files
semi structured→ log files
unstructured→ images
VELOCITY
the speed at which data is generated
VALUE
the community benefit of the information collected
What is a database?
collection of related files containing records on people, places, or things.
(T/F) database where all data is stored but it is not clean
True
data warehouse is
where the data stored is cleaned and usable
characteristics of data warehouses
-Large
-multiple sources
-historical
-cross organizational access and analysis
-supports various types of analyses and reporting
(T/F) data mart is a subset of data warehouse
True
data lake is a
storage of unstructured data
data mining
finds hidden patterns in large data sets
regression analysis
COMMON TECHNIQUE FOR DATA MINING
-text mining:
extract insights from textual data
video analysis is the process of
obtaining information or insights from video footage
data governance encompasses
policies and procedures through which data can be managed as an organizational resource
sources of data for big data
internal
documemts
emails
external
social media
public dataset
entity is
Generalized category representing person, place, thing
Attributes are
Specific characteristics of each entity
ex
supplier name, address
Entity-relationship diagram is
Used to clarify table relationships in a relational database
Relational database tables may have:
One-to-one relationship
One-to-many relationship
Many-to-many relationship
Requires “join table” or intersection relation that links the two tables to join information
operations of a relational DBMS
select
join
project
select
Creates a subset of all records meeting stated criteria
join
Combines relational tables to present the user with more information than is available from individual tables
project
Creates a subset consisting of columns in a table
Permits user to create new tables containing only desired information
Data definition: Specify structure of content of database
Data dictionary: Stores definitions of data elements and their characteristics
Querying and reporting
Data manipulation language
▪ Structured query language (S Q L)
▪ Microsoft Access query-building tools
– Report generation:
▪ Examples: SQL Server Reporting Services
Hadoop
Breaks data task into sub-problems and distributes the processing to many inexpensive computer processing nodes
Key services of Hadoop
Hadoop Distributed File System (H D F S)
MapReduce
HBase: NoSQL database
alternative to Hadoop
Apache Spark
Faster than Hadoop for small workloads
Online Analytical Processing (O L A P) supports
multidimensional data analysis, enabling users to view the same data in different ways using multiple dimensions
data mining examples
Detect fraud
Improve forecasting
Increase sales
Regression analysis determines
the relationship between a dependent variable and one or more independent variables
text mining examples
Sentiment analysis
customer feedback analysis
social media monitoring
competitor analysis
market research and trend analysis
video analysis can be done using
Computer vision, machine learning, and deep learning
under web mining
Content mining mines content of websites
Structure mining mines website structural elements, such as links
Usage mining mines user interaction data gathered by web servers