Understand the concept of "big data" and its origins.
Define big data by its characteristics – the 4Vs.
Explain challenges posed by Big Data.
Understand how data integration addresses the issue of being data-rich yet information-poor.
Differentiate between data warehouses, data marts, and data lakes.
Definition of Big Data
Big Data is defined as a collection of data sets that are so large and complex, they are challenging to process using traditional database management tools or applications.
Key Challenges:
Capture
Storage
Search
Sharing
Transfer
Analysis
Visualization
Daily creation of data is astronomical: 3.3 quintillion bytes daily, with projections of 181 zettabytes by 2025.
Perspectives on Big Data
Definition varies depending on the capabilities of the organization and their tools.
Example: For some, hundreds of gigabytes may require new management options, while others may not consider data too large until it reaches hundreds of terabytes.
Sources of Big Data
Archives: Historical records of communications and transactions.
Documents: Emails, presentations, spreadsheets, etc.
Business Apps: Data from ERP, CRM, and HR systems.
Public Data: Government websites providing local, state, and federal data.
Social Media: Data from platforms like Twitter, Facebook, and LinkedIn.
Machine Logs: Call detail records and logs from business processes.
Media: Images, audio, and video content.
Sensor Data: From IoT devices and process control devices.
Big Data Characteristics - The 4Vs
Volume: Refers to the amount of data – can be measured in terabytes, petabytes, and exabytes.
Velocity: The speed at which data is generated and stored, overwhelming traditional systems.
Variety: Refers to different forms of data - roughly 80% of big data is unstructured.
Veracity: The quality and trustworthiness of data, determining reliability for insights.
Challenges of Big Data
Determining which data subsets to store.
Deciding where and how to store data.
Identifying relevant data for decision-making.
Extracting value from significant datasets.
Protecting sensitive data from unauthorized access.
Data Integration
Key Problem: Organizations may have abundant data but lack the processes to turn it into meaningful information.
Solution: Data Integration improves business decision quality, affecting costs and revenue by ensuring data reliability, consistency, and understandability.
Data Warehousing
Definition: A data warehouse is a large database that collates business information from various sources.
Function: Supports management decision-making and involves data extraction, transformation, and loading (ETL).
Data Sources: Internal operations, external data, social networks, and clickstream data.
Data Marts and Data Lakes
Data Mart: A subset of data from a warehouse tailored for small- to medium-sized businesses or specific departments.
Data Lake: A vast repository holding all types of data in raw format, allowing users to extract and transform data as needed when conducting analyses.
Data Warehouses vs. Data Marts
Data warehouses contain comprehensive data suitable for large-scale decision support, while data marts offer specialized data for specific departments or functions.