Introduction to Data Management Tools
Competency Statement and the Value of Data Management
Competency Statement: The graduate applies data management tools and processes for business tasks.
The Evolution of Data Significance: Markets and platforms can shift drastically; as seen with AOL and Myspace, companies must adapt or disappear. Organizations and individuals capture massive volumes of data daily from diverse sources.
Data Collection Examples: When a customer makes a purchase, companies collect specific data points, including:
What was purchased.
When it was purchased.
How much was paid.
Where the customer lives.
Whether the purchase was bundled with other products.
Data as a Strategic Asset:
Data is a corporate asset that grows in value as its volume increases.
Accumulated data over years allows for the identification of trends and the generation of predictions.
Data-backed decisions lead to superior marketing and company development.
Investment in data management is comparable to securing funds in a bank; it is vital to an enterprise's success.
Defining Data Management (DM)
Gartner Group Definition: Data management consists of the practices, architectural techniques, and tools for achieving consistent access to and delivery of data across the spectrum of data subject areas and data structure types in the enterprise.
Purpose: The objective is to access data meaningfully to enable thoughtful, strategic decision-making regarding new investments, marketing, and product development.
Proficiency Statement: The student describes key data management tools and their capabilities.
Data versus Information
Fundamental Distinction: In computing, the terms "data" and "information" are not synonymous, though often used loosely.
Raw Data: Data by itself is raw, unorganized, and may lack significance until it is processed, organized, and structured.
The Role of Context: Data requires context to become useful. Once context is applied, data becomes information, which carries meaning.
Example: An individual employee's training score is raw data. The average score of an entire department or division, providing comparison and insight, is information.
Transformation Tools: Tools like relational databases and programming languages such as Structured Query Language (SQL) are used to transform unfathomable amounts of data into information.
Scale Example: A single person may use a debit card , , or times a day. Multiplied by millions of cardholders, this results in millions of transactions daily, requiring complex relationship databases to manage.
Data Governance and Security
Definition of Data Governance: This field is concerned with data policies, procedures, access control, backup and recovery, and data classification standards to govern business-critical data.
Security Objective: The primary goal is to keep data confidential and safe.
Consequences of Failure: Weekly occurrences of data breaches results in businesses or government entities facing:
Hefty financial fines.
Loss of customer trust.
Damage to brand reputation.
Functional Management: Data management handles crucial business entities such as customers, accounts, products, and supply chains.
Operational Example: Walmart’s distribution centers require accurate and timely data to efficiently supply grocery stores nationwide.
The Data Management Toolbox and Emerging Technologies
Data Discovery (Data Mining): This process involves searching for patterns, anomalies, and meaningful relationships within large datasets.
Business Intelligence (BI): BI encompasses a broad range of tools and practices designed to provide better strategic decision-making and predictive capabilities. It extracts, analyzes, and reports information to assist in critical decisions.
Predicting Trends: Organizations use data discovery on social media platforms like Pinterest and Instagram to predict future consumer trends.
Data Lakes: These are storage repositories that hold large amounts of unstructured data in its raw form, allowing for flexible analysis later.
Machine Learning (ML): ML algorithms identify patterns and anomalies in datasets to generate predictions based on that data.
Continuous Evolution: Existing tools, such as data visualization and BI software, are constantly being updated with new features.
Database Systems and Structure
Filing Cabinet Analogy:
Traditional cabinets use drawers (marked alphabetically), tabs, and folders to organize information about clients, property, or accounts.
Database systems function similarly, organizing information based on set rules.
Databases: Well-thought-out collections of computer files that serve as storehouses for data used by managers for decision-making.
Database Management Systems (DBMS): Software systems used to create and manage databases. Data is stored in files called tables.
Tables: The core component of a database where data is held.
Records (Rows): Individual entries in a table.
Fields (Columns): Specific categories of information within a record.
Queries: Questions asked of the data to produce specific subsets of information.
Relational Database: A system where tables are connected to other tables containing related information.
Key Database Terms:
Primary Key: A field that uniquely identifies a record in a table (e.g., a unique Student ID number).
Foreign Key: A field in one table that provides a link to another table in a relational database.
Schema: The blueprint, organization, or layout of a database. It defines tables, fields, constraints, keys, and integrity.
Big Data and the 4 Vs
Definition: Big data refers to large, expansive, and disparate datasets collected from sources like smartphone metadata, internet usage records, social media, and computer usage.
The 4 Vs of Big Data:
Volume: The sheer scale of data requires significant resources for management. Volume is growing exponentially.
Variety: Data comes in structured and unstructured forms from fragmented sources.
Veracity: Concerns the quality and trustworthiness of data. Data often requires "scrubbing" to remove discrepancies.
Velocity: The accelerating speed at which data is produced.
Examples: Streaming services like Netflix and Amazon Web Services, or the constant data generated by a cell phone.
Data Categorization
Structured Data: Data residing in fixed formats, typically well-labeled with traditional fields and records (e.g., product specification tables).
Unstructured Data: Unorganized data that cannot be easily read by computers because it lacks rows and columns.
Statistics: Approximately of all data is unstructured.
Examples: Social media posts, video, audio, satellite imagery, and weather sensor data.
Semi-structured Data: Contains both structured and unstructured elements.
Email Example: Structured elements include the sender, recipient, date, and subject; the message body and attachments are unstructured.
HTML Example: Structured elements include layout tags; the displayed content is unstructured.
Data Mining Tools and Software
Purpose: Software that allows businesses to gather large amounts of data to protect accounts, monitor usage, or drive marketing.
Security Use Case: Receiving an instant message from a credit card company about a major purchase is a result of data mining surveillance for fraud protection.
Common Software Applications:
Oracle Data Mining (ODM) (utilizes SQL).
RapidMiner.
IBM SPSS Modeler.
KNIME, SAS Enterprise Miner, Weka, Orange, Alteryx.
Microsoft Azure Machine Learning, TensorFlow, H2O.ai, BigML.
Dataiku, Databricks, Talend, and TIBCO Spotfire.
Data Storage: Warehouses and Marts
Data Warehouse: Used to consolidate disparate data in a central location for enterprise-wide use.
Capacity: Can hold yottabytes of data.
Measurement: . .
Data Mart: A smaller version of a data warehouse designed for the specific needs of a single department, such as Sales or Human Resources.
Big Data Toolsets: ETL and Hadoop
ETL (Extract, Transform, Load): A process to standardize and centralize data for querying.
Extract: Gathering data from sources like Customer Relationship Management (CRM) or Enterprise Resource Planning (ERP) systems.
Transform: Cleaning data to fit table structures (e.g., removing decimals or dollar signs).
Load: Transferring data into the warehouse or mart. Frequency is critical for up-to-date analytics.
Requirement: ETL must happen in the specific order: Extract, then Transform, then Load.
Hadoop: An infrastructure for storing and processing large datasets across multiple servers using a distributed file system.
Difference from Data Warehouses: It handles unstructured/semi-structured data and does not centralize files; it identifies and searches files across multiple servers.
Scalability: Highly cost-effective and scalable; used by major companies like Uber, Airbnb, and Spotify.
Personnel: Typically requires a qualified data scientist to operate.
Apache Sparks: A recent alternative to Hadoop gaining wider adoption.
Data Output and Visualization
SQL (Structured Query Language): The standard language for manipulating and querying relational databases.
Query Example: Finding customers in Utah, Arizona, and Texas who purchased in the last months with financing.
Tableau: A business intelligence platform that simplifies raw data into interactive visualizations (graphs, charts, numerical analysis).
Dashboards: Tools that present an overall view of business health (e.g., hotel room availability or supply chain efficiency) for non-database experts.
Other Visualization Tools: Power BI, Looker, and Domo.
Cloud Computing Platforms: Solutions like Amazon Web Services (AWS), Google Cloud Platform, and Microsoft Azure offer scalable, cost-effective storage and analysis.
Questions & Discussion
Question regarding USB sticks: "These cheap USB memory sticks can carry terabytes of data. Are they a threat to data security?"
Context: The transcript notes that large amounts of data can now be carried easily, implying a risk to data governance and confidentiality.
Sample Queries for Professionals:
Nonprofit: "How many volunteers in my national nonprofit are within city limits?"
E-commerce: "Which types of transactions are not completed within online purchases?"
Healthcare: "How often do patients respond to pre-screening questions prior to medical appointments?"
Marketing: "What is the key defining demographics of our most active social media followers?"