p6
Introduction to Hive
- What is Hive?
- Hive is a Data Warehousing tool used for querying structured data built on top of Hadoop.
- Developed by Facebook to manage growing data volumes.
- Key components utilized are:
- HDFS (Hadoop Distributed File System) for storage.
- MapReduce for execution of queries.
- Metadata stored in an RDBMS (Relational Database Management System).
Hive Overview
- Apache Hive: Popular SQL interface for batch processing on Hadoop.
- Built to organize and store massive data.
- Provides an easy access method to data within the cluster.
Query Language and Features
- HiveQL: A query language similar to SQL.
- Allows analysts and data scientists to access data without needing Java expertise.
- Structures data on HDFS, making it a data warehousing platform.
- Accelerates response time for data analysis and expands the user base of Hadoop and MapReduce.
- Developed by Facebook, contributed to the Apache Software Foundation, and is available as open-source software.
Limitations of Hive
- What Hive is not?
- Not a Relational Database; data processes are stored in HDFS.
- Not designed for OLTP (Online Transaction Processing).
- Unsuitable for real-time queries and row-level updates; ideal for batch jobs on large, immutable data.
Use-Cases of Hive
- Processes large amounts of unstructured data into a structured view.
- Common use cases include ad-hoc queries, summarization, data analysis.
- Supports custom scalar functions, aggregations, and table functions via HIVEQL.
- Converts SQL queries into MapReduce jobs for execution.
Features of Hive
- Similar to SQL, making it user-friendly (HQL).
- Supports rich data types like structs, lists, and maps.
- Includes standard SQL features, such as filters, group-by, and order-by clauses.
Prerequisites for Using Hive
- Stable build of Hadoop.
- Java 1.6 installed.
- Basic Java programming skills.
- Basic SQL knowledge.
- Start Hadoop services with
$ start-all.sh and access Hive with $ hive command.
Hive Integration and Workflow
- Hourly log data is stored directly into HDFS.
- Data cleaning processes are performed on the log files.
- Hive tables can be created to query the cleaned files of log data.
Hive Architecture
- Components:
- Metastore: Stores Hive table definitions and data mappings.
- Driver: Manages query compilation, optimization, and execution.
- Command-Line Interface (CLI), Web Interface, and Server for remote job submissions.
- Embedded Metastore: Default metastore using Apache Derby Database; single process connection.
- Local Metastore: Stores metadata in RDBMS (e.g., MySQL); allows multiple connections.
- Remote Metastore: Driver and interface run separately, enhancing security by isolating credentials.
Data Organization in Hive
- Data structured as:
- Databases: Separate namespaces to avoid naming conflicts.
- Tables: Organized records with the same schema (e.g., page_views).
- Partitions: Logical data segregation based on specific attributes.
- Buckets: Further division of partitions using hash functions for improved performance.
Table Concepts in Hive
- Tables hold similar records defined by schema:
- Example:
page_views with columns like timestamp, userid, and page_url.
- Tables are analogous to relational tables with corresponding directories in HDFS.
- Data can be serialized and stored in various formats, with support for custom serialization schemes.
Working with Tables
- Loading Data:
- Command:
LOAD DATA LOCAL INPATH '/root/hivedemos/student.tsv' OVERWRITE INTO TABLE EXT_STUDENT; - Retrieval:
SELECT * from EXT_STUDENT;
- Table Alterations: Commands to modify tables include renaming, adding columns, or changing data types.
Working with Partitions
- Partitioning: Logically separates data into subdirectories based on attributes.
- Example of a partitioned table:
CREATE TABLE Sales (sale_id INT, amount FLOAT) PARTITIONED BY (country STRING, year INT, month INT);
- Benefits include processing only the relevant data for queries, optimizing performance.
Buckets in Hive
- Buckets further segment data within partitions.
- Each bucket corresponds to a file in a partitioned directory, calculated via a hash function on a defined column.
Hive Data Types
- Primitive Types: Includes numeric types (INT, FLOAT), boolean, and string types.
- Complex Types: Includes structures, arrays, and maps to handle collections and complex data.
- Supported varying data formats such as text files, sequential files, and columnar formats (RCFile).
Hive Query Language (HQL)
- HQL is used to create, manage databases/tables/partitions/buckets and supports various relational & logical operations.
- Creating a Database:
CREATE DATABASE IF NOT EXISTS STUDENTS COMMENT 'STUDENT Details' WITH DBPROPERTIES ('creator' = 'JOHN'); - DDL statements for managing schema and data include create, drop, alter, and describe commands.
Database Operations in Hive
- Commands to manage databases include describing, showing, and dropping databases as necessary.
- To list databases:
SHOW DATABASES; - To use a specific database:
USE <data_base_name>;
Types of Tables in Hive
- Managed Tables: Hive deletes both data and metadata on drop.
- External Tables: Hive deletes only metadata, maintaining data in its original location.
- Use the keyword
EXTERNAL to create external tables.