p6

Introduction to Hive

What is Hive?
- Hive is a Data Warehousing tool used for querying structured data built on top of Hadoop.
- Developed by Facebook to manage growing data volumes.
- Key components utilized are:
- HDFS (Hadoop Distributed File System) for storage.
- MapReduce for execution of queries.
- Metadata stored in an RDBMS (Relational Database Management System).

Hive Overview

Apache Hive: Popular SQL interface for batch processing on Hadoop.
- Built to organize and store massive data.
- Provides an easy access method to data within the cluster.

Query Language and Features

HiveQL: A query language similar to SQL.
- Allows analysts and data scientists to access data without needing Java expertise.
- Structures data on HDFS, making it a data warehousing platform.
Accelerates response time for data analysis and expands the user base of Hadoop and MapReduce.
Developed by Facebook, contributed to the Apache Software Foundation, and is available as open-source software.

Limitations of Hive

What Hive is not?
- Not a Relational Database; data processes are stored in HDFS.
- Not designed for OLTP (Online Transaction Processing).
- Unsuitable for real-time queries and row-level updates; ideal for batch jobs on large, immutable data.

Use-Cases of Hive

Processes large amounts of unstructured data into a structured view.
Common use cases include ad-hoc queries, summarization, data analysis.
Supports custom scalar functions, aggregations, and table functions via HIVEQL.
Converts SQL queries into MapReduce jobs for execution.

Features of Hive

Similar to SQL, making it user-friendly (HQL).
Supports rich data types like structs, lists, and maps.
Includes standard SQL features, such as filters, group-by, and order-by clauses.

Prerequisites for Using Hive

Stable build of Hadoop.
Java 1.6 installed.
Basic Java programming skills.
Basic SQL knowledge.
Start Hadoop services with $ start-all.sh and access Hive with $ hive command.

Hive Integration and Workflow

Hourly log data is stored directly into HDFS.
Data cleaning processes are performed on the log files.
Hive tables can be created to query the cleaned files of log data.

Hive Architecture

Components:
- Metastore: Stores Hive table definitions and data mappings.
- Driver: Manages query compilation, optimization, and execution.
- Command-Line Interface (CLI), Web Interface, and Server for remote job submissions.

Metastore Types

Embedded Metastore: Default metastore using Apache Derby Database; single process connection.
Local Metastore: Stores metadata in RDBMS (e.g., MySQL); allows multiple connections.
Remote Metastore: Driver and interface run separately, enhancing security by isolating credentials.

Data Organization in Hive

Data structured as:
- Databases: Separate namespaces to avoid naming conflicts.
- Tables: Organized records with the same schema (e.g., page_views).
- Partitions: Logical data segregation based on specific attributes.
- Buckets: Further division of partitions using hash functions for improved performance.

Table Concepts in Hive

Tables hold similar records defined by schema:
- Example: page_views with columns like timestamp, userid, and page_url.
Tables are analogous to relational tables with corresponding directories in HDFS.
Data can be serialized and stored in various formats, with support for custom serialization schemes.

Working with Tables

Loading Data:
- Command: LOAD DATA LOCAL INPATH '/root/hivedemos/student.tsv' OVERWRITE INTO TABLE EXT_STUDENT;
- Retrieval: SELECT * from EXT_STUDENT;
Table Alterations: Commands to modify tables include renaming, adding columns, or changing data types.

Working with Partitions

Partitioning: Logically separates data into subdirectories based on attributes.
Example of a partitioned table:
- CREATE TABLE Sales (sale_id INT, amount FLOAT) PARTITIONED BY (country STRING, year INT, month INT);
Benefits include processing only the relevant data for queries, optimizing performance.

Buckets in Hive

Buckets further segment data within partitions.
Each bucket corresponds to a file in a partitioned directory, calculated via a hash function on a defined column.

Hive Data Types

Primitive Types: Includes numeric types (INT, FLOAT), boolean, and string types.
Complex Types: Includes structures, arrays, and maps to handle collections and complex data.
Supported varying data formats such as text files, sequential files, and columnar formats (RCFile).

Hive Query Language (HQL)

HQL is used to create, manage databases/tables/partitions/buckets and supports various relational & logical operations.
Creating a Database: CREATE DATABASE IF NOT EXISTS STUDENTS COMMENT 'STUDENT Details' WITH DBPROPERTIES ('creator' = 'JOHN');
DDL statements for managing schema and data include create, drop, alter, and describe commands.

Database Operations in Hive

Commands to manage databases include describing, showing, and dropping databases as necessary.
To list databases: SHOW DATABASES;
To use a specific database: USE <data_base_name>;

Types of Tables in Hive

Managed Tables: Hive deletes both data and metadata on drop.
External Tables: Hive deletes only metadata, maintaining data in its original location.
Use the keyword EXTERNAL to create external tables.