p6

Introduction to Hive

  • What is Hive?
    • Hive is a Data Warehousing tool used for querying structured data built on top of Hadoop.
    • Developed by Facebook to manage growing data volumes.
    • Key components utilized are:
    • HDFS (Hadoop Distributed File System) for storage.
    • MapReduce for execution of queries.
    • Metadata stored in an RDBMS (Relational Database Management System).

Hive Overview

  • Apache Hive: Popular SQL interface for batch processing on Hadoop.
    • Built to organize and store massive data.
    • Provides an easy access method to data within the cluster.

Query Language and Features

  • HiveQL: A query language similar to SQL.
    • Allows analysts and data scientists to access data without needing Java expertise.
    • Structures data on HDFS, making it a data warehousing platform.
  • Accelerates response time for data analysis and expands the user base of Hadoop and MapReduce.
  • Developed by Facebook, contributed to the Apache Software Foundation, and is available as open-source software.

Limitations of Hive

  • What Hive is not?
    • Not a Relational Database; data processes are stored in HDFS.
    • Not designed for OLTP (Online Transaction Processing).
    • Unsuitable for real-time queries and row-level updates; ideal for batch jobs on large, immutable data.

Use-Cases of Hive

  • Processes large amounts of unstructured data into a structured view.
  • Common use cases include ad-hoc queries, summarization, data analysis.
  • Supports custom scalar functions, aggregations, and table functions via HIVEQL.
  • Converts SQL queries into MapReduce jobs for execution.

Features of Hive

  • Similar to SQL, making it user-friendly (HQL).
  • Supports rich data types like structs, lists, and maps.
  • Includes standard SQL features, such as filters, group-by, and order-by clauses.

Prerequisites for Using Hive

  1. Stable build of Hadoop.
  2. Java 1.6 installed.
  3. Basic Java programming skills.
  4. Basic SQL knowledge.
  5. Start Hadoop services with $ start-all.sh and access Hive with $ hive command.

Hive Integration and Workflow

  • Hourly log data is stored directly into HDFS.
  • Data cleaning processes are performed on the log files.
  • Hive tables can be created to query the cleaned files of log data.

Hive Architecture

  • Components:
    • Metastore: Stores Hive table definitions and data mappings.
    • Driver: Manages query compilation, optimization, and execution.
    • Command-Line Interface (CLI), Web Interface, and Server for remote job submissions.

Metastore Types

  1. Embedded Metastore: Default metastore using Apache Derby Database; single process connection.
  2. Local Metastore: Stores metadata in RDBMS (e.g., MySQL); allows multiple connections.
  3. Remote Metastore: Driver and interface run separately, enhancing security by isolating credentials.

Data Organization in Hive

  • Data structured as:
    • Databases: Separate namespaces to avoid naming conflicts.
    • Tables: Organized records with the same schema (e.g., page_views).
    • Partitions: Logical data segregation based on specific attributes.
    • Buckets: Further division of partitions using hash functions for improved performance.

Table Concepts in Hive

  • Tables hold similar records defined by schema:
    • Example: page_views with columns like timestamp, userid, and page_url.
  • Tables are analogous to relational tables with corresponding directories in HDFS.
  • Data can be serialized and stored in various formats, with support for custom serialization schemes.

Working with Tables

  • Loading Data:
    • Command: LOAD DATA LOCAL INPATH '/root/hivedemos/student.tsv' OVERWRITE INTO TABLE EXT_STUDENT;
    • Retrieval: SELECT * from EXT_STUDENT;
  • Table Alterations: Commands to modify tables include renaming, adding columns, or changing data types.

Working with Partitions

  • Partitioning: Logically separates data into subdirectories based on attributes.
  • Example of a partitioned table:
    • CREATE TABLE Sales (sale_id INT, amount FLOAT) PARTITIONED BY (country STRING, year INT, month INT);
  • Benefits include processing only the relevant data for queries, optimizing performance.

Buckets in Hive

  • Buckets further segment data within partitions.
  • Each bucket corresponds to a file in a partitioned directory, calculated via a hash function on a defined column.

Hive Data Types

  • Primitive Types: Includes numeric types (INT, FLOAT), boolean, and string types.
  • Complex Types: Includes structures, arrays, and maps to handle collections and complex data.
  • Supported varying data formats such as text files, sequential files, and columnar formats (RCFile).

Hive Query Language (HQL)

  • HQL is used to create, manage databases/tables/partitions/buckets and supports various relational & logical operations.
  • Creating a Database: CREATE DATABASE IF NOT EXISTS STUDENTS COMMENT 'STUDENT Details' WITH DBPROPERTIES ('creator' = 'JOHN');
  • DDL statements for managing schema and data include create, drop, alter, and describe commands.

Database Operations in Hive

  • Commands to manage databases include describing, showing, and dropping databases as necessary.
  • To list databases: SHOW DATABASES;
  • To use a specific database: USE <data_base_name>;

Types of Tables in Hive

  • Managed Tables: Hive deletes both data and metadata on drop.
  • External Tables: Hive deletes only metadata, maintaining data in its original location.
  • Use the keyword EXTERNAL to create external tables.