User Manual
Capture Lineage Information In Hive Hooks
Jul 29, 2025
Background In Hive, lineage information is captured in the form of LineageInfo object. This object is created in the SemanticAnalyzer and is passed to the HookContext object. Users can use the following existing Hooks or implement their own custom hooks to capture this information and utilize it.
Existing Hooks org.apache.hadoop.hive.ql.hooks.PostExecutePrinter org.apache.hadoop.hive.ql.hooks.LineageLogger org.apache.atlas.hive.hook.HiveHook To facilitate the capture of lineage information in a custom hook or in a use case where the existing hooks are not set in hive.
Apache Hive : AccumuloIntegration
Dec 12, 2024
Apache Hive : Accumulo Integration Apache Hive : Accumulo Integration Overview Implementation Accumulo Configuration Usage Column Mapping Indexing Other options Examples Override the Accumulo table name Store a Hive map with binary serialization Register an external table Create an indexed table Acknowledgements Overview Apache Accumulo is a sorted, distributed key-value store based on the Google BigTable paper. The API methods that Accumulo provides are in terms of Keys and Values which present the highest level of flexibility in reading and writing data; however, higher-level query abstractions are typically an exercise left to the user.
Apache Hive : AuthDev
Dec 12, 2024
Apache Hive : AuthDev This is the design document for the original Hive authorization mode. See Authorization for an overview of authorization modes, which include storage based authorization and SQL standards based authorization.
Apache Hive : AuthDev 1. Privilege 1.1 Access Privilege 2. Hive Operations 3. Metadata 3.1 user, group, and roles 3.1.1 Role management 3.1.2 role metadata 3.1.3 hive role user membership table 3.
Apache Hive : AvroSerDe
Dec 12, 2024
Apache Hive : AvroSerDe Apache Hive : AvroSerDe Availability Overview – Working with Avro from Hive Requirements Avro to Hive type conversion Creating Avro-backed Hive tables Writing tables to Avro files Specifying the Avro schema for a table HBase Integration If something goes wrong FAQ Availability Earliest version AvroSerde is available
The AvroSerde is available in Hive 0.9.1 and greater.
Apache Hive : CompressedStorage
Dec 12, 2024
Apache Hive : CompressedStorage Compressed Data Storage Keeping data compressed in Hive tables has, in some cases, been known to give better performance than uncompressed storage; both in terms of disk usage and query performance.
You can import text files compressed with Gzip or Bzip2 directly into a table stored as TextFile. The compression will be detected automatically and the file will be decompressed on-the-fly during query execution. For example:
Apache Hive : Configuration Properties
Dec 12, 2024
Apache Hive : Configuration Properties This document describes the Hive user configuration properties (sometimes called parameters, variables, or options), and notes which releases introduced new properties.
The canonical list of configuration properties is managed in the HiveConf Java class, so refer to the HiveConf.java file for a complete list of configuration properties available in your Hive release.
For information about how to use these configuration properties, see Configuring Hive. That document also describes administrative configuration properties for setting up Hive in the Configuration Variables section.
Apache Hive : Cost-based optimization in Hive
Dec 12, 2024
Apache Hive : Cost-based optimization in Hive Apache Hive : Cost-based optimization in Hive Abstract 1. INTRODUCTION 2. RELATED WORK STATS PAPERS 3. BACKGROUND Hive Query optimization issues TEZ Join algorithms in Hive Multi way Join Common Join Map Join Bucket Map Join SMB Join Skew Join 4. Implementation details Phase 1 Phase 2 Phase 3 Configuration Proposed Cost Model Table Scan Common Join Map Join Bucket Map Join SMB Join Skew Join Distinct/Group By Union All Filter/Having Select Filter Selectivity Join Cardinality (without Histogram) Distinct Estimation 5.
Apache Hive : CSV Serde
Dec 12, 2024
Apache Hive : CSV Serde Apache Hive : CSV Serde Availability Background Usage Versions Availability Earliest version CSVSerde is available
The CSVSerde is available in Hive 0.14 and greater.
Background The CSV SerDe is based on https://github.com/ogrodnek/csv-serde, and was added to the Hive distribution in HIVE-7777.
Limitation
This SerDe treats all columns to be of type String. Even if you create a table with non-string column types using this SerDe, the DESCRIBE TABLE output would show string column type.
Apache Hive : Data Connector for Hive and Hive-like engines What is a Data connector? Data connectors (referred to as “connector” in Hive Query Language) are top level objects in Hive where users can define a set of properties required to be able to connect to an external datasource from hive. This document illustrates example of the data connector framework can be used to do SQL query federation between two distinct “hive” clusters/installations or between Hive and another hive-like compute engines (eg: EMR).
Apache Hive : Data Connectors in Hive
Dec 12, 2024
Apache Hive : Data Connectors in Hive What is a Data connector? Data connectors (referred to as “connector” in Hive Query Language) are top level objects in Hive where users can define a set of properties required to be able to connect to a datasource from hive. So a connector has a type (closed enumerated set) that allows Hive to determine the driver class (for JDBC) and other URL params, a URL and a set of properties that could include the default credentials for the remote datasource.
Apache Hive : Druid Integration
Dec 12, 2024
Apache Hive : Druid Integration This page documents the work done for the integration between Druid and Hive introduced in Hive 2.2.0 (HIVE-14217). Initially it was compatible with Druid 0.9.1.1, the latest stable release of Druid to that date.
Apache Hive : Druid Integration Objectives Preliminaries Druid Storage Handlers Usage Discovery and management of Druid datasources from Hive Create tables linked to existing Druid datasources Create Druid datasources from Hive Druid kafka ingestion from Hive INSERT, INSERT OVERWRITE and DROP statements Queries completely executed in Druid Queries across Druid and Hive Open Issues (JIRA) Objectives Our main goal is to be able to index data from Hive into Druid, and to be able to query Druid datasources from Hive.
Apache Hive : FileFormats
Dec 12, 2024
Apache Hive : FileFormats File Formats and Compression File Formats Hive supports several file formats:
Text File SequenceFile RCFile Avro Files ORC Files Parquet Custom INPUTFORMAT and OUTPUTFORMAT The hive.default.fileformat configuration parameter determines the format to use if it is not specified in a CREATE TABLE or ALTER TABLE statement. Text file is the parameter’s default value.
For more information, see the sections Storage Formats and Row Formats & SerDe on the DDL page.
Apache Hive : HBaseIntegration
Dec 12, 2024
Apache Hive : HBase Integration This page documents the Hive/HBase integration support originally introduced in HIVE-705. This feature allows Hive QL statements to access HBase tables for both read (SELECT) and write (INSERT). It is even possible to combine access to HBase tables with native Hive tables via joins and unions.
A presentation is available from the HBase HUG10 Meetup
This feature is a work in progress, and suggestions for its improvement are very welcome.
Apache Hive : Hive Aws EMR
Dec 12, 2024
Apache Hive : Hive Aws EMR Amazon Elastic MapReduce and Hive Amazon Elastic MapReduce is a web service that makes it easy to launch managed, resizable Hadoop clusters on the web-scale infrastructure of Amazon Web Services (AWS). Elastic Map Reduce makes it easy for you to launch a Hive and Hadoop cluster, provides you with flexibility to choose different cluster sizes, and allows you to tear them down automatically when processing has completed.
Apache Hive : Hive Configurations
Dec 12, 2024
Apache Hive : Hive Configurations Hive has more than 1600 configs around the service. The hive-site.xml contains the default configurations for the service. In this config file, you can change the configs. Every config change needs to restart the service(s).
Here you can find the most important configurations and default values.
Config Name Default Value Description Config file hive.metastore.client.cache.v2.enabled true This property enabled a Caffaine Cache for Metastore client MetastoreConf More configs are in MetastoreConf.
Apache Hive : Hive deprecated authorization mode / Legacy Mode This document describes Hive security using the basic authorization scheme, which regulates access to Hive metadata on the client side. This was the default authorization mode used when authorization was enabled. The default was changed to SQL Standard authorization in Hive 2.0 (HIVE-12429).
Apache Hive : Hive deprecated authorization mode / Legacy Mode Disclaimer Prerequisites Users, Groups, and Roles Creating/Dropping/Using Roles Privileges Hive Operations and Required Privileges Disclaimer Hive authorization is not completely secure.
Apache Hive : Hive HPL/SQL
Dec 12, 2024
Apache Hive : Hive HPL/SQL Hive Hybrid Procedural SQL On Hadoop (HPL/SQL) is a tool that implements procedural SQL for Hive. It is available in Hive 2.0.0 (HIVE-11055).
HPL/SQL is an open source tool (Apache License 2.0) that implements procedural SQL language for Apache Hive, SparkSQL, Impala as well as any other SQL-on-Hadoop implementation, any NoSQL and any RDBMS.
HPL/SQL is a hybrid and heterogeneous language that understands syntaxes and semantics of almost any existing procedural SQL dialect, and you can use with any database, for example, running existing Oracle PL/SQL code on Apache Hive and Microsoft SQL Server, or running Transact-SQL on Oracle, Cloudera Impala or Amazon Redshift.
Apache Hive : Hive Metrics
Dec 12, 2024
Apache Hive : Hive Metrics The metrics that Hive collects can be viewed in the HiveServer2 Web UI by using the “Metrics Dump” tab.
The metrics dump will display any metric available over JMX encoded in JSON: Alternatively the metrics can be written directly into HDFS, a JSON file on the local file system where the HS2 instance is running or to the console by enabling the corresponding metric reporters. By default only the JMX and the JSON file reporter are enabled.
Apache Hive : Hive on Spark
Dec 12, 2024
Apache Hive : Hive on Spark Apache Hive : Hive on Spark 1. Introduction 1.1 Motivation 1.2 Design Principle 1.3 Comparison with Shark and Spark SQL 1.4 Other Considerations 2. High-Level Functionality 2.1 A New Execution Engine 2.2 Spark Configuration 2.3 Miscellaneous Functionality 3. Hive-Level Design 3.1 Query Planning 3.2 Job Execution 3.3 Design Considerations Table as RDD SparkWork SparkTask Shuffle, Group, and Sort Join Number of Tasks Local MapReduce Tasks Semantic Analysis and Logical Optimizations Job Diagnostics Counters and Metrics Explain Statements Hive Variables Union Concurrency and Thread Safety Build Infrastructure Mini Spark Cluster Testing 3.
Apache Hive : Hive Transactions
Dec 12, 2024
Apache Hive : ACID Transactions Apache Hive : ACID Transactions Upgrade to Hive 3+ What is ACID and why should you use it? Limitations Streaming APIs Grammar Changes Basic Design Base and Delta Directories Compactor Transaction/Lock Manager Configuration New Configuration Parameters for Transactions Configuration Values to Set for INSERT, UPDATE, DELETE Configuration Values to Set for Compaction Compaction pooling Table Properties Talks and Presentations Upgrade to Hive 3+ Any transactional tables created by a Hive version prior to Hive 3 require Major Compaction to be run on every partition before upgrading to 3.
Apache Hive : Hive Transactions (Hive ACID)
Dec 12, 2024
Apache Hive : Hive Transactions (Hive ACID) Apache Hive : Hive Transactions (Hive ACID) What is ACID and why should you use it? Limitations Streaming APIs Grammar Changes Basic Design Base and Delta Directories Compactor Transaction/Lock Manager Configuration New Configuration Parameters for Transactions Configuration Values to Set for Hive ACID (INSERT, UPDATE, DELETE) Configuration Values to Set for Compaction Compaction pooling Table Properties Talks and Presentations What is ACID and why should you use it?
Apache Hive : Hive-Iceberg Integration
Dec 12, 2024
Apache Hive : Hive-Iceberg Integration Apache Hive starting from 4.0 out of the box supports the Iceberg table format, the iceberg tables can be created like regular hive external or ACID tables, without adding any extra jars.
Creating an Iceberg Table
An iceberg table can be created using STORED BY ICEBERG keywords while creating a table.
Creating an Iceberg table using normal create command CREATE TABLE TBL_ICE (ID INT) STORED BY ICEBERG; The above creates an iceberg table named ‘TBL_ICE’
Apache Hive : HiveAws HivingS3nRemotely
Dec 12, 2024
Apache Hive : HiveAws HivingS3nRemotely = Querying S3 files from your PC (using EC2, Hive and Hadoop) =
Usage Scenario The scenario being covered here goes as follows:
A user has data stored in S3 - for example Apache log files archived in the cloud, or databases backed up into S3. The user would like to declare tables over the data sets here and issue SQL queries against them These SQL queries should be executed using computed resources provisioned from EC2.
Apache Hive : HiveClient
Dec 12, 2024
Apache Hive : HiveClient This page describes the different clients supported by Hive. The command line client currently only supports an embedded server. The JDBC and Thrift-Java clients support both embedded and standalone servers. Clients in other languages only support standalone servers.
For details about the standalone server see Hive Server or HiveServer2.
Apache Hive : HiveClient Command Line JDBC JDBC Client Sample Code Running the JDBC Sample Code JDBC Client Setup for a Secure Cluster Python PHP ODBC Thrift Thrift Java Client Thrift C++ Client Thrift Node Clients Thrift Ruby Client Command Line Operates in embedded mode only, that is, it needs to have access to the Hive libraries.
Apache Hive : HiveCounters
Dec 12, 2024
Apache Hive : HiveCounters Task counters created by Hive during query execution
For Tez execution, %context is set to the mapper/reducer name. For other execution engines it is not included in the counter name.
Counter Name Description RECORDS_IN[_%context] Input records read RECORDS_OUT[_%context] Output records written RECORDS_OUT_INTERMEDIATE[_%context] Records written as intermediate records to ReduceSink (which become input records to other tasks) CREATED_FILES Number of files created DESERIALIZE_ERRORS Deserialization errors encountered while reading data
Apache Hive : HiveServer2 Clients
Dec 12, 2024
Apache Hive : HiveServer2 Clients This page describes the different clients supported by HiveServer2.
Apache Hive : HiveServer2 Clients Version information Beeline – Command Line Shell Beeline Example Beeline Commands Beeline Properties Beeline Hive Commands Beeline Command Options Output Formats HiveServer2 Logging Cancelling the Query Background Query in Terminal Script JDBC Connection URLs Connection URL Format Connection URL for Remote or Embedded Mode Connection URL When HiveServer2 Is Running in HTTP Mode Connection URL When SSL Is Enabled in HiveServer2 Connection URL When ZooKeeper Service Discovery Is Enabled Named Connection URLs Reconnecting Using hive-site.
Apache Hive : HiveServer2 Overview
Dec 12, 2024
Apache Hive : HiveServer2 Overview Apache Hive : HiveServer2 Overview Introduction HS2 Architecture Server Transport Protocol Processor Dependencies of HS2 JDBC Client Source Code Description Server Side Client Side Interaction between Client and Server Resources Introduction HiveServer2 (HS2) is a service that enables clients to execute queries against Hive. HiveServer2 is the successor to HiveServer1 which has been deprecated. HS2 supports multi-client concurrency and authentication.
Apache Hive : JDBC Storage Handler
Dec 12, 2024
Apache Hive : JDBC Storage Handler Apache Hive : JDBC Storage Handler Syntax Table Properties Supported Data Type Column/Type Mapping Auto Shipping Securing Password Partitioning Computation Pushdown Using a Non-default Schema MariaDB MS SQL Oracle PostgreSQL Syntax JdbcStorageHandler supports reading from jdbc data source in Hive. Currently writing to a jdbc data source is not supported. To use JdbcStorageHandler, you need to create an external table using JdbcStorageHandler.
Apache Hive : Kudu Integration
Dec 12, 2024
Apache Hive : Kudu Integration Apache Hive : Kudu Integration Overview Implementation Hive Configuration Table Creation Impala Tables Data Ingest Examples Overview Apache Kudu is a an Open Source data storage engine that makes fast analytics on fast and changing data easy. Implementation The initial implementation was added to Hive 4.0 in HIVE-12971 and is designed to work with Kudu 1.
Apache Hive : Materialized views in Hive
Dec 12, 2024
Apache Hive : Materialized views in Hive Objectives Traditionally, one of the most powerful techniques used to accelerate query processing in data warehouses is the pre-computation of relevant summaries or materialized views.
The initial implementation focuses on introducing materialized views and automatic query rewriting based on those materializations in the project. In particular, materialized views can be stored natively in Hive or in other systems such as Druid using custom storage handlers, and they can seamlessly exploit new exciting Hive features such as LLAP acceleration.
Apache Hive : MultiDelimitSerDe
Dec 12, 2024
Apache Hive : MultiDelimitSerDe Introduction Introduced in HIVE-5871, MultiDelimitSerDe allows user to specify multiple-character string as the field delimiter when creating a table.
Version Hive 0.14.0 and later.
Hive QL Syntax You can use MultiDelimitSerDe in a create table statement like this:
CREATE TABLE test ( id string, hivearray array<binary>, hivemap map<string,int>) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.MultiDelimitSerDe' WITH SERDEPROPERTIES ("field.delim"="[,]","collection.delim"=":","mapkey.delim"="@"); where field.delim is the field delimiter, collection.delim and mapkey.delim is the delimiter for collection items and key value pairs, respectively.
Apache Hive : Parquet
Dec 12, 2024
Apache Hive : Parquet Parquet is supported by a plugin in Hive 0.10, 0.11, and 0.12 and natively in Hive 0.13 and later.
Apache Hive : Parquet Introduction Native Parquet Support Hive 0.10, 0.11, and 0.12 Hive 0.13 HiveQL Syntax Hive 0.10 - 0.12 Hive 0.13 and later Versions and Limitations Hive 0.13.0 Hive 0.14.0 Hive 1.1.0 Hive 1.2.0 Resources Introduction Parquet (http://parquet.
Apache Hive : Permission Inheritance in Hive
Dec 12, 2024
Apache Hive : Permission Inheritance in Hive This document describes how attributes (permission, group, extended ACL’s) of files representing Hive data are determined.
HDFS Background When a file or directory is created, its owner is the user identity of the client process, and its group is inherited from parent (the BSD rule). Permissions are taken from default umask. Extended Acl’s are taken from parent unless they are set explicitly. Goals To reduce need to set fine-grain file security props after every operation, users may want the following Hive warehouse file/dir to auto-inherit security properties from their directory parents:
Apache Hive : Query ReExecution
Dec 12, 2024
Apache Hive : Query ReExecution Query reexecution provides a facility to re-run the query multiple times in case of an unfortunate event happens. Introduced in Hive 3.0 (HIVE-17626)
Apache Hive : Query ReExecution ReExecition strategies Overlay Reoptimize Operator Matching Configuration ReExecition strategies Overlay Enables to change the hive settings for all reexecutions which will be happening. It works by adding a configuration subtree as an overlay to the actual hive settings(reexec.
Apache Hive : RCFile
Dec 12, 2024
Apache Hive : RCFile RCFile (Record Columnar File) is a data placement structure designed for MapReduce-based data warehouse systems. Hive added the RCFile format in version 0.6.0.
RCFile stores table data in a flat file consisting of binary key/value pairs. It first partitions rows horizontally into row splits, and then it vertically partitions each row split in a columnar way. RCFile stores the metadata of a row split as the key part of a record, and all the data of a row split as the value part.
Apache Hive : RCFileCat
Dec 12, 2024
Apache Hive : RCFileCat $HIVE_HOME/bin/hive –rcfilecat is a shell utility which can be used to print data or metadata from RC files.
Apache Hive : RCFileCat Data Metadata Data Prints out the rows stored in an RCFile, columns are tab separated and rows are newline separated.
Usage:
hive --rcfilecat [--start=start_offset] [--length=len] [--verbose] fileName --start=start_offset Start offset to begin reading in the file --length=len Length of data to read from the file --verbose Prints periodic stats about the data read, how many records, how many bytes, scan rate Metadata New in 0.
Apache Hive : Rebalance compaction
Dec 12, 2024
Apache Hive : Rebalance compaction In order to improve performance, Hive under the hood creates bucket files even for non-explicitly bucketed tables. Depending on the usage, the data loaded into these non-explicitly bucketed full-acid ORC tables may lead to unbalanced distribution, where some of the buckets are much larger (> 100 times) than the others. Unbalanced tables has performance penalty, as larger buckets takes more time to read. Rebalance compaction addresses this issue by equally redistributing the data among the implicit bucket files.
Apache Hive : Replacing the Implementation of Hive CLI Using Beeline Apache Hive : Replacing the Implementation of Hive CLI Using Beeline Why Replace the Existing Hive CLI? Hive CLI Functionality Support Hive CLI Options Support Examples Hive CLI Interactive Shell Commands Support Hive CLI Configuration Support Performance Impacts Why Replace the Existing Hive CLI? Hive CLI is a legacy tool which had two main use cases.
Apache Hive : SerDe
Dec 12, 2024
Apache Hive : SerDe Apache Hive : SerDe SerDe Overview Built-in and Custom SerDes Built-in SerDes Custom SerDes HiveQL for SerDes Input Processing Output Processing Additional Notes Comments: SerDe Overview SerDe is short for Serializer/Deserializer. Hive uses the SerDe interface for IO. The interface handles both serialization and deserialization and also interpreting the results of serialization as individual fields for processing.
Apache Hive : StarRocks Integration
Dec 12, 2024
Apache Hive : StarRocks Integration StarRocks has the ability to setup a Hive catalog which enables you to query data from Hive without loading data into StarRocks or creating external tables. See here for more information.
Apache Hive : Storage Based Authorization in the Metastore Server The metastore server security feature with storage based authorization was added to Hive in release 0.10. This feature was introduced previously in HCatalog.
HIVE-3705 added metastore server security to Hive in release 0.10.0.
For additional information about storage based authorization in the metastore server, see the HCatalog document Storage Based Authorization. For an overview of Hive authorization models and other security options, see the Authorization document.
Apache Hive : Streaming Data Ingest
Dec 12, 2024
Apache Hive : Streaming Data Ingest Apache Hive : Streaming Data Ingest Hive 3 Streaming API Hive HCatalog Streaming API Streaming Mutation API Streaming Requirements Limitations API Usage Transaction and Connection Management HiveEndPoint StreamingConnection TransactionBatch Notes about the HiveConf Object I/O – Writing Data RecordWriter DelimitedInputWriter StrictJsonWriter StrictRegexWriter AbstractRecordWriter Error Handling Example – Non-secure Mode Example – Secure Streaming Knowledge Base Hive 3 Streaming API Hive 3 Streaming API Documentation - new API available in Hive 3
Apache Hive : Streaming Data Ingest V2
Dec 12, 2024
Apache Hive : Streaming Data Ingest V2 Starting in release Hive 3.0.0, Streaming Data Ingest is deprecated and is replaced by newer V2 API (HIVE-19205). Apache Hive : Streaming Data Ingest V2 Hive Streaming API Streaming Mutation API Deprecation and Removal Streaming Requirements Limitations API Usage Transaction and Connection Management HiveStreamingConnection Notes about the HiveConf Object I/O – Writing Data RecordWriter StrictDelimitedInputWriter StrictJsonWriter StrictRegexWriter AbstractRecordWriter Error Handling Example Hive Streaming API Traditionally adding new data into Hive requires gathering a large amount of data onto HDFS and then periodically adding a new partition.
Apache Hive : TeradataBinarySerde
Dec 12, 2024
Apache Hive : TeradataBinarySerde Apache Hive : TeradataBinarySerde Availability Overview How to export How to import Usage Availability Earliest version CSVSerde is available
The TeradataBinarySerDe is available in Hive 2.4 or greater.
Overview Teradata can use TPT(Teradata Parallel Transporter) or BTEQ(Basic Teradata Query) to export and import data files compressed by gzip in very high speed. However such binary files are encoded in Teradata’s proprietary format and can’t be directly consumed by Hive without a customized SerDe.
Apache Hive : Transitivity on predicate pushdown
Dec 12, 2024
Apache Hive : Transitivity on predicate pushdown Before Hive 0.8.0, the query
set hive.mapred.mode=strict; create table invites (foo int, bar string) partitioned by (ds string); create table invites2 (foo int, bar string) partitioned by (ds string); select count(*) from invites join invites2 on invites.ds=invites2.ds where invites.ds='2011-01-01'; would give the error
Error in semantic analysis: No Partition Predicate Found for Alias "invites2" Table "invites2" Here, the filter is applied to the table invites as invites.
Apache Hive : Tutorial
Dec 12, 2024
Apache Hive : Tutorial Apache Hive : Tutorial Concepts What Is Hive What Hive Is NOT Getting Started Data Units Type System Primitive Types Complex Types Timestamp Built In Operators and Functions Built In Operators Built In Functions Language Capabilities Usage and Examples Creating, Showing, Altering, and Dropping Tables Creating Tables Browsing Tables and Partitions Altering Tables Dropping Tables and Partitions Loading Data HIVE-5999 HIVE-11996 Querying and Inserting Data Simple Query Partition Based Query Joins Aggregations Multi Table/File Inserts Dynamic-Partition Insert Inserting into Local Files Sampling Union All Array Operations Map (Associative Arrays) Operations Custom Map/Reduce Scripts Co-Groups Concepts What Is Hive Hive is a data warehousing infrastructure based on Apache Hadoop.
Apache Hive : Union Optimization
Dec 12, 2024
Apache Hive : Union Optimization Consider the query
select * from
(subq1
UNION ALL
sub2) u;
If the parents to union were map reduce jobs, they will write the output to temporary files. The Union will then read the rows from these temporary files and write to a final directory. In effect, the results are read and written twice unnecessarily. We can avoid this by directly writing to the final directory.
Apache Hive : User FAQ
Dec 12, 2024
Apache Hive : User FAQ Apache Hive : User FAQ General I see errors like: Server access Error: Connection timed out url=http://archive.apache.org/dist/hadoop/core/hadoop-0.20.1/hadoop-0.20.1.tar.gz How to change the warehouse.dir location for older tables? When running a JOIN query, I see out-of-memory errors. I am using MySQL as metastore and I see errors: “com.mysql.jdbc.exceptions.jdbc4.!CommunicationsException: Communications link failure” Does Hive support Unicode? Hive SQL Are Hive SQL identifiers (e.
Apache Hive : Using TiDB as the Hive Metastore database Apache Hive : Using TiDB as the Hive Metastore database Why use TiDB in Hive as the Metastore database? How to create a Hive cluster with TiDB Components required Install a Hive cluster Step 1: Deploy a TiDB cluster Step 2: Configure Hive Step 3: Initialize metadata Step 4: Launch Metastore and test Conclusion FAQ Why use TiDB in Hive as the Metastore database?