Hive MIGRATE

MENU – Apache Iceberg Cookbook

Apache Iceberg Cookbook

Introduction

Getting Started

Basics

Data Engineering

Pyiceberg

Data Operations

Migrating to Iceberg

MIGRATING TO ICEBERG

The MIGRATE procedure in Apache Iceberg is used to convert an Apache Hive table to an Iceberg Table and replace the catalog reference so the Iceberg table will be resolved going forward. Similar to the SNAPSHOT procedure, MIGRATE scans the location of the original Hive table for data files to create Iceberg metadata.

The main difference between the two procedures is that MIGRATE will replace the Hive table reference for the Iceberg table and will be the source of truth for the dataset going forward. Because the Iceberg table is considered the new “owner” of the underlying data files, it should be maintained using maintenance procedures like expire_snapshots.

A reference to the original Hive data will be saved as table_backup_, but should not be used as a source going forward. If the dataset is modified after the migration is complete, the data referenced by the backup cannot be relied upon because it is stale.

It is best to validate using the SNAPSHOT procedure and then prepare for this step as a one-way migration.

Note: The Hive table must be an external table for this procedure to work. A managed table can be converted to an external table using ALTER TABLE <table> SET TBLPROPERTIES ('EXTERNAL'='TRUE')in Hive.

This procedure requires connecting to a Hive Metastore. For the Spark configuration and instructions for running a metastore locally, see this chapter’s background section on Connecting to a Hive metastore.

Creating an example table

To exercise this procedure, first create a Hive table using Parquet as the storage file format. Remember to add the EXTERNAL keyword so that the table is correctly created as an external table.

CREATE DATABASE IF NOT EXISTS cookbook;
USE cookbook;

CREATE EXTERNAL TABLE m1 (s string) USING PARQUET
    LOCATION 'file:/tmp/hive/data/warehouse/';
-- Time taken: 0.091 seconds

-- Insert a row to create a data file

INSERT INTO m1 values ('hive data');
-- Time taken: 0.621 seconds

There is now one parquet data file with the value "hive data", which we will use to migrate to an Iceberg table.

Running the migrate procedure

CALL system.migrate('cookbook.m1');
1  -- The number of files imported during the migration process

SHOW TABLES;
m1
m1_backup_

Now that the migration is complete, cookbook.m1 references the Iceberg table and the Hive table has been renamed as cookbook.m1_backup_. Both table references share the same table location, but all data is now owned by the Iceberg table and will be affected by maintenance procedures.

Rolling back is possible by dropping the Iceberg table and renaming the backup table, as long as there are no new changes to the Iceberg table that will not be reflected in the backup table.