MIGRATING TO ICEBERG
MIGRATE procedure in Apache Iceberg is used to convert an Apache Hive table to an Iceberg Table and replace the catalog reference so the Iceberg table will be resolved going forward. Similar to the
MIGRATE scans the location of the original Hive table for data files to create Iceberg metadata.
The main difference between the two procedures is that
MIGRATE will replace the Hive table reference for the Iceberg table and will be the source of truth for the dataset going forward. Because the Iceberg table is considered the new “owner” of the underlying data files, it should be maintained using maintenance procedures like
A reference to the original Hive data will be saved as
table_backup_, but should not be used as a source going forward. If the dataset is modified after the migration is complete, the data referenced by the backup cannot be relied upon because it is stale.
It is best to validate using the
SNAPSHOT procedure and then prepare for this step as a one-way migration.
Note: The Hive table must be an external table for this procedure to work. A managed table can be converted to an external table using
ALTER TABLE <table> SET TBLPROPERTIES ('EXTERNAL'='TRUE')in Hive.
This procedure requires connecting to a Hive Metastore. For the Spark configuration and instructions for running a metastore locally, see this chapter’s background section on Connecting to a Hive metastore.
Creating an example table
To exercise this procedure, first create a Hive table using Parquet as the storage file format. Remember to add the
EXTERNAL keyword so that the table is correctly created as an external table.
CREATE DATABASE IF NOT EXISTS cookbook;
CREATE EXTERNAL TABLE m1 (s string) USING PARQUET
-- Time taken: 0.091 seconds
-- Insert a row to create a data file
INSERT INTO m1 values ('hive data');
-- Time taken: 0.621 seconds
There is now one parquet data file with the value
"hive data", which we will use to migrate to an Iceberg table.
Running the migrate procedure
1 -- The number of files imported during the migration process
Now that the migration is complete,
cookbook.m1 references the Iceberg table and the Hive table has been renamed as
cookbook.m1_backup_. Both table references share the same table location, but all data is now owned by the Iceberg table and will be affected by maintenance procedures.
Rolling back is possible by dropping the Iceberg table and renaming the backup table, as long as there are no new changes to the Iceberg table that will not be reflected in the backup table.