Analytics/Cluster/Hive/Avro

From Wikitech
< Analytics‎ | Cluster‎ | Hive

This describes how to do evolution of avro schemas in hive.

Intro

Despite of what you might read here: https://cwiki.apache.org/confluence/display/Hive/AvroSerDe we have found that avro in hive has several gotchas. Thus we are documenting what has worked for us.


Create tables with avro schema

We have found that easies is to use the avro.schema.literal. property.

Example of how to create a table (if schema is too large you might need to remove all whitespace)

CREATE TABLE nuria_avro
  ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  STORED as INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
  TBLPROPERTIES (
'avro.schema.literal'='
{
    "type": "record",
     "namespace": "org.wikimedia.analytics.schemas",
    "name": "testTable",
    "fields": [
        {"name": "userName",        "type": "string", "default":""},
        {"name": "favouriteNumber", "type": "int", "default":1},
        {"name": "interests",       "type": {"type": "array", "items": "string"}, "default":[]}
    ]
}');


Update schema

Updating schema requires updating the TBLPROPERTY. Please note that optional fields in avro have to be defined a certain way using unions, see specifications for details: [1]


ALTER TABLE nuria_avro SET TBLPROPERTIES ('avro.schema.literal'='
{
    "type": "record",
    "namespace": "org.wikimedia.analytics.schemas",
    "name": "testTable",
    "fields": [
        {"name": "userName",        "type": "string", "default":""},
        {"name":"lastName",         "type":["string", "null"], "default":""},
        {"name": "favouriteNumber", "type": "int", "default":1},
        {"name": "interests",       "type": {"type": "array", "items": "string"}, "default":[]}
    ]
}');

External Tables

When interacting with data that is not created within hive, for example data written out by Camus, an external table needs to be created to point at that data.

CREATE EXTERNAL TABLE `avro_test`
PARTITIONED BY ( 
  `year` string, 
  `month` string, 
  `day` string, 
  `hour` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION
  'hdfs://analytics-hadoop/wmf/data/raw/mediawiki/mediawiki_testTable/hourly'
TBLPROPERTIES (
  'avro.schema.literal'='{
    "type": "record",
    "namespace": "org.wikimedia.analytics.schemas",
    "name": "testTable",
    "fields": [
        {"name": "userName",        "type": "string", "default":""},
        {"name":"lastName",         "type":["string", "null"], "default":""},
        {"name": "favouriteNumber", "type": "int", "default":1}
    ]
  }'
)