Analytics/Cluster/Deploy a fix to incorrect camus partitionning
It's possible that you deployed and started to fetch data from a kafka topic but your camus job created incorrect partitions.
It can happen in the case you've set an incorrect property to camus.message.timestamp.format
or camus.message.timestamp.field
.
In such case camus may fallback to import time instead of the actual log timestamp. This will lead to incorrect partitionning like this : (in this example ts is the field that contains the log timestamp in unix seconds)
select from_unixtime(min(ts)), from_unixtime(max(ts)) from cirrussearchrequestset where year=2015 and month=11 and day=5 and hour=2 limit 1; _c0 _c1 2015-11-05 01:15:10 2015-11-05 02:15:10
In such scenario you will have to deploy a fix (either in camus.properties file or in refinery-camus artifact). But you will also have to follow these deployement steps:
- Deploy the fix
- Comment your camus job crontab line
- Re-create your table or delete existing partition
- Archive the hdfs paths used by camus (
etl.destination.path
,etl.execution.base.path
andetl.execution.history.path
) - If you use the --check options to flag your partitions with _IMPORTED:
- Launch a first manual run with
kafka.max.pull.minutes.per.task=1
- The job will fail on the check phase because it is unable to handle the initial run correctly
- Launch a first manual run with
- Re-enable your camus job in cron
- Wait for the first automatic run to finish
- If you used the check flag you will have to manually flag the partitions created by the first run. The number of partitions that will have to be flagged manually depends on the number of lines fetched by camus in 1 minute.
- Backfill with oozie