Analytics/Cluster/Deploy a fix to incorrect camus partitionning

From Wikitech

It's possible that you deployed and started to fetch data from a kafka topic but your camus job created incorrect partitions.

It can happen in the case you've set an incorrect property to camus.message.timestamp.format or camus.message.timestamp.field.

In such case camus may fallback to import time instead of the actual log timestamp. This will lead to incorrect partitionning like this : (in this example ts is the field that contains the log timestamp in unix seconds)

select from_unixtime(min(ts)), from_unixtime(max(ts)) from cirrussearchrequestset where year=2015 and month=11 and day=5 and hour=2 limit 1;
_c0	                _c1
2015-11-05 01:15:10	2015-11-05 02:15:10

In such scenario you will have to deploy a fix (either in camus.properties file or in refinery-camus artifact). But you will also have to follow these deployement steps:

  1. Deploy the fix
  2. Comment your camus job crontab line
  3. Re-create your table or delete existing partition
  4. Archive the hdfs paths used by camus (etl.destination.path, etl.execution.base.path and etl.execution.history.path)
  5. If you use the --check options to flag your partitions with _IMPORTED:
    1. Launch a first manual run with kafka.max.pull.minutes.per.task=1
    2. The job will fail on the check phase because it is unable to handle the initial run correctly
  6. Re-enable your camus job in cron
  7. Wait for the first automatic run to finish
  8. If you used the check flag you will have to manually flag the partitions created by the first run. The number of partitions that will have to be flagged manually depends on the number of lines fetched by camus in 1 minute.
  9. Backfill with oozie