Analytics/EventLogging/Architecture

From Wikitech

This page explains WMF's EventLogging system topology and how its parts interact. Using the following diagram as a reference:

EventLogging architecture

  • varnishkafka sends client-side raw events from Varnish to eventlogging-client-side Kafka topic.
  • MediaWiki server sends server-side raw events to eventlog1001 via UDP. Those get consumed by eventlogging-forwarder and fed into eventlogging-server-side Kafka topic.
  • A client side and a server side eventlogging-processor processes these raw events and send them back to Kafka. There are 2 processors because the formats of raw sever side and raw client side events are different. Once processed and validated, the processed events are produce to Kafka in the topics: eventlogging-valid-mixed and eventlogging_<schemaName>. eventlogging-valid-mixed that contains the valid events from all schemas with the exception of blacklisted high volume schemas. eventlogging-_schemaName> holds all events for each schema.
  • The processors internally use the BalancedConsumer implementation in PyKafka, so can be easily parallelized my launching more instances of the processor, for increased throughput. Internally these would be multiple Kafka consumers that belong to the same consumer group, and balance the consumption from different partitions in the kafka topic between themselves by communicating with the Kafka brokers. The number of parallel processors is upper bounded by the number of partitions in the kafka topic.
  • eventlogging-valid-mixed is consumed by eventlogging-consumer processes and stored into MySQL and into the eventlogging log files. The eventlogging_<schemaName> topics are consumed by Camus and stored in HDFS partitioned by <schemaName>/<year>/<month>/<day>/<hour>

So the EventLogging back-end is comprised of several pieces that consume and produce from/to Kafka. The /etc/eventlogging.d file hierarchy contains those instance definitions. It has a subfolder for each service type. An Upstart task, 'eventlogging/init', walks this file hierarchy and provisions a job for each instance definition. Instance definition files contain command-line arguments for the service program, one argument per line.

An 'eventloggingctl' shell script provides a convenient wrapper around Upstart's initctl that is specifically tailored for managing EventLogging tasks.