Analytics/EventLogging/Data representations

From Wikitech

This page gives an overview over the various representations of EventLogging data available on the WMF production cluster, and expectations around those representations.

In a nutshell: When consuming EventLogging data, only rely on the log database available from m2 replicas, like analytics-store.eqiad.wmnet. Other representations might not get updated, might not get fix-ups or may (on purpose) give you unvalidated data.


MySQL / MariaDB database on m2

This database is the best place to consume EventLogging data from.

Available as log database on m2 replicas, such as analytics-store.eqiad.wmnet.

Only validated events enter the database.

In case of bugs, this database is the only place that gets fixes like cleanup of historic data, or live fixes.


'all-events' JSON log files

Use this data source only to debug issues around ingestion into the m2 database.

Entries are JSON objects.

Only validated events get written.

In case of bugs, historic data does not get fixed.

Those files are available as:

  • stats1002:/a/eventlogging/archive/all-events.log-$DATE.gz
  • stats1003:/srv/eventlogging/archive/all-events.log-$DATE.gz
  • eventlog1001:/var/log/eventlogging/...

Raw client and server side log files

Use this data source only to debug issues around ingestion into the m2 database.

Entries are parameters to the event.gif's request. They are not decoded at all.

In case of bugs, historic data does not get fixed. Neither need hot-fixes reach those files.

Those files are available as:

  • stats1002:/a/eventlogging/archive/client-side-events.log-$DATE.gz
  • stats1002:/a/eventlogging/archive/server-side-events.log-$DATE.gz
  • stats1003:/srv/eventlogging/archive/client-side-events.log-$DATE.gz
  • stats1003:/srv/eventlogging/archive/server-side-events.log-$DATE.gz
  • eventlog1001:/var/log/eventlogging/...

Kafka

EventLogging now feeds the following topics in Kafka:

  • eventlogging_valid_mixed: All valid events that come from all schemas.
  • eventlogging_<schemaName>: All events from the specified schema. Note there is one of those topics for each schema.

MongoDB

EventLogging data is no longer fed into MongoDB since 2014-02-13.

The EventLogging data in MongoDB did not appear to get used.


ZMQ

ZMQ is available from eventlog1001.

In case of bugs, historic data cannot get fixed :-)

Data coming from the forwarders (ports 8421, 8422) is not validated and need not see hot-fixes.

Data coming from processors (port 8521, 8522) and multiplexer (port 8600) is validated.

This streams will cease working soon, Analytics is working to move all streams to Kafka.

Nginx pipeline

Since EventLogging data is typically coming in through https, and the EventLogging payload is encoded in the URL, EventLogging data is available in all the log targets from the SSL terminators.

In case of bugs, historic data does not get fixed. Neither need hot-fixes reach this pipeline.


Varnish pipeline

Since EventLogging data is extracted at the bits caches, and the EventLogging payload is encoded in the URL, EventLogging data is available in all log targets from the bits caches.

In case of bugs, historic data does not get fixed. Neither need hot-fixes reach this pipeline.