Analytics/Cluster/Overview

Diagram

Architecture

The Analytics Cluster can be used for many other things than processing the webrequest logs but as that is our primary source of big data and this is an overview document, the focus will be on describing the architecture to import, store, and analyze this data stream.

Glossary

Hadoop: A collection of services for batch processing of large data. Core concepts are a distributed filesystem (HDFS) and MapReduce.

HDFS: Hadoop Distributed File System (HDFS). HDFS is a distributed, scalable, and portable filesystem written in Java for the Hadoop framework. Each node in a Hadoop instance typically has a single namenode; a cluster of datanodes form the HDFS cluster.

NameNode (Hadoop): Manages HDFS file metadata.

DataNode (Hadoop): Responsible for storage of HDFS block data.

YARN (Hadoop): Yet-Another-Resource-Negotiator. Compute resource manager and API for distributed applications. MapReduce (v2) is a YARN application. Although not technically correct, YARN is sometimes referred to as MRv2.

ResourceManager (Hadoop): Handles YARN application (job) submission from clients.

NodeManager (Hadoop): Runs on each compute node (and/or HDFS DataNode) and accepts jobs from the ResourceManager.

MapReduce: Programming model for parallelizing batch processing of large data sets. Good for counting things. Hadoop ships with a Java implementation of MapReduce.

Hue: Hadoop User Experience. Web GUI for various Hadoop services. (HDFS, Oozie, Pig, Hive, etc.) Written in Python Django.

Pig: High level language abstraction for common implementing MapReduce programs without thinking about the MapReduce model. (Feels like a mix between SQL and awk). Generates MapReduce programs that are run in Hadoop.

Hive: Projects structure onto flat data (text or binary) in HDFS and allows this data to be queried in an SQL like manner. Analytics is not currently using Hive, although believe it will be useful for certain types of analysis as more clients have need.

Oozie: Scheduler for Hadoop jobs. Used for automated reporting based on data availability.

Kafka: Distributed pub/sub message queue. Useful as a big ol' reliable log buffer.

Kafka Consumer: Any Kafka client that subscribes and reads from Kafka Brokers.

Kafka Producer: Any Kafka client that publishes and writes to Kafka Brokers.

Kafka Topic: A logical partitioning of Kafka data streams.

Camus: LinkedIn's Kafka -> HDFS pipeline. Runs as a MapReduce job in Hadoop and consumes from Kafka into HDFS.

Zookeeper: Highly reliable dynamic configuration store. Instead of keeping configuration and simple state in flat files or a database, Zookeeper allows applications to be notified of configuration changes on the fly.

Cloudera: An organization that provides Hadoop ecosystem packages and documentation. Kraken uses Cloudera's CDH5 distribution.

webrequest log stream (aka firehose): The Analytics team has been referring to the web access logs generated from all WMF frontend cache webservers as 'webrequest logs'. The existent UDP stream is often referred to as the 'firehose', or the webrequest log stream.

jmx: JMX (Java Management Extensions) is a widely-adopted JVM mechanism to expose VM and application metrics, state, and management features. Nearly all the above-listed technologies provide JMX metrics; for example, Kafka brokers expose BytesIn, BytesOut, FailedFetchRequests, FailedProduceRequests, and MessagesIn aggregated across all topics (as well as individually). Kraken is using jmxtrans to send relevant metrics to Ganglia and elsewhere.

Refinery: Refinery contains scripts, artifacts, and configuration for WMF's analytics cluster.

Log Producers - varnishkafka

Frontend varnish cache servers use varnishkafka to send all webrequest logs to Kafka brokers in EQIAD.

Log Buffering - Kafka

Kafka Brokers store a 7 day buffer of data. This gives us time to consume this data into HDFS and recover if there are any problems. Check out the Kafka Ganglia view.

Batch Analysis - Hadoop

Raw webrequest logs are imported into HDFS using LinkedIn's Camus. (Raw log data is not world readable.) Analysis of these logs is done by Hive, Pig, and other MapReduce jobs.

Oozie is used to schedule and chain runs of Hive or other scripts based on the availability of data in HDFS. Workflows are triggered once the data is available.