Graphite

From Wikitech

Graphite is a real-time graphing system. The system is a bit like RRDTool though much more scalable and with faster access letting it handle huge amounts of metrics and still be fast enough.

A big advantage is that metric identifiers do not need to be predefined on the server side, thus saving a lot of configuration overhead. The metric names are handled by the client. As such, submission of data is not publicly exposed. This is deferred to other deployed applications. See #Sources for more about that.

Front-ends

Wikimedia deploys various web applications that provide convenient ways to access the data and generate graphs.

  • graphite.wikimedia.org (restricted), the default graphite-web frontend. Provides a visual interface to all raw metrics, discovering functions to transform data, and an API with PNG and JSON output formats.
  • grafana.wikimedia.org, a frontend for flexibly querying metrics and creating new graphs. Unlike other front-ends, this queries the raw data and renders interactive graphs client-side.
  • gdash.wikimedia.org, various predefined dashboards showing a selection of graphs.

Service

The graphite receiver is hosted on graphite1001. (Previously on tungsten.)

For the beta cluster, the receiver is labmon1001.

Data sources

Graphite is one of the primary aggregators for metrics at Wikimedia. It providers a powerful API to query, transform and aggregate the data.

Data is rarely recorded with Graphite directly. Most commonly, data goes through statsd.

statsd

The statsd server acts as an intermediary between Graphite and other applications.

EventLogging

To aggregate data from EventLogging events from client-side JavaScript, we usually create a Python script that subscribes to relevant topics from the EventLogging ZMQ stream, and reacts by sending packets to statsd. Such script is then deployed on hafnium through the role::webperf role in puppet.

For example, puppet:///webperf/navtiming.py.

See Webperf for more information about how this works. See EventLogging for how to create new schemas and start sending events from your application.

statsv

statsv is an HTTP beacon endpoint (/beacon/statsv) for sending data to statsd.

It's a lightweight way of sending data from clients. This is useful when you only require one or more values to be aggregated, without needing the overhead of an EventLogging schema or storing each entry in a database.

See statsv.py and kafka::statsv.

You can hit this endpoint directly with an HTTP request.

Within MediaWiki, you should use the abstraction layer provided by the WikimediaEvents extension. Use the "timing" and "counter" topic namespace of mw.track . E.g. mw.track( 'timing.foo', 1234.56 ) or mw.track( 'counter.bar', 5 ). (Source)

MediaWiki

Use MediaWiki's RequestContext::getStats() interface. This buffers data within the process, and sends it to Statsd at the end.

Note that properties from MediaWiki automatically get the MediaWiki. prefix added to the metric name.

TCP

To record data to Graphite directly, the client will send a simple message over TCP port 2003 that will contains three space separated entries:

  1. Metric name.
  2. Integer value.
  3. Unix timestamp.

Example:

$ echo "my.metric 1911 $(date +%s)" | nc -q0 professor.pmtpa.wmnet 2003

The my.metric does not to be preconfigured in graphite, it will be happily recorded as-is. Any missing hierarchy is automatically created.

Everything stored in graphite has a path with components delimited by dots. In a path such as "foo.bar.baz", each thing surrounded by dots is called a path component. So "foo" is a path component, as well as "bar", etc. When coming up with metric names, adhere to the following guidelines:

  • Each path component should have a clear and well-defined purpose.
  • Volatile path components should be kept as deep into the hierarchy as possible

Terminology

  • Metric (also known as Bucket). Each metric has a name and a bucket with one or more values over time.
  • Flush interval. At a configured interval, the statsd server will aggregate all buckets and send the representative values for each property to Graphite. At Wikimedia the interval is currently one minute.
  • Aggregation. Each minute, statsd takes each bucket and summarises all values with a single value to represent that minute. It also creates the derivative properties at this point (e.g. lower, upper, p95, etc.). At later stages, once in graphite, more aggregation happens. For example, data older than 7 days is represented in intervals of 5 minutes, and after 30 days the interval is 15 minutes. [1]

Extended properties

This is the missing manual about aggregation by statsd and Graphite at Wikimedia. This describes the primary metric types we use: counters and meters. For other metric types, see Statsd Metric Types.

Counters

A simple counter per flush interval. Aggregation layers will add values up. Note that code is not restricted to incrementing by one. A single push can increment the counter with a higher number as well.

Properties:

  • count: How many values there were in this interval. Before aggregation this is always 1 (because the counter received one value). If you only ever increase a metric by 1, this can sometimes match sum. However there may be aggregations between your code and statsd that add up counters and report it as a single value to statsd (e.g. varnishrls).
  • sum: The total sum of all values in this interval. This is not subject to averaging in later stages of aggregation. As such, recent data will reflect the rate per minute, but queries further back report higher numbers over longer intervals. Use this in conjunction with integral() to produce a running total. When used in a graph directly, it will give something like a rate per minute (or per 5min, or per hour, depending on how far back your query goes). Use rate instead if you need a total per fixed interval (e.g. always per second or always per minute).
  • rate: The average total per second. This is not subject to summing in later stages of aggregation and will remain an average. Use scale() to inflate this back to a higher interval. E.g. to draw a rate per minute use mycounter.rate and scale(60).
  • lower: The lowest single increment in this interval.
  • mean: The average of all increments in this interval.
  • upper: The highest single increment in this interval.

Timers

Track the duration of a particular event.

Properties:

  • count: How many values there were in this interval.
  • sum: The total sum of durations in this interval (e.g. pushing 250ms and 300ms produces 550ms). Use this to compute total time spent in all events.
  • rate: Unknown. It is easy to mistake this for a hitrate (like for Counters), however that is incorrect. Upstream documentation suggests it is the same as the timer's sum (except that it is later aggregated as an average instead of a sum). To draw a counter from a timing metric, use sample_rate instead.
  • sample_rate: The average number of values observed per second. Use this to produce a counter from something primarily tracked as a timer. Behaves the same as a Counter's "rate".
  • lower: The lowest value in this interval.
  • mean: The average of all values in this interval.
  • median: The middle value of all values in this interval. See also Comparison of mean and median on Wikipedia.
  • upper: The highest single value in this interval.
  • p75, p95, p99, etc.: The highest value within a bottom percentage of values. This helps filter out outliers (which influence the mean), but still consider more than just the middle experience. The 50th percentile is the same as "median". The 100th percentile is the same "upper". See also Percentile on Wikipedia.

Functions

Here is a short list of common functions you should know about.

Moving average

Adding a moving average to your metric can turn an oscillating line, which obscures any trend, into a line that more accurately reflects how values are changing over time. Expand your query to at least 24 hours and start with a value of 5. Increase as needed up to 100. Higher generally makes the data too influenced by old data. To produce an average per day or week, use summarize() instead.

Time shift

Events influenced by user input (e.g. how long it takes to parse an article), or events that happen on the user's device, often have a daily and weekly pattern to them. Looking at the last 12 hours of data (even with a moving average) might not tell you much as it will always be going up or down depending on the time of day.

A time shift gives you context about how this metric behaved in the past and helps decide whether it is higher or lower than usual. Typically we add a time shift to show the metric at the same time yesterday and last week.

See the Navigation Timing dashboard for an example.

Summarize

For graphs showing the history of a metric over the course of several weeks or months it can be helpful to summarise data points to a higher interval to help hide normal variation. For example, it's much easier to see a regression from ~ 10 to ~ 20 on a straight line than a line that continuously wiggles between 1 and 30. Even after aggregation into a median and application of moving average, data can still exhibit a wide variation over longer periods of time.

summarize() helps you plot very bold and spaced out data points onto a graph. For example, one value per hour, day, or week. To emphasise changes in the metric more prominently, the "staircase" line mode can be used.

See the "History" panel on the Save Timing dashboard for an example.

FAQ

How do I render a counter metric as running total?

Start the sum property which stores the total per interval (e.g. minute or hour). Then apply integral() to produce a running total. (Original discussion at T108480)

What queries are being asked to graphite?

The graphite web application graphite-web does a fair amount of logging, specifically inside /var/log/graphite-web/metricaccess.log. All requested queries are logged together with how much time it spent serving those.

Further reading

See also

References

  1. Graphite configuration, Wikimedia operations puppet