Analytics/Reportupdater

From Wikitech

ATTENTION: This codebase was developed with the intention to replace generate.py script from analytics-limn-mobile-data project while the analytics team would find a reliable and generic solution for report scheduling. It was built temporarily inside the same project in a separate directory.

What is reportupdater?

It's a python program that creates and updates TSV files in a periodic manner, given the inputs:

  • SQL queries
  • Scripts that return SQL-like responses

Whenever a report needs to be updated, reportupdater queries the given databases or the given scripts for the missing data and appends it to the TSV report files.


Dependencies and dev environment

As reportupdater is built inside analytics-limn-mobile-data project, its dependencies and dev environment are the same, you can see them here.

How to execute

There are two options:

Run directly from the command line

Inside reportupdater folder, there lives this script:

usage: update_reports.py [-h] [--config-path CONFIG_PATH]
                         [--wikis_path WIKIS_PATH] [-l LOG_LEVEL]
                         query_folder output_folder

Periodically execute SQL queries or scripts and write/update the results into
TSV files.

positional arguments:
  query_folder          Folder with *.sql files and scripts.
  output_folder         Folder to write the TSV files to.

optional arguments:
  -h, --help            show this help message and exit
  --config-path CONFIG_PATH
                        Yaml configuration file. Default:
                        <query_folder>/config.yaml.
  --wikis_path WIKIS_PATH
                        All wikis list file path. Default: wikis.txt.
  -l LOG_LEVEL, --log-level LOG_LEVEL
                        (debug|info|warning|error|critical)

Run through generate.py

Some unused parameters were left out of the description.

usage: generate.py [-h] [-c CONFIG_OVERRIDE]
                   folder

Generate data for the mobile dashboard.

positional arguments:
  folder                folder with config.yaml and *.sql files

optional arguments:
  -h, --help            show this help message and exit
  -c CONFIG_OVERRIDE, --config-override CONFIG_OVERRIDE
                        config.yaml override
  -g GRAPH, --graph GRAPH
                        the name of a single graph you want to generate for

Config file

As reportupdater will run alongside generate.py, the general expected contents of the config file are the same as in analytics-limn-mobile-data. The only difference is the reportupdater-reports section. Reportupdater will assume the config file describes the following yaml block:

reportupdater-reports:

    report_key_1:

        # Type: 'sql' or 'script'. Tells reportupdater how to execute the
        # report. Optional. Default: 'sql'.
        type: sql

        # Frequency: hours, days or weeks. The frequency in which the report
        # will be updated. Mandatory.
        frequency: hours

        # Granularity: days, weeks or months. Reportupdater will assume the data
        # to have this granularity. Mandatory.
        granularity: days

        # Lag: Positive integer. Time to wait (in seconds) to start the report's
        # execution after the execution is triggered. Useful to wait until source
        # database is updated. Optional. Default: 0.
        lag: 300

        # Timeboxed: If true, the program will expect the SQL query to have
        # {from_timestamp} and {to_timestamp} placeholders and instantiate
        # them with the corresponding values for the missing date intervals.
        # Optional. Default: false.
        timeboxed: true

        # Starts: First date to be computed. Inclusive.
        # Only and mandatory for timeboxed = true.
        starts: 2015-01-01

        # Funnel: If true, the program will assume funnel data.
        # Meaning that for a single date, there may be various data rows.
        # Optional. Default: false.
        funnel: true

        # Explode by: (dict) The initial report will be exploded,
        # and a different report file will be generated for each
        # value. The output path will be:
        #   <report_name>/value1/value2/.../valueN.tsv
        # Where the order of the folders will be dictated by the
        # names of the placeholders (alphabetically).
        # In the sql query use {placeholder} to indicate where to
        # instantiate the values.
        # In the scripts, the values will be passed as arguments
        # in alphabetical order.
        explode_by:
            placeholder1: value1, value2, value3
            placeholder2: value4, value5

        # By wiki: If true, this will emulate explode_by with
        # placeholder = 'wiki' and values = a list of all wikis.
        # The value 'all' will also be part of the list. Default: false.
        by_wiki: true

    report_key_2:
        ...

When executing generate.py, reportupdater will only run if the reportupdater-reports section exists in the config file. Otherwise, it will never be called. If reportupdater runs, the report keys that exist inside the reportupdater-reports section, will not be executed by generate.py, even if they are also in generate.py's graphs section.

If you're running reportupdater through generate.py, and you want the reports generated by reportupdater to be outputed to a different folder, you can specify it by adding the following to the root of the config file:

reportupdater-output: "/your/desired/output/folder"

Limitations

Reportupdater works only for SQL queries and scripts whose result's first column represents a date in the following format: YYYY-MM-DD.

Testing

To run unit tests, use:

nosetests

To actually test reportupdater against the database, create an ssh tunnel to the database as explained in analytics-limn-mobile-data readme. And run reportupdater through generate.py overriding the config file, like:

python generate.py -c scripts/config.yaml <folder> -g wrong_graph_id

Where <folder> is the folder that contains all SQL files and the config.yaml file. The graph id wrong_graph_id will make generate.py crash with a KeyError, but that will be after executing reportupdater, and will prevent all generate.py reports to execute, if this makes sense for testing.