Nova Resource:Catgraph

From Wikitech


Resource Type Resource Type::project
Project Name [[Project::catgraph]]
Monitoring nagf
Admins {{#arraymap:User:jkroll,User:novaadmin,User:wmde-fisch|,|x|* Member::x|\n}}
Members {{#arraymap:User:jkroll,User:novaadmin,User:wmde-fisch|,|x|* Member::x|\n}}

Documentation

{{#formlink:form=Nova Project Documentation |target=Nova Resource:Catgraph/Documentation |link text=Edit documentation }}

Catgraph

Description

Description::In-memory graph database for Wikipedia category structure.


CatGraph is a custom graph database. It provides developers of tools and other software fast access to the Wikipedia category structure, for example:

  • Finding all pages in a category with a user-defined search depth inside subcategories.
  • Finding "root nodes", that is, categories not contained in any other category; for example, !Hauptkategorie in dewiki.
  • Finding cycles in the category structure, i.e. categories containing one of their parent categories (beta).
  • Set operations on search results: intersection ("In A and in B"), difference ("In A, but not in B").

The complete category graph of a Wikipedia language version is held in RAM. The provided data consists solely of page_ids that can be used as parameters for SQL queries.

CatGraph is being developed by Johannes Kroll for Wikimedia Deutschland e.V. The original specification was written by Daniel Kinzler.

Example Query

The following query displays a list of the subset of pages in enwiki that are contained in both the Biology and People categories.

Both queries are run with search depth 6, i.e. the subcategories and their subcategories, ... up to the sixth subcategory are searched. The resulting 'Biology' subgraph contains (as I'm writing this) 381187 pages, the 'People' subgraph contains 1852741 pages. The intersection, that is, the pages contained in both subgraphs, consists of 86042 pages.

$ time curl "http://sylvester.wmflabs.org:8090/enwiki/traverse-successors+692675+6+&&+traverse-successors+691008+6" >/dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  705k    0  705k    0     0  67236      0 --:--:--  0:00:10 --:--:--  170k

real   0m10.759s
user   0m0.012s
sys    0m0.020s

Currently Running Instances

A list of the Wiki graphs currently running on Wikimedia Labs is here (alternative: single json file mapping graphs to hostnames). The name of the graph indicates the wiki instance, e.g. dewiki for the German Wikipedia. The suffix indicates the namespace, e.g. "enwiki_ns14" contains the category namespace of the English Wikipedia. Graphs without a suffix contain all namespaces.

Instances are reimported completely by a cron job in regular intervals for the time being. You can see the timestamp of the last update of a graph here. A script that continuously transfers the running changes is also in the works.

It is planned to make to make all existing language versions available.

Tools using CatGraph

  • The Render Article List Generator uses CatGraph for searching the category tree.
  • CatCycle finds cycles in the category structure, that is, categories containing one of their supercategories. It can also display root nodes and the shortest path between categories.
  • Merlissimo started working on a CatGraph interface for Merlbot.

Resources for Developers

  • gpClient: Libraries for talking to CatGraph in in PHP & Python.
  • This could be used to execute search engine-style query strings with CatGraph.
  • This simple script shows the usage of gpClient with PHP (code listing).

Core Components

The core components are implemented in C++. Some optimizations have been done to increase performance. A custom mmap-based block allocator keeps allocations fast and memory usage near the theoretical minimum for an adjacency list.

Graphcore implements the graph database. A graphcore process contains a Wikipedia language instance or a subset of its namespaces. The basic node type is an unsigned integer, which maps to the MediaWiki page_id table field. Graphcore communicates using a simple command language over the stdin/stdout streams.

Graphserv handles TCP connections and multiplexes data between clients and Graphcore instances. Basic access protection (read/write/admin levels, password authentication) is provided.

Documentation

Server admin log

June 4

  • 17:01 Ryan_Lane: rebooting sylvester to apply nfs homedirs


Instances for this project

{{#ask:Resource Type::instance[[Project::catgraph]] |?Instance Type |?Image Id |?Public IP |?Number of CPUs |?RAM Size |?Amount of Storage |?Modification date |mainlabel=FQDN |format=broadtable |limit=20 |order=DESC |sort=Modification date |offset=0 }}