Parsoid

Parsoid is a service that parses converts between wikitext and HTML. The HTML contains additional metadata that allows it to be converted back ("round-tripped") to wikitext.

VisualEditor fetches the HTML for a given page from Parsoid, edits it, then delivers the modified HTML to Parsoid, which converts it back to wikitext. Parsoid is a stateless HTTP server running on port 8000.
Flow (as configured on WMF wikis with $wgFlowContentFormat = 'html') works the other way around. When a user creates a post Flow uses Parsoid to convert the wikitext to HTML and Flow stores the HTML in ExternalStore. If someone later edits a post Flow uses Parsoid to convert the HTML back to wikitext for editing.

Monitoring

Parsoid eqiad cluster in Ganglia, only lists the worker machines. The Varnish hosts are cp1045 and cp1058.
Icinga has service checks for HTTP on port 8000 on both the individual backends and on the LVS service IP, and on port 80 on cp1045 and cp1058 and their service IP.
pybal does health checks on all backends every second, and depools boxes that are down as long as the % of depooled boxes does not exceed 50%. To see these health checks and depools/repools happen in real time, run ssh parsoid.svc.eqiad.wmnet (this will drop you into either lvs1003 or lvs1006, depending on which is active), then tail -f /var/log/pybal.log | grep parsoid
- pybal also manages the Varnish hosts in the same way; they're at parsoidcache.svc.eqiad.wmnet
Logging happens in /var/log/parsoid/parsoid.log. There is a log rotation setup in /etc/logrotate.d/parsoid.
Global job queue length -- this approaching a million jobs or so is a reason for concern. Check that parsoid job runners are all working if there is no other reason for slowness like high Parsoid cluster load or high general API load.

Deploying changes

Pre-deploy checks

Prepare the deploy patch

Check http://parsoid-tests.wikimedia.org/regressions/between/{from}/{to} where {from} is the last deployed hash from mw:Parsoid/Deployments and {to} is the latest tested commit (which we're about to deploy)
- http://parsoid-tests.wikimedia.org/commits gives you a nice radio-button interface to create this URL
- BEWARE: if you get the output total regressions between selected revisions: 0, it is extremely likely that you mistyped the hash or that we didn't actually run round-trip tests for that particular hash. (This is a bug, we should probably give a better message in this case.)
Create a short deployment summary on mw:Parsoid/Deployments from git log {from}..{to}. Don't include all commits, but only notable fixes and changes (ignore rt-test fixes, code cleanup updates, parser test updates, etc).
Prepare a deploy repo commit and push for +2
- Roughly: cd deploy ; git checkout master ; git pull origin master ; git submodule update ; cd src ; git checkout {to} ; cd .. ; git add -u ; git commit -m "Bump src to {to} for deploy" ; git review

Verify deployment version on beta after the deploy patch is merged

If beta cluster is down or visual editor is down in beta cluster, do not continue with routine deployments.
On beta cluster, perform manual VisualEditor editing tests. This requires you to have an account on the beta cluster wiki. Test with non-ASCII content too to catch encoding issues. Check parsoid logs, if necessary.

Be around on IRC

Add yourself to the "deployer" field of Deployments if you're not already there
Be online in freenode #wikimedia-operations (and stay online through the deployment window)

Deploying the latest version of Parsoid

Before you begin, note that Parsoid caches its git version string. So you may wish to do:

ssh -A tin
tin$ for wtp in `cat /etc/dsh/group/parsoid`; do echo -n "Querying $wtp: "; \
   curl "http://$wtp:8000/_version"; echo; done;

to ensure that the "old" version string is cached, so that you will be able to tell when parsoid restarts with its "new" version below.

Now to do the deploy:

ssh -A tin
tin$ cd /srv/deployment/parsoid/deploy
tin$ git deploy start
tin$ git pull
tin$ git submodule update --init
tin$ git deploy sync

You will then get status updates. If any minions are not ok, then retry the deploy until all are. Proceed with 'y' when all minions are ok for each step.

Nodes are not automatically restarted. First restart Parsoid on a canary:

ssh wtp1001.eqiad.wmnet
wtp1001:~$ sudo service parsoid restart
parsoid stop/waiting
parsoid start/running, process 25849
wtp1001:~$

Monitor wtp1001 on ganglia for a while to make sure things seem okay. (FIXME: describe how to do so). You can curl http://localhost:8000/_version to verify wtp1001 is running correctly. It should output the expected sha.

Then to restart the rest of the nodes:

tin$ git deploy service restart

This is broken (https://phabricator.wikimedia.org/T102039), so you might need to do this instead (on your localhost):

for wtp in `ssh <your-user-id>@bast1001.wikimedia.org cat /etc/dsh/group/parsoid` ; do echo $wtp ; ssh <your-user-id>@$wtp sudo service parsoid restart ; done

Unfortunately, Parsoid often does not restart cleanly. (FIXME: describe how to tell if nodes are hanging during version check, and what to do about it, if anything.)

Once everything is done, log the deploy in #wikimedia-operations with something like

!log updated Parsoid to version <new hash>

This creates a timestamped entry in the Server Admin Log.

Post-deploy checks

Test VE editing on enwiki and non-latin wikis
- For example, open it:Luna (or other complex page), start the visual editor, make some random vandalism, click save -> review changes, then verify that the wikitext reflects your changes and was not corrupted. Hit cancel to abort the edit.
- Reading through the recent edits (frwiki, enwiki) can also be a good check.
Verify all Parsoid servers are running the same version with:

tin$ for wtp in `cat /etc/dsh/group/parsoid`; do echo -n "Querying $wtp: "; \
   curl "http://$wtp:8000/_version"; echo; done;

(Note that dsh doesn't work directly any more.)

Deploying a cherry-picked patch

One way to do this is to create a new branch in the Parsoid repo and cherry-pick your patches to that. For example:

git checkout 497da30e # this is the commit on the master branch that you want to cherry pick on top of
git checkout -b deploy-20150528 # give it a name (go ahead and use the date of your deploy)
git cherry-pick f274c3f54f385a6ac159a47209d279b9040a161c # patch number 1
git cherry-pick de087b106be48fc6e97f2ebc4644f9d297ecdfed # patch number 2
git push gerrit deploy-20150528:deploy-20150528 # create the branch in gerrit (DON'T USE SLASHES HERE)

Now do the usual steps to prepare a deploy repo (see below) using the hash of your branch commit (73445bfd in the example below):

cd deploy
git checkout master ; git pull origin master ; git submodule update ; cd src ; git checkout 73445bfddded9f0baa6afe548c98880f4401fb7b # your branch commit
cd .. ; git add -u ; git commit -m "Bump src to 73445bfd (deploy-20150528 branch) for deploy"
git review -u

Note that the automated push to beta will fail if your gerrit branch name contains a slash. This is probably just because some ancient version of git is being used, and will eventually be fixed. But in the meantime, use dashes instead of slashes.

Cherry-picking directly from tin and deploying it

In many situations, a hotfix might need to be pushed quickly. One way to do that is to cherry-pick the patch on tin and sync it.

### Verify that you have the most recently deployed code that you want to cherry-pick on top of
tin$ cd /srv/deployment/parsoid/deploy (verify via git log)
tin$ cd src (verify via git log)

### Create a hotfix branch
tin$ git checkout -b hotfix_<some_unique_tag>

### Get latest code from master you want to cherry-pick from
tin$ git checkout master; git pull

### Check out the hotfix branch and cherry-pick
tin$ git checkout hotfix_<some_unique_tag>
tin$ git cherry-pick <commit-from-master>

### Create a deploy-repo patch
tin$ cd ..; git commit -a -m "Bump src to whatever-git-sha-it-is for hotfix"

### The usual deployment steps
tin$ git deploy start
tin$ git deploy sync
... restart and verify deployment ...

When something goes wrong

Roan and Gabriel know most about the Parsoid infrastructure. Send them a mail or (if urgent) call if there are issues you can't solve.

Reverting a Parsoid deployment

Code

ssh tin
cd /srv/deployment/parsoid/deploy
git deploy start
git checkout <tag> (all deployed versions are tagged as parsoid/deploy-sync-<date>-<some-id>)
git submodule update --init
git deploy sync

If you want to revert to a specific changeset (should never ever be necessary)

git deploy start
git reset --hard <desired changeset>
git deploy --force sync

You still need to restart the Parsoid service after deploying reverted code. Follow the dsh restart directions below.

Misc stuff

Clear varnish caches: varnishadm ban.url . on cp1045 and cp1058
Rolling restart via dsh: dsh -g parsoid service parsoid restart
Restart parsoid hosts via salt from /srv/deployment/parsoid/deploy: service-restart
To abort a deployment after running git deploy start but before git deploy sync , run git deploy abort .
There is a lock file preventing multiple deployments on the same code base from being active at the same time. If git deploy start complains about this lock, you can run git deploy abort to make it go away (if you know this isn't a legitimate warning due to someone else actively deploying).
If the sync step complains you didn't change anything, you can run git deploy --force sync (note order of arguments!) to make it sync anyway.
To change which hosts are pooled or change their weights, edit /home/wikipedia/common/docroot/noc/pybal/eqiad/parsoid as root on fenari
Get the top clients on the varnishes:

varnishncsa -n frontend | cut -d ' ' -f 5 | head -10000 \
| sort | uniq -c | sort -n | tail -20 \
| while read i; do echo -n "$i  "; host `echo "$i" | cut -d ' ' -f 2`; done

Data flow

Parsoid runs entirely on an internal subnet, so requests to it are proxied through the ve-parsoid API module. This module is implemented in extensions/VisualEditor/ApiVisualEditor.php and is invoked with a POST request to /w/api.php?action=ve-parsoid. The API module then sends a request to Parsoid, either GET /$prefix/$pagename to get the HTML for a page, or POST /$prefix/$pagename to submit HTML and get wikitext back. Parsoid itself also issues requests to /w/api.php to get the wikitext of the requested page and to do template expansion.

Once the ve-parsoid API module receives a response from Parsoid, it either relays it back to the client (when requesting HTML), or saves the returned wikitext to the page (when submitting HTML).

                (POST /w/api.php?action=ve-parsoid)          (GET /en/Barack_Obama?oldid=1234)           (requests for page content and template expansions)
Client browser ------------------------------------------> API ---------------------------->  Parsoid -----------------------------------------------------> API
    ^                                                      | ^                                 |   ^                                                          |
    |                  (response)                          | |      (HTML)                     |   |                   (responses)                            |
    +------------------------------------------------------+ +---------------------------------+   +----------------------------------------------------------+


                (POST /w/api.php?action=ve-parsoid)          (POST /en/Barack_Obama; oldid=1234)
Client browser ------------------------------------------> API ---------------------------->  Parsoid
                                                           | ^                                 |
                                               (save page) | |      (wikitext)                 |
                                                           | +---------------------------------+
                                                           |
                                                        Database

Caching and load balancing

Parsoid is load balanced using LVS. The assigned service IPs are:

parsoidcache.svc.eqiad.wmnet = 10.2.2.9 served by lvs1003/lvs1006, backends are cp1045 and cp1058
parsoid.svc.eqiad.wmnet = 10.2.2.28 served by lvs1003/lvs1006, backends are wtp1001-1024

The parsoidcache LVS balances two front-end Varnishes running on cp1045 / cp1058 (see parsoid-frontend.inc.vcl.erb). Those only hash requests for backends (see parsoid-backend.inc.vcl.erb). Cache misses are then forwarded to LVS in front of the Parsoid backends.

       10.2.2.29:80  {cp1045,cp1058}:80      10.2.2.28:8000          wtp10NN:8000
MW API  -> LVS -----> Varnish ---------------> LVS  ---------------------> Parsoid

All request URLs include the oldid as a query parameter. The Parsoid PHP extension in sends update requests to the front-end LVS IP on edits, template updates and visibility changes. The Parsoid backends perform additional requests with 'Cache-Control: only-if-cached' to the caches and reuse cached HTML to speed up serialization and re-rendering of pages. As an example, expansions of templates, extensions and images are reused after an edit without performing API requests for these. See this document for more detail.

Job queue length

The ParsoidCacheUpdateJobOnDependencyChange queue occasionally grows large, which shows up in the global job queue stats. Is this a reason to worry?

First of all, these are low-priority background jobs that need to run eventually to make sure that our HTML is updated to reflect template or image changes. They do not however affect the performance or correctness of editing. Parsoid emits a second class of OnEdit jobs in a separate high-priority queue, which normally don't have a backlog at all.

The reason why a lot of background jobs can be enqueued quickly is that we actually process all dependency updates. With popular templates used in up to 8 million pages that means that the bulk of the updates are simply ignored. Our jobs wrap 10 updates each, so a single edit to a template used in 8 million pages will enqueue 800k jobs.

Why aren't there similarly big jumps with PHP refreshLinks jobs? The PHP version only processes at most 200k dependent pages, and ignores the remaining pages. This arguably contributes to the perception of the job queue being unreliable, which in turn motivates some users to perform millions of null edits with bots to force re-parses.

So why can't those parsoid dependency jobs be processed more quickly? We have limited the rate at which we are dequeuing Parsoid jobs to a rate that can be sustained by the PHP API cluster. This means that the Parsoid cluster only runs at about 25% CPU, but the API cluster is closer to 50%. We could increase the rate at which we are dequeuing parsoid dependency update jobs once more API capacity becomes available.

Bottom line: ParsoidCacheUpdateJobOnDependencyChange numbers temporarily going up to a few million jobs is fine as long as the queue length starts shrinking from there. You can check the number of jobs per wiki with mwscript showJobs.php --wiki=enwiki --group on tin. You can list the jobs with mwscript showJobs.php --wiki=enwiki ---type=ParsoidCacheUpdateJobOnDependencyChange --list.