Analytics/Cluster/Hive/Queries

Prerequisites

You have a Wikitech/Labs account – see Analytics/Onboarding#Labs
You have requested access to stat1002
1. You created a task like phab:T96053
2. Your manager has approved your access request in the task you created above
3. Ops has fulfilled the request
You have ssh configured to tunnel through Bastion – see Analytics/Onboarding#Sample ssh config
You can start a shell on stat1002 by entering ssh stat1002.eqiad.wmnet in a command-line terminal.

Running queries

Once you can ssh to stat1002 you can simply access the hive command-line interface by entering hive. Here's the start of a sample interactive session:

 nuria@stat1002:~$ hive
 hive (default)> show databases;
 OK
 database_name
 wmf
 ...
 hive (default)> use wmf;
 OK 
 Time taken: 0.047 seconds
 hive (wmf)> show tables;
 OK
 tab_name
 mediacounts
 mobile_apps_session_metrics
 mobile_apps_uniques_daily
 mobile_apps_uniques_monthly
 pagecounts_all_sites
 pageview_hourly
 projectview_hourly
 webrequest
 webrequest_empty
 Time taken: 0.108 seconds, Fetched: 9 row(s)
hive (wmf)>

You can also write your query in a text file on stat1002. e.g. named some.hql, and execute it. This example redirects the output of the query (by default in tab-separated values format) to results.txt:

 $ hive -f some.hql > results.txt

Use wmf

You should use wmf database not wmf_raw or your queries will be slow. The wmf database includes filtered and preprocessed data.

Always restrict queries to a date range (partitioning)

There is a lot of data in hive so you want to make sure your queries have time bounds so they complete in a reasonable amount of time. For most tables you will be copying and pasting

WHERE year=2024 AND month=5 AND day=5

to all your queries, and maybe appending AND hour=NN as well.

In fact, you must restrict your query to some Hadoop "partition", otherwise Hive will fail with the error "No partition predicate found". The partitioning of most tables is by year/month/day/hour. For example, see this simple query fail:

hive (wmf)> SELECT agent_type FROM webrequest
  LIMIT 5;

 FAILED: SemanticException [Error 10041]: No partition predicate found for Alias "webrequest" Table "webrequest"

you must restrict it, for example:

hive (wmf)> SELECT agent_type FROM webrequest
  WHERE year=2015 AND month=9 AND day=30
  LIMIT 5

user
spider
user
user
spider

Run long queries in a screen session

If your queries take a long time to run you can execute them in a screen session. If you are accidentally disconnected, or you intentionally disconnect, from a screen session, your remote shell and queries continue to run on stat1002.

User-defined functions

You can create or reuse UDFs (user-defined functions), see Analytics/Cluster/Hive/QueryUsingUDF. WMF has developed several UDFs, such as is_wikimedia_bot, geocoded_data, etc. Some are run to create its derived tables from raw Hadoop information.

Sample queries

Here are sample queries from the tables in the wmf database. Run hive and enter use wmf; before running these.

Pageviews

Using the pre-aggregated table pageview_hourly

SELECT
  concat(month,'/',day,'/',year), access_method, sum(view_count)
FROM 
  wmf.pageview_hourly
WHERE
  year = 2015
  AND month = 8
  AND agent_type = "user"
  AND country = "Canada"
  AND project = "en.wikipedia"
GROUP BY
  year, month, day, access_method;

Distinct IPs

SELECT
  COUNT(DISTINCT ip) AS hits
FROM
  webrequest
WHERE
  year = 2014
  AND month = 5
  AND day = 5
  AND hour = 13
  AND uri_host = "en.wikipedia.org"
  AND http_status = 200;

FAQ

JsonSerDe Errors

If you get errors like:

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.ClassNotFoundException Class org.apache.hive.hcatalog.data.JsonSerDe not found

then enter the following:

ADD JAR /usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar ;

Killing a running query

Once you submit a query, it is handed off to Hadoop. Hadoop runs the query as a YARN application. The Hive CLI is then detached from the actual application. If you Ctrl-C your Hive CLI, you will quit the interface you used to submit the query, but will not actually kill the application. To kill the application, you have to tell YARN you want it dead.

Note the application ID from when your query started. You should see something like:

 Starting Job = job_1387838787660_12241, Tracking URL = http://analytics1010.eqiad.wmnet:8088/proxy/application_1387838787660_12241/

The application ID in this case is application_1387838787660_12241. To kill this application, run:

 yarn application -kill application_1387838787660_12241

Slow queries

A simple SELECT LIMIT like the following will prepare all data from a partition before applying the LIMIT clause, so it's always best to specify the lowest-level relevant partition for the query.

SELECT http_status,uri_path FROM webrequest WHERE year = 2014 LIMIT 10;

Scratch space

You can create your own database using regular SQL

CREATE DATABASE dartar;

Batch mode

You can save the output of a hive query into a TSV by using hive in batch mode, e.g.

hive  -e "USE wmf; SELECT foo;" > ./bar.tsv

Dates

Dates are expressed as integers due to how we have configured this data. You may encounter this error if you treat dates as strings by quoting:

> Diagnostic Messages for this Task:
> Error: java.io.IOException: java.lang.reflect.InvocationTargetException
>       [...]
> Caused by: java.lang.reflect.InvocationTargetException
>       [...]
> Caused by: java.io.EOFException
>       [...]

Out of Memory Errors on Client

When performing large queries, the hive client may run out of memory. Just look for an out of memory error at the top of the stack trace.

Invoke Hive via the following to allocate more memory

 export HADOOP_HEAPSIZE=1024 && hive

Killing Jobs

Standard Hadoop practice involves launching a long running job and then realizing you forgot a parameter. Here is how you kill a job from the command line:

yarn application -kill [applicationID]

Your application id can be found in the hive messages -- it's the application_XXX string at the end of the tracking URL:

Starting Job = job_1409078537822_2754, Tracking URL = http://analytics1010.eqiad.wmnet:8088/proxy/application_1409078537822_2754/

INSERT OVERWRITE ... IF NOT EXISTS

When using INSERT OVERWRITE with IF NOT EXISTS be sure to use unqualified tables. Otherwise, the IF NOT EXISTS gets silently ignored.

Earliest data available in hadoop

What’s the most efficient way to obtain the earliest, non-truncated hour or date of data still available in hadoop?

  
show partitions webrequest;

Will show you all of the available partitions. Request log data in hadoop is currently pruned after 31 days.

Avoiding overgreedy scans / Operator precedence

In HiveQL, AND takes precedence over OR.

So when trying to select all “mobile” and “text” partitions for 2015-01-01,

  
  ... WHERE year=2015 AND month=1 AND day=1 AND webrequest_source='mobile' OR webrequest_source='text'

would not do the trick, as Hive parses it as

  
  ... WHERE (year=2015 AND month=1 AND day=1 AND webrequest_source='mobile') OR webrequest_source='text'

and would select all “text” partitions. This unintentionally skews the data you want to obtain, makes the query take longer, and keeps all “text” partitions locked.

You can use parenthesis to override operator precedence. So for example

  
  ... WHERE year=2015 AND month=1 AND day=1 AND (webrequest_source='mobile' OR webrequest_source='text')

would select the “mobile” and “text” partitions for 2015-01-01.

My query does not start

Check if the cluster is overloaded. If it is, let the Analytics team know.

My query does not progress / finish

Querying through Hive is not as fast as querying a plain SQL table. So expect to wait an hour for your result to show up.

But if the progress counter in your query does not increase (like at least a 1% per minute), you are either unintentionally querying a lot of data, or the cluster is stalled.

If you are querying “a lot” of data (a week worth of webrequest logs is definitely already more “a lot”!), consider splitting up your query. That will increase performance for you, cut down runtime of your query, and it will leave cluster resources available to other users. If you run into blockers, let the Analytics team know.

If your query is small, nice and neat, and does not progress nonetheless, please check if the cluster is overloaded. If it is, let the Analytics team know.