Analytics/Cluster/Hive/QueryUsingUDF
Intro
UDF are user defined functions. They can make complex calculations over every record of data retrieved from hive.
How to write a UDF in hive: [1]
You will usually want to use the latest version of the refinery-hive UDFs. These can be found on stat1002 at
/srv/deployment/analytics/refinery/artifacts/refinery-hive.jar
or in HDFS at
/wmf/refinery/current/artifacts/refinery-hive.jar
Using an existing UDF: User Agent UDF
The hive cluster includes a user agent UDF, the UDF can be used to parse user agents from request logs.
In order to use it you need to define a function in the following fashion:
hive> ADD JAR /srv/deployment/analytics/refinery/artifacts/refinery-hive.jar; hive> CREATE TEMPORARY FUNCTION ua as 'org.wikimedia.analytics.refinery.hive.UAParserUDF'; OK Time taken: 0.046 seconds
Execute select using function:
hive> select ua(user_agent) from webrequest where year=2014 and month=9 and day=1 and hour=12
Output will look like:
{"browser_major":"3","os_family":"Android","os_major":"4","device_family":"Lenovo A3500-H","browser_family":"UC Browser","os_minor":"4"} 9
{"browser_major":"30","os_family":"Android","os_major":"4","device_family":"LG-D631/D63110b","browser_family":"Chrome Mobile","os_minor":"4"} 260
{"browser_major":"30","os_family":"Android","os_major":"4","device_family":"SM-N9008","browser_family":"Chrome Mobile","os_minor":"4"} 18
{"browser_major":"31","os_family":"Android","os_major":"4","device_family":"SM-N9005","browser_family":"Chrome Mobile","os_minor":"4"} 708
Group by device_family:
select a.device_family, count(*) as cnt from (
select ua(user_agent)['device_family'] as device_family
from webrequest
where webrequest_source='mobile' and year=2014 and month=10 and day=30 and hour=0
) a
group by a.device_family order by cnt desc limit 10;
iPhone 7773691 Other 2940052 iPad 2911523 Spider 770622 iPod 299841 Samsung GT-I9300 173951 Samsung GT-I9505 170575 Samsung SCH-I545 161654 Samsung SM-G900V 150833 HTC One 132639
Testing a UDF you just wrote
If the definition of the UDF is not yet merged you would need to build a jar that contains the udf. You can checkout code on 1002 and build the jar there
Build the jar using mvn:
mvn package
Once you have compiled your UDF, you will need to register it with Hive before using it.
hive> ADD JAR /some/path/refinery-hive-0.0.1.jar;
Added /some/path/refinery-hive-0.0.1.jar to class path Added resource: /some/path/refinery-hive-0.0.1.jar
hive> CREATE TEMPORARY FUNCTION blah as 'org.wikimedia.analytics.refinery.hive.BlahUDF'; OK Time taken: 0.046 seconds
Execute select using function
hive> select blah(some_colum) from webrequest where year=2014 and month=9 and day=1 and hour=12 limit
Testing changes to existing udf
You will need to build the jar just like you would in the case of creating a new udf but when testing the udf you need to override the path that loads some jars by default. Otherwise hive will be existing code rather than new code.
Leaving hive.aux.jars.path empty will do the trick.
hive --hiveconf hive.aux.jars.path= -f test-udf.hql
Sampling Data: Get a user agent report for the past month
Hadoop holds about a month of data, which means that there is a LOT of data. You do not need to a access it all in order to get a sufficiently precise user agent report. We get about 10.000 request per second for mobile so sampling 1 in 1000 gives you about 18 million records that should be sufficient to get a monthly report.
A hive query like the following does the sampling and the grouping using the UDF:
ADD JAR /home/nuria/refinery-hive-0.0.1.jar;
CREATE TEMPORARY FUNCTION ua as 'org.wikimedia.analytics.refinery.hive.UAParserUDF';
use wmf_raw;
SELECT a.useragent, Count(*)
FROM
(select ua(user_agent) as useragent
from webrequest TABLESAMPLE(BUCKET 1 OUT OF 1000 ON rand())
where year=2014 and webrequest_source="mobile") a
GROUP BY a.useragent
To execute (timing the output):
time hive -f select.sql > output.txt
Output will look like:
{"browser_major":"3","os_family":"Android","os_major":"4","device_family":"Lenovo A3500-H","browser_family":"UC Browser","os_minor":"4"} 9
{"browser_major":"30","os_family":"Android","os_major":"4","device_family":"LG-D631/D63110b","browser_family":"Chrome Mobile","os_minor":"4"} 260
{"browser_major":"30","os_family":"Android","os_major":"4","device_family":"SM-N9008","browser_family":"Chrome Mobile","os_minor":"4"} 18
{"browser_major":"31","os_family":"Android","os_major":"4","device_family":"SM-N9005","browser_family":"Chrome Mobile","os_minor":"4"} 708