たごもりすメモ

コードとかその他の話とか。

UserAgent判定器 Project Woothee はじめました


UserAgentCPAN()Hadoop使JavaJavaUserAgentJava*1UserAgent

v0.1.0JavaPerl*2

https://github.com/tagomoris/woothee https://github.com/tagomoris/woothee
(: User-Agent woothee  - tagomoris) 

(v0.2.0 ruby/pythonsee: http://d.hatena.ne.jp/tagomoris/20120613/1339583174)

*3

API livedoor blog  livedoor news 使

機能


UserAgent
// import is.tagomor.woothee.Classifier;
// import is.tagomor.woothee.DataSet;
Map<String,String> r = Classifier.parse("user agent string");
r.get("name");        // => name of browser (or string like name of user-agent
r.get("category");    // => "pc", "smartphone", "mobilephone", "appliance", "crawler", "misc", "unknown"
r.get("os");          // => os from user-agent, or carrier name of mobile phones
r.get("version");     // => version of browser, or terminal type name of mobile phones

JavaPerl
use Woothee;

Woothee::parse("Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0)");
# => {'name'=>"Internet Explorer", 'category'=>"pc", 'os'=>"Windows 7", 'version'=>"8.0"}

Woothee::parse("Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)");
# => {'name'=>"Googlebot", 'category'=>"crawler", 'os'=>"UNKNOWN", 'version'=>"UNKNOWN"}

Woothee::parse("Opera/9.80 (Android; Opera Mini/6.5.27452/26.1305; U; ja) Presto/2.8.119 Version/10.54");
# => {'name'=>"Opera", 'category'=>"smartphone", 'os'=>"Android", 'version'=>"9.80"}



API
use Woothee;
Woothee::is_crawler("Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0)"); # => 0
Woothee::is_crawler("Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"); # =>1

2012LinuxBSD(Android)ThunderbirdIceweaselUNKNOWN
hive udf

使
add jar woothee.jar;

create temporary function parse_agent as 'is.tagomor.woothee.hive.ParseAgent';
create temporary function is_pc as 'is.tagomor.woothee.hive.IsPC';
create temporary function is_smartphone as 'is.tagomor.woothee.hive.IsSmartPhone';
create temporary function is_mobilephone as 'is.tagomor.woothee.hive.IsMobilePhone';
create temporary function is_appliance as 'is.tagomor.woothee.hive.IsAppliance';
create temporary function is_crawler as 'is.tagomor.woothee.hive.IsCrawler';
create temporary function is_misc as 'is.tagomor.woothee.hive.IsMisc';
create temporary function is_unknown as 'is.tagomor.woothee.hive.IsUnknown';
create temporary function is_in as 'is.tagomor.woothee.hive.IsIn';

SELECT
 count(is_pc(parsed_agent)) as pc_pageviews,
 count(is_in(parsed_agent, array('pc', 'mobilephone', 'smartphone', 'appliance'))) as total_pageviews
FROM (
    SELECT parse_agent(useragent) as parsed_agent FROM access_log WHERE date='today'
) x

*4便

woothee.Classifier.parse() I23wrapperpig

なんで作ったの


JavaPerlUserAgent
使(yaml)UserAgent(yaml)
Perl/Javayaml
DSL*5

docomo/au/SoftBankon/off使configure

 Windows/OSX/MSIE/Chrome/Safari/Firefox/iOS/Andorid/Googlebot 6
patches welcome!

 UA UA
Perl/JavaJavaPerl7 

*1:し、Javaで書きたいとも思わない

*2:あとhive udfの実装も

*3:新しいブラウザへの対応、既存ブラウザのバージョンアップによるUserAgentパターンの変化への追従、新OSや新クローラへの対応、などなどなどなど

*4:countで使えるように is_xxx 系のは全部 TRUE or NULL を返す

*5:Perlだと強力な拡張正規表現を使って云々とかもちょっと考えたが、簡単に速度計ってみたら遅くて使えなさそうだった。