Lucene Revolutionに遊びに行ってきたのでメモ

Overview

agenda

-> http://www.lucenerevolution.com/agenda

key

news of Lucene/Solr4.0 (SolrCloud, NRT, Join…)
big data analysis
collaboration with other OSS (Hadoop family, CMS, admin…)
vender’s solutioin (LucidWorks, MapR, Windows…)

slide & video

-> http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012

Day 1

8:30 AM – 9:00 AM	Opening Remarks – What’s New at Lucid Imagination Paul Doscher, CEO Lucid Imagination
9:00 AM – 9:45 AM	Lucene/Solr on Windows: Azure: Microsoft and OSS Working Together Gianugo Rabellino, Microsoft
	LUCENE/SOLR IN ACTION TRACK	TECHNICAL DEEP DIVE TRACK	BIG DATA TRACK
10:15 AM – 11:00 AM	Building Query Auto-Completion Systems with Lucene 4.0 Sudarshan Gaikaiwari, Software Engineer,Yelp	Search is Not Enough: Using Solr for Analytics Steve Kearns, Basis Technology	Big Search with Big Data Principles Eric Pugh, Principle, OpenSource Connections
11:10 AM – 11:50 AM	NetDocuments- Journey from FAST to Solr David Hamson & Mou Nandi NetDocuments	Integrating Lucene into a Transactional XML Database Petr Pleshachkov, EMC	Indexing Wikipedia as a Benchmark of Single Machine Performance Limits Paddy Mullen,Independent Contractor
1:15 PM – 1:55 PM	Japanese Linguistics in Lucene and Solr Christian Moen, Founder and CEO Atilika Inc.	Grouping and Joining in Lucene/Solr Martijn van Groningen, SearchWorkings	Using Lucene/Solr to Surface the Big Data of Social Media Glenn Engstrand, Sr. Software Engineer, Zoosk, Inc
2:05 PM – 2:45 PM	Using Lucene/Solr to Build CiteSeerX and Friends C. Lee Giles, Professor, Pennsylvania State University	Automata Invasion Michael McCandless, IBM & Robert Muir, Lucid Imagination	Using the LucidWorks REST API to Support User-Configuration Big Data Search Experience Mark Davis, CTO Kitenga
2:55 PM – 3:35 PM	Introducing Hydra – An Open Source Document Processing Framework Joel Westberg, Findwise	How SolrCloud Changes the User Experience In a Sharded Environment Erick Erickson, Committer, Lucid Imagination	Delivering on the Promise of Big Data at the “Tactical Edge” Wes Caldwell, Chief Architect, ISS, Inc.
4:00 PM – 5:15 PM	Stump The Chump
6:30 PM – 9:30 PM	Conference Party at Museum of Science

Opening Remarks – What’s New at Lucid Imagination

LucidのCEO、Paul Doscherによるオープニング
(いろんなとこでCEOだのvice presidentだのやってた)
イカす

LucidのCEO、Paul Doscherによるオープニング

(いろんなとこでCEOだのvice presidentだのやってた)

イカす

Lucene/Solr on Windows: Azure: Microsoft and OSS Working Together

microsoftがもっとオープンになってくよって話

1	microsoftがもっとオープンになってくよって話

windows live, office, bing, skypeあたりは開けてます
Windowsは下記と仲良しです -> Linux, Solr, Hadoop, Firefox, Drupal, Java, PHP, node.js…

こんなにオープンソースと仲がいいMicrosoft製品を使ってねーってカンジ

1	こんなにオープンソースと仲がいいMicrosoft製品を使ってねーってカンジ

Search is Not Enough: Using Solr for Analytics

Basis TechnologyのSteve Kearnsがしゃべってた。
Basisは企業向けにinformation retrievalの手助けしてるようなとこで、
ここに紹介ビデオがある。-> http://www.basistech.com

nonstructured (or unstructured) なcontentをstructuredにしてあげましょう。
分析には以下のようなレベルのものがありますと

1 2	nonstructured (or unstructured) なcontentをstructuredにしてあげましょう。分析には以下のようなレベルのものがありますと

document level analysis – language identify – summarization – categorization
sub-document level analysis – segmentation (東京ルパン上映時間 -> 東京、ルパン、上映、時間に分けるっていう例が紹介されてた。。。) – lemmatization – stemming – entity extraction – facet relationship – sentiment (sentence, paragraph, entity, emotion等)
cross document level analysis – near duplicated documents – document clustering – co-reference resolution

でもってSolrとanaliticsを絡める場合以下の３パターンがあるかもね

1	でもってSolrとanaliticsを絡める場合以下の３パターンがあるかもね

Make Analyzer Charfilter, Tokenizer, TokenFilter(s) をもつAnalyzerを作って、 schema.xmlのFieldTypeにセットしてあげよう index用とquery用に別々のもセットできるょただし、この場合document自体にアクセスはできませんょ
Update Request Processor RequestHandlerを改良してsolrconfig.xmlにセットしてあげよう Analizerの前に呼ばれるので、どっかにドキュメント送って処理させて、そのアウトプットをindexingできるょこの場合documentに対するfull accessを持ちますょ
preprocessor いろいろ分析しておいて、その最後の処理としてindexingしてあげようこの場合、Solrがやってくれてるほげほげを自分でやる必要が出てくるかもね

Indexing Wikipedia as a Benchmark of Single Machine Performance Limits

題名そのまんま

題名そのまんま

JWPL(Java Wikipedia library)を使ってMediaWikiをパースしました
DataImportHandler使うんでdata-config.xmlに定義しました
MacBook, Linux, どっかのcloud使って、それぞれ33GB-12milliのdocumentを突っ込んで検索しました
だいたいどれも突っ込むのに15h、検索に100-1000milli secかかりました

ここでのベンチーマークの結果自体はまぁどおでもよくて、
簡単な方法だから、自分が持ちうるいろんな環境を比較する基準として使えんじゃねって感じだった気がする。
んで、もろもろgithubからとってこれる。
-&gt; <a href="https://web.archive.org/web/20130516211743/https://github.com/paddymul/wikipedia_solr">https://github.com/paddymul/wikipedia_solr</a>

ここでのベンチーマークの結果自体はまぁどおでもよくて、

簡単な方法だから、自分が持ちうるいろんな環境を比較する基準として使えんじゃねって感じだった気がする。

んで、もろもろgithubからとってこれる。

-> <a href="https://web.archive.org/web/20130516211743/https://github.com/paddymul/wikipedia_solr">https://github.com/paddymul/wikipedia_solr</a>

Japanese Linguistics in Lucene and Solr

最近Lucene/SolrのcommiterになったChristianによる、日本語検索のお話。

日本語ってこんな言語ですょ（分かち書きじゃない）
日本語をindexingするには主に２つのアプローチがありますょ -> n-gramming（n文字ずつくぎるアレ）意味が保存されないし、semanticsが変わってしまう場合もある検索ノイズが多く、indexのサイズもでかい -> morphological analysis（形態素解析）
そしてkuromojiの話へ huge latis から shortest pathを選んでますと
手っ取り早く日本語解析試すには。。。Lucene/Solrの3.6以降はデフォルトでいい感じの設定入ってますょ
LuceneのJapaneseAnalyzerか、Solrのfield_type:text_ja text_jaのchainはJapaneseTokenizer, JapaneseBaseFormFilter, JapanesePartOfSpeechStopFilter, CJKWidthFilter, StopFilter…
あとはよく出てくる関西国際空港とか、マネージャー->マネージャ、買う・買わない・買います。。。
日本語周りのSolr4トピックとして、JapaneseTokenizerの改善と、スペルチェッカー追加

Using the LucidWorks REST API to Support User-Configurable Big Data Search Experiences

LucidWorks REST APIと、KitengaのZettaVox-ZettaSearch
紹介ビデオ -> http://www.youtube.com/watch?v=HM0SUuaHYqc&feature=youtu.be

LucidWorks REST API JSON使ったRESTなAPIでシステムの設定やらスキーマやらを操作できる（admin task）
ZettaVox map reduce書くかわりに、workflow pipelinesを定義することによって、簡単に大量データのマイニングだとか解析だとかできる
ZettaSearch いろんな切り口の分析から、自動でmetadata提示してくれたり、好きなもの選んでビジネスソリューション（？）を組み立てたりできる
リッチなインターフェースでHadoopやらSolrと仲良くなれそう

Introducing Hydra – An Open Source Document Processing Framework

Findwiseが作ってる、Hydraっていうdocument-processing frameworkの紹介

unstructuredなデータにメタデータ付与してリッチなものにしてあげましょう

1	unstructuredなデータにメタデータ付与してリッチなものにしてあげましょう

language detection
sentiment analysis
headline extraction
regular expression matching & extraction

こ～ゆ～ことやんのにSolrのdataimporthandler使ってもいいけど、Solr expertが必要だよね
あるいは今までのpipelineはシーケンシャルだから（doc -&gt; □□□□□ -&gt; solr）、
どっかおかしくなると全体ダメ
たとえばtikaちゃんはJVM食いつぶして死んじゃうんで、要再起動。。。

そこでHydraの登場
イメージとしては、それぞれの処理（stageって呼ばれる）が雲の中にばら撒かれてるカンジ

こ～ゆ～ことやんのにSolrのdataimporthandler使ってもいいけど、Solr expertが必要だよね

あるいは今までのpipelineはシーケンシャルだから（doc -> □□□□□ -> solr）、

どっかおかしくなると全体ダメ

たとえばtikaちゃんはJVM食いつぶして死んじゃうんで、要再起動。。。

そこでHydraの登場

イメージとしては、それぞれの処理（stageって呼ばれる）が雲の中にばら撒かれてるカンジ

scalabile:the central repository & worker nodes can scale horizontally
distributed:any processing node can work on any document
fail-safe:node down will not affect the documents in the pipeline
robust:all stages run in separate JVMs
easy to use/configure:we can debug stages from IDE against actual data

architectureはcore、admin interface、stage、mongoDBからなる
HdoopのScalabilityはすごいけどねぇ、、、Hydraはよりreal timeなのさ

1 2	architectureはcore、admin interface、stage、mongoDBからなる HdoopのScalabilityはすごいけどねぇ、、、Hydraはよりreal timeなのさ

Stump The Chump

来場者の質問に、Chris Hostetterをはじめとするコミッター達がどしどし答えるっていう企画
いい質問をすると、　<a href="https://web.archive.org/web/20130516211743/http://www.amazon.com/Lucene-Action-Erik-Hatcher/dp/1932394281">Lucene in Action</a>　がもらえる(^o^)/
でも、、、家にあんのに、わざわざ重いもんを飛行機で持って帰りたくないっていう。。。

来場者の質問に、Chris Hostetterをはじめとするコミッター達がどしどし答えるっていう企画

いい質問をすると、　<a href="https://web.archive.org/web/20130516211743/http://www.amazon.com/Lucene-Action-Erik-Hatcher/dp/1932394281">Lucene in Action</a>　がもらえる(^o^)/

でも、、、家にあんのに、わざわざ重いもんを飛行機で持って帰りたくないっていう。。。

Conference Party at Museum of Science

博物館を貸しきってパーティ
一通りうろうろしたあと、飲み食いしてYonik Seeleyに挨拶mm

1 2	博物館を貸しきってパーティ一通りうろうろしたあと、飲み食いしてYonik Seeleyに挨拶mm

Day 2

9:00 AM – 9:40 AM	Enabling Scalable Search, Discovery and Analytics with Solr,Mahout and Hadoop Grant Ingersoll, Chief Scientist, Lucid Imagination
9:40 AM – 10:15 AM	Apache Hadoop:Now, Next and Beyond Ari Zilka, Chief Product Officer, Hortonworks
	LUCENE/SOLR IN ACTION TRACK	TECHNICAL DEEP DIVE TRACK	BIG DATA TRACK
10:45 AM – 11:25 AM	Building a Real-time Solr-powered Recommendation Engine Trey Grainger, Search Tech Development Mgr, CareerBuilder	Is Your Index Reader Really Atomic or Maybe Slow? Uwe Schindler, SD DataSolutions GmbH	Solr 4: The SolrCloud Architecture Mark Miller, Lucid Imagination
1:15 PM – 1:55 PM	Television News Search and Analysis with Lucene/Solr Kai Chan, Instructional Technology and Database Developer, Social Sciences Computing UCLA	Challenges in Maintaining a High Performance Search Engine Written in Java Simon Willnauer,Apache Lucene	Indexing Big Data on Amazon AWS Scott Stults, Solutions Architect, Open Source Connections
2:05 PM – 2:45 PM	Solr, Lucene and Hadoop @ Etsy David Giffin, Software Engineer, Etsy	Updateable Fields in Lucene and other Codec Applications Andrzej Bialeki, Lucid Imagination	The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics M.C. Srivas, CTO & Founder, MapR
2:55 PM – 3:35 PM	How to Access Your Library Book Collections Using Solr Engy Ali, Software Project Manager, The Library of Alexandria	Things Made Easy: One Click CMS Integration with Solr & Drupal Peter Wolanin, Acquia, Inc	How is the Government Spending Your Money? How GCE is Using Lucene and the GCE Big Data Cloud Seshu Simhadri, CTO, Global Computer Enterprises
4:00 PM – 4:40 PM	How to Gain Greater Business Intelligence from Lucene/Solr Patrick Beaucamp, Bpm-Conseil	Search with Polygons: Another Approach to Solr Geospatial Search Andrew Urquhart, Principal Systems Engineer, Raytheon	Big Data Meets Metadata – Analyzing Large Data Sets Jermy Bently, CEO, Smartlogic

Enabling Scalable Search, Discovery and Analytics with Solr,Mahout and Hadoop

SDA : Search, Discovery, and Analytics
-&gt; user needs real-time, ad hoc access to content, so batch processing isn't enough

what do you need for SDA?

SDA : Search, Discovery, and Analytics

-> user needs real-time, ad hoc access to content, so batch processing isn't enough

what do you need for SDA?

fast, efficient, scalable search
large scale, cost effective storage
large scale processing power
NLP and machine learning tools that scale to eenhance discovery and analysis

これらをやるのに便利なのがLucidWorks big dataっていうplatformですょ
-&gt; <a href="https://web.archive.org/web/20130516211743/http://www.lucidimagination.com/products/lucidworks-search-platform/lucidworks-big-data">http://www.lucidimagination.com/products/lucidworks-search-platform/lucidworks-big-data</a>
JSON使ったRESTfulAPIで、Real timeかつad hocに以下ができます
search &amp; indexing, workflows, admin, analytics, machine learning, proxy?...

構成

これらをやるのに便利なのがLucidWorks big dataっていうplatformですょ

-> <a href="https://web.archive.org/web/20130516211743/http://www.lucidimagination.com/products/lucidworks-search-platform/lucidworks-big-data">http://www.lucidimagination.com/products/lucidworks-search-platform/lucidworks-big-data</a>

JSON使ったRESTfulAPIで、Real timeかつad hocに以下ができます

search & indexing, workflows, admin, analytics, machine learning, proxy?...

構成

Lucene/Solr 4.0-dev Sharded with SolrCloud 1 second soft commits for NRT updates 1 minute hard commits (without searcher reopen)
Restlet2.1 (SDAEngine)
Hadoop map-reduce jobs for ETL and bulk indexing into SolrCloud sharded system leverage pig and custom MR jobs
Mahout K-means clustering

Apache Hadoop:Now, Next and Beyond

HortonworksによるHadoopの現在、未来、そして。。。

Hadoop Now stable
Hadoop Next : Hadoop2 high availability next-gen MapReduce & HDFS YARN layer is new and APIs are still evolving
Hadoop Beyond integrate w/ecosystem next-gen data architecture

Hortonworks' vision &amp; role

1	Hortonworks' vision & role

make hadoop easy to use and consume
make hadoop an enterprise viable data platform
provide open APIs and data services
enable ecosystem at each layer of the data stack

2015の終わりまでに、世界中のデータの半分はHadoopでしょりされるらしい

1	2015の終わりまでに、世界中のデータの半分はHadoopでしょりされるらしい

Building a Real-time Solr-powered Recommendation Engine

Solr in Action (coming soon) の共同執筆者Trey Graingerによる
Lucne isn’t text search, but token matching engine

various recommendations

1	various recommendations

content based attribute based – set fields(jobtitle, salary, place…) – using boost for each field in solr select query hierarchical – healthcare//nursing//transplant… – educator//postsecondary//nursing… textual similarity – MoreLikeThisRequestHandler / SearchHandler are a good example concept based – create a taxonomy/dictionary to define concepts – then manually tag documents or use classification system geography and recommendations – Solr’s geodist()
behavioral based （既存Solrのフィールドとクエリをうまく使ってごにょごにょやっててすごかった。。。発表資料公開されないかなぁ） collaborate filtering find similar users who like the same documents search for docs liked by those similar users comparison with mahout – easier & real time
hybrid approaches

important

important

Custom Scoring with Payloads
Measuring Results Quality
Understanding users (there is no right recommendation algorithm)

Challenges in Maintaining a High Performance Search Engine Written in Java

Luceneを開発するってどんなことかって話

1	Luceneを開発するってどんなことかって話

Solr, Lucene and Hadoop

histry of Etsy（ECサイト） -&gt; <a href="https://web.archive.org/web/20130516211743/http://www.etsy.com/">http://www.etsy.com/</a>

1	histry of Etsy（ECサイト） -> <a href="https://web.archive.org/web/20130516211743/http://www.etsy.com/">http://www.etsy.com/</a>

2007 1 million listings single master PostgreSQL
2008 2 million listings single master PostgreSQL master & 4 slaves Solr
2009 4 million listings single master PostgreSQL master & 6 slaves Solr
2010 7 million listings single master PostgreSQL master & 10 slaves Solr, custom import handler
2011 10 million listings sharded PostgreSQL master & 24 slaves Solr
2012 add HBase indexing …?

そのほかJenkins, Memcached, bitTornado (for replication), oozie (workflow engine for hadoop) 独自にツールつくって公開してる ->

https://github.com/etsy

How is the Government Spending Your Money?
How GCE is Using Lucene and the GCE Big Data Cloud

派手さはないけど、至極全うにアーキテクチャ決めてる感じで、、スライドもっかい見たいな

1	派手さはないけど、至極全うにアーキテクチャ決めてる感じで、、スライドもっかい見たいな

Big Data Meets Metadata – Analyzing Large Data Sets

So many innovation has occurred on the net,
but 2001-2011 user's satisfaction in searching is flat!!! (about 50%)
Then...

So many innovation has occurred on the net,

but 2001-2011 user's satisfaction in searching is flat!!! (about 50%)

Then...

file management
index management
automation of 1 & 2
new ; content intelligence -> metadata identifying classifying taxonomy/ontology

Lucene Revolution 2012

Overview

agenda

key

slide & video

Day 1

Opening Remarks – What’s New at Lucid Imagination

Lucene/Solr on Windows: Azure: Microsoft and OSS Working Together

Search is Not Enough: Using Solr for Analytics

Indexing Wikipedia as a Benchmark of Single Machine Performance Limits

Japanese Linguistics in Lucene and Solr

Using the LucidWorks REST API to Support User-Configurable Big Data Search Experiences

Introducing Hydra – An Open Source Document Processing Framework

Stump The Chump

Conference Party at Museum of Science

Day 2

Enabling Scalable Search, Discovery and Analytics with Solr,Mahout and Hadoop

Apache Hadoop:Now, Next and Beyond

Building a Real-time Solr-powered Recommendation Engine

Challenges in Maintaining a High Performance Search Engine Written in Java

Solr, Lucene and Hadoop

How is the Government Spending Your Money?
How GCE is Using Lucene and the GCE Big Data Cloud

Big Data Meets Metadata – Analyzing Large Data Sets

Leave a comment Cancel reply

Overview

agenda

key

slide & video

Day 1

Opening Remarks – What’s New at Lucid Imagination

Lucene/Solr on Windows: Azure: Microsoft and OSS Working Together

Search is Not Enough: Using Solr for Analytics

Indexing Wikipedia as a Benchmark of Single Machine Performance Limits

Japanese Linguistics in Lucene and Solr

Using the LucidWorks REST API to Support User-Configurable Big Data Search Experiences

Introducing Hydra – An Open Source Document Processing Framework

Stump The Chump

Conference Party at Museum of Science

Day 2

Enabling Scalable Search, Discovery and Analytics with Solr,Mahout and Hadoop

Apache Hadoop:Now, Next and Beyond

Building a Real-time Solr-powered Recommendation Engine

Challenges in Maintaining a High Performance Search Engine Written in Java

Solr, Lucene and Hadoop

How is the Government Spending Your Money? How GCE is Using Lucene and the GCE Big Data Cloud

Big Data Meets Metadata – Analyzing Large Data Sets

Leave a comment Cancel reply

How is the Government Spending Your Money?
How GCE is Using Lucene and the GCE Big Data Cloud