Lucene Revolutionに遊びに行ってきたのでメモ
Overview
agenda
key
- news of Lucene/Solr4.0 (SolrCloud, NRT, Join…)
- big data analysis
- collaboration with other OSS (Hadoop family, CMS, admin…)
- vender’s solutioin (LucidWorks, MapR, Windows…)
slide & video
Day 1
8:30 AM – 9:00 AM | Opening Remarks – What’s New at Lucid Imagination Paul Doscher, CEO Lucid Imagination |
||||
9:00 AM – 9:45 AM | Lucene/Solr on Windows: Azure: Microsoft and OSS Working Together Gianugo Rabellino, Microsoft |
||||
LUCENE/SOLR IN ACTION TRACK | TECHNICAL DEEP DIVE TRACK | BIG DATA TRACK | |||
10:15 AM – 11:00 AM | Building Query Auto-Completion Systems with Lucene 4.0 Sudarshan Gaikaiwari, Software Engineer,Yelp |
Search is Not Enough: Using Solr for Analytics Steve Kearns, Basis Technology |
Big Search with Big Data Principles Eric Pugh, Principle, OpenSource Connections |
||
11:10 AM – 11:50 AM | NetDocuments- Journey from FAST to Solr David Hamson & Mou Nandi NetDocuments |
Integrating Lucene into a Transactional XML Database Petr Pleshachkov, EMC |
Indexing Wikipedia as a Benchmark of Single Machine Performance Limits Paddy Mullen,Independent Contractor |
||
1:15 PM – 1:55 PM | Japanese Linguistics in Lucene and Solr Christian Moen, Founder and CEO Atilika Inc. |
Grouping and Joining in Lucene/Solr Martijn van Groningen, SearchWorkings |
Using Lucene/Solr to Surface the Big Data of Social Media Glenn Engstrand, Sr. Software Engineer, Zoosk, Inc |
||
2:05 PM – 2:45 PM | Using Lucene/Solr to Build CiteSeerX and Friends C. Lee Giles, Professor, Pennsylvania State University |
Automata Invasion Michael McCandless, IBM & Robert Muir, Lucid Imagination |
Using the LucidWorks REST API to Support User-Configuration Big Data Search Experience Mark Davis, CTO Kitenga |
||
2:55 PM – 3:35 PM | Introducing Hydra – An Open Source Document Processing Framework Joel Westberg, Findwise |
How SolrCloud Changes the User Experience In a Sharded Environment Erick Erickson, Committer, Lucid Imagination |
Delivering on the Promise of Big Data at the “Tactical Edge” Wes Caldwell, Chief Architect, ISS, Inc. |
||
4:00 PM – 5:15 PM | Stump The Chump | ||||
6:30 PM – 9:30 PM | Conference Party at Museum of Science |
Opening Remarks – What’s New at Lucid Imagination
1 2 3 |
LucidのCEO、Paul Doscherによるオープニング (いろんなとこでCEOだのvice presidentだのやってた) イカす |
Lucene/Solr on Windows: Azure: Microsoft and OSS Working Together
1 |
microsoftがもっとオープンになってくよって話 |
- windows live, office, bing, skypeあたりは開けてます
- Windowsは下記と仲良しです -> Linux, Solr, Hadoop, Firefox, Drupal, Java, PHP, node.js…
1 |
こんなにオープンソースと仲がいいMicrosoft製品を使ってねーってカンジ |
Search is Not Enough: Using Solr for Analytics
Basis TechnologyのSteve Kearnsがしゃべってた。
Basisは企業向けにinformation retrievalの手助けしてるようなとこで、
ここに紹介ビデオがある。-> http://www.basistech.com
1 2 |
nonstructured (or unstructured) なcontentをstructuredにしてあげましょう。 分析には以下のようなレベルのものがありますと |
- document level analysis – language identify – summarization – categorization
- sub-document level analysis – segmentation (東京ルパン上映時間 -> 東京、ルパン、上映、時間に分けるっていう例が紹介されてた。。。) – lemmatization – stemming – entity extraction – facet relationship – sentiment (sentence, paragraph, entity, emotion等)
- cross document level analysis – near duplicated documents – document clustering – co-reference resolution
1 |
でもってSolrとanaliticsを絡める場合以下の3パターンがあるかもね |
- Make Analyzer Charfilter, Tokenizer, TokenFilter(s) をもつAnalyzerを作って、 schema.xmlのFieldTypeにセットしてあげよう index用とquery用に別々のもセットできるょ ただし、この場合document自体にアクセスはできませんょ
- Update Request Processor RequestHandlerを改良してsolrconfig.xmlにセットしてあげよう Analizerの前に呼ばれるので、どっかにドキュメント送って処理させて、そのアウトプットをindexingできるょ この場合documentに対するfull accessを持ちますょ
- preprocessor いろいろ分析しておいて、その最後の処理としてindexingしてあげよう この場合、Solrがやってくれてるほげほげを自分でやる必要が出てくるかもね
Indexing Wikipedia as a Benchmark of Single Machine Performance Limits
1 |
題名そのまんま |
- JWPL(Java Wikipedia library)を使ってMediaWikiをパースしました
- DataImportHandler使うんでdata-config.xmlに定義しました
- MacBook, Linux, どっかのcloud使って、それぞれ33GB-12milliのdocumentを突っ込んで検索しました
- だいたいどれも突っ込むのに15h、検索に100-1000milli secかかりました
1 2 3 4 |
ここでのベンチーマークの結果自体はまぁどおでもよくて、 簡単な方法だから、自分が持ちうるいろんな環境を比較する基準として使えんじゃねって感じだった気がする。 んで、もろもろgithubからとってこれる。 -> <a href="https://web.archive.org/web/20130516211743/https://github.com/paddymul/wikipedia_solr">https://github.com/paddymul/wikipedia_solr</a> |
Japanese Linguistics in Lucene and Solr
最近Lucene/SolrのcommiterになったChristianによる、日本語検索のお話。
- 日本語ってこんな言語ですょ(分かち書きじゃない)
- 日本語をindexingするには主に2つのアプローチがありますょ -> n-gramming(n文字ずつくぎるアレ) 意味が保存されないし、semanticsが変わってしまう場合もある 検索ノイズが多く、indexのサイズもでかい -> morphological analysis(形態素解析)
- そしてkuromojiの話へ huge latis から shortest pathを選んでますと
- 手っ取り早く日本語解析試すには。。。Lucene/Solrの3.6以降はデフォルトでいい感じの設定入ってますょ
- LuceneのJapaneseAnalyzerか、Solrのfield_type:text_ja text_jaのchainはJapaneseTokenizer, JapaneseBaseFormFilter, JapanesePartOfSpeechStopFilter, CJKWidthFilter, StopFilter…
- あとはよく出てくる関西国際空港とか、マネージャー->マネージャ、買う・買わない・買います。。。
- 日本語周りのSolr4トピックとして、JapaneseTokenizerの改善と、スペルチェッカー追加
Using the LucidWorks REST API to Support User-Configurable Big Data Search Experiences
LucidWorks REST APIと、KitengaのZettaVox-ZettaSearch
紹介ビデオ -> http://www.youtube.com/watch?v=HM0SUuaHYqc&feature=youtu.be
- LucidWorks REST API JSON使ったRESTなAPIでシステムの設定やらスキーマやらを操作できる(admin task)
- ZettaVox map reduce書くかわりに、workflow pipelinesを定義することによって、簡単に大量データのマイニングだとか解析だとかできる
- ZettaSearch いろんな切り口の分析から、自動でmetadata提示してくれたり、好きなもの選んでビジネスソリューション(?)を組み立てたりできる
- リッチなインターフェースでHadoopやらSolrと仲良くなれそう
Introducing Hydra – An Open Source Document Processing Framework
Findwiseが作ってる、Hydraっていうdocument-processing frameworkの紹介
1 |
unstructuredなデータにメタデータ付与してリッチなものにしてあげましょう |
- language detection
- sentiment analysis
- headline extraction
- regular expression matching & extraction
1 2 3 4 5 6 7 |
こ~ゆ~ことやんのにSolrのdataimporthandler使ってもいいけど、Solr expertが必要だよね あるいは今までのpipelineはシーケンシャルだから(doc -> □□□□□ -> solr)、 どっかおかしくなると全体ダメ たとえばtikaちゃんはJVM食いつぶして死んじゃうんで、要再起動。。。 そこでHydraの登場 イメージとしては、それぞれの処理(stageって呼ばれる)が雲の中にばら撒かれてるカンジ |
- main design objectives
- scalabile:the central repository & worker nodes can scale horizontally
- distributed:any processing node can work on any document
- fail-safe:node down will not affect the documents in the pipeline
- robust:all stages run in separate JVMs
- easy to use/configure:we can debug stages from IDE against actual data
1 2 |
architectureはcore、admin interface、stage、mongoDBからなる HdoopのScalabilityはすごいけどねぇ、、、Hydraはよりreal timeなのさ |
Stump The Chump
1 2 3 |
来場者の質問に、Chris Hostetterをはじめとするコミッター達がどしどし答えるっていう企画 いい質問をすると、 <a href="https://web.archive.org/web/20130516211743/http://www.amazon.com/Lucene-Action-Erik-Hatcher/dp/1932394281">Lucene in Action</a> がもらえる(^o^)/ でも、、、家にあんのに、わざわざ重いもんを飛行機で持って帰りたくないっていう。。。 |
Conference Party at Museum of Science
1 2 |
博物館を貸しきってパーティ 一通りうろうろしたあと、飲み食いしてYonik Seeleyに挨拶mm |
Day 2
9:00 AM – 9:40 AM | Enabling Scalable Search, Discovery and Analytics with Solr,Mahout and Hadoop Grant Ingersoll, Chief Scientist, Lucid Imagination |
||||
9:40 AM – 10:15 AM | Apache Hadoop:Now, Next and Beyond Ari Zilka, Chief Product Officer, Hortonworks |
||||
LUCENE/SOLR IN ACTION TRACK | TECHNICAL DEEP DIVE TRACK | BIG DATA TRACK | |||
10:45 AM – 11:25 AM | Building a Real-time Solr-powered Recommendation Engine Trey Grainger, Search Tech Development Mgr, CareerBuilder |
Is Your Index Reader Really Atomic or Maybe Slow? Uwe Schindler, SD DataSolutions GmbH |
Solr 4: The SolrCloud Architecture Mark Miller, Lucid Imagination |
||
1:15 PM – 1:55 PM | Television News Search and Analysis with Lucene/Solr Kai Chan, Instructional Technology and Database Developer, Social Sciences Computing UCLA |
Challenges in Maintaining a High Performance Search Engine Written in Java Simon Willnauer,Apache Lucene |
Indexing Big Data on Amazon AWS Scott Stults, Solutions Architect, Open Source Connections |
||
2:05 PM – 2:45 PM | Solr, Lucene and Hadoop @ Etsy David Giffin, Software Engineer, Etsy |
Updateable Fields in Lucene and other Codec Applications Andrzej Bialeki, Lucid Imagination |
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics M.C. Srivas, CTO & Founder, MapR |
||
2:55 PM – 3:35 PM | How to Access Your Library Book Collections Using Solr Engy Ali, Software Project Manager, The Library of Alexandria |
Things Made Easy: One Click CMS Integration with Solr & Drupal Peter Wolanin, Acquia, Inc |
How is the Government Spending Your Money? How GCE is Using Lucene and the GCE Big Data Cloud Seshu Simhadri, CTO, Global Computer Enterprises |
||
4:00 PM – 4:40 PM | How to Gain Greater Business Intelligence from Lucene/Solr Patrick Beaucamp, Bpm-Conseil |
Search with Polygons: Another Approach to Solr Geospatial Search Andrew Urquhart, Principal Systems Engineer, Raytheon |
Big Data Meets Metadata – Analyzing Large Data Sets Jermy Bently, CEO, Smartlogic |
Enabling Scalable Search, Discovery and Analytics with Solr,Mahout and Hadoop
1 2 3 4 |
SDA : Search, Discovery, and Analytics -> user needs real-time, ad hoc access to content, so batch processing isn't enough what do you need for SDA? |
- fast, efficient, scalable search
- large scale, cost effective storage
- large scale processing power
- NLP and machine learning tools that scale to eenhance discovery and analysis
1 2 3 4 5 6 |
これらをやるのに便利なのがLucidWorks big dataっていうplatformですょ -> <a href="https://web.archive.org/web/20130516211743/http://www.lucidimagination.com/products/lucidworks-search-platform/lucidworks-big-data">http://www.lucidimagination.com/products/lucidworks-search-platform/lucidworks-big-data</a> JSON使ったRESTfulAPIで、Real timeかつad hocに以下ができます search & indexing, workflows, admin, analytics, machine learning, proxy?... 構成 |
- Lucene/Solr 4.0-dev Sharded with SolrCloud 1 second soft commits for NRT updates 1 minute hard commits (without searcher reopen)
- Restlet2.1 (SDAEngine)
- Hadoop map-reduce jobs for ETL and bulk indexing into SolrCloud sharded system leverage pig and custom MR jobs
- Mahout K-means clustering
Apache Hadoop:Now, Next and Beyond
HortonworksによるHadoopの現在、未来、そして。。。
- Hadoop Now stable
- Hadoop Next : Hadoop2 high availability next-gen MapReduce & HDFS YARN layer is new and APIs are still evolving
- Hadoop Beyond integrate w/ecosystem next-gen data architecture
1 |
Hortonworks' vision & role |
- make hadoop easy to use and consume
- make hadoop an enterprise viable data platform
- provide open APIs and data services
- enable ecosystem at each layer of the data stack
1 |
2015の終わりまでに、世界中のデータの半分はHadoopでしょりされるらしい |
Building a Real-time Solr-powered Recommendation Engine
Solr in Action (coming soon) の共同執筆者Trey Graingerによる
Lucne isn’t text search, but token matching engine
1 |
various recommendations |
- content based attribute based – set fields(jobtitle, salary, place…) – using boost for each field in solr select query hierarchical – healthcare//nursing//transplant… – educator//postsecondary//nursing… textual similarity – MoreLikeThisRequestHandler / SearchHandler are a good example concept based – create a taxonomy/dictionary to define concepts – then manually tag documents or use classification system geography and recommendations – Solr’s geodist()
- behavioral based (既存Solrのフィールドとクエリをうまく使ってごにょごにょやっててすごかった。。。発表資料公開されないかなぁ) collaborate filtering find similar users who like the same documents search for docs liked by those similar users comparison with mahout – easier & real time
- hybrid approaches
1 |
important |
- Custom Scoring with Payloads
- Measuring Results Quality
- Understanding users (there is no right recommendation algorithm)
Challenges in Maintaining a High Performance Search Engine Written in Java
1 |
Luceneを開発するってどんなことかって話 |
Solr, Lucene and Hadoop
1 |
histry of Etsy(ECサイト) -> <a href="https://web.archive.org/web/20130516211743/http://www.etsy.com/">http://www.etsy.com/</a> |
- 2007 1 million listings single master PostgreSQL
- 2008 2 million listings single master PostgreSQL master & 4 slaves Solr
- 2009 4 million listings single master PostgreSQL master & 6 slaves Solr
- 2010 7 million listings single master PostgreSQL master & 10 slaves Solr, custom import handler
- 2011 10 million listings sharded PostgreSQL master & 24 slaves Solr
- 2012 add HBase indexing …?
そのほかJenkins, Memcached, bitTornado (for replication), oozie (workflow engine for hadoop) 独自にツールつくって公開してる ->
How is the Government Spending Your Money?
How GCE is Using Lucene and the GCE Big Data Cloud
1 |
派手さはないけど、至極全うにアーキテクチャ決めてる感じで、、スライドもっかい見たいな |
Big Data Meets Metadata – Analyzing Large Data Sets
1 2 3 |
So many innovation has occurred on the net, but 2001-2011 user's satisfaction in searching is flat!!! (about 50%) Then... |
- file management
- index management
- automation of 1 & 2
- new ; content intelligence -> metadata identifying classifying taxonomy/ontology