Lucene Revolution 2012


Lucene Revolutionに遊びに行ってきたのでメモ

Overview

agenda
key
  • news of Lucene/Solr4.0 (SolrCloud, NRT, Join…)
  • big data analysis
  • collaboration with other OSS (Hadoop family, CMS, admin…)
  • vender’s solutioin (LucidWorks, MapR, Windows…)
slide & video

 

Day 1

8:30 AM – 9:00 AM Opening Remarks – What’s New at Lucid Imagination
Paul Doscher, CEO Lucid Imagination
9:00 AM – 9:45 AM Lucene/Solr on Windows: Azure: Microsoft and OSS Working Together
Gianugo Rabellino, Microsoft
LUCENE/SOLR IN ACTION TRACK TECHNICAL DEEP DIVE TRACK BIG DATA TRACK
10:15 AM – 11:00 AM Building Query Auto-Completion Systems with Lucene 4.0
Sudarshan Gaikaiwari, Software Engineer,Yelp
Search is Not Enough: Using Solr for Analytics
Steve Kearns, Basis Technology
Big Search with Big Data Principles
Eric Pugh, Principle, OpenSource Connections
11:10 AM – 11:50 AM NetDocuments- Journey from FAST to Solr
David Hamson & Mou Nandi NetDocuments
Integrating Lucene into a Transactional XML Database
Petr Pleshachkov, EMC
Indexing Wikipedia as a Benchmark of Single Machine Performance Limits
Paddy Mullen,Independent Contractor
1:15 PM – 1:55 PM Japanese Linguistics in Lucene and Solr
Christian Moen, Founder and CEO Atilika Inc.
Grouping and Joining in Lucene/Solr
Martijn van Groningen, SearchWorkings
Using Lucene/Solr to Surface the Big Data of Social Media
Glenn Engstrand, Sr. Software Engineer, Zoosk, Inc
2:05 PM – 2:45 PM Using Lucene/Solr to Build CiteSeerX and Friends
C. Lee Giles, Professor, Pennsylvania State University
Automata Invasion
Michael McCandless, IBM & Robert Muir, Lucid Imagination
Using the LucidWorks REST API to Support User-Configuration Big Data Search Experience
Mark Davis, CTO Kitenga
2:55 PM – 3:35 PM Introducing Hydra – An Open Source Document Processing Framework
Joel Westberg, Findwise
How SolrCloud Changes the User Experience In a Sharded Environment
Erick Erickson, Committer, Lucid Imagination
Delivering on the Promise of Big Data at the “Tactical Edge”
Wes Caldwell, Chief Architect, ISS, Inc.
4:00 PM – 5:15 PM Stump The Chump
6:30 PM – 9:30 PM Conference Party at Museum of Science

 

Opening Remarks – What’s New at Lucid Imagination

 

Lucene/Solr on Windows: Azure: Microsoft and OSS Working Together

  • windows live, office, bing, skypeあたりは開けてます
  • Windowsは下記と仲良しです -> Linux, Solr, Hadoop, Firefox, Drupal, Java, PHP, node.js…

 

Search is Not Enough: Using Solr for Analytics

Basis TechnologyのSteve Kearnsがしゃべってた。
Basisは企業向けにinformation retrievalの手助けしてるようなとこで、
ここに紹介ビデオがある。-> http://www.basistech.com

  • document level analysis – language identify – summarization – categorization
  • sub-document level analysis – segmentation (東京ルパン上映時間 -> 東京、ルパン、上映、時間に分けるっていう例が紹介されてた。。。) – lemmatization – stemming – entity extraction – facet relationship – sentiment (sentence, paragraph, entity, emotion等)
  • cross document level analysis – near duplicated documents – document clustering – co-reference resolution

  • Make Analyzer Charfilter, Tokenizer, TokenFilter(s) をもつAnalyzerを作って、 schema.xmlのFieldTypeにセットしてあげよう index用とquery用に別々のもセットできるょ ただし、この場合document自体にアクセスはできませんょ
  • Update Request Processor RequestHandlerを改良してsolrconfig.xmlにセットしてあげよう Analizerの前に呼ばれるので、どっかにドキュメント送って処理させて、そのアウトプットをindexingできるょ この場合documentに対するfull accessを持ちますょ
  • preprocessor いろいろ分析しておいて、その最後の処理としてindexingしてあげよう この場合、Solrがやってくれてるほげほげを自分でやる必要が出てくるかもね

 

Indexing Wikipedia as a Benchmark of Single Machine Performance Limits

  • JWPL(Java Wikipedia library)を使ってMediaWikiをパースしました
  • DataImportHandler使うんでdata-config.xmlに定義しました
  • MacBook, Linux, どっかのcloud使って、それぞれ33GB-12milliのdocumentを突っ込んで検索しました
  • だいたいどれも突っ込むのに15h、検索に100-1000milli secかかりました

 

Japanese Linguistics in Lucene and Solr

最近Lucene/SolrのcommiterになったChristianによる、日本語検索のお話。

  • 日本語ってこんな言語ですょ(分かち書きじゃない)
  • 日本語をindexingするには主に2つのアプローチがありますょ -> n-gramming(n文字ずつくぎるアレ) 意味が保存されないし、semanticsが変わってしまう場合もある 検索ノイズが多く、indexのサイズもでかい -> morphological analysis(形態素解析)
  • そしてkuromojiの話へ huge latis から shortest pathを選んでますと
  • 手っ取り早く日本語解析試すには。。。Lucene/Solrの3.6以降はデフォルトでいい感じの設定入ってますょ
  • LuceneのJapaneseAnalyzerか、Solrのfield_type:text_ja text_jaのchainはJapaneseTokenizer, JapaneseBaseFormFilter, JapanesePartOfSpeechStopFilter, CJKWidthFilter, StopFilter…
  • あとはよく出てくる関西国際空港とか、マネージャー->マネージャ、買う・買わない・買います。。。
  • 日本語周りのSolr4トピックとして、JapaneseTokenizerの改善と、スペルチェッカー追加

 

Using the LucidWorks REST API to Support User-Configurable Big Data Search Experiences

LucidWorks REST APIと、KitengaのZettaVox-ZettaSearch
紹介ビデオ -> http://www.youtube.com/watch?v=HM0SUuaHYqc&feature=youtu.be

  • LucidWorks REST API JSON使ったRESTなAPIでシステムの設定やらスキーマやらを操作できる(admin task)
  • ZettaVox map reduce書くかわりに、workflow pipelinesを定義することによって、簡単に大量データのマイニングだとか解析だとかできる
  • ZettaSearch いろんな切り口の分析から、自動でmetadata提示してくれたり、好きなもの選んでビジネスソリューション(?)を組み立てたりできる
  • リッチなインターフェースでHadoopやらSolrと仲良くなれそう

 

Introducing Hydra – An Open Source Document Processing Framework

Findwiseが作ってる、Hydraっていうdocument-processing frameworkの紹介

  • language detection
  • sentiment analysis
  • headline extraction
  • regular expression matching & extraction

    main design objectives

  • scalabile:the central repository & worker nodes can scale horizontally
  • distributed:any processing node can work on any document
  • fail-safe:node down will not affect the documents in the pipeline
  • robust:all stages run in separate JVMs
  • easy to use/configure:we can debug stages from IDE against actual data

 

Stump The Chump

 

Conference Party at Museum of Science

 

Day 2

9:00 AM – 9:40 AM Enabling Scalable Search, Discovery and Analytics with Solr,Mahout and Hadoop
Grant Ingersoll, Chief Scientist, Lucid Imagination
9:40 AM – 10:15 AM Apache Hadoop:Now, Next and Beyond
Ari Zilka, Chief Product Officer, Hortonworks
LUCENE/SOLR IN ACTION TRACK TECHNICAL DEEP DIVE TRACK BIG DATA TRACK
10:45 AM – 11:25 AM Building a Real-time Solr-powered Recommendation Engine
Trey Grainger, Search Tech Development Mgr, CareerBuilder
Is Your Index Reader Really Atomic or Maybe Slow?
Uwe Schindler, SD DataSolutions GmbH
Solr 4: The SolrCloud Architecture
Mark Miller, Lucid Imagination
1:15 PM – 1:55 PM Television News Search and Analysis with Lucene/Solr
Kai Chan, Instructional Technology and Database Developer, Social Sciences Computing UCLA
Challenges in Maintaining a High Performance Search Engine Written in Java
Simon Willnauer,Apache Lucene
Indexing Big Data on Amazon AWS
Scott Stults, Solutions Architect, Open Source Connections
2:05 PM – 2:45 PM Solr, Lucene and Hadoop @ Etsy
David Giffin, Software Engineer, Etsy
Updateable Fields in Lucene and other Codec Applications
Andrzej Bialeki, Lucid Imagination
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics
M.C. Srivas, CTO & Founder, MapR
2:55 PM – 3:35 PM How to Access Your Library Book Collections Using Solr
Engy Ali, Software Project Manager, The Library of Alexandria
Things Made Easy: One Click CMS Integration with Solr & Drupal
Peter Wolanin, Acquia, Inc
How is the Government Spending Your Money? How GCE is Using Lucene and the GCE Big Data Cloud
Seshu Simhadri, CTO, Global Computer Enterprises
4:00 PM – 4:40 PM How to Gain Greater Business Intelligence from Lucene/Solr
Patrick Beaucamp, Bpm-Conseil
Search with Polygons: Another Approach to Solr Geospatial Search
Andrew Urquhart, Principal Systems Engineer, Raytheon
Big Data Meets Metadata – Analyzing Large Data Sets
Jermy Bently, CEO, Smartlogic

 

Enabling Scalable Search, Discovery and Analytics with Solr,Mahout and Hadoop

  • fast, efficient, scalable search
  • large scale, cost effective storage
  • large scale processing power
  • NLP and machine learning tools that scale to eenhance discovery and analysis

  • Lucene/Solr 4.0-dev Sharded with SolrCloud 1 second soft commits for NRT updates 1 minute hard commits (without searcher reopen)
  • Restlet2.1 (SDAEngine)
  • Hadoop map-reduce jobs for ETL and bulk indexing into SolrCloud sharded system leverage pig and custom MR jobs
  • Mahout K-means clustering

 

Apache Hadoop:Now, Next and Beyond

HortonworksによるHadoopの現在、未来、そして。。。

  • Hadoop Now stable
  • Hadoop Next : Hadoop2 high availability next-gen MapReduce & HDFS YARN layer is new and APIs are still evolving
  • Hadoop Beyond integrate w/ecosystem next-gen data architecture

  • make hadoop easy to use and consume
  • make hadoop an enterprise viable data platform
  • provide open APIs and data services
  • enable ecosystem at each layer of the data stack

 

Building a Real-time Solr-powered Recommendation Engine

Solr in Action (coming soon) の共同執筆者Trey Graingerによる
Lucne isn’t text search, but token matching engine

  • content based attribute based – set fields(jobtitle, salary, place…) – using boost for each field in solr select query hierarchical – healthcare//nursing//transplant… – educator//postsecondary//nursing… textual similarity – MoreLikeThisRequestHandler / SearchHandler are a good example concept based – create a taxonomy/dictionary to define concepts – then manually tag documents or use classification system geography and recommendations – Solr’s geodist()
  • behavioral based (既存Solrのフィールドとクエリをうまく使ってごにょごにょやっててすごかった。。。発表資料公開されないかなぁ) collaborate filtering find similar users who like the same documents search for docs liked by those similar users comparison with mahout – easier & real time
  • hybrid approaches

  • Custom Scoring with Payloads
  • Measuring Results Quality
  • Understanding users (there is no right recommendation algorithm)

 

Challenges in Maintaining a High Performance Search Engine Written in Java

 

Solr, Lucene and Hadoop

      • 2007 1 million listings single master PostgreSQL
      • 2008 2 million listings single master PostgreSQL master & 4 slaves Solr
      • 2009 4 million listings single master PostgreSQL master & 6 slaves Solr
      • 2010 7 million listings single master PostgreSQL master & 10 slaves Solr, custom import handler
      • 2011 10 million listings sharded PostgreSQL master & 24 slaves Solr
      • 2012 add HBase indexing …?

そのほかJenkins, Memcached, bitTornado (for replication), oozie (workflow engine for hadoop) 独自にツールつくって公開してる ->

https://github.com/etsy

 

How is the Government Spending Your Money?
How GCE is Using Lucene and the GCE Big Data Cloud

 

Big Data Meets Metadata – Analyzing Large Data Sets

  1. file management
  2. index management
  3. automation of 1 & 2
  4. new ; content intelligence -> metadata identifying classifying taxonomy/ontology

 

Leave a comment

Your email address will not be published. Required fields are marked *