Lucene

From CSWiki

Jump to: navigation, search

Conceptual Overview

Lucene is one of the most popular open source search engine libraries. Lucene consists of core APIs that allow indexing and searching of text. Given a set of text files, Lucene can create indexes and allow you to search those indexes with complex queries such as +title:Lucene -content:Search , search AND Lucene , +search +code.

Indexing Text in Lucene

Search engines scan all of the data that need to be searched and store them in a structure that allows efficient retrieval. The most well-known structure is called the inverted index

Analyzing Text to Be Indexed

Lucene processes the text to be indexed with analyzers. Analyzers are used to tokenize text, extract relevant words, discard common words, stem the words (reduce them to the root form, meaning that bowling, bowler and bowls are reduced to bowl), and perform any other desired processing before storing it into the index. The common analyzers provided by Lucene are: SimpleAnalyzer: Tokenizes the string to a set of words and converts them to lower case. StandardAnalyzer: Tokenizes the string to a set of words identifying acronyms, email addresses, host names, etc., discarding the basic English stop words (a, an, the, to) and stemming the words.

Searching Indexes

Once created, indexes can be searched by providing complex queries specifying the field and the term that needs to be searched. Documents that match the query are ranked based on the number of times the terms occurs in the document and the number of documents that have the terms. Lucene implements a ranking mechanism and gives us the flexibility to modify it if required.

For more information on lucene index format, click on Apache Lucene - Index File Formats

[edit] How accessed/used

You need lucene-core-2.0.0.jar to use lucene. To get lucene, click on download lucene.

- Lucene Home Page

- Lucene API

Personal tools