Lucene
From CSWiki
Conceptual Overview
Lucene is one of the most popular open source search engine libraries. Lucene consists of core APIs that allow indexing and searching of text. Given a set of text files, Lucene can create indexes and allow you to search those indexes with complex queries such as +title:Lucene -content:Search , search AND Lucene , +search +code.
Indexing Text in Lucene
Search engines scan all of the data that need to be searched and store them in a structure that allows efficient retrieval. The most well-known structure is called the inverted index
Analyzing Text to Be Indexed
Lucene processes the text to be indexed with analyzers. Analyzers are used to tokenize text, extract relevant words, discard common words, stem the words (reduce them to the root form, meaning that bowling, bowler and bowls are reduced to bowl), and perform any other desired processing before storing it into the index. The common analyzers provided by Lucene are: SimpleAnalyzer: Tokenizes the string to a set of words and converts them to lower case. StandardAnalyzer: Tokenizes the string to a set of words identifying acronyms, email addresses, host names, etc., discarding the basic English stop words (a, an, the, to) and stemming the words.
Searching Indexes
Once created, indexes can be searched by providing complex queries specifying the field and the term that needs to be searched. Documents that match the query are ranked based on the number of times the terms occurs in the document and the number of documents that have the terms. Lucene implements a ranking mechanism and gives us the flexibility to modify it if required.
For more information on lucene index format, click on Apache Lucene - Index File Formats
[edit] How accessed/used
You need lucene-core-2.0.0.jar to use lucene. To get lucene, click on download lucene.

