A Simple File Search with Lucene

1. Overview

Apache Lucene is a full-text search engine, which can be used by various programming languages. To get started with Lucene, please refer to our introductory article here.

In this quick article, we’ll index a text file and search sample Strings and text snippets within that file.

2. Maven Setup

Let’s add necessary dependencies first:

<dependency>        
    <groupId>org.apache.lucene</groupId>          
    <artifactId>lucene-core</artifactId>
    <version>7.1.0</version>
</dependency>

The latest version can be found here.

Also, for parsing our search queries, we’ll need:

<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-queryparser</artifactId>
    <version>7.1.0</version>
</dependency>

Remember to check the latest version here.

3. File System Directory

In order to index files, we’ll first need to create a file-system index.

Lucene provides the FSDirectory class to create a file system index:

Directory directory = FSDirectory.open(Paths.get(indexPath));

Here indexPath is the location of the directory. If the directory doesn’t exist, Lucene will create it.

Lucene provides three concrete implementations of the abstract FSDirectory class: SimpleFSDirectory, NIOFSDirectory, and MMapDirectory. Each of them might have special issues with a given environment.

For example, SimpleFSDirectory has poor concurrent performance as it blocks when multiple threads read from the same file.

Similarly, the NIOFSDirectory and MMapDirectory implementations face file-channel issues in Windows and memory release problems respectively.

To overcome such environment peculiarities Lucene provides the FSDirectory.open() method. When invoked, it tries to choose the best implementation depending on the environment.

4. Index Text File

Once we’ve created the index directory, let’s go ahead and add a file to the index:

public void addFileToIndex(String filepath) {

    Path path = Paths.get(filepath);
    File file = path.toFile();
    IndexWriterConfig indexWriterConfig
     = new IndexWriterConfig(analyzer);
    Directory indexDirectory = FSDirectory
      .open(Paths.get(indexPath));
    IndexWriter indexWriter = new IndexWriter(
      indexDirectory, indexWriterConfig);
    Document document = new Document();

    FileReader fileReader = new FileReader(file);
    document.add(
      new TextField("contents", fileReader));
    document.add(
      new StringField("path", file.getPath(), Field.Store.YES));
    document.add(
      new StringField("filename", file.getName(), Field.Store.YES));

    indexWriter.addDocument(document);
    indexWriter.close();
}

Here, we create a document with two StringFields named “path” and “filename” and a TextField called “contents”.

Note that we pass the fileReader instance as the second parameter to the TextField. The document is added to the index using the IndexWriter.

The third argument in the TextField or StringField constructor indicates whether the value of the field will also be stored.

Finally, we invoke the close() of the IndexWriter to gracefully close and release the lock from the index files.

5. Search Indexed Files

Now let’s search the files we have indexed:

public List<Document> searchFiles(String inField, String queryString) {
    Query query = new QueryParser(inField, analyzer)
      .parse(queryString);
    Directory indexDirectory = FSDirectory
      .open(Paths.get(indexPath));
    IndexReader indexReader = DirectoryReader
      .open(indexDirectory);
    IndexSearcher searcher = new IndexSearcher(indexReader);
    TopDocs topDocs = searcher.search(query, 10);
    
    return topDocs.scoreDocs.stream()
      .map(scoreDoc -> searcher.doc(scoreDoc.doc))
      .collect(Collectors.toList());
}

Let’s now test the functionality:

@Test
public void givenSearchQueryWhenFetchedFileNamehenCorrect(){
    String indexPath = "/tmp/index";
    String dataPath = "/tmp/data/file1.txt";
    
    Directory directory = FSDirectory
      .open(Paths.get(indexPath));
    LuceneFileSearch luceneFileSearch 
      = new LuceneFileSearch(directory, new StandardAnalyzer());
    
    luceneFileSearch.addFileToIndex(dataPath);
    
    List<Document> docs = luceneFileSearch
      .searchFiles("contents", "consectetur");
    
    assertEquals("file1.txt", docs.get(0).get("filename"));
}

Notice how we’re creating a file-system index in the location indexPath and indexing the file1.txt.

Then, we simply search for the String “consectetur” in the “contents” field.

6. Conclusion

This article was a quick demonstration of indexing and searching text with Apache Lucene. To learn more about indexing, searing and queries of Lucene, please refer our introduction to Lucene article.

As always the code for the examples can be found over on Github.