I usually post about Persistence on Twitter - you can follow me there:
1. Introduction
In a previous article, we demonstrated how to configure and use Spring Data Elasticsearch for a project. In this article we will examine several query types offered by Elasticsearch and we’ll also talk about field analyzers and their impact on search results.
2. Analyzers
All stored string fields are, by default, processed by an analyzer. An analyzer consists of one tokenizer and several token filters, and is usually preceded by one or more character filters.
The default analyzer splits the string by common word separators (such as spaces or punctuation) and puts every token in lowercase. It also ignores common English words.
Elasticsearch can also be configured to regard a field as analyzed and not-analyzed at the same time.
For example, in an Article class, suppose we store the title field as a standard analyzed field. The same field with the suffix verbatim will be stored as a not-analyzed field:
@MultiField( mainField = @Field(type = String), otherFields = { @NestedField(index = not_analyzed, dotSuffix = "verbatim", type = String) } ) private String title;
Here, we apply the @MultiField annotation to tell Spring Data that we would like this field to be indexed in several ways. The main field will use the name title and will be analyzed according to the rules described above.
But we also provide a second annotation, @NestedField, which describes an additional indexing of the title field. We use FieldIndex.not_analyzed to indicate that we do not want to use an analyzer when performing the additional indexing of the field, and that this value should be stored using a nested field with the suffix verbatim.
2.1. Analyzed Fields
Let’s look at an example. Suppose an article with the title “Spring Data Elasticsearch” is added to our index. The default analyzer will break up the string at the space characters and produce lowercase tokens: “spring“, “data“, and “elasticsearch“.
Now we may use any combination of these terms to match a document:
SearchQuery searchQuery = new NativeSearchQueryBuilder() .withQuery(matchQuery("title", "elasticsearch data")) .build();
2.2. Non-analyzed Fields
A non-analyzed field is not tokenized, so can only be matched as a whole when using match or term queries:
SearchQuery searchQuery = new NativeSearchQueryBuilder() .withQuery(matchQuery("title.verbatim", "Second Article About Elasticsearch")) .build();
Using a match query, we may only search by the full title, which is also case-sensitive.
3. Match Query
A match query accepts text, numbers and dates.
There are three type of “match” query:
- boolean
- phrase and
- phrase_prefix
In this section we will explore the boolean match query.
3.1. Matching with Boolean Operators
boolean is the default type of a match query; you can specify which boolean operator to use (or is default):
SearchQuery searchQuery = new NativeSearchQueryBuilder() .withQuery(matchQuery("title","Search engines").operator(AND)) .build(); List<Article> articles = getElasticsearchTemplate() .queryForList(searchQuery, Article.class);
This query would return an article with the title “Search engines” by specifying two terms from the title with and operator. But what will happen if we search with the default (or) operator when only one of the terms matches?
SearchQuery searchQuery = new NativeSearchQueryBuilder() .withQuery(matchQuery("title", "Engines Solutions")) .build(); List<Article> articles = getElasticsearchTemplate() .queryForList(searchQuery, Article.class); assertEquals(1, articles.size()); assertEquals("Search engines", articles.get(0).getTitle());
The “Search engines” article is still matched, but it will have a lower score because not all of the terms matched.
The sum of the scores of each matching term add up to the total score of each resulting document.
There may be situations in which a document containing a rare term entered in the query will have higher rank then a document which contains several common terms.
3.2. Fuzziness
When the user makes a typo in a word, it is still possible to match it with a search by specifying a fuzziness parameter, which allows inexact matching.
For string fields fuzziness means the edit distance: the number of one-character changes that need to be made to one string to make it the same as another string.
SearchQuery searchQuery = new NativeSearchQueryBuilder() .withQuery(matchQuery("title", "spring date elasticsearch") .operator(AND) .fuzziness(Fuzziness.ONE) .prefixLength(3)) .build();
The prefix_length parameter is used to improve performance. In this case, we require that the first three characters should match exactly, which reduces the number of possible combinations.
5. Phrase Search
Phase search is stricter, although you can control it with the slop parameter. This parameter tells the phrase query how far apart terms are allowed to be while still considering the document a match.
In other words, it represents the number of times you need to move a term in order to make the query and document match:
SearchQuery searchQuery = new NativeSearchQueryBuilder() .withQuery(matchPhraseQuery("title", "spring elasticsearch").slop(1)) .build();
Here the query will match the document with the title “Spring Data Elasticsearch” because we set the slop to one.
6. Multi Match Query
When you want to search in multiple fields then you could use QueryBuilders#multiMatchQuery() where you specify all the fields to match:
SearchQuery searchQuery = new NativeSearchQueryBuilder() .withQuery(multiMatchQuery("tutorial") .field("title") .field("tags") .type(MultiMatchQueryBuilder.Type.BEST_FIELDS)) .build();
Here we search the title and tags fields for a match.
Notice that here we use the “best fields” scoring strategy. It will take the maximum score among the fields as a document score.
7. Aggregations
In our Article class we have also defined a tags field, which is non-analyzed. We could easily create a tag cloud by using an aggregation.
Remember that, because the field is non-analyzed, the tags will not be tokenized:
TermsBuilder aggregation = AggregationBuilders.terms("top_tags") .field("tags") .order(Terms.Order.aggregation("_count", false)); SearchResponse response = client.prepareSearch("blog") .setTypes("article") .addAggregation(aggregation) .execute().actionGet(); Map<String, Aggregation> results = response.getAggregations().asMap(); StringTerms topTags = (StringTerms) results.get("top_tags"); List<String> keys = topTags.getBuckets() .stream() .map(b -> b.getKey()) .collect(toList()); assertEquals(asList("elasticsearch", "spring data", "search engines", "tutorial"), keys);
8. Summary
In this article we discussed the difference between analyzed and non-analyzed fields, and how this distinction affects search.
We also learned about several types of queries provided by Elasticsearch, such as the match query, phrase match query, full-text search query, and boolean query.
Elasticsearch provides many other types of queries, such as geo queries, script queries and compound queries. You can read about them in the Elasticsearch documentation and explore the Spring Data Elasticsearch API in order to use these queries in your code.
You can find a project containing the examples used in this article in the GitHub repository.