In my previous blog on What is Elasticsearch, I have introduced Elasticsearch, talked about its advantages, and did the installation on windows. I have also discussed the basic concepts and different API conventions present in Elasticsearch. But let me tell you something interesting, whatever I have discussed in the previous blog, is just the tip of the iceberg. In this Elasticsearch tutorial blog, I will introduce all the features which make the Elasticsearch the fastest and most popular among its competitors. Also, I will introduce you to the different API’s present in Elasticsearch and how you can perform different searches using them through this Elasticsearch tutorial blog.
Below are the topics that I will be discussing this Elasticsearch tutorial blog:
So, let’s get started with the very first topic of this Elasticsearch tutorial blog.
Elasticsearch APIs – Elasticsearch Tutorial
This section of Elasticsearch tutorial blog talks about various kinds of API’s supported by Elasticsearch. Let’s understand each of them in detail.
Document API
Elasticsearch provides both single document APIs and multi-document APIs.
- SINGLE DOCUMENT API
- Index API
- Get API
- Update API
- Delete API
- MULTI-DOCUMENT API
- Multi Get API
- Bulk API
- Delete By Query API
- Update By Query API
- Reindex API
Now that you know about different types of Document APIs, let’s try to implement CRUD operations to them.
Index API
The index API is responsible for adding and updating a typed JSON document in a specific index and then making it searchable. The following example inserts the JSON document into the “playlist” index, under a type called “kpop” with an id of 1:
PUT /playlist/kpop/1 { "title" : "Beautiful Life", "artist" : "Crush", "album" : "Goblin", "year" : 2017 }
GET API
The get API is responsible for fetching a typed JSON document from the index based on its unique id. The following example gets a JSON document from a “playlist” index, under a type called “kpop”, with id valued 2:
GET /playlist/kpop/2
UPDATE API
The updated API is responsible for updating a document based on a script provided. The operation fetches the document from the index, runs the script and then indexes back the result. To make sure no updates happen during the “get” and “reindex”, it uses versioning. The following example updates a JSON document from a “playlist” index, under a type called “kpop”, by adding a new field called “time”:
PUT /playlist/kpop/1 { "title" : "Beautiful Life", "artist" : "Crush", "album" : "Goblin", "year" : 2017, "time" : 5 }
DELETE API
The delete API is responsible for deleting a typed JSON document from a specific index based on its unique id. The following example gets a JSON document from a “playlist” index, under a type called “kpop”, with id valued 3:
DELETE /playlist/kpop/3
Search API
The search API is responsible for searching the content within the Elasticsearch. You can search either by sending a get request with a query having a string parameter or a query in the message body of a post request. Generally, the search APIs are multi-index or multi-type.
There are various parameters which can be passed in a search operation having Uniform Resource Identifier (URI):
Parameter | Description |
q | This parameter specifies query string |
lenient | By setting this parameter’s value to true, format based errors can be ignored |
fields | This parameter fetches response from selective fields |
sort | This parameter sorts the result |
timeout | This parameter helps in restricting the search time |
terminate_after | This parameter restricts the response to a specific number of documents in each shard |
from | This parameter specifies the start index |
size | This parameter specifies the number of hits to return |
Now that you are familiar with the search parameter, let’s see how you can perform the search through multiple indexes and types.
Multi-Index
In Elasticsearch, you can search for the documents present in all the indices or in some particular indices. The following example searches for JSON documents from all the indexes, where the year is 2014:
GET playlist,my_playlist/_search?q=2014 { "title" : "MAMACITA", "artist" : "SuJu", "album" : "MAMACITA", "year" : 2014, "time" : 4 }
Multi-Type
You can also search all the documents in a particular index across all types or in some specified type. The following example searches for JSON documents from a “playlist” index, under all types, where the year is 2017:
GET playlist/_search?q=2017
The next section of Elasticsearch tutorial will talk about the aggregations and its types supported by Elasticsearch.
Aggregations
In Elasticsearch, aggregations framework is responsible for providing the aggregated data based on a search query. Aggregations can be composed together in order to build complex summaries of the data. For a better understanding, consider it as a unit-of-work. It develops analytic information over a set of documents that are available in Elasticsearch. Various types of aggregations are available, each of them having its own purpose and output. For simplification, they are generalized to 4 major families:
Bucketing
Here each bucket is associated with a key and a document. Whenever the aggregation is executed, all the buckets criteria are evaluated on every document. Each time a criterion matches, the document is considered to “fall in” the relevant bucket.
Metric
Metrics are the aggregations which are responsible for keeping a track and computing the metrics over a set of documents.
Matrix
Matrix are the aggregations which are responsible for operating on multiple fields. They produce a matrix result out of the values extracted from the requested document fields. Matrix does not support scripting.
Pipeline
Pipeline are the aggregations which are responsible for aggregating the output of other aggregations and their associated metrics together.
The following example shows how a basic aggregation is structured:
"aggregations" : { "<aggregation_name>" : { "<aggregation_type>" : { <aggregation_body> } [,"meta" : { [<meta_data_body>] } ]? [,"aggregations" : { [<sub_aggregation>]+ } ]? } [,"<aggregation_name_2>" : { ... } ]* }
Index API
In Elasticsearch, the index APIs or the indices APIs are responsible for managing individual indices, index settings, aliases, mappings, and index templates. Following are some of the operations that we can perform on Index APIs:
Create Index
The create index API is responsible for instantiating an index. Whenever a user passes a JSON object, an index is created automatically. The following example creates one index called “courses” with some settings:
PUT courses { "settings" : { "index" : { "number_of_shards" : 3, "number_of_replicas" : 2 } } }
Get Index
The get API is responsible for fetching the information about the index. By sending the get request to one or more indices, you can call it. The following example retrieves index called “courses”:
GET /courses
Delete Index
The delete index API is responsible for deleting an existing index. The following example deletes an index called “courses”:
DELETE /courses
Open/ Close Index API
The open and close index APIs are responsible for closing an index and then opening it. A closed index is blocked for any read/ write operations. But you can still open it, which will then go through the normal recovery process. The following example closes and opens an index called “courses”:
POST /courses/_close POST /courses/_open
Index Aliases
APIs in the Elasticsearch can accept an index name when working against a specific index when required. The index aliases API permits aliasing an index with a name, with all APIs automatically converting the alias name to the actual index name. The following example adds and removes an index alias:
POST /_aliases { "actions" : [ { "add" : { "index" : "courses", "alias" : "subjects" } } ] } POST /_aliases { "actions" : [ { "remove" : { "index" : "courses", "alias" : "subjects" } } ] }
Analyse
In Elasticsearch, it performs the analysis process on a text and returns the tokens breakdown of the text. You can perform analysis without specifying any index. The following example performs a simple analysis:
GET _analyze { "analyzer" : "standard", "text" : "this is a demo" }
Index Template
Index templates are responsible for defining the templates that will be automatically applied when new indices are created. The following example shows a template format:
PUT _template/template_1 { "template": "te*", "settings": { "number_of_shards": 1 }, "mappings": { "type1": { "_source": { "enabled": false }, "properties": { "host_name": { "type": "keyword" }, "created_at": { "type": "date", "format": "EEE MMM dd HH:mm:ss Z YYYY" } } } } }
Index Stats
In Elasticsearch, indices level stats is responsible for providing statistics on different operations which are happening on an index. The API generally provides the statistics on the index level. The following example shows an index level stats for all indices and a specific index stats as well:
GET /_stats GET /playlist/_stats
Flush
The flush API is responsible for flushing one or more indices through an API. Basically, its a process of releasing memory from the index by pushing the data to the index storage and clearing the internal transaction log. The following example shows an index being flushed:
POST playlist/_flush
Refresh
The refresh API is responsible for refreshing one or more index explicitly. This makes all operations performed since the last refresh available for the search. The following example shows an index being refreshed:
POST /courses/_refresh POST /playlist,courses/_refresh
Cluster API
The Cluster API in Elasticsearch is responsible for fetching information about a cluster and its nodes and making further changes in them.
Cluster Health
This API is responsible for retrieving cluster’s health status by appending health keyword. The following example shows cluster health:
GET _cluster/health
Cluster State
This Cluster State API is responsible for retrieving the state information about a cluster by appending ‘state’ keyword URL. Various information like version, master node, other nodes, routing table, metadata, and blocks are contained by the state. The following example shows cluster state:
GET /_cluster/state
Cluster Stats
The Cluster Stats API is responsible for retrieving statistics from a cluster-wide perspective. It returns a basic index metrics and information about the current node which forms the cluster. The following example shows cluster stats:
GET /_cluster/stats
Pending Cluster Tasks
This API is responsible for monitoring pending tasks in any cluster. Tasks may include create an index, update, mapping, allocate shard, fail shard etc. The following example shows cluster stats:
GET /_cluster/pending_tasks
Node Stats
This cluster node stats API is responsible for retrieving one or more of the cluster nodes statistics. The following example shows cluster nodes stats:
GET /_nodes/stats
Nodes hot_thread
This API is responsible for retrieving the current hot threads on each of the node in the cluster. The following example shows cluster’s hot threads:
GET /_nodes/hot_threads
Next section of this Elasticsearch Tutorial blog talks about the Query DSL provided by Elasticsearch.
Query DSL – Elasticsearch Tutorial
Elasticsearch provides a full Query DSL which is based on JSON and is responsible for defining queries. The Query DSL consisting of two types of clauses:
Leaf Query Clauses
In Elasticsearch, the leaf query clauses search for a particular value in a particular field like match, term or range queries. These queries can be used by themselves as well.
Compound Query Clauses
In Elasticsearch, the compound query clauses wrap up other leaf or compound queries. These queries are used for combining multiple queries in a logical fashion or for altering their behavior.
Match All Query
This is the most simple query, which matches all the documents and returns a score of 1.0 for every object. The following example shows the match query:
GET /_search { "query": { "match_all": {} } }
Full Text Queries
These queries are used for running full-text queries on full text fields. These are basically high-level queries which understand how a field being queried is analyzed. Then it applies each field’s analyzer to the query a string before executing. The following example shows a simple full-text query:
POST /playlist*/_search { "query":{ "match" : { "title":"Beautiful Life" } } }
Some of the full-text queries are:
Query Description match This query is used for performing full-text queries. match_phrase This query is used for matching exact phrases or word proximity matches. match_phrase_prefix This query is used for wildcard search on the final word. multi_match This query is used for matching the multi-field versions. common_terms This query is used for providing more preference to uncommon words. query_string This query is used for specifying AND|OR|NOT conditions and multi-field search within a single query string. simple_query_string This query is a robust version of query_string. Term Level Queries
Rather than a full-text field, these types of queries are used for structured data like numbers, dates, and enums. You can also craft low-level queries using them. The following example shows term level query:
POST /playlist/_search { "query":{ "term":{"title":"Silence"} } }
Some of the full-text queries are:
Query Description term This query is used for finding the documents containing the exact term specified. terms This query is used for finding the documents which contain any of the exact terms specified. range This query is used for finding the documents where the range specified must be contained in the specified fields. exits This query is used for finding the documents where any non-null value is contained by the specified field. prefix This query is used for finding the documents containing the terms beginning with the exact prefix specified. wildcard This query is used for finding the documents containing the terms matching the pattern specified. regexp This query is used for finding the documents containing the terms matching the regular expression. fuzzy This query is used for finding the documents containing the terms fuzzily similar to the specified term. type This query is used for finding the documents of the specified type. ids This query is used for finding the documents with the specified type and IDs. Compound Queries
The compound queries in Elasticsearch, are responsible for wrapping up the other compound or leaf queries together. This is done either to combine their results and scores, to change their behavior or to switch from query to filter context. The following example shows a simple full-text query:
POST /playlist/_search { "query": { "match": { "title": "Lucifer" } } }
Some of the full-text queries are:
Query Description constant_score This query is used for wrapping up another query and executing it in filter context. bool This query is used for combining multiple leaf or compound query clauses, by default. dis_max This query accepts multiple queries and then returns the documents matching any of the query clauses. function_score This query is used for modifying the scores returned by the main query with functions to take into account factors like popularity, recency, distance, or custom algorithms implemented with scripting. boosting This query is used for returning documents matching a positive query, but reducing the score of documents matching a negative query. indices This query is used for executing one query for the specified indices and another for other indices. Joining Queries
In a distributed system like Elasticsearch, performing full SQL-style joins is very expensive. Thus, Elasticsearch provides two forms of join which are designed to scale horizontally.
nested query
This query is used for the documents containing nested type fields. Using this query, you can query each object as an independent document.
has_child & has_parent queries
This query is used to retrieve the parent-child relationship between two document types within a single index. The has_child query returns the matching parent documents, while the has_parent query returns the matching child documents.
The following example shows a simple join query:
POST /my_playlist/_search { "query": { "has_child" : { "type" : "kpop", "query" : { "match" : { "artist" : "EXO" } } } } }
Geo Queries
In Elasticsearch, two types of geo data are supported:
- geo_point: These are the fields which support lat/ lon pairs
- geo_shape: These are the fields which support points, lines, circles, polygons, multi-polygons etc.
{ "query":{ "filtered":{ "filter":{ "geo_distance":{ "distance":"150km", "location":[42.056098, 86.674299] } } } } }
Next part of this Elasticsearch Tutorial blog talks about different mappings available in Elasticsearch.
Mapping – Elasticsearch Tutorial
In Elasticsearch, mapping is responsible for defining how a document and its fields are stored and indexed. The following example shows a simple mapping query:
POST /playlist POST /playlist { "mappings": { "report": { "_all": { "enabled": true }, "properties":{ "title":{ "type":"string"}, "artist":{ "type":"string"}, "album":{ "type":"string"}, "year":{ "type":"integer"} } } }
Field Types
Elasticsearch supports various data types for the fields in a document like:
Datatypes Description Core These are the basic data types that are supported by almost all the systems. The basic datatypes are integer, long, double, short, byte, double, float, string, date, Boolean and binary. Complex These are the data types that are the combination of core data types. For example array, JSON object and nested data type. Geo These are the data types which are used for defining geographic properties. Specialized These are the data types that are used for special purposes. Mapping Types
In Elasticsearch, each index has one or more mapping types. These mapping types are used to divide the documents of an index into logical groups/ units. Mapping can be differentiated on the basis of the following parameters:
- Meta-Fields: The meta-fields are responsible for customizing how a document’s associated metadata is treated. Meta-fields in Elasticsearch includes the document’s _index, _type,_id and _source fields.
- Fields or Properties: In Elasticsearch, each mapping type has a list of fields or properties which are specific it only. In an index, fields with the same name but in different mapping types should have the same mapping.
- Dynamic Mapping: Elasticsearch allows the automatic creation of mapping called dynamic mapping. Using dynamic mapping a user can post data to any undefined mapping.
Following section of this Elasticsearch Tutorial blog will introduce you to the analysis processes in Elasticsearch.
Analysis – Elasticsearch Tutorial
In Elasticsearch, analysis is the process of conversion of text into tokens or terms. These tokens are then added to the inverted index for the searching purpose. This process of analysis is performed by an analyzer. An analyzer can be of two types:
- Built-in analyzer
- custom analyzer defined per index.
Thus, if no analyzer is defined, then by default the built-in analyzers will perform the analysis. The following example shows a simple analysis query:
PUT cities { "mappings": { "metropolitan": { "properties": { "title": { "type": "text", "analyzer": "standard" } } } } }
Analyzers
In Elasticsearch, a tokenizer and optional token filters make up an analyzer. Inside the analysis module, these analyzers are registered with logical names. Using names, the analyzers can be referenced either in mapping definitions or in some APIs. Following are some of the default analyzers −
Analyzers Description Standard Using this analyzer you can set stopwords and max_token_length. Simple The lowercase tokenizer composes this analyzer. Whitespace The whitespace tokenizer composes this analyzer. Stop Using this analyzer, stopwords, and stopwords_path can be configured. Keyword Using this analyzer, an entire stream can be tokenized into a single token. Pattern Using this analyzer you can configure regular expressions like lowercase, pattern, flags, stopwords etc. Language Using this analyzer you can analyze different languages like Hindi, Arabic, Dutch etc. Snowball This analyzer utilizes a standard tokenizer, with standard filter, lowercase filter, stop filter, and snowball filter. Custom Using this analyzer, a customized analyzer along with a tokenizer with optional token filters and char filters is created. Tokenizer
In Elasticsearch, tokenizers are responsible for generating tokens from a text. Using whitespace or other punctuations, the text can be broken down into tokens. Elasticsearch provides a list of built-in tokenizers, which are used in a custom analyzer. Following are the some of the tokenizers used in Elasticsearch:
Tokenizer Description Standard Developed on grammar-based tokenizer for which max_token_length can also be configured. Edge NGram Different configurations can be set for this tokenizer like min_gram, max_gram, token_chars. Keyword This tokenizer is responsible for generating the entire input as an output and setting the buffer_size. Letter This tokenizer is responsible for capturing the whole word unless a non-letter is encountered. Lowercase This tokenizer works similar to the letter tokenizer. Once the tokens are created, it changes them into lower case. NGram You can set min_gram, max_gram, and token_chars etc., for this tokenizer. Whitespace On the basis of whitespaces, this tokenizer divides the text. Pattern This tokenizer uses the regular expressions as a token separator. UAX Email URL This works similar to the standard tokenizer but refers email and URL as a single token. Path Hierarchy This tokenizer is responsible for generating all the possible paths present inside the input directory path. Classic This tokenizer uses grammar based tokens for its functioning. Thai This is used for the Thai language which uses built-in Thai segmentation algorithm for processing. Token Filters
In Elasticsearch, tokenizers send input to the token filters. These token filters can further modify, delete or add text into that input.
Character Filters
Before the tokenizers, the text is processed by the character filters. Character filters search for the special characters or HTML tags or specified patterns. After which it either deletes them or changes them to appropriate words.
Next part of this Elasticsearch Tutorial blog talks about different modules provided by Elasticsearch.
Modules – Elasticsearch Tutorial
Elasticsearch is composed of different modules, which are responsible for various aspects of its functionality. Each of these modules can have any one of the following settings:
- static – These settings must be done at the node level and must be set on every relevant node.
- dynamic – These settings can be updated dynamically on a live cluster.
Modules | Description |
Cluster-level routing and shard allocation | Responsible for the settings which control where, when, and how shards are allocated to nodes. |
Discovery | Responsible for discovering a cluster and maintaining the state of all the nodes in it. |
Gateway | Responsible for maintaining the cluster state and the shard data across full cluster during restarts. |
HTTP | Responsible for managing the communication between HTTP client and Elasticsearch APIs. |
Indices | Responsible for maintaining the settings that are set globally for every index. |
Network | Responsible for controlling default network settings. |
Node Client | Responsible for starting a node in a cluster. |
Painless | Default scripting language responsible for safe use of inline and stored scripts. |
Plugins | Responsible for enhancing the basic elasticsearch functionality in a custom manner. |
Scripting | Enables user to use scripts to evaluate custom expressions. |
Snapshot/ Restore | Responsible for creating snapshots of individual indices or an entire cluster into a remote repository. |
Thread pools | Responsible for holding several thread pools in order to improve how threads memory consumption are managed within a node. |
Transport | Responsible for configuring the transport networking layer. |
Tribe Nodes | Responsible for joining one or more clusters and act as a federated client across them. |
Cross-Cluster Search | Responsible for executing the search requests across more than one cluster without joining them and act as a federated client across them. |
This brings us to the end of the blog on Elasticsearch tutorial. I hope through this blog on Elasticsearch tutorial I was able to clearly explain different Elasticsearch APIs and how to use them.
Elasticsearch Tutorial | Getting Started with Elasticsearch
If you want to get trained in Elasticsearch and wish to search and analyze large datasets with ease, then check out the ELK Stack Training by Edureka, a trusted online learning company with a network of more than 250,000 satisfied learners spread across the globe.
Got a question for us? Please mention it in the comments section and we will get back to you.