Microsoft Certified Azure Data Engineer Assoc ...
- 14k Enrolled Learners
- Weekend
- Live Class
In my previous blog on What is Elasticsearch, I have introduced Elasticsearch, talked about its advantages, and did the installation on windows. I have also discussed the basic concepts and different API conventions present in Elasticsearch. But let me tell you something interesting, whatever I have discussed in the previous blog, is just the tip of the iceberg. In this Elasticsearch tutorial blog, I will introduce all the features which make the Elasticsearch the fastest and most popular among its competitors. Also, I will introduce you to the different API’s present in Elasticsearch and how you can perform different searches using them through this Elasticsearch tutorial blog.
Below are the topics that I will be discussing this Elasticsearch tutorial blog:
So, let’s get started with the very first topic of this Elasticsearch tutorial blog.
This section of Elasticsearch tutorial blog talks about various kinds of API’s supported by Elasticsearch. Let’s understand each of them in detail.
Elasticsearch provides both single document APIs and multi-document APIs.
Now that you know about different types of Document APIs, let’s try to implement CRUD operations to them.
The index API is responsible for adding and updating a typed JSON document in a specific index and then making it searchable. The following example inserts the JSON document into the “playlist” index, under a type called “kpop” with an id of 1:
PUT /playlist/kpop/1 { "title" : "Beautiful Life", "artist" : "Crush", "album" : "Goblin", "year" : 2017 }
The get API is responsible for fetching a typed JSON document from the index based on its unique id. The following example gets a JSON document from a “playlist” index, under a type called “kpop”, with id valued 2:
GET /playlist/kpop/2
The updated API is responsible for updating a document based on a script provided. The operation fetches the document from the index, runs the script and then indexes back the result. To make sure no updates happen during the “get” and “reindex”, it uses versioning. The following example updates a JSON document from a “playlist” index, under a type called “kpop”, by adding a new field called “time”:
PUT /playlist/kpop/1 { "title" : "Beautiful Life", "artist" : "Crush", "album" : "Goblin", "year" : 2017, "time" : 5 }
The delete API is responsible for deleting a typed JSON document from a specific index based on its unique id. The following example gets a JSON document from a “playlist” index, under a type called “kpop”, with id valued 3:
DELETE /playlist/kpop/3
The search API is responsible for searching the content within the Elasticsearch. You can search either by sending a get request with a query having a string parameter or a query in the message body of a post request. Generally, the search APIs are multi-index or multi-type.
There are various parameters which can be passed in a search operation having Uniform Resource Identifier (URI):
Parameter | Description |
q | This parameter specifies query string |
lenient | By setting this parameter’s value to true, format based errors can be ignored |
fields | This parameter fetches response from selective fields |
sort | This parameter sorts the result |
timeout | This parameter helps in restricting the search time |
terminate_after | This parameter restricts the response to a specific number of documents in each shard |
from | This parameter specifies the start index |
size | This parameter specifies the number of hits to return |
Now that you are familiar with the search parameter, let’s see how you can perform the search through multiple indexes and types.
In Elasticsearch, you can search for the documents present in all the indices or in some particular indices. The following example searches for JSON documents from all the indexes, where the year is 2014:
GET playlist,my_playlist/_search?q=2014 { "title" : "MAMACITA", "artist" : "SuJu", "album" : "MAMACITA", "year" : 2014, "time" : 4 }
You can also search all the documents in a particular index across all types or in some specified type. The following example searches for JSON documents from a “playlist” index, under all types, where the year is 2017:
GET playlist/_search?q=2017
The next section of Elasticsearch tutorial will talk about the aggregations and its types supported by Elasticsearch.
In Elasticsearch, aggregations framework is responsible for providing the aggregated data based on a search query. Aggregations can be composed together in order to build complex summaries of the data. For a better understanding, consider it as a unit-of-work. It develops analytic information over a set of documents that are available in Elasticsearch. Various types of aggregations are available, each of them having its own purpose and output. For simplification, they are generalized to 4 major families:
Here each bucket is associated with a key and a document. Whenever the aggregation is executed, all the buckets criteria are evaluated on every document. Each time a criterion matches, the document is considered to “fall in” the relevant bucket.
Metrics are the aggregations which are responsible for keeping a track and computing the metrics over a set of documents.
Matrix are the aggregations which are responsible for operating on multiple fields. They produce a matrix result out of the values extracted from the requested document fields. Matrix does not support scripting.
Pipeline are the aggregations which are responsible for aggregating the output of other aggregations and their associated metrics together.
The following example shows how a basic aggregation is structured:
"aggregations" : { "<aggregation_name>" : { "<aggregation_type>" : { <aggregation_body> } [,"meta" : { [<meta_data_body>] } ]? [,"aggregations" : { [<sub_aggregation>]+ } ]? } [,"<aggregation_name_2>" : { ... } ]* }
In Elasticsearch, the index APIs or the indices APIs are responsible for managing individual indices, index settings, aliases, mappings, and index templates. Following are some of the operations that we can perform on Index APIs:
The create index API is responsible for instantiating an index. Whenever a user passes a JSON object, an index is created automatically. The following example creates one index called “courses” with some settings:
PUT courses { "settings" : { "index" : { "number_of_shards" : 3, "number_of_replicas" : 2 } } }
The get API is responsible for fetching the information about the index. By sending the get request to one or more indices, you can call it. The following example retrieves index called “courses”:
GET /courses
The delete index API is responsible for deleting an existing index. The following example deletes an index called “courses”:
DELETE /courses
The open and close index APIs are responsible for closing an index and then opening it. A closed index is blocked for any read/ write operations. But you can still open it, which will then go through the normal recovery process. The following example closes and opens an index called “courses”:
POST /courses/_close POST /courses/_open
APIs in the Elasticsearch can accept an index name when working against a specific index when required. The index aliases API permits aliasing an index with a name, with all APIs automatically converting the alias name to the actual index name. The following example adds and removes an index alias:
POST /_aliases { "actions" : [ { "add" : { "index" : "courses", "alias" : "subjects" } } ] } POST /_aliases { "actions" : [ { "remove" : { "index" : "courses", "alias" : "subjects" } } ] }
In Elasticsearch, it performs the analysis process on a text and returns the tokens breakdown of the text. You can perform analysis without specifying any index. The following example performs a simple analysis:
GET _analyze { "analyzer" : "standard", "text" : "this is a demo" }
Index templates are responsible for defining the templates that will be automatically applied when new indices are created. The following example shows a template format:
PUT _template/template_1 { "template": "te*", "settings": { "number_of_shards": 1 }, "mappings": { "type1": { "_source": { "enabled": false }, "properties": { "host_name": { "type": "keyword" }, "created_at": { "type": "date", "format": "EEE MMM dd HH:mm:ss Z YYYY" } } } } }
In Elasticsearch, indices level stats is responsible for providing statistics on different operations which are happening on an index. The API generally provides the statistics on the index level. The following example shows an index level stats for all indices and a specific index stats as well:
GET /_stats GET /playlist/_stats
The flush API is responsible for flushing one or more indices through an API. Basically, its a process of releasing memory from the index by pushing the data to the index storage and clearing the internal transaction log. The following example shows an index being flushed:
POST playlist/_flush
The refresh API is responsible for refreshing one or more index explicitly. This makes all operations performed since the last refresh available for the search. The following example shows an index being refreshed:
POST /courses/_refresh POST /playlist,courses/_refresh
The Cluster API in Elasticsearch is responsible for fetching information about a cluster and its nodes and making further changes in them.
This API is responsible for retrieving cluster’s health status by appending health keyword. The following example shows cluster health:
GET _cluster/health
This Cluster State API is responsible for retrieving the state information about a cluster by appending ‘state’ keyword URL. Various information like version, master node, other nodes, routing table, metadata, and blocks are contained by the state. The following example shows cluster state:
GET /_cluster/state
The Cluster Stats API is responsible for retrieving statistics from a cluster-wide perspective. It returns a basic index metrics and information about the current node which forms the cluster. The following example shows cluster stats:
GET /_cluster/stats
This API is responsible for monitoring pending tasks in any cluster. Tasks may include create an index, update, mapping, allocate shard, fail shard etc. The following example shows cluster stats:
GET /_cluster/pending_tasks
This cluster node stats API is responsible for retrieving one or more of the cluster nodes statistics. The following example shows cluster nodes stats:
GET /_nodes/stats
This API is responsible for retrieving the current hot threads on each of the node in the cluster. The following example shows cluster’s hot threads:
GET /_nodes/hot_threads
Next section of this Elasticsearch Tutorial blog talks about the Query DSL provided by Elasticsearch.
Elasticsearch provides a full Query DSL which is based on JSON and is responsible for defining queries. The Query DSL consisting of two types of clauses:
In Elasticsearch, the leaf query clauses search for a particular value in a particular field like match, term or range queries. These queries can be used by themselves as well.
In Elasticsearch, the compound query clauses wrap up other leaf or compound queries. These queries are used for combining multiple queries in a logical fashion or for altering their behavior.
This is the most simple query, which matches all the documents and returns a score of 1.0 for every object. The following example shows the match query:
GET /_search { "query": { "match_all": {} } }
These queries are used for running full-text queries on full text fields. These are basically high-level queries which understand how a field being queried is analyzed. Then it applies each field’s analyzer to the query a string before executing. The following example shows a simple full-text query:
POST /playlist*/_search { "query":{ "match" : { "title":"Beautiful Life" } } }
Some of the full-text queries are:
Query | Description |
match | This query is used for performing full-text queries. |
match_phrase | This query is used for matching exact phrases or word proximity matches. |
match_phrase_prefix | This query is used for wildcard search on the final word. |
multi_match | This query is used for matching the multi-field versions. |
common_terms | This query is used for providing more preference to uncommon words. |
query_string | This query is used for specifying AND|OR|NOT conditions and multi-field search within a single query string. |
simple_query_string | This query is a robust version of query_string. |
Rather than a full-text field, these types of queries are used for structured data like numbers, dates, and enums. You can also craft low-level queries using them. The following example shows term level query:
POST /playlist/_search { "query":{ "term":{"title":"Silence"} } }
Some of the full-text queries are:
Query | Description |
term | This query is used for finding the documents containing the exact term specified. |
terms | This query is used for finding the documents which contain any of the exact terms specified. |
range | This query is used for finding the documents where the range specified must be contained in the specified fields. |
exits | This query is used for finding the documents where any non-null value is contained by the specified field. |
prefix | This query is used for finding the documents containing the terms beginning with the exact prefix specified. |
wildcard | This query is used for finding the documents containing the terms matching the pattern specified. |
regexp | This query is used for finding the documents containing the terms matching the regular expression. |
fuzzy | This query is used for finding the documents containing the terms fuzzily similar to the specified term. |
type | This query is used for finding the documents of the specified type. |
ids | This query is used for finding the documents with the specified type and IDs. |
The compound queries in Elasticsearch, are responsible for wrapping up the other compound or leaf queries together. This is done either to combine their results and scores, to change their behavior or to switch from query to filter context. The following example shows a simple full-text query:
POST /playlist/_search { "query": { "match": { "title": "Lucifer" } } }
Some of the full-text queries are:
Query | Description |
constant_score | This query is used for wrapping up another query and executing it in filter context. |
bool | This query is used for combining multiple leaf or compound query clauses, by default. |
dis_max | This query accepts multiple queries and then returns the documents matching any of the query clauses. |
function_score | This query is used for modifying the scores returned by the main query with functions to take into account factors like popularity, recency, distance, or custom algorithms implemented with scripting. |
boosting | This query is used for returning documents matching a positive query, but reducing the score of documents matching a negative query. |
indices | This query is used for executing one query for the specified indices and another for other indices. |
In a distributed system like Elasticsearch, performing full SQL-style joins is very expensive. Thus, Elasticsearch provides two forms of join which are designed to scale horizontally.
This query is used for the documents containing nested type fields. Using this query, you can query each object as an independent document.
This query is used to retrieve the parent-child relationship between two document types within a single index. The has_child query returns the matching parent documents, while the has_parent query returns the matching child documents.
The following example shows a simple join query:
POST /my_playlist/_search { "query": { "has_child" : { "type" : "kpop", "query" : { "match" : { "artist" : "EXO" } } } } }
In Elasticsearch, two types of geo data are supported:
{ "query":{ "filtered":{ "filter":{ "geo_distance":{ "distance":"150km", "location":[42.056098, 86.674299] } } } } }
Next part of this Elasticsearch Tutorial blog talks about different mappings available in Elasticsearch.
In Elasticsearch, mapping is responsible for defining how a document and its fields are stored and indexed. The following example shows a simple mapping query:
POST /playlist POST /playlist { "mappings": { "report": { "_all": { "enabled": true }, "properties":{ "title":{ "type":"string"}, "artist":{ "type":"string"}, "album":{ "type":"string"}, "year":{ "type":"integer"} } } }
Elasticsearch supports various data types for the fields in a document like:
Datatypes | Description |
Core | These are the basic data types that are supported by almost all the systems. The basic datatypes are integer, long, double, short, byte, double, float, string, date, Boolean and binary. |
Complex | These are the data types that are the combination of core data types. For example array, JSON object and nested data type. |
Geo | These are the data types which are used for defining geographic properties. |
Specialized | These are the data types that are used for special purposes. |
In Elasticsearch, each index has one or more mapping types. These mapping types are used to divide the documents of an index into logical groups/ units. Mapping can be differentiated on the basis of the following parameters:
Following section of this Elasticsearch Tutorial blog will introduce you to the analysis processes in Elasticsearch.
In Elasticsearch, analysis is the process of conversion of text into tokens or terms. These tokens are then added to the inverted index for the searching purpose. This process of analysis is performed by an analyzer. An analyzer can be of two types:
Thus, if no analyzer is defined, then by default the built-in analyzers will perform the analysis. The following example shows a simple analysis query:
PUT cities { "mappings": { "metropolitan": { "properties": { "title": { "type": "text", "analyzer": "standard" } } } } }
In Elasticsearch, a tokenizer and optional token filters make up an analyzer. Inside the analysis module, these analyzers are registered with logical names. Using names, the analyzers can be referenced either in mapping definitions or in some APIs. Following are some of the default analyzers −
Analyzers | Description |
Standard | Using this analyzer you can set stopwords and max_token_length. |
Simple | The lowercase tokenizer composes this analyzer. |
Whitespace | The whitespace tokenizer composes this analyzer. |
Stop | Using this analyzer, stopwords, and stopwords_path can be configured. |
Keyword | Using this analyzer, an entire stream can be tokenized into a single token. |
Pattern | Using this analyzer you can configure regular expressions like lowercase, pattern, flags, stopwords etc. |
Language | Using this analyzer you can analyze different languages like Hindi, Arabic, Dutch etc. |
Snowball | This analyzer utilizes a standard tokenizer, with standard filter, lowercase filter, stop filter, and snowball filter. |
Custom | Using this analyzer, a customized analyzer along with a tokenizer with optional token filters and char filters is created. |
In Elasticsearch, tokenizers are responsible for generating tokens from a text. Using whitespace or other punctuations, the text can be broken down into tokens. Elasticsearch provides a list of built-in tokenizers, which are used in a custom analyzer. Following are the some of the tokenizers used in Elasticsearch:
Tokenizer | Description |
Standard | Developed on grammar-based tokenizer for which max_token_length can also be configured. |
Edge NGram | Different configurations can be set for this tokenizer like min_gram, max_gram, token_chars. |
Keyword | This tokenizer is responsible for generating the entire input as an output and setting the buffer_size. |
Letter | This tokenizer is responsible for capturing the whole word unless a non-letter is encountered. |
Lowercase | This tokenizer works similar to the letter tokenizer. Once the tokens are created, it changes them into lower case. |
NGram | You can set min_gram, max_gram, and token_chars etc., for this tokenizer. |
Whitespace | On the basis of whitespaces, this tokenizer divides the text. |
Pattern | This tokenizer uses the regular expressions as a token separator. |
UAX Email URL | This works similar to the standard tokenizer but refers email and URL as a single token. |
Path Hierarchy | This tokenizer is responsible for generating all the possible paths present inside the input directory path. |
Classic | This tokenizer uses grammar based tokens for its functioning. |
Thai | This is used for the Thai language which uses built-in Thai segmentation algorithm for processing. |
In Elasticsearch, tokenizers send input to the token filters. These token filters can further modify, delete or add text into that input.
Before the tokenizers, the text is processed by the character filters. Character filters search for the special characters or HTML tags or specified patterns. After which it either deletes them or changes them to appropriate words.
Next part of this Elasticsearch Tutorial blog talks about different modules provided by Elasticsearch.
Elasticsearch is composed of different modules, which are responsible for various aspects of its functionality. Each of these modules can have any one of the following settings:
Modules | Description |
Cluster-level routing and shard allocation | Responsible for the settings which control where, when, and how shards are allocated to nodes. |
Discovery | Responsible for discovering a cluster and maintaining the state of all the nodes in it. |
Gateway | Responsible for maintaining the cluster state and the shard data across full cluster during restarts. |
HTTP | Responsible for managing the communication between HTTP client and Elasticsearch APIs. |
Indices | Responsible for maintaining the settings that are set globally for every index. |
Network | Responsible for controlling default network settings. |
Node Client | Responsible for starting a node in a cluster. |
Painless | Default scripting language responsible for safe use of inline and stored scripts. |
Plugins | Responsible for enhancing the basic elasticsearch functionality in a custom manner. |
Scripting | Enables user to use scripts to evaluate custom expressions. |
Snapshot/ Restore | Responsible for creating snapshots of individual indices or an entire cluster into a remote repository. |
Thread pools | Responsible for holding several thread pools in order to improve how threads memory consumption are managed within a node. |
Transport | Responsible for configuring the transport networking layer. |
Tribe Nodes | Responsible for joining one or more clusters and act as a federated client across them. |
Cross-Cluster Search | Responsible for executing the search requests across more than one cluster without joining them and act as a federated client across them. |
This brings us to the end of the blog on Elasticsearch tutorial. I hope through this blog on Elasticsearch tutorial I was able to clearly explain different Elasticsearch APIs and how to use them.
If you want to get trained in Elasticsearch and wish to search and analyze large datasets with ease, then check out the ELK Stack Training by Edureka, a trusted online learning company with a network of more than 250,000 satisfied learners spread across the globe.
Got a question for us? Please mention it in the comments section and we will get back to you.
edureka.co