Elasticsearch Tutorial - Power Up Your Searches

ELK Stack Certification Training (3 Blogs) Become a Certified Professional

In my previous blog on What is Elasticsearch, I have introduced Elasticsearch, talked about its advantages, and did the installation on windows. I have also discussed the basic concepts and different API conventions present in Elasticsearch. But let me tell you something interesting, whatever I have discussed in the previous blog, is just the tip of the iceberg. In this Elasticsearch tutorial blog, I will introduce all the features which make the Elasticsearch the fastest and most popular among its competitors. Also, I will introduce you to the different API’s present in Elasticsearch and how you can perform different searches using them through this Elasticsearch tutorial blog.

Below are the topics that I will be discussing this Elasticsearch tutorial blog:

Elasticsearch APIs
Query DSL
Mapping
Analysis
Modules

So, let’s get started with the very first topic of this Elasticsearch tutorial blog.

Elasticsearch APIs – Elasticsearch Tutorial

This section of Elasticsearch tutorial blog talks about various kinds of API’s supported by Elasticsearch. Let’s understand each of them in detail.

Document API

Elasticsearch provides both single document APIs and multi-document APIs.

SINGLE DOCUMENT API
- Index API
- Get API
- Update API
- Delete API
MULTI-DOCUMENT API
- Multi Get API
- Bulk API
- Delete By Query API
- Update By Query API
- Reindex API

Now that you know about different types of Document APIs, let’s try to implement CRUD operations to them.

Index API

The index API is responsible for adding and updating a typed JSON document in a specific index and then making it searchable. The following example inserts the JSON document into the “playlist” index, under a type called “kpop” with an id of 1:

PUT /playlist/kpop/1
{
 "title" : "Beautiful Life",
 "artist" : "Crush",
 "album" : "Goblin",
 "year" : 2017
}

GET API

The get API is responsible for fetching a typed JSON document from the index based on its unique id. The following example gets a JSON document from a “playlist” index, under a type called “kpop”, with id valued 2:

1	`GET /playlist/kpop/2`

UPDATE API

The updated API is responsible for updating a document based on a script provided. The operation fetches the document from the index, runs the script and then indexes back the result. To make sure no updates happen during the “get” and “reindex”, it uses versioning. The following example updates a JSON document from a “playlist” index, under a type called “kpop”, by adding a new field called “time”:

PUT /playlist/kpop/1
{
 "title" : "Beautiful Life",
 "artist" : "Crush",
 "album" : "Goblin",
 "year" : 2017,
 "time" : 5
}

DELETE API

The delete API is responsible for deleting a typed JSON document from a specific index based on its unique id. The following example gets a JSON document from a “playlist” index, under a type called “kpop”, with id valued 3:

1	`DELETE /playlist/kpop/3`

Search API

The search API is responsible for searching the content within the Elasticsearch. You can search either by sending a get request with a query having a string parameter or a query in the message body of a post request. Generally, the search APIs are multi-index or multi-type.

There are various parameters which can be passed in a search operation having Uniform Resource Identifier (URI):

Parameter	Description
q	This parameter specifies query string
lenient	By setting this parameter’s value to true, format based errors can be ignored
fields	This parameter fetches response from selective fields
sort	This parameter sorts the result
timeout	This parameter helps in restricting the search time
terminate_after	This parameter restricts the response to a specific number of documents in each shard
from	This parameter specifies the start index
size	This parameter specifies the number of hits to return

Now that you are familiar with the search parameter, let’s see how you can perform the search through multiple indexes and types.

Multi-Index
In Elasticsearch, you can search for the documents present in all the indices or in some particular indices. The following example searches for JSON documents from all the indexes, where the year is 2014:
1
2
3
4
5
6
7
8
GET playlist,my_playlist/_search?q=2014
{
"title" : "MAMACITA",
"artist" : "SuJu",
"album" : "MAMACITA",
"year" : 2014,
"time" : 4
}
Multi-Type
You can also search all the documents in a particular index across all types or in some specified type. The following example searches for JSON documents from a “playlist” index, under all types, where the year is 2017:
1
GET playlist/_search?q=2017

The next section of Elasticsearch tutorial will talk about the aggregations and its types supported by Elasticsearch.

Aggregations

In Elasticsearch, aggregations framework is responsible for providing the aggregated data based on a search query. Aggregations can be composed together in order to build complex summaries of the data. For a better understanding, consider it as a unit-of-work. It develops analytic information over a set of documents that are available in Elasticsearch. Various types of aggregations are available, each of them having its own purpose and output. For simplification, they are generalized to 4 major families:

Bucketing
Here each bucket is associated with a key and a document. Whenever the aggregation is executed, all the buckets criteria are evaluated on every document. Each time a criterion matches, the document is considered to “fall in” the relevant bucket.
Metric
Metrics are the aggregations which are responsible for keeping a track and computing the metrics over a set of documents.
Matrix
Matrix are the aggregations which are responsible for operating on multiple fields. They produce a matrix result out of the values extracted from the requested document fields. Matrix does not support scripting.
Pipeline
Pipeline are the aggregations which are responsible for aggregating the output of other aggregations and their associated metrics together.

The following example shows how a basic aggregation is structured:

"aggregations" : {
 "<aggregation_name>" : {
 "<aggregation_type>" : {
 <aggregation_body>
 }
 [,"meta" : { [<meta_data_body>] } ]?
 [,"aggregations" : { [<sub_aggregation>]+ } ]?
 }
 [,"<aggregation_name_2>" : { ... } ]*
}

Index API

In Elasticsearch, the index APIs or the indices APIs are responsible for managing individual indices, index settings, aliases, mappings, and index templates. Following are some of the operations that we can perform on Index APIs:

Create Index
The create index API is responsible for instantiating an index. Whenever a user passes a JSON object, an index is created automatically. The following example creates one index called “courses” with some settings:
1
2
3
4
5
6
7
8
9
PUT courses
{
"settings" : {
"index" : {
"number_of_shards" : 3,
"number_of_replicas" : 2
}
}
}
Get Index
The get API is responsible for fetching the information about the index. By sending the get request to one or more indices, you can call it. The following example retrieves index called “courses”:
1
GET /courses
Delete Index
The delete index API is responsible for deleting an existing index. The following example deletes an index called “courses”:
1
DELETE /courses
Open/ Close Index API
The open and close index APIs are responsible for closing an index and then opening it. A closed index is blocked for any read/ write operations. But you can still open it, which will then go through the normal recovery process. The following example closes and opens an index called “courses”:
1
2
3
POST /courses/_close

POST /courses/_open

Index Aliases

APIs in the Elasticsearch can accept an index name when working against a specific index when required. The index aliases API permits aliasing an index with a name, with all APIs automatically converting the alias name to the actual index name. The following example adds and removes an index alias:

POST /_aliases
{
 "actions" : [
 { "add" : { "index" : "courses", "alias" : "subjects" } }
 ]
}
 
POST /_aliases
{
 "actions" : [
 { "remove" : { "index" : "courses", "alias" : "subjects" } }
 ]
}

Analyse
In Elasticsearch, it performs the analysis process on a text and returns the tokens breakdown of the text. You can perform analysis without specifying any index. The following example performs a simple analysis:
1
2
3
4
5
GET _analyze
{
"analyzer" : "standard",
"text" : "this is a demo"
}

Index Template

Index templates are responsible for defining the templates that will be automatically applied when new indices are created. The following example shows a template format:

PUT _template/template_1
{
 "template": "te*",
 "settings": {
 "number_of_shards": 1
 },
 "mappings": {
 "type1": {
 "_source": {
 "enabled": false
 },
 "properties": {
 "host_name": {
 "type": "keyword"
 },
 "created_at": {
 "type": "date",
 "format": "EEE MMM dd HH:mm:ss Z YYYY"
 }
 }
 }
 }
}

Index Stats
In Elasticsearch, indices level stats is responsible for providing statistics on different operations which are happening on an index. The API generally provides the statistics on the index level. The following example shows an index level stats for all indices and a specific index stats as well:
1
2
3
GET /_stats

GET /playlist/_stats
Flush
The flush API is responsible for flushing one or more indices through an API. Basically, its a process of releasing memory from the index by pushing the data to the index storage and clearing the internal transaction log. The following example shows an index being flushed:
1
POST playlist/_flush
Refresh
The refresh API is responsible for refreshing one or more index explicitly. This makes all operations performed since the last refresh available for the search. The following example shows an index being refreshed:
1
2
3
POST /courses/_refresh

POST /playlist,courses/_refresh

Cluster API

The Cluster API in Elasticsearch is responsible for fetching information about a cluster and its nodes and making further changes in them.

Cluster Health
This API is responsible for retrieving cluster’s health status by appending health keyword. The following example shows cluster health:
1
GET _cluster/health
Cluster State
This Cluster State API is responsible for retrieving the state information about a cluster by appending ‘state’ keyword URL. Various information like version, master node, other nodes, routing table, metadata, and blocks are contained by the state. The following example shows cluster state:
1
GET /_cluster/state
Cluster Stats
The Cluster Stats API is responsible for retrieving statistics from a cluster-wide perspective. It returns a basic index metrics and information about the current node which forms the cluster. The following example shows cluster stats:
1
GET /_cluster/stats
Pending Cluster Tasks
This API is responsible for monitoring pending tasks in any cluster. Tasks may include create an index, update, mapping, allocate shard, fail shard etc. The following example shows cluster stats:
1
GET /_cluster/pending_tasks
Node Stats
This cluster node stats API is responsible for retrieving one or more of the cluster nodes statistics. The following example shows cluster nodes stats:
1
GET /_nodes/stats
Nodes hot_thread
This API is responsible for retrieving the current hot threads on each of the node in the cluster. The following example shows cluster’s hot threads:
1
GET /_nodes/hot_threads

Next section of this Elasticsearch Tutorial blog talks about the Query DSL provided by Elasticsearch.

Query DSL – Elasticsearch Tutorial

Elasticsearch provides a full Query DSL which is based on JSON and is responsible for defining queries. The Query DSL consisting of two types of clauses:

Leaf Query Clauses
In Elasticsearch, the leaf query clauses search for a particular value in a particular field like match, term or range queries. These queries can be used by themselves as well.
Compound Query Clauses
In Elasticsearch, the compound query clauses wrap up other leaf or compound queries. These queries are used for combining multiple queries in a logical fashion or for altering their behavior.

Match All Query
This is the most simple query, which matches all the documents and returns a score of 1.0 for every object. The following example shows the match query:
1
2
3
4
5
6
GET /_search
{
"query": {
"match_all": {}
}
}

Full Text Queries

These queries are used for running full-text queries on full text fields. These are basically high-level queries which understand how a field being queried is analyzed. Then it applies each field’s analyzer to the query a string before executing. The following example shows a simple full-text query:

POST /playlist*/_search
{
"query":{
"match" : {
"title":"Beautiful Life"
}
}
}

Some of the full-text queries are:

Query	Description
match	This query is used for performing full-text queries.
match_phrase	This query is used for matching exact phrases or word proximity matches.
match_phrase_prefix	This query is used for wildcard search on the final word.
multi_match	This query is used for matching the multi-field versions.
common_terms	This query is used for providing more preference to uncommon words.
query_string	This query is used for specifying AND\|OR\|NOT conditions and multi-field search within a single query string.
simple_query_string	This query is a robust version of query_string.

Term Level Queries

Rather than a full-text field, these types of queries are used for structured data like numbers, dates, and enums. You can also craft low-level queries using them. The following example shows term level query:

POST /playlist/_search
{
 "query":{
 "term":{"title":"Silence"}
 }
}

Some of the full-text queries are:

Query	Description
term	This query is used for finding the documents containing the exact term specified.
terms	This query is used for finding the documents which contain any of the exact terms specified.
range	This query is used for finding the documents where the range specified must be contained in the specified fields.
exits	This query is used for finding the documents where any non-null value is contained by the specified field.
prefix	This query is used for finding the documents containing the terms beginning with the exact prefix specified.
wildcard	This query is used for finding the documents containing the terms matching the pattern specified.
regexp	This query is used for finding the documents containing the terms matching the regular expression.
fuzzy	This query is used for finding the documents containing the terms fuzzily similar to the specified term.
type	This query is used for finding the documents of the specified type.
ids	This query is used for finding the documents with the specified type and IDs.

Compound Queries

The compound queries in Elasticsearch, are responsible for wrapping up the other compound or leaf queries together. This is done either to combine their results and scores, to change their behavior or to switch from query to filter context. The following example shows a simple full-text query:

POST /playlist/_search
{
 "query": {
 "match": {
 "title": "Lucifer"
 }
 }
}

Some of the full-text queries are:

Query	Description
constant_score	This query is used for wrapping up another query and executing it in filter context.
bool	This query is used for combining multiple leaf or compound query clauses, by default.
dis_max	This query accepts multiple queries and then returns the documents matching any of the query clauses.
function_score	This query is used for modifying the scores returned by the main query with functions to take into account factors like popularity, recency, distance, or custom algorithms implemented with scripting.
boosting	This query is used for returning documents matching a positive query, but reducing the score of documents matching a negative query.
indices	This query is used for executing one query for the specified indices and another for other indices.

Joining Queries
In a distributed system like Elasticsearch, performing full SQL-style joins is very expensive. Thus, Elasticsearch provides two forms of join which are designed to scale horizontally.
1. nested query
  This query is used for the documents containing nested type fields. Using this query, you can query each object as an independent document.
2. has_child & has_parent queries
  This query is used to retrieve the parent-child relationship between two document types within a single index. The has_child query returns the matching parent documents, while the has_parent query returns the matching child documents.

The following example shows a simple join query:

POST /my_playlist/_search
{
 "query":
 {
 "has_child" : {
 "type" : "kpop", "query" : {
 "match" : {
 "artist" : "EXO"
 }
 }
 }
 }
}

Geo Queries
In Elasticsearch, two types of geo data are supported:

geo_point: These are the fields which support lat/ lon pairs
geo_shape: These are the fields which support points, lines, circles, polygons, multi-polygons etc.

{
 "query":{
 "filtered":{
 "filter":{
 "geo_distance":{
 "distance":"150km",
 "location":[42.056098, 86.674299]
 }
 }
 }
 }
}

Next part of this Elasticsearch Tutorial blog talks about different mappings available in Elasticsearch.

Mapping – Elasticsearch Tutorial

In Elasticsearch, mapping is responsible for defining how a document and its fields are stored and indexed. The following example shows a simple mapping query:

POST /playlist
POST /playlist
{
 "mappings": {
 "report": {
 "_all": {
 "enabled": true
 },
 "properties":{
 "title":{ "type":"string"}, "artist":{ "type":"string"},
 "album":{ "type":"string"}, "year":{ "type":"integer"}
 }
 }
}

Field Types

Elasticsearch supports various data types for the fields in a document like:

Datatypes	Description
Core	These are the basic data types that are supported by almost all the systems. The basic datatypes are integer, long, double, short, byte, double, float, string, date, Boolean and binary.
Complex	These are the data types that are the combination of core data types. For example array, JSON object and nested data type.
Geo	These are the data types which are used for defining geographic properties.
Specialized	These are the data types that are used for special purposes.

Mapping Types
In Elasticsearch, each index has one or more mapping types. These mapping types are used to divide the documents of an index into logical groups/ units. Mapping can be differentiated on the basis of the following parameters:
1. Meta-Fields: The meta-fields are responsible for customizing how a document’s associated metadata is treated. Meta-fields in Elasticsearch includes the document’s _index, _type,_id and _source fields.
2. Fields or Properties: In Elasticsearch, each mapping type has a list of fields or properties which are specific it only. In an index, fields with the same name but in different mapping types should have the same mapping.
3. Dynamic Mapping: Elasticsearch allows the automatic creation of mapping called dynamic mapping. Using dynamic mapping a user can post data to any undefined mapping.

Following section of this Elasticsearch Tutorial blog will introduce you to the analysis processes in Elasticsearch.

Analysis – Elasticsearch Tutorial

In Elasticsearch, analysis is the process of conversion of text into tokens or terms. These tokens are then added to the inverted index for the searching purpose. This process of analysis is performed by an analyzer. An analyzer can be of two types:

Built-in analyzer
custom analyzer defined per index.

Thus, if no analyzer is defined, then by default the built-in analyzers will perform the analysis. The following example shows a simple analysis query:

PUT cities
{
 "mappings": {
 "metropolitan": {
 "properties": {
 "title": {
 "type": "text",
 "analyzer": "standard"
 }
 }
 }
 }
}

Analyzers

In Elasticsearch, a tokenizer and optional token filters make up an analyzer. Inside the analysis module, these analyzers are registered with logical names. Using names, the analyzers can be referenced either in mapping definitions or in some APIs. Following are some of the default analyzers −

Analyzers	Description
*Standard*	Using this analyzer you can set stopwords and max_token_length.
*Simple*	The lowercase tokenizer composes this analyzer.
*Whitespace*	The whitespace tokenizer composes this analyzer.
*Stop*	Using this analyzer, stopwords, and stopwords_path can be configured.
*Keyword*	Using this analyzer, an entire stream can be tokenized into a single token.
*Pattern*	Using this analyzer you can configure regular expressions like lowercase, pattern, flags, stopwords etc.
*Language*	Using this analyzer you can analyze different languages like Hindi, Arabic, Dutch etc.
*Snowball*	This analyzer utilizes a standard tokenizer, with standard filter, lowercase filter, stop filter, and snowball filter.
*Custom*	Using this analyzer, a customized analyzer along with a tokenizer with optional token filters and char filters is created.

Tokenizer

In Elasticsearch, tokenizers are responsible for generating tokens from a text. Using whitespace or other punctuations, the text can be broken down into tokens. Elasticsearch provides a list of built-in tokenizers, which are used in a custom analyzer. Following are the some of the tokenizers used in Elasticsearch:

Tokenizer	Description
*Standard*	Developed on grammar-based tokenizer for which max_token_length can also be configured.
*Edge NGram*	Different configurations can be set for this tokenizer like min_gram, max_gram, token_chars.
*Keyword*	This tokenizer is responsible for generating the entire input as an output and setting the buffer_size.
*Letter*	This tokenizer is responsible for capturing the whole word unless a non-letter is encountered.
*Lowercase*	This tokenizer works similar to the letter tokenizer. Once the tokens are created, it changes them into lower case.
*NGram*	You can set min_gram, max_gram, and token_chars etc., for this tokenizer.
*Whitespace*	On the basis of whitespaces, this tokenizer divides the text.
*Pattern*	This tokenizer uses the regular expressions as a token separator.
*UAX Email URL*	This works similar to the standard tokenizer but refers email and URL as a single token.
*Path Hierarchy*	This tokenizer is responsible for generating all the possible paths present inside the input directory path.
*Classic*	This tokenizer uses grammar based tokens for its functioning.
*Thai*	This is used for the Thai language which uses built-in Thai segmentation algorithm for processing.

Token Filters
In Elasticsearch, tokenizers send input to the token filters. These token filters can further modify, delete or add text into that input.
Character Filters
Before the tokenizers, the text is processed by the character filters. Character filters search for the special characters or HTML tags or specified patterns. After which it either deletes them or changes them to appropriate words.

Next part of this Elasticsearch Tutorial blog talks about different modules provided by Elasticsearch.

Modules – Elasticsearch Tutorial

Elasticsearch is composed of different modules, which are responsible for various aspects of its functionality. Each of these modules can have any one of the following settings:

static – These settings must be done at the node level and must be set on every relevant node.
dynamic – These settings can be updated dynamically on a live cluster.

Modules	Description
Cluster-level routing and shard allocation	Responsible for the settings which control where, when, and how shards are allocated to nodes.
Discovery	Responsible for discovering a cluster and maintaining the state of all the nodes in it.
Gateway	Responsible for maintaining the cluster state and the shard data across full cluster during restarts.
HTTP	Responsible for managing the communication between HTTP client and Elasticsearch APIs.
Indices	Responsible for maintaining the settings that are set globally for every index.
Network	Responsible for controlling default network settings.
Node Client	Responsible for starting a node in a cluster.
Painless	Default scripting language responsible for safe use of inline and stored scripts.
Plugins	Responsible for enhancing the basic elasticsearch functionality in a custom manner.
Scripting	Enables user to use scripts to evaluate custom expressions.
Snapshot/ Restore	Responsible for creating snapshots of individual indices or an entire cluster into a remote repository.
Thread pools	Responsible for holding several thread pools in order to improve how threads memory consumption are managed within a node.
Transport	Responsible for configuring the transport networking layer.
Tribe Nodes	Responsible for joining one or more clusters and act as a federated client across them.
Cross-Cluster Search	Responsible for executing the search requests across more than one cluster without joining them and act as a federated client across them.

This brings us to the end of the blog on Elasticsearch tutorial. I hope through this blog on Elasticsearch tutorial I was able to clearly explain different Elasticsearch APIs and how to use them.

Elasticsearch Tutorial | Getting Started with Elasticsearch

If you want to get trained in Elasticsearch and wish to search and analyze large datasets with ease, then check out the ELK Stack Training by Edureka, a trusted online learning company with a network of more than 250,000 satisfied learners spread across the globe.

Got a question for us? Please mention it in the comments section and we will get back to you.

Introduction to ELK Stack

Big Data

Elasticsearch Tutorial – Power Up Your Searches

Elasticsearch APIs – Elasticsearch Tutorial

Document API

Index API

GET API

UPDATE API

DELETE API

Search API

Multi-Index

Multi-Type

Aggregations

Bucketing

Metric

Matrix

Pipeline

Index API

Create Index

Get Index

Delete Index

Open/ Close Index API

Index Aliases

Analyse

Index Template

Index Stats

Flush

Refresh

Cluster API

Cluster Health

Cluster State

Cluster Stats

Pending Cluster Tasks

Node Stats

Nodes hot_thread

Query DSL – Elasticsearch Tutorial

Leaf Query Clauses

Compound Query Clauses

Match All Query

Full Text Queries

Term Level Queries

Compound Queries

Joining Queries

nested query

has_child & has_parent queries

Geo Queries

Mapping – Elasticsearch Tutorial

Field Types

Mapping Types

Analysis – Elasticsearch Tutorial

Analyzers

Tokenizer

Token Filters

Character Filters

Modules – Elasticsearch Tutorial

Elasticsearch Tutorial | Getting Started with Elasticsearch

Recommended videos for you

Hadoop-A Highly Available And Secure Enterprise Data Warehousing Solution

Big Data Processing With Apache Spark

Introduction to Big Data TDD and Pig Unit

Introduction to Hadoop Administration

When not to use Hadoop

MapReduce Tutorial – All You Need To Know About MapReduce

Is It The Right Time For Me To Learn Hadoop ? Find out.

Big Data Tutorial – Get Started With Big Data And Hadoop

Apache Spark Redefining Big Data Processing

Hadoop Tutorial – A Complete Tutorial For Hadoop

Spark SQL | Apache Spark

Boost Your Data Career with Predictive Analytics! Learn How ?

Ways to Succeed with Hadoop in 2015

Apache Kafka With Spark Streaming: Real-Time Analytics Redefined

Logistic Regression In Data Science

Filtering on HBase Using MapReduce Filtering Pattern

Apache Spark Will Replace Hadoop ! Know Why

HBase Tutorial – A Complete Guide On Apache HBase

Python for Big Data Analytics

Power of Python With BigData

Recommended blogs for you

What is Azure Data Factory – Here’s Everything You Need to Know

Top 14 Big Data Certifications in 2021