Why we use --split by command in Sqoop

Question

Why we use --split-by command in Sqoop?

Gitika · Answer 1 · Apr 11, 2019

The command --split-by is used to specify the column of the table used to generate splits for imports. This means that it specifies which column will be used to create the split while importing the data into the cluster.

Basically it is used to improve the import performance to achieve faster parallelism.

answered Apr 11, 2019 by Gitika
• 65,770 points

score +2 · Answer 2 · Feb 6, 2020

In simple explanation,

When specify SPLIT_BY only for sqoop'ing the whole table [which would be select * from table]. As this not condition based we need a logic to divide the data and process them in more than one node.

So we select a column based on which data can be divided base on a range [for integers] & characters [alphabets], to avoid huge amount of the data being concentrated on one mapper/node.

Bigdata is specifically meant to achieve parallelism.