Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel.
Now this RDD creation can be done in two ways:
First, is to refer to an external dataset present in the hdfs or local i.e,
sc.textFile("/user/edureka_425640/patient_records.txt")
Second, is parallelizing an existing collection using sc.parallelize i.e., sc.parallelize API will help in loading user created data which is not mandatorily coming from a directory.
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
So, when we are using sc.parallelize, we are actually using it for RDD creation only.