Hey @Deepthi, you could do this:
Run the steps below to prepare to run the code in this tutorial.
-
Set up your project. If necessary, set up a project with the Cloud Dataproc, Compute Engine, and Cloud Storage APIs enabled and the Cloud SDK installed on your local machine.
-
Select or create a GCP project.
-
Make sure that billing is enabled for your Google Cloud Platform project.
-
Enable the Cloud Dataproc, Compute Engine, and Cloud Storage APIs.
-
Install and initialize the Cloud SDK.
-
Create a Cloud Storage bucket. You need a Cloud Storage to hold tutorial data. If you do not have one ready to use, create a new bucket in your project.
-
In the GCP Console, go to the Cloud Storage Browser page.
-
Click Create bucket.
-
In the Create bucket dialog, specify the following attributes:
-
Click Create.
-
Set local environment variables. Set environment variables on your local machine. Set your GCP project-id and the name of the Cloud Storage bucket you will use. Also provide the name and zone of an existing or new Cloud Dataproc cluster. You can create a cluster to use in the next step.
PROJECT=project-id
BUCKET_NAME=bucket-name
CLUSTER=cluster-name
ZONE=cluster-region Example: "us-west1-a"
-
Create a Cloud Dataproc cluster. Run the command, below, to create a single-node Cloud Dataproc cluster in the specified Compute Engine zone.
gcloud dataproc clusters create $CLUSTER \
--project=${PROJECT} \
--zone=${ZONE} \
--single-node
-
Copy your pyspark application from that specific cloud storage bucket to your Cloud Storage bucket.
gsutil cp gs://training/root.py gs://${BUCKET_NAME}
For more info refer to https://cloud.google.com/dataproc/docs/tutorials/gcs-connector-spark-tutorial