GCP Data Proc

Question

How can I create a Google Cloud Storage bucket to use for my Google Cloud cluster and copy the Py Spark application to the bucket in my project. The Py spark app has been shared from a Cloud Storage bucket: GS://training/root.py. Please can you help solve this problem with what steps to take and if possible with pics?

So if I have understood correctly, you need a cloud storage bucket that can be used by your google cloud cluster to copy the py spark application to the bucket? — Kalgi, Nov 26, 2019

Karan · Answer 1 · Nov 26, 2019

Hey @Deepthi, you could do this:

Run the steps below to prepare to run the code in this tutorial.

Set up your project. If necessary, set up a project with the Cloud Dataproc, Compute Engine, and Cloud Storage APIs enabled and the Cloud SDK installed on your local machine.
- Select or create a GCP project.
- Make sure that billing is enabled for your Google Cloud Platform project.
- Enable the Cloud Dataproc, Compute Engine, and Cloud Storage APIs.
- Install and initialize the Cloud SDK.
Create a Cloud Storage bucket. You need a Cloud Storage to hold tutorial data. If you do not have one ready to use, create a new bucket in your project.
1. In the GCP Console, go to the Cloud Storage Browser page.
2. Click Create bucket.
3. In the Create bucket dialog, specify the following attributes:
  - A unique bucket name.
  - A storage class.
  - A location where bucket data will be stored.
4. Click Create.
Set local environment variables. Set environment variables on your local machine. Set your GCP project-id and the name of the Cloud Storage bucket you will use. Also provide the name and zone of an existing or new Cloud Dataproc cluster. You can create a cluster to use in the next step.

PROJECT=project-id
BUCKET_NAME=bucket-name
CLUSTER=cluster-name
ZONE=cluster-region Example: "us-west1-a"
Create a Cloud Dataproc cluster. Run the command, below, to create a single-node Cloud Dataproc cluster in the specified Compute Engine zone.
```
gcloud dataproc clusters create $CLUSTER \
    --project=${PROJECT} \
    --zone=${ZONE} \
    --single-node
```
Copy your pyspark application from that specific cloud storage bucket to your Cloud Storage bucket.

gsutil cp gs://training/root.py gs://${BUCKET_NAME}