How to create new column with function in Spark Dataframe?

Question

I'm trying to figure out the new dataframe API in Spark. I am facing an issue here that I have a dataframe with 2 columns, "ID" and "Amount". As a generic example, say I want to return a new column called "code" that returns a code based on the value of "Amt". I can write a function something like this:def coder(myAmt:Integer):String {&#160; if (myAmt > 100) "Little"&#160; else "Big"}When I try to use it like this:val DF = sqlContext.parquetFile("hdfs://temp/file.parquet")DF.withColumn("Code", coder(DF("Amt")))I get type mismatch errorsfound&#160; &#160;: org.apache.spark.sql.Columnrequired: Integer&#160;I've tried changing the input type on my function to org.apache.spark.sql.Column but I then I start getting errors with&#160;the function compiling because it wants a boolean in the if statement.Am I doing this wrong? Is there another way to do this than using withColumn?Thanks in advance.

nitinrawat895 · Answer

Let's say you have "Amt" column in your Schema:import org.apache.spark.sql.functions._val DF = sqlContext.parquetFile("hdfs://temp/file.parquet")val coder: (Int => String) = (arg: Int) => {if (arg < 100) "little" else "big"}val sqlfunc = udf(coder)DF.withColumn("Code", sqlfunc(col("Amt")))I guess withColumn is the right way to add a column.You can even check out the details of a successful Spark developers with the Pyspark online training.&#160;

Rahul · Answer

df = sqlContext.createDataFrame(
    [(1, "a", 25.0), (2, "B", -25.0)], ("c1", "c2", "c3"))
from pyspark.sql.functions import udf
from pyspark.sql.types import *

def valueToCategory(value):
   if   value == 1: return 1
   elif value == 2: return 2
   ...
   else: return 0

# NOTE: it seems that calls to udf() must be after SparkContext() is called
udfValueToCategory = udf(valueToCategory, StringType())
df_with_cat = df.withColumn("category", udfValueToCategory("c1"))

Raj · Answer

df.select('*', (df.column_name + 10).alias('new_column'))

Lakheer · Answer

new_col = []
for column in COLUMN_LIST:
    if column in df.columns:
        new_col.append(column)
    else:
        new_col.append(lit(None).cast(StringType()).alias('{0}'.format(column)))

df = df.select(new_col)

Goutam · Answer

You can do it using udf:a = F.udf(lambda :yourstring,StringType())
a.select(a().alias('new_column')

Manoj · Answer

creator = udf(
    lambda val: val,
    StringType()
)
df.withColumn('new_col_name', creator(df.old_col))

Vinod · Answer

DF.withColumn("new_col", DF.col("old_col") + 10)

Ashok · Answer

import org.apache.spark.sql.functions.lit
    val addColumn :(String)=>String=(data:String)=>{data}
    val ColUDF= udf(addColumn)
     val output = inputDataFrame.withColumn("Name",ColUDF(lit("abcde")))

Lalit · Answer

Dataset<Row> &#8203;newDs = ds.withColumn("new_col",functions.lit(1));

Suman · Answer

val df2 = dataFrame .withColumn("F", lit("foo")) .select("F", "A", "B", "C", "D", "E")

anonymous · Answer

val coder: (Int => String) = v => if (v > 100) "Big" else "Small"
import org.apache.spark.sql.functions.udf
val coder_udf = udf(coder)
DF.withColumn("Code", coder_udf( DF.col("Amt")))

Gopalakrishnan · Answer

You don't even need to create a function. you can use the when method

DF.withColumn("Code",  when(DF("Amt") > 100 ,"Little").otherwise("Big"))

MD · Answer

Hi,withColumn() is used to add a new or update an existing column on DataFrame, here, we will see,&#160;how to add a new column by using an existing column. The withColumn()&#160;function takes two arguments, the first argument is the name of the new column and the second argument is the value of the column in&#160;Column&#160;type.&#160;df.withColumn("CopiedColumn",col("salary")* -1) .show(false)

bathina · Answer

Spark&#160;withColumn()&#160;is a DataFrame function that is used to add a new column to DataFrame, change the value of an existing column,&#160;convert the datatype of a column, derive a new column from an existing column, on this post, I will walk you through commonly used DataFrame column operations with Scala examples.First, let&#8217;s create a simple DataFrame to work with.  import spark.sqlContext.implicits._

val data = Seq(("111",50000),("222",60000),("333",40000))
  val df = data.toDF("EmpId","Salary")
  df.show(false)
Yields below output+-----+------+
|EmpId|Salary|
+-----+------+
|111  |50000 |
|222  |60000 |
|333  |40000 |
+-----+------+
Using withColumn() to Add a New ColumnwithColumn() is used to add a new or update an existing column on DataFrame, here, I will just explain how to add a new column by using an existing column.&#160;withColumn()&#160;function takes two arguments, the first argument is the name of the new column and the second argument is the value of the column in&#160;Column&#160;type.  //Derive a new column from existing
  df.withColumn("CopiedColumn",col("salary")* -1)
    .show(false)
Here, we have added a new column&#160;CopiedColumn&#160;by multiplying -1 with an existing column&#160;Salary. This yields the below output.+-----+------+------------+
|EmpId|Salary|CopiedColumn|
+-----+------+------------+
|111  |50000 |-50000      |
|222  |60000 |-60000      |
|333  |40000 |-40000      |
+-----+------+------------+
You can also add columns based on some conditions, please refer to&#160;Spark Case When and When Otherwise examplesUsing Select to Add ColumnThe above statement can also be written using&#160;select()&#160;as below and this yields the same as the above output. You can also add multiple columns using select.

How to create new column with function in Spark Dataframe

Your comment on this question:

14 answers to this question.

Your answer

Your comment on this answer:

Your comment on this answer:

Your comment on this answer:

Your comment on this answer:

Your comment on this answer:

Your comment on this answer:

Your comment on this answer:

Your comment on this answer:

Your comment on this answer:

Your comment on this answer:

Your comment on this answer:

Your comment on this answer:

Your comment on this answer:

Your comment on this answer:

Related Questions In Apache Spark

How to assign a column in Spark Dataframe (PySpark) as a Primary Key?

How to create a not null column in case class in spark

How to find the number of elements present in the array in a Spark DataFame column?

How to work with Matrix Multiplication in Apache Spark?

what is Paired RDD and how to create paired RDD in Spark?

How to create paired RDD using subString method in Spark?

How do I get number of columns in each line from a delimited file??

How to convert rdd object to dataframe in spark

How to replace null values in Spark DataFrame?

Different Spark Ecosystem

Subscribe to our Newsletter, and get personalized recommendations.

TRENDING CERTIFICATION COURSES

TRENDING MASTERS COURSES

COMPANY

WORK WITH US

DOWNLOAD APP

CATEGORIES

CATEGORIES

TRENDING BLOG ARTICLES

TRENDING BLOG ARTICLES