RDD word count with line numbers

Question

Hi,

Could you please send me the Pyspark snippet to find word count and list of line numbers where that word present.

Ex.

Text file contains following text

Hello world

Hello world

Hello

Output

Hello 3 [1,2,3]

World 2 [1,2]

Here,

hello is present in line numbers 1,2,3

World is present in line numbers 1,2

score 0 · Answer 1 · Jul 25, 2019

df = spark.createDataFrame([("A", 2000), ("A", 2002), ("A", 2007), ("B", 1999), ("B", 2015)], ["Group", "Date"])

+-----+----+

|Group|Date|

+-----+----+

| A|2000|

| A|2002|

| A|2007|

| B|1999|

| B|2015|

+-----+----+


# accepted solution above



from pyspark.sql.window import *

from pyspark.sql.functions import row_number


df.withColumn("row_num", row_number().over(Window.partitionBy("Group").orderBy("Date")))



# accepted solution above output



+-----+----+-------------+

|Group|Date|row_num|

+-----+----+-------------+

| B     |1999|       1   |

| B     |2015|        2  |

| A     |2000|        1  |

| A     |2002| 2         |

| A     |2007| 3         |

+-----+----+-------+

After this you can write a UDF to list it out.