Pyspark Finding top three countries with covid confirmed covid cases

Hi guys,

I have a beginner at pyspark and I'm working with a dataset where I'd like to find out the top three countries with the highest number of covid cases. I've googled enough to find out the solution for this and by theory, it should work but for some reason, it doesn't. Please see screenshot attached.

Note my dataframe has two columns: country & total_count (where total_count is the total number of confirmed covid cases)

Here's my code:

from pyspark.sql.functions import desc

Top_by_Country = df_covid_3.groupBy('Country').max().select(['total_count'])


Here's the error I get:

AnalysisException: cannot resolve '`total_count`' given input columns: [Country, max(total_count)]; 'Project ['total_count] +- Aggregate [Country#11261], [Country#11261, max(total_count#9842) AS max(total_count)#11923] +- Project [Country#11261, total_count#9842]

Error Screenshot

Apr 22, 2022 in Apache Spark by Saadat

edited 5 days ago 84 views

No answer to this question. Be the first to respond.

