Pyspark Aggregate and filtering code error

0 votes

Hi guys,

I am a beginner at pyspark and I'm working with a dataset where I'd like to find out the top three countries with the highest number of covid cases. I've googled enough to find out the solution for this and by theory, it should work but for some reason it doesn't. (Please see screenshot)

Note my dataframe has two columns: country & total_count (where total_count is the total number of confirmed covid cases)

Here's my code:

from pyspark.sql.functions import desc

Top_by_Country = df_covid_3.groupBy('Country').max().select(['total_count'])

Top_by_Country.orderBy(desc("total_count"))

Here's the error I get:

AnalysisException: cannot resolve '`total_count`' given input columns: [Country, max(total_count)]; 'Project ['total_count] +- Aggregate [Country#11261], [Country#11261, max(total_count#9842) AS max(total_count)#11923] +- Project [Country#11261, total_count#9842]

Any help would be appreciated

Apr 22, 2022 in Apache Spark by Saadat

edited 9 hours ago 79 views

No answer to this question. Be the first to respond.

Your answer

Your name to display (optional):
Privacy: Your email address will only be used for sending these notifications.
webinar REGISTER FOR FREE WEBINAR X
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP