Data Lab PreTest

Data Lab PreTest

Assessment

Quiz

Instructional Technology

Professional Development

Hard

Created by

Sinh Hoàng Văn

FREE Resource

Student preview

quiz-placeholder

50 questions

Show all answers

1.

MULTIPLE SELECT QUESTION

1 min • 1 pt

You have the following dataset and need to calculate the average salary per department, but only include departments where the average salary is greater than 4000:
df.groupBy(\Department\").avg(\"Salary\").filter(\"avg(Salary) > 4000\")"
df.createOrReplaceTempView(\employees\")\nspark.sql(\"SELECT Department, AVG(Salary) AS AvgSalary FROM employees GROUP BY Department HAVING AVG(Salary) > 4000\")"
from pyspark.sql.functions import col, avg\n\ndf.groupBy(\Department\").agg(avg(col(\"Salary\")).alias(\"AvgSalary\")).filter(col(\"AvgSalary\") > 4000)"
df.filter(\Salary > 4000\").groupBy(\"Department\").avg(\"Salary\")"

2.

MULTIPLE SELECT QUESTION

1 min • 1 pt

You need to find all employees in departments with a budget greater than 8000. Which code snippet achieves this?
employees_df.join(departments_df, \Department\").filter(\"Budget > 8000\").select(\"Name\", \"Department\")"
employees_df.createOrReplaceTempView(\employees\")\ndepartments_df.createOrReplaceTempView(\"departments\")\nspark.sql(\"SELECT e.Name, e.Department FROM employees e JOIN departments d ON e.Department = d.Department WHERE d.Budget > 8000\")"
from pyspark.sql.functions import col\n\nemployees_df.join(departments_df, \Department\").where(col(\"Budget\") > 8000).select(\"Name\", \"Department\")"
employees_df.join(departments_df, employees_df.Department == departments_df.Department).filter(\Budget > 8000\")"

3.

MULTIPLE SELECT QUESTION

1 min • 1 pt

From the dataset below, find the top-earning employee in each department, including their department name and salary. Use Spark SQL:
df.createOrReplaceTempView(\employees\")\nspark.sql(\"SELECT Department, Name, Salary FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY Department ORDER BY Salary DESC) AS rank FROM employees) t WHERE rank = 1\")"
df.createOrReplaceTempView(\employees\")\nspark.sql(\"SELECT Department, Name, MAX(Salary) AS Salary FROM employees GROUP BY Department, Name\")"
df.createOrReplaceTempView(\employees\")\nspark.sql(\"SELECT Department, Name, MAX(Salary) AS MaxSalary FROM employees GROUP BY Department\")"
from pyspark.sql.window import Window\nfrom pyspark.sql.functions import col, row_number\n\nwindow_spec = Window.partitionBy(\Department\").orderBy(col(\"Salary\").desc())\ndf.withColumn(\"rank\", row_number().over(window_spec)).filter(\"rank == 1\").select(\"Department\", \"Name\", \"Salary\")"

4.

MULTIPLE SELECT QUESTION

1 min • 1 pt

Which of the following codes computes the total sales per region efficiently in Spark SQL?
df.createOrReplaceTempView(\sales\")\nspark.sql(\"SELECT Region, SUM(Sales) AS TotalSales FROM sales GROUP BY Region\")"
df.groupBy(\Region\").sum(\"Sales\")"
from pyspark.sql.functions import col, sum\n\ndf.groupBy(col(\Region\")).agg(sum(col(\"Sales\")).alias(\"TotalSales\"))"
df.filter(\Sales IS NOT NULL\").groupBy(\"Region\").sum(\"Sales\")"

5.

MULTIPLE SELECT QUESTION

1 min • 1 pt

How can you rank products within each category by their sales in descending order?
from pyspark.sql.window import Window\nfrom pyspark.sql.functions import col, rank\n\nwindow_spec = Window.partitionBy(\Category\").orderBy(col(\"Sales\").desc())\ndf.withColumn(\"rank\", rank().over(window_spec))"
df.createOrReplaceTempView(\products\")\nspark.sql(\"SELECT Category, Product, RANK() OVER (PARTITION BY Category ORDER BY Sales DESC) AS Rank FROM products\")"
df.groupBy(\Category\", \"Product\").agg(rank().alias(\"Rank\"))"
from pyspark.sql.functions import row_number\n\nwindow_spec = Window.partitionBy(\Category\").orderBy(\"Sales DESC\")\ndf.withColumn(\"row_number\", row_number().over(window_spec))"

6.

MULTIPLE SELECT QUESTION

1 min • 1 pt

Which function would you use to replace all NULL values in a Spark DataFrame with default values?
df.na.fill({\Column1\": 0, \"Column2\": \"N/A\"})"
df.na.replace(NULL, {\Column1\": 0, \"Column2\": \"N/A\"})"
df.fillna({\Column1\": 0, \"Column2\": \"N/A\"})"
df.replace(NULL, {\Column1\": 0, \"Column2\": \"N/A\"})"

7.

MULTIPLE SELECT QUESTION

1 min • 1 pt

How do you efficiently join two DataFrames on multiple keys in PySpark?
df1.join(df2, (df1.Key1 == df2.Key1) & (df1.Key2 == df2.Key2))
df1.join(df2, \Key1\", \"inner\")"
df1.createOrReplaceTempView(\df1\")\ndf2.createOrReplaceTempView(\"df2\")\nspark.sql(\"SELECT * FROM df1 JOIN df2 ON df1.Key1 = df2.Key1 AND df1.Key2 = df2.Key2\")"
df1.merge(df2, \Key1\")"

Create a free account and access millions of resources

Create resources
Host any resource
Get auto-graded reports
or continue with
Microsoft
Apple
Others
By signing up, you agree to our Terms of Service & Privacy Policy
Already have an account?