Pyspark sort array of structs

Sorting an Array of Structs in PySpark

In PySpark, you can sort an array of structs by using the orderBy function available for DataFrames. Firstly, let’s understand the structure of the dataframe and the sample data for better clarity.

from pyspark.sql import SparkSession
from pyspark.sql.functions import *

# Create a Spark session
spark = SparkSession.builder.getOrCreate()

# Sample DataFrame with array of structs
data = [
    (1, [("John", 25), ("Alice", 30), ("Bob", 22)]),
    (2, [("David", 35), ("Emily", 28), ("Charles", 31)]),
    (3, [("Mary", 20), ("Henry", 26)])
]

df = spark.createDataFrame(data, ["id", "people"])
df.show(truncate=False)

# +---+----------------------------------------------------+
# |id |people                                              |
# +---+----------------------------------------------------+
# |1  |[[John,25], [Alice,30], [Bob,22]]                   |
# |2  |[[David,35], [Emily,28], [Charles,31]]              |
# |3  |[[Mary,20], [Henry,26]]                             |
# +---+----------------------------------------------------+

1. Sorting by Struct Field

If you want to sort the array of structs based on a specific field within the struct, you can use the sort_array function along with indexing.

df.withColumn("people_sorted", sort_array(col("people"), asc=False)[0]) \
  .show(truncate=False)

# +---+----------------------------------------------------+------------------------+
# |id |people                                              |people_sorted           |
# +---+----------------------------------------------------+------------------------+
# |1  |[[John,25], [Alice,30], [Bob,22]]                   |[John, 25]              |
# |2  |[[David,35], [Emily,28], [Charles,31]]              |[David, 35]             |
# |3  |[[Mary,20], [Henry,26]]                             |[Mary, 20]              |
# +---+----------------------------------------------------+------------------------+

In the above example, we use the sort_array function to sort the array of structs in descending order. The [0] indexing is used to select the first element of the sorted array, which in this case corresponds to the first struct from the sorted array of structs.

2. Sorting by Struct Field with a secondary sort

If you want to sort the array of structs by a specific field within the struct and have a secondary sort on another field, you can use the orderBy function along with indexing.

df.withColumn("people_sorted", sort_array(col("people"), asc=False)) \
  .withColumn("people_top", col("people_sorted")[0]) \
  .orderBy(col("people_top._2").asc(), col("id").asc()) \
  .show(truncate=False)

# +---+----------------------------------------------------+------------------------+------------------+
# |id |people                                              |people_sorted           |people_top        |
# +---+----------------------------------------------------+------------------------+------------------+
# |3  |[[Mary,20], [Henry,26]]                             |[[Mary,20], [Henry,26]] |[Mary, 20]        |
# |1  |[[John,25], [Alice,30], [Bob,22]]                   |[[John,25], [Alice,30]] |[John, 25]        |
# |2  |[[David,35], [Emily,28], [Charles,31]]              |[[David,35], [Emily,28]]|[David, 35]       |
# +---+----------------------------------------------------+------------------------+------------------+

In the above example, we first sort the array of structs in descending order using sort_array. Next, we use indexing [0] to select the first element of the sorted array, which corresponds to the first struct from the sorted array of structs.

We then use the orderBy function to sort the dataframe based on the secondary sort field, which is the second element inside the struct (_2) and the id field in ascending order.

Leave a comment