Sorting an Array of Structs in PySpark
In PySpark, you can sort an array of structs by using the orderBy
function available for DataFrames. Firstly, let’s understand the structure of the dataframe and the sample data for better clarity.
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
# Create a Spark session
spark = SparkSession.builder.getOrCreate()
# Sample DataFrame with array of structs
data = [
(1, [("John", 25), ("Alice", 30), ("Bob", 22)]),
(2, [("David", 35), ("Emily", 28), ("Charles", 31)]),
(3, [("Mary", 20), ("Henry", 26)])
]
df = spark.createDataFrame(data, ["id", "people"])
df.show(truncate=False)
# +---+----------------------------------------------------+
# |id |people |
# +---+----------------------------------------------------+
# |1 |[[John,25], [Alice,30], [Bob,22]] |
# |2 |[[David,35], [Emily,28], [Charles,31]] |
# |3 |[[Mary,20], [Henry,26]] |
# +---+----------------------------------------------------+
1. Sorting by Struct Field
If you want to sort the array of structs based on a specific field within the struct, you can use the sort_array
function along with indexing.
df.withColumn("people_sorted", sort_array(col("people"), asc=False)[0]) \
.show(truncate=False)
# +---+----------------------------------------------------+------------------------+
# |id |people |people_sorted |
# +---+----------------------------------------------------+------------------------+
# |1 |[[John,25], [Alice,30], [Bob,22]] |[John, 25] |
# |2 |[[David,35], [Emily,28], [Charles,31]] |[David, 35] |
# |3 |[[Mary,20], [Henry,26]] |[Mary, 20] |
# +---+----------------------------------------------------+------------------------+
In the above example, we use the sort_array
function to sort the array of structs in descending order. The [0]
indexing is used to select the first element of the sorted array, which in this case corresponds to the first struct from the sorted array of structs.
2. Sorting by Struct Field with a secondary sort
If you want to sort the array of structs by a specific field within the struct and have a secondary sort on another field, you can use the orderBy
function along with indexing.
df.withColumn("people_sorted", sort_array(col("people"), asc=False)) \
.withColumn("people_top", col("people_sorted")[0]) \
.orderBy(col("people_top._2").asc(), col("id").asc()) \
.show(truncate=False)
# +---+----------------------------------------------------+------------------------+------------------+
# |id |people |people_sorted |people_top |
# +---+----------------------------------------------------+------------------------+------------------+
# |3 |[[Mary,20], [Henry,26]] |[[Mary,20], [Henry,26]] |[Mary, 20] |
# |1 |[[John,25], [Alice,30], [Bob,22]] |[[John,25], [Alice,30]] |[John, 25] |
# |2 |[[David,35], [Emily,28], [Charles,31]] |[[David,35], [Emily,28]]|[David, 35] |
# +---+----------------------------------------------------+------------------------+------------------+
In the above example, we first sort the array of structs in descending order using sort_array
. Next, we use indexing [0]
to select the first element of the sorted array, which corresponds to the first struct from the sorted array of structs.
We then use the orderBy
function to sort the dataframe based on the secondary sort field, which is the second element inside the struct (_2
) and the id
field in ascending order.
- Presignedurl could not be authenticated.
- Property ‘current’ does not exist on type ‘((instance: htmlinputelement | null) => void) | mutablerefobject
‘. - Psycopg2.operationalerror: could not translate host name “postgres” to address: temporary failure in name resolution
- Proxy_fcgi:error ah01071
- Primeng table styleclass
- Pyspark mapinpandas
- Pyspark shuffle rows