pyspark explode multiple columns
Exploding multiple columns in PySpark allows you to transform complex data types into separate rows. The explode function is used for this purpose.
For example, let’s say you have a DataFrame with two array columns: ‘col1’ and ‘col2’. Each cell of ‘col1’ contains an array of values, and each cell of ‘col2’ contains an array of corresponding values. You want to explode both columns into separate rows.
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
# Create a SparkSession
spark = SparkSession.builder.getOrCreate()
# Create a DataFrame
data = [(["a", "b", "c"], [1, 2, 3]),
(["d", "e"], [4, 5])]
df = spark.createDataFrame(data, ["col1", "col2"])
# Explode multiple columns
df_exploded = df.select(explode(df.col1).alias("col1_exploded"),
explode(df.col2).alias("col2_exploded"))
# Show the result
df_exploded.show()
The output will be:
+--------------+--------------+
|col1_exploded |col2_exploded |
+--------------+--------------+
|a |1 |
|b |2 |
|c |3 |
|d |4 |
|e |5 |
+--------------+--------------+
In the above example, the explode function is used to transform each cell of ‘col1’ and ‘col2’ into separate rows, resulting in five rows in the resulting DataFrame.