Pyspark explode multiple columns

pyspark explode multiple columns

Exploding multiple columns in PySpark allows you to transform complex data types into separate rows. The explode function is used for this purpose.

For example, let’s say you have a DataFrame with two array columns: ‘col1’ and ‘col2’. Each cell of ‘col1’ contains an array of values, and each cell of ‘col2’ contains an array of corresponding values. You want to explode both columns into separate rows.

    
      from pyspark.sql import SparkSession
      from pyspark.sql.functions import explode
      
      # Create a SparkSession
      spark = SparkSession.builder.getOrCreate()
      
      # Create a DataFrame
      data = [(["a", "b", "c"], [1, 2, 3]),
              (["d", "e"], [4, 5])]
      df = spark.createDataFrame(data, ["col1", "col2"])
      
      # Explode multiple columns
      df_exploded = df.select(explode(df.col1).alias("col1_exploded"),
                              explode(df.col2).alias("col2_exploded"))
      
      # Show the result
      df_exploded.show()
    
  

The output will be:

    
      +--------------+--------------+
      |col1_exploded |col2_exploded |
      +--------------+--------------+
      |a             |1             |
      |b             |2             |
      |c             |3             |
      |d             |4             |
      |e             |5             |
      +--------------+--------------+
    
  

In the above example, the explode function is used to transform each cell of ‘col1’ and ‘col2’ into separate rows, resulting in five rows in the resulting DataFrame.

Leave a comment