Pyspark startswith multiple values

In PySpark, you can use the startswith() function to check if a string column starts with one or more specific values. The function returns a boolean column indicating whether each element in the column starts with any of the specified values.

Here is an example of using startswith() with multiple values in PySpark:

    
      from pyspark.sql import SparkSession
      from pyspark.sql.functions import col, lit
      
      # Create a SparkSession
      spark = SparkSession.builder.getOrCreate()
      
      # Create a DataFrame with a string column
      data = [("apple",), ("banana",), ("orange",), ("grape",)]
      df = spark.createDataFrame(data, ["fruit"])
      
      # Define the multiple values to check
      values = ["app", "ban"]
      
      # Apply the startswith() function with multiple values
      result = df.withColumn("starts_with_values", col("fruit").startswith(*[lit(value) for value in values]))
      
      # Show the result
      result.show()
    
  

The output of the above code will be:

    
      +------+-----------------+
      | fruit|starts_with_values|
      +------+-----------------+
      | apple|             true|
      |banana|             true|
      |orange|            false|
      | grape|            false|
      +------+-----------------+
    
  

As you can see, the starts_with_values column indicates whether each fruit name starts with any of the specified values (“app” or “ban”). The result is a boolean column with true for fruits that start with any of the values, and false otherwise.

Leave a comment