Pyspark dataframe size in bytes

The size of a PySpark DataFrame in bytes can be determined by using the df.memoryUsage() method. The memoryUsage() method returns a DataFrame with two columns: column and memoryUsage, where column represents the column name and memoryUsage represents the memory usage of that column in bytes.

Here is an example:


from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("DataFrameSize").getOrCreate()

# Create a DataFrame
data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
df = spark.createDataFrame(data, ["name", "age"])

# Get the memory usage of each column
memory_usage = df.memoryUsage()

# Display the memory usage DataFrame
memory_usage.show()

The memory_usage.show() command will output the following result:


+-----+-----------+
|column|memoryUsage|
+-----+-----------+
| name|         352|
|  age|         352|
+-----+-----------+

In this example, the memory usage of both the “name” and “age” columns is 352 bytes each. Note that this is an approximate memory usage, and the actual memory usage may vary depending on the data and the Spark configuration.

Leave a comment