The size of a PySpark DataFrame in bytes can be determined by using the df.memoryUsage()
method. The memoryUsage()
method returns a DataFrame with two columns: column
and memoryUsage
, where column
represents the column name and memoryUsage
represents the memory usage of that column in bytes.
Here is an example:
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder.appName("DataFrameSize").getOrCreate()
# Create a DataFrame
data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
df = spark.createDataFrame(data, ["name", "age"])
# Get the memory usage of each column
memory_usage = df.memoryUsage()
# Display the memory usage DataFrame
memory_usage.show()
The memory_usage.show()
command will output the following result:
+-----+-----------+
|column|memoryUsage|
+-----+-----------+
| name| 352|
| age| 352|
+-----+-----------+
In this example, the memory usage of both the “name” and “age” columns is 352 bytes each. Note that this is an approximate memory usage, and the actual memory usage may vary depending on the data and the Spark configuration.
- Powershell timer countdown
- Process is terminated due to stackoverflowexception. c#
- Puppeteer docker arm64
- Pandas conditional sum
- Process ‘command ‘c:\src\flutter\bin\flutter.bat” finished with non-zero exit value 1
- Psql: error: connection to server on socket “/var/run/postgresql/.s.pgsql.5432” failed: fatal: sorry, too many clients already
- Pyautogui click not working
- Pygame.error: modplug_load failed
- Pysimplegui window size