List of PySpark functions with short descriptions:
- SparkSession.builder.getOrCreate()
Initializes or retrieves a Spark session. - spark.createDataFrame()
Creates a DataFrame from a data source (e.g., RDD, list, Pandas DataFrame). - spark.createDataFrame(pandas_DataFrame)
Converts a Pandas DataFrame to a PySpark DataFrame. - DataFrame.show()
Displays the first 20 rows of the DataFrame in tabular format. - DataFrame.printSchema()
Prints the schema of the DataFrame, including column names and data types. - DataFrame.show(1)
Displays the first row of the DataFrame. - DataFrame.columns
Returns a list of column names in the DataFrame. - DataFrame.select(“a”, “b”, “c”).describe().show()
Selects specific columns, computes summary statistics, and shows the result. - DataFrame.take(1)
Returns the first row of the DataFrame as a list. - DataFrame.toPandas()
Converts a PySpark DataFrame to a Pandas DataFrame. - DataFrame.filter()
Filters rows in the DataFrame based on a condition.
Examples:
1. SparkSession.builder.getOrCreate()
Description: Initializes or retrieves a Spark session.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("example").getOrCreate()
Output: (No output for session creation)
A Spark session is started, which will be used to create and manipulate DataFrames.
2. spark.createDataFrame()
Description: Creates a DataFrame from a data source (e.g., RDD, list, Pandas DataFrame).
data = [("John", 25), ("Anna", 30), ("Mike", 22)]
df = spark.createDataFrame(data, ["Name", "Age"])
df.show()
Output:
+----+---+
|Name|Age|
+----+---+
|John| 25|
|Anna| 30|
|Mike| 22|
+----+---+
3. spark.createDataFrame(pandas_df)
Description: Converts a Pandas DataFrame to a PySpark DataFrame.
import pandas as pd
pandas_df = pd.DataFrame({"Name": ["John", "Anna", "Mike"], "Age": [25, 30, 22]})
df = spark.createDataFrame(pandas_df)
df.show()
Output:
+----+---+
|Name|Age|
+----+---+
|John| 25|
|Anna| 30|
|Mike| 22|
+----+---+
4. df.show()
Description: Displays the first 20 rows of the DataFrame in tabular format.
df.show()
Output:
+----+---+
|Name|Age|
+----+---+
|John| 25|
|Anna| 30|
|Mike| 22|
+----+---+
5. df.printSchema()
Description: Prints the schema of the DataFrame, including column names and data types.
df.printSchema()
Output:
root
|-- Name: string (nullable = true)
|-- Age: long (nullable = true)
6. df.show(1)
Description: Displays the first row of the DataFrame.
df.show(1)
Output:
+----+---+
|Name|Age|
+----+---+
|John| 25|
+----+---+
7. df.columns
Description: Returns a list of column names in the DataFrame.
df.columns
Output:
['Name', 'Age']
8. df.select("a", "b", "c").describe().show()
Description: Selects specific columns, computes summary statistics, and shows the result.
df.select("Age").describe().show()
Output:
+-------+------------------+
|summary| Age|
+-------+------------------+
| count| 3|
| mean|26.0 |
| stddev|4.35889894495572 |
| min|22 |
| max|30 |
+-------+------------------+
9. df.take(1)
Description: Returns the first row of the DataFrame as a list.
df.take(1)
Output:
[Row(Name='John', Age=25)]
10. df.toPandas()
Description: Converts a PySpark DataFrame to a Pandas DataFrame.
pandas_df = df.toPandas()
pandas_df
Output:
Name Age
0 John 25
1 Anna 30
2 Mike 22
11. DataFrame.filter()
Description: Filters rows in the DataFrame based on a condition.
filtered_df = df.filter(df.Age > 25)
filtered_df.show()
Output:
+----+---+
|Name|Age|
+----+---+
|Anna| 30|
+----+---+
Discover more from Science Comics
Subscribe to get the latest posts sent to your email.