PySpark data frame creation song

List of PySpark functions with short descriptions:

  1. SparkSession.builder.getOrCreate()
    Initializes or retrieves a Spark session.
  2. spark.createDataFrame()
    Creates a DataFrame from a data source (e.g., RDD, list, Pandas DataFrame).
  3. spark.createDataFrame(pandas_DataFrame)
    Converts a Pandas DataFrame to a PySpark DataFrame.
  4. DataFrame.show()
    Displays the first 20 rows of the DataFrame in tabular format.
  5. DataFrame.printSchema()
    Prints the schema of the DataFrame, including column names and data types.
  6. DataFrame.show(1)
    Displays the first row of the DataFrame.
  7. DataFrame.columns
    Returns a list of column names in the DataFrame.
  8. DataFrame.select(“a”, “b”, “c”).describe().show()
    Selects specific columns, computes summary statistics, and shows the result.
  9. DataFrame.take(1)
    Returns the first row of the DataFrame as a list.
  10. DataFrame.toPandas()
    Converts a PySpark DataFrame to a Pandas DataFrame.
  11. DataFrame.filter()
    Filters rows in the DataFrame based on a condition.

Examples:

1. SparkSession.builder.getOrCreate()

Description: Initializes or retrieves a Spark session.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("example").getOrCreate()

Output: (No output for session creation)
A Spark session is started, which will be used to create and manipulate DataFrames.


2. spark.createDataFrame()

Description: Creates a DataFrame from a data source (e.g., RDD, list, Pandas DataFrame).

data = [("John", 25), ("Anna", 30), ("Mike", 22)]
df = spark.createDataFrame(data, ["Name", "Age"])
df.show()

Output:

+----+---+
|Name|Age|
+----+---+
|John| 25|
|Anna| 30|
|Mike| 22|
+----+---+

3. spark.createDataFrame(pandas_df)

Description: Converts a Pandas DataFrame to a PySpark DataFrame.

import pandas as pd

pandas_df = pd.DataFrame({"Name": ["John", "Anna", "Mike"], "Age": [25, 30, 22]})
df = spark.createDataFrame(pandas_df)
df.show()

Output:

+----+---+
|Name|Age|
+----+---+
|John| 25|
|Anna| 30|
|Mike| 22|
+----+---+

4. df.show()

Description: Displays the first 20 rows of the DataFrame in tabular format.

df.show()

Output:

+----+---+
|Name|Age|
+----+---+
|John| 25|
|Anna| 30|
|Mike| 22|
+----+---+

5. df.printSchema()

Description: Prints the schema of the DataFrame, including column names and data types.

df.printSchema()

Output:

root
 |-- Name: string (nullable = true)
 |-- Age: long (nullable = true)

6. df.show(1)

Description: Displays the first row of the DataFrame.

df.show(1)

Output:

+----+---+
|Name|Age|
+----+---+
|John| 25|
+----+---+

7. df.columns

Description: Returns a list of column names in the DataFrame.

df.columns

Output:

['Name', 'Age']

8. df.select("a", "b", "c").describe().show()

Description: Selects specific columns, computes summary statistics, and shows the result.

df.select("Age").describe().show()

Output:

+-------+------------------+
|summary|               Age|
+-------+------------------+
|  count|                 3|
|   mean|26.0              |
| stddev|4.35889894495572   |
|    min|22                |
|    max|30                |
+-------+------------------+

9. df.take(1)

Description: Returns the first row of the DataFrame as a list.

df.take(1)

Output:

[Row(Name='John', Age=25)]

10. df.toPandas()

Description: Converts a PySpark DataFrame to a Pandas DataFrame.

pandas_df = df.toPandas()
pandas_df

Output:

    Name  Age
0   John   25
1   Anna   30
2   Mike   22

11. DataFrame.filter()

Description: Filters rows in the DataFrame based on a condition.

filtered_df = df.filter(df.Age > 25)
filtered_df.show()

Output:

+----+---+
|Name|Age|
+----+---+
|Anna| 30|
+----+---+


Discover more from Science Comics

Subscribe to get the latest posts sent to your email.

Leave a Reply

error: Content is protected !!