PySpark data frame creation song

September 23, 2024September 23, 2024by Kurious Fox

List of PySpark functions with short descriptions:

SparkSession.builder.getOrCreate()
Initializes or retrieves a Spark session.
spark.createDataFrame()
Creates a DataFrame from a data source (e.g., RDD, list, Pandas DataFrame).
spark.createDataFrame(pandas_DataFrame)
Converts a Pandas DataFrame to a PySpark DataFrame.
DataFrame.show()
Displays the first 20 rows of the DataFrame in tabular format.
DataFrame.printSchema()
Prints the schema of the DataFrame, including column names and data types.
DataFrame.show(1)
Displays the first row of the DataFrame.
DataFrame.columns
Returns a list of column names in the DataFrame.
DataFrame.select(“a”, “b”, “c”).describe().show()
Selects specific columns, computes summary statistics, and shows the result.
DataFrame.take(1)
Returns the first row of the DataFrame as a list.
DataFrame.toPandas()
Converts a PySpark DataFrame to a Pandas DataFrame.
DataFrame.filter()
Filters rows in the DataFrame based on a condition.

Examples:

1. `SparkSession.builder.getOrCreate()`

Description: Initializes or retrieves a Spark session.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("example").getOrCreate()

Output: (No output for session creation)
A Spark session is started, which will be used to create and manipulate DataFrames.

2. `spark.createDataFrame()`

Description: Creates a DataFrame from a data source (e.g., RDD, list, Pandas DataFrame).

data = [("John", 25), ("Anna", 30), ("Mike", 22)]
df = spark.createDataFrame(data, ["Name", "Age"])
df.show()

Output:

+----+---+
|Name|Age|
+----+---+
|John| 25|
|Anna| 30|
|Mike| 22|
+----+---+

3. `spark.createDataFrame(pandas_df)`

Description: Converts a Pandas DataFrame to a PySpark DataFrame.

import pandas as pd

pandas_df = pd.DataFrame({"Name": ["John", "Anna", "Mike"], "Age": [25, 30, 22]})
df = spark.createDataFrame(pandas_df)
df.show()

Output:

+----+---+
|Name|Age|
+----+---+
|John| 25|
|Anna| 30|
|Mike| 22|
+----+---+

4. `df.show()`

Description: Displays the first 20 rows of the DataFrame in tabular format.

df.show()

Output:

+----+---+
|Name|Age|
+----+---+
|John| 25|
|Anna| 30|
|Mike| 22|
+----+---+

5. `df.printSchema()`

Description: Prints the schema of the DataFrame, including column names and data types.

df.printSchema()

Output:

root
 |-- Name: string (nullable = true)
 |-- Age: long (nullable = true)

6. `df.show(1)`

Description: Displays the first row of the DataFrame.

df.show(1)

Output:

+----+---+
|Name|Age|
+----+---+
|John| 25|
+----+---+

7. `df.columns`

Description: Returns a list of column names in the DataFrame.

df.columns

Output:

['Name', 'Age']

8. `df.select("a", "b", "c").describe().show()`

Description: Selects specific columns, computes summary statistics, and shows the result.

df.select("Age").describe().show()

Output:

+-------+------------------+
|summary|               Age|
+-------+------------------+
|  count|                 3|
|   mean|26.0              |
| stddev|4.35889894495572   |
|    min|22                |
|    max|30                |
+-------+------------------+

9. `df.take(1)`

Description: Returns the first row of the DataFrame as a list.

df.take(1)

Output:

[Row(Name='John', Age=25)]

10. `df.toPandas()`

Description: Converts a PySpark DataFrame to a Pandas DataFrame.

pandas_df = df.toPandas()
pandas_df

Output:

    Name  Age
0   John   25
1   Anna   30
2   Mike   22

11. `DataFrame.filter()`

Description: Filters rows in the DataFrame based on a condition.

filtered_df = df.filter(df.Age > 25)
filtered_df.show()

Output:

+----+---+
|Name|Age|
+----+---+
|Anna| 30|
+----+---+

Discover more from Science Comics

Subscribe to get the latest posts sent to your email.

How to debug codes generated by ChatGPT

Debugging code generated by ChatGPT by pasting the error into ChatGPT for it to resolve its own problem, most of…

PySpark: selecting and accessing data

The content outlines various PySpark functions used for data manipulation in DataFrames. Key functions include filtering with where(), limiting rows with limit(), returning distinct rows, dropping columns, and grouping by criteria. Each function includes a brief example, illustrating how to access, modify, and aggregate data effectively within PySpark.

google-generativeai & google-genai: a detailed comparison with integration guides

Google offers two libraries for generative AI: google-generativeai for complex, direct work with Gemini models, and google-genai for simplified access and easier integration in applications.

PySpark data frame creation song

1. `SparkSession.builder.getOrCreate()`

2. `spark.createDataFrame()`

3. `spark.createDataFrame(pandas_df)`

4. `df.show()`

5. `df.printSchema()`

6. `df.show(1)`

7. `df.columns`

8. `df.select("a", "b", "c").describe().show()`

9. `df.take(1)`

10. `df.toPandas()`

11. `DataFrame.filter()`

Like this:

Related

Discover more from Science Comics

Like this:

Like this:

Like this:

Leave a ReplyCancel reply

1. SparkSession.builder.getOrCreate()

2. spark.createDataFrame()

3. spark.createDataFrame(pandas_df)

4. df.show()

5. df.printSchema()

6. df.show(1)

7. df.columns

8. df.select("a", "b", "c").describe().show()

9. df.take(1)

10. df.toPandas()

11. DataFrame.filter()

Share this:

Like this:

Related

Discover more from Science Comics

Related Posts

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Leave a ReplyCancel reply

1. `SparkSession.builder.getOrCreate()`

2. `spark.createDataFrame()`

3. `spark.createDataFrame(pandas_df)`

4. `df.show()`

5. `df.printSchema()`

6. `df.show(1)`

7. `df.columns`

8. `df.select("a", "b", "c").describe().show()`

9. `df.take(1)`

10. `df.toPandas()`

11. `DataFrame.filter()`