where(condition)
– Alias forfilter()
; filters rows based on a condition.limit(num)
– Limits the number of rows in the DataFrame.distinct()
– Returns only distinct rows in the DataFrame.drop(*cols)
– Drops one or more columns from a DataFrame.withColumn(colName, col)
– Adds or replaces a column in the DataFrame.withColumnRenamed(existing, new)
– Renames an existing column.groupBy(*cols)
– Groups rows by specified columns.orderBy(*cols)
– Sorts the DataFrame by specified columns.sort(*cols)
– Alias fororderBy()
; sorts the DataFrame.show(n=20, truncate=True)
– Displays the topn
rows of the DataFrame.head(n=1)
– Returns the firstn
rows of the DataFrame.first()
– Returns the first row of the DataFrame.collect()
– Collects all rows of the DataFrame into a list.count()
– Returns the number of rows in the DataFrame.dropDuplicates(*cols)
– Drops duplicate rows based on selected columns.sample(withReplacement, fraction)
– Returns a random sample of the DataFrame.join(other, on, how='inner')
– Joins two DataFrames on a column with a specified join type.alias(alias_name)
– Gives an alias to the DataFrame for use in joins or subqueries.union(other)
– Appends rows from another DataFrame with the same schema.coalesce(numPartitions)
– Reduces the number of partitions in the DataFrame.repartition(numPartitions)
– Increases or changes the number of partitions.
Here’s the updated list of PySpark functions with examples and outputs:
where(condition)
– Alias forfilter()
; filters rows based on a condition. Example:
df.where(df['age'] > 30).show()
Output:
+---+-----+
| id| age|
+---+-----+
| 2| 35|
| 3| 40|
+---+-----+
limit(num)
– Limits the number of rows in the DataFrame. Example:
df.limit(2).show()
Output:
+---+-----+
| id| age|
+---+-----+
| 1| 25|
| 2| 35|
+---+-----+
distinct()
– Returns only distinct rows in the DataFrame. Example:
df.distinct().show()
Output:
+---+-----+
| id| age|
+---+-----+
| 1| 25|
| 2| 35|
| 3| 40|
+---+-----+
drop(*cols)
– Drops one or more columns from a DataFrame. Example:
df.drop('age').show()
Output:
+---+
| id|
+---+
| 1|
| 2|
| 3|
+---+
withColumn(colName, col)
– Adds or replaces a column in the DataFrame. Example:
df.withColumn('new_age', df['age'] + 10).show()
Output:
+---+---+-------+
| id|age|new_age|
+---+---+-------+
| 1| 25| 35|
| 2| 35| 45|
| 3| 40| 50|
+---+---+-------+
withColumnRenamed(existing, new)
– Renames an existing column. Example:
df.withColumnRenamed('age', 'years').show()
Output:
+---+-----+
| id|years|
+---+-----+
| 1| 25|
| 2| 35|
| 3| 40|
+---+-----+
groupBy(*cols)
– Groups rows by specified columns. Example:
df.groupBy('age').count().show()
Output:
+---+-----+
|age|count|
+---+-----+
| 25| 1|
| 35| 1|
| 40| 1|
+---+-----+
orderBy(*cols)
– Sorts the DataFrame by specified columns. Example:
df.orderBy(df['age'].desc()).show()
Output:
+---+---+
| id|age|
+---+---+
| 3| 40|
| 2| 35|
| 1| 25|
+---+---+
sort(*cols)
– Alias fororderBy()
; sorts the DataFrame. Example:
df.sort('age').show()
Output:
+---+---+
| id|age|
+---+---+
| 1| 25|
| 2| 35|
| 3| 40|
+---+---+
show(n=20, truncate=True)
– Displays the topn
rows of the DataFrame. Example:
df.show(2)
Output:
+---+---+
| id|age|
+---+---+
| 1| 25|
| 2| 35|
+---+---+
head(n=1)
– Returns the firstn
rows of the DataFrame. Example:df.head(1)
Output:[Row(id=1, age=25)]
first()
– Returns the first row of the DataFrame. Example:df.first()
Output:Row(id=1, age=25)
collect()
– Collects all rows of the DataFrame into a list.
Example:df.collect()
Output:[Row(id=1, age=25), Row(id=2, age=35), Row(id=3, age=40)]
count()
– Returns the number of rows in the DataFrame.
Example:df.count()
Output:3
dropDuplicates(*cols)
– Drops duplicate rows based on selected columns. Example:df.dropDuplicates(['age']).show()
Output:
+---+---+
| id|age|
+---+---+
| 1| 25|
| 2| 35|
| 3| 40|
+---+---+
sample(withReplacement, fraction)
– Returns a random sample of the DataFrame. Example:df.sample(False, 0.5).show()
Output:
+---+---+
| id|age|
+---+---+
| 1| 25|
+---+---+
join(other, on, how='inner')
– Joins two DataFrames on a column with a specified join type. Example:df1.join(df2, df1['id'] == df2['id'], 'inner').show()
Output:
+---+---+---+---+
| id|age| id|age|
+---+---+---+---+
| 1| 25| 1| 20|
| 2| 35| 2| 30|
+---+---+---+---+
alias(alias_name)
– Gives an alias to the DataFrame for use in joins or subqueries. Example:df.alias('a').join(df.alias('b'), df['id'] == df['id']).show()
Output:
+---+---+---+---+
| id|age| id|age|
+---+---+---+---+
| 1| 25| 1| 25|
| 2| 35| 2| 35|
+---+---+---+---+
union(other)
– Appends rows from another DataFrame with the same schema. Example:df.union(df2).show()
Output:
+---+---+
| id|age|
+---+---+
| 1| 25|
| 2| 35|
| 3| 40|
| 1| 20|
| 2| 30|
+---+---+
coalesce(numPartitions)
– Reduces the number of partitions in the DataFrame. Example:df.coalesce(1).rdd.getNumPartitions()
Output:1
repartition(numPartitions)
– Increases or changes the number of partitions. Example:df.repartition(3).rdd.getNumPartitions()
Output:3
Each of these examples demonstrates how to select, access, and manipulate data in PySpark DataFrames.
Discover more from Science Comics
Subscribe to get the latest posts sent to your email.