PySpark: selecting and accessing data

where(condition) – Alias for filter(); filters rows based on a condition.
limit(num) – Limits the number of rows in the DataFrame.
distinct() – Returns only distinct rows in the DataFrame.
drop(*cols) – Drops one or more columns from a DataFrame.
withColumn(colName, col) – Adds or replaces a column in the DataFrame.
withColumnRenamed(existing, new) – Renames an existing column.
groupBy(*cols) – Groups rows by specified columns.
orderBy(*cols) – Sorts the DataFrame by specified columns.
sort(*cols) – Alias for orderBy(); sorts the DataFrame.
show(n=20, truncate=True) – Displays the top n rows of the DataFrame.
head(n=1) – Returns the first n rows of the DataFrame.
first() – Returns the first row of the DataFrame.
collect() – Collects all rows of the DataFrame into a list.
count() – Returns the number of rows in the DataFrame.
dropDuplicates(*cols) – Drops duplicate rows based on selected columns.
sample(withReplacement, fraction) – Returns a random sample of the DataFrame.
join(other, on, how='inner') – Joins two DataFrames on a column with a specified join type.
alias(alias_name) – Gives an alias to the DataFrame for use in joins or subqueries.
union(other) – Appends rows from another DataFrame with the same schema.
coalesce(numPartitions) – Reduces the number of partitions in the DataFrame.
repartition(numPartitions) – Increases or changes the number of partitions.

Here’s the updated list of PySpark functions with examples and outputs:

where(condition) – Alias for filter(); filters rows based on a condition. Example:

   df.where(df['age'] > 30).show()

Output:

   +---+-----+
   | id|  age|
   +---+-----+
   |  2|   35|
   |  3|   40|
   +---+-----+

limit(num) – Limits the number of rows in the DataFrame. Example:

   df.limit(2).show()

Output:

   +---+-----+
   | id|  age|
   +---+-----+
   |  1|   25|
   |  2|   35|
   +---+-----+

distinct() – Returns only distinct rows in the DataFrame. Example:

   df.distinct().show()

Output:

   +---+-----+
   | id|  age|
   +---+-----+
   |  1|   25|
   |  2|   35|
   |  3|   40|
   +---+-----+

drop(*cols) – Drops one or more columns from a DataFrame. Example:

   df.drop('age').show()

Output:

   +---+
   | id|
   +---+
   |  1|
   |  2|
   |  3|
   +---+

withColumn(colName, col) – Adds or replaces a column in the DataFrame. Example:

   df.withColumn('new_age', df['age'] + 10).show()

Output:

   +---+---+-------+
   | id|age|new_age|
   +---+---+-------+
   |  1| 25|     35|
   |  2| 35|     45|
   |  3| 40|     50|
   +---+---+-------+

withColumnRenamed(existing, new) – Renames an existing column. Example:

   df.withColumnRenamed('age', 'years').show()

Output:

   +---+-----+
   | id|years|
   +---+-----+
   |  1|   25|
   |  2|   35|
   |  3|   40|
   +---+-----+

groupBy(*cols) – Groups rows by specified columns. Example:

   df.groupBy('age').count().show()

Output:

   +---+-----+
   |age|count|
   +---+-----+
   | 25|    1|
   | 35|    1|
   | 40|    1|
   +---+-----+

orderBy(*cols) – Sorts the DataFrame by specified columns. Example:

   df.orderBy(df['age'].desc()).show()

Output:

   +---+---+
   | id|age|
   +---+---+
   |  3| 40|
   |  2| 35|
   |  1| 25|
   +---+---+

sort(*cols) – Alias for orderBy(); sorts the DataFrame. Example:

   df.sort('age').show()

Output:

   +---+---+
   | id|age|
   +---+---+
   |  1| 25|
   |  2| 35|
   |  3| 40|
   +---+---+

show(n=20, truncate=True) – Displays the top n rows of the DataFrame. Example:

df.show(2)
    Output:
    +---+---+
    | id|age|
    +---+---+
    |  1| 25|
    |  2| 35|
    +---+---+

head(n=1) – Returns the first n rows of the DataFrame. Example: df.head(1) Output: [Row(id=1, age=25)]
first() – Returns the first row of the DataFrame. Example: df.first() Output: Row(id=1, age=25)
collect() – Collects all rows of the DataFrame into a list.
Example: df.collect()
Output: [Row(id=1, age=25), Row(id=2, age=35), Row(id=3, age=40)]
count() – Returns the number of rows in the DataFrame.
Example: df.count() Output: 3
dropDuplicates(*cols) – Drops duplicate rows based on selected columns. Example: df.dropDuplicates(['age']).show()
Output:

+---+---+
| id|age|
+---+---+
|  1| 25|
|  2| 35|
|  3| 40|
+---+---+

sample(withReplacement, fraction) – Returns a random sample of the DataFrame. Example: df.sample(False, 0.5).show() Output:

+---+---+
| id|age|
+---+---+
|  1| 25|
+---+---+

join(other, on, how='inner') – Joins two DataFrames on a column with a specified join type. Example: df1.join(df2, df1['id'] == df2['id'], 'inner').show() Output:

+---+---+---+---+
| id|age| id|age|
+---+---+---+---+
|  1| 25|  1| 20|
|  2| 35|  2| 30|
+---+---+---+---+

alias(alias_name) – Gives an alias to the DataFrame for use in joins or subqueries. Example: df.alias('a').join(df.alias('b'), df['id'] == df['id']).show() Output:

+---+---+---+---+
| id|age| id|age|
+---+---+---+---+
|  1| 25|  1| 25|
|  2| 35|  2| 35|
+---+---+---+---+

union(other) – Appends rows from another DataFrame with the same schema. Example: df.union(df2).show() Output:

+---+---+
| id|age|
+---+---+
|  1| 25|
|  2| 35|
|  3| 40|
|  1| 20|
|  2| 30|
+---+---+

coalesce(numPartitions) – Reduces the number of partitions in the DataFrame. Example: df.coalesce(1).rdd.getNumPartitions() Output: 1
repartition(numPartitions) – Increases or changes the number of partitions. Example: df.repartition(3).rdd.getNumPartitions() Output: 3

Each of these examples demonstrates how to select, access, and manipulate data in PySpark DataFrames.

PySpark: selecting and accessing data

Related

Leave a ReplyCancel reply

PySpark: selecting and accessing data

Share this:

Related

Leave a ReplyCancel reply