Skip to content

PySpark: selecting and accessing data

  1. where(condition) – Alias for filter(); filters rows based on a condition.
  2. limit(num) – Limits the number of rows in the DataFrame.
  3. distinct() – Returns only distinct rows in the DataFrame.
  4. drop(*cols) – Drops one or more columns from a DataFrame.
  5. withColumn(colName, col) – Adds or replaces a column in the DataFrame.
  6. withColumnRenamed(existing, new) – Renames an existing column.
  7. groupBy(*cols) – Groups rows by specified columns.
  8. orderBy(*cols) – Sorts the DataFrame by specified columns.
  9. sort(*cols) – Alias for orderBy(); sorts the DataFrame.
  10. show(n=20, truncate=True) – Displays the top n rows of the DataFrame.
  11. head(n=1) – Returns the first n rows of the DataFrame.
  12. first() – Returns the first row of the DataFrame.
  13. collect() – Collects all rows of the DataFrame into a list.
  14. count() – Returns the number of rows in the DataFrame.
  15. dropDuplicates(*cols) – Drops duplicate rows based on selected columns.
  16. sample(withReplacement, fraction) – Returns a random sample of the DataFrame.
  17. join(other, on, how='inner') – Joins two DataFrames on a column with a specified join type.
  18. alias(alias_name) – Gives an alias to the DataFrame for use in joins or subqueries.
  19. union(other) – Appends rows from another DataFrame with the same schema.
  20. coalesce(numPartitions) – Reduces the number of partitions in the DataFrame.
  21. repartition(numPartitions) – Increases or changes the number of partitions.

Here’s the updated list of PySpark functions with examples and outputs:

  1. where(condition) – Alias for filter(); filters rows based on a condition. Example:
   df.where(df['age'] > 30).show()

Output:

   +---+-----+
   | id|  age|
   +---+-----+
   |  2|   35|
   |  3|   40|
   +---+-----+
  1. limit(num) – Limits the number of rows in the DataFrame. Example:
   df.limit(2).show()

Output:

   +---+-----+
   | id|  age|
   +---+-----+
   |  1|   25|
   |  2|   35|
   +---+-----+
  1. distinct() – Returns only distinct rows in the DataFrame. Example:
   df.distinct().show()

Output:

   +---+-----+
   | id|  age|
   +---+-----+
   |  1|   25|
   |  2|   35|
   |  3|   40|
   +---+-----+
  1. drop(*cols) – Drops one or more columns from a DataFrame. Example:
   df.drop('age').show()

Output:

   +---+
   | id|
   +---+
   |  1|
   |  2|
   |  3|
   +---+
  1. withColumn(colName, col) – Adds or replaces a column in the DataFrame. Example:
   df.withColumn('new_age', df['age'] + 10).show()

Output:

   +---+---+-------+
   | id|age|new_age|
   +---+---+-------+
   |  1| 25|     35|
   |  2| 35|     45|
   |  3| 40|     50|
   +---+---+-------+
  1. withColumnRenamed(existing, new) – Renames an existing column. Example:
   df.withColumnRenamed('age', 'years').show()

Output:

   +---+-----+
   | id|years|
   +---+-----+
   |  1|   25|
   |  2|   35|
   |  3|   40|
   +---+-----+
  1. groupBy(*cols) – Groups rows by specified columns. Example:
   df.groupBy('age').count().show()

Output:

   +---+-----+
   |age|count|
   +---+-----+
   | 25|    1|
   | 35|    1|
   | 40|    1|
   +---+-----+
  1. orderBy(*cols) – Sorts the DataFrame by specified columns. Example:
   df.orderBy(df['age'].desc()).show()

Output:

   +---+---+
   | id|age|
   +---+---+
   |  3| 40|
   |  2| 35|
   |  1| 25|
   +---+---+
  1. sort(*cols) – Alias for orderBy(); sorts the DataFrame. Example:
   df.sort('age').show()

Output:

   +---+---+
   | id|age|
   +---+---+
   |  1| 25|
   |  2| 35|
   |  3| 40|
   +---+---+
  1. show(n=20, truncate=True) – Displays the top n rows of the DataFrame. Example:
df.show(2)
    Output:
    +---+---+
    | id|age|
    +---+---+
    |  1| 25|
    |  2| 35|
    +---+---+
  1. head(n=1) – Returns the first n rows of the DataFrame. Example: df.head(1) Output: [Row(id=1, age=25)]
  2. first() – Returns the first row of the DataFrame. Example: df.first() Output: Row(id=1, age=25)
  3. collect() – Collects all rows of the DataFrame into a list.
    Example: df.collect()
    Output: [Row(id=1, age=25), Row(id=2, age=35), Row(id=3, age=40)]
  4. count() – Returns the number of rows in the DataFrame.
    Example: df.count() Output: 3
  5. dropDuplicates(*cols) – Drops duplicate rows based on selected columns. Example: df.dropDuplicates(['age']).show()
    Output:
+---+---+
| id|age|
+---+---+
|  1| 25|
|  2| 35|
|  3| 40|
+---+---+
  1. sample(withReplacement, fraction) – Returns a random sample of the DataFrame. Example: df.sample(False, 0.5).show() Output:
+---+---+
| id|age|
+---+---+
|  1| 25|
+---+---+
  1. join(other, on, how='inner') – Joins two DataFrames on a column with a specified join type. Example: df1.join(df2, df1['id'] == df2['id'], 'inner').show() Output:
+---+---+---+---+
| id|age| id|age|
+---+---+---+---+
|  1| 25|  1| 20|
|  2| 35|  2| 30|
+---+---+---+---+
  1. alias(alias_name) – Gives an alias to the DataFrame for use in joins or subqueries. Example: df.alias('a').join(df.alias('b'), df['id'] == df['id']).show() Output:
+---+---+---+---+
| id|age| id|age|
+---+---+---+---+
|  1| 25|  1| 25|
|  2| 35|  2| 35|
+---+---+---+---+
  1. union(other) – Appends rows from another DataFrame with the same schema. Example: df.union(df2).show() Output:
+---+---+
| id|age|
+---+---+
|  1| 25|
|  2| 35|
|  3| 40|
|  1| 20|
|  2| 30|
+---+---+
  1. coalesce(numPartitions) – Reduces the number of partitions in the DataFrame. Example: df.coalesce(1).rdd.getNumPartitions() Output: 1
  2. repartition(numPartitions) – Increases or changes the number of partitions. Example: df.repartition(3).rdd.getNumPartitions() Output: 3

Each of these examples demonstrates how to select, access, and manipulate data in PySpark DataFrames.

Leave a Reply

error: Content is protected !!