PySpark: selecting and accessing data

  1. where(condition) – Alias for filter(); filters rows based on a condition.
  2. limit(num) – Limits the number of rows in the DataFrame.
  3. distinct() – Returns only distinct rows in the DataFrame.
  4. drop(*cols) – Drops one or more columns from a DataFrame.
  5. withColumn(colName, col) – Adds or replaces a column in the DataFrame.
  6. withColumnRenamed(existing, new) – Renames an existing column.
  7. groupBy(*cols) – Groups rows by specified columns.
  8. orderBy(*cols) – Sorts the DataFrame by specified columns.
  9. sort(*cols) – Alias for orderBy(); sorts the DataFrame.
  10. show(n=20, truncate=True) – Displays the top n rows of the DataFrame.
  11. head(n=1) – Returns the first n rows of the DataFrame.
  12. first() – Returns the first row of the DataFrame.
  13. collect() – Collects all rows of the DataFrame into a list.
  14. count() – Returns the number of rows in the DataFrame.
  15. dropDuplicates(*cols) – Drops duplicate rows based on selected columns.
  16. sample(withReplacement, fraction) – Returns a random sample of the DataFrame.
  17. join(other, on, how='inner') – Joins two DataFrames on a column with a specified join type.
  18. alias(alias_name) – Gives an alias to the DataFrame for use in joins or subqueries.
  19. union(other) – Appends rows from another DataFrame with the same schema.
  20. coalesce(numPartitions) – Reduces the number of partitions in the DataFrame.
  21. repartition(numPartitions) – Increases or changes the number of partitions.

Here’s the updated list of PySpark functions with examples and outputs:

  1. where(condition) – Alias for filter(); filters rows based on a condition. Example:
   df.where(df['age'] > 30).show()

Output:

   +---+-----+
   | id|  age|
   +---+-----+
   |  2|   35|
   |  3|   40|
   +---+-----+
  1. limit(num) – Limits the number of rows in the DataFrame. Example:
   df.limit(2).show()

Output:

   +---+-----+
   | id|  age|
   +---+-----+
   |  1|   25|
   |  2|   35|
   +---+-----+
  1. distinct() – Returns only distinct rows in the DataFrame. Example:
   df.distinct().show()

Output:

   +---+-----+
   | id|  age|
   +---+-----+
   |  1|   25|
   |  2|   35|
   |  3|   40|
   +---+-----+
  1. drop(*cols) – Drops one or more columns from a DataFrame. Example:
   df.drop('age').show()

Output:

   +---+
   | id|
   +---+
   |  1|
   |  2|
   |  3|
   +---+
  1. withColumn(colName, col) – Adds or replaces a column in the DataFrame. Example:
   df.withColumn('new_age', df['age'] + 10).show()

Output:

   +---+---+-------+
   | id|age|new_age|
   +---+---+-------+
   |  1| 25|     35|
   |  2| 35|     45|
   |  3| 40|     50|
   +---+---+-------+
  1. withColumnRenamed(existing, new) – Renames an existing column. Example:
   df.withColumnRenamed('age', 'years').show()

Output:

   +---+-----+
   | id|years|
   +---+-----+
   |  1|   25|
   |  2|   35|
   |  3|   40|
   +---+-----+
  1. groupBy(*cols) – Groups rows by specified columns. Example:
   df.groupBy('age').count().show()

Output:

   +---+-----+
   |age|count|
   +---+-----+
   | 25|    1|
   | 35|    1|
   | 40|    1|
   +---+-----+
  1. orderBy(*cols) – Sorts the DataFrame by specified columns. Example:
   df.orderBy(df['age'].desc()).show()

Output:

   +---+---+
   | id|age|
   +---+---+
   |  3| 40|
   |  2| 35|
   |  1| 25|
   +---+---+
  1. sort(*cols) – Alias for orderBy(); sorts the DataFrame. Example:
   df.sort('age').show()

Output:

   +---+---+
   | id|age|
   +---+---+
   |  1| 25|
   |  2| 35|
   |  3| 40|
   +---+---+
  1. show(n=20, truncate=True) – Displays the top n rows of the DataFrame. Example:
df.show(2)
    Output:
    +---+---+
    | id|age|
    +---+---+
    |  1| 25|
    |  2| 35|
    +---+---+
  1. head(n=1) – Returns the first n rows of the DataFrame. Example: df.head(1) Output: [Row(id=1, age=25)]
  2. first() – Returns the first row of the DataFrame. Example: df.first() Output: Row(id=1, age=25)
  3. collect() – Collects all rows of the DataFrame into a list.
    Example: df.collect()
    Output: [Row(id=1, age=25), Row(id=2, age=35), Row(id=3, age=40)]
  4. count() – Returns the number of rows in the DataFrame.
    Example: df.count() Output: 3
  5. dropDuplicates(*cols) – Drops duplicate rows based on selected columns. Example: df.dropDuplicates(['age']).show()
    Output:
+---+---+
| id|age|
+---+---+
|  1| 25|
|  2| 35|
|  3| 40|
+---+---+
  1. sample(withReplacement, fraction) – Returns a random sample of the DataFrame. Example: df.sample(False, 0.5).show() Output:
+---+---+
| id|age|
+---+---+
|  1| 25|
+---+---+
  1. join(other, on, how='inner') – Joins two DataFrames on a column with a specified join type. Example: df1.join(df2, df1['id'] == df2['id'], 'inner').show() Output:
+---+---+---+---+
| id|age| id|age|
+---+---+---+---+
|  1| 25|  1| 20|
|  2| 35|  2| 30|
+---+---+---+---+
  1. alias(alias_name) – Gives an alias to the DataFrame for use in joins or subqueries. Example: df.alias('a').join(df.alias('b'), df['id'] == df['id']).show() Output:
+---+---+---+---+
| id|age| id|age|
+---+---+---+---+
|  1| 25|  1| 25|
|  2| 35|  2| 35|
+---+---+---+---+
  1. union(other) – Appends rows from another DataFrame with the same schema. Example: df.union(df2).show() Output:
+---+---+
| id|age|
+---+---+
|  1| 25|
|  2| 35|
|  3| 40|
|  1| 20|
|  2| 30|
+---+---+
  1. coalesce(numPartitions) – Reduces the number of partitions in the DataFrame. Example: df.coalesce(1).rdd.getNumPartitions() Output: 1
  2. repartition(numPartitions) – Increases or changes the number of partitions. Example: df.repartition(3).rdd.getNumPartitions() Output: 3

Each of these examples demonstrates how to select, access, and manipulate data in PySpark DataFrames.


Discover more from Science Comics

Subscribe to get the latest posts sent to your email.

Leave a Reply

error: Content is protected !!