DataFrameGroupBy.
aggregate
Aggregate using one or more operations over the specified axis.
a dict mapping from column name (string) to aggregate functions (string or list of strings).
The return can be:
Series : when DataFrame.agg is called with a single function
DataFrame : when DataFrame.agg is called with several functions
Return Series or DataFrame.
See also
pyspark.pandas.Series.groupby
pyspark.pandas.DataFrame.groupby
Notes
agg is an alias for aggregate. Use the alias.
Examples
>>> df = ps.DataFrame({'A': [1, 1, 2, 2], ... 'B': [1, 2, 3, 4], ... 'C': [0.362, 0.227, 1.267, -0.562]}, ... columns=['A', 'B', 'C'])
>>> df A B C 0 1 1 0.362 1 1 2 0.227 2 2 3 1.267 3 2 4 -0.562
Different aggregations per column
>>> aggregated = df.groupby('A').agg({'B': 'min', 'C': 'sum'}) >>> aggregated[['B', 'C']].sort_index() B C A 1 1 0.589 2 3 0.705
>>> aggregated = df.groupby('A').agg({'B': ['min', 'max']}) >>> aggregated.sort_index() B min max A 1 1 2 2 3 4
>>> aggregated = df.groupby('A').agg('min') >>> aggregated.sort_index() B C A 1 1 0.227 2 3 -0.562
>>> aggregated = df.groupby('A').agg(['min', 'max']) >>> aggregated.sort_index() B C min max min max A 1 1 2 0.227 0.362 2 3 4 -0.562 1.267
To control the output names with different aggregations per column, pandas-on-Spark also supports ‘named aggregation’ or nested renaming in .agg. It can also be used when applying multiple aggregation functions to specific columns.
>>> aggregated = df.groupby('A').agg(b_max=ps.NamedAgg(column='B', aggfunc='max')) >>> aggregated.sort_index() b_max A 1 2 2 4
>>> aggregated = df.groupby('A').agg(b_max=('B', 'max'), b_min=('B', 'min')) >>> aggregated.sort_index() b_max b_min A 1 2 1 2 4 3
>>> aggregated = df.groupby('A').agg(b_max=('B', 'max'), c_min=('C', 'min')) >>> aggregated.sort_index() b_max c_min A 1 2 0.227 2 4 -0.562