Group By in Pandas
Group data based on one or more columns in Pandas
The groupby()
function in Pandas is used to group data based on one or more columns and then apply an aggregation
function to summarize or analyze the grouped data.
Grouping by a Column and Aggregating
You can group by a column and apply an aggregation function (e.g., sum, mean, count).
import pandas as pd
# Sample Data
data = {
"Department": ["HR", "IT", "IT", "HR", "Finance", "Finance", "HR"],
"Salary": [50000, 60000, 70000, 55000, 65000, 72000, 53000],
"Experience": [5, 7, 10, 3, 8, 12, 2]
}
df = pd.DataFrame(data)
print(df)
# Group by "Department" and Sum Salaries
df.groupby("Department")["Salary"].sum()
Output
Department
Finance 137000
HR 158000
IT 130000
Name: Salary, dtype: int64
Using Multiple Aggregation Functions
You can apply multiple aggregation functions to grouped data.
df.groupby("Department")["Salary"].agg(["sum", "mean", "max"])
Output
sum mean max
Department
Finance 137000 68500.0 72000
HR 158000 52666.7 55000
IT 130000 65000.0 70000
Grouping by Multiple Columns
You can group by multiple columns.
df.groupby(["Department", "Experience"])["Salary"].sum()
Output
Department Experience
Finance 8 65000
12 72000
HR 2 53000
3 55000
5 50000
IT 7 60000
10 70000
Name: Salary, dtype: int64
Applying Multiple Aggregations to Multiple Columns
You can use .agg()
to apply different aggregation functions to multiple columns.
df.groupby("Department").agg({
"Salary": ["sum", "mean"],
"Experience": ["max", "min"]
})
Output
Salary Experience
sum mean max min
Department
Finance 137000 68500.0 12 8
HR 158000 52666.7 5 2
IT 130000 65000.0 10 7
apply()
for Custom Functions
Using If you need custom aggregation, you can use .apply()
.
df.groupby("Department")["Salary"].apply(lambda x: x.max() - x.min())
Output
Department
Finance 7000
HR 5000
IT 10000
Name: Salary, dtype: int64
- This calculates the salary range for each department.
Resetting Index
By default, groupby()
creates a multi-indexed DataFrame. You can use .reset_index()
to convert it back to a normal DataFrame.
df.groupby("Department")["Salary"].sum().reset_index()
Output
Department Salary
0 Finance 137000
1 HR 158000
2 IT 130000
transform()
for Group-wise Operations
Using The .transform()
method returns values of the same size as the original DataFrame.
df["Average_Salary"] = df.groupby("Department")["Salary"].transform("mean")
print(df)
Output
Department Salary Experience Average_Salary
0 HR 50000 5 52666.67
1 IT 60000 7 65000.00
2 IT 70000 10 65000.00
3 HR 55000 3 52666.67
4 Finance 65000 8 68500.00
5 Finance 72000 12 68500.00
6 HR 53000 2 52666.67
- This adds a column with the average salary per department.
filter()
to Filter Groups
Using The .filter()
method keeps only groups that satisfy a condition.
df_filtered = df.groupby("Department").filter(lambda x: x["Salary"].mean() > 60000)
print(df_filtered)
Output
Department Salary Experience
4 Finance 65000 8
5 Finance 72000 12
1 IT 60000 7
2 IT 70000 10
- This keeps only Finance and IT departments where the average salary is greater than 60,000.
ngroup()
to Assign Group Numbers
Using ngroup()
assigns a unique number to each group.
df["Group_Number"] = df.groupby("Department").ngroup()
print(df)
Output
Department Salary Experience Group_Number
0 HR 50000 5 1
1 IT 60000 7 2
2 IT 70000 10 2
3 HR 55000 3 1
4 Finance 65000 8 0
5 Finance 72000 12 0
6 HR 53000 2 1
- Each department is assigned a unique group number.
Summary Table
Method | Description |
---|---|
df.groupby("col")["col2"].sum() | Aggregate values grouped by a column |
df.groupby(["col1", "col2"])["col3"].mean() | Group by multiple columns |
df.groupby("col").agg({"col2": ["sum", "mean"]}) | Multiple aggregations per group |
df.groupby("col")["col2"].apply(lambda x: x.max() - x.min()) | Custom aggregation |
df.groupby("col").transform("mean") | Group-wise operation returning original size |
df.groupby("col").filter(lambda x: condition) | Keep only specific groups |
df.groupby("col").ngroup() | Assign group numbers |