Aggregations such as averages, minimums, maximums, counts and sums are common in SQL queries. We can return values such as the number of rows or the average of a column, or group data to find aggregate values per subset. SQL aggregations AVG(...)
, SUM(...)
, COUNT(...)
, MAX(...)
, and MIN(...)
are translated to pandas using Numpy's mean
, sum
, size
, max
, and min
. The GROUP BY
function in SQL has its equivalent in pandas in the .groupby()
method, which works very similar as we'll see below. We use the tips dataset to demonstrate.
Aggregations in pandas
Whatever aggregation we want to perform in SQL, it works syntactically the same: We select an aggregation, pass it a column as an argument, and specify our table and any optional filters. To illustrate:
SELECT
AVG(total_bill)
FROM
tips;
This code would return a single value. In pandas, we do just the same. For the above SQL query, we translate to pandas as follows:
tips['tip'].mean()
If we want to perform multiple aggregations in a single query, we simply add these to our SQL query:
SELECT
AVG(total_bill),
AVG(tip)
FROM
tips;
In pandas, we need to apply the .agg()
method if we want to expand beyond a single aggregation. The following code is an example of how to apply .agg()
to achieve the same results as the SQL query above.
tips.agg({'total_bill': ['mean'], 'tip': ['mean']})
The .agg()
method accepts (among other things) a dictionary with column names as keys, and a list of functions as values. This means that we can add other aggregations we want to perform to the lists. For example, if we want to have the minimum, average, and maximum tip in our dataset:
tips.agg({'tip': ['min', 'mean', 'max']})
Using GROUP BY
in pandas
We can drill down in our analysis by looking at aggregations per group, such as what is the average tip per day? Or what is the average total bill and tip of a dinner compared to a lunch? To find an answer to the latter, we can query our tips dataset as follows:
SELECT
time,
AVG(total_bill),
AVG(tip)
FROM
tips
GROUP BY
time;
We can translate this to pandas by applying the .groupby()
method to our dataset, and then perform our aggregations using .agg()
tips
.groupby('time')
.agg({'total_bill': ['mean'], 'tip': ['mean']})
.groupby()
works the same as in SQL, as it requires every non aggregated column as an argument. Dealing with more than one column to group data on, is done using a list. For example, if we want to add a grouping based on day:
tips
.groupby(['time, day'])
.agg({'total_bill': ['mean'], 'tip': ['mean']})
It often makes sense to include a row count into your grouping, to make sure whether that we're dealing with a large enough amount of data to draw some insights from. We do this by including COUNT(*)
into the mix, represented in pandas by Numpy's 'size'
. However, since we're passing *
as an argument (and in pandas *
is translated to not specifying column names), we need to include 'size'
into the aggregation list of one of the other columns to make it work. This looks as follows:
tips
.groupby(['time', 'day'])
.agg({'total_bill': ['size', 'mean'], 'tip': ['mean']})
Aliases for aggregations in pandas
In SQL, not using aliasing on aggregations causes the column name in the result set to be something like 'AVG(tip)'. Using aliasing, we can give these columns something more helpfull, such as 'Average_Tip':
SELECT
time,
day,
AVG(tip) AS average_tip,
AVG(total_bill) AS average_total_bill
FROM
tips
GROUP BY
time,
day;
In pandas, we use the same methods as before, but we are going to pass our arguments differently to make sure our aggregations are named as we want:
tips
.groupby(['time', 'day'])
.agg(average_tip=('tip', 'mean'), average_total_bill=('total_bill', 'mean'))
In the .agg()
method, we specify for each aggregation a column alias, and assign it a tuple containing the column we want to aggregate, and the aggregation function to use. Be aware that if we provide an alias to one aggregation, we need to name each aggregation, or this method will not return the results you might expect.
As you have seen, using aggregations in pandas, either directly or using the .agg()
method works slightly different compared to SQL, but everything is there. Be sure to try out some of these examples in the converter, or check out the SQL2pandas Cheat Sheet and other great resources.