In my previous post on SQL to pandas, we went through some basic operations, such as selecting columns, performing some string transformations, and renaming columns. In this post we'll look at conditional columns (SQL's CASE WHEN
) and calculated columns. All information is available in summarized form as a SQL to pandas cheat sheet, available on the resources page, as well as at the end of this post.
Conditional columns in pandas
Selecting a conditional column in SQL follows the CASE WHEN ... THEN ... ELSE ...
syntax, where we can add as many conditions as we like. For example:
SELECT
tip,
CASE WHEN tip > 10 THEN "Big tip" ELSE "Regular tip" END AS comment
FROM
tips;
Effectively, we are creating a new column that depends on an existing column in the dataset. In pandas, we would use the .assign()
method and .apply
method to achieve this:
tips
.assign(comment = tips['tip']
.apply(lambda tip: 'Big tip' if tip > 10 else 'Regular tip'))[['tip', 'comment']]
Let’s break this statement down:
- We start by specifying the DataFrame:
tips
- To this DataFrame, we assign a new column using the
.assign
method. - In the
.assign
method, we start with the declaring the column name:comment
. - The new column
'comment'
gets its values by applying a comparison lambda function to the existing column:'tip'
. - For each row, the function evaluates whether the
'tip'
satisfies the condition of being greater than 10, and assigns a value to'comment'
accordingly. - Select
'tip'
and the newly created'comment'
column.
Now, let’s say we want to add more conditions than just one. For example:
SELECT
tip,
CASE WHEN tip > 10 THEN "Big tip" WHEN tip < 1 THEN "Small tip" ELSE "Regular tip" END AS comment
FROM
tips;
Normally, in Python you would approach this using if ... elif ... else...
. However, if we want to apply this to our lambda function, the elif
part needs to be written as a nested if-statement like so:
tips
.assign(comment = tips['tip']
.apply(lambda tip: 'big tip' if tip > 10 else ('small tip' if tip < 1 else 'regular tip')))[['tip', 'comment']]
It works similarly to the previous situation with one condition:
- As always, we start by specifying the DataFrame:
tips
- To this DataFrame, you assign a new column using the
.assign
method. - In the
.assign
method, we start with the declaring the column name:comment
. - The new column
'comment'
gets its values by applying a function to the existing column:'tip'
. - For each row, the function evaluates whether
'tip'
satisfies the condition of being greater than 10. If not, the function evaluates the nested condition of'tip'
being smaller than 1. After this is done, it assigns a value to'comment'
accordingly. - Select
'tip'
and the newly created'comment'
column.
Calculated columns in pandas
SQL allows us to perform all sorts of mathematical operations on columns, and pandas is no different. For example, if we want to calculate the tip per person at the table in SQL:
SELECT
*,
tip/size AS tip_per_person
FROM
tips;
In pandas we would again take the '.assign'
route. The code is structured in a similar manner as the previous examples with the .assign
method.
tips
.assign(tips_per_person = tips['tip']/tips['size'])
We can easily modify the code to perform different kinds of mathematical operations, such as divisions, additions or subtractions.
I hope you got some insights in how pandas operates compared to what you’re maybe already familiar to in SQL You can always play around with the SQL to pandas converter on this site to discover and learn more about pandas. Also, there are plenty of great resources out there to further advance your Python and pandas skills. Below, you’ll find some useful links on the topic of this post. Thanks for reading and happy Pythoning!