Without filtering, we would never be able to pick up the signal from the noise. In SQL, the WHERE
clause takes care of filtering for us, allowing us to specify what we want to see, and don't want to see. Some standard ways of filtering include direct comparisons of values using =
, >
, <
, <>
, >=
or <=
, using ranges with BETWEEN
, checking for presence using IN
or NOT IN
, or seeing whether a value starts with, ends with, or contains a given string. In this post we'll look at how to apply these powerful filtering methods in pandas, using various examples from the tips dataset.
SQL's WHERE
in pandas: direct comparisons
Let's start by filtering using simple comparison. If we want to compare two values in SQL or pandas, we use mostly the same syntax except for 'is' and 'is not': In SQL, we use =
and <>
, whereas in pandas, we use ==
and !=
, respectively. In SQL, we could have the following query:
SELECT
*
FROM
tips
WHERE
tip > 5;
SQL's syntax is very straightforward; even reading out loud the statement still makes sense to most people --just make sure you say 'all' instead of 'star'. For pandas, the code is a little less intuitive at first:
tips[tips['tip'] > 5]
Breaking it down:
- We start with specifying the table:
tips
- We add brackets, and within the brackets specify our filter:
- Again, starting with the table:
tips
- We select the column we want to filter:
'tip'
- Finally, we add an comparison with a value:
> 5
What we apply here is what's called 'Boolean Indexing': We are looking for rows in tips
where the condition tips['tip'] > 5
returns True
. If you query just tips['tip'] > 5
, you would get a pandas Series with a bunch of True
and False
values on 'tip'
. If we pass this 'mask' to the __getitem__
(the brackets after the table name), we'll see only the rows where the 'mask' value for 'tip'
is True
. This way we can filter using any comparison or function that returns True
or False
, as we will see later in other examples further down.
SQL's WHERE
in pandas: BETWEEN
If we want rows from our tips dataset that have a tip between two defined values, we use SQL's BETWEEN
:
SELECT
*
FROM
tips
WHERE
tip BETWEEN 2.50 AND 5;
The easiest way to translate this to pandas, is by using two different conditions that both need to apply:
tips[(tips['tip']>=2.50) & (tips['tip']<=5)]
Again, we are usign Boolean Indexing, but now we are coupling the conditions using &
, pandas equivalent for AND
. If both conditions return True
, the row is returned; if either or both are False
, the row is not returned.
SQL's WHERE
in pandas: LIKE
In SQL, rows can be filtered on conditions such as LIKE '%ample'
, LIKE 'examp%'
, or '%xampl%'
. To perform the same types of filtering in pandas, we have to apply different string functions: .endswith()
, .startswith()
and .contains()
. Let's do some filtering using SQL's LIKE
:
SELECT
*
FROM
tips
WHERE time LIKE '%nner';
SELECT
*
FROM
tips
WHERE time LIKE 'Din%';
SELECT
*
FROM
tips
WHERE time LIKE '%inn%';
In pandas, we'll have to apply some string methods to achieve the same results. The pandas equivalents for the SQL above goes as follows:
tips[tips['time']
.str
.endswith('inner')]
tips[tips['time']
.str
.startswith('Din')]
tips[tips['time']
.str
.contains('inn')]
Again, we are using Boolean Indexing, where we return a Series of True
and False
values to pass as a filter mask. Note the use of .str
, as we can apply string functions to strings only. The functions are fairly self explanatory, with names that quite literally describe what you expect them to do: return a bool, based on the argument passed.
SQL's WHERE
in pandas: IN
Instead of endlessly chaining conditions using OR
, we can use IN (...)
in SQL. Pandas has the very similar .isin()
method, which takes a list of values to include in our DataFrame. Let's say we only want weekend days to appear in our dataset, in SQL:
SELECT
*
FROM
tips
WHERE day IN ('Fri', 'Sat', 'Sun');
In pandas, this would translate to:
tips[tips['day']
.isin(['Fri', 'Sat', 'Sun'])]
The .isin()
method returns, just like the other methods we used for filtering, a pandas Series bools to indicate whether a row from our tips DataFrame is going to show up or not.
SQL's AND
and OR
in pandas: &
and |
If you want to combine filters or add different conditions, SQL's AND and OR have their equivalents in pandas in &
and |
. Luckily, this works pretty much the same as combining works in SQL. I would advise to use brackets to be certain that filters are interacting in the right way, but if you know how to apply brackets properly, you won't need to do anything different than before.
What about NOT
?
SQL's NOT is very straightforward to use, but unfortunately won't work in pandas. In pandas, we would use the ~
(pronounce: tilde) operator. This operator inverts whatever bool any of the filtering functions returns (True
becomes False
, and vice versa). So how does this work in practice? In SQL:
SELECT
*
FROM
tips
WHERE day NOT LIKE '%u%';
In pandas, we use the same syntax as the non-negative query, but make sure we inverse its results using ~
.
tips[~tips['day']
.str
.contains('u')]
Filtering data to what matters is an essential skill for every data user. In this post, we've looked at some useful ways to filter DataFrames effectively, using Boolean Indexing. However, there are more methods in our pandas toolbox that we can use to slice and dice our data in every way we want. Make sure to check for new posts in which these tools are sure to appear. For now, you can experiment with what you learned above using the SQL2pandas converter on the home page of this site, or check out any other articles or resources if you are serious about learning pandas.