Oftentimes in SQL, we append multiple tables, especially with a time dimension, to present in a single result set, run analysis on, or train models with. We use unions to append data sets underneath one another, and joins to merge columns stored in different tables to enrich rows with more data. In the previous post, we looked at SQL's joins and their equivalents in pandas. In this post, we'll look at SQL's
UNION ALL (all rows, regardless of duplication) and
UNION (only the unique rows). The equivalent for
UNION ALL is pandas'
.concat method, or previously the
SQL's UNION ALL and UNION in pandas
When adding new data using rows from a different dataset, we use the
UNION ALL to select from the other dataset we are looking to get our extra rows from. We need to have the same columns in both datasets for
UNION ALL to work, although pandas has a way to make it work in case columns don't match up completely. Let's first start with the happy path: Two datasets, with the same columns and no duplicate rows, we want to union. We'll use the tips dataset as an example. The dataset only contains data for Thursday, Friday, Saturday, and Sunday. Imagine we have a different tips dataset containing data for Monday, Tuesday, and Wednesday and we want to add these rows to our original
tips DataFrame. So, we add rows from
In SQL, we'd union these tables like this:
SELECT * FROM tips UNION ALL SELECT * FROM tips_weekdays;
In this query, we are selecting all columns for both tables: the
tips table and the
tips_weekdays table. Using a
UNION ALL, we'll find all rows from both tables present in the dataset. In pandas, we can use the
.concat method to return the same dataset:
pd.concat([tips, tips_weekdays], ignore_index=True)
.concat method takes a list of DataFrames to union. This list can be any number of DataFrames you'd want to concatenate. Consider, however, that our
'tips_weekdays' DataFrame contains data for Thursday as well. We'll have overlap in our rows, and end up with a unreliable dataset that is bloated with duplicate rows. In SQL,
UNION deals with this by only considering unique rows. In pandas, we need to apply the .drop_duplicates method to get rid of duplicate rows. First, the SQL code:
SELECT * FROM tips UNION SELECT * FROM tips_weekdays;
Then, our pandas equivalent:
pd.concat([tips, tips_weekdays], ignore_index=True).drop_duplicates()
.drop_duplicates, we collect the data without the duplicate rows from Thursday, exactly as
UNION would do. Of course, we can union only subsets of columns of our DataFrame instead of the entire DataFrame. For example, only the
'tip' columns. In SQL:
SELECT total_bill, tip FROM tips UNION SELECT total_bill, tip FROM tips_weekdays;
The pandas equivalent looks like the code below. Learn more about pandas column selection in one of my previous posts.
pd.concat([tips[['total_bill', 'tip']], tips_weekdays[['total_bill', 'tip']]], ignore_index=True) .drop_duplicates()
tips_weekdays miss a column compared to
UNION won't run and SQL will complain. In pandas, you can specify the
join parameter to
'inner'. This way, the following pandas statement will return only the columns present in both datasets.
pd.concat([tips, tips_weekdays], join='inner', ignore_index=True)
Concatenating DataFrames is one of the staples in the data science toolbox. Pandas'
.concat method offers a lot of flexibility for appending data --for example concatenating along both axes, and checking for duplicates. Note that there is an interesting difference between SQL and pandas when it comes to unions. In SQL, we need to explicitly include
ALL in the query, otherwise SQL 'defaults' to removing duplicates, which is computationally more expensive. In pandas, we need to explicitly call the
.drop_duplicates() method to remove duplicates, forcing us to be more thoughtful on what we want to do with our data. Be sure to check out the resources for great ways to learn more about pandas, or read more ways to apply SQL functions using pandas methods.