Oftentimes in SQL, we append multiple tables, especially with a time dimension, to present in a single result set, run analysis on, or train models with. We use unions to append data sets underneath one another, and joins to merge columns stored in different tables to enrich rows with more data. In the previous post, we looked at SQL's joins and their equivalents in pandas. In this post, we'll look at SQL's UNION ALL
(all rows, regardless of duplication) and UNION
(only the unique rows). The equivalent for UNION
and UNION ALL
is pandas' .concat
method, or previously the .append
method.
SQL's UNION ALL and UNION in pandas
When adding new data using rows from a different dataset, we use the UNION
or UNION ALL
to select from the other dataset we are looking to get our extra rows from. We need to have the same columns in both datasets for UNION
and UNION ALL
to work, although pandas has a way to make it work in case columns don't match up completely. Let's first start with the happy path: Two datasets, with the same columns and no duplicate rows, we want to union. We'll use the tips dataset as an example. The dataset only contains data for Thursday, Friday, Saturday, and Sunday. Imagine we have a different tips dataset containing data for Monday, Tuesday, and Wednesday and we want to add these rows to our original tips
DataFrame. So, we add rows from tips_weekdays
to tips
.
In SQL, we'd union these tables like this:
SELECT
*
FROM
tips
UNION ALL
SELECT
*
FROM
tips_weekdays;
In this query, we are selecting all columns for both tables: the tips
table and the tips_weekdays
table. Using a UNION ALL
, we'll find all rows from both tables present in the dataset. In pandas, we can use the .concat
method to return the same dataset:
pd.concat([tips, tips_weekdays], ignore_index=True)
The .concat
method takes a list of DataFrames to union. This list can be any number of DataFrames you'd want to concatenate. Consider, however, that our 'tips_weekdays'
DataFrame contains data for Thursday as well. We'll have overlap in our rows, and end up with a unreliable dataset that is bloated with duplicate rows. In SQL, UNION
deals with this by only considering unique rows. In pandas, we need to apply the .drop_duplicates method to get rid of duplicate rows. First, the SQL code:
SELECT
*
FROM
tips
UNION
SELECT
*
FROM
tips_weekdays;
Then, our pandas equivalent:
pd.concat([tips, tips_weekdays], ignore_index=True).drop_duplicates()
After applying .drop_duplicates
, we collect the data without the duplicate rows from Thursday, exactly as UNION
would do. Of course, we can union only subsets of columns of our DataFrame instead of the entire DataFrame. For example, only the 'total_bill'
and 'tip'
columns. In SQL:
SELECT
total_bill,
tip
FROM
tips
UNION
SELECT
total_bill,
tip
FROM
tips_weekdays;
The pandas equivalent looks like the code below. Learn more about pandas column selection in one of my previous posts.
pd.concat([tips[['total_bill', 'tip']], tips_weekdays[['total_bill', 'tip']]], ignore_index=True)
.drop_duplicates()
Should tips_weekdays
miss a column compared to tips
, UNION
won't run and SQL will complain. In pandas, you can specify the join
parameter to 'inner'
. This way, the following pandas statement will return only the columns present in both datasets.
pd.concat([tips, tips_weekdays], join='inner', ignore_index=True)
Concatenating DataFrames is one of the staples in the data science toolbox. Pandas' .concat
method offers a lot of flexibility for appending data --for example concatenating along both axes, and checking for duplicates. Note that there is an interesting difference between SQL and pandas when it comes to unions. In SQL, we need to explicitly include ALL
in the query, otherwise SQL 'defaults' to removing duplicates, which is computationally more expensive. In pandas, we need to explicitly call the .drop_duplicates()
method to remove duplicates, forcing us to be more thoughtful on what we want to do with our data. Be sure to check out the resources for great ways to learn more about pandas, or read more ways to apply SQL functions using pandas methods.