SQL2pandas | From SQL to pandas: SQL's UNION (ALL) in pandas

Oftentimes in SQL, we append multiple tables, especially with a time dimension, to present in a single result set, run analysis on, or train models with. We use unions to append data sets underneath one another, and joins to merge columns stored in different tables to enrich rows with more data. In the previous post, we looked at SQL's joins and their equivalents in pandas. In this post, we'll look at SQL's UNION ALL (all rows, regardless of duplication) and UNION (only the unique rows). The equivalent for UNION and UNION ALL is pandas' .concat method, or previously the .append method.

SQL's UNION ALL and UNION in pandas

When adding new data using rows from a different dataset, we use the UNION or UNION ALL to select from the other dataset we are looking to get our extra rows from. We need to have the same columns in both datasets for UNION and UNION ALL to work, although pandas has a way to make it work in case columns don't match up completely. Let's first start with the happy path: Two datasets, with the same columns and no duplicate rows, we want to union. We'll use the tips dataset as an example. The dataset only contains data for Thursday, Friday, Saturday, and Sunday. Imagine we have a different tips dataset containing data for Monday, Tuesday, and Wednesday and we want to add these rows to our original tips DataFrame. So, we add rows from tips_weekdays to tips. In SQL, we'd union these tables like this:

SELECT 
  * 
FROM 
  tips 
UNION ALL 
SELECT 
  * 
FROM 
  tips_weekdays;

In this query, we are selecting all columns for both tables: the tips table and the tips_weekdays table. Using a UNION ALL, we'll find all rows from both tables present in the dataset. In pandas, we can use the .concat method to return the same dataset:

pd.concat([tips, tips_weekdays], ignore_index=True)

The .concat method takes a list of DataFrames to union. This list can be any number of DataFrames you'd want to concatenate. Consider, however, that our 'tips_weekdays' DataFrame contains data for Thursday as well. We'll have overlap in our rows, and end up with a unreliable dataset that is bloated with duplicate rows. In SQL, UNION deals with this by only considering unique rows. In pandas, we need to apply the .drop_duplicates method to get rid of duplicate rows. First, the SQL code:

SELECT 
  * 
FROM 
  tips 
UNION 
SELECT 
  * 
FROM 
  tips_weekdays;

Then, our pandas equivalent:

pd.concat([tips, tips_weekdays], ignore_index=True).drop_duplicates()

After applying .drop_duplicates, we collect the data without the duplicate rows from Thursday, exactly as UNION would do. Of course, we can union only subsets of columns of our DataFrame instead of the entire DataFrame. For example, only the 'total_bill' and 'tip' columns. In SQL:

SELECT 
  total_bill, 
  tip 
FROM 
  tips 
UNION 
SELECT 
  total_bill, 
  tip 
FROM 
  tips_weekdays;

The pandas equivalent looks like the code below. Learn more about pandas column selection in one of my previous posts.

pd.concat([tips[['total_bill', 'tip']], tips_weekdays[['total_bill', 'tip']]],  ignore_index=True)
    .drop_duplicates()

Should tips_weekdays miss a column compared to tips, UNION won't run and SQL will complain. In pandas, you can specify the join parameter to 'inner'. This way, the following pandas statement will return only the columns present in both datasets.

pd.concat([tips, tips_weekdays], join='inner', ignore_index=True)

Concatenating DataFrames is one of the staples in the data science toolbox. Pandas' .concat method offers a lot of flexibility for appending data --for example concatenating along both axes, and checking for duplicates. Note that there is an interesting difference between SQL and pandas when it comes to unions. In SQL, we need to explicitly include ALL in the query, otherwise SQL 'defaults' to removing duplicates, which is computationally more expensive. In pandas, we need to explicitly call the .drop_duplicates() method to remove duplicates, forcing us to be more thoughtful on what we want to do with our data. Be sure to check out the resources for great ways to learn more about pandas, or read more ways to apply SQL functions using pandas methods.

From SQL to pandas: SQL's UNION (ALL) in pandas

SQL's UNION ALL and UNION in pandas

Resources