SQL2pandas | From SQL to pandas: SQL's JOIN in pandas

In the every day use of SQL we find ourselves gathering information from different sources to present in a single result set. We use unions to append data sets underneath one another, and joins to merge columns stored in different tables to enrich rows with more data. In this post, we'll look at SQL's joins and their equivalents in pandas, where we use the pandas' .merge method.

SQL's JOIN in pandas

When enriching your data with columns from a different dataset, we use the JOIN statement to define the other dataset we are looking to get our extra columns from. In the ON part of the statement, we define in what columns from each dataset we want values to match. What to do with values that don't have a representation in the other dataset, is defined by the preceding LEFT, RIGHT, INNER, or OUTER definition.

  • LEFT: Keep all values on the left side, non-matching rows on the right side are left out;
  • RIGHT: Vice versa the above;
  • INNER: Only keep rows that have a match, leave out all non-matching rows from either side;
  • OUTER: Keep all rows from left and right, regardless of matching values in columns

In SQL, a query involving multiple tables could look something like this:

  orders AS o 
  LEFT JOIN customers AS c ON o.customer_id = c.customer_id;

In this query, we are selecting all columns for both tables: the 'orders' table and the 'customers' table. Using a LEFT JOIN, we'll find all our orders present in the dataset, and either some customer info when a customer_id is associated with the order, or NULLs if there's no customer info available. In pandas, we can use the .merge method to return the same dataset:

orders.merge(customers, left_on='customer_id', right_on='customer_id', how='left')

The .merge method can be applied to either one of the DataFrames, or both (in that case: pd.merge(orders, customers, ...)). To keep things consistent, we are looking at this from the 'orders' perspective and go from there. In the .merge method used above, we set a number of parameters.

  • The first argument is the table we want to include in our result set, 'customers'.
  • Secondly, we need to specify the columns to match on: 'customer_id'. We do this for both the left and the right side of the join. If we want the joins to consider multiple columns, we pass a list of column names. We can also use one or both of the index columns to join, but we specify those using 'left_index=True' and/or 'right_index=True'.
  • Lastly, we specify which type of merge we want: 'left'. 'left' in pandas translates to LEFT OUTER in SQL. Using RIGHT OUTER JOIN, INNER JOIN or OUTER JOIN would require us to pass 'right', 'inner', or 'outer' respectively.

Using the method above, we are greedy with columns, querying more than we might need. Say, we are just interested in what ZIP codes our customers are having their orders delivered to. In SQL, we could query the following:

  orders AS o 
  LEFT JOIN customers AS c ON o.customer_id = c.customer_id;

In the query above, we'll return all orders and the ZIP code of the customers who placed them. Should an order not be associated with a customer_id, we would still see the order, but the customer_id would be empty. In pandas, we now need to specify what columns we need from each DataFrame:

orders[['order_id', 'date_ordered', 'customer_id']]
    .merge(customers[['zip', 'customer_id']], left_on='customer_id', right_on='customer_id', how='left')

Using this code, we collect only the data that we need for our analysis purposes, and the column we join on, saving precious memory resources. However, we can slim this code down further. As you probably noticed, we are joining on columns that have the same name. In pandas, we can pass the column name a single time using the on argument:

orders[['order_id', 'date_ordered', 'customer_id']]
    .merge(customers[['zip', 'customer_id']], on='customer_id', how='left')

Pandas, however, is smart enough to recognize this similarity, and will join on columns with equal names if we don't provide them explicitly. So we can leave out the specification for the columns in this case. The results will be the same.

orders[['order_id', 'date_ordered', 'customer_id']]
    .merge(customers[['zip', 'customer_id']], how='left')

Right now, for joining purposes, we need to include the 'customer_id' column in both DataFrames. This column is not necessary for our analysis, so we could leave it out if we wanted. One way this could work is the following:

orders.merge(customers, how='left')[['order_id', 'order_date', 'zip']]

As you see, we can join DataFrames quite effectively with a relatively small amount of code. My advice, however, would be to be carefull with leaving out arguments from your code. If you express what columns you use to join two DataFrames, you will thank yourself later, when the underlying datasets change and your code needs reviewing because it's throwing errors. Good luck, and be sure to check out the resources for great ways to learn more pandas!