SQL2pandas | From SQL to pandas: column selection, transformation, and renaming with pandas

SQL requires two mandatory clauses to be specified if we want our query to return data: SELECT and FROM. One of the first things to think about when writing a query is which columns we need, and in what shape or form we would like to see them. We might want to return some columns exactly as they are stored in our database, or we could derive a conditional or calculated column from them. Perhaps we would like to assign custom names to columns, or would like to apply a different casing to its string values.


In this post we will look what pandas equivalents you can use to perform some of the basic SQL actions of selecting columns from a DataFrame. We will be using the tips dataset also used in the converter.

Selecting a single column in pandas

That one is easy enough, and there is more than one right way to do it. Let’s first look at the statement in SQL:


SELECT 
  tip 
FROM 
  tips;

There are two ways to perform the same operation in pandas:


  1. tips['tip']
  2. tips.tip

The first method requires us to pass the desired column as a string between brackets (the __getitem__ syntax), whereas the second method requires us to chain the name of the desired column to the table name.


Both methods work equally well, however, I like the first method better. This is because some datasets have columns wit names that contain spaces. This won’t happen if you’re querying a SQL dataset where spaces in column names are not allowed, but you might run into data from different sources (such as Excel files or csv’s) where this rule does not apply. The first method is more resilient to such variations in column names.

Selecting multiple columns in pandas

If we are looking to select more than one column, the methods change slightly. First the SQL statement:


SELECT 
  day, 
  tip, 
  smoker 
FROM 
  tips;

In pandas, the code would like this:


tips[['day', 'tip', 'smoker']]

In its essence, the syntax follows the same structure as the code for selecting a single column: We first specify a table, then specify in one item what columns we want to see between the brackets, and pass their names as strings. Take note: The __getitem__ syntax (what you pass between the brackets) only accepts one argument. This means we can either pass a single column name, or a list of column names. Above, we passed a list of columns. Passing the names of the columns instead would cause an error.


If we want to select all columns from a table, in SQL we would end up with the following:


SELECT 
  * 
FROM 
  tips;

In pandas, all columns are selected by default. So, selecting all columns in pandas would simply be:


tips

Applying string functions to columns in pandas

There is a plethora of string functions available in SQL, such as UPPER(), LOWER(), TRIM(), REPLACE(), INITCAP() and many more. Python has a fair number of string functions as well, and we can use them to format data in a certain way when selecting from our pandas DataFrame. Say, you would like to select a column but want to return all its values in uppercase:


SELECT 
  UPPER(day) 
FROM 
  tips;

To achieve the same results in pandas, we need to apply (using .apply) the .upper() string function to each element of the column 'time', and assign the resulting value to the same column with the .assign method. Here’s how that looks:


tips
    .assign(time = tips['time']
    .apply(lambda time: time.upper()))['time']

Let’s break this statement down:

  • We start by specifying the dataframe: tips
  • To this DataFrame, we assign a new column using the .assign-method.
  • In the .assign method, we start with the declaring the column name: time.
  • The new column 'time' --replacing the current 'time' column-- gets its values by applying a lambda function to the values in the current column 'time'.
  • For each row, the lambda function applies the .upper() function to the value in 'time'.
  • Select the newly created 'time' column.

This time we are applying the .upper() function to the values in the column ‘time’. But using the same code structure, we can perform any other string operation, such as .lower(), .strip(), .replace(), or .title() to perform the SQL functions LOWER(), TRIM(), REPLACE(), or INITCAP(), respectively.

Renaming columns in pandas

It is not unusual to come across columns with ambiguous names, or names that are simply hard for humans to read. Luckily, in SQL we can alias columns so our returned dataset has column names that better suit our needs. In a query, it would simply look like this:


SELECT 
  tip, 
  time AS Shift 
FROM 
  tips;

In pandas, renaming columns is done using the .rename method. To mimic the SQL query above, the code would look like this:


tips
    .rename(columns={'time': 'Shift'})[['tip', 'Shift']]

The first thing we do is specify the column, and apply the rename method to the columns that are specified in its argument. The dictionary contains all column names to be changed as keys, and their new names as values. After renaming the columns, we select the columns we are looking for.


This is the first in a serie of posts in which I will discuss how to select columns in pandas from a SQL perspective. Check the articles section for any other posts on this topic, and good luck Pythoning!


Resources: