Stop Building Your Models One Step at a Time. Automate the Process with Pipelines!

Posted by Brian Roepke on Fri 01 October 2021 Updated on Fri 01 October 2021

Why Pipelines?

When I started building models in Sklearn, I would break each pre-processing step into its cell or chunk of code. This was a good way to get started because you could easily break down the steps into readable chunks. However, while it was easier to read, it ended up lacking repeatability. The next time you presented your model with new data, you had to run through all the steps to transform the data before running the model on the new data, which presented many problems. For example the dimensionality of your data can change if there are new columns created during One Hot Encoding.

The Answer? Pipelines

Anatomy of a Pipeline

First, as usual, the imports needed to run this script.

from sklearn.pipeline import Pipeline
from sklearn.compose import make_column_selector as selector
from sklearn.compose import ColumnTransformer
from sklearn. pre-processing import MinMaxScaler
from sklearn. pre-processing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor

Column Tranformers

Next is a function for making a column transformer. I prefer to call this as a function to make the code more reusable.

The column transformer allows you to combine any number of pre-processing steps into a single transformer. In the example below, we have a MinMaxScaler for numeric columns and an OneHotEncoder for categorical values. You could include any transformer from Sklearn in these steps.

A good example of this from their documentation:
Column Transformer with Mixed Types

In addition to demonstrating both numeric and categorical columns, this shows the column selector which allows you to select groups of columns based on different criteria. The numeric columns are selected via selector(dtype_exclude="object"). Below demonstrates selecting columns by their name as a simple python list. You can combine any of these select styles with each of the different transformers you supply. Additionally, you name your transformers such as num and cat as see below for later identification in your fit model.

def make_coltrans():
    column_trans = ColumnTransformer(transformers=
            [('num', MinMaxScaler(), selector(dtype_exclude="object")),
             ('cat', OneHotEncoder(dtype='int', handle_unknown='ignore'), ['CAT_FIELD_ONE', 'CAT_FIELD_TWO'])],
            remainder='drop')

    return column_trans

The Pipeline

The creation of the pipeline now is a very simple step after creating the column transformer. All you need to do is order the pipeline sequence based on the logical ordering of the steps you would normally take. Here we have two steps, the column transformer, and the classifier. Like the column transformer, we name the steps like prep and clf below.

def create_pipe(clf):
    '''Create a pipeline for a given classifier.  The classifier needs to be an instance
    of the classifier with all parmeters needed specified.'''

    # Each pipeline uses the same column transformer.  
    column_trans = make_coltrans()

    pipeline = Pipeline([('prep',column_trans),
                         ('clf', clf)])

    return pipeline

Creating and Fitting the Model

Finally, we can create an instance of the classifier and pass that to our function above that create the pipeline.

# Creaate the classifier instance and build the pipleline.
clf = RandomForestClassifier(random_state=42, class_weight='balanced')
pipeline = create_pipe(clf)

# Fit the model to the training data
pipeline.fit(X_train, y_train)

Summary

The above demonstrates the simplicity of setting up a pipeline. The first time you walk through this, it can seem a little confusing versus doing each step independently. The benefit is that each time you want to apply this to a new model, or even better, run new data against your fit model, all of the transformation to the data will happen automatically.

If you liked what you read, subscribe to my newsletter and you will get my cheat sheet on Python, Machine Learning (ML), Natural Language Processing (NLP), SQL, and more. You will receive an email each time a new article is posted.



Comments !