When I started building models in Sklearn, I would break each pre-processing step into its cell or chunk of code. This was a good way to get started because you could easily break down the steps into readable chunks. However, while it was easier to read, it ended up lacking repeatability. The next time you presented your model with new data, you had to run through all the steps to transform the data before running the model on the new data, which presented many problems. For example the dimensionality of your data can change if there are new columns created during One Hot Encoding.
The Answer? Pipelines
Anatomy of a Pipeline
First, as usual, the imports needed to run this script.
from sklearn.pipeline import Pipeline from sklearn.compose import make_column_selector as selector from sklearn.compose import ColumnTransformer from sklearn. pre-processing import MinMaxScaler from sklearn. pre-processing import OneHotEncoder from sklearn.ensemble import RandomForestRegressor
Next is a function for making a column transformer. I prefer to call this as a function to make the code more reusable.
The column transformer allows you to combine any number of pre-processing steps into a single transformer. In the example below, we have a
MinMaxScaler for numeric columns and an
OneHotEncoder for categorical values. You could include any transformer from Sklearn in these steps.
A good example of this from their documentation:
Column Transformer with Mixed Types
In addition to demonstrating both numeric and categorical columns, this shows the column selector which allows you to select groups of columns based on different criteria. The numeric columns are selected via
selector(dtype_exclude="object"). Below demonstrates selecting columns by their name as a simple python list. You can combine any of these select styles with each of the different transformers you supply. Additionally, you name your transformers such as
cat as see below for later identification in your fit model.
def make_coltrans(): column_trans = ColumnTransformer(transformers= [('num', MinMaxScaler(), selector(dtype_exclude="object")), ('cat', OneHotEncoder(dtype='int', handle_unknown='ignore'), ['CAT_FIELD_ONE', 'CAT_FIELD_TWO'])], remainder='drop') return column_trans
The creation of the pipeline now is a very simple step after creating the column transformer. All you need to do is order the pipeline sequence based on the logical ordering of the steps you would normally take. Here we have two steps, the column transformer, and the classifier. Like the column transformer, we name the steps like
def create_pipe(clf): '''Create a pipeline for a given classifier. The classifier needs to be an instance of the classifier with all parmeters needed specified.''' # Each pipeline uses the same column transformer. column_trans = make_coltrans() pipeline = Pipeline([('prep',column_trans), ('clf', clf)]) return pipeline
Creating and Fitting the Model
Finally, we can create an instance of the classifier and pass that to our function above that create the pipeline.
# Creaate the classifier instance and build the pipleline. clf = RandomForestClassifier(random_state=42, class_weight='balanced') pipeline = create_pipe(clf) # Fit the model to the training data pipeline.fit(X_train, y_train)
The above demonstrates the simplicity of setting up a pipeline. The first time you walk through this, it can seem a little confusing versus doing each step independently. The benefit is that each time you want to apply this to a new model, or even better, run new data against your fit model, all of the transformation to the data will happen automatically.
If you liked what you read, subscribe to my newsletter and you will get my cheat sheet on Python, Machine Learning (ML), Natural Language Processing (NLP), SQL, and more. You will receive an email each time a new article is posted.