First Experiences with Scikit-Learn

I recently made a submission to one of Kaggle’s introductory machine learning competitions. The Python code that I wrote was built upon code I wrote for the Coursera Machine Learning by Stanford class in GNU Octave. For the course we put together implementations of common machine learning models, one of those being the logistic regression model I wanted to use for the aforementioned Kaggle competition. I hadn’t written code in Python in a while and felt that porting those models from GNU Octave to Python/Pandas/NumPy would be a great way of getting familiar with the language again.

Eventually I wanted to extend the code I had written with additional functionality. The first extensions to go after were to be one-hot encoding and automated parameter selection. I found that functionality in scikit-learn modules but soon realized that in order to fully make use of the vast amount of functionality that scikit-learn provides I would have to re-architect my code with utilizing scikit-learn in mind from the start.

With the above in mind I made a new branch in my Github repository and set out to porting my original functionality into a script built from the ground-up to use scikit-learn. The team behind scikit-learn provides detailed documentation of its modules and classes and because of that I was able to put together a basic implementation in no time flat. For one-hot encoding I used scikit-learn’s DictVectorizer. For automatic feature selection I went with GridSearchCV and RandomizedSearchCV. Both provide functionality for searching over parameters for machine learning models and returning the parameters that produce the best results after cross-validation. Each class goes about this in different ways however:

GridSearchCV – Exhaustive search over all of the parameters provided. For example, in the dictionary below GridSearchCV would run the model provided for every combination of parameters in the dictionary. In this case we would run the model 1,000 times before being able to report the best performing combination of parameters.

param_grid = dict(pandaspoly__degree=[2, 3, 4, 5, 6],
 pandaspoly__interaction_only=[True, False],
 logisticregression__C=np.linspace(0.01, 1, 100))

RandomizedSearchCV – A fixed number of parameter combinations is sampled from provided distributions of parameters. In the dictionary below RandomizedSearchCV would choose a fixed number (user-provided) of combinations from the distributions provided below. Non-continuous parameters can be provided as well and are sampled with replacement if distributions are provided for other continuous parameters (the recommended practice).

rand_param_grid = dict(pandaspoly__degree=np.random.randint(low=2, high=6, size=1000),
 pandaspoly__interaction_only=[True, False],
 logisticregression__C=np.random.uniform(low=0.01, high=1, size=1000))

In my use, RandomizedSearchCV completed its work much faster and produced results nearly equivalent in performance to GridSearchCV.


Optimal parameters as found by RandomizedSearchCV in 4.76 seconds:
pandaspoly__interaction_only: 1.0000
pandaspoly__degree: 5.0000
logisticregression__C: 0.3400

Optimal parameters as found by GridSearchCV in 63.97 seconds:
logisticregression__C: 0.3300
pandaspoly__degree: 2.0000
pandaspoly__interaction_only: 1.0000

Accuracy score for logistic regression using default parameter: 0.83
Accuracy score for logistic regression using RandomizedSearchCV optimal parameters: 0.84
Accuracy score for logistic regression using GridSearchCV optimal parameters: 0.84

Process finished with exit code 0

One thing that came up was needing to create my own estimator subclasses of provided scikit-learn classes (I wanted to exclude certain Pandas columns from certain operations). Daniel Hnyk’s blog has a post that proved to be very useful in accomplishing this. In addition to his posted tips I also came up with the following:

  • In your __init__ method make sure to set the local variables you are using for the class you are subclassing. Scikit-learn will complain if you do not. In the example below I do that for the “copy”, “with_mean”, and “with_std” variables for scikit-learn’s StandardScaler.
 def __init__(self, columns, copy=True, with_mean=True, with_std=True):
 self.columns = columns
 self.copy = copy
 self.with_mean = with_mean
 self.with_std = with_std
 self.scaler = StandardScaler(self.copy, self.with_mean, self.with_std)

Overall scikit-learn has been a huge help for quickly putting together functional, easy-to-write, and easily-extendible machine learning code. I will be using it going forward for all of my projects in the foreseeable future.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: