Machine learning made easier with datto package

Kristie Wirth
Kristie Wirth / October 2, 2020

Machine learning isn't easy

Machine learning can get complex rather quickly. There are so many steps to the process, and keeping track of everything you've done is tricky at times.

Here at Zapier, we're experimenting with all types of machine learning algorithms. From recommending Zaps to automating support ticket responses to classifying spam, we're continuously developing new approaches and iterating on existing ones.

According to the No Free Lunch theorem, there's no single approach that is the best for all of these types of machine learning problems. Each project is unique and requires different approaches to find the best algorithm for that use case. However, if you are involved in machine learning projects, you will likely find that there are repetitive tasks as part of this process.

I looked around for Python libraries to help me simplify some of these common machine learning tasks, but while I found some built-in methods that were helpful, the vast majority of steps I commonly take were nowhere to be found. So I decided to take matters into my own hands and build a Python package that simplifies and streamlines much of the machine learning process, leaving you free to work on the intricacies of data cleaning, feature generation, or model tweaks relevant to your specific data.

Introducing… datto! Otherwise known as DATa TOols. I'm going to give you a tour of some of the best models to use while creating a machine learning model, as well as share some additional open-source packages I've found helpful in my work.

All screenshots come from this example Jupyter notebook. Note that datto is not Zapier owned or maintained—this is a personal project created by the author.

Exploring your data

The first task in any machine learning project is getting to know your data. datto has several methods for this Exploratory Data Analysis (EDA).

Sampling unique values

One part of getting to know your data is understanding what values it contains. This method prints out several unique values for each column in your data frame, showcasing what you might expect from each column.

Finding columns to exclude

Some columns in your dataset may be better suited for feature development than others. This method prints out columns you may consider excluding from your model based on a series of exclusion criteria, including aspects such as having a large proportion of null values, having only one value in the whole column, or having a particularly large number of categorical values.

Finding correlated features

Depending on your model type, you may want to remove correlated features ahead of time to avoid violating assumptions. This method prints out any correlations and the associated Pearson correlation coefficient.

Additional helpful packages:

  • Altair - creates HTML interactive visualizations

Preparing your data

Now that you have a better understanding of your data, you'll want to start doing the actual transformations needed before feeding that data into a machine learning model.

Fixing column data types

Sometimes Pandas doesn't pick the correct data type for a given column. Perhaps you have a column with 1s, 2s, and 3s, but these represent particular categories, not a continuous integer. You can use this method to easily recode columns as a given feature type.

Removing duplicate columns

Sometimes your original query will give you duplicate columns. Quickly clean them up with this method.

Compressing your data frame

The original data types that Pandas uses take up lots of space. Many integers can be recoded as data types with fewer possible values and categorical data can also be transformed to take up less space. This method automatically checks each column and can compress your data frame dramatically, making your future computing times much faster.

Additional helpful packages:

  • Featuretools - generates additional features for use in your model
  • snorkel - assists in creating labels for unlabeled data
  • fancyimpute - many methods for better filling null values

Training your model

Before choosing your final model, you'll want to test out several types to determine the best fit.

Model testing

This complex method automates much model choosing process for you. Sometimes you'll want to choose a specific model given data or implementation restraints. But many times, you'll simply want to identify the best model with the best parameters. This method grid searches across common models with common parameters and outputs a detailed summary of the highest performing models. There is an option to specify which scoring parameter matters most to you, as well as an option to append your top results to a CSV file, making it much easier to track various models over many tests.

Additional helpful packages:

  • Modin - faster computing in Pandas

Getting results

This section has methods for various types of models, including text clustering, classification, and regression type models.

Coefficients graph

Using the incredibly helpful SHAP package, this method provides a wrapper that simplifies the process of creating their coefficients graph on any model type. This is a great way to understand how your model is working under the hood. Having this understanding can help you identify any bias in your model, ways to improve it, and what features are actually necessary.

Most similar texts

This method takes in a data frame with a column of text and chooses the best number of categories to create using non-negative matrix factorization. It then outputs the categories chosen, the top words/phrases per category, as well as a listing of the best category for each row of text.

Scoring the final model

Once you've decided on the model to deploy, you can use this method to share its expected performance with your team. This method returns a summary of how the model performs on several common scoring methods. It also creates a confusion matrix for classification models.

Share your feedback

I hope you find this package useful throughout your machine learning projects, and I'm excited to hear from you on how to make this even better. Please submit feature requests, bug reports, and PRs to add features on the datto repository. I wish you the best of luck on your future machine learning projects!


Load Comments...

Comments powered by Disqus