Early on in my tenure in Metis’ 12 week immersive Data Science bootcamp, I learned what I found to be one of the core lessons of machine learning — your model can only be as good as the features you feed it. You can use the fanciest packages out there for regression or classification algorithms, but if the features don’t tell the machine much, it won’t get you very far.
While I was able to use domain knowledge about various subjects to develop features on my own, I knew there were features I was missing. This is likely true in any scenario, where there is always some telling feature that can be added.
Thus, while Google searching I came across Featuretools, a way to automatically engineer features. Immediately, my ears perked up when I found this tool — a way to automatically create these features I had been missing seemed almost too good to be true.
Once I found this tool, I decided to investigate further, and learn more about what it was about.
Here, I’ll walk you through some of my main findings and thoughts after learning more about Featuretools.
What is Featuretools?
As I mentioned, Featuretools is a Python package to automatically develop features from a dataset. Their website puts it best:
Featuretools is a framework to perform automated feature engineering. It excels at transforming temporal and relational datasets into feature matrices for machine learning.
It does this through a few different techniques, but three core principles really drive the operation. They are:
- Representing Data as EntitySets
- Using “primitives” on the data
- Deep Feature Synthesis
These may not sound familiar right now, so I’ll take some time to walk through them.
Entities and EntitySets
When I first started investigating Featuretools, Entities and EntitySets immediately connected with SQL schema structures in my head. In that analogy, an entity is an individual table, that connects to other tables via certain “key” rows — what Featuretools refers to as relationships.
An EntitySet, intuitively, is then a collection of these Entities (tables), connected by these relationships. For a clear example, here is a setup they provide in their intro walkthrough of the framework:
As you can see, the Entities here are transactions, sessions, products, and customers. Various relationships connect the entities, such as product_id connecting transactions and products, and session_id connecting transactions and customers.
One important thing to note, is that the key row connecting the relationships is a one-to-many key — in other words, the “parent” entity has one per table (i.e. sessions), whereas it can refer to many in the child (i.e. each session can have multiple transactions associated with it). This must be specified when setting up the table.
All in all, this is extremely similar to working with relational databases in SQL, which is what helped solidify these concepts to me.
Primitives are the basic building blocks behind the more involved mechanisms of Featuretools. Simply put, they’re relatively basic functions that can be combined or chained together to unleash their full power in Deep Feature Synthesis.
Within the broader topic of primitives, there are two main types: aggregation and transform. Aggregation primitives are things that can be used across entities, and are more simple statistic like — think standard deviation, sum, max, etc.
Transform primitives, on the other hand, are performed within one entity. These are things like less than or equal to, equal, cumulative mean, etc.
Again, these seem simple now, but a bit later their full power will make more sense.
Variable Assignment is Crucial
As a bit of an aside before getting into Deep Feature Synthesis, one of my main takeaways of the Featuretools product is the importance of correct variable type assignment. While the program will try to detect variable types (and from my experience it does quite well), it is prudent to double check before really getting going into Deep Feature Synthesis.
The algorithm uses these data types to determine what type of operations can be done in Deep Feature Synthesis, so an incorrect data type could easily lead to inappropriate calculations.
While there are plenty of variable types to choose from, including categorical/ordinal, numerical, and boolean, custom variable types can also be defined by a user to fit specific use cases.
Deep Feature Synthesis
While the above topics are important to understand, Deep Feature Synthesis is the core engine of Featuretools. Utilizing primitives and relationships between Entities together, several levels of complex features can be generated.
While this sounds complicated, Featuretools makes it as easy as these steps:
- The user specifies the target entity to perform the operations on
- The user inputs what types of primitives to use, either aggregation, transformative, or a mix of both
- The user gives the max depth of feature synthesis
And just like that, Featuretools takes the wheel and does the rest. When it’s all said and done, a matrix of new features is created, ready to go for use. Additionally, in one of the cooler features (in my mind) of this product, the user can call up graphical and plain English explanations of the new features, so they can be sure of what actually happened.
As an example, I’ll use the following feature
Looking at this, it may be tough to tell exactly what is going on, especially if you are browsing the newly created features. Well, the flow chart that can be called helped visually see what happened:
For this feature, the English “translation” was as follows:
The average of the sum of the “amount” of all instances of “transactions” for each “session_id” in “sessions” of all instances of “sessions” for each “customer_id” in “customers”.
Between the two of those helpful functions, it becomes much more clear what feature was actually created, and how it was made.
This would become more useful the further you go with Deep Feature Synthesis. You can chain together more feature depth and more primitives to see what results come, or try a more focused approach into the methods.
Another interesting way to utilize Featuretools is on a single table (entity) of data, utilizing basic primitives to create interaction terms with your existing features. Not the most high-tech way to use the framework, but for me this was an easy spot to test it out to build interactions in a project.
Not too long after learning about this tool, I utilized it in my classification project at Metis, where I used the transformative primitives “add_numeric” (i.e. adding two values) and “multiply_Numeric” (i.e. multiplying value) to add some interaction terms near the end of my modeling/feature engineering process.
The end result was a model with a slight boost in binary classification precision (.603 pre-Featuretools vs. .607 post-Featuretools), but it does show how helpful this can be for even the slightest of applications.
In my opinion, one of the best part of Featuretools is how informative and easy to follow the site is. From a quick 5 minute “how to” guide to more detailed breakdowns on the core concepts, the website makes learning the tool’s basics very easy and straightforward.
While, like anything, there is more advanced functionality to learn and master, it really doesn’t take too much practice to at least know how to get started on the standard EntitySet setup and Deep Feature Synthesis tools.
There’s also an excellent collection of Demos, where different use cases are provided for Featuretools. Covering everything from Predicting Olympic Medals to Predicting Taxi Trip Duration, these examples give you a good walkthrough on how to utilize the tool in the real-world. Additionally, each example has a GitHub repo with example code, so you can see it in action yourself.
Overall, there is a lot to love about Featuretools. It provides a quick and easy way to engineer many features you may not have known how to develop on your own, with a relatively quick learning curve to hammer down the basic functionality. The way information is presented, and how much help and resources are available, is truly a plus for this framework.
Going beyond the surface, Featuretools gives you plenty of room for customization, with custom variable types and primitives that can be written. This type of flexibility is a really cool feature that can help make the tool work for your specific use case.
Like anything else though, there are potential complications. For one thing, if you don’t fully understand how your data is structured, this tool could easily lead to data leak if used incorrectly.
Additionally, I can certainly see how using it as a “shoot in the dark and ask questions later” tool could lead you into a rabbit hole of features. On top of that, blindly adding features can certainly introduce multicollinearity or overfitting into a model, depending on the amount of data.
At some point, the complexity may begin to outweigh the slight benefits in prediction — for some applications this may be a worthwhile trade, but in others maybe not.
All in all, my major thoughts can be summarized as such:
- If your main concern is predictability and you have a good grasp of your data structure and how to utilize Featuretools, it can only help. It will provide an easy, quick way to manufacture new features, without having to keep asking yourself “How can I write up a SQL query that gets me this very specific feature?”
- If you want to build a model that is primarily interpretive, you don’t have a ton of domain knowledge on the data, or you are dealing with time series that could easily have data leakage if combined with the wrong columns, proceed with caution. Not to say you shouldn’t use Featuretools in that case, but just make sure the data you are pulling makes sense, and that you have a firm grasp in how you are utilizing the tool.
Thanks for taking the time to read through this! If you’d like to discuss this or any other topics further, please feel free to connect with me on LinkedIn.