Permutation importance is a simple, yet powerful tool in the hands of machine learning enthusiast. It has many applications. You can use it to validate your model and dataset. You can use it to find important but not obvious dependencies between features and a label. You can use it to drop redundant features from the dataset.
In this research, I want to show how permutation importance helped me to analyze the Manhattan Real Estate market. The full research is available via Google Colab. There is a dataset of a bit more than 20k entities describing Lower and Central Manhattan Real Estate properties. When I looked at the data, the question appeared in my mind: which exactly features of this dataset influence the market price of the property?
If you want to know which exact feature influences the market price the most, unfortunately I have to disappoint you, you will not find anything new here because this is (spoiler alert) the size of the property. On the contrary, there are few fascinating conclusions I’ve made after this research and (spoiler alert) there are other features that influence the market price of the property except its size.
As previously stated, there is a dataset of ~20k entities (exactly 22,314). Each entity contains descriptive information about the real estate property. Firstly, I want to show you a map of feature name/description pairs. Further, we use only feature names, so you can get back here for alias.
The dataset was separated on three subsets by prop_type feature:
- Residential (condominium, apartment etc) of 17,859 entities;
- Commercial (office, common area, store etc) of 3970 entities;
- Other (school, religious, vacant land etc) of only 482 entities.
My main interest is in Residential and Commercial properties. In the following image 1 you can see the properties' distribution with the colored market value ranges.
The first thing I want to do is to calculate the correlation between features of the dataset. Correlation is a ‘dependency’ of one feature from another. If two vectors have positive correlation, then when the value from the first vector increases, the value from the second vector, also, tends to increase. If two vectors have negative correlation, then when the value from the first vector increases, the value from the second vector tends to decrease. On image 2 you can see the correlation heatmap of the dataset.
About permutation importance
At least intuitive understanding of permutation importance is needed to read further. If you already do, you can skip this paragraph.
To comprehend permutation importance, you just need to figure out what this algorithm does. And there is nothing complicated here. Permutation importance takes a trained model, features set, and the label vector as input data. In our case, features set is the set of data of descriptive information about the property. The label set is the set of market values of properties.
In simple words, the process of the permutation importance is next:
- Randomly shuffle the data in the feature (for example year_built);
- Run the model on the corrupted data;
- Calculate the score. The score is difference between the model accuracy on true data and the model accuracy on the corrupted data;
- Select the next feature, go to step 1.
So, obviously, when the feature is important for the model, the corruption of this feature will lead to a high permutation importance score. When the feature is not important for the model, the corruption of this feature will not affect the model performance and the permutation importance score will be low. Negative permutation score for the feature means that the model performs even better without this feature.
There is one pitfall with the permutation importance: it assigns low values to features on the multi collinear dataset. For instance, lot_depth and lot_width features have high correlation, and therefore they have low permutation importance scores. This happens because, when we randomly shuffle lot_depth, the model receives the same information from the lot_width feature.
This problem is described in detail in Google Colab research. This brief article just shortly shows the process of dataset decor relation.
One way of dataset decor relation is to cluster its similar features into groups by their correlations and then select one feature of one group. That’s exactly that was done. I had no idea of the amount of resulting clusters, so I’ve applied a hierarchical clustering algorithm to the dataset. On image 3 you can see the dendrogram of the clustered data set features for Residential (left) and the Commercial (right) properties.
After selecting the one feature from the one cluster, the correlation between features looks much better (image 4).
After these steps, finally, the model can be built and permutation scores can be calculated. Again, in Google Colab you can see all the training steps in detail. Here, I post the results of the permutation importance algorithm.
The features that influence the price the most are, of course, the size of the property: they are units amount for residential properties and lot depth (and also lot width such as they belong to one cluster). There is actually an interesting thing that the tax amount and stories amount have a high importance score for commercial properties. Also, property indicator feature is medium important for commercial ones. As for residential properties, the tax amount is medium important and the property indicator feature doesn't almost influence the market price.
The building_construction_type feature doesn’t have any value for the model. Actually, it can be safely deleted from the dataset.
As a bonus I wanted, also, to build a model to analyze the Other properties. Unfortunately, it’s not achievable 🙈. There are only 482 entities in the dataset. Even the training accuracy of the model is no higher than 50%, what to say about test accuracy which is sometimes negative.
But here I just want to place the permutation importance scores for the best model run without any results analysis. Please and welcome :)
Permutation importance is a must-have tool for a data scientist. It can help have a better understanding of the dataset and the model. Literally, this simple tool can tell a lot about the dataset.
From the business perspective, what we’ve known is the difference in the Manhattan real estate market between residential and commercial properties. If you want to buy/rent a property for a living, well, you have to rely on the standard size of property, prestigious year when the property was built, and a little bit on taxes. On the contrary, if you plan to buy/rent a property for commercial purposes then together with property sizes you can pay more attention to taxes and amount of building stories.