You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As part of my Master's thesis, I am developing a new estimator based on Isolation Forest that operates on residuals. Without delving into the theoretical background, which isn't relevant here, I'm currently facing a technical issue.
The estimator is implemented within the scikit-learn ecosystem and therefore inherits its methods. In particular, here is what happens:
When I call the fit method on the RIF estimator, it internally invokes fit_transform from _residual_gen, which is responsible for computing residuals and using them to fit the Isolation Forest.
These residuals are computed using a Random Forest model. To avoid data leakage, they are calculated either with out-of-bag (OOB) predictions or k-fold cross-validation. (There’s also a “vanilla” version without leakage control, but that’s not relevant for this issue.)
Once computed, the residuals are cached. Why?
Because when RIF.predict(X) is called:
If the input X is the same as the one used in RIF.fit(X), the cached residuals are reused.
If the input X is different, the previously fitted Random Forest is used to compute new residuals, and anomalies are detected on these.
Currently, this distinction between training and prediction data is handled using id(X), which checks whether the memory reference of the two datasets is the same. I also tried using a hash of the dataset content, but both approaches seem fragile and not robust in practice.
I’m looking for a better solution, either one that improves the logic of comparing the two datasets, or a new approach that achieves the same goal in a more reliable way.
Any help or suggestions would be greatly appreciated.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Looking for the latest TMZ celebrity news? You've come to the right place. From shocking Hollywood scandals to exclusive videos, TMZ delivers it all in real time.
Whether it’s a red carpet slip-up, a viral paparazzi moment, or a legal drama involving your favorite stars, TMZ news is always first to break the story. Stay in the loop with daily updates, insider tips, and jaw-dropping photos.
🎥 Watch TMZ Live
TMZ Live brings you daily celebrity news and interviews straight from the TMZ newsroom. Don’t miss a beat—watch now and see what’s trending in Hollywood.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello everyone,
As part of my Master's thesis, I am developing a new estimator based on Isolation Forest that operates on residuals. Without delving into the theoretical background, which isn't relevant here, I'm currently facing a technical issue.
My repository is available at:
(Rif estimator)
The repository includes two modules:
RIF
_residual_gen
The estimator is implemented within the scikit-learn ecosystem and therefore inherits its methods. In particular, here is what happens:
When I call the
fit
method on theRIF
estimator, it internally invokesfit_transform
from_residual_gen
, which is responsible for computing residuals and using them to fit the Isolation Forest.These residuals are computed using a Random Forest model. To avoid data leakage, they are calculated either with out-of-bag (OOB) predictions or k-fold cross-validation. (There’s also a “vanilla” version without leakage control, but that’s not relevant for this issue.)
Once computed, the residuals are cached. Why?
Because when
RIF.predict(X)
is called:X
is the same as the one used inRIF.fit(X)
, the cached residuals are reused.X
is different, the previously fitted Random Forest is used to compute new residuals, and anomalies are detected on these.Currently, this distinction between training and prediction data is handled using
id(X)
, which checks whether the memory reference of the two datasets is the same. I also tried using a hash of the dataset content, but both approaches seem fragile and not robust in practice.I’m looking for a better solution, either one that improves the logic of comparing the two datasets, or a new approach that achieves the same goal in a more reliable way.
Any help or suggestions would be greatly appreciated.
Best regards,
Giulio
Beta Was this translation helpful? Give feedback.
All reactions