-
Notifications
You must be signed in to change notification settings - Fork 396
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keep raw feature distributions calculated in raw feature filter #76
Conversation
/** | ||
* Keeps the distribution information for features | ||
*/ | ||
trait FeatureDistributionBase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leahmcguire why do we need the trait? if we do, let's name it FeatureDistributionLike
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok - It seemed nicer to make the trait than to move all of the classes in raw feature filter here. though I am willing to hear counter arguments :-P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I missed the part that it's located in a separate project. Makes sense then.
Let's rename it then to match our convention and also lets replace vals in the trait with defs.
Codecov Report
@@ Coverage Diff @@
## master #76 +/- ##
==========================================
- Coverage 85.97% 85.95% -0.02%
==========================================
Files 294 294
Lines 9530 9547 +17
Branches 320 337 +17
==========================================
+ Hits 8193 8206 +13
- Misses 1337 1341 +4
Continue to review full report at Codecov.
|
@@ -108,7 +107,7 @@ class OpWorkflow(val uid: String = UID[OpWorkflow]) extends OpWorkflowCore { | |||
* Will set the blacklisted features variable and if list is non-empty it will | |||
* @param features list of features to blacklist | |||
*/ | |||
private[op] def setBlacklist(features: Array[OPFeature]): Unit = { | |||
private[op] def setBlacklist(features: Array[OPFeature], distributions: Seq[FeatureDistribution]): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update doc.
val name: String | ||
|
||
/** | ||
* key map key associated with distribution (when the feature is a map) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
key map key
-> map key
val nulls: Long | ||
|
||
/** | ||
* binned counts of feature values (hashed for strings, evently spaced bins for numerics) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
evently
-> evenly
"with distributions" should { | ||
"make a copy of the feature containing the specified distributions" in { | ||
val distrib = new FeatureDistributionBase { | ||
override val count: Long = 1L |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
override
is redundant.
@leahmcguire I think we should also (de)serialize feature distributions with the model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Thanks for the contribution! Unfortunately we can't verify the commit author(s): leahmcguire <l***@s***.com>. One possible solution is to add that email to your GitHub account. Alternatively you can change your commits to another email and force push the change. After getting your commits associated with your GitHub account, refresh the status of this Pull Request. |
Related issues
Keeping feature distribution information calculated in raw feature filter for use in metrics logging and feature engineering
Describe the proposed solution
When raw feature filter is called it now returns the feature distribution information calculated on the training data. These distributions are kept in the workflow and added to the raw features so that they can be accessed later.
Describe alternatives you've considered
Could just write the information out directly from raw feature filter.