Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataCutter-related fixes for multiclass #263

Merged
merged 41 commits into from
Apr 19, 2019

Conversation

gerashegalov
Copy link
Contributor

@gerashegalov gerashegalov commented Apr 3, 2019

Related issues
Fixes #245

  • label sort by count is nondeterministic for labels with the same count
  • data cutter label trimming in validation is not called prior to bestEstimator
  • column metadata is not updated after trimming
  • in DataCutter.estimate whereas filter before sort is good for performance it may not drop labels non-deterministically unlike filtering over a stable order.

Describe the proposed solution
Remove non-determinism

@codecov
Copy link

codecov bot commented Apr 3, 2019

Codecov Report

Merging #263 into master will decrease coverage by 1.85%.
The diff coverage is 91.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #263      +/-   ##
==========================================
- Coverage   86.61%   84.75%   -1.86%     
==========================================
  Files         315      315              
  Lines       10345    10369      +24     
  Branches      346      563     +217     
==========================================
- Hits         8960     8788     -172     
- Misses       1385     1581     +196
Impacted Files Coverage Δ
.../stages/impl/feature/OpStringIndexerNoFilter.scala 100% <100%> (ø) ⬆️
...sforce/op/stages/impl/selector/ModelSelector.scala 98.18% <100%> (+0.03%) ⬆️
.../salesforce/op/stages/impl/tuning/DataCutter.scala 91.54% <90%> (-4.11%) ⬇️
...sforce/op/stages/base/binary/BinaryEstimator.scala 0% <0%> (-100%) ⬇️
...stages/base/sequence/BinarySequenceEstimator.scala 0% <0%> (-100%) ⬇️
...orce/op/stages/base/ternary/TernaryEstimator.scala 0% <0%> (-100%) ⬇️
...m/salesforce/op/utils/spark/OpVectorMetadata.scala 0% <0%> (-85.46%) ⬇️
...sforce/op/utils/spark/OpVectorColumnMetadata.scala 0% <0%> (-76.09%) ⬇️
...m/salesforce/op/aggregators/ExtendedMultiset.scala 0% <0%> (-75%) ⬇️
...ages/impl/classification/OpXGBoostClassifier.scala 6.89% <0%> (-25.87%) ⬇️
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6bb63ba...8333743. Read the comment docs.

@codecov
Copy link

codecov bot commented Apr 3, 2019

Codecov Report

Merging #263 into master will decrease coverage by 3.77%.
The diff coverage is 99.02%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #263      +/-   ##
==========================================
- Coverage   86.24%   82.47%   -3.78%     
==========================================
  Files         320      321       +1     
  Lines       10471    10539      +68     
  Branches      352      558     +206     
==========================================
- Hits         9031     8692     -339     
- Misses       1440     1847     +407
Impacted Files Coverage Δ
...sforce/op/stages/impl/selector/ModelSelector.scala 98.24% <100%> (+0.09%) ⬆️
...org/apache/spark/ml/attribute/MetadataHelper.scala 100% <100%> (ø)
.../stages/impl/feature/OpStringIndexerNoFilter.scala 100% <100%> (ø) ⬆️
...alesforce/op/stages/impl/tuning/DataBalancer.scala 96.29% <100%> (+0.1%) ⬆️
...om/salesforce/op/stages/impl/tuning/Splitter.scala 97.72% <100%> (+0.58%) ⬆️
...e/op/stages/impl/selector/ModelSelectorNames.scala 100% <100%> (ø) ⬆️
...alesforce/op/stages/impl/tuning/DataSplitter.scala 66.66% <100%> (ø) ⬆️
.../salesforce/op/stages/impl/tuning/DataCutter.scala 96.84% <98.59%> (+1.18%) ⬆️
...alesforce/op/cli/gen/templates/SimpleProject.scala 0% <0%> (-100%) ⬇️
.../scala/com/salesforce/op/cli/gen/ProblemKind.scala 0% <0%> (-100%) ⬇️
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9357a89...3d4198b. Read the comment docs.

@tovbinm
Copy link
Collaborator

tovbinm commented Apr 3, 2019

@gerashegalov can you elaborate more on the proposed fixes? thanks.

@gerashegalov gerashegalov force-pushed the gera/factorlabeltrim branch from 8333743 to 618bd92 Compare April 3, 2019 19:38
@gerashegalov
Copy link
Contributor Author

@tovbinm there are multiple issues.

One is the sort stability when there are multiple top labels with identical cardinality. We can solve this by adding the label as a secondary sort key. Similar issue with filtering labels for the minFraction prior to sorting. And #251 caused the data cutter omission to begin with

@tovbinm
Copy link
Collaborator

tovbinm commented Apr 4, 2019

@gerashegalov please let us know when you want the PR reviewed.

@gerashegalov
Copy link
Contributor Author

@tovbinm it's ready

Copy link
Collaborator

@tovbinm tovbinm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a new test case explicitly for label trimming issue?

@gerashegalov
Copy link
Contributor Author

not ready for review yet, checkpointing WIP, sorting out test issues.

@gerashegalov gerashegalov force-pushed the gera/factorlabeltrim branch 2 times, most recently from 843f8cd to a43409a Compare April 9, 2019 12:16
@gerashegalov gerashegalov force-pushed the gera/factorlabeltrim branch 3 times, most recently from b7c72de to 5a5b19d Compare April 15, 2019 07:13
@gerashegalov
Copy link
Contributor Author

marking as ready for review since all unit testing are passing

Copy link
Collaborator

@leahmcguire leahmcguire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just remove the parts that were in there for testing and LGTM

@gerashegalov gerashegalov force-pushed the gera/factorlabeltrim branch from 27992db to 17c53fb Compare April 18, 2019 23:59
Copy link
Collaborator

@leahmcguire leahmcguire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!!!!

@gerashegalov gerashegalov merged commit 213962e into salesforce:master Apr 19, 2019
@gerashegalov gerashegalov deleted the gera/factorlabeltrim branch April 19, 2019 18:49
@tovbinm
Copy link
Collaborator

tovbinm commented Apr 19, 2019

🎆 🍾

@Jauntbox Jauntbox mentioned this pull request May 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Labels may not be dropped in MultiClassClassification
3 participants