Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save model locally and convert to zip before moving to final path and do the reverse for loading #516

Merged
merged 20 commits into from
Oct 13, 2020

Conversation

leahmcguire
Copy link
Collaborator

Related issues
The mleap bundle save works only with the local file system so workflow saves directly to hdfs fail after merging #475
#514

Describe the proposed solution
Save models first to a local tmp file and then zip and move to final location. move zipped file to local and unzip for loading

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context about the changes here.

def save(path: String, overwrite: Boolean = true): Unit = {
OpWorkflowModelWriter.save(this, path = path, overwrite = overwrite)
def save(path: String, overwrite: Boolean = true,
localDir: String = "tmp/model"): Unit = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. please add docs for localDir
  2. better use something random here, e.g. localDir: String = s"tmp/model-${System.currentTimeInMillis}"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider calling it modelStagingDir. We don't need an explicit default, we can "borrow"/reflect into https://github.com/apache/spark/blob/065f17386d1851d732b4c1badf1ce2e14d0de338/core/src/main/scala/org/apache/spark/util/Utils.scala createTempDir and other goodies

* @param inLoc location of directory to zip
* @param zipLoc output zip file name
*/
def zip(inLoc: String, zipLoc: String): Unit = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be beneficial to use this library instead. E.g

// Compress folder
ZipUtil.pack(new File("/tmp/demo"), new File("/tmp/demo.zip"))

// Decompress zip into a folder
ZipUtil.unpack(new File("/tmp/demo.zip"), new File("/tmp/demo"))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The downsides are

  • it's yet another dependency to be vetted at Salesforce
  • and we lost the API that would allow to write the zip directly to the destination file system by providing an appropriate output stream here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets chat about this @gerashegalov

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

java.util.zip could be an alternative. zt-zip seems pretty light-weight, though.

@codecov
Copy link

codecov bot commented Oct 4, 2020

Codecov Report

Merging #516 into master will decrease coverage by 0.02%.
The diff coverage is 72.72%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #516      +/-   ##
==========================================
- Coverage   86.72%   86.70%   -0.03%     
==========================================
  Files         347      347              
  Lines       11897    11883      -14     
  Branches      381      379       -2     
==========================================
- Hits        10318    10303      -15     
- Misses       1579     1580       +1     
Impacted Files Coverage Δ
...c/main/scala/com/salesforce/op/ModelInsights.scala 90.12% <0.00%> (-1.76%) ⬇�
...op/evaluators/OpMultiClassificationEvaluator.scala 94.84% <ø> (ø)
.../src/main/scala/com/salesforce/op/OpWorkflow.scala 88.19% <100.00%> (ø)
...main/scala/com/salesforce/op/OpWorkflowModel.scala 93.90% <100.00%> (ø)
...cala/com/salesforce/op/OpWorkflowModelReader.scala 92.52% <100.00%> (+1.31%) ⬆�
...cala/com/salesforce/op/OpWorkflowModelWriter.scala 100.00% <100.00%> (ø)
...m/salesforce/op/utils/stages/NameDetectUtils.scala 88.05% <0.00%> (-1.40%) ⬇�
...e/op/stages/impl/feature/SmartTextVectorizer.scala 95.20% <0.00%> (-0.13%) ⬇�
...s/impl/feature/OPCollectionHashingVectorizer.scala 96.50% <0.00%> (-0.05%) ⬇�
...p/stages/impl/feature/SmartTextMapVectorizer.scala 100.00% <0.00%> (ø)
... and 54 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 09200c8...00f3e8d. Read the comment docs.

@leahmcguire
Copy link
Collaborator Author

combust/mleap#721

@leahmcguire
Copy link
Collaborator Author

You want the link in the code?

@tovbinm
Copy link
Collaborator

tovbinm commented Oct 9, 2020

@leahmcguire It's fine to have it here.

Copy link
Collaborator

@TuanNguyen27 TuanNguyen27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@nicodv nicodv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but may want to avoid additional dependency

Copy link
Contributor

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, minor comment

def loadModel(
path: String,
asSpark: Boolean = true,
localDir: String = WorkflowFileReader.localDir
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just maybe call the parameter modelStagingDir too

@leahmcguire leahmcguire merged commit 9ab7d10 into master Oct 13, 2020
@leahmcguire leahmcguire deleted the lm/localSaveAndLoad branch October 13, 2020 18:27
implicit val conf = new org.apache.hadoop.conf.Configuration()
Try(WorkflowFileReader.loadFile(OpWorkflowModelReadWriteShared.jsonPath(path)))
.flatMap(loadJson(_, path = path)) match {
val localPath = new Path(new File(modelStagingDir).getAbsolutePath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not mix Hadoop and JDK api, let us get rid of new File

to avoid ambiguity let us request the local filesystem explicitly

val localFileSystem = FileSystem.getLocal(spark.sparkContext.hadoopConfiguration)

.flatMap(loadJson(_, path = path)) match {
val localPath = new Path(new File(modelStagingDir).getAbsolutePath)
val savePath = new Path(path)
val fs = savePath.getFileSystem(conf)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let us call it destinationFileSystem

val localPath = new Path(new File(modelStagingDir).getAbsolutePath)
val savePath = new Path(path)
val fs = savePath.getFileSystem(conf)
fs.delete(localPath, true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the bug where we try to delete the local path using the remote (s3,hdfs) filesystem

sanmitra added a commit that referenced this pull request Nov 2, 2020
…path and do the reverse for loading (#516)"

This reverts commit 9ab7d10.
@salesforce-cla
Copy link

Thanks for the contribution! It looks like @leahmcguire is an internal user so signing the CLA is not required. However, we need to confirm this.

@salesforce-cla
Copy link

Thanks for the contribution! Unfortunately we can't verify the commit author(s): leahmcguire <l***@s***.com> Leah McGuire <l***@s***.com>. One possible solution is to add that email to your GitHub account. Alternatively you can change your commits to another email and force push the change. After getting your commits associated with your GitHub account, refresh the status of this Pull Request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants