-
Notifications
You must be signed in to change notification settings - Fork 396
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Save model locally and convert to zip before moving to final path and do the reverse for loading #516
Conversation
def save(path: String, overwrite: Boolean = true): Unit = { | ||
OpWorkflowModelWriter.save(this, path = path, overwrite = overwrite) | ||
def save(path: String, overwrite: Boolean = true, | ||
localDir: String = "tmp/model"): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- please add docs for
localDir
- better use something random here, e.g.
localDir: String = s"tmp/model-${System.currentTimeInMillis}"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider calling it modelStagingDir. We don't need an explicit default, we can "borrow"/reflect into https://github.com/apache/spark/blob/065f17386d1851d732b4c1badf1ce2e14d0de338/core/src/main/scala/org/apache/spark/util/Utils.scala createTempDir and other goodies
* @param inLoc location of directory to zip | ||
* @param zipLoc output zip file name | ||
*/ | ||
def zip(inLoc: String, zipLoc: String): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be beneficial to use this library instead. E.g
// Compress folder
ZipUtil.pack(new File("/tmp/demo"), new File("/tmp/demo.zip"))
// Decompress zip into a folder
ZipUtil.unpack(new File("/tmp/demo.zip"), new File("/tmp/demo"))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The downsides are
- it's yet another dependency to be vetted at Salesforce
- and we lost the API that would allow to write the zip directly to the destination file system by providing an appropriate output stream here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets chat about this @gerashegalov
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
java.util.zip
could be an alternative. zt-zip seems pretty light-weight, though.
Codecov Report
@@ Coverage Diff @@
## master #516 +/- ##
==========================================
- Coverage 86.72% 86.70% -0.03%
==========================================
Files 347 347
Lines 11897 11883 -14
Branches 381 379 -2
==========================================
- Hits 10318 10303 -15
- Misses 1579 1580 +1
Continue to review full report at Codecov.
|
…localSaveAndLoad
…ifAI into lm/localSaveAndLoad
… as a spark stage
…localSaveAndLoad
You want the link in the code? |
@leahmcguire It's fine to have it here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but may want to avoid additional dependency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, minor comment
def loadModel( | ||
path: String, | ||
asSpark: Boolean = true, | ||
localDir: String = WorkflowFileReader.localDir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just maybe call the parameter modelStagingDir too
implicit val conf = new org.apache.hadoop.conf.Configuration() | ||
Try(WorkflowFileReader.loadFile(OpWorkflowModelReadWriteShared.jsonPath(path))) | ||
.flatMap(loadJson(_, path = path)) match { | ||
val localPath = new Path(new File(modelStagingDir).getAbsolutePath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should not mix Hadoop and JDK api, let us get rid of new File
to avoid ambiguity let us request the local filesystem explicitly
val localFileSystem = FileSystem.getLocal(spark.sparkContext.hadoopConfiguration)
.flatMap(loadJson(_, path = path)) match { | ||
val localPath = new Path(new File(modelStagingDir).getAbsolutePath) | ||
val savePath = new Path(path) | ||
val fs = savePath.getFileSystem(conf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let us call it destinationFileSystem
val localPath = new Path(new File(modelStagingDir).getAbsolutePath) | ||
val savePath = new Path(path) | ||
val fs = savePath.getFileSystem(conf) | ||
fs.delete(localPath, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the bug where we try to delete the local path using the remote (s3,hdfs) filesystem
Thanks for the contribution! It looks like @leahmcguire is an internal user so signing the CLA is not required. However, we need to confirm this. |
Thanks for the contribution! Unfortunately we can't verify the commit author(s): leahmcguire <l***@s***.com> Leah McGuire <l***@s***.com>. One possible solution is to add that email to your GitHub account. Alternatively you can change your commits to another email and force push the change. After getting your commits associated with your GitHub account, refresh the status of this Pull Request. |
Related issues
The mleap bundle save works only with the local file system so workflow saves directly to hdfs fail after merging #475
#514
Describe the proposed solution
Save models first to a local tmp file and then zip and move to final location. move zipped file to local and unzip for loading
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context about the changes here.