-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Contributing a deep-learning, BERT-based analyzer #13065
Comments
For the analyzer, are you meaning something that tokenizes into an embedding? Or just creates the tokens (wordpiece + dictionary)? |
I mean, create just the tokens - the lemmas / wordpieces |
@lmessinger I don't see why text tokenization would need any native code. Word piece is pretty simple and just a dictionary look up. Do y'all not have a Java one? Or does this model actually need inference to do the lemmatization? (e.g. https://huggingface.co/dicta-il/dictabert-joint) ? |
hi,
in Hebrew and other Semitic languages, lemmas are context-dependent.
eg שמן could be interpreted as
fat, oil, their name, from
all dependent on the context
so yes, we do need inference. to do inference, python is the language.
either we compile the python into native code (not so easy but possible) or
use it in a container, as a web server
…On Tue, Feb 6, 2024 at 4:48 PM Benjamin Trent ***@***.***> wrote:
@lmessinger <https://github.com/lmessinger> I don't see why text
tokenization would need any native code. Word piece is pretty simple and
just a dictionary look up.
Do y'all not have a Java one?
Or does this model actually need inference to do the lemmatization? (e.g.
https://huggingface.co/dicta-il/dictabert-joint) ?
—
Reply to this email directly, view it on GitHub
<#13065 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAM5MHITFNCDP5H6FMW6PVTYSI7FNAVCNFSM6AAAAABCU5UD2KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRZHEZTGNJWGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Lior Messinger
+1-646-3730044
+972-546-888401
|
It will be a major headache to maintain native bindings for all major platforms. I think such an analyzer should be a downstream project (then you can restrict the platforms on which it's available to whatever you wish to maintain). We can point at such a project from Lucene documentation, for example. |
Hi,
Got it. Pointing to the project from the documentation would actually be
very valuable to the Hebrew community. How can that be done? is the
documentation also on github, so we can add it there as PR for approval?
thanks!
Lior
…On Tue, Feb 6, 2024 at 10:06 PM Dawid Weiss ***@***.***> wrote:
It will be a major headache to maintain native bindings for all major
platforms. I think such an analyzer should be a downstream project (then
you can restrict the platforms on which it's available to whatever you wish
to maintain). We can point at such a project from Lucene documentation, for
example.
—
Reply to this email directly, view it on GitHub
<#13065 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAM5MHJY3T5ODVXTMPJVEMDYSKEKZAVCNFSM6AAAAABCU5UD2KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZQGY3DONRZG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Lior Messinger
+1-646-3730044
+972-546-888401
|
This is a question that is much harder to answer than I thought... Lucene doesn't have a tutorial/user guide. The only place I could think of was here, in the javadocs: https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/java/overview.html An alternative would be to include an empty package for Hebrew and only add the package-info.java file, telling folks where they can find downstream Hebrew analyzers. I really don't have any better ideas. |
How about something with the source maintained in the sandbox dir (along
with instructions to build), but no corresponding official release artifact?
…On Fri, 9 Feb, 2024, 1:20 am Dawid Weiss, ***@***.***> wrote:
How can that be done?
This is a question that is much harder to answer than I thought... Lucene
doesn't have a tutorial/user guide. The only place I could think of was
here, in the javadocs:
https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/java/overview.html
An alternative would be to include an empty package for Hebrew and only
add the package-info.java file, telling folks where they can find
downstream Hebrew analyzers. I really don't have any better ideas.
—
Reply to this email directly, view it on GitHub
<#13065 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABDCR5FBPK7ZCAOW5U2YYOTYSUT65AVCNFSM6AAAAABCU5UD2KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZUHAZTCMJSGI>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
It would be better located in the module "analysis" (which is just the parent of all analyzers). Unfortunately this module does not create javadocs, so analysis-common is the only location. I think it would be a good idea to add there a list to external resources of analysis components. Lucene is a flexible library with extension points through SPI, so we can list all external contributions there. This page is also missing an overview on the analysis submodules. An alternative (an in my opinion better) idea is to put a list as Markdown file into the documentation package: https://github.com/apache/lucene/tree/main/lucene/documentation/src/markdown All md files there are compiled to HTML and can be linked in the template file for index.html, too.
I don't think this is a good idea. It won't be tested (as we can't run the build) and also it is inconsequent. We had that in the past for the DirectIODirectory and WindowsDirectory. All those were not maintained -- and did not build anymore, although there were build scripts. The Java parts were building, the JNI parts were not longer matching the Java implementations. I may be wrong, but when we looked into this, it was almost impossible to make it work again. Luckily they were rewritten using Java 11+ APIs and are now part of official distribution. |
Description
Hi,
We are building an open-source custom Hebrew/Arabic analyzer (lemmatizer and stopwords), based on a BERT model. We'd like to contribute this to this repository. How can we do that and be accepted? Can we compile it to native code and use JNI or Panama ? If not, what is the best approacch?
#12502 (comment)
@uschindler would be very happy to hear what you think
The text was updated successfully, but these errors were encountered: