-
Notifications
You must be signed in to change notification settings - Fork 458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OPENNLP-912: Rule based sentence detector #390
base: main
Are you sure you want to change the base?
Conversation
Somehow the Travis CI always timed out here, but succeeded in my fork repo: https://travis-ci.org/github/Alanscut/opennlp/builds/757336242 |
Thanks a lot for this contribution! This is something OpenNLP has needed. I will take a closer look. Built and tested successfully.
|
/** | ||
* The interface for rule based sentence detector | ||
*/ | ||
public interface SentenceTokenizer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an opennlp.tools.sentdetect.SentenceDetector
interface that resembles this interface. Since the purpose of the two interfaces seem the same (to break text into sentences), is it possible to reuse the other interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 (and the method would require a proper description). It is totally unclear what the provided method does/shall do from an implementor perspective.
import opennlp.tools.util.StringUtil; | ||
|
||
|
||
public class SentenceTokenizerME implements SentenceTokenizer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ME
name in the file names denotes "maximum entropy" as the method for the implementation. Since this implementation doesn't use a trained model, could it be named something like RulesBasedSentenceDetector
? (Open to other names, too.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to @jzonthemtn comment
@@ -0,0 +1,131 @@ | |||
<?xml version="1.0" encoding="UTF-8"?> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the origin of these rules?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you generate this xml file? It does not seem to originate from [the](https://github.com/diasks2/pragmatic_segmenter
I would like to better understand the origins of the rules used. Does there need to be license attribution? |
@jzonthemtn Looks these "golden-rules.txt" is from https://github.com/diasks2/pragmatic_segmenter#the-golden-rules (at least, if we trust the textual description). Also in some other languages: https://s3.amazonaws.com/tm-town-nlp-resources/golden_rules.txt - the library itself with the content is MIT, so no compliance issue but we would need to attribute it accordingly. |
|
||
package opennlp.tools.sentdetect.segment; | ||
|
||
public class Clean { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be a Record
?
import opennlp.tools.util.StringUtil; | ||
|
||
|
||
public class SentenceTokenizerME implements SentenceTokenizer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to @jzonthemtn comment
|
||
private Matcher afterMatcher; | ||
|
||
boolean found; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be private.
@@ -0,0 +1,131 @@ | |||
<?xml version="1.0" encoding="UTF-8"?> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you generate this xml file? It does not seem to originate from [the](https://github.com/diasks2/pragmatic_segmenter
|
||
public class Clean { | ||
|
||
String regex; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth to use a Pattern
here to avoid compiling the regex in every replaceAll(...)
call.
|
||
public class Section { | ||
|
||
int left; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these should be private
/** | ||
* The interface for rule based sentence detector | ||
*/ | ||
public interface SentenceTokenizer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 (and the method would require a proper description). It is totally unclear what the provided method does/shall do from an implementor perspective.
|
||
private List<Section> noBreakSections; | ||
|
||
public SentenceTokenizerME(LanguageTool languageTool, CharSequence text) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can rebuild this to avoid creating a Tokenizer for every piece of text? Wouldn't it be of more value to provide the text as a method parameter and compute the stuff on the fly? It would also allow us to make it threadsafe in the future.
start += count; | ||
} | ||
} catch (IOException e) { | ||
e.printStackTrace(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn*t just print the stack trace: Either rethrow as runtime exception or at least log it.
text = cleaner.clean(text); | ||
} | ||
|
||
InputStream inputStream = getClass().getResourceAsStream( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should close the stream + read it once and consume the cached result for every test run.
Thank you for contributing to Apache OpenNLP.
In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:
For all changes:
Is there a JIRA ticket associated with this PR? Is it referenced
in the commit message?
Does your PR title start with OPENNLP-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically master)?
Is your initial contribution a single, squashed commit?
For code changes:
For documentation related changes:
Note:
Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.