Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OPENNLP-912: Rule based sentence detector #390

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

Alanscut
Copy link
Contributor

@Alanscut Alanscut commented Jan 28, 2021

Thank you for contributing to Apache OpenNLP.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes:

  • Is there a JIRA ticket associated with this PR? Is it referenced
    in the commit message?

  • Does your PR title start with OPENNLP-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.

  • Has your PR been rebased against the latest commit within the target branch (typically master)?

  • Is your initial contribution a single, squashed commit?

For code changes:

  • Have you ensured that the full suite of tests is executed via mvn clean install at the root opennlp folder?
  • Have you written or updated unit tests to verify your changes?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE file, including the main LICENSE file in opennlp folder?
  • If applicable, have you updated the NOTICE file, including the main NOTICE file found in opennlp folder?

For documentation related changes:

  • Have you ensured that format looks appropriate for the output in which it is rendered?

Note:

Please ensure that once the PR is submitted, you check travis-ci for build issues and submit an update to your PR as soon as possible.

@Alanscut
Copy link
Contributor Author

Alanscut commented Feb 4, 2021

Somehow the Travis CI always timed out here, but succeeded in my fork repo: https://travis-ci.org/github/Alanscut/opennlp/builds/757336242

@jzonthemtn
Copy link
Contributor

Thanks a lot for this contribution! This is something OpenNLP has needed. I will take a closer look.

Built and tested successfully.

Apache Maven 3.6.3 (cecedd343002696d0abb50b32b541b8a6ba2883f)
Maven home: /opt/apache-maven
Java version: 11.0.9.1, vendor: Ubuntu, runtime: /usr/lib/jvm/java-11-openjdk-amd64
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "5.8.0-40-generic", arch: "amd64", family: "unix"

/**
* The interface for rule based sentence detector
*/
public interface SentenceTokenizer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an opennlp.tools.sentdetect.SentenceDetector interface that resembles this interface. Since the purpose of the two interfaces seem the same (to break text into sentences), is it possible to reuse the other interface?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 (and the method would require a proper description). It is totally unclear what the provided method does/shall do from an implementor perspective.

import opennlp.tools.util.StringUtil;


public class SentenceTokenizerME implements SentenceTokenizer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ME name in the file names denotes "maximum entropy" as the method for the implementation. Since this implementation doesn't use a trained model, could it be named something like RulesBasedSentenceDetector? (Open to other names, too.)

Copy link
Contributor

@rzo1 rzo1 Nov 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to @jzonthemtn comment

@@ -0,0 +1,131 @@
<?xml version="1.0" encoding="UTF-8"?>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the origin of these rules?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you generate this xml file? It does not seem to originate from [the](https://github.com/diasks2/pragmatic_segmenter

@jzonthemtn
Copy link
Contributor

I would like to better understand the origins of the rules used. Does there need to be license attribution?

@rzo1
Copy link
Contributor

rzo1 commented Nov 29, 2023

I would like to better understand the origins of the rules used. Does there need to be license attribution?

@jzonthemtn Looks these "golden-rules.txt" is from https://github.com/diasks2/pragmatic_segmenter#the-golden-rules (at least, if we trust the textual description). Also in some other languages: https://s3.amazonaws.com/tm-town-nlp-resources/golden_rules.txt - the library itself with the content is MIT, so no compliance issue but we would need to attribute it accordingly.


package opennlp.tools.sentdetect.segment;

public class Clean {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be a Record ?

import opennlp.tools.util.StringUtil;


public class SentenceTokenizerME implements SentenceTokenizer {
Copy link
Contributor

@rzo1 rzo1 Nov 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to @jzonthemtn comment


private Matcher afterMatcher;

boolean found;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be private.

@@ -0,0 +1,131 @@
<?xml version="1.0" encoding="UTF-8"?>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you generate this xml file? It does not seem to originate from [the](https://github.com/diasks2/pragmatic_segmenter


public class Clean {

String regex;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth to use a Pattern here to avoid compiling the regex in every replaceAll(...) call.


public class Section {

int left;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these should be private

/**
* The interface for rule based sentence detector
*/
public interface SentenceTokenizer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 (and the method would require a proper description). It is totally unclear what the provided method does/shall do from an implementor perspective.


private List<Section> noBreakSections;

public SentenceTokenizerME(LanguageTool languageTool, CharSequence text) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can rebuild this to avoid creating a Tokenizer for every piece of text? Wouldn't it be of more value to provide the text as a method parameter and compute the stuff on the fly? It would also allow us to make it threadsafe in the future.

start += count;
}
} catch (IOException e) {
e.printStackTrace();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn*t just print the stack trace: Either rethrow as runtime exception or at least log it.

text = cleaner.clean(text);
}

InputStream inputStream = getClass().getResourceAsStream(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should close the stream + read it once and consume the cached result for every test run.

@mawiesne mawiesne marked this pull request as draft January 14, 2025 10:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants