-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: Ignore n levels of "leafs" #51
Comments
Thanks for the suggestion -- I'd like to see if there's an easy way with argparse to allow PRs are always welcome! I will get around to this myself eventually if you don't have the time to do a PR but I have unfortunately been very slow with this project recently. |
I think not (a following positional argument could not be distinguished from the parameter), but I would maybe suggest something like ignore_group = parser.add_mutually_exclusive_group()
ignore_group.add_argument("--ignore_leaf", action="store_const", dest="ignore_depth", const=1)
ignore_group.add_argument("--ignore_depth", dest="ignore_depth", type=int) this should work. I can PR this soon enough. The biggest question is how properly name this other parameter :) However, there also is the question whether equivalence shouldn't also be possible "top down". In the sense that two files are not compared if the first n components of their (relative) paths are the same. This is the same thing if all submissions are on the same level, but suppose that students submit several java projects, which could have varying depth. |
I think that's a good solution. I think You bring up a good point about whether it should look top down or bottom up, though -- it actually makes more sense to me to go from the top down (in this case to the depth of the student folders). How many subfolders there are within each of the student folders shouldn't matter. Is there a scenario where a bottom-up search would work but a top-down one would not? I guess there could always be some kind of |
I can't really imagine one. But that doesn't say that much :)
in case both sides should be supported, I think the best variant would be to have three mutually exclusive options like I'd go ahead and draft a PR. The options could still be removed / hidden if needed. EDIT: Ok, I looked at this a bit, and top-down needs more restructuring. The problem is that this "ignore the first n" should probably be relative to the base directory from which this file has been picked up. This means that the file list needs to track where the file came from (and what happens if one file effectively appears in multiple directories). I guess that this path-based reasoning should be with logical paths only, but fingerprinting etc. is cached by real path ( I can take a look at this if you want, but we should agree beforehand whether these more involved changes are ok. |
Hmm -- I think this would also run into trouble with the file adding API. I'm not sure how many people are using the API instead of the command-line interface but files added using, e.g., The bottom-up search should be much easier to implement and should handle most use cases. |
One more elegant way would be to separate testing from defining pairs. So that if you add thing with testing and reference directories etc., a set of pairs to compare is created, and the actual file comparison then is performed on this set. But I agree that bottom up probably is good enough for now. I think we can leave this issue open. In case someone has a need for this feature, they could comment here. Compromise suggestion: What about adding this separation into "obtain the comparison pairs" and "perform comparison"? Then one could add an API like |
I agree that generating a set of comparison pairs would be a nice solution for the current design, though it may conflict with the possibility of having a "one to many" comparison where similarity could be computed between a single test file and the reference files in aggregate (this has been requested in multiple issues so it would probably be useful to implement eventually). At some point I'd like to rethink the architecture and do a more complete refactor -- I wrote the original version of this package in a week or two for a pretty simple use-case. I just don't personally use the package anymore (I haven't been in academia for a few years now) so it's hard to have the motivation 🙂 |
Hi!
I am using copydetect together with DOMjudge submissions. Effectively, the situation is that students can submit attempts at a problem multiple times. I want to check for plagiarism across students, but not within their own attempts. Here,
ignore_leaf
comes into play, since I can organize submissions byproblem / student / <student attempts>
. But if one attempt could contain multiple files, I'd like to organize submissions byproblem / student / submission / <submission files>
. Then,ignore_leaf
doesn't help much. (Of course, I could still flatten all the student submissions into one directory, but I think there is an "easier" way)Suggestion: Instead of making
ignore_leaf
a boolean flag, make it an integer, and two files are not compared to each other when the n-th parent is the same.This should be easily doable in
copydetect/copydetect/detector.py
Lines 422 to 429 in d3667ff
The text was updated successfully, but these errors were encountered: