Skip to content

Releases: blingenf/copydetect

0.5.0

09 Mar 20:28
Compare
Choose a tag to compare
  • Feature: the file list and execution parameters are now displayed on the output report (#46 by ankostis).
  • Feature: improved handling of duplicate hashes. Duplicates are now correctly highlighted on the report and the similarity metric is now simply overlapping tokens/total token count instead of overlapping tokens after removing duplicate fingerprints/number of tokens in unique fingerprints (#48).
  • Feature: the default report styling can be overwritten using a custom CSS file provided using the --css argument (#49 by mikeperalta1).
  • Fix: some internal cleanup to how the CopyDetector object is configured. There is no impact to the publicly-documented API but code which was referencing parameters passed to this object (e.g., CopyDetector.noise_t) may break (#47)
  • Fix: replaced pkg_resources with importlib.resources (allowing support for python 3.12). Support for python 3.6 is dropped (#52).

0.4.6

15 Jul 16:27
Compare
Choose a tag to compare
  • Fix: the "file not ASCII text" warning has been changed to "file not UTF-8" text to reflect the encoding which is actually used.
  • Improvement: added a --encoding parameter which allows specifying an encoding. If the chardet library is installed, --encoding DETECT can now be used to automatically detect the encoding of all files

0.4.5

26 Apr 03:36
Compare
Choose a tag to compare
  • Fix: corrected an issue introduced by 0.4.4 causing incorrect indexing when there is partial overlap between test and reference files (this could result in crashes or incorrect slice selection on the output report).
  • Fix: a "no files found" warning is only displayed if none of the provided extensions are found in a folder rather than printing in individual warning for each missing extension.

0.4.4

04 Feb 17:04
Compare
Choose a tag to compare
  • Fix: UTF-8 is explicitly specified when loading the HTML template and saving the output report.
  • Improvement: the slice matrix is now implemented as a dictionary instead of an actual matrix and consumes less memory as a result.

CI has also been migrated from Travis to Github Actions

0.4.3

05 Nov 17:08
2866bd6
Compare
Choose a tag to compare
  • Fix: corrected a crash which occured when comparing empty slices using the copydetect API.
  • Fix: corrected an issue causing certain operating systems to fail for files with non-ASCII characters. UTF-8 is now explicitly specified as the encoding when reading files.

0.4.2

16 Jul 17:45
d63a1e3
Compare
Choose a tag to compare

Fix: corrects an issue introduced by version 0.4.0 which caused similarity scores to be lower than they should (see #19 for more information)

0.4.1

29 May 22:25
d5fafd9
Compare
Choose a tag to compare
  • Fix: the program now behaves identically for config files and command-line arguments -- in particular, parameters which have defaults (reference directories, noise threshold, extensions) are no longer required to be filled if a config file is being used. They will fall back to defaults just like command-line parameters.

  • API Update: the config parameter to CopyDetector has been deprecated and will be removed in a future version. This could result in ambiguity when parameters were provided both in the config dictionary and as optional arguments to the detector. To initialize a CopyDetector object using a config dictionary, use the new CopyDetector.from_config() function. The change described above also applies to this function -- missing values in the provided dictionary will be filled with defaults where applicable.

  • Update: the default guarantee_threshold has been updated to be equal to the noise_threshold (30 --> 25). This comes with a small performance drop but it is fairly minor and the gap seemed to be causing some confusion.

0.4.0

15 May 20:22
a53eca5
Compare
Choose a tag to compare

Fix/feature: the similarity matrix is no longer necessarily square. There will no longer be large gaps when test files != reference files.
Bux fix: similarity is now based on number of fingerprints rather than number of tokens. This improves detection for files with large amounts of duplication (e.g., XML files)
Feature: fp argument for CodeFinerprint: fingerprints can now be initialized with file pointers rather than just a file path.

0.3.0

11 Oct 19:12
Compare
Choose a tag to compare

Improvement: both images and the style sheet have been merged into the output HTML file. Instead of saving an output directory, copydetect outputs a single html file with a name/path controlled using the -o parameter (default: report.html).
Improvement: output report now uses Bootstrap 5.

Bug Fix: changed deprecated jinja2.escape import to markupsafe.escape
Bug Fix: preprocessor directives are now correctly tokenized for languages which use them (#6)
Bug Fix: token.Name.Variable, token.Name.Attribute tokens are now treated as variables in addition to token.Name tokens. This improves tokenization for certain languages (primarily Java).

0.2.1

20 Nov 16:23
Compare
Choose a tag to compare

Improvement: the style sheet is now copied to the output folder rather than directly linking to its location in the package data to improve output portability.