Releases: blingenf/copydetect
0.5.0
- Feature: the file list and execution parameters are now displayed on the output report (#46 by ankostis).
- Feature: improved handling of duplicate hashes. Duplicates are now correctly highlighted on the report and the similarity metric is now simply
overlapping tokens/total token count
instead ofoverlapping tokens after removing duplicate fingerprints/number of tokens in unique fingerprints
(#48). - Feature: the default report styling can be overwritten using a custom CSS file provided using the
--css
argument (#49 by mikeperalta1). - Fix: some internal cleanup to how the
CopyDetector
object is configured. There is no impact to the publicly-documented API but code which was referencing parameters passed to this object (e.g.,CopyDetector.noise_t
) may break (#47) - Fix: replaced
pkg_resources
withimportlib.resources
(allowing support for python 3.12). Support for python 3.6 is dropped (#52).
0.4.6
- Fix: the "file not ASCII text" warning has been changed to "file not UTF-8" text to reflect the encoding which is actually used.
- Improvement: added a
--encoding
parameter which allows specifying an encoding. If thechardet
library is installed,--encoding DETECT
can now be used to automatically detect the encoding of all files
0.4.5
- Fix: corrected an issue introduced by 0.4.4 causing incorrect indexing when there is partial overlap between test and reference files (this could result in crashes or incorrect slice selection on the output report).
- Fix: a "no files found" warning is only displayed if none of the provided extensions are found in a folder rather than printing in individual warning for each missing extension.
0.4.4
- Fix: UTF-8 is explicitly specified when loading the HTML template and saving the output report.
- Improvement: the slice matrix is now implemented as a dictionary instead of an actual matrix and consumes less memory as a result.
CI has also been migrated from Travis to Github Actions
0.4.3
- Fix: corrected a crash which occured when comparing empty slices using the copydetect API.
- Fix: corrected an issue causing certain operating systems to fail for files with non-ASCII characters. UTF-8 is now explicitly specified as the encoding when reading files.
0.4.2
0.4.1
-
Fix: the program now behaves identically for config files and command-line arguments -- in particular, parameters which have defaults (reference directories, noise threshold, extensions) are no longer required to be filled if a config file is being used. They will fall back to defaults just like command-line parameters.
-
API Update: the
config
parameter toCopyDetector
has been deprecated and will be removed in a future version. This could result in ambiguity when parameters were provided both in theconfig
dictionary and as optional arguments to the detector. To initialize aCopyDetector
object using a config dictionary, use the newCopyDetector.from_config()
function. The change described above also applies to this function -- missing values in the provided dictionary will be filled with defaults where applicable. -
Update: the default
guarantee_threshold
has been updated to be equal to thenoise_threshold
(30 --> 25). This comes with a small performance drop but it is fairly minor and the gap seemed to be causing some confusion.
0.4.0
Fix/feature: the similarity matrix is no longer necessarily square. There will no longer be large gaps when test files != reference files.
Bux fix: similarity is now based on number of fingerprints rather than number of tokens. This improves detection for files with large amounts of duplication (e.g., XML files)
Feature: fp argument for CodeFinerprint: fingerprints can now be initialized with file pointers rather than just a file path.
0.3.0
Improvement: both images and the style sheet have been merged into the output HTML file. Instead of saving an output directory, copydetect outputs a single html file with a name/path controlled using the -o parameter (default: report.html).
Improvement: output report now uses Bootstrap 5.
Bug Fix: changed deprecated jinja2.escape import to markupsafe.escape
Bug Fix: preprocessor directives are now correctly tokenized for languages which use them (#6)
Bug Fix: token.Name.Variable, token.Name.Attribute tokens are now treated as variables in addition to token.Name tokens. This improves tokenization for certain languages (primarily Java).