Diff DVC files, optionally piping through other commands first.
pip install dvc-utils
dvc-utils --help
# Usage: dvc-utils [OPTIONS] COMMAND [ARGS]...
#
# Options:
# --help Show this message and exit.
#
# Commands:
# diff Diff a DVC-tracked file at two commits (or one commit vs. current
# worktree), optionally passing both through another command first
The single subcommand, dvc-utils diff
, is also exposed directly as dvc-dff
:
dvc-diff --help
# Usage: dvc-diff [OPTIONS] [exec_cmd...] <path>
#
# Diff a file at two commits (or one commit vs. current worktree), optionally
# passing both through `cmd` first
#
# Examples:
#
# dvc-utils diff -r HEAD^..HEAD wc -l foo.dvc # Compare the number of lines
# (`wc -l`) in `foo` (the file referenced by `foo.dvc`) at the previous vs.
# current commit (`HEAD^..HEAD`).
#
# dvc-utils diff md5sum foo # Diff the `md5sum` of `foo` (".dvc" extension is
# optional) at HEAD (last committed value) vs. the current worktree content.
#
# Options:
# -c, --color / -C, --no-color Force or prevent colorized output
# -r, --refspec TEXT <commit 1>..<commit 2> (compare two commits)
# or <commit> (compare <commit> to the worktree)
# -R, --ref TEXT Shorthand for `-r <ref>^..<ref>`, i.e. inspect
# a specific commit (vs. its parent)
# -s, --shell-executable TEXT Shell to use for executing commands; defaults
# to $SHELL
# -S, --no-shell Don't pass `shell=True` to Python
# `subprocess`es
# -U, --unified INTEGER Number of lines of context to show (passes
# through to `diff`)
# -v, --verbose Log intermediate commands to stderr
# -w, --ignore-whitespace Ignore whitespace differences (pass `-w` to
# `diff`)
# -x, --exec-cmd TEXT Command(s) to execute before diffing;
# alternate syntax to passing commands as
# positional arguments
# --help Show this message and exit.
These examples are verified with mdcmd
and $BMDF_WORKDIR=test/data
(test/data
is a clone of ryan-williams/dvc-helpers@test, which contains simple DVC-tracked files used for testing git-diff-dvc.sh
)
8ec2060
added a DVC-tracked text file, test.txt
:
dvc-diff -R 8ec2060 test.txt
# 0a1,10
# > 1
# > 2
# > 3
# > 4
# > 5
# > 6
# > 7
# > 8
# > 9
# > 10
0455b50
appended some lines to test.txt
:
dvc-diff -R 0455b50 test.txt
# 10a11,15
# > 11
# > 12
# > 13
# > 14
# > 15
f92c1d2
added test.parquet
:
dvc-diff -R f92c1d2 pqa test.parquet
# 0a1,27
# > MD5: 4379600b26647a50dfcd0daa824e8219
# > 1635 bytes
# > 5 rows
# > message schema {
# > OPTIONAL INT64 num;
# > OPTIONAL BYTE_ARRAY str (STRING);
# > }
# > {
# > "num": 111,
# > "str": "aaa"
# > }
# > {
# > "num": 222,
# > "str": "bbb"
# > }
# > {
# > "num": 333,
# > "str": "ccc"
# > }
# > {
# > "num": 444,
# > "str": "ddd"
# > }
# > {
# > "num": 555,
# > "str": "eee"
# > }
f29e52a
updated test.parquet
:
dvc-diff -R f29e52a pqa test.parquet
# 1,3c1,3
# < MD5: 4379600b26647a50dfcd0daa824e8219
# < 1635 bytes
# < 5 rows
# ---
# > MD5: be082c87786f3364ca9efec061a3cc21
# > 1622 bytes
# > 8 rows
# 5c5
# < OPTIONAL INT64 num;
# ---
# > OPTIONAL INT32 num;
# 26a27,38
# > }
# > {
# > "num": 666,
# > "str": "fff"
# > }
# > {
# > "num": 777,
# > "str": "ggg"
# > }
# > {
# > "num": 888,
# > "str": "hhh"
3257258
added a DVC-tracked directory data/
, including test.{txt,parquet}
), and removed the top-level test.{txt,parquet}
.
dvc-diff -R 3257258 data
# test.parquet: None -> c07bba3fae2b64207aa92f422506e4a2
# test.txt: None -> e20b902b49a98b1a05ed62804c757f94
ae8638a
changed values in data/test.parquet
, and added rows to data/test.txt
:
dvc-diff -R ae8638a data
# test.parquet: c07bba3fae2b64207aa92f422506e4a2 -> f46dd86f608b1dc00993056c9fc55e6e
# test.txt: e20b902b49a98b1a05ed62804c757f94 -> 9306ec0709cc72558045559ada26573b
See sample commands and output below for inspecting changes to a DVC-tracked Parquet file in a given commit.
Setup:
git clone https://github.com/hudcostreets/nj-crashes && cd nj-crashes # Clone + enter example repo
commit=c8ae28e # Example commit that changed some DVC-tracked Parquet files
path=njdot/data/2001/NewJersey2001Accidents.pqt.dvc # One of the changed files
Use parquet2json
to observe schema changes to a Parquet file:
parquet_schema() {
parquet2json "$1" schema
}
export -f parquet_schema
dvc-diff -r $commit^..$commit parquet_schema $path
Output
2d1
< OPTIONAL BYTE_ARRAY Year (STRING);
8,10d6
< OPTIONAL BYTE_ARRAY Crash Date (STRING);
< OPTIONAL BYTE_ARRAY Crash Day Of Week (STRING);
< OPTIONAL BYTE_ARRAY Crash Time (STRING);
14,17c10,13
< OPTIONAL BYTE_ARRAY Total Killed (STRING);
< OPTIONAL BYTE_ARRAY Total Injured (STRING);
< OPTIONAL BYTE_ARRAY Pedestrians Killed (STRING);
< OPTIONAL BYTE_ARRAY Pedestrians Injured (STRING);
---
> OPTIONAL INT64 Total Killed;
> OPTIONAL INT64 Total Injured;
> OPTIONAL INT64 Pedestrians Killed;
> OPTIONAL INT64 Pedestrians Injured;
20,21c16,17
< OPTIONAL BYTE_ARRAY Alcohol Involved (STRING);
< OPTIONAL BYTE_ARRAY HazMat Involved (STRING);
---
> OPTIONAL BOOLEAN Alcohol Involved;
> OPTIONAL BOOLEAN HazMat Involved;
23c19
< OPTIONAL BYTE_ARRAY Total Vehicles Involved (STRING);
---
> OPTIONAL INT64 Total Vehicles Involved;
29c25
< OPTIONAL BYTE_ARRAY Mile Post (STRING);
---
> OPTIONAL DOUBLE Mile Post;
47,48c43,44
< OPTIONAL BYTE_ARRAY Latitude (STRING);
< OPTIONAL BYTE_ARRAY Longitude (STRING);
---
> OPTIONAL DOUBLE Latitude;
> OPTIONAL DOUBLE Longitude;
51a48
> OPTIONAL INT64 Date (TIMESTAMP(MICROS,false));
Here we can see that various date/time columns were consolidated, and several stringly-typed columns were converted to ints, floats, and booleans.
Diff the first row of the Parquet file above (pretty-printed as JSON using jq
), before and after the given commit:
pretty_print_first_row() {
# Print first row of Parquet file as JSON, pretty-print with jq
parquet2json "$1" cat -l 1 | jq .
}
export -f pretty_print_first_row
dvc-diff -r $commit^..$commit pretty_print_first_row $path
Output
2d1
< "Year": "2001",
8,10d6
< "Crash Date": "12/21/2001",
< "Crash Day Of Week": "F",
< "Crash Time": "1834",
14,17c10,13
< "Total Killed": "0",
< "Total Injured": "0",
< "Pedestrians Killed": "0",
< "Pedestrians Injured": "0",
---
> "Total Killed": 0,
> "Total Injured": 0,
> "Pedestrians Killed": 0,
> "Pedestrians Injured": 0,
20,21c16,17
< "Alcohol Involved": "N",
< "HazMat Involved": "N",
---
> "Alcohol Involved": false,
> "HazMat Involved": false,
23c19
< "Total Vehicles Involved": "2",
---
> "Total Vehicles Involved": 2,
29c25
< "Mile Post": "",
---
> "Mile Post": null,
47,48c43,44
< "Latitude": "",
< "Longitude": "",
---
> "Latitude": null,
> "Longitude": null,
51c47,48
< "Reporting Badge No.": "830"
---
> "Reporting Badge No.": "830",
> "Date": "2001-12-21 18:34:00 +00:00"
This reflects the schema changes above.
parquet_row_count() {
parquet2json "$1" rowcount
}
export -f parquet_row_count
dvc-diff -r $commit^..$commit parquet_row_count $path
This time we get no output; the given $commit
didn't change the row count in the DVC-tracked Parquet file $path
.
Here's a "one-liner" I used in ctbk.dev, to normalize and compare headers of .csv.gz.dvc
files between two commits:
# Save some `sed` substitution commands to file `seds`:
cat <<EOF >seds
s/station_//
s/latitude/lat/
s/longitude/lng/
s/starttime/started_at/
s/stoptime/ended_at/
s/usertype/member_casual/
EOF
# Commit range to diff; branch `c0` is an initial commit of some `.csv.gz` files, branch `c1` is a later commit after some updates
r=c0..c1
# List files changed in commit range `$r`, in the `s3/ctbk/csvs/` dir, piping through several post-processing commands:
gdno $r s3/ctbk/csvs/ | \
pel "ddcr $r guc h1 spc kq kcr snc 'sdf seds' sort"
Explanation of aliases
gdno
(git diff --name-only
): list files changed in the given commit range and directorypel
:parallel
alias that prepends anecho {}
to the commandddcr
(dvc-diff -cr
): colorizeddiff
output, revision range$r
guc
(gunzip -c
): uncompress the.csv.gz
filesh1
(head -n1
): only examine each file's header linespc
(tr , $'\n'
): split the header line by commas (so each column name will be on one line, for easierdiff
ing below)kq
(tr -d '"'
): kill quote characters (in this case, header-column name quoting changed, but I don't care about that)kcr
(tr -d '\r'
): kill carriage returns (line endings also changed)snc
(sed -f 'snake_case.sed'
): snake-case column namessdf
(sed -f
): execute thesed
substitution commands defined in theseds
file abovesort
: sort the column names alphabetically (to identify missing or added columns, ignore rearrangements)
Note:
- Most of these are exported Bash functions, allowing them to be used inside the
parallel
command. - I was able to build this pipeline iteratively, adding steps to normalize out the bits I didn't care about (and accumulating the
seds
commands).
Example output:
…
s3/ctbk/csvs/201910-citibike-tripdata.csv.gz.dvc:
s3/ctbk/csvs/201911-citibike-tripdata.csv.gz.dvc:
s3/ctbk/csvs/201912-citibike-tripdata.csv.gz.dvc:
s3/ctbk/csvs/202001-citibike-tripdata.csv.gz.dvc:
1,2d0
< bikeid
< birth_year
8d5
< gender
9a7,8
> ride_id
> rideable_type
15d13
< tripduration
s3/ctbk/csvs/202002-citibike-tripdata.csv.gz.dvc:
1,2d0
< bikeid
< birth_year
8d5
< gender
9a7,8
> ride_id
> rideable_type
15d13
< tripduration
s3/ctbk/csvs/202003-citibike-tripdata.csv.gz.dvc:
1,2d0
< bikeid
< birth_year
8d5
< gender
9a7,8
> ride_id
> rideable_type
15d13
< tripduration
…
This helped me see that the data update in question (c0..c1
) dropped some fields (bikeid, birth_year
, gender
, tripduration
) and added others (ride_id
, rideable_type
), for 202001
and later.