Consistently muffled sound for some multiple speaker sessions #446

fbanados · 2024-08-15T17:01:05Z

Symptom: In some sessions, the recordings for only one speaker are heard clearly. The other ones are noisy, sometimes speakers can be barely heard, sometimes there is either background noise or noises from another speaker's microphone.

Hypothesis: Although the time annotations in ELAN files used for import are correct, they are not always being matched to the appropriate sound file track. This would likely mean that for some speakers we are hearing them through the incorrect micrphone/track and that is muddling the sound. Likely an extraction issue.

Diagnostic: The hypothesis is reflected in code. The import scripts are trying to deal themselves with inconsistencies in naming for the folders and between .eaf and .wav files, as seen in extract_phrases.py:find_audio_oddities. Bug is manifest in line 406 (Introduced Jan 2021). If the underscore in the eaf file happens before the number, the regexp will not filter out any wav files in the session folder (they are only being filtered for a variation of Track in them, not a track number), and the script always takes the first from the filtered non-empty collection of files, not checking whether there's more than one option. IF the track sequence appears exactly in the wav filename, this is not an issue as that case is detected earlier, but it becomes a problem if, for example, there is a leading 0 in the number in the wav filename that is not present in the eaf name or viceversa.

Possible Solution: Simplest solution would be to re-import the sound files for every entry whenever they are different from the one in the file. Because this comparison is relatively expensive (requires to re-compress the wav entry to bytewise-compare the data with the compressed entry stored in the database, for each candidate that previously was just being skipped if the entry was already present in the database).

However, this might invalidate previous validations. We could assume that the new entries should be "strictly better", so if they were already marked as good they should remain as such, but is unclear if the bad ones are to remain bad or be revalidated.

An alternative solution would be to re-import the sound files only for the entries yet-to-be-validated, and generate a new recording for those that have already been validated. I will generate some stats about the scope of the problem and its impact.

Current Impact: TBD. Must write a script to check the exact number of entries impacted. Likely, all entries where the underscore happens before the number, that is, for example, 2017-05-11pm-US-Track_01.eaf instead of, say, 2016-01-13am-Track 3_001.eaf. This represents 123 out of the 426 folders in /data/maskwacis-recordings.

The text was updated successfully, but these errors were encountered:

fbanados · 2024-08-15T17:43:34Z

List of possibly impacted sessions:

./2014-12-09 (not: spotchecked)
./2014-12-10 (not: spotchecked)
./2015-02-12
./2015-03-03
./2015-03-18
./2015-03-19
./2015-03-23
./2015-03-25 (not: spotchecked)
./2015-04-15am
./2015-04-15pm
./2015-04-29am
./2015-04-29pm
./2015-05-04am
./2015-05-04pm
./2015-05-29pm
./2015-07-10am
./2015-09-21am1 (not: spotchecked)
./2015-09-21am2 (MISSING)
./2015-09-21am2_data (MISSING)
./2015-09-30am
./2015-09-30pm
./2015-10-19pm (NO, but sounds like BET is picking ATT's mic)
./2015-11-02am (YES: SPOTCHECKED. likely picking from ATT's mic)
./2015-11-02pm (YES: SPOTCHECKED. likely picking from ATT's mic)
./2015-11-16am
./2015-11-16pm
./2015-12-02am
./2015-12-02pm (YES: SPOTCHECKED)
./2015-12-07pm (YES : SPOTCHECKED)
./2016-01-20am (MISSING)
./2016-01-25am (MISSING)
./2016-02-01am
./2016-03-09am
./2016-03-14am
./2016-05-30pm
./2016-06-01pm (YES: SPOTCHECKED)
./2016-06-10pm_US
./2016-06-13pm_DS
./2016-06-14am_US
./2016-06-14pm-DS
./2016-06-14pm_US
./2016-06-16pm-ds
./2016-06-17am-DS
./2016-10-03pm
./2016-10-24pmC-US
./2016-10-24pm-US
./2016-10-31amDS
./2016-11-21pm-US
./2016-11-28-pm-DS
./2016-12-05pmDS
./2016-12-12amDS
./2016-12-12am-US
./2016-12-12pmDS
./2016-12-12pm-US
./2017-01-12am-US
./2017-01-12pmDOWNSTAIRS
./2017-01-12pm-US
./2017-01-19am-US
./2017-01-19-DS-am
./2017-01-19-DS-pm
./2017-04-05amUS
./2017-04-06pmUS
./2017-04-20pm-ds
./2017-04-20pm-DS
./2017-05-04USPM
./2017-05-11pm-US
./2017-05-11pmUS
./2017-05-18pm-US
./2017-06-15am-DS
./2017-07-15am-US
./2017-07-15pm-US
./2017-10-25am-KCH
./2017-10-25am-kit
./2017-10-25am-off
./2017-10-25-pm-KCH
./2017-10-25-pm-kit
./2017-10-25pm-off
./2017-11-08pm-off
./2017-11-29am-KCH
./2017-11-29am-off
./2017-11-29pm-KCH
./2017-11-29pm-off
./2017-12-06am-kch
./2017-12-06am-off
./2017-12-06pm-kch
./2017-12-06pm-off
./2017-12-13am-KCH
./2018-01-17am-KCH
./2018-01-17pm-kch
./2018-01-24am2-kch
./2018-01-24am-off
./2018-01-24pm-kch
./2018-01-24pm-off
./2018-01-31am-kch
./2018-01-31am-off
./2018-01-31pm-KCH
./2018-01-31pm-off
./2018-02-28am-kch
./2018-02-28am-off
./2018-02-28pm-kch
./2018-02-28pm-off
./2018-03-07am-kch
./2018-03-07am-off
./2018-03-07pm-kch
./2018-03-07pm-off
./2018-03-14am_KCH (MISSING)
./2018-03-14am-off
./2018-03-14pm_kch
./2018-03-14pm-off
./2018-04-04am_kch
./2018-04-04pm_kch
./2018-04-11am_kch
./2018-04-11pm_kch
./2018-04-18am_kch
./2018-04-18pm_kch
./2018-04-25am-kch
./2018-04-25am-OFF
./2018-04-25pm_kch
./2018-04-25pm-OFF
./2018-05-02am_kch (YES: SPOTCHECKED)
./2018-05-02am-OFF
./2018-05-02pm-kch
./2018-05-02pm_off

Immediate recommendation: Avoid these sessions in validation until we re-process them. This includes all sessions from 2018. Sessions from 2017 without the issue are likely 2017-01-19PM-US, 2017-01-26, all of 2017-02, all of 2017-03, 2017-04-06AM, 2017-04-13, 2017-04-20AM, 2017-04-26, 2017-05-04AM, 2017-05-11AM, 2017-05-25, and all of 2017-06.

We need a way to re-process the audio recordings when they have **not** been addressed. Currently we are not doing anything else but replacing them, although adding a note might be nice as well.

fbanados · 2024-08-16T21:27:53Z

We could add a field to every recording to mark when the audio has been reset so that then one could filter to only show the entries that need to be revalidated.

fbanados · 2024-08-16T21:33:20Z

I have updated the scripts to replace sounds. Locally, when running the sessions for 2018-04-18 with the new scripts, it produces a very noticeable difference: The entries I spot checked are all equally good (to my untrained ears), not just one speaker.

I have not tried the scripts in production yet, as we should first decide on whether we want to have some mark that the recording should be revisited (or maybe resetting the annotations for it). Once we decide and that's implemented, I could re-import a not-yet-validated test session to check how things are working (and to ensure there's usable 2018 content for upcoming validation sessions), but I think it's important to decide first what the process to replace recordings will be.

aarppe · 2024-08-19T22:20:06Z

@fbanados When I have reviewed the recordings, and then followed this up with Rose, as long as what is being said can be identified (perhaps with some noise or slightly less loudly pronounced) and is judged to be spoken correctly (as judged by Cree speaker, and matching the transcription), then we have judged those audio snippets as good. It's when the recording is clipped at either end, or the speaker doesn't say the entire word, or pronounces sloppily (adding or removing an -h-) or faintly, then that has been marked as bad. We have also marked as bad audio where there is some significant noise resulting from the primary speaker coughing or whispering out loud on top of the secondary speaker. Thus, what has been judged as good, probably would remain judged as so, even if we'd replace the less-optimal current audio with revised improved snippets. What this would have some impact on is that the crappier audio, even if pronounced properly, has rarely been starred as an exemplary pronunciation, which judgment might change with the improved snippets.

I probably wouldn't have the speaker revalidate the improved snippets, but that is something that we would take on ourselves, but using the original best snippets as the reference point for what is good. It would probably be good to have some indicator where the recordings that have been judged as bad or good already have been replaced by the improved snippets, as you suggest. I'm not sure we'd want to keep the crappier audio, when there is a better snippet - how to rule them out is another matter (e.g. adding a new button like duplicate). And rerunning this on a session that is coming up for validation would be a worthwhile trial.

fbanados · 2024-08-27T22:09:32Z

This is a duplicate of #156

fbanados · 2024-08-30T22:41:47Z

I have ran the script on production for session 2018-05-02PM-KCH-_, should be ready for trying it out on Tuesday.

fbanados · 2024-09-03T22:58:57Z

I've added four extra sessions to work with Rose. See extra field in google spreadsheet.

fbanados added bug Something isn't working requires-programmer-work labels Aug 15, 2024

fbanados self-assigned this Aug 15, 2024

fbanados mentioned this issue Aug 15, 2024

Missing sessions from maskwacis-recordings #447

Open

fbanados added the question Further information is requested label Aug 16, 2024

fbanados added this to Second revision Aug 27, 2024

fbanados moved this to In progress in Second revision Aug 27, 2024

fbanados added a commit that referenced this issue Aug 27, 2024

Added marker for imports whose sound has been re-uploaded (#446)

cd972fb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consistently muffled sound for some multiple speaker sessions #446

Consistently muffled sound for some multiple speaker sessions #446

fbanados commented Aug 15, 2024 •

edited

Loading

fbanados commented Aug 15, 2024 •

edited

Loading

fbanados commented Aug 16, 2024

fbanados commented Aug 16, 2024

aarppe commented Aug 19, 2024 •

edited

Loading

fbanados commented Aug 27, 2024

fbanados commented Aug 30, 2024

fbanados commented Sep 3, 2024

Consistently muffled sound for some multiple speaker sessions #446

Consistently muffled sound for some multiple speaker sessions #446

Comments

fbanados commented Aug 15, 2024 • edited Loading

fbanados commented Aug 15, 2024 • edited Loading

fbanados commented Aug 16, 2024

fbanados commented Aug 16, 2024

aarppe commented Aug 19, 2024 • edited Loading

fbanados commented Aug 27, 2024

fbanados commented Aug 30, 2024

fbanados commented Sep 3, 2024

fbanados commented Aug 15, 2024 •

edited

Loading

fbanados commented Aug 15, 2024 •

edited

Loading

aarppe commented Aug 19, 2024 •

edited

Loading