Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistently muffled sound for some multiple speaker sessions #446

Open
fbanados opened this issue Aug 15, 2024 · 7 comments
Open

Consistently muffled sound for some multiple speaker sessions #446

fbanados opened this issue Aug 15, 2024 · 7 comments
Assignees
Labels
bug Something isn't working question Further information is requested requires-programmer-work

Comments

@fbanados
Copy link
Member

fbanados commented Aug 15, 2024

Symptom: In some sessions, the recordings for only one speaker are heard clearly. The other ones are noisy, sometimes speakers can be barely heard, sometimes there is either background noise or noises from another speaker's microphone.

Hypothesis: Although the time annotations in ELAN files used for import are correct, they are not always being matched to the appropriate sound file track. This would likely mean that for some speakers we are hearing them through the incorrect micrphone/track and that is muddling the sound. Likely an extraction issue.

Diagnostic: The hypothesis is reflected in code. The import scripts are trying to deal themselves with inconsistencies in naming for the folders and between .eaf and .wav files, as seen in extract_phrases.py:find_audio_oddities. Bug is manifest in line 406 (Introduced Jan 2021). If the underscore in the eaf file happens before the number, the regexp will not filter out any wav files in the session folder (they are only being filtered for a variation of Track in them, not a track number), and the script always takes the first from the filtered non-empty collection of files, not checking whether there's more than one option. IF the track sequence appears exactly in the wav filename, this is not an issue as that case is detected earlier, but it becomes a problem if, for example, there is a leading 0 in the number in the wav filename that is not present in the eaf name or viceversa.

Possible Solution: Simplest solution would be to re-import the sound files for every entry whenever they are different from the one in the file. Because this comparison is relatively expensive (requires to re-compress the wav entry to bytewise-compare the data with the compressed entry stored in the database, for each candidate that previously was just being skipped if the entry was already present in the database).

However, this might invalidate previous validations. We could assume that the new entries should be "strictly better", so if they were already marked as good they should remain as such, but is unclear if the bad ones are to remain bad or be revalidated.

An alternative solution would be to re-import the sound files only for the entries yet-to-be-validated, and generate a new recording for those that have already been validated. I will generate some stats about the scope of the problem and its impact.

Current Impact: TBD. Must write a script to check the exact number of entries impacted. Likely, all entries where the underscore happens before the number, that is, for example, 2017-05-11pm-US-Track_01.eaf instead of, say, 2016-01-13am-Track 3_001.eaf. This represents 123 out of the 426 folders in /data/maskwacis-recordings.

@fbanados fbanados added bug Something isn't working requires-programmer-work labels Aug 15, 2024
@fbanados fbanados self-assigned this Aug 15, 2024
@fbanados
Copy link
Member Author

fbanados commented Aug 15, 2024

List of possibly impacted sessions:

./2014-12-09 (not: spotchecked)
./2014-12-10 (not: spotchecked)
./2015-02-12
./2015-03-03
./2015-03-18
./2015-03-19
./2015-03-23
./2015-03-25 (not: spotchecked)
./2015-04-15am
./2015-04-15pm
./2015-04-29am
./2015-04-29pm
./2015-05-04am
./2015-05-04pm
./2015-05-29pm
./2015-07-10am
./2015-09-21am1 (not: spotchecked)
./2015-09-21am2 (MISSING)
./2015-09-21am2_data (MISSING)
./2015-09-30am
./2015-09-30pm
./2015-10-19pm (NO, but sounds like BET is picking ATT's mic)
./2015-11-02am (YES: SPOTCHECKED. likely picking from ATT's mic)
./2015-11-02pm (YES: SPOTCHECKED. likely picking from ATT's mic)
./2015-11-16am
./2015-11-16pm
./2015-12-02am
./2015-12-02pm (YES: SPOTCHECKED)
./2015-12-07pm (YES : SPOTCHECKED)
./2016-01-20am (MISSING)
./2016-01-25am (MISSING)
./2016-02-01am
./2016-03-09am
./2016-03-14am
./2016-05-30pm
./2016-06-01pm (YES: SPOTCHECKED)
./2016-06-10pm_US
./2016-06-13pm_DS
./2016-06-14am_US
./2016-06-14pm-DS
./2016-06-14pm_US
./2016-06-16pm-ds
./2016-06-17am-DS
./2016-10-03pm
./2016-10-24pmC-US
./2016-10-24pm-US
./2016-10-31amDS
./2016-11-21pm-US
./2016-11-28-pm-DS
./2016-12-05pmDS
./2016-12-12amDS
./2016-12-12am-US
./2016-12-12pmDS
./2016-12-12pm-US
./2017-01-12am-US
./2017-01-12pmDOWNSTAIRS
./2017-01-12pm-US
./2017-01-19am-US
./2017-01-19-DS-am
./2017-01-19-DS-pm
./2017-04-05amUS
./2017-04-06pmUS
./2017-04-20pm-ds
./2017-04-20pm-DS
./2017-05-04USPM
./2017-05-11pm-US
./2017-05-11pmUS
./2017-05-18pm-US
./2017-06-15am-DS
./2017-07-15am-US
./2017-07-15pm-US
./2017-10-25am-KCH
./2017-10-25am-kit
./2017-10-25am-off
./2017-10-25-pm-KCH
./2017-10-25-pm-kit
./2017-10-25pm-off
./2017-11-08pm-off
./2017-11-29am-KCH
./2017-11-29am-off
./2017-11-29pm-KCH
./2017-11-29pm-off
./2017-12-06am-kch
./2017-12-06am-off
./2017-12-06pm-kch
./2017-12-06pm-off
./2017-12-13am-KCH
./2018-01-17am-KCH
./2018-01-17pm-kch
./2018-01-24am2-kch
./2018-01-24am-off
./2018-01-24pm-kch
./2018-01-24pm-off
./2018-01-31am-kch
./2018-01-31am-off
./2018-01-31pm-KCH
./2018-01-31pm-off
./2018-02-28am-kch
./2018-02-28am-off
./2018-02-28pm-kch
./2018-02-28pm-off
./2018-03-07am-kch
./2018-03-07am-off
./2018-03-07pm-kch
./2018-03-07pm-off
./2018-03-14am_KCH (MISSING)
./2018-03-14am-off
./2018-03-14pm_kch
./2018-03-14pm-off
./2018-04-04am_kch
./2018-04-04pm_kch
./2018-04-11am_kch
./2018-04-11pm_kch
./2018-04-18am_kch
./2018-04-18pm_kch
./2018-04-25am-kch
./2018-04-25am-OFF
./2018-04-25pm_kch
./2018-04-25pm-OFF
./2018-05-02am_kch (YES: SPOTCHECKED)
./2018-05-02am-OFF
./2018-05-02pm-kch
./2018-05-02pm_off

Immediate recommendation: Avoid these sessions in validation until we re-process them. This includes all sessions from 2018. Sessions from 2017 without the issue are likely 2017-01-19PM-US, 2017-01-26, all of 2017-02, all of 2017-03, 2017-04-06AM, 2017-04-13, 2017-04-20AM, 2017-04-26, 2017-05-04AM, 2017-05-11AM, 2017-05-25, and all of 2017-06.

fbanados added a commit that referenced this issue Aug 16, 2024
We need a way to re-process the audio recordings when they have **not**
been addressed.  Currently we are not doing anything else but replacing
them, although adding a note might be nice as well.
@fbanados
Copy link
Member Author

We could add a field to every recording to mark when the audio has been reset so that then one could filter to only show the entries that need to be revalidated.

@fbanados
Copy link
Member Author

I have updated the scripts to replace sounds. Locally, when running the sessions for 2018-04-18 with the new scripts, it produces a very noticeable difference: The entries I spot checked are all equally good (to my untrained ears), not just one speaker.

I have not tried the scripts in production yet, as we should first decide on whether we want to have some mark that the recording should be revisited (or maybe resetting the annotations for it). Once we decide and that's implemented, I could re-import a not-yet-validated test session to check how things are working (and to ensure there's usable 2018 content for upcoming validation sessions), but I think it's important to decide first what the process to replace recordings will be.

@fbanados fbanados added the question Further information is requested label Aug 16, 2024
@aarppe
Copy link
Collaborator

aarppe commented Aug 19, 2024

@fbanados When I have reviewed the recordings, and then followed this up with Rose, as long as what is being said can be identified (perhaps with some noise or slightly less loudly pronounced) and is judged to be spoken correctly (as judged by Cree speaker, and matching the transcription), then we have judged those audio snippets as good. It's when the recording is clipped at either end, or the speaker doesn't say the entire word, or pronounces sloppily (adding or removing an -h-) or faintly, then that has been marked as bad. We have also marked as bad audio where there is some significant noise resulting from the primary speaker coughing or whispering out loud on top of the secondary speaker. Thus, what has been judged as good, probably would remain judged as so, even if we'd replace the less-optimal current audio with revised improved snippets. What this would have some impact on is that the crappier audio, even if pronounced properly, has rarely been starred as an exemplary pronunciation, which judgment might change with the improved snippets.

I probably wouldn't have the speaker revalidate the improved snippets, but that is something that we would take on ourselves, but using the original best snippets as the reference point for what is good. It would probably be good to have some indicator where the recordings that have been judged as bad or good already have been replaced by the improved snippets, as you suggest. I'm not sure we'd want to keep the crappier audio, when there is a better snippet - how to rule them out is another matter (e.g. adding a new button like duplicate). And rerunning this on a session that is coming up for validation would be a worthwhile trial.

@fbanados
Copy link
Member Author

This is a duplicate of #156

@fbanados
Copy link
Member Author

I have ran the script on production for session 2018-05-02PM-KCH-_, should be ready for trying it out on Tuesday.

@fbanados
Copy link
Member Author

fbanados commented Sep 3, 2024

I've added four extra sessions to work with Rose. See extra field in google spreadsheet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested requires-programmer-work
Projects
Status: In progress
Development

No branches or pull requests

2 participants