Skip to content

Bug & Progress Logs

Mabel edited this page Sep 12, 2022 · 2 revisions
  • Bug: Conversion Rate Caveats
    • Status: Fixed (08/12/2022)

    • Description of bug:

    • Scenario: Calculating Conversion Rate as (# of users who are at level D2 at month n+1) / (# of users who are at level D1 but not D2 at month n).

      • For example, the D1 cutoff can be least 2 contributions up to month x, and D2 cutoff can be at least 10 contributions up to month x. Contributions are counted cumulatively per month, and the number of contributions per person will only increase monotonically over time. For example, assess at 2017-04-01, 2017-05-01, etc.

Issue #1: For example if person B is at level D1, then jumps to level D2 in April 2017, then he has converted from D1-D2. He remains at level D2 for the rest of time, but each month, since he is simultaneously at both levels, then he completes the D1-D2 conversion multiple times.

Solution: 
- We assume that the levels are disjoint, meaning someone cannot simultaneously be level D1 and level D2, to fix the issue that people can "convert" multiple times, which does not make sense.
- There are 2 ways to implement this, trying out combinations of both
    - Method A: Pass cutoffs to filter method
    - Method B: Add filtering to numerator so once someone has joined numerator (converted) you can't count them again
        - The problem is with people like Person A - they will never be counted

Issue #2: Suppose that person A makes > 10 contributions in their first month. They are immediately at level D2 when using monthly interval. Then, they are not counted in the denominator when using monthly interval. Should they be counted in the conversion rate? This could mean the conversion rate could be > 1. Before, to constrain the CR as <= 1 we had to make sure the uuids in numerator is a subset of the uuids in the denominator.

Possible solutions? 
- Allow conversion rate to be > 1 when conceptually, it should not be? (current implementation)
- Extend denominator endpoint to 1 day before the numerator endpoint - meaning assess denominator at 2017-03-31 and numerator up to 2017-04-01 INSTEAD OF denominator at 2017-03-01 and numerator at 2017-04-01. This be achieved by specifying aggs offset ("offset": "-1d"). I think this person will have > 10 contributions already by then. So that might not work. What about an offset of half a month combined with the next solution for those who are not in it?
- Ignore people such as person A    
- Add a "first time contributor in this month" aspect to the D1 criteria (denominator)
    - If it's their first time AND they already are in numerator, it also counts
    - Problem is they don't get to the denominator a month before, they're in there at the SAME time so it doesn't register
      and they already were in the numerator so they don't get counted again via the logic above. A solution may be to put them in the denom of the month before on a 2nd loop. This works so far.

Issue #3: - Since we count D levels cumulatively, there will be more and more people at lower D levels over time that maybe became inactive or just never leveled up. Thus, the denominator will grow possibly faster than the numerator making it so that it is harder to achieve a higher conversion rate over time (need more and more people to level up to achieve the same conversion rate as before, e.g. 0.75) - It follows that a conversion rate of 0.75 is very good for a community that is old, and not very good for one that is very new in terms of ABSOLUTE conversions. - This may not be what we want. I had planned a "lag time" solution to account for this - We can also show ABSOLUTE conversions over time so that communities can compare the rate vs. absolute conversions - Discuss?

- Suggestion: 
    - Also include a graph of the Absolute count of conversions
    - Instead of calculating from the beginning, maybe consider past 6 months (lag time, and people can adjust what lag time they want, can use this to measure impact of an event, such as a summit)
    - Filter to define role of D1, D2 (roles could mean different things in different communities - they care about different things), currently not required for GSOC but can be good for future to make definition of the role flexible.
  • Bug: SortingHat UUID Problem

    • Status: Workaround placed (07/10/2022)
    • Description of bug: Each of the Perceval backends (and enrichment) can produce distinct SortingHat UUIDs for the same person (for example the github vs githubql backends). This becomes a problem when we have to aggregate contributions - these people cannot be seen as disparate. Currently, the workaround is to combine the user's uuids in their Sorting Hat profile by querying the API using their usernames dynamically.
  • Bug: Users Not Found in SortingHat During Processing of github2 data (aka people who only made issuecomment contributions ONLY are not included in Sortinghat)

    • Status: Investigating (08/25/2022)
    • Description of bug: github2 contains data from users' comments on Github issues. It seems that when the github2:issues backend is run, there is no entry for users who have ONLY a comment contribution from github2 in SortingHat. When the metrics model enrichment code runs, it cannot find these users in SortingHat through its api (by term only). Thus, uuids cannot be known for these people, and they cannot be aggregated. Currently, their usernames are standing in for their uuids. This causes the actor_id field in the final combined index (the one which is input for the metric model) to contain a mix of SortingHat uuids and usernames now. There is a secondary Bug where some of these people are not showing up in the final index. One of which is ElizabethN (90 out of 273 are making it into the final index).
  • Bug: '_source.issue_title_analyzed' not found in github2 schema

  • Bug: Not all items uploading to Elasticsearch for github2 PR request comments during MM enrichment

    • Status: Fixed (08/26/2022)
    • Description of bug: While running the index combine for github2 PR request comments, there are 1208 hits (for the Augur test set) but there are only 473 making it into the final index. As I suspect, this is due to non-unique perceval UUIDs being used - resulting in 1 comment per PR. These comments will have a uuid generated with md5 hash (prefixed by the original perceval ID) before they are added to the final index.
  • Bug: Missed PR Comments

    • Status: Resolved (08/26/2022)
    • Description of bug: While running the index combine for github2 PR request comments, only some of the discussion is captured for a given PR. For example only these are showing up for 940, but there are more. So there are not really any missed PR request comments, but we need to make sure the other backends were able to retrieve these for total completion. This may be due to the fact the initial PR also contains an issue (https://github.com/chaoss/augur/pull/940#issue-713099296) and also "issuecomment" type comments (example: https://github.com/chaoss/augur/pull/940#issuecomment-703302800). So the way the metric model counts certain types of contributions is complete, but does NOT proceed in the way that's natural from eyeballing the pull request page on Github.
  • Bug: Missing time fields

    • Status: Investigating (08/28/2022)
    • Description of bug: elasticsearch.exceptions.RequestError: RequestError(400, 'search_phase_execution_exception', 'No mapping found for [grimoire_creation_date] in order to sort on') while running search in aggregate.py
https://github.com/chaoss/augur/pull/940#discussion_r499272394
https://github.com/chaoss/augur/pull/940#pullrequestreview-501645443
https://github.com/chaoss/augur/pull/940#discussion_r499272056
https://github.com/chaoss/augur/pull/940#pullrequestreview-501645568
https://github.com/chaoss/augur/pull/940#pullrequestreview-501645269
https://github.com/chaoss/augur/pull/940#discussion_r499279636
https://github.com/chaoss/augur/pull/940#pullrequestreview-501651563
https://github.com/chaoss/augur/pull/940#discussion_r499272241

TODO

  • Add in community level
  • Add in more catches for required parameters in conf.yaml to make them optional
  • Add in ability to set an upper cutoff on d2 other than None
  • Add in option for allowing repeat conversions AFTER a time cutoff
  • Allow different repositories for each perceval backend.?
  • Add in 2-way contribution attribution (for Github assignees)
  • Add more options for which date to unify contributions on (created_at, updated_at, or grimoire_creation_date)
  • Unify uuid in combined index
  • Differentiate editing a comment vs. posting a comment as a contribution? (Only able to distinguish the last edit made, not all of them)
  • Option to remove bots
  • Add multithreading option?
  • Fix sorting-hat bug

Progress Log

GSOC

Week 1 + 2 Jun 13 - Jun 24: Meetings with mentors to scope project and coordinate with Taiwei's project. Understanding code flow from data collection through enrichment (micro.py and github enricher)

Week 3 Jun 24 - Jul 1: Set up the Pycharm sirmordred virtual environment properly for development, traced the enricher code with debugger, correct missing __init__.py files (https://github.com/chaoss/grimoirelab-perceval/issues/791)

Week 4 + 5 Jul 1 - Jul 15: Completing Custom Enricher Code for Github Issues (as separate file - githubcr.py), Completing Custom Enricher Code for Github Issues (on top of - githubql.py), discussing approach as cloned enricher or as a standalone module.

Week 6 Jul 15 - Jul 22: Investigate how SortingHat handles unique identities from different backends data collected at different times and across different runs. Design method to match identities up post-backend enrichment.

Week 7 Jul 22 - Jul 29: Implement sorting hat utils for combination of disparate Sorting Hat users from different perceval backends.

Week 8 Jul 29 - Aug 5: Implement conversion rate calculation with only Github and Githubql data.

Week 9 Aug 5 - Aug 12: Add option to conversion rate by implementing a "lag time", debugging conversion rate aggregation code.

Week 10 Aug 12 - Aug 19: Separate 'CreatedEvent' issue vs pull request, debug conversion rate parameters, introduce support for issue comments, optimize SortingHat combination work to reduce redundant work, refactor conversion_rate.py (init / kwargs handling). Begin integrating github2 support for issues.

Week 11 Aug 19 - Aug 26: Implemented github2 support for PR comments (schema for github2:issues and github2:pull was not exactly as described in Grimoirelab repo, and also did not match with each other so a bit of time was spent on trying to get those two to be consistent), performed testing to make sure Github events are all included in enriched index, fixed UUIDs discrepancy leading to omitted results, wrote documentation, generated figure for Wiki. Worked on debugging the SortingHat bug where users do not end up in the SortingHat database after github2 and githubql enrich (add_sh_identity method may be to do with this).

Week 12 Aug 26 - Sep 2: Write documentation, refactor aggregate.py plus adding the following features: support for 2 types of cutoff mode, allow/disallow multiple conversions, separate event types tracked for numerator/denominator, write tests. Add multilevel support (which will allow overlapping graphs), add project tag to output. Familiarize with CHAOSS compass. Work on preparing work for final submission and pre-answering questions https://developers.google.com/open-source/gsoc/help/evaluations#final_1

Week 13 Aug 2 - Sep 9: Write documentation (including documentation for conf.yaml), refactor and clean methods in conversion_rate.py, continue to add multilevel support, add assignee-assigner support. Add in ability for Conversion Rate to exclude bot users.

Post-GSOC

Clone this wiki locally