Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve code authorship attribution #944

Closed
myteo opened this issue Nov 12, 2019 · 12 comments · Fixed by #2140
Closed

Improve code authorship attribution #944

myteo opened this issue Nov 12, 2019 · 12 comments · Fixed by #2140

Comments

@myteo
Copy link
Contributor

myteo commented Nov 12, 2019

Currently in the code panel, an author is credited for lines of code that was last modified them. However, another author might have written the line initially and the current author (author who last modified the line) might have only made a slight modification to the line. In such a case, the current author gets credited for work that is not entirely done by them.

Proposed solution:
Assign partial or full credit of a line to the current author based on whether they introduced the line or someone else introduced the line and they only made a slight modification to it. This is done by checking the ancestor line (previous version of the line) to see how similar it is to the line. Users add a command line flag to indicate if they want this feature to be turned on - it is turned off by default.

For a given line l and author a who last modified the line, we can assign credit to author a using the following steps:

  1. Check the commit in which author a modified line l
  2. Get the diffs of the commit and find the hunk which line l belongs to
  3. Go through all the deleted lines in the hunk and compare them with line l to see how similar they are (using Levenshtein string distance metric)
  4. If we do not find a deleted line with high similarity with line l, we give author a full credit. If we find such a deleted line, we find out the author of the deleted line.
  5. If the author of the deleted line is not author a, we give partial credit to author a. If the author of the deleted line is author a, we set line l as the deleted line and repeat from step 1.

After we assign partial or full credit to the author, we can display the results in the code panel by using a darker shade of green to highlight code with full credit.

Picture1

@myteo myteo self-assigned this Nov 12, 2019
@myteo
Copy link
Contributor Author

myteo commented Nov 12, 2019

Some additional implementation details:

  1. We can get the author who last modified a line using git blame
  2. We can get the diffs of a commit using git diff
  3. In the case of a merge commit, we need to get the diff w.r.t each parent of the commit
  4. How to calculate the similarity of line l2 w.r.t line l1:
  • Get the Levenshtein string distance d between line l1 and line l2 (brief explanation of Levenshtein distance here)
  • The similarity score will be 1 - (d ÷ length of line l1). It will be a score between 0 and 1.
  • We need to set a threshold similarity value. Based on some testing, I propose using a value of 80% similarity for now.
  • Out of all the deleted lines that meet the threshold value, we choose the one with the highest score
  1. We should ignore empty lines

@eugenepeh
Copy link
Member

So in this algorithm, given the flow:

  1. A authored a new line
  2. B modified the line (minor edit)
  3. A modified the line (minor edit)

The full credit in this case will go to A or B?

@myteo
Copy link
Contributor Author

myteo commented Dec 3, 2019

For this case, A will be given partial credit.

After this change, the author who last modified the line will still get credit as before, but this algorithm will determine if he/she gets partial or full credit. That means for the line described above, B will still not receive any credit.

When tracing the "ancestry" of a line, the algorithm terminates and gives partial credit once a different author is found. So for the line described, once we find out that B authored the previous version, we will give A partial credit and analysis will stop there. This is less rigorous, but saves the time of having to trace the ancestry of the line all the way back to the start.

Hope this clarifies, do let me know if you think there are areas which could be improved. Thanks!

@jinyao-lee
Copy link
Contributor

Just to ask and clarify:

  • Given authors a and b and a line l:

    • If a initially authored l, and b made a major modification to that line (i.e.: similarity score < 0.8), b gets the full credit?
    • If a initially authored l, and b only makes a minor modification to that line (i.e.: similarity score > 0.8), b gets the partial credit?
  • The intended solution will still be crediting to the author who last modified that line, with the addition of giving that author full/partial credit (based on the different shades of green color) based on the calculated min edit distance?

@myteo
Copy link
Contributor Author

myteo commented Dec 12, 2019

Thanks for clarifying! To answer your questions:

  • If a initially authored l, and b made a major modification to that line (i.e.: similarity score < 0.8), b gets the full credit?

Yup correct.

  • If a initially authored l, and b only makes a minor modification to that line (i.e.: similarity score > 0.8), b gets the partial credit?

Yup correct, strictly speaking >= 0.8.

  • The intended solution will still be crediting to the author who last modified that line, with the addition of giving that author full/partial credit (based on the different shades of green color) based on the calculated min edit distance?

Yup correct.

@eugenepeh
Copy link
Member

One more question, how far do we go in terms of tracing the ancestry of a line?

@myteo
Copy link
Contributor Author

myteo commented Dec 21, 2019

One more question, how far do we go in terms of tracing the ancestry of a line?

Currently there is no limit to how far back we trace the ancestry. The current way it works is that the tracing will stop if:

  1. Ancestor line has similarity value < 80% [give full credit]
  2. Ancestor line has similarity value >= 80% and author of ancestor line is someone else [give partial credit]

We continue to trace if the ancestor line has similarity value >= 80% and is written by the author (author who last modified the line). In the worst case, the tracing will stop when there are no ancestor lines. Ie. the commit where the file was first added.

@eugenepeh
Copy link
Member

We continue to trace if the ancestor line has similarity value >= 80% and is written by the author (author who last modified the line). In the worst case, the tracing will stop when there are no ancestor lines. Ie. the commit where the file was first added.

Would it be faster if we just find the original author and the last modifier of the line?

@myteo
Copy link
Contributor Author

myteo commented Dec 21, 2019

If we just find the original author, we would still need to trace the ancestry of the line but we save the time needed to check the author of each of the ancestor lines during tracing. On the other hand, the current method checks the author of each of the ancestor lines but is able to terminate early once a different author is found.

The author of a line is obtained from the results of the git blame command, which is needed anyway to find the commit in which the ancestor line was modified, so the time saved would mainly come from extracting the author from the git blame results. Thus I think the current method might still be faster since the time saved from checking the author might not outweigh the time saved from terminating early.

@SkyBlaise99
Copy link
Contributor

Hi @dcshzj, I would like to continue working on this issue. Anything I need to take note of? I observed that #949 left by @myteo is near complete, what are the issues preventing it from being merged apart from merge conflicts?

@damithc
Copy link
Collaborator

damithc commented Jul 20, 2023

@SkyBlaise99 it's been a while since someone looked at this PR. Perhaps you can send a cleaned up PR and we can get the current active developers to take a fresh look?

@SkyBlaise99
Copy link
Contributor

Sure prof @damithc, I'm planning to make a new pr since the old one has too much merge conflict to resolve.

chan-j-d pushed a commit that referenced this issue Nov 11, 2023
Currently frontend tests are failing.
Let's sync with master branch to fix the test cases.
chan-j-d pushed a commit that referenced this issue Nov 12, 2023
)

AnnotatorAnalyzer only overwrites the author but not the credit
information, when an author tag is found. If the analyze authorship
flag is enabled, credit information based on the blame author will be
wrongly inherited by the annotated author.

Lets assign partial credit if the annotated author is not the same as
the blame author, and keep the analyzed credit information if the 2
are the same.
ckcherry23 pushed a commit that referenced this issue Jan 8, 2024
* [#2027] Fix date range bug (#2034)

Currently, users are unable to select a zoom range that includes 
the until date.

This results in misleading data being presented to users.

* [#2039] Update cypress minimum requirement to 12.15.0 (#2041)

Chrome bug is causing cypress to fail to open a browser on Github 
Actions, causing frontend tests and CI to fail. Upgrading cypress 
to greater than 12.15.0 will fix this issue.

Let's upgrade cypress to fix the failing CI.

* [#1936] Migrate c-segment.vue to typescript (#2035)

Currently, there is still some JavaScript code which remains 
unmigrated. This allows for type unsafe code to be written, 
potentially resulting in unintended behavior.

Let's migrate the rest of the JavaScript code to TypeScript 
code to facilitate future changes to the code.

* [#1936] Migrate load-font-awesome-icons.js to typescript (#2040)

Currently, there is still some JavaScript code which remains 
unmigrated. This allows for type unsafe code to be written, 
potentially resulting in unintended behavior.

Let's migrate the rest of the JavaScript code to TypeScript 
code to facilitate future changes to the code.

* [#2045] Fix cypress zoom feature test (#2047)

Currently, Cypress zoom feature tests are failing due to a recent change
in behavior caused by a bug fix. With the tests failing, we are unable
to detect any future regressions.

Let's update the Cypress tests to test for the new intended behavior.

* [#1936] Migrate random-color-gen.js to typescript (#2043)

Currently, there is still some JavaScript code which remains unmigrated.
This allows for type unsafe code to be written, potentially resulting in
unintended behavior.

Let's migrate random-color-generator.js JavaScript code to TypeScript
code to facilitate future changes to the code.

* [#1936] Migrate c-segment-collection.vue to typescript (#2036)

Currently, there is still some JavaScript code which remains unmigrated.
This allows for type unsafe code to be written, potentially resulting in
unintended behavior.

Let's migrate the rest of the JavaScript code to TypeScript code to
facilitate future changes to the code.

* [#1936] Migrate c-resizer.vue to typescript (#2038)

Currently, there is still some JavaScript code which remains unmigrated.
This allows for type unsafe code to be written, potentially resulting in
unintended behavior.

Let's migrate the rest of the JavaScript code to TypeScript code to
facilitate future changes to the code.

* Bump zod from 3.20.6 to 3.22.3 in /frontend (#2048)

Bumps [zod](https://github.com/colinhacks/zod) from 3.20.6 to 3.22.3.
- [Release notes](https://github.com/colinhacks/zod/releases)
- [Changelog](https://github.com/colinhacks/zod/blob/master/CHANGELOG.md)
- [Commits](colinhacks/zod@v3.20.6...v3.22.3)

---
updated-dependencies:
- dependency-name: zod
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump @cypress/request and cypress in /frontend/cypress (#2042)

Bumps [@cypress/request](https://github.com/cypress-io/request) to 3.0.1 and updates ancestor dependency [cypress](https://github.com/cypress-io/cypress). These dependencies need to be updated together.


Updates `@cypress/request` from 2.88.12 to 3.0.1
- [Release notes](https://github.com/cypress-io/request/releases)
- [Changelog](https://github.com/cypress-io/request/blob/master/CHANGELOG.md)
- [Commits](cypress-io/request@v2.88.12...v3.0.1)

Updates `cypress` from 12.17.4 to 13.3.0
- [Release notes](https://github.com/cypress-io/cypress/releases)
- [Changelog](https://github.com/cypress-io/cypress/blob/develop/CHANGELOG.md)
- [Commits](cypress-io/cypress@v12.17.4...v13.3.0)

---
updated-dependencies:
- dependency-name: "@cypress/request"
  dependency-type: indirect
- dependency-name: cypress
  dependency-type: direct:development
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* [#1936] Migrate c-ramp.vue to typescript (#2037)

Currently, there is still some JavaScript code which remains unmigrated.
This allows for type unsafe code to be written, potentially resulting in
unintended behavior.

Let's migrate the rest of the JavaScript code to TypeScript code to
facilitate future changes to the code.

* Give partial credit if annotated author is not the same as the blame
author

* [#2054] Fix zoom view bug (#2055)

Currently, when granularity is set to day or week, clicking on a ramp
will open up a zoom view where commit messages are not being displayed
and sorting by insertions does not result in any sorting. 

Let's fix the unintended behaviour of the zoom view.

* [#1936] Migrate repo-sorter.js to typescript (#2052)

Currently, there is still some JavaScript code which remains unmigrated.
This allows for type unsafe code to be written, potentially resulting in
unintended behavior.

Let's migrate repo-sorter.js to TypeScript code to facilitate future
changes to the code.

* [#1936] Migrate safari_date.js to typescript (#2053)

Currently, there is still some JavaScript code which remains unmigrated.
This allows for type unsafe code to be written, potentially resulting in
unintended behavior.

Let's migrate safari_date.js to TypeScript code to facilitate future
changes to the code.

* Remove frontend JS lint (#2063)

Currently, frontend linter is failing due to lint scripts 
checking javascript files, the last of which has been 
removed in PR #2053.

Lets update the lint command to exclude javascript 
files front the check.

* use full and partial credit color

* [#1929] Add dynamic positioning support for tooltips (#2056)

Currently, most tooltips are shown above buttons and text. 
When these tooltips appear at the top of the viewport, 
part of the tooltips will not be rendered.

Let's implement changes such that these tooltips appear below the
text or button, when appearing at the top of the viewport.

* Add test cases for annotated author overriding last author's credit

* revert merge from master

* revert merge from master 58b7002

* [#1928] Fix tooltip zIndex such that it doesn't occlude next file title (#2057)

Currently, if one hovers over a tooltip of the pinned title of
a file whose content is scrolled almost completely, such that 
the title of the next file is just below the pinned title, the 
tooltip is not displayed appropriately, as the title of the next 
file obstructs it.

Let's fix this issue.

* [#1726] Update GitHub-specific references in codebase and docs (#2050)

There are still leftover references specific to GitHub on parts of 
the codebase and docs that have been generalized to accept 
other remote git hosts. 

Let's update these GitHub references to use more general language.

* Trigger workflow

* Revert "Merge branch 'master' into 944-analyze-authorship"

This reverts commit 950c912, reversing
changes made to 4bd05a7.

* fix frontend test failing

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: jq1836 <95712150+jq1836@users.noreply.github.com>
Co-authored-by: Chan Jun Da <65345505+chan-j-d@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Pratham Jain <71367149+pratham31012002@users.noreply.github.com>
ckcherry23 pushed a commit that referenced this issue Jan 8, 2024
* [#2027] Fix date range bug (#2034)

Currently, users are unable to select a zoom range that includes 
the until date.

This results in misleading data being presented to users.

* [#2039] Update cypress minimum requirement to 12.15.0 (#2041)

Chrome bug is causing cypress to fail to open a browser on Github 
Actions, causing frontend tests and CI to fail. Upgrading cypress 
to greater than 12.15.0 will fix this issue.

Let's upgrade cypress to fix the failing CI.

* [#1936] Migrate c-segment.vue to typescript (#2035)

Currently, there is still some JavaScript code which remains 
unmigrated. This allows for type unsafe code to be written, 
potentially resulting in unintended behavior.

Let's migrate the rest of the JavaScript code to TypeScript 
code to facilitate future changes to the code.

* [#1936] Migrate load-font-awesome-icons.js to typescript (#2040)

Currently, there is still some JavaScript code which remains 
unmigrated. This allows for type unsafe code to be written, 
potentially resulting in unintended behavior.

Let's migrate the rest of the JavaScript code to TypeScript 
code to facilitate future changes to the code.

* [#2045] Fix cypress zoom feature test (#2047)

Currently, Cypress zoom feature tests are failing due to a recent change
in behavior caused by a bug fix. With the tests failing, we are unable
to detect any future regressions.

Let's update the Cypress tests to test for the new intended behavior.

* [#1936] Migrate random-color-gen.js to typescript (#2043)

Currently, there is still some JavaScript code which remains unmigrated.
This allows for type unsafe code to be written, potentially resulting in
unintended behavior.

Let's migrate random-color-generator.js JavaScript code to TypeScript
code to facilitate future changes to the code.

* [#1936] Migrate c-segment-collection.vue to typescript (#2036)

Currently, there is still some JavaScript code which remains unmigrated.
This allows for type unsafe code to be written, potentially resulting in
unintended behavior.

Let's migrate the rest of the JavaScript code to TypeScript code to
facilitate future changes to the code.

* [#1936] Migrate c-resizer.vue to typescript (#2038)

Currently, there is still some JavaScript code which remains unmigrated.
This allows for type unsafe code to be written, potentially resulting in
unintended behavior.

Let's migrate the rest of the JavaScript code to TypeScript code to
facilitate future changes to the code.

* Bump zod from 3.20.6 to 3.22.3 in /frontend (#2048)

Bumps [zod](https://github.com/colinhacks/zod) from 3.20.6 to 3.22.3.
- [Release notes](https://github.com/colinhacks/zod/releases)
- [Changelog](https://github.com/colinhacks/zod/blob/master/CHANGELOG.md)
- [Commits](colinhacks/zod@v3.20.6...v3.22.3)

---
updated-dependencies:
- dependency-name: zod
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump @cypress/request and cypress in /frontend/cypress (#2042)

Bumps [@cypress/request](https://github.com/cypress-io/request) to 3.0.1 and updates ancestor dependency [cypress](https://github.com/cypress-io/cypress). These dependencies need to be updated together.


Updates `@cypress/request` from 2.88.12 to 3.0.1
- [Release notes](https://github.com/cypress-io/request/releases)
- [Changelog](https://github.com/cypress-io/request/blob/master/CHANGELOG.md)
- [Commits](cypress-io/request@v2.88.12...v3.0.1)

Updates `cypress` from 12.17.4 to 13.3.0
- [Release notes](https://github.com/cypress-io/cypress/releases)
- [Changelog](https://github.com/cypress-io/cypress/blob/develop/CHANGELOG.md)
- [Commits](cypress-io/cypress@v12.17.4...v13.3.0)

---
updated-dependencies:
- dependency-name: "@cypress/request"
  dependency-type: indirect
- dependency-name: cypress
  dependency-type: direct:development
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* [#1936] Migrate c-ramp.vue to typescript (#2037)

Currently, there is still some JavaScript code which remains unmigrated.
This allows for type unsafe code to be written, potentially resulting in
unintended behavior.

Let's migrate the rest of the JavaScript code to TypeScript code to
facilitate future changes to the code.

* Give partial credit if annotated author is not the same as the blame
author

* [#2054] Fix zoom view bug (#2055)

Currently, when granularity is set to day or week, clicking on a ramp
will open up a zoom view where commit messages are not being displayed
and sorting by insertions does not result in any sorting. 

Let's fix the unintended behaviour of the zoom view.

* [#1936] Migrate repo-sorter.js to typescript (#2052)

Currently, there is still some JavaScript code which remains unmigrated.
This allows for type unsafe code to be written, potentially resulting in
unintended behavior.

Let's migrate repo-sorter.js to TypeScript code to facilitate future
changes to the code.

* [#1936] Migrate safari_date.js to typescript (#2053)

Currently, there is still some JavaScript code which remains unmigrated.
This allows for type unsafe code to be written, potentially resulting in
unintended behavior.

Let's migrate safari_date.js to TypeScript code to facilitate future
changes to the code.

* Remove frontend JS lint (#2063)

Currently, frontend linter is failing due to lint scripts 
checking javascript files, the last of which has been 
removed in PR #2053.

Lets update the lint command to exclude javascript 
files front the check.

* use full and partial credit color

* [#1929] Add dynamic positioning support for tooltips (#2056)

Currently, most tooltips are shown above buttons and text. 
When these tooltips appear at the top of the viewport, 
part of the tooltips will not be rendered.

Let's implement changes such that these tooltips appear below the
text or button, when appearing at the top of the viewport.

* Add test cases for annotated author overriding last author's credit

* revert merge from master

* revert merge from master 58b7002

* [#1928] Fix tooltip zIndex such that it doesn't occlude next file title (#2057)

Currently, if one hovers over a tooltip of the pinned title of
a file whose content is scrolled almost completely, such that 
the title of the next file is just below the pinned title, the 
tooltip is not displayed appropriately, as the title of the next 
file obstructs it.

Let's fix this issue.

* [#1726] Update GitHub-specific references in codebase and docs (#2050)

There are still leftover references specific to GitHub on parts of 
the codebase and docs that have been generalized to accept 
other remote git hosts. 

Let's update these GitHub references to use more general language.

* Trigger workflow

* Revert "Merge branch 'master' into 944-analyze-authorship"

This reverts commit 950c912, reversing
changes made to 4bd05a7.

* fix frontend test failing

* switch to originality score and threshold

* update originality threshold

* revert frontend code changes

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: jq1836 <95712150+jq1836@users.noreply.github.com>
Co-authored-by: Chan Jun Da <65345505+chan-j-d@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Pratham Jain <71367149+pratham31012002@users.noreply.github.com>
ckcherry23 pushed a commit that referenced this issue Jan 27, 2024
)

Currently, when viewing individual contributions, full and partial
credit are differentiated with dark and light green colors. However,
this distinction is not applied when the group is merged, potentially
causing confusion for users.

Let's introduce a clear differentiation between full and partial credit
when viewing a merged group.
gok99 pushed a commit that referenced this issue Mar 3, 2024
The existing setup employs a static originality threshold of 0.51.
However, this threshold is tailored for codes, such as Java or Markdown,
and might not be suitable for other programming languages. Additionally,
it doesn't offer flexibility for users who may want a stricter threshold
but are willing to endure longer processing times, or those who prefer a
more lenient threshold but prioritize faster analysis speeds.

Let's enable users to input their preferred originality threshold.
gok99 pushed a commit that referenced this issue Mar 3, 2024
Currently, the authorship credit analysis process exceeds an hour in
duration, which is significantly longer than the mere 5 minutes required
when the feature is deactivated.

Let's speed up the performance by implementing a caching mechanism and
refining the dynamic programming algorithm utilized for computing the
Levenshtein distance.
gok99 pushed a commit that referenced this issue Mar 7, 2024
Currently the branch is ready but out of sync with the master branch,
which had some major revamp in both the frontend and backend.

Let's merge the master branch into this branch to keep it up to date.
This was referenced Mar 7, 2024
ckcherry23 pushed a commit that referenced this issue Apr 28, 2024
A line is credited to the author who last modified it.

Another author might have written the line initially and the current
author only modified it slightly. In such a case, the current author
gets credited for work that is not entirely done by him/her.

Let's analyze how similar a line is as compared to its ancestor lines
(previous versions of the line) and give full or partial credit to the
last author based on the analysis.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment