Open letter to Veracode concerning duplicate flaws

UPDATE: This has been mostly resolved for me. I've seen a 80-90% reduction in duplicates. Thank you, Veracode Engineering.

I see duplicate flaws in every Veracode scan. CWE-73 External Control of File Name or Path is a particular severe offender. In a given scan, that same flaw appears up to three times.

This goes on further. Veracode has detected this exact same flaw 12 unique times. This same issue affects roughly ten flaws currently. My team complains that Veracode is inaccurate or useless as a result. I have to combat their morale issue due to an easily remedied bug in Veracode.

I strongly suggest that you implement two complementary fixes to the Veracode static scanning algorithm:

  1. compute a content hash of the line a flaw is detected on
  2. disallow more than one flaw of the same type to be reported on the same content hash and same line number (content hash alone is insufficient as a copy-pasted line legitimately needs to be reported each time it appears)

After those two fixes are in place, all trivial duplicate flaws will disappear from all scans. There will be additional work to contextually de-duplicate flaws on the same semantic line as the file content above it changes between scans. I'm trying to develop my own de-duplication using the Git commit SHAs of the detections. That would solve all duplicates reliably.

I realize that the same flaw can be found multiple times on the same line, but that's not useful to the human process of flaw triage or remediation. When Veracode reports a flaw, I fix the entire line and if I fail to fully correct the extant flaws, the content hash changes and the following scan detects the residual flaws and the feedback loop continues in a virtuous manner reinforcing my efforts with results.

By itself, the content hash could allow for significant optimizations in scan times by indexing the results of all customers' scans by content hash. The first pass on any file could be content-addressable lookup of existing flaw findings by content hash. You could content hash all lines, group by hash, score all with available indexed flaw findings, scan remaining unindexed lines, and finally add their novel findings to the index. I suspect this would significantly accelerate all customer scans since many lines are identical across codebases without being plagiarism (import statements, class/field/variable declarations with common names, open and close braces of various depths). Maybe you folks already do this. Your scans are fairly quick already. Line normalization (treating qualified and unqualified references to the same class or method as the same, renaming all variables to arg0/arg1, etc.) prior to indexing would greatly shrink the required index size and increase the likelihood of discovering an existing finding in the index.

I realize you scan bytecode for Java, but content hashes still apply there.

Sincerely,
Alain O'Dea
Concerned Veracode Customer