The Wayback Machine - http://web.archive.org/web/20211229061339/https://github.com/github/codeql/pull/7386
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C++: split cpp/overrunning-write into two #7386

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

@redsun82
Copy link
Contributor

@redsun82 redsun82 commented Dec 14, 2021

This splits the cpp/overruning-write into two separate queries based
off on the reason for the estimation. If the overrun is detected based
on non-trivial range analysis, the results are now marked by the new
cpp/very-likely-overruning-write high precision query. If it is based
on less precise, usually type based bounds, then it will still be marked
by cpp/overruning-write which remains at medium precision.

Closes: github/codeql-c-team#751

@redsun82 redsun82 requested review from MathiasVP and geoffw0 Dec 14, 2021
@redsun82 redsun82 requested a review from as a code owner Dec 14, 2021
@github-actions github-actions bot added the C++ label Dec 14, 2021
@redsun82
Copy link
Contributor Author

@redsun82 redsun82 commented Dec 14, 2021

I'll now create the internal request and start a DCA experiment on this, and then maybe play around with some LGTM databases.

@redsun82 redsun82 force-pushed the redsun82/cpp-overrunning-write-precision-split branch from 3e8ac4d to 3c01605 Dec 14, 2021
Copy link
Contributor

@MathiasVP MathiasVP left a comment

The code LGTM, I think (see my one comment). A few things need to happen before we can merge this:

  1. We need to run a DCA on the code-scanning suite to make sure the performance on that suite is good with this new query.
  2. We need to add a .qhelp file, which is probably just a copy/paste of the one we have for cpp/overrunning-write.
  3. We need to run the new query on the ~140 projects we usually test on (or even better: on all of LGTM) and make sure that the new query doesn't have too many FPs.
  4. Once that's done, we need to add the ready-for-doc-review label on this PR so that the docs team can review the .qhelp file, the metadata, and stuff like that.

cpp/ql/src/Security/CWE/CWE-120/OverrunWrite.ql Outdated Show resolved Hide resolved
@MathiasVP
Copy link
Contributor

@MathiasVP MathiasVP commented Dec 14, 2021

I'll now create the internal request and start a DCA experiment on this, and then maybe play around with some LGTM databases.

Great!

@redsun82 redsun82 self-assigned this Dec 14, 2021
@redsun82 redsun82 force-pushed the redsun82/cpp-overrunning-write-precision-split branch from 125385b to a7ab402 Dec 14, 2021
Copy link
Contributor

@MathiasVP MathiasVP left a comment

Oops. Forgot to add the rest of the comments. Sorry!

cpp/ql/src/Security/CWE/CWE-120/VeryLikelyOverrunWrite.ql Outdated Show resolved Hide resolved
cpp/change-notes/2021-12-14-overruning-write-split.md Outdated Show resolved Hide resolved
@redsun82 redsun82 force-pushed the redsun82/cpp-overrunning-write-precision-split branch from 7bc26a6 to fa63b76 Dec 14, 2021
@redsun82
Copy link
Contributor Author

@redsun82 redsun82 commented Dec 14, 2021

On vim/vim, I'm getting a couple of false positives on the new query that boils down to to the trailing array idiom:

v = &fc->fixvar[fixvar_idx++].var;
name = v->di_key;
STRCPY(name, "self");

where v is of type

struct dictitem_S
{
    typval_T	di_tv;		// type and value of the variable
    char_u	di_flags;	// DI_FLAGS_ flags (only used for variable)
    char_u	di_key[1];	// key (actually longer!)
};
typedef struct dictitem_S dictitem_T;

and the items are malloced with a bigger size than the size of the item.

Any thoughts? Is this ok, or should we try to detect such a case somehow?

Copy link
Contributor

@geoffw0 geoffw0 left a comment

A few comments, otherwise LGTM.

cpp/config/suites/security/cwe-120 Outdated Show resolved Hide resolved
@MathiasVP
Copy link
Contributor

@MathiasVP MathiasVP commented Dec 14, 2021

Any thoughts? Is this ok, or should we try to detect such a case somehow?

It would be great if we can filter those out. We should have code for this already. Take a look at https://github.com/github/codeql/blob/main/cpp/ql/lib/semmle/code/cpp/commons/Buffer.qll#L15. That predicate is actually already used to filter out some results in getBufferSize (which this new query is actually using). So I'm surprised we're seeing these FPs.

@redsun82
Copy link
Contributor Author

@redsun82 redsun82 commented Dec 14, 2021

on kraflab/dsda-doom I found a bunch of quite interesting FPs. Here's an example: j is bound between characters '0' and '9', so that j - 48 is actually bounded between 0 and 9, so that the formatting "STTNUM%.1d",j-48 actually fits in 8 bytes, but the query estimates the maximum length to 10. There are two other similar FP around that, but interestingly a bunch of other sprintf's around there on the same buffer are correctly not flagged.

@redsun82
Copy link
Contributor Author

@redsun82 redsun82 commented Dec 14, 2021

Any thoughts? Is this ok, or should we try to detect such a case somehow?

It would be great if we can filter those out. We should have code for this already. Take a look at https://github.com/github/codeql/blob/main/cpp/ql/lib/semmle/code/cpp/commons/Buffer.qll#L15. That predicate is actually already used to filter out some results in getBufferSize (which this new query is actually using). So I'm surprised we're seeing these FPs.

Hmm, actually the full code snippet is

// Set l:self to "selfdict".  Use "name" to avoid a warning from
// some compiler that checks the destination size.
v = &fc->fixvar[fixvar_idx++].var;
name = v->di_key;
STRCPY(name, "self");

Might it be that the aliasing they use to avoid compiler warnings is actually stumping our own code detecting this case?

Or maybe it is because that field is not malloced in the same scope?

@redsun82
Copy link
Contributor Author

@redsun82 redsun82 commented Dec 14, 2021

on kraflab/dsda-doom I found a bunch of quite interesting FPs. Here's an example: j is bound between characters '0' and '9', so that j - 48 is actually bounded between 0 and 9, so that the formatting "STTNUM%.1d",j-48 actually fits in 8 bytes, but the query estimates the maximum length to 10. There are two other similar FP around that, but interestingly a bunch of other sprintf's around there on the same buffer are correctly not flagged.

This indeed seems like a bug with range analysis, I've managed to reduce the false positive case to #7396 . Can you have a look @geoffw0 ?

@redsun82 redsun82 closed this Dec 16, 2021
@redsun82 redsun82 reopened this Dec 16, 2021
@redsun82
Copy link
Contributor Author

@redsun82 redsun82 commented Dec 16, 2021

Any thoughts? Is this ok, or should we try to detect such a case somehow?

It would be great if we can filter those out. We should have code for this already. Take a look at https://github.com/github/codeql/blob/main/cpp/ql/lib/semmle/code/cpp/commons/Buffer.qll#L15. That predicate is actually already used to filter out some results in getBufferSize (which this new query is actually using). So I'm surprised we're seeing these FPs.

Hmm, actually the full code snippet is

// Set l:self to "selfdict".  Use "name" to avoid a warning from
// some compiler that checks the destination size.
v = &fc->fixvar[fixvar_idx++].var;
name = v->di_key;
STRCPY(name, "self");

Might it be that the aliasing they use to avoid compiler warnings is actually stumping our own code detecting this case?

Or maybe it is because that field is not malloced in the same scope?

yes, I think that's it. As the copy is not done directly on v->di_key but on the alias name, then memberMayBeVarSize(_, name) does not hold.

Do you think we should enhance memberMayBeVarSize to follow aliases, or should we just accept this kind of FP?

@MathiasVP
Copy link
Contributor

@MathiasVP MathiasVP commented Dec 16, 2021

Do you think we should enhance memberMayBeVarSize to follow aliases, or should we just accept this kind of FP?

I think it's okay to accept this kind of FP for now (since it didn't stop increasing the precision of another query in #6760). But let's collect these FPs in an issue.

@@ -359,6 +376,17 @@ private int lengthInBase10(float f) {
result = f.log10().floor() + 1
}

private BufferWriteEstimationReason getEstimationReasonForIntegralExpression(Expr expr) {
Copy link
Contributor

@MathiasVP MathiasVP Dec 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure expr is sufficiently bound in this predicate, but let's test this with DCA before we make any changes. I'd be much less nervous about this if we added a bindingset[expr] to it.

Copy link
Contributor Author

@redsun82 redsun82 Dec 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a problem we might see it on performance? Should I add the tuple counting option?

Copy link
Contributor

@MathiasVP MathiasVP Dec 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you can try adding tuple counting. Be sure to update DCA to get the latest updates to that feature.

cpp/ql/lib/semmle/code/cpp/commons/Printf.qll Show resolved Hide resolved
@redsun82 redsun82 force-pushed the redsun82/cpp-overrunning-write-precision-split branch from 9ac2177 to cc8a646 Dec 17, 2021
@intrigus-lgtm
Copy link
Contributor

@intrigus-lgtm intrigus-lgtm commented Dec 20, 2021

I assume that it is not possible to do some sharing of the query help using query help inclusion?

redsun82 added 2 commits Dec 21, 2021
This splits the `cpp/overruning-write` into two separate queries based
off on the reason for the estimation. If the overrun is detected based
on non-trivial range analysis, the results are now marked by the new
`cpp/very-likely-overruning-write` high precision query. If it is based
on less precise, usually type based bounds, then it will still be marked
by `cpp/overruning-write` which remains at medium precision.
redsun82 and others added 12 commits Dec 21, 2021
Rather than testing for `TypeBoundsAnalysis`, we test that the reason is
not `ValueFlowAnalysis` (which is reported by the new
`cpp/very-likely-overruning-write` query), so that if a client has
overridden `BufferWrite::getMaxData` the `NoSpecifiedEstimateReason` is
taken into account.
Also update the help of `cpp/overruning-write`, as the case shown there
will actually not be flagged by that query any more.
Co-authored-by: Mathias Vorreiter Pedersen <mathiasvp@github.com>
This also restrict what we consider "non-trivial" range analysis, as we
now require both ends to be non-trivially bounded for signed integers.
This avoids false positives stemming from a non trivial upper bound but
no meaningful lower bound, for example.
@redsun82 redsun82 force-pushed the redsun82/cpp-overrunning-write-precision-split branch from 1535efc to 4beaf2e Dec 21, 2021
@redsun82
Copy link
Contributor Author

@redsun82 redsun82 commented Dec 21, 2021

I assume that it is not possible to do some sharing of the query help using query help inclusion?

I did not know about this mechanism, I'll have a look, thanks!

redsun82 added 2 commits Dec 21, 2021
There were some false positives where something like

    int x;
    // ...
    sprintf(buff, "%ld", (long)x);

was considered as if the parameter had a non-trivial range analysis only
because the range of `int` is smaller than the range for `long`, without
any non-trivial range analysis actually done on `x`.

These will now be reported by `OverrunWrite` instead.
Also added the relevant CERT C _and_ C++ standard references where they
were missing, and did some minor stylistic tweaks to
`OverrunWriteFloat.qhelp`.
@MathiasVP
Copy link
Contributor

@MathiasVP MathiasVP commented Dec 21, 2021

The DCA experiment failed due to a memory issue on the join-order badness analysis (that's my fault, sorry!)

It's still not fixed, but you can still generate the data locally by doing:

./dca reports --name redsun82/pr-7386-cc8a646__Differences__code-scanning --redo-report readme

when I do that, I get the following table (I've removed a few of the columns for simplicity):

| source                  | predicate                                                                           | b\_result\_size | b\_max\_tuple\_count | b\_metric
| nlohmann\_\_json        | Printf::FormatLiteral::getMaxConvertedLengthAfterLimited\#ffff \(recursive delta\)  | 158             | 225                  | 1.424
| microsoft\_\_calculator | Printf::FormatLiteral::getMaxConvertedLengthAfterLimited\#ffff \(recursive delta\)  | 4               | 6                    | 1.5  
| vim\_\_vim              | m\#Declaration::Declaration::hasGlobalOrStdName\_dispred\#bf                        | 8901            | 14635                | 1.644
| vim\_\_vim              | Printf::FormatLiteral::getMaxConvertedLengthAfterLimited\#ffff \(recursive delta\)  | 1978            | 3711                 | 1.876
| microsoft\_\_ChakraCore | Printf::FormatLiteral::getMaxConvertedLengthAfterLimited\#ffff \(recursive delta\)  | 3323            | 6561                 | 1.974
| dsp-testing\_\_putty    | Printf::FormatLiteral::getMaxConvertedLengthAfterLimited\#ffff \(recursive delta\)  | 1958            | 4341                 | 2.217
| git\_\_git              | Printf::FormatLiteral::getMaxConvertedLengthAfterLimited\#ffff \(recursive delta\)  | 5984            | 14886                | 2.488
| microsoft\_\_calculator | Printf::FormatLiteral::getIntegralDisplayType\_dispred\#bff                         | 6               | 54                   | 3    
| git\_\_git              | Scanf::ScanfFormatLiteral::parseConvSpec\_dispred\#ffffff                           | 2               | 8                    | 4    
| vim\_\_vim              | Scanf::ScanfFormatLiteral::parseConvSpec\_dispred\#ffffff                           | 16              | 64                   | 4    
| dsp-testing\_\_putty    | Scanf::ScanfFormatLiteral::parseConvSpec\_dispred\#ffffff                           | 22              | 88                   | 4    
| vim\_\_vim              | StringAnalysis::AnalysedString::getMaxLength\#ff \(recursive delta\)                | 16058           | 74460                | 4.637
| git\_\_git              | StringAnalysis::AnalysedString::getMaxLength\#ff \(recursive delta\)                | 27983           | 134538               | 4.808
| facebook\_\_yoga        | StringAnalysis::AnalysedString::getMaxLength\#ff \(recursive delta\)                | 241             | 1302                 | 5.402
| abseil\_\_abseil-cpp    | StringAnalysis::AnalysedString::getMaxLength\#ff \(recursive delta\)                | 1855            | 10062                | 5.424
| dsp-testing\_\_putty    | StringAnalysis::AnalysedString::getMaxLength\#ff \(recursive delta\)                | 9202            | 50688                | 5.508
| microsoft\_\_ChakraCore | StringAnalysis::AnalysedString::getMaxLength\#ff \(recursive delta\)                | 23079           | 134778               | 5.84 
| microsoft\_\_calculator | StringAnalysis::AnalysedString::getMaxLength\#ff \(recursive delta\)                | 1849            | 10944                | 5.919
| nlohmann\_\_json        | StringAnalysis::AnalysedString::getMaxLength\#ff \(recursive delta\)                | 31419           | 186444               | 5.934
| an-tao\_\_drogon        | StringAnalysis::AnalysedString::getMaxLength\#ff \(recursive delta\)                | 8410            | 50064                | 5.953
| nlohmann\_\_json        | Printf::FormatLiteral::getConvSpec\_dispred\#fff\#shared\#1                         | 83              | 581                  | 7    
| dsp-testing\_\_putty    | Printf::FormatLiteral::getConvSpec\_dispred\#fff\#shared\#1                         | 1455            | 10185                | 7    
| microsoft\_\_ChakraCore | Printf::FormatLiteral::getConvSpec\_dispred\#fff\#shared\#1                         | 2335            | 16345                | 7    
| git\_\_git              | Printf::FormatLiteral::getConvSpec\_dispred\#fff\#shared\#1                         | 5035            | 35245                | 7    

I think this shows that there might be a potential performance problem that we need to look into before we merge this.

@skedwards88
Copy link
Contributor

@skedwards88 skedwards88 commented Dec 27, 2021

Docs first responder here 👋 I've added this to our review board for a writer to review.

Copy link
Contributor

@skedwards88 skedwards88 left a comment

Looks good from a writer perspective!

<li>Control the size of the buffer using a preprocessor define.</li>
<li>Replace the call to <code>strcpy</code> with <code>strncpy</code>, specifying the define as the maximum length to copy. This will prevent the buffer overflow.</li>
<li>Consider increasing the buffer size, say to 20 characters, so that the message is displayed correctly.</li>
<li>Control the size of the buffer by declaring it with a compile time constant</li>
Copy link
Contributor

@skedwards88 skedwards88 Dec 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit to be consistent with the other list items

Suggested change
<li>Control the size of the buffer by declaring it with a compile time constant</li>
<li>Control the size of the buffer by declaring it with a compile time constant.</li>

"qhelp.dtd">
<qhelp>
<overview>
<p>The program performs a buffer copy or write operation with no upper limit on the size of the copy, and by analysing the bounds of the expressions involved it appears that certain inputs will cause a buffer overflow to occur in this case. In addition to causing program instability, techniques exist which may allow an attacker to use this vulnerability to execute arbitrary code.</p>
Copy link
Contributor

@skedwards88 skedwards88 Dec 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
<p>The program performs a buffer copy or write operation with no upper limit on the size of the copy, and by analysing the bounds of the expressions involved it appears that certain inputs will cause a buffer overflow to occur in this case. In addition to causing program instability, techniques exist which may allow an attacker to use this vulnerability to execute arbitrary code.</p>
<p>The program performs a buffer copy or write operation with no upper limit on the size of the copy. By analyzing the bounds of the expressions involved, it appears that certain inputs will cause a buffer overflow to occur in this case. In addition to causing program instability, techniques exist which may allow an attacker to use this vulnerability to execute arbitrary code.</p>


<p>To fix this issue these changes should be made:</p>
<ul>
<li>Control the size of the buffer by declaring it with a compile time constant</li>
Copy link
Contributor

@skedwards88 skedwards88 Dec 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<li>Control the size of the buffer by declaring it with a compile time constant</li>
<li>Control the size of the buffer by declaring it with a compile time constant.</li>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants