The problem with it is that the changes I've made so far don't add up to a measurable difference to performance in my testing. And I'm not aware of a specific concern in the query log to focus in on there. So I'm left a little unconvinced that making these changes more widely will be worthwhile.
Does it also not show any difference in the tuple counts? On Slack you mentioned having to do an unbind when joining with hasQualifiedName to get good performance (which leaves me with the impression that the current tuple counts when using hasQualifiedName is bad). Is it any better with this = any(StdBasicString s).getAnInstMemberNamed("push_back")orthis.getClassForMember("push_back") instanceof StdBasicString?
@jbj has previously expressed a preference where I've written this = any(StdBasicString s).getAnInstMemberNamed("push_back"), for this.getClassForMember("push_back") instanceof StdBasicString instead. I think that's for performance reasons but I don't currently understand why - and I find it significantly more difficult to read. getClassForMember really means getClassAndAlsoIAmAMemberCalled(name).
I think the two options are equivalent in performance, and I think I prefer the getClassForMember option.
It looks like the class models have the same hasQualifiedName issue that we saw with function models previously, but the function models are now using the class models correctly.
The optimizer is using magic sets and sharing with getAnInstMemberNamed to generate a common predicate for all classes, and then filtering it for each individual class:
EVALUATE NONRECURSIVE RELATION:
SYNTHETIC StdString::StdBasicIStream::getAnInstMemberNamed_dispred#ffb#shared(int arg0, int arg1, cached string arg2) :-
{3} r1 = JOIN Memcpy::MemcpyFunction#class#b#shared AS L WITH Class::Class::getAMember_dispred#fb_10#join_rhs AS R ON FIRST 1 OUTPUT R.<1>, L.<0>, L.<1>
{3} r2 = JOIN r1 WITH Class::TemplateClass::getAnInstantiation_dispred#ff_10#join_rhs AS R ON FIRST 1 OUTPUT R.<1>, r1.<1>, r1.<2>
return r2
I'd advocate defining it centrally rather than adding a separate rootdef in each class model.
Uses seem to be optimized correctly:
[2021-01-13 21:42:36] (55s) Tuple counts for project#StdString::StdBasicIStream::getAnInstMemberNamed_dispred#ffb#4:
1 ~0% {1} r1 = CONSTANT(unique string)["operator>>"]
0 ~0% {1} r2 = JOIN r1 WITH project#StdString::StdBasicIStream::getAnInstMemberNamed_dispred#ffb AS R ON FIRST 1 OUTPUT R.<1>
return r2
With this definition of getClassForMemberinDeclaration, it seems to have the same linear behavior we have currently, but that might be an issue with my definition:
[2021-01-13 22:30:54] (1s) Tuple counts for StdString::StdIStreamIn#class#b:
6735 ~6% {2} r1 = SCAN project#Instruction::FunctionInstruction#3#ff AS I OUTPUT I.<0>, "operator>>"
0 ~0% {2} r2 = JOIN r1 WITH Declaration::Declaration::getClassForMember_dispred#bff AS R ON FIRST 2 OUTPUT R.<2>, r1.<0>
0 ~0% {5} r3 = JOIN r2 WITH Class::TemplateClass#f AS R ON FIRST 1 OUTPUT "basic_string", "", "std", r2.<0>, r2.<1>
0 ~0% {1} r4 = JOIN r3 WITH QualifiedName::declarationHasQualifiedName#ffff@staged_ext AS R ON FIRST 4 OUTPUT r3.<4>
return r4
I've just pushed a significant change, so that it's now the instantiations of library classes which are modelled, rather than the templates. This gives us much more natural code in the function models, without any weird custom predicates (getAnInstMemberNamed) or excessive use of any. I haven't looked into the impact on performance yet, though it sounds like we hadn't solved that problem yet anyway.
There are a few predicates that are still doing a full scan of the function table. All of them are joining single strings against the function name, but not all such predicates do a full scan. I haven't noticed any other common factors.
[2021-01-14 14:46:39] (0s) Tuple counts for StdString::StdStreamFunction#class#b:
0 ~0% {2} r1 = SELECT Memcpy::MemcpyFunction#class#b#shared AS I ON I.<1> = "ignore" OR I.<1> = "unget" OR I.<1> = "seekg"
0 ~0% {1} r2 = SCAN r1 OUTPUT r1.<0>
0 ~0% {2} r3 = JOIN r2 WITH Declaration::Declaration::getDeclaringType_dispred#bf AS R ON FIRST 1 OUTPUT R.<1>, r2.<0>
0 ~0% {1} r4 = JOIN r3 WITH project#StdString::StdBasicIStream#class#ff AS R ON FIRST 1 OUTPUT r3.<1>
2 ~0% {2} r5 = SELECT Memcpy::MemcpyFunction#class#b#shared AS I ON I.<1> = "seekp" OR I.<1> = "flush"
2 ~0% {1} r6 = SCAN r5 OUTPUT r5.<0>
1 ~0% {2} r7 = JOIN r6 WITH Declaration::Declaration::getDeclaringType_dispred#bf AS R ON FIRST 1 OUTPUT R.<1>, r6.<0>
1 ~0% {1} r8 = JOIN r7 WITH project#StdString::StdBasicOStream#class#ff AS R ON FIRST 1 OUTPUT r7.<1>
1 ~0% {1} r9 = r4 \/ r8
6735 ~4% {2} r10 = SCAN project#Instruction::FunctionInstruction#3#ff AS I OUTPUT I.<0>, "copyfmt"
0 ~0% {1} r11 = JOIN r10 WITH Declaration::Declaration::hasName_dispred#bf AS R ON FIRST 2 OUTPUT r10.<0>
0 ~0% {2} r12 = JOIN r11 WITH Declaration::Declaration::getDeclaringType_dispred#bf AS R ON FIRST 1 OUTPUT R.<1>, r11.<0>
0 ~0% {2} r13 = JOIN r12 WITH project#Class::TemplateClass::getAnInstantiation_dispred#ff AS R ON FIRST 1 OUTPUT r12.<0>, r12.<1>
0 ~0% {5} r14 = JOIN r13 WITH Class::TemplateClass::getAnInstantiation_dispred#ff_10#join_rhs AS R ON FIRST 1 OUTPUT "basic_ios", "", "std", R.<1>, r13.<1>
0 ~0% {1} r15 = JOIN r14 WITH QualifiedName::declarationHasQualifiedName#ffff@staged_ext AS R ON FIRST 4 OUTPUT r14.<4>
1 ~0% {1} r16 = r9 \/ r15
return r16
There's a few cases in hasTaintFlow and hasDataFlow that have similar issues (Including a regression of the operator>> example from the previous commit):
6735 ~0% {4} r56 = JOIN r16 WITH project#Instruction::FunctionInstruction#3#ff AS R CARTESIAN PRODUCT OUTPUT R.<0>, "operator>>", r16.<0>, r16.<1>
2 ~0% {3} r57 = JOIN r56 WITH Declaration::Declaration::hasName_dispred#bf AS R ON FIRST 2 OUTPUT r56.<0>, r56.<2>, r56.<3>
0 ~0% {4} r58 = JOIN r57 WITH Declaration::Declaration::getDeclaringType_dispred#bf AS R ON FIRST 1 OUTPUT R.<1>, r57.<1>, r57.<2>, r57.<0>
0 ~0% {3} r59 = JOIN r58 WITH project#StdString::StdBasicIStream#class#ff AS R ON FIRST 1 OUTPUT r58.<3>, r58.<1>, r58.<2>
And some cases that look scary, but are just using the Cartesian product to join FunctionInput and FunctionOutput with the specific function or class:
77 ~0% {3} r261 = JOIN DataFlow::DataFlowFunction::hasDataFlow_dispred#fff#shared AS L WITH project#StdString::StdBasicString#class#ff AS R CARTESIAN PRODUCT OUTPUT R.<0>, L.<0>, L.<1>
263 ~3% {4} r262 = JOIN r261 WITH Declaration::Declaration::getDeclaringType_dispred#ff_10#join_rhs AS R ON FIRST 1 OUTPUT R.<1>, "push_back", r261.<1>, r261.<2>
1 ~0% {3} r263 = JOIN r262 WITH Declaration::Declaration::hasName_dispred#bb AS R ON FIRST 2 OUTPUT r262.<0>, r262.<2>, r262.<3>
Both "copyfmt" and "operator>>" are cases where the model only models one function, so the [] syntax is not used for these strings. I remember previously finding that [] causes the optimizer to do things differently and avoids the problem. I don't remember a lot more than that and I don't think I found a good solution (short of ["copyfmt", "something-that-is-not-a-legal-function-name"]).
I've just pushed five commits containing two performance improvements, designed by myself and @jbj:
(1) pragma[nomagic]onhasTaintFlow and hasDataFlow. This is a barrier to prevent the implementation details inside taint models from being affected by the context in which hasTaintFlow / hasDataFlow is used. In particular, in context we have information about FunctionCalls as well as Functions that are being modelled, allowing evaluation to start from the Call rather than the Function - which turns out to be a poor strategy. This is now prevented.
(2) even with the first change we still get a mediocre evaluation strategy, where we narrow down to all of the functions in the modelled class before joining on name. This is the 'wrong' order, producing a fair number of rows we don't need. The getClassAndName predicate (a concept previously referred to as getClassForMember) blocks this strategy, though it does create a bit of work up front (@jbj we didn't discuss the up front cost of computing the new predicate; it seems like it will be bigger than the cost of doing the 'wrong' ordering on one model, but it isn't per-model, which means we don't need to worry about worst cases and future expansion so much).
Notes:
the first change (in two places) applies to everything, whereas the second change only works for models where the change has been made in their code. We may forget some models now or in the future, which is less than ideal, but with the first change in place performance should be acceptable.
the pragma[nomagic]s appear to be necessary for things to work as intended.
Looking at StdStringCStr::StdStringCStr ("c_str") Before the latest batch of changes:
[2021-01-18 18:01:24] (8s) Tuple counts for StdString::StdStringCStr#class#b/1@52e292:
1953 ~1% {2} r1 = SCAN m#FormattingFunction::FormattingFunction::getFirstFormatArgumentIndex_dispred#bf AS I OUTPUT I.<0> 'this', "c_str"
4 ~0% {1} r2 = JOIN r1 WITH Declaration::Declaration::hasName_dispred#bb AS R ON FIRST 2 OUTPUT r1.<0> 'this'
4 ~0% {1} r3 = STREAM DEDUP r2
4 ~0% {2} r4 = JOIN r3 WITH Declaration::Declaration::getDeclaringType_dispred#ff AS R ON FIRST 1 OUTPUT R.<1>, r3.<0> 'this'
2 ~0% {1} r5 = JOIN r4 WITH project#StdString::StdBasicString#class#ff AS R ON FIRST 1 OUTPUT r4.<1> 'this'
return r5
[2021-01-18 18:01:24] (8s) Tuple counts for Taint::TaintFunction::hasTaintFlow_dispred#bbf/3@e4424b:
...
1 ~0% {2} r44 = JOIN construct<TFunctionInput,2>@dom#FunctionInputsAndOutputs::TInQualifierObject#0#f AS L WITH construct<TFunctionOutput,3>@dom#FunctionInputsAndOutputs::TOutReturnValueDeref#0#f AS R CARTESIAN PRODUCT OUTPUT L.<0> 'input', R.<0> 'output'
1953 ~3% {3} r45 = JOIN r44 WITH project#Call::Call::getTarget_dispred#ff AS R CARTESIAN PRODUCT OUTPUT R.<0> 'this', r44.<0> 'input', r44.<1> 'output'
2 ~0% {3} r46 = JOIN r45 WITH StdString::StdStringCStr#class#b AS R ON FIRST 1 OUTPUT r45.<0> 'this', r45.<1> 'input', r45.<2> 'output'
With the first commit only (pragma[nomagic]onhasTaintFlow / hasDataFlow):
StdString::StdStringCStr
[not explicitly evaluated]
[2021-01-18 18:11:11] (8s) Tuple counts for Taint::TaintFunction::hasTaintFlow_dispred#fff/3@b9de32:
...
1 ~0% {2} r72 = JOIN construct<TFunctionInput,2>@dom#FunctionInputsAndOutputs::TInQualifierObject#0#f AS L WITH construct<TFunctionOutput,3>@dom#FunctionInputsAndOutputs::TOutReturnValueDeref#0#f AS R CARTESIAN PRODUCT OUTPUT L.<0> 'input', R.<0> 'output'
...
51 ~0% {3} r317 = JOIN r72 WITH project#StdString::StdBasicString#class#ff AS R CARTESIAN PRODUCT OUTPUT R.<0>, r72.<0> 'input', r72.<1> 'output'
216 ~0% {4} r318 = JOIN r317 WITH Declaration::Declaration::getDeclaringType_dispred#ff_10#join_rhs AS R ON FIRST 1 OUTPUT R.<1> 'this', "c_str", r317.<1> 'input', r317.<2> 'output'
2 ~0% {3} r319 = JOIN r318 WITH Declaration::Declaration::hasName_dispred#bb AS R ON FIRST 2 OUTPUT r318.<0> 'this', r318.<2> 'input', r318.<3> 'output'
and after:
StdString::StdStringCStr
[not explicitly evaluated]
[2021-01-18 17:59:01] (8s) Tuple counts for Taint::TaintFunction::hasTaintFlow_dispred#fff/3@896546:
...
1 ~0% {2} r72 = JOIN construct<TFunctionInput,2>@dom#FunctionInputsAndOutputs::TInQualifierObject#0#f AS L WITH construct<TFunctionOutput,3>@dom#FunctionInputsAndOutputs::TOutReturnValueDeref#0#f AS R CARTESIAN PRODUCT OUTPUT L.<0> 'input', R.<0> 'output'
...
51 ~4% {4} r228 = JOIN r72 WITH project#StdString::StdBasicString#class#ff AS R CARTESIAN PRODUCT OUTPUT "c_str", R.<0>, r72.<0> 'input', r72.<1> 'output'
2 ~0% {3} r229 = JOIN r228 WITH Function::Function::getClassAndName#fff_120#join_rhs AS R ON FIRST 2 OUTPUT R.<2> 'this', r228.<2> 'input', r228.<3> 'output'
There are similar results in DataFlow, e.g. for StdIStreamIn::StdIStreamIn ("operator>>").