Descriptionlikan_999.student
2012-07-23 23:06:57 UTC
The following piece of code shows 3x performance regression from 4.6.2 to 4.7.1. The reason should come from the change in libstdc++. The code is compiled with -O2 -std=c++0x. -O3 doesn't make much difference; with or without the call to reserve doesn't make much difference. The attachment contains profiling using google-perftool. Looks like the major cost comes from the rehashing. Does anyone aware of the issue, or I am the first one to report this?
#include <unordered_map>
using namespace std;
int main() {
const int N = 10000000;
unordered_map<int, int> m;
m.reserve(2 * N);
for (int i = 0; i < N; i++) {
m[i] = i;
}
}
Timing:
[hidden]$ time ./a-4.6.2.out
real 0m1.029s
user 0m0.787s
sys 0m0.239s
[hidden]$ time ./a-4.7.1.out
real 0m3.364s
user 0m2.596s
sys 0m0.761s
Comment 1likan_999.student
2012-07-23 23:08:07 UTC
I wonder, anyway, if the apparent slow down is just an artifact caused by a different handling of the load factor in the reworked unordered containers which we have been delivering since 4.7.0. I would suggest submitter to experiment a bit with max_load_factor, and see if when 4.6.x seems faster at insertion time actually the load factor is higher (too searching would be slower).
Comment 5likan_999.student
2012-07-24 00:17:10 UTC
@Paolo Carlini: can you talk more about how to experiment with max_load_factor? As long as I use the same max_load_factor for 4.6 and 4.7, I can still see the performance difference (3x slower is the best result):
max_load_factor gcc-4.6.2 gcc-4.7.1
0.2 1.600s 7.650s
0.5 1.125s 4.251s
1.0 1.004s 3.378s
2.0 0.914s 20.471s
5.0 0.917s 29.132s
In some cases 4.6.x was handling max_load_factor incorrectly. Thus, the idea isn't comparing 4.6.x to 4.7.x with the same max_load_factor (I don't think it's useful), instead, actually measure load_factor(s), see if for some values of max_load_factor, the actual load_factor(s) are different in 4.6.x vs 4.7.x. In any case, measure the actual load_factor while the map grows.
Comment 7likan_999.student
2012-07-24 00:42:57 UTC
@Paolo Carlini: the problem is, with different max_load_factor in range [0.2-5], the *best* result of 4.7.1 is still 2x slower than the *worst* of 4.6.2.
I printed out the average load factor during the insertion. Looks like 4.7.1's load factor is very close to the max_load_factor, and the average for 4.6.2 is about 1/4 of the max_load_factor. But what conclusion can we get from this result?
I just stumbled over this bug while searching for something related to the max load factor (it seems that since 4.7 the hashtable also shrinks when you set the load factor higher, which is at least for me unfortunate).
I did just change the testcase to count the number of rehashes (by checking bucket count before and after insert) and found:
gcc4.6 without reserve: 21
gcc4.6 with reserve: 1
gcc4.7 without reserve: 155
gcc4.7 with reserve: 160
Then in callgrind I had a look and most time seems to be spend in rehashing. So I would assume that when the 4.7 version gets the number of rehashing down to where it was, then we also get the speed down to where it was.
I would say that the reserve behaviour though is definetly broken, as it even increases the amount of rehashings. I personally would just not have the hashtable ever shrink on insert and/or load factor setting, just only ever on explicit rehash() calls...
I confirm that the reserve method is broken. I had correctly handle the size hint that can be given through the hashtable constructor, I set _M_rehash_policy._M_prev_resize to 0 just to avoid the container to be shrink until it reaches the targetted size. The same thing should be done in the rehash method that is called from reserve one. I will submit a patch soon.
Regarding the shrink on insert calls, well, it can be easily avoided by calling (normally) reserve so I considered that it is better to offer a container with a symetric behavior.
I will also I think reconsider the grow factor. During one of my propositions of hashtable redesign the number of empty buckets was a problem for performance. With the current implementation it is not a problem anymore so we could grow a little bit faster.
I can confirm that now the reserve works as I would expect it (causing no further rehashes). However the amount of rehashes done in the testcase is still 155 (needing 4.5s), while gcc 4.6 only needs 21 (needing 1.5s).
Monitoring the bucket counts on resizes it seems that gcc 4.8 is now much more fine grained than gcc 4.6. I am unsure if this is expected, deliberate and/or a good idea.
In my tests, with this patch, 4.7.1 is about 10% slower than 4.6 ... a vast improvement but certainly not parity.
./bench46 1.75s user 0.82s system 99% cpu 2.577 total
./bench47 8.01s user 2.78s system 99% cpu 10.800 total
./bench47+patch 1.95s user 0.80s system 99% cpu 2.764 total
I haven't touch the grow speed for the moment. I prefer to fix the reserve Standard conformity first.
Now I can restore the 4.6 grow speed as it seems to be a relatively correct one. Note that 3x slower with so many more rehash operations doesn't seem so bad ! Of course growing faster will also induce higher memory consumption which is not shown by the simple benchmark proposed here.
Am on vacation so I don't have the testcase at hand, but it is the same as likan posted in the original bugreport, minus the reserve. The main difference is that without reserve I see 21 rehashes in gcc 4.6 and 155 rehashes. I have added some code that after each insert tests if the bucket count has changed, I think that should be trivial to add for yourselves without me providing it.
Author: fdumont
Date: Sun Jul 29 16:44:18 2012
New Revision: 189938
URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=189938
Log:
2012-07-29 François Dumont <fdumont@gcc.gnu.org>
PR libstdc++/54075
* include/bits/hashtable_policy.h
(_Prime_rehash_policy::_M_next_bkt): Add a growth factor set to 2
to boost growth in the number of buckets.
* testsuite/performance/23_containers/insert/unordered_set.cc: New.
Added:
trunk/libstdc++-v3/testsuite/performance/23_containers/insert/unordered_set.cc
Modified:
trunk/libstdc++-v3/ChangeLog
trunk/libstdc++-v3/include/bits/hashtable_policy.h
Author: fdumont
Date: Sun Jul 29 17:06:21 2012
New Revision: 189941
URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=189941
Log:
2012-07-29 François Dumont <fdumont@gcc.gnu.org>
PR libstdc++/54075
* include/bits/hashtable_policy.h
(_Prime_rehash_policy::_M_next_bkt): Add a growth factor set to 2
to boost growth in the number of buckets.
* testsuite/performance/23_containers/insert/unordered_set.cc: New.
Added:
branches/gcc-4_7-branch/libstdc++-v3/testsuite/performance/23_containers/insert/unordered_set.cc
Modified:
branches/gcc-4_7-branch/libstdc++-v3/ChangeLog
branches/gcc-4_7-branch/libstdc++-v3/include/bits/hashtable_policy.h
I'm afraid this doesn't quite do it.
I still observe a > 60% slowdown going from gcc-4.4 to gcc-4.7, with this fix already in, specifically for insert(). It's a 320,000 member table I am dealing with here.
I can make 4.7 be as fast as 4.4 by preemptively setting the reserve to what I know (for this test) to be the maximum size I need, but measured resident memory shoots up (not unexpected). And resident memory of the 4.7 build was already higher than 4.4 so I don't think this can be the answer here.
Playing with the load factor resulted in a minor speedup with 0.4 (from 1.0), but not reaching 4.4 performance. Other load factors (lower than 0.4 and higher than 1.0) are even slower.
Is there more specific information available about the tradeoff numbers that made GNU pick this new implementation?
I should clarify that I was pointed to this problem (with insert) by profiling and for us nothing pops up as faster (or smaller for that matter). Hence the question.
Indeed, if we have to do something about that, we need to know those profiles. That's my point. Otherwise, blindly, it could be anything, ie, not necessarily rehashes which are the topic of this specific PR.
I am going to take a look to 4.4 implementation to see if I can explain the difference. But waiting for that can you confirm that without the reserve the number of rehash is similar to what you had with version 4.4.
Regarding the resident memory, I can't remember any overhead with the new implementation but once again checking 4.4 implementation will point me to it. Do you have it with the code originally posted, that is to say an unordered_map<int, int> ?
For information the major reason to review the implementation was for Standard conformity. The erase method had to return the iterator following the erased one.
The case I reported is in a large system. I did an isolated test case which sees the time shoots up from 32.54 seconds in gcc-4.4 to 42.86 s in gcc-4.7 and back down to 32.27s when using gcc-4.7 with the tr1 hashtables.
As you can see there also is a reasonably catastrophic increase in minor page faults.
gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5.1)
extra flags:
meh: 1635550270 4081940 288998848
0:32.54 32.54 real 30.50 user 1.97 sys 99% CPU 0/1394855 faults
text data bss dec hex filename
5456 688 3840032 3846176 3ab020 sdriver
gcc version 4.7.2 (Debian 4.7.2-4)
extra flags:
meh: 1635550270 4081940 288998848
0:42.86 42.86 real 40.21 user 2.56 sys 99% CPU 0/2101681 faults
text data bss dec hex filename
6822 760 3840032 3847614 3ab5be sdriver
gcc version 4.7.2 (Debian 4.7.2-4)
extra flags: -DCRA_USE_TR1
meh: 1635550270 4081940 288998848
0:32.27 32.27 real 30.29 user 1.92 sys 99% CPU 0/1394853 faults
text data bss dec hex filename
5426 736 3840032 3846194 3ab032 sdriver
Profiles for the three cases:
http://www.cons.org/gcc-hash-slowdown/
Source code for test case:
http://www.cons.org/gcc-hash-slowdown/reproduce-slow-hash.tar.gz
Please feel free to copy things if you want to append things to this pr.
My original case with worse slowdown is in a complicated system and driven from Lisp. For no other reason than Murphy's law the profiler in there does not penetrate into layers of C++ frames for this run, although it normally does. So I'm not able to break up the insert() into it's parts right now. I have no reason to believe that it is different than this test case, probably just a slower hash. I have seen gcc-4.7 + tr1 hashtable faster than gcc-4.4 both in the large system and in the test case so I'm pretty confident that we are looking at the same thing here. I'm fixing that profiler, more later.
Hope this helps.
Thanks. Francois, can you please further investigate this issue? In fact, if the slowdown goes away with a preliminary reserve, it must be possible to handle the problem rather easily, by tweaking the grow policy or something (We should be a bit patient and consider that vs 4.4.x the implementation is almost completely different)
I fear that this performance issue is a normal drawback of the major enhancement for PR 41975. Before this evolution the hashtable data model was like a std::vector<std::forward_list>. During the insert operation it was easy to find out if the inserted element was already in the targetted bucket by simply checking for equality with each element of the bucket forwward_list. The code was something like:
for (auto it = bucket.begin(); it != bucket.end(); ++it)
if (comp(*it, element))
// The element already exists
return false;
// Ok, no equivalent element
return true;
After the evolution the data model became a combination of a std::forward_list containing all elements and a std::vector<std::forward_list<>::iterator> representing buckets. Now to check if an element is already in a bucket we still need to compare for equality with each element of the bucket but the element of the bucket are harder to identified. There is no more bucket end so we must also check that the element we are about to compare is still in the bucket. The code became something like:
// Pre: bucket not empty.
for (auto it = buckets[n];)
{
if (comp(*it, element))
// The element already exists
return false;
++it;
if (it == end() || bucket(it) != n)
// We are not in bucket n anymore
return false;
}
// Ok, no equivalent element
return true;
The new design helps to iterate through the container because even if it is almost empty we simply need to iterate within the forward_list. In the previous implementation we needed to iterate within a bucket forward_list and then jump above all empty buckets which could be very expensive. Try to run testsuite/performance/23_containers/insert_erase/41975.cc, you can easily tweak it to make it run with the tr1 implementation and you will notice the difference. This test also show the insert overhead even if with a perfect hasher like the one use in this test the impact is very limited.
To make bucket(it) not too expensive the new implementation caches most of the time the hash code which explain the additional memory. You can try to ask for it not to be cache but you might have to qualified your hasher with noexcept, static assertions will tell you so.
So for the moment I can see a way to restore tr1 insert performance without impacting erase performance. In my opinion, a Standard implementation needs to behave correctly in all use cases and not to be perform very well in one use case but very bad in an other.
Well, I haven't looked into this issue in detail, but, it looks like everyone is about the same speed if you put a foo.{reserve or rehash}(n) before the inserts.
Unfortunately, it doesn't look as simple as the new impl calling for a rehash more often than the old, cause it's not. So, I don't know if the slowness is because rehash is now a lot slower, or if insert is slower but only if there aren't a huge number of empty buckets.
I'll also note that libc++'s implementation (seen here: http://llvm.org/svn/llvm-project/libcxx/trunk/) appears to be getting the same speed as the old libstdc++ implementation, while appearing to have approximately the same bucket datastructure as the new libstdc++ implementation. That says to me that it should be *possible* somehow to make the new version fast.
It seems that this commit doesn't fully fix this issue. If you call rehash multiple times with the same size, the second call to rehash resets _M_prev_resize to a non-zero value in _M_next_bkt(). Here is a sample program that shows this behavior:
#include <stdio.h>
#include <unordered_map>
int main(void) {
std::unordered_map<int, int> myMap;
myMap.rehash(4000000);
myMap.rehash(4000000);
unsigned long long buckets = myMap.bucket_count();
int i = 0;
while (i < 2000000000) {
myMap.insert(std::make_pair(i, 0));
++i;
if (buckets != myMap.bucket_count()) {
printf("buckets %lu -> %lu\n", buckets, myMap.bucket_count());
buckets = myMap.bucket_count();
}
}
return 0;
}
(In reply to comment #13)
> Author: fdumont
> Date: Thu Jul 26 12:31:50 2012
> New Revision: 189889
>
> URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=189889
> Log:
> 2012-07-26 François Dumont <fdumont@gcc.gnu.org>
>
> PR libstdc++/54075
> * include/bits/hashtable.h
> (_Hashtable<>::_Hashtable(_InputIterator, _InputIterator,
> size_type, ...): Remove std::max usage to guarantee that hashtable
> state is consistent with hash policy state.
> (_Hashtable<>::rehash): Likewise. Set _M_prev_resize to 0 to avoid
> the hashtable shrinking on next insertion.
> * testsuite/23_containers/unordered_set/modifiers/reserve.cc: New.
> * testsuite/23_containers/unordered_multiset/modifiers/reserve.cc: New.
> * testsuite/23_containers/unordered_map/modifiers/reserve.cc: New.
> * testsuite/23_containers/unordered_multimap/modifiers/reserve.cc: New.
>
> Added:
>
> branches/gcc-4_7-branch/libstdc++-v3/testsuite/23_containers/unordered_map/modifiers/reserve.cc
>
> branches/gcc-4_7-branch/libstdc++-v3/testsuite/23_containers/unordered_multimap/modifiers/reserve.cc
>
> branches/gcc-4_7-branch/libstdc++-v3/testsuite/23_containers/unordered_multiset/modifiers/reserve.cc
>
> branches/gcc-4_7-branch/libstdc++-v3/testsuite/23_containers/unordered_set/modifiers/reserve.cc
> Modified:
> branches/gcc-4_7-branch/libstdc++-v3/ChangeLog
> branches/gcc-4_7-branch/libstdc++-v3/include/bits/hashtable.h
Ok thanks. I guess depending on the complexity of the fixes we can apply some only to mainline first and reconsider the 4_7 branch later. Please do your best to work on both issues: we just entered Stage 3 thus no new features from now on, we are all concentrated on bug fixes until the release.
Here is the patch to fix the redundant rehash/reserve issue.
2012-11-07 François Dumont <fdumont@gcc.gnu.org>
PR libstdc++/54075
* include/bits/hashtable.h (_Hashtable<>::rehash): Reset hash
policy state if no rehash.
* testsuite/23_containers/unordered_set/modifiers/reserve.cc
(test02): New.
I had prepared and tested it in 4.7 branch but I can apply the same on
trunk.
Ok to commit ? If so, where ?
François
On 11/06/2012 10:33 PM, paolo.carlini at oracle dot com wrote:
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54075
>
> --- Comment #39 from Paolo Carlini <paolo.carlini at oracle dot com> 2012-11-06 21:33:57 UTC ---
> Ok thanks. I guess depending on the complexity of the fixes we can apply some
> only to mainline first and reconsider the 4_7 branch later. Please do your best
> to work on both issues: we just entered Stage 3 thus no new features from now
> on, we are all concentrated on bug fixes until the release.
>
On 11/08/2012 01:58 AM, Jonathan Wakely wrote:
> On 7 November 2012 22:02, François Dumont wrote:
>> Ok to commit ? If so, where ?
> That patch is OK for trunk and 4.7, thanks.
... I was having a look at this code - patch per se is straightforward
enough and can in any case go in now - and something is puzzling me a
lot: we always compute things, in _M_next_bkt etc, in terms of
__grown_n, thus __n * 2, until the final _M_rehash call.
On the other hand, the old-old code for rehash didn't use
_M_growth_factor in these computations, it just literally enforced the
post-conditions of the Standard. Are we sure we aren't so to speak
rehashing too much? For example, when the load factor is very low and
doesn't count, it looks like a current rehash(n) accomplishes the same
of an old rehash(n * 2)?!? Something seems wrong, can you double check
that? In any case the comments before _M_next_bkt would need fixing.
Thanks,
Paolo.
On 11/08/2012 02:56 AM, Paolo Carlini wrote:
> On the other hand, the old-old code for rehash didn't use
> _M_growth_factor in these computations, it just literally enforced the
> post-conditions of the Standard. Are we sure we aren't so to speak
> rehashing too much? For example, when the load factor is very low and
> doesn't count, it looks like a current rehash(n) accomplishes the same
> of an old rehash(n * 2)?!? Something seems wrong, can you double check
> that? In any case the comments before _M_next_bkt would need fixing.
... in other terms, I really think that the only uses of
_S_growth_factor should return to be inside _M_need_rehash, because
that's the function called by the inserts, when the container must
automatically grow the number of buckets. Elsewhere, like the
constructors, rehash, should not need to know about _S_growth_factor.
Paolo.
Created attachment 28640[details]
performance.patch
Attached patch applied to trunk and 4.7 branch.
2012-11-08 François Dumont <fdumont@gcc.gnu.org>
PR libstdc++/54075
* include/bits/hashtable.h (_Hashtable<>::rehash): Reset hash
policy state if no rehash.
* testsuite/23_containers/unordered_set/modifiers/reserve.cc
(test02): New.
François
On 11/08/2012 01:58 AM, Jonathan Wakely wrote:
> On 7 November 2012 22:02, François Dumont wrote:
>> Ok to commit ? If so, where ?
> That patch is OK for trunk and 4.7, thanks.
>
On 11/08/2012 03:25 AM, Paolo Carlini wrote:
> On 11/08/2012 02:56 AM, Paolo Carlini wrote:
>> On the other hand, the old-old code for rehash didn't use
>> _M_growth_factor in these computations, it just literally enforced
>> the post-conditions of the Standard. Are we sure we aren't so to
>> speak rehashing too much? For example, when the load factor is very
>> low and doesn't count, it looks like a current rehash(n) accomplishes
>> the same of an old rehash(n * 2)?!? Something seems wrong, can you
>> double check that? In any case the comments before _M_next_bkt would
>> need fixing.
> ... in other terms, I really think that the only uses of
> _S_growth_factor should return to be inside _M_need_rehash, because
> that's the function called by the inserts, when the container must
> automatically grow the number of buckets. Elsewhere, like the
> constructors, rehash, should not need to know about _S_growth_factor.
>
> Paolo.
>
I haven't yet considered the following emails but based on those
good remarks I have done the attached patch. Surprisingly it seems to
have a good impact on performance even if before it
testsuite/performance/23_containers/insert/unordered_set.cc was showing
that new implementation was better.
I have also starting to adapt tests so that it's possible to check
unordered performance with std or std::tr1 implementations. Can I
generalize it to other tests ? If so, is it a good approach ?
François
This performance issue is a result of fixing:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=41975
It resulted in many more modulo operations and so expensive float divisions.
I plan to commit an alternative hash policy using power of 2 number of buckets so that modulo is trivial. Bench are showing that performance are better but still not at the level of tr1 implementation on the operations you are interested in.
So it is difficult to close this ticket cause performance regression is still there but it might stay this way for a long time.
The following piece of code shows 3x performance regression from 4.6.2 to 4.7.1. The reason should come from the change in libstdc++. The code is compiled with -O2 -std=c++0x. -O3 doesn't make much difference; with or without the call to reserve doesn't make much difference. The attachment contains profiling using google-perftool. Looks like the major cost comes from the rehashing. Does anyone aware of the issue, or I am the first one to report this? #include <unordered_map> using namespace std; int main() { const int N = 10000000; unordered_map<int, int> m; m.reserve(2 * N); for (int i = 0; i < N; i++) { m[i] = i; } } Timing: [hidden]$ time ./a-4.6.2.out real 0m1.029s user 0m0.787s sys 0m0.239s [hidden]$ time ./a-4.7.1.out real 0m3.364s user 0m2.596s sys 0m0.761s
Created attachment 27861 [details] Profiling of gcc-4.7.1 using google-perftools
Created attachment 27862 [details] Profiling of gcc-4.6.2 using google-perftools
On 7 November 2012 22:02, François Dumont wrote: > > Ok to commit ? If so, where ? That patch is OK for trunk and 4.7, thanks.
On 11/08/2012 02:56 AM, Paolo Carlini wrote: > On the other hand, the old-old code for rehash didn't use > _M_growth_factor in these computations, it just literally enforced the > post-conditions of the Standard. Are we sure we aren't so to speak > rehashing too much? For example, when the load factor is very low and > doesn't count, it looks like a current rehash(n) accomplishes the same > of an old rehash(n * 2)?!? Something seems wrong, can you double check > that? In any case the comments before _M_next_bkt would need fixing. ... in other terms, I really think that the only uses of _S_growth_factor should return to be inside _M_need_rehash, because that's the function called by the inserts, when the container must automatically grow the number of buckets. Elsewhere, like the constructors, rehash, should not need to know about _S_growth_factor. Paolo.