I planned to write on this for quite some time, but failed to do so. Sorry about that. Today I finally got time and desire to describe a situation from the title. It was observed on an Oracle 9.2.0.8 running Solaris SPARC; manifested itself as a severe CPU burning at 100% utilization with ‘latch free’ on the first place of the Top 5 wait events in Statspack report.
Here is part of a 1 hour report:
Load Profile ~~~~~~~~~~~~ Per Second Per Transaction --------------- --------------- Redo size: -72,280.51 -18,601.83 Logical reads: 411,881.06 106,000.09 Block changes: 2,515.53 647.39 Physical reads: 665.17 171.18 Physical writes: 218.18 56.15 User calls: 394.87 101.62 Parses: 79.46 20.45 Hard parses: 0.94 0.24 Sorts: 885.81 227.97 Logons: 0.05 0.01 Executes: 1,160.92 298.77 Transactions: 3.89 % Blocks changed per Read: 0.61 Recursive Call %: 75.88 Rollback per transaction %: 5.17 Rows per Sort: 5.95 Instance Efficiency Percentages (Target 100%) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Buffer Nowait %: 99.98 Redo NoWait %: 100.00 Buffer Hit %: 99.84 In-memory Sort %: 100.00 Library Hit %: 99.93 Soft Parse %: 98.82 Execute to Parse %: 93.16 Latch Hit %: 89.32 Parse CPU to Parse Elapsd %: 84.20 % Non-Parse CPU: 99.57 Shared Pool Statistics Begin End ------ ------ Memory Usage %: 90.41 91.42 % SQL with executions>1: 31.20 35.28 % Memory for SQL w/exec>1: 30.16 30.05 Top 5 Timed Events ~~~~~~~~~~~~~~~~~~ % Total Event Waits Time (s) Ela Time -------------------------------------------- ------------ ----------- -------- latch free 2,529,903 49,710 48.22 CPU time 31,694 30.75 db file sequential read 1,228,775 13,164 12.77 buffer busy waits 269,229 3,076 2.98 enqueue 2,276 2,567 2.49
And this is Latch sleep breakdown:
Get Spin & Latch Name Requests Misses Sleeps Sleeps 1->4 -------------------------- -------------- ----------- ----------- ------------ cache buffers chains 4,190,999,464 ########### 2,524,798 467719813/20 75503/172064 /31324/0 library cache 27,510,901 44,028 931 43116/895/15 /2/0 row cache enqueue latch 10,692,790 27,738 16 27722/16/0/0 /0 cache buffer handles 6,478,263 7,240 66 7175/64/1/0/ 0 shared pool 2,657,115 5,706 412 5300/400/6/0 /0 ...
As you can see, ‘latch free’ waits are almost entirely accounted to the infamous ‘cache buffers chains’ latch gets. You can easily find information on what it is and why it could happen, and how to diagnose such contention. Here I’ll just show you what I’ve found by accident (just staring at the report). I noticed that two queries – #2 and #3 in the SQL Ordered by Gets section – which are nothing but a unique index access to a table, had ~16K gets per execution:
CPU Elapsd Buffer Gets Executions Gets per Exec %Total Time (s) Time (s) Hash Value --------------- ------------ -------------- ------ -------- --------- ---------- 333,759,439 21,538 15,496.3 22.4 7028.63 24546.46 4091048759 Module: JDBC Thin Client SELECT 1 FROM T WHERE ID = :B1 FOR UPDATE 259,164,764 16,137 16,060.3 17.4 5391.01 18934.74 307775861 Module: JDBC Thin Client select ID from T where ID=:1
How that could possibly happen? At first I thought the index used by queries was missing, but that was not the case. Second thought – the index has exploded due to modifications (I found one query that probably had inserted lots of rows to the table, but rows_processed for the statement was zero) and its height was increased, which caused other sessions to do much more work when accessing both index and table blocks. That was just an idea which I didn’t know how to confirm or dismiss so I asked wise guys and they were kind enough to point out to me that index height increase is not a necessity to have a spike in CR gets for concurrent SQL access. All that is needed is just an uncommitted transaction that inserts rows to the table somewhere nearby the normally accessed data. Yep, it sounded reasonable to me and I was OK with this explanation. I did confirm that the long running INSERT statement was indeed adding millions of rows to the table due to an application bug (missing condition), but failed to do so because of transaction timeout on the WebLogic set to half an hour. So the statement actually was rolled back and that’s why rows_processed for the query was zero.
But when I tried to reproduce this situation to prepare test case for this blog entry, I was not able to do so. A simple access by index to the table with an in-flight transaction inserting huge amount of data is definitely not enough to reproduce such situation. So it must have been one of those scenarios when a (table or index) block has been updated many times, and subsequent transactions had to apply undo to reverse changes to a block made by other transactions. It’s very easy to setup so I won’t do that here. Also I think it’s impossible to identify which block caused the issue having just a standard Statspack report in hands. Anyway, here is statistics comparison that shows undo access was the cause for increased LIO and ‘cache buffers chain’ latch contention:
Statname | Normal | R1 | R2 | Comments |
branch node splits |
29 |
329 |
421 |
|
consistent changes |
44,192,320 |
779,500,412 |
1,269,485,409 |
number of times undo was applied to construct a consistent block version |
consistent gets |
209,640,205 |
1,035,154,852 |
1,466,065,189 |
|
consistent gets – examination |
142,031,460 |
889,574,780 |
1,366,583,799 |
number of times a block (such as undo or index root) was accessed under protection of ‘cache buffers chain’ latch |
data blocks consistent reads – undo records applied |
43,973,030 |
779,361,353 |
1,269,187,599 |
number of undo records applied while constructing CR block copy |
leaf node splits |
6,200 |
90,367 |
113,873 |
|
transaction tables consistent read – undo records applied |
1,824 |
80,503 |
4,863 |
|
user calls |
6,526,934 |
2,277,792 |
1,429,810 |
6 Comments
Nice example with an explanation from Jonathan Lewis of a similar issue (but due to transaction table rollbacks due to block cleanout).
Good post, Timur! Thanks for update your blog 🙂
Timur,
The very large number of leaf node splits is a clue.
I did a little test as follows and got a result on just 10,000 rows inserted that seems to be heading in the right direction:
Jonathan,
thank you for test case. Indeed, I was able to repeat something very similar to your setup even with a single insert statement, like this (I use specific value for start with of the sequence to simulate real life index):
So it’s almost 500 gets…
Timur,
Here’s a little detail that I haven’t followed up.
I put the “set transaction read only” into the code because I thought this would make it easier to run and see the excessive logical I/O. In fact the number of consistent gets is much higher if I don’t have that call in place.
Timur,
Correction – the stats I’ve quoted came from a predicate “id = 9999”.
For “id = 1” I got 245 session logical reads.
I should have said I was running 10.2.0.3 at the time.
One Trackback
[…] Timur Akhmadeev wonders about 16K CR gets for UQ index scan? […]