I’ve a level 5 Statspack report from a real production 11.2.0.2 database with the following Top 5 Timed events section:
Top 5 Timed Events Avg %Total ~~~~~~~~~~~~~~~~~~ wait Call Event Waits Time (s) (ms) Time ----------------------------------------- ------------ ----------- ------ ------ log buffer space 28 1,396 49870 41.0 buffer busy waits 51 1,049 20567 30.8 CPU time 320 9.4 log file parallel write 4,011 159 40 4.7 log file sync 3,275 153 47 4.5
The complete report (without SQL and several other sections) is here. Can you explain what & why is wrong in this Top?
I must admit I don’t have the definitive answer (probably yet), but I have partial explanation of the issue as well as several ideas.
14 Comments
For some reason, the previous reply posted before I was finished.
With a DB_CACHE_SIZE of 69.5 GB, plus the other items in the SGA (74.7 GB total), the granule size in the database instance is likely 256 MB, which probably means that the LOG_BUFFER is also close to 256 MB in size, even though someone tried to set the size to 1,163,264 bytes. The long (average 50 seconds each, with on average more than 1 session waiting per second in the 20 minute time interval) LOG BUFFER SPACE wait event probably points to slow disks where one or more of the online redo logs reside, even though the LOG FILE PARALLEL WRITE wait event does not seem to agree with that analysis. Is the redo being shipped to a remote server? Was the nice OS utility used on the Oracle Database server processes (see the comments attached to my article “Faulty Quotes 6 – CPU Utilization”)? There does not seem to be a lot of competition for the CPU.
You might want to take a look at the following article:
http://files.e2sn.com/slides/Tanel_Poder_log_file_sync.pdf (page 14)
A search through the Oracle support site found the following two potentially related, AIX platform specific articles (even though LOG FILE SYNC waits are not your specific problem):
Doc ID 1318709.1: “AIX: Things To Check When Seeing Long Log File Sync Time in 11.2.”
Bug 12834808: “LONG WAITS FOR ‘LOG FILE SYNC’ ON 11.2.0.2″
Is there a reason why the DB_FILE_MULTIBLOCK_READ_COUNT parameter is set to 16? My testing suggests that if the runtime engine selects to use serial direct path read for full scans, the number of blocks read in a single direct path read request is the largest power of 2 that is equal to or smaller than the DB_FILE_MULTIBLOCK_READ_COUNT parameter value. I wonder if manually setting the DB_FILE_MULTIBLOCK_READ_COUNT parameter value is changing some of the execution plans, resulting in greater demand for clean blocks in the buffer cache?
The good news is that your buffer cache hit ratio is 100% (or maybe that is bad news).
Hi Charles
thank you for your reply. I’ve edited it to include second successful attempt and deleted.
Yes indeed granule size of this instance is 256M and redo log buffer is 167M as reported in the end:
1) No 2) Most likely no 3) It’s not a busy server
Thanks! it looks very-very close to this hang.
Also thanks for the link to Tanel’s presentation.
Yes there is a reason – it’s a required setting for this app.
🙂 I simply don’t care.
Timur,
I recent comment by Kyle Hailey might apply if NFS is used to communicate with the filer for the redo logs and large memory pages are NOT configured:
http://www.freelists.org/post/oracle-l/OT-Blog-entry-on-hugepages,10
Charles
the storage used by this server is not NFS (I believe). Thanks anyway for the link.
I can reproduce this by suspending LGWR (oradebug suspend). First dml session waits for log buffer space, second, which updates the same block, waiting for buffer busy waits. And 40 ms for log file parallel write is too slow.
Hi Vyacheslav
thanks for stopping by.
I was also thinking about a possibly suspended LGWR, but I don’t think it’s the case since this is a production environment.
Agree.
Hi,
my money is on an I/O issue as the root cause for this. E.g. a faulty SAN port.
Best regards,
Nikolay
Hi Nikolay
faulty port or similar hardware error would explain longer IO, but 1) IO-related waits are not significant in the report 2) log buffer is actually larger than total amount of redo written in 20 minutes.
Hi Timur,
here is my counter-quiz 🙂
Doesn’t look like an I/O issue, now does it?… and yet that’s exactly what it was!
So maybe there’s an epidemic of such issues or something 🙂
Best regards,
Nikolay
Nikolay
there’s not enough data in your example for me to make any conclusion.
What kind of IO issue?
Hi Timur,
it’s true that there’s not enough information in the AWR report to diagnose the issue. I was simply using it to show that appearances can be deceiving, and what looks nothing like an I/O issue still can be one.
Regarding your question — the issue was a hardware failure (flopping SAN port).
I’m preparing a detailed post this case it in my blog, should be ready within a few days.
Hi Timur,
Here are some interesting facts I could pull from the statstpack report.
Depending on the above metrics, I can say that the application configured to run is doing something nasty with the way it is issuing commits.
As Charles pointed out first – you have I/O problems but in addition you should also look into those application sessions to find out the way commits are issued.
Please share the answer.
-Abhishek
Hi Timur,
Just to be curious ongranule size (256 MB), the default value of hidden parameter determining the granule size “_ksmg_granule_size” is 16 MB in AIX. However your case won’t be able to use that, and rather starts with 32 MB onward. How did you get that value based upon this statspack ?
Hi
I don’t know about restrictions on the granule size for AIX, and according to the MOS DOC 947152.1 based on the SGA of ~80G and Oracle version the default granule size is 256MB, which is indirectly confirmed by large Redo Log buffer – 167M.