Statspack quiz

I’ve a level 5 Statspack report from a real production 11.2.0.2 database with the following Top 5 Timed events section:

Top 5 Timed Events                                                    Avg %Total
~~~~~~~~~~~~~~~~~~                                                   wait   Call
Event                                            Waits    Time (s)   (ms)   Time
----------------------------------------- ------------ ----------- ------ ------
log buffer space                                    28       1,396  49870   41.0
buffer busy waits                                   51       1,049  20567   30.8
CPU time                                                       320           9.4
log file parallel write                          4,011         159     40    4.7
log file sync                                    3,275         153     47    4.5

The complete report (without SQL and several other sections) is here. Can you explain what & why is wrong in this Top?
I must admit I don’t have the definitive answer (probably yet), but I have partial explanation of the issue as well as several ideas.

This entry was written by Timur Akhmadeev, posted on April 3, 2012 at 17:25, filed under Oracle, Performance and tagged quiz, Statspack. Bookmark the permalink. Follow any comments here with the RSS feed for this post. Post a comment or leave a trackback: Trackback URL.

14 Comments

Charles Hooper

Posted April 4, 2012 at 01:10 | Permalink | Reply

For some reason, the previous reply posted before I was finished.

With a DB_CACHE_SIZE of 69.5 GB, plus the other items in the SGA (74.7 GB total), the granule size in the database instance is likely 256 MB, which probably means that the LOG_BUFFER is also close to 256 MB in size, even though someone tried to set the size to 1,163,264 bytes. The long (average 50 seconds each, with on average more than 1 session waiting per second in the 20 minute time interval) LOG BUFFER SPACE wait event probably points to slow disks where one or more of the online redo logs reside, even though the LOG FILE PARALLEL WRITE wait event does not seem to agree with that analysis. Is the redo being shipped to a remote server? Was the nice OS utility used on the Oracle Database server processes (see the comments attached to my article “Faulty Quotes 6 – CPU Utilization”)? There does not seem to be a lot of competition for the CPU.

You might want to take a look at the following article:
http://files.e2sn.com/slides/Tanel_Poder_log_file_sync.pdf (page 14)

A search through the Oracle support site found the following two potentially related, AIX platform specific articles (even though LOG FILE SYNC waits are not your specific problem):
Doc ID 1318709.1: “AIX: Things To Check When Seeing Long Log File Sync Time in 11.2.”
Bug 12834808: “LONG WAITS FOR ‘LOG FILE SYNC’ ON 11.2.0.2″

Is there a reason why the DB_FILE_MULTIBLOCK_READ_COUNT parameter is set to 16? My testing suggests that if the runtime engine selects to use serial direct path read for full scans, the number of blocks read in a single direct path read request is the largest power of 2 that is equal to or smaller than the DB_FILE_MULTIBLOCK_READ_COUNT parameter value. I wonder if manually setting the DB_FILE_MULTIBLOCK_READ_COUNT parameter value is changing some of the execution plans, resulting in greater demand for clean blocks in the buffer cache?

The good news is that your buffer cache hit ratio is 100% (or maybe that is bad news).
- Timur Akhmadeev
  
  Posted April 4, 2012 at 11:52 | Permalink | Reply
  
  Hi Charles
  
  thank you for your reply. I’ve edited it to include second successful attempt and deleted.
  Yes indeed granule size of this instance is 256M and redo log buffer is 167M as reported in the end:
```
                                                        End Size (Bytes)
SGA regions                      Begin Size (Bytes)       (if different)
------------------------------ -------------------- --------------------
Database Buffers                     75,161,927,680
Fixed Size                                2,228,232
Redo Buffers                            175,116,288
Variable Size                         4,831,842,296
                               -------------------- --------------------
sum                                  80,171,114,496
```
  Is the redo being shipped to a remote server? Was the nice OS utility used on the Oracle Database server processes (see the comments attached to my article “Faulty Quotes 6 – CPU Utilization”)? There does not seem to be a lot of competition for the CPU.
  
  1) No 2) Most likely no 3) It’s not a busy server
  
  Bug 12834808: “LONG WAITS FOR ‘LOG FILE SYNC’ ON 11.2.0.2″
  
  Thanks! it looks very-very close to this hang.
  Also thanks for the link to Tanel’s presentation.
  
  Is there a reason why the DB_FILE_MULTIBLOCK_READ_COUNT parameter is set to 16? My testing suggests that if the runtime engine selects to use serial direct path read for full scans, the number of blocks read in a single direct path read request is the largest power of 2 that is equal to or smaller than the DB_FILE_MULTIBLOCK_READ_COUNT parameter value. I wonder if manually setting the DB_FILE_MULTIBLOCK_READ_COUNT parameter value is changing some of the execution plans, resulting in greater demand for clean blocks in the buffer cache?
  
  Yes there is a reason – it’s a required setting for this app.
  
  The good news is that your buffer cache hit ratio is 100% (or maybe that is bad news).
  
  🙂 I simply don’t care.
  - Charles Hooper
    
    Posted April 5, 2012 at 00:42 | Permalink | Reply
    
    Timur,
    
    I recent comment by Kyle Hailey might apply if NFS is used to communicate with the filer for the redo logs and large memory pages are NOT configured:
    http://www.freelists.org/post/oracle-l/OT-Blog-entry-on-hugepages,10
    - Timur Akhmadeev
      
      Posted April 5, 2012 at 21:32 | Permalink | Reply
      
      Charles
      
      the storage used by this server is not NFS (I believe). Thanks anyway for the link.
Vyacheslav Rasskazov

Posted April 4, 2012 at 09:13 | Permalink | Reply

I can reproduce this by suspending LGWR (oradebug suspend). First dml session waits for log buffer space, second, which updates the same block, waiting for buffer busy waits. And 40 ms for log file parallel write is too slow.
- Timur Akhmadeev
  
  Posted April 4, 2012 at 11:56 | Permalink | Reply
  
  Hi Vyacheslav
  
  thanks for stopping by.
  I was also thinking about a possibly suspended LGWR, but I don’t think it’s the case since this is a production environment.
  
  And 40 ms for log file parallel write is too slow.
  
  Agree.
savvinov

Posted April 4, 2012 at 15:46 | Permalink | Reply

Hi,

my money is on an I/O issue as the root cause for this. E.g. a faulty SAN port.

Best regards,
Nikolay
- Timur Akhmadeev
  
  Posted April 4, 2012 at 17:50 | Permalink | Reply
  
  Hi Nikolay
  
  faulty port or similar hardware error would explain longer IO, but 1) IO-related waits are not significant in the report 2) log buffer is actually larger than total amount of redo written in 20 minutes.
  - savvinov
    
    Posted April 4, 2012 at 19:44 | Permalink | Reply
    
    Hi Timur,
    
    here is my counter-quiz 🙂
```
log buffer space                     43,008      28,392    660   20.6 Configurat
log file sync                       274,137      27,164     99   19.7     Commit
write complete waits                 19,744      19,044    965   13.8 Configurat
free buffer waits                   439,225      17,711     40   12.9 Configurat
buffer busy waits                    77,540      15,623    201   11.3 Concurrenc
```
    Doesn’t look like an I/O issue, now does it?… and yet that’s exactly what it was!
    
    So maybe there’s an epidemic of such issues or something 🙂
    
    Best regards,
    Nikolay
    - Timur Akhmadeev
      
      Posted April 5, 2012 at 21:38 | Permalink | Reply
      
      Nikolay
      
      there’s not enough data in your example for me to make any conclusion.
      What kind of IO issue?
      - savvinov
        
        Posted April 6, 2012 at 16:51 | Permalink | Reply
        
        Hi Timur,
        
        it’s true that there’s not enough information in the AWR report to diagnose the issue. I was simply using it to show that appearances can be deceiving, and what looks nothing like an I/O issue still can be one.
        
        Regarding your question — the issue was a hardware failure (flopping SAN port).
        
        I’m preparing a detailed post this case it in my blog, should be ready within a few days.

abhishekchaturvedi

Posted April 11, 2012 at 00:06 | Permalink | Reply

Hi Timur,
Here are some interesting facts I could pull from the statstpack report.

1. 'log file sync' wait happened 3272 times - avg wait time of 47ms
2. 'log file parallel write' happened 4011 times - avg wait time of 40ms
3. The value for 'user commits' statistic is 3642. With 3 commits happening every 1 second.
4. There were 0 new logons in the 20 min snapshot
5. There were 0 rollbacks issued in the 20 min snapshot
6. Your DB had just 2.6 average active sessions
7. Most important - there was NO log switch in the above snapshot
8. You generated over or around 109Mbytes of redo in those 20 minutes - but no log switch.

Depending on the above metrics, I can say that the application configured to run is doing something nasty with the way it is issuing commits.
As Charles pointed out first – you have I/O problems but in addition you should also look into those application sessions to find out the way commits are issued.

Please share the answer.

-Abhishek

bunditj

Posted April 12, 2012 at 11:15 | Permalink | Reply

Hi Timur,

Just to be curious ongranule size (256 MB), the default value of hidden parameter determining the granule size “_ksmg_granule_size” is 16 MB in AIX. However your case won’t be able to use that, and rather starts with 32 MB onward. How did you get that value based upon this statspack ?
- Timur Akhmadeev
  
  Posted April 12, 2012 at 11:33 | Permalink | Reply
  
  Hi
  
  I don’t know about restrictions on the granule size for AIX, and according to the MOS DOC 947152.1 based on the SGA of ~80G and Oracle version the default granule size is 256MB, which is indirectly confirmed by large Redo Log buffer – 167M.

Timur Akhmadeev's blog

Statspack quiz

14 Comments

Leave a reply to savvinov Cancel reply

Categories

Tags

Top Posts

Blogroll

Email subscription

Twitter Updates

Timur Akhmadeev's blog

Statspack quiz

Related

14 Comments

Leave a reply to savvinov Cancel reply

Categories

Tags

Top Posts

Blogroll

Email subscription

Twitter Updates