Statspack quiz

I’ve a level 5 Statspack report from a real production 11.2.0.2 database with the following Top 5 Timed events section:

Top 5 Timed Events                                                    Avg %Total
~~~~~~~~~~~~~~~~~~                                                   wait   Call
Event                                            Waits    Time (s)   (ms)   Time
----------------------------------------- ------------ ----------- ------ ------
log buffer space                                    28       1,396  49870   41.0
buffer busy waits                                   51       1,049  20567   30.8
CPU time                                                       320           9.4
log file parallel write                          4,011         159     40    4.7
log file sync                                    3,275         153     47    4.5

The complete report (without SQL and several other sections) is here. Can you explain what & why is wrong in this Top?
I must admit I don’t have the definitive answer (probably yet), but I have partial explanation of the issue as well as several ideas.

14 Comments

  1. Posted April 4, 2012 at 01:10 | Permalink | Reply

    For some reason, the previous reply posted before I was finished.

    With a DB_CACHE_SIZE of 69.5 GB, plus the other items in the SGA (74.7 GB total), the granule size in the database instance is likely 256 MB, which probably means that the LOG_BUFFER is also close to 256 MB in size, even though someone tried to set the size to 1,163,264 bytes. The long (average 50 seconds each, with on average more than 1 session waiting per second in the 20 minute time interval) LOG BUFFER SPACE wait event probably points to slow disks where one or more of the online redo logs reside, even though the LOG FILE PARALLEL WRITE wait event does not seem to agree with that analysis. Is the redo being shipped to a remote server? Was the nice OS utility used on the Oracle Database server processes (see the comments attached to my article “Faulty Quotes 6 – CPU Utilization”)? There does not seem to be a lot of competition for the CPU.

    You might want to take a look at the following article:
    http://files.e2sn.com/slides/Tanel_Poder_log_file_sync.pdf (page 14)

    A search through the Oracle support site found the following two potentially related, AIX platform specific articles (even though LOG FILE SYNC waits are not your specific problem):
    Doc ID 1318709.1: “AIX: Things To Check When Seeing Long Log File Sync Time in 11.2.”
    Bug 12834808: “LONG WAITS FOR ‘LOG FILE SYNC’ ON 11.2.0.2″

    Is there a reason why the DB_FILE_MULTIBLOCK_READ_COUNT parameter is set to 16? My testing suggests that if the runtime engine selects to use serial direct path read for full scans, the number of blocks read in a single direct path read request is the largest power of 2 that is equal to or smaller than the DB_FILE_MULTIBLOCK_READ_COUNT parameter value. I wonder if manually setting the DB_FILE_MULTIBLOCK_READ_COUNT parameter value is changing some of the execution plans, resulting in greater demand for clean blocks in the buffer cache?

    The good news is that your buffer cache hit ratio is 100% (or maybe that is bad news).

    • Posted April 4, 2012 at 11:52 | Permalink | Reply

      Hi Charles

      thank you for your reply. I’ve edited it to include second successful attempt and deleted.
      Yes indeed granule size of this instance is 256M and redo log buffer is 167M as reported in the end:

                                                              End Size (Bytes)
      SGA regions                      Begin Size (Bytes)       (if different)
      ------------------------------ -------------------- --------------------
      Database Buffers                     75,161,927,680
      Fixed Size                                2,228,232
      Redo Buffers                            175,116,288
      Variable Size                         4,831,842,296
                                     -------------------- --------------------
      sum                                  80,171,114,496

      Is the redo being shipped to a remote server? Was the nice OS utility used on the Oracle Database server processes (see the comments attached to my article “Faulty Quotes 6 – CPU Utilization”)? There does not seem to be a lot of competition for the CPU.

      1) No 2) Most likely no 3) It’s not a busy server

      Bug 12834808: “LONG WAITS FOR ‘LOG FILE SYNC’ ON 11.2.0.2″

      Thanks! it looks very-very close to this hang.
      Also thanks for the link to Tanel’s presentation.

      Is there a reason why the DB_FILE_MULTIBLOCK_READ_COUNT parameter is set to 16? My testing suggests that if the runtime engine selects to use serial direct path read for full scans, the number of blocks read in a single direct path read request is the largest power of 2 that is equal to or smaller than the DB_FILE_MULTIBLOCK_READ_COUNT parameter value. I wonder if manually setting the DB_FILE_MULTIBLOCK_READ_COUNT parameter value is changing some of the execution plans, resulting in greater demand for clean blocks in the buffer cache?

      Yes there is a reason – it’s a required setting for this app.

      The good news is that your buffer cache hit ratio is 100% (or maybe that is bad news).

      🙂 I simply don’t care.

  2. Vyacheslav Rasskazov
    Posted April 4, 2012 at 09:13 | Permalink | Reply

    I can reproduce this by suspending LGWR (oradebug suspend). First dml session waits for log buffer space, second, which updates the same block, waiting for buffer busy waits. And 40 ms for log file parallel write is too slow.

    • Posted April 4, 2012 at 11:56 | Permalink | Reply

      Hi Vyacheslav

      thanks for stopping by.
      I was also thinking about a possibly suspended LGWR, but I don’t think it’s the case since this is a production environment.

      And 40 ms for log file parallel write is too slow.

      Agree.

  3. Posted April 4, 2012 at 15:46 | Permalink | Reply

    Hi,

    my money is on an I/O issue as the root cause for this. E.g. a faulty SAN port.

    Best regards,
    Nikolay

    • Posted April 4, 2012 at 17:50 | Permalink | Reply

      Hi Nikolay

      faulty port or similar hardware error would explain longer IO, but 1) IO-related waits are not significant in the report 2) log buffer is actually larger than total amount of redo written in 20 minutes.

      • Posted April 4, 2012 at 19:44 | Permalink | Reply

        Hi Timur,

        here is my counter-quiz 🙂

        log buffer space                     43,008      28,392    660   20.6 Configurat
        log file sync                       274,137      27,164     99   19.7     Commit
        write complete waits                 19,744      19,044    965   13.8 Configurat
        free buffer waits                   439,225      17,711     40   12.9 Configurat
        buffer busy waits                    77,540      15,623    201   11.3 Concurrenc
        

        Doesn’t look like an I/O issue, now does it?… and yet that’s exactly what it was!

        So maybe there’s an epidemic of such issues or something 🙂

        Best regards,
        Nikolay

        • Posted April 5, 2012 at 21:38 | Permalink | Reply

          Nikolay

          there’s not enough data in your example for me to make any conclusion.
          What kind of IO issue?

          • Posted April 6, 2012 at 16:51 | Permalink | Reply

            Hi Timur,

            it’s true that there’s not enough information in the AWR report to diagnose the issue. I was simply using it to show that appearances can be deceiving, and what looks nothing like an I/O issue still can be one.

            Regarding your question — the issue was a hardware failure (flopping SAN port).

            I’m preparing a detailed post this case it in my blog, should be ready within a few days.

  4. abhishekchaturvedi
    Posted April 11, 2012 at 00:06 | Permalink | Reply

    Hi Timur,
    Here are some interesting facts I could pull from the statstpack report.

    1. 'log file sync' wait happened 3272 times - avg wait time of 47ms
    2. 'log file parallel write' happened 4011 times - avg wait time of 40ms
    3. The value for 'user commits' statistic is 3642. With 3 commits happening every 1 second.
    4. There were 0 new logons in the 20 min snapshot
    5. There were 0 rollbacks issued in the 20 min snapshot
    6. Your DB had just 2.6 average active sessions
    7. Most important - there was NO log switch in the above snapshot
    8. You generated over or around 109Mbytes of redo in those 20 minutes - but no log switch.
    

    Depending on the above metrics, I can say that the application configured to run is doing something nasty with the way it is issuing commits.
    As Charles pointed out first – you have I/O problems but in addition you should also look into those application sessions to find out the way commits are issued.

    Please share the answer.

    -Abhishek

  5. bunditj
    Posted April 12, 2012 at 11:15 | Permalink | Reply

    Hi Timur,

    Just to be curious ongranule size (256 MB), the default value of hidden parameter determining the granule size “_ksmg_granule_size” is 16 MB in AIX. However your case won’t be able to use that, and rather starts with 32 MB onward. How did you get that value based upon this statspack ?

    • Posted April 12, 2012 at 11:33 | Permalink | Reply

      Hi

      I don’t know about restrictions on the granule size for AIX, and according to the MOS DOC 947152.1 based on the SGA of ~80G and Oracle version the default granule size is 256MB, which is indirectly confirmed by large Redo Log buffer – 167M.

Leave a reply to savvinov Cancel reply