UNDO sizing

90% databases that I see for the first time have the same issue with UNDO tablespace: it’s over sized, yet still causing infamous ORA-1555 errors at times. Here is why.

UNDO is over-sized because of a dumb monitoring which treats UNDO tablespace in the same way as normal datafiles, i.e. the monitoring suggests to have some free space in UNDO. So it notifies DBAs and tells them to increase UNDO datafiles. And most DBAs are doing it straight away without thinking. This is the most common reason why databases end up with ridiculously large UNDO tablespaces.
It shouldn’t be this way. UNDO is a rollback history: there’s some really old history which can be ignored, there’s some history that might be necessary for a long running transaction, delayed block cleanout of flashback query, and finally there’s active transactions rollback history. Unless you have a special test environment, at any single point in time of a database instance nobody knows when and which UNDO data might be required. Oracle provides 3 options to handle this unknown:

  • let Oracle to decide how much undo to keep based on undo_retention and autoextensible undo data files
  • let Oracle to decide how much undo to keep based on the fixed size of UNDO tablespace
  • ask Oracle to guarantee undo_retention setting

Since datafiles are usually autoextensible, option 1 is the most common scenario. There are two possibilities with this option:
1) undo_retention is left default at 900 seconds. This effectively means Oracle doesn’t need to keep a lot of undo history since you didn’t ask for it. And no matter how big your UNDO is, this setup may easily lead to “snapshot too old” errors – Oracle doesn’t need to keep much UNDO, and is happily re-using undo space quickly enough
2) undo_retention is set to relatively high value, like few hours. This should set the minimum undo retention time, and, in theory, should guarantee you enough transactions history in the UNDO to run queries of at least undo_retention duration. However I’ve seen it more than once that setting undo_retention with autoextensible UNDO doesn’t help to avoid ORA-1555 even for queries that are much shorter than undo_retention. I don’t know how to explain this, and I don’t have a test case and I hate Oracle Support

With fixed size UNDO Oracle behaves rather differently and tries to utilize all the allocated UNDO datafiles as much as possible. It ignores undo_retention and just shows you tuned_undoretention that represents how much history you keep in your fixed size UNDO. I think this is the only reliable way to run Oracle DB in production. In fact documentation suggests to use this approach. It’s just not default.

Recently I’ve seen a database with multi terabyte UNDO. Its size was about 20% of all the datafiles in the system. This prompted me to try to write a query which would compute approximate UNDO size required to handle the load.
Here’re the ideas:

  • the minimum UNDO size should be derived from the maximum number of active undo blocks over the run time of the database. It’s most likely not enough, but gives a good starting point to understanding how much space active transactions require in extreme situations
  • the maximum UNDO size should be equal to the maximum of the sum of ‘undo change vector size’ statistic over the sliding window representing target undo_retention with retention guarantee. This value most likely is way more than it is required for a normal database operation.
  • sufficient UNDO size should be somewhere between those two numbers. I think it should be derived empirically starting from some intermediate best guess between min and max; then adjusted if ORA-1555 or inefficient undo space error appear

Here’s the script I’ve come up with:

prompt
prompt UNDO sizing report
prompt ==================
prompt 
prompt Reports current undo set up, and minimum/maximum required undo tablespace size based on historical AWR data
prompt
prompt Note:
prompt   a) if your undo datafiles are not fixed size (all of them), you are doing it wrong most likely
prompt   b) awr_snapshots_count is used as a period to calculate maximum required undo size with guaranteed undo retention
prompt   c) there is no way to calculate sufficient undo size with a good precision and this report is an approximation
prompt 
prompt For feedback mailto:timur.akhmadeev@gmail.com
prompt

col inst_id                         format 99 head 'In|st'
col current_size_mb                 format 999,999,999 head 'Current|undo, MB'
col is_autoextensible               format a4 head 'Auto|ext?'
col undo_retention                  format 999.9 head 'Retention|hours'
col undo_size_min_mb                format 999,999,999 head 'Minimumal|req UNDO, MB'
col undo_size_guarantee_mb          format 999,999,999 head 'Max req|UNDO, MB'
col longest_sql                     format 999,999.9 head 'Longest|SQL, h'
col longest_sql_id                  format a13 head 'Longest|sql_id'
col max_ora1555_cnt                 format 999,999 head 'Max ORA-|1555 cnt'
col max_no_space_cnt                format 999,999 head 'Max no|space cnt'

select
  ua.inst_id,
  ua.current_size_mb,
  ua.is_autoextensible,
  ua.undo_retention/3600 undo_retention,
  um.undo_size_min_mb,
  u.undo_size_guarantee_mb,
  um.longest_sql/3600 longest_sql,
  um.longest_sql_id,
  um.max_ora1555_cnt,
  um.max_no_space_cnt
from
  gv$parameter p,
  ( -- how much undo is required to guarantee undo retention for awr_snapshots_count period
    select 
      inst_id,
      max(required_undo_mb) undo_size_guarantee_mb
    from
    (
      select
        inst_id,
        sum(undo_size) over (partition by inst_id order by begin_interval_time rows &awr_snapshots_count preceding) required_undo_mb
      from
      (     
        select
          s.instance_number inst_id,
          s.begin_interval_time,
          round((ss.value - lag(ss.value) over (partition by s.instance_number order by s.begin_interval_time))/1024/1024) undo_size
        from 
          dba_hist_snapshot s,
          v$database d,
          dba_hist_sysstat ss,
          v$statname n
        where
          s.dbid = d.dbid and
          s.dbid = ss.dbid and
          s.instance_number = ss.instance_number and
          s.snap_id = ss.snap_id and
          ss.stat_id = n.stat_id and
          n.name = 'undo change vector size'
      )
    )
    group by inst_id
  ) u,
  ( -- minimally required undo as max active blocks
    select
      uh.instance_number inst_id,
      max(activeblks * p.value/1024/1024) undo_size_min_mb,
      max(maxquerylen) longest_sql,
      max(maxquerysqlid) keep (dense_rank first order by maxquerylen desc) longest_sql_id,
      max(ssolderrcnt) max_ora1555_cnt,
      max(nospaceerrcnt) max_no_space_cnt
    from
      dba_hist_snapshot s,
      v$database d,
      dba_hist_undostat uh,
      v$parameter p
    where
      s.dbid = d.dbid and
      s.dbid = uh.dbid and
      s.instance_number = uh.instance_number and
      s.snap_id = uh.snap_id and
      p.name = 'db_block_size'
    group by
      uh.instance_number
  ) um,
  ( -- current undo setup
    select
      p.inst_id,
      round(sum(bytes)/1024/1024) current_size_mb,
      max(t.autoextensible) is_autoextensible,
      max(ur.value) undo_retention
    from 
      dba_data_files t, 
      gv$parameter p,
      (select inst_id, value from gv$parameter where name = 'undo_retention') ur
    where
      t.tablespace_name = p.value and
      p.name = 'undo_tablespace' and
      p.inst_id = ur.inst_id
    group by
      p.inst_id
  ) ua
where
  p.inst_id = u.inst_id and
  p.name = 'undo_retention' and
  u.inst_id = ua.inst_id and
  ua.inst_id = um.inst_id
;

Example output:

Enter value for awr_snapshots_count: 5

 In      Current Auto Retention    Minimumal      Max req    Longest Longest       Max ORA-    Max no
 st     undo, MB ext?     hours req UNDO, MB     UNDO, MB     SQL, h sql_id        1555 cnt space cnt
--- ------------ ---- --------- ------------ ------------ ---------- ------------- -------- ---------
  1       65,508 YES         .3        3,678        9,213        1.7 6xxxxxxxxxxxg        0         0
  2       55,508 YES         .3        5,325       26,797        1.7 6xxxxxxxxxxxg        0         0

What does the output mean?

  • this is a 2 node RAC with autoextensible UNDO tablespaces of more than 50G each
  • undo_retention is default 900 seconds (non important rounding error from 0.25)
  • maximum active undo blocks as reported in dba_hist_undostat were few GBs per instance and these numbers are around minimally required UNDO tablespaces
  • maximum UNDO vector size that was generated for 5 consecutive AWR snapshots is 9 and 26 G respectively. Those are required UNDO tablespaces to guarantee undo_retention of 6 hours (hourly snapshots in this DB).
  • few FYI columns: the longest SQL, its sql_id and max number of ORA-1555 and space errors (should be sum perhaps)
  • based on these numbers I think I’d try to set UNDO fixed size about 20G for both nodes. This is smaller than currently allocated UNDO, and it should still be enough to run the DB without errors. This DB didn’t have any undo errors though, so technically, no need to touch it unless it’s broken or there’s not enough space for such UNDO

I tried to use Undo Advisor for this DB and the only result I got was … “Undo tablespace is OK.” 🙂 True! but very limited value.

Another example:

Enter value for awr_snapshots_count: 50

 In      Current Auto Retention    Minimumal      Max req    Longest Longest       Max ORA-    Max no
 st     undo, MB ext?     hours req UNDO, MB     UNDO, MB     SQL, h sql_id        1555 cnt space cnt
--- ------------ ---- --------- ------------ ------------ ---------- ------------- -------- ---------
  1       76,800 NO          .3       20,345      409,742      164.1 fxxxxxxxxxxxq        1         1
  2       76,800 NO          .3       12,027    2,038,377      234.0 axxxxxxxxxxx5        8         0
  3       76,800 NO          .3       18,336    3,829,793       62.8 8xxxxxxxxxxx4        2         0
  4       76,800 NO          .3       12,299      929,484       99.3 gxxxxxxxxxxx7        2         0

In this case, the DB is already using fixed size UNDO. Its size is much lower than maximum required UNDO; still it works fine. The errors were experienced before UNDO was made fixed size and the script doesn’t recognize it.
This example shows how dangerous retention guarantee could be: with relatively short undo_retention estimations for UNDO are in the range of a multiple TBs – and that’s too much I think.

Feedback, comments, corrections – welcome.

3 Comments

  1. Posted February 8, 2018 at 14:31 | Permalink | Reply

    Hello,

    One thing I find is that Oracle auto-tunes the undo so I use a script like

    col undo_minutes for 999,999.00
    col undo_hours for 999,999.00
    col undo_days for 999,999.00
    select begin_time, end_time,
    round(tuned_undoretention/60,2) as undo_minutes,
    round(tuned_undoretention/(60*60),2) as undo_hours,
    round(tuned_undoretention/(60*60*24),2) as undo_days
    from v$undostat
    where begin_time > trunc(sysdate) -1
    order by begin_time
    /

    If these numbers are way above what you need, leave things alone.

    Be very careful when your database is very quiet as you get very full undo tablespaces with 100+ days retention.

    Regards,
    Frank

  2. Posted February 8, 2018 at 14:34 | Permalink | Reply

    Also understand what’s using space and what’s expired

    break on tablespace_name skip 1
    col mb for 999,999,999.00
    compute sum of mb on tablespace_name

    show parameter undo

    select tablespace_name,
    status,
    count(extent_id) “Extent Count”,
    sum(blocks) “Total Blocks”,
    sum(bytes) / 1048576 as mb
    from dba_undo_extents
    group by tablespace_name, status
    order by tablespace_name, status;

    What you should worry about for space purposes is UNEXPIRED undo.

  3. Evgeni
    Posted June 9, 2018 at 00:29 | Permalink | Reply

    Hi,
    Why not to set autoextend on and maxsize=size? When you choice your size depend on the load/behavior and undo_retantion.
    It would give you #4 and in my opinion most successful way to manage undo.

    Regards
    Evgeni

    P.S. Check it out (if not yet): doc.id: 1579779.1?

Leave a comment