ODA Upgrade

Impressions from the recent upgrade of a Virtualized ODA version 12.2 to 19.9.

This upgrade was expected to be a complex task. There are two possible paths to such an upgrade: re-image or go through 12.2->18.3->18.8->19.8->19.9.
While re-imaging ODA is the preferred & recommeneded way, it requires you to re-create the whole structure of your environment.
If you’ve never done that & it’s complicated & not automated – it’s also not an easy thing to do.
We’ve never done re-imaging here, and always followed the hard path of the upgrade. This was not an exception.
At each and every step through the upgrade we faced different issues. Some of them were relatively simple & well known, some of them were as hard as trying to brake a concrete wall with your forehead.

Here’s a non-exhaustive list of issues we’ve experienced (some of them we faced in prior upgrades).
Space issues. Yes this is the most common issue both during the oncalls and it is absolutely #1 thing during ODA upgrade.
You must have free space prior to trying to patch anything. Exact requirements are in the MOS Doc ID 2446194.1. The patching itself will be complaining if something is not right too.

Backups. You have to have backups of the ODA Base before every version upgrade; obviously you’ll need at least 1 backup of your custom VMs too.

GI patch 29520544 is required before ODA 18.3 patching. In our case we needed to apply a merge bug 29597701.
If you follow Readme of that patch, you most likely will fail.
I found this thread (now inaccessible for some reason) where it is mentioned to apply the patch in nonrolling mode. It helps indeed.

18.3 server patching gave us “ERROR: OAKD on local node is not available.” and restarting oakd was not helping.
Attempts to restart ODA Base via oakcli from Dom0 didn’t go as expected; there were errors like “ERROR: Exception encountered while stopping oda_base [Errno 104] Connection reset by peer” on the stop command; and “Error: Device 51712 (vbd) could not be connected. Hotplug scripts not working.” from the start command.
Rebooting Dom0 and ODA Base from iLOM helped to get going, but patching failed asking to stop all the repos manually.
Stopping repos manually helped to pass this step, but patching failed with free space issues this time.
Fixing free space issues & re-trying again – now failed due to presence of a custom rpm installed on the ODA Base node.
After uninstalling custom rpm & fixing yum repos certificate & restarting server patching, it failed with “ERROR: OAKD on local node is not available.”. And “WARNING: Failed to get the good disks to read the partition sizes” in the logs.
This was workarounded with the help of MOS Doc ID 2454218.1 and attempts to continue server patching resumed.
Aaaand you guessed it right, patching failed one more time due to MOS Doc ID 2502972.1; after some tweaking it finally completed.
The storage patching part of the 18.3 went with no issues.
In the end of this process both ODA Base and Dom0 shouled’ve been restarted but only the first ODA Based node was restarted so we had to manually restart everything.

18.8 patching initially failed due to running TFA – it should be stopped manually.
Then patching failed with an odd error while trying to rollback some GI patch: “Failed to rollback 28887509 on home …” and “Failed to patch server (grid) component”.
The thing is the patch was actually removed; it’s just the GI startup was giving a few errors due to HAVIP 3/4 config.
Those HAVIPs have been an ongoing issue in the past too. They are kind of “local” and run on the 1/2 nodes respectively but set up with ONLINE target status for both nodes, and it’s impossible to start them up without an error.
After many unsuccessful attempts to make those HAVIPs’ target node config as it should’ve been, 18.8 server patching was re-tried.
I suspect that since the GI patch was actually not present, this rollback step was skipped, and patching proceeded to the next breaking point.
After some unusual waiting & checking what’s going on, apparently the opatch calls were hanging due to presence of the locks directory in the oraInventory (see MOS Doc ID 2385473.1).
After the oraInventory fix it failed with “Unable to update the device OVM to: 3.4.4”, i.e. it was updating Dom0 to a newer OVM version but couldn’t complete.
In the logs there were odd messages like ‘not all opensm processes are runnng’. opensm is “InfiniBand subnet manager and administration”, and patching was trying to stop/start the opensmd and then counting number of opensm processes, expecting it to be non zero.
But that count was zero. If we tried this restart manually, opensm processes were starting up correctly. It was during the patching that something went bonkers.
We’ve spent hours trying to mock the upgrade process into marking the check as correct and proceed. No luck.
More irritating was the fact that OVM version reported by oakcli was 3.4.4. WTF?
After some trial and error, my collegue noticed that the Dom0 is running with different kernel so the OVM patch was probably half baked.
Fixing Dom0 kernel & also tweaking /etc/init.d/opensmd with proper commands to startup opensm helped to get pass the dead point.
Server patching kind of finished but the second node still had old kernel after the reboot, so had to switch to the new kernel manually & do another reboot.
The repos didn’t startup correctly though. Again due to issues with HAVIPs startup because of the missing exportfs resources. Added those manually & started HAVIPs. Interestingly MOS Doc ID 2593843.1 mentions that HAVIP_3 and HAVIP_4 should be gone from ODA 18.7.
The storage part of the 18.8 patch finished without issues.

For 19.8 the major change is ODA Base upgrade from OEL6 to OEL7. This requires some more pre-checks & fixes.
In our case the blocking issue was one of the ASM disks was offline so prepatch report was complaining “ERROR: One or more ASM disk are not online. Rolling storage update cannot proceed.”
Adding the disk & rebalance took a few hours.
After that the OS preupgrade checks complained about “ERROR: Unknown loop device *” (snipped part of the name is part of the mount point).
Loop devices are used by the Dom0 for accessing ODA Base disks. We had a custom disk added to the ODA Base for hosting DBOH.
Apparently this is not the best thing to do; however previous ODA upgrades completed fine even with this disk present.
Now to avoid this failure the “-force” option was tried, and the patching continued.
It reached the point saying “Modifying /boot/grub/grub.conf to specify console to ttyS0”, then “Done running redhat-upgrade-tool.”, and ODA base went to reboot.
But it didn’t come up. Checking its console from Dom0 with “xm console oakDom1” gave a very sad message:

Booting 'Oracle Linux Server Unbreakable Enterprise Kernel (4.1.12-124.18.6.e

root (hd0,0)
Filesystem type is ext2fs, partition type 0x83
kernel /vmlinuz-4.1.12-124.18.6.el6uek.x86_64 ro root=LABEL=rootfs tsc=reliable
nohpet nopmtimer hda=noprobe hdb=noprobe ide0=noprobe numa=off console=tty0 co
nsole=ttyS0,115200n8 selinux=0 nohz=off crashkernel=256M@64M loglevel=3 panic=6
0 ipv6.disable=1 pci=noaer,nocrs  transparent_hugepage=never PRODUCT=ORACLE_SER

Error 15: File not found

Press any key to continue...

OEL6 to OEL7 upgrade did something wrong along the way of the grub/kernel updates. ODA Base was dead.
Attempt to get in touch with Support was not very successful. Support analyst was requesting iLOM snapshot, and it was failing with a timeout.
Meanwhile I decided to try to take the System disk image of the ODA Base, and mount it on another ODA Base node.
This is possible to do with “xm block-attach oakDom1 file:path /dev/xvdX w” from Dom0, and then mounting appropriate partition inside ODA Base.
Upon /boot partition review we noticed grub.conf is pointing to a non-existing kernel.
At the same time, the new OEL7 kernel is there, with the appropriate initrd image file.
So we’ve manually updated grub.conf to the new OEL7 kernel, unmounted /boot partition, copied System image disk back, and tried to startup broken ODA base.
To our surprise startup succeeded on the first attempt.
Bad thing was that it seemed like everything was incompletely patched: oakcli utility was failing due to missing perl modules.
Apparently the OEL7 upgrade didn’t install the packages. Those were present in /root/oda-upgrade/iso/oda_bm_19., and we’ve mounted the disk, and installed the packages from there.
One ODA Base node upgrade was done; and it failed because of that custom vdisk.
For the second node OS upgrade, the custom disk was removed from ODA Base config. Guess what? It didn’t help.
Same issue with the broken boot. Luckily we already knew how to fix it.
Did the same thing with copying disk/mounting boot partition/editing grub.conf/unmounting & copying System disk back/installing some rpms manually.
OK all of this was just a pre-patching step of the 19.8 patching 🙂
The 19.8 server patching …you guessed it right… failed.
First with missing permission on oraInventory/locks1 – it was set to root, and not to grid user.
Then it failed due to MOS Doc ID 2648271.1.
And another failure due to missing olsnodes script; it was copied from the 18c home (+edited to point to 19c home) to go on.
And finally server patching was done. There were HAVIP issues of course, and those were fixed again.
Storage parth of the 19.8 patch completed, as usual, normally.

We were out of time to proceed with 19.9 upgrade at this point, so the only thing we did was start everything up and make sure all services work as expected. With a few unrelated to upgrade errors everything was fine.
This was a very long upgrade that took almost 4 days, most part of which was lots of troubleshooting work that shouldn’t have happened.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: