Friday, March 4, 2011

Block Corruption after Hot backup

In Oracle E-Business Suite 11i/R12, very rarely we get block corruption error after cloning with hot backup or in activated Standby. Re-backup or Re-cloning may solve this issue but in some case where we cannot afford to redo the activity then, we have to address the root cause and fix the issue in the cloned instance. Here i have come across the same issue and found the root cause for the block corruption. Will see how to resolve this cloned instance block corruption error in this blog.

After cloning we got this error while accessing login page,

500 Internal Server Error
java.lang.NoClassDefFoundError
at oracle.apps.fnd.sso.AppsLoginRedirect.doGet(AppsLoginRedirect.java:623)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:743)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:856)
at com.evermind[Oracle Containers for J2EE 10g (10.1.3.0.0) ].server.http.ResourceFilterChain.doFilter(ResourceFilterChain.java:64)
at oracle.apps.jtf.base.session.ReleaseResFilter.doFilter(ReleaseResFilter.java:26)


To our surprise, in database alert log we noticed below error,


Errors in file <oracle_home>/admin/<context_name>/diag/rdbms/<db_name>/<sid>/trace/<sid>_ora_30460.trc (incident=193):
ORA-01578: ORACLE data block corrupted (file # <file id>, block # <block id>)
ORA-01110: data file 401: '<data_top>/a_txn_data02.dbf'
ORA-26040: Data block was loaded using the NOLOGGING option

Incident details in: <oracle_home>/admin/<context_name>/diag/rdbms/<db_name>/<sid>/incident/incdir_193/<sid>_ora_30460_i193.trc


Above incident report file mentions that accessing a sql in package APPS.FND_SECURITY_PKG threw this block corruption.

When we tried dbv on the affected datafile noticed that few blocks are corrupted. Then we traced segments in affected block using below query and found that it is APPLSYS.WF_LOCAL_USER_ROLES object which has corrupted block after cloning with hot backup.

select owner, segment_name, segment_type
from dba_extents
where file_id = <file id>
and <block id> between block_id and block_id + blocks - 1;

: file id from ORA-01578 error msg
: block id from ORA-01578 error msg

Nologging may be used in a concurrent program to increase the performance. In most cases, by end of the Concurrent Manager job, the objects affected by nologging have been dropped. For these cases, you have no further actions to take - dropping the objects cleans up the invalidated blocks. If the backup was taken before it drops the temporary then we hit the above error.

The concurrent job which caused a problem in our case was "Synchronize WF LOCAL tables". Performed a nologgin operations on APPLSYS.WF_LOCAL_USER_ROLES objects and invalidated the blocks in restored database. To fix this we have to re-run "Synchronize WF LOCAL tables" in restored database using pl/sql block,

sqlplus apps/<apps pwd>

set serveroutput on size 100000;
declare
begin
WF_LOCAL_SYNCH.BULKSYNCHRONIZATION(
P_ORIG_SYSTEM => 'ALL',
P_PARALLEL_PROCESSES => null,
P_LOGGING => null,
P_RAISEERRORS => TRUE);
exception
when others then
dbms_output.put_line('sqlerrm = ' || sqlerrm);
end;
/

Also it is found that even if we run "Synchronize WF LOCAL tables" with logging=YES, partitions of the following table are switched to logging=NO.
- WF_LOCAL_ROLES
- WF_LOCAL_USER_ROLES
- WF_USER_ROLE_ASSIGNMENTS

The cause of this issue is the stage table. For example if WF_LOCAL_USER_ROLES_STAGE table is set to logging=NO and concurrent request "Synchronize WF LOCAL tables" with loggin=YES then object WF_LOCAL_USER_ROLES will be set to logging=NO. To fix this we have to manually set logging to yes on all the stage tables by following these steps.

1) For any partition that is set to NOLOGGING, manually alter it to LOGGING:

1.1) SQL to run to retrieve partitions that are set to NOLOGGING:

select table_name, partition_name
from dba_tab_partitions
where table_name in
('WF_LOCAL_ROLES' ,
'WF_LOCAL_USER_ROLES' ,
'WF_USER_ROLE_ASSIGNMENTS')
and logging = 'NO' ;


1.2) switch all partitions retrieved in step 1.1 to LOGGING = 'YES' using
the following syntax:

alter table applsys. modify partition logging;

example:
alter table applsys.WF_LOCAL_USER_ROLES modify partition ENG_LIST logging;


2) For any stage table that is set to NOLOGGING, manually alter them to LOGGING.

2.1) SQL to run to retrieve stage tables set to NOLOGGING:

select table_name
from dba_tables
where table_name in
('WF_LOCAL_ROLES_STAGE' ,
'WF_LOCAL_USER_ROLES_STAGE' ,
'WF_UR_ASSIGNMENTS_STAGE')
and logging = 'NO' ;

2.2) Switch all the tables retrieved by statement above to LOGGING = 'YES' using the
following syntax:

alter table applsys. logging;

example:
alter table applsys.WF_LOCAL_ROLES_STAGE logging;

3) Then run the synchronize WF local tables concurrent program with logging mode = 'LOGGING'.

4) After completion, monitor tables and partitions: they should remain to LOGGING = 'YES'.

The best practice is to enable force logging on database level. First have to test the performance with forced logging before implementing it in production.

Reference
*********
1. Http 500 : Internal Server Error When Login To A New Cloned Environment [ID 781413.1]
2. ORA-1578 ORA-26040 in a LOB segment - Script to solve the errors [ID 293515.1]
3. Partitions in Workflow Local Tables are Automatically Switched to NOLOGGING [ID 433280.1]
4. Nologging in the E-Business Suite [ID 216211.1]

Tuesday, March 1, 2011

Internal 500 error accessing direct forms in R12

Many a times we hit lot of issues after cloning, mostly it would be login issues. Being APPS DBA we will try to eliminate and find the root cause of the problem to fix the issue. One such approach is "Direct access to Forms" to check whether form services are configured successfully or not. From introduction of Self-Service in 11.5+, direct access to forms is not allowed by default.

We have to tweak the Administer server security settings to enable this feature. By default the value would be "SECURE" which should be "OFF" to access forms directly

Command to change the security settings. In admin application node run below command as applmgr user.

java oracle.apps.fnd.security.AdminAppServer apps/<apps_pwd> AUTHENTICATION OFF DBC=$FND_SECURE/<SID>.dbc

Check the status using,
java oracle.apps.fnd.security.AdminAppServer apps/<apps_pwd> STATUS DBC=$FND_SECURE/<SID>.dbc

Sample output:

Database Server
---------------
DATABASE_ID:<SID>
AUTHENTICATION: OFF

Application Server
------------------
APPL_SERVER_STATUS: VALID
APPL_SERVER_ID:<Server ID>

After turning off Administer Server Security bounce all application services.

Then access the below links for direct forms,

11i, http://:/dev60cgi/f60cgi
R12,http://hyptestdb.alrajhisteel.com:8010/forms/frmservlet

When your middle tier forms server is 10.1.2.3 then you will hit below error while accessing the above URL,

500 Internal Server Error
Servlet error: An exception occurred. The current application deployment descriptors do not allow for including it in this response. Please consult the application log for details

This is a known issue and got fixed in higher version 10.1.2.3.1
Workaround is to append "?" at the end of URL like,
11i, http://:/dev60cgi/f60cgi?
R12,http://hyptestdb.alrajhisteel.com:8010/forms/frmservlet?

References
**********
1. Log on as a Different User Option Greyed Out in File Drop Down Menu [ID 165485.1]
2. Preventing access via "/forms/frmservlet/" in Release 12 [ID 821037.1]
3. How to Administer Server Security in Release 12 [ID 443978.1]


Sunday, February 27, 2011

How to handle lost undo datafile

As DBA we are not suppose to drop any datafile without proper analysis and backup even if it is offline datafile. But as Human we make mistakes which sometimes cost us large. In this blog i am going to share how to recover lost undo datafile.

There are 2 scenarios lost undo datafile with active transactions and without active transactions. In active transaction case, the preferred option is to restore the datafile from backup and perform media recovery. If there is no active transactions then follow these steps to recover the database.

1. Query to check for active transactions,

SELECT KTUXEUSN, KTUXESLT, KTUXESQN, /* Transaction ID */ KTUXESTA Status,KTUXECFL Flags
FROM x$ktuxe WHERE ktuxesta!=’INACTIVE’;

select local_tran_id, state from dba_2pc_pending;

Both the above query should return “no rows”.

2. Make clean shutdown of database,

shutdown immediate;

3. create a parameter file from spfile

create pfile from spfile;

4. Update pfile with “undo_management=manual”

5. Start the database

startup pfile=$ORACLE_HOME/dbs/init.ora

6. Drop undo tablespace with datafile

drop tablespace including contents and datafiles;

7. Create a new undo tablespace,

CREATE UNDO TABLESPACE DATAFILE ‘’ SIZE 2M AUTOEXTEND ON NEXT 1M MAXSIZE UNLIMITED;

8. Shutdown the database

shutdown immediate;

9. Update pfile with “undo_management=AUTO”

10. Create spfile from pfile

create spfile from pfile;

11. Start the database

startup;

Reference
*********

How to Recover from a Lost Datafile with Different Scenarios [ID 198640.1]

Monday, February 14, 2011

rconfig does not recognize cluster environment

rconfig is a command line tool to convert single instance database to RAC. It got introduced from 10gR2. We were trying to convert single instance 11gR1 database to RAC database using 11gR2 grid. After installing grid, we verified the cluster status using "crs_stat". All the cluster services are running fine. But when we try to invoke rconfig from 11gR1 Oracle Home, we got this below error;

Operation Failed
Clusterware is not configured

On Further analysis in rconfig log file in $ORACLE_HOME/cfgtoollogs/rconfig, we notice that
11gR1 Oracle home could not communicate with CRS services.

...
oracle.ops.mgmt.cluster.ClusterException: PRKH-1010 : Unable to communicate with CRS services.
....

Then we come across this metalink note 946332.1, which states that "unable to create lower version database in 11gR2 grid"

Cause:
RAC nodes were not pinned:

$11gR2_ORACLE_HOME/grid/bin/olsnodes -t
oranode1 Unpinned
oranode2 Unpinned
oranode3 Unpinned

This is a known issue when trying to configure lower version of database with fresh 11gR2 grid infrastructure. The fix is manually need to pin RAC nodes. Run below command as root in any 1 node.

# crsctl pin css -n oranode1 oranode2 oranode3
CRS-4664: Node oranode1 successfully pinned.
CRS-4664: Node oranode2 successfully pinned.
CRS-4664: Node oranode3 successfully pinned.

Check the status, node should be "Pinned"
$11gR2_ORACLE_HOME/grid/bin/olsnodes -t
oranode1 Pinned
oranode2 Pinned
oranode3 Pinned

Note: This issue will occur even when trying to create database using dbca from lower version Oracle home IN 11gR2 grid.

Reference
=======
1. Unable To Create 10.1 or 10.2 or 11.1 ASM RAC Databases (ORA-29702) Using Brand New 11gR2 Grid Infrastructure Installation. [946332.1]

Saturday, February 12, 2011

SYSTEM09.dbf file corrupted in R12.1.1 fresh install on Linux x86_64

Fresh installed R12.1.1 with rapidwizard version (StartCD) 12.1.1.9 or below on Linux X86_64 contains corrupted SYSTEM09.dbf file. Thankfully those corruption are not data corruption (does not belong to any extents/segments) hence will not affect your installation and database startup. Might be this was the reason that Oracle could not capture this issue before it releases to the customer.

User will identify this corruption only when they starts to backup using RMAN or validates datafile using RMAN. RMAN will fail with below error message,

ORA-19566: exceeded limit of 0 corrupt blocks for file /DBF_LOC/system09.dbf

On verifying the datafile using dbv, there will be around 20000+ corrupted blocks in system09.dbf,

...
Page 192127 is marked corrupt
Corrupt block relative dba: 0x5802ee7f (file 352, block 192127)
Completely zero block found during dbv:

Page 192128 is marked corrupt
Corrupt block relative dba: 0x5802ee80 (file 352, block 192128)
Completely zero block found during dbv:



DBVERIFY - Verification complete

Total Pages Examined : 192128
Total Pages Processed (Data) : 54272
Total Pages Failing (Data) : 0
Total Pages Processed (Index): 26750
Total Pages Failing (Index): 0
Total Pages Processed (Other): 2063
Total Pages Processed (Seg) : 0
Total Pages Failing (Seg) : 0
Total Pages Empty : 82570
Total Pages Marked Corrupt : 26473
Total Pages Influx : 0
Total Pages Encrypted : 0
Highest block SCN : 3755938521 (1388.3755938521)
$

Since the affected blocks are empty blocks we can resize the datafile to fix this corruption issue. Run this command as sysdba,

alter database datafile '/DBF_LOC/system09.dbf' resize 1000M;


Reference
=======
1. Oracle E-Business Suite Installation and Upgrade Notes Release 12 (12.1.1) for Linux x86-64 [ID 761566.1]
2. More than 20000 Blocks Corrupted in system09.dbf After Fresh Installation of E-Business Suite R12.1.1 on Linux x86_64 Platform [ID 840963.1]

Wednesday, February 9, 2011

Device Mapper Multipathing



Multipath
========

The connection from the server through Host Bus Adapter (HBA) to storage controller is referred as a path. When multiple paths exists to a storage device(LUN) on a storage subsystem, it is referred as multipath connectivity. It is a enterprise level storage capability. Main purpose of multipath connectivity are;

* To provide redundant access to the storage devices, i.e to have access to the storage device when one or more of the components in a path fail.
* The increased throughput by way of load balancing.


A simple example of multipath could be: 2 HBAs connected to a switch to which the storage controllers are connected. In this case the storage controller can be accessed from either of the HBAs and hence we have multipath connectivity.

In Linux, a SCSI device is configured for a LUN seen on each path. i.e, if a LUN has 4 paths, then one will see four SCSI devices getting configured for the same storage LUN. Doing I/O to a LUN in a such an environment is unmanageable,

* applications/administrators do not know which SCSI device to use
* all applications consistently using the same device even on path failure
* always using the storage device specific preferred path
* spreading I/O between multiple valid paths

Device mapper Multipathing
=====================

Device mapper multipathing (DM-Multipath) allows you to configure multiple I/O paths between server nodes and storage arrays into a single device. Multipathing aggregates the I/O paths, creating a new device that consists of the aggregated paths.

Storage devices with 2 controller can be configured,

Active/Active: Means that both the controllers can process I/Os.
Active/Passive: Means that one of the controllers(active) can process I/Os, and the other one(passive) is in a standby mode. I/Os to the passive controller will fail.

Steps to configure DM-Multipath
=======================

First we need to check whether all the hardware connections were performed successfully,i.e connection from host to SAN switch through Fiber Channel HBA and connection from SAN switch to storage. Steps listed here are for Active/Passive configured SAN Storage.

1. Check the following rpm for multipath drivers installled.

#rpm –qa | grep device-mapper
device-mapper-1.02.28-2.el5
device-mapper-event-1.02.28-2.el5
device-mapper-multipath-0.4.7-23.el5
device-mapper-1.02.28-2.el5

If the above packages are not installed then install it using “rpm -ivh”. All the above packages we will get it from Enterprise Linux OS install CD. Also you can download from rpm.pbone.net

2. Check devices list in /proc/partitions

# cat /proc/partitions
major minor #blocks name

8 0 291991552 sda
8 1 200781 sda1
8 2 51199155 sda2
8 3 8193150 sda3
8 4 1 sda4
8 5 232396258 sda5
8 16 104857600 sdb
8 32 104857600 sdc

4. Get the unique scsi id (WWID) for the available devices

# scsi_id -g -u -s /block/sda
3600605b00293c32014c002630a328feb
# scsi_id -g -u -s /block/sdb
3600605b00293e20014bf1ca41c462c45
# scsi_id -g -u -s /block/sdc
3600605b00293e20014bf1ca41c462c45
#

Device sdb and sdc are having same SCSI id which means they both points to same LUN in SAN storage.

Note: when obtaining unique scsi id we have use /block/sdb not with /dev/sdb

5. Check available devices listed in OS using fdisk

# fdisk -l

Disk /dev/sda: 298.9 GB, 298999349248 bytes
255 heads, 63 sectors/track, 36351 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System

/dev/sda1 * 1 25 200781 83 Linux
/dev/sda2 26 6399 51199155 83 Linux
/dev/sda3 6400 7419 8193150 82 Linux swap / Solaris
/dev/sda4 7420 36351 232396290 5 Extended
/dev/sda5 7420 36351 232396258+ 83 Linux

Disk /dev/sdb: 107.3 GB, 107374182400 bytes
255 heads, 63 sectors/track, 13054 cylinders

Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdb doesn’t contain a valid partition table

Ignore primary device /dev/sda, it is a local hard disk which has OS and other applications.

Note that device /dev/sdb is listed in fdisk but not /dev/sdc thats because I/O to the passive device here /dev/sdc fails.

5. Check for the multipath modules are loaded by issuing the ‘lsmod’ command

#lsmod | grep dm[\-_]round
dm_round_robin 36801 1
dm_multipath 52433 2 dm_round_robin
#

If they are not loaded, load them using ‘modprobe’ command:

# modprobe dm-multipath
# modprobe dm-round-robinps
#

6. Update /etc/multipath.conf

* Set user_friendly_names=yes, which provides user friendly name to the device other wise DM will name the device with WWID as default like /dev/mapper/3600605b00293e20014bf1ca41c462c45
* Blacklist is the list of devices which will not allow DM to configure multipath for example hard disks. To include only specified devices here SAN LUNs for the multipathing, blacklist all the devices and in blacklist_exceptions add the device required for multipathing.

blacklist {
wwid “*”
}
blacklist_exceptions {
wwid 3600605b00293e20014bf1ca41c462c45
}

* Include different devices in multipathing group and provide alias name. If alias name is not provided DM will use the default user friendly naming convention /dev/mapper/mpathn where n is number from 0. In this example we have provided alias name as “oracle”.

multipaths {
multipath {
wwid 3600605b00293e20014bf1ca41c462c45 #<— for sdb and sdc
alias oracle
}

}

7. Set chkconfig to start the mutipath daemon automatically on reboot.

# chkconfig –list multipathd
multipathd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
#

If the service is disable as above, enable it:

# chkconfig multipathd on
# chkconfig –list multipathd
multipathd 0:off 1:off 2:on 3:on 4:on 5:on 6:off
#

8. Start the multipath daemon service:

# service multipathd start
Starting multipathd daemon: [ OK ]
#

9. Check the multipath configurations,

# multipath -ll
sdc: checker msg is “readsector0 checker reports path is down”
oracle (3600605b00293e20014bf1ca41c462c45) dm-0 SUN,SUN_6180
[size=100G][features=0][hwhandler=0][rw]
\_ round-robin 0 [prio=1][active]
\_ 7:0:1:0 sdb 8:48 [active][ready]
\_ round-robin 0 [prio=0][enabled]
\_ 7:0:0:0 sdc 8:32 [active][ghost]
#

Note that sdb is active/ready and sdc is active/ghost. status active/ghots points that it is passive connection.

10. Verify DM-Multipathing device in fdisk, in fdisk the DM devices will be listed like /dev/dm-n where n is number from 0.

# fdisk –l

Disk /dev/sda: 298.9 GB, 298999349248 bytes
255 heads, 63 sectors/track, 36351 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/sda1 * 1 25 200781 83 Linux
/dev/sda2 26 6399 51199155 83 Linux
/dev/sda3 6400 7419 8193150 82 Linux swap / Solaris
/dev/sda4 7420 36351 232396290 5 Extended
/dev/sda5 7420 36351 232396258+ 83 Linux

Disk /dev/sdb: 107.3 GB, 107374182400 bytes
255 heads, 63 sectors/track, 13054 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdb doesn’t contain a valid partition table

Disk /dev/dm-0: 107.3 GB, 107374182400 bytes
255 heads, 63 sectors/track, 13054 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/dm-0 doesn’t contain a valid partition table

11. Check user friendly named devices for DM in /dev/

# ls -lrt /dev/mapper/*
crw——- 1 root root 10, 62 Feb 5 01:46 /dev/mapper/control
brw-rw—- 1 root disk 253, 0 Feb 5 01:46 /dev/mapper/oracle

12. Use /dev/mapper/ for any file system activities like partition using fdisk, creating file system and to mount the device. Also make sure in /etc/fstab this user friendly named device is used for auto mount.

Example:
1. To partition,
#fdisk /dev/mapper/oracle

3. To create file syste, After creating partition the device name for the partition will be /dev/mapper/pn where n is number from 1,
#mkfs -t ext3 /dev/mapper/oraclep5

2. To mount the device
#mount -t ext3 /dev/mapper/oraclep5 /data

References
=========

1. Red Hat Enterprise Linux 6 DM Multipath - DM Multipath Configuration and Administration
2. http://sourceware.org/lvm2/wiki/MultipathUsageGuide#active_passive
3. Configuration and Use of Device Mapper Multipathing on Oracle Enterprise Linux (OEL) [ID 555603.1]

Monday, February 7, 2011

OCFS2 Fencing

I am very excited to start writing my experience with Oracle products
and solutions. In this blog am going to detail on OCFS2 fencing technology.

We are setting up a new data center for Oracle products,
for which we have got SUN servers X4170,X4270 and X4470,
SUN SAN storage 6180, SAN Switches and Fiber Channel
HBA (Host Bus Adapter). We have configured SAN storage and allocated
LUNs to the servers for Oracle EBS R12.1, OBI and Hyperian Applications.

We created a required partitions and OCFS2 file system on new devices
for our Oracle EBS R12.1 RAC database and shared Applications File system.
OCFS2 is certified for Shared Application file system, check ID 384248.1.
After configuring our file system and mounted, we started to test
various scenarios to verify the stability of the setup. To our surprise noticed
that server is getting rebooted when we removed all HBAs connected to the
server or powered off SAN switch.

On analysis we found that it is a expected behavior of OCFS2 which is called "Fencing".

Fencing is the act of forcefully removing a node from a cluster.
A node with OCFS2 file system mounted will fence itself when it
realizes that it does not have quorum in a degraded cluster.
It does this so that other nodes won't be stuck trying to access its resources.
In earlier versions of OCFS2, User reported that nodes are hanging
during fencing. From version OCFS2 1.2.5, Oracle no longer uses
"panic" state for fencing instead it uses "machine restart".

Let us now see, how exactly OCFS2 forces kernel to restart on fencing.

  • After configuring OCFS2 and started cluster service O2CB, there will be a heartbeat system file in which every node writes its presence every 2 seconds to its block in the file.
  • Block offset is equal to its global node number, that is node 0 will write to the first block in the file, node 1 to the second block and so on.
  • All nodes will read the heartbeat system file every 2 seconds.
  • As long as timestamp is changing that node is consider alive.
  • A node self-fences if it fails to update its timestamp for ((O2CB_HEARTBEAT_THRESHOLD - 1) * 2) seconds.
  • The [o2hb-xx] kernel thread, after every timestamp write, sets a timer to panic the system after that duration.
  • If the next timestamp is written within that duration, as it should, it first cancels old timer before setting up a new one.
  • If for some reason the [o2hb-x] kernel thread is unable to update the timestamp for O2CB_HEARTBEAT_THRESHOLD (default=7 or 31) loops and thus be deemed dead by other nodes in the cluster and OCFS2 forces kernel to restart.
  • Once a node is deemed dead, the surviving node which manages cluster, lock the dead node's journal, recovers it by replaying the journal.

From the above steps it is evident that the parameter O2CB_HEARTBEAT_THRESHOLD=(((timeout in secs) / 2) + 1), is very important in defining the time line to restart the server.
The default value is 31, which means if the node does not update the timestamp in the heartbeat system file in 60 sec then that node restarts. This value is quite low for RAC environment.

Assume that we have a 2 controller SAN storage and configured active/passive that is at any particular time only 1 controller path is active other is passive and used for failover. OCFS2 forces the kernel to restart (after the timeout) when a cable to a SAN device is cut even if the SAN configuration was going to perform failover.

To over come this we have to increase the value of O2CB_HEARTBEAT_THRESHOLD. If you want to increase to 120sec then the value should be 61.

Steps to change O2CB_HEARTBEAT_THRESHOLD
=====================================
1. Stop O2CB services in the server
#service O2CB stop

2. Update the O2CB configuration file with required value for O2CB_HEARTBEAT_THRESHOLD

# service o2cb configure
Configuring the O2CB driver.

This will configure the on-boot properties of the O2CB driver.
The following questions will determine whether the driver is loaded on
boot. The current values will be shown in brackets ('[]'). Hitting
without typing an answer will keep that current value. Ctrl-C
will abort.

Load O2CB driver on boot (y/n) [y]:
Cluster stack backing O2CB [o2cb]:
Cluster to start on boot (Enter "none" to clear) [ocfs2]:
Specify heartbeat dead threshold (>=7) [31]: 61
Specify network idle timeout in ms (>=5000) [30000]:
Specify network keepalive delay in ms (>=1000) [2000]:
Specify network reconnect delay in ms (>=2000) [2000]:
Writing O2CB configuration: OK
Loading filesystem "configfs": OK
Mounting configfs filesystem at /sys/kernel/config: OK
Loading filesystem "ocfs2_dlmfs": OK
Mounting ocfs2_dlmfs filesystem at /dlm: OK
Starting O2CB cluster ocfs2: OK
#

Alternate way to update the value, is to edit file /etc/sysconfig/o2cb and start O2CB service.

3. Check the status of O2CB
# /etc/init.d/o2cb status
Driver for "configfs": Loaded
Filesystem "configfs": Mounted
Driver for "ocfs2_dlmfs": Loaded
Filesystem "ocfs2_dlmfs": Mounted
Checking O2CB cluster ocfs2: Online
Heartbeat dead threshold = 61
Network idle timeout: 30000
Network keepalive delay: 2000
Network reconnect delay: 2000
Checking O2CB heartbeat: Active
#

References
========
1. OCFS2: A Cluster File System for Linux - User's Guide for Release 1.4
2. OCFS2 Kernel Panics on SAN Failover [ID 377616.1]
3. OCFS2 1.2 - FREQUENTLY ASKED QUESTIONS [ID 391771.1]
4. Heartbeat/Voting/Quorum Related Timeout Configuration for Linux, OCFS2, RAC Stack to Avoid Unnecessary Node Fencing, Panic and Reboot [ID 395878.1]

In my next blog, I will explain the detail steps for configuring multipath using native device mapper(DM) in Linux.