Basha's Tech World: File System

Wednesday, February 9, 2011

Device Mapper Multipathing

Multipath
========

The connection from the server through Host Bus Adapter (HBA) to storage controller is referred as a path. When multiple paths exists to a storage device(LUN) on a storage subsystem, it is referred as multipath connectivity. It is a enterprise level storage capability. Main purpose of multipath connectivity are;

* To provide redundant access to the storage devices, i.e to have access to the storage device when one or more of the components in a path fail.
* The increased throughput by way of load balancing.

A simple example of multipath could be: 2 HBAs connected to a switch to which the storage controllers are connected. In this case the storage controller can be accessed from either of the HBAs and hence we have multipath connectivity.

In Linux, a SCSI device is configured for a LUN seen on each path. i.e, if a LUN has 4 paths, then one will see four SCSI devices getting configured for the same storage LUN. Doing I/O to a LUN in a such an environment is unmanageable,

* applications/administrators do not know which SCSI device to use
* all applications consistently using the same device even on path failure
* always using the storage device specific preferred path
* spreading I/O between multiple valid paths

Device mapper Multipathing
=====================

Device mapper multipathing (DM-Multipath) allows you to configure multiple I/O paths between server nodes and storage arrays into a single device. Multipathing aggregates the I/O paths, creating a new device that consists of the aggregated paths.

Storage devices with 2 controller can be configured,

Active/Active: Means that both the controllers can process I/Os.
Active/Passive: Means that one of the controllers(active) can process I/Os, and the other one(passive) is in a standby mode. I/Os to the passive controller will fail.

Steps to configure DM-Multipath
=======================

First we need to check whether all the hardware connections were performed successfully,i.e connection from host to SAN switch through Fiber Channel HBA and connection from SAN switch to storage. Steps listed here are for Active/Passive configured SAN Storage.

1. Check the following rpm for multipath drivers installled.

#rpm –qa | grep device-mapper
device-mapper-1.02.28-2.el5
device-mapper-event-1.02.28-2.el5
device-mapper-multipath-0.4.7-23.el5
device-mapper-1.02.28-2.el5

If the above packages are not installed then install it using “rpm -ivh”. All the above packages we will get it from Enterprise Linux OS install CD. Also you can download from rpm.pbone.net

2. Check devices list in /proc/partitions

# cat /proc/partitions
major minor #blocks name

8 0 291991552 sda
8 1 200781 sda1
8 2 51199155 sda2
8 3 8193150 sda3
8 4 1 sda4
8 5 232396258 sda5
8 16 104857600 sdb
8 32 104857600 sdc

4. Get the unique scsi id (WWID) for the available devices

# scsi_id -g -u -s /block/sda
3600605b00293c32014c002630a328feb
# scsi_id -g -u -s /block/sdb
3600605b00293e20014bf1ca41c462c45
# scsi_id -g -u -s /block/sdc
3600605b00293e20014bf1ca41c462c45
#

Device sdb and sdc are having same SCSI id which means they both points to same LUN in SAN storage.

Note: when obtaining unique scsi id we have use /block/sdb not with /dev/sdb

5. Check available devices listed in OS using fdisk

# fdisk -l

Disk /dev/sda: 298.9 GB, 298999349248 bytes
255 heads, 63 sectors/track, 36351 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System

/dev/sda1 * 1 25 200781 83 Linux
/dev/sda2 26 6399 51199155 83 Linux
/dev/sda3 6400 7419 8193150 82 Linux swap / Solaris
/dev/sda4 7420 36351 232396290 5 Extended
/dev/sda5 7420 36351 232396258+ 83 Linux

Disk /dev/sdb: 107.3 GB, 107374182400 bytes
255 heads, 63 sectors/track, 13054 cylinders

Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdb doesn’t contain a valid partition table

Ignore primary device /dev/sda, it is a local hard disk which has OS and other applications.

Note that device /dev/sdb is listed in fdisk but not /dev/sdc thats because I/O to the passive device here /dev/sdc fails.

5. Check for the multipath modules are loaded by issuing the ‘lsmod’ command

#lsmod | grep dm[\-_]round
dm_round_robin 36801 1
dm_multipath 52433 2 dm_round_robin
#

If they are not loaded, load them using ‘modprobe’ command:

# modprobe dm-multipath
# modprobe dm-round-robinps
#

6. Update /etc/multipath.conf

* Set user_friendly_names=yes, which provides user friendly name to the device other wise DM will name the device with WWID as default like /dev/mapper/3600605b00293e20014bf1ca41c462c45
* Blacklist is the list of devices which will not allow DM to configure multipath for example hard disks. To include only specified devices here SAN LUNs for the multipathing, blacklist all the devices and in blacklist_exceptions add the device required for multipathing.

blacklist {
wwid “*”
}
blacklist_exceptions {
wwid 3600605b00293e20014bf1ca41c462c45
}

* Include different devices in multipathing group and provide alias name. If alias name is not provided DM will use the default user friendly naming convention /dev/mapper/mpathn where n is number from 0. In this example we have provided alias name as “oracle”.

multipaths {
multipath {
wwid 3600605b00293e20014bf1ca41c462c45 #<— for sdb and sdc
alias oracle
}
}

7. Set chkconfig to start the mutipath daemon automatically on reboot.

# chkconfig –list multipathd
multipathd 0:off 1:off 2:off 3:off 4:off 5:off 6:off
#

If the service is disable as above, enable it:

# chkconfig multipathd on
# chkconfig –list multipathd
multipathd 0:off 1:off 2:on 3:on 4:on 5:on 6:off
#

8. Start the multipath daemon service:

# service multipathd start
Starting multipathd daemon: [ OK ]
#

9. Check the multipath configurations,

# multipath -ll
sdc: checker msg is “readsector0 checker reports path is down”
oracle (3600605b00293e20014bf1ca41c462c45) dm-0 SUN,SUN_6180
[size=100G][features=0][hwhandler=0][rw]
\_ round-robin 0 [prio=1][active]
\_ 7:0:1:0 sdb 8:48 [active][ready]
\_ round-robin 0 [prio=0][enabled]
\_ 7:0:0:0 sdc 8:32 [active][ghost]
#

Note that sdb is active/ready and sdc is active/ghost. status active/ghots points that it is passive connection.

10. Verify DM-Multipathing device in fdisk, in fdisk the DM devices will be listed like /dev/dm-n where n is number from 0.

# fdisk –l

Disk /dev/sda: 298.9 GB, 298999349248 bytes
255 heads, 63 sectors/track, 36351 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/sda1 * 1 25 200781 83 Linux
/dev/sda2 26 6399 51199155 83 Linux
/dev/sda3 6400 7419 8193150 82 Linux swap / Solaris
/dev/sda4 7420 36351 232396290 5 Extended
/dev/sda5 7420 36351 232396258+ 83 Linux

Disk /dev/sdb: 107.3 GB, 107374182400 bytes
255 heads, 63 sectors/track, 13054 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdb doesn’t contain a valid partition table

Disk /dev/dm-0: 107.3 GB, 107374182400 bytes
255 heads, 63 sectors/track, 13054 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/dm-0 doesn’t contain a valid partition table

11. Check user friendly named devices for DM in /dev/

# ls -lrt /dev/mapper/*
crw——- 1 root root 10, 62 Feb 5 01:46 /dev/mapper/control
brw-rw—- 1 root disk 253, 0 Feb 5 01:46 /dev/mapper/oracle

12. Use /dev/mapper/ for any file system activities like partition using fdisk, creating file system and to mount the device. Also make sure in /etc/fstab this user friendly named device is used for auto mount.

Example:
1. To partition,
#fdisk /dev/mapper/oracle

3. To create file syste, After creating partition the device name for the partition will be /dev/mapper/pn where n is number from 1,
#mkfs -t ext3 /dev/mapper/oraclep5

2. To mount the device
#mount -t ext3 /dev/mapper/oraclep5 /data

References
=========

1. Red Hat Enterprise Linux 6 DM Multipath - DM Multipath Configuration and Administration
2. http://sourceware.org/lvm2/wiki/MultipathUsageGuide#active_passive
3. Configuration and Use of Device Mapper Multipathing on Oracle Enterprise Linux (OEL) [ID 555603.1]

Monday, February 7, 2011

OCFS2 Fencing

I am very excited to start writing my experience with Oracle products
and solutions. In this blog am going to detail on OCFS2 fencing technology.

We are setting up a new data center for Oracle products,
for which we have got SUN servers X4170,X4270 and X4470,
SUN SAN storage 6180, SAN Switches and Fiber Channel
HBA (Host Bus Adapter). We have configured SAN storage and allocated
LUNs to the servers for Oracle EBS R12.1, OBI and Hyperian Applications.

We created a required partitions and OCFS2 file system on new devices
for our Oracle EBS R12.1 RAC database and shared Applications File system.
OCFS2 is certified for Shared Application file system, check ID 384248.1.
After configuring our file system and mounted, we started to test
various scenarios to verify the stability of the setup. To our surprise noticed
that server is getting rebooted when we removed all HBAs connected to the
server or powered off SAN switch.

On analysis we found that it is a expected behavior of OCFS2 which is called "Fencing".

Fencing is the act of forcefully removing a node from a cluster.
A node with OCFS2 file system mounted will fence itself when it
realizes that it does not have quorum in a degraded cluster.
It does this so that other nodes won't be stuck trying to access its resources.
In earlier versions of OCFS2, User reported that nodes are hanging
during fencing. From version OCFS2 1.2.5, Oracle no longer uses
"panic" state for fencing instead it uses "machine restart".

Let us now see, how exactly OCFS2 forces kernel to restart on fencing.

After configuring OCFS2 and started cluster service O2CB, there will be a heartbeat system file in which every node writes its presence every 2 seconds to its block in the file.
Block offset is equal to its global node number, that is node 0 will write to the first block in the file, node 1 to the second block and so on.
All nodes will read the heartbeat system file every 2 seconds.
As long as timestamp is changing that node is consider alive.
A node self-fences if it fails to update its timestamp for ((O2CB_HEARTBEAT_THRESHOLD - 1) * 2) seconds.
The [o2hb-xx] kernel thread, after every timestamp write, sets a timer to panic the system after that duration.
If the next timestamp is written within that duration, as it should, it first cancels old timer before setting up a new one.
If for some reason the [o2hb-x] kernel thread is unable to update the timestamp for O2CB_HEARTBEAT_THRESHOLD (default=7 or 31) loops and thus be deemed dead by other nodes in the cluster and OCFS2 forces kernel to restart.
Once a node is deemed dead, the surviving node which manages cluster, lock the dead node's journal, recovers it by replaying the journal.

From the above steps it is evident that the parameter O2CB_HEARTBEAT_THRESHOLD=(((timeout in secs) / 2) + 1), is very important in defining the time line to restart the server.
The default value is 31, which means if the node does not update the timestamp in the heartbeat system file in 60 sec then that node restarts. This value is quite low for RAC environment.

Assume that we have a 2 controller SAN storage and configured active/passive that is at any particular time only 1 controller path is active other is passive and used for failover. OCFS2 forces the kernel to restart (after the timeout) when a cable to a SAN device is cut even if the SAN configuration was going to perform failover.

To over come this we have to increase the value of O2CB_HEARTBEAT_THRESHOLD. If you want to increase to 120sec then the value should be 61.

Steps to change O2CB_HEARTBEAT_THRESHOLD
=====================================
1. Stop O2CB services in the server
#service O2CB stop

2. Update the O2CB configuration file with required value for O2CB_HEARTBEAT_THRESHOLD

# service o2cb configure
Configuring the O2CB driver.

This will configure the on-boot properties of the O2CB driver.
The following questions will determine whether the driver is loaded on
boot. The current values will be shown in brackets ('[]'). Hitting
without typing an answer will keep that current value. Ctrl-C
will abort.

Load O2CB driver on boot (y/n) [y]:
Cluster stack backing O2CB [o2cb]:
Cluster to start on boot (Enter "none" to clear) [ocfs2]:
Specify heartbeat dead threshold (>=7) [31]: 61
Specify network idle timeout in ms (>=5000) [30000]:
Specify network keepalive delay in ms (>=1000) [2000]:
Specify network reconnect delay in ms (>=2000) [2000]:
Writing O2CB configuration: OK
Loading filesystem "configfs": OK
Mounting configfs filesystem at /sys/kernel/config: OK
Loading filesystem "ocfs2_dlmfs": OK
Mounting ocfs2_dlmfs filesystem at /dlm: OK
Starting O2CB cluster ocfs2: OK
#

Alternate way to update the value, is to edit file /etc/sysconfig/o2cb and start O2CB service.

3. Check the status of O2CB
# /etc/init.d/o2cb status
Driver for "configfs": Loaded
Filesystem "configfs": Mounted
Driver for "ocfs2_dlmfs": Loaded
Filesystem "ocfs2_dlmfs": Mounted
Checking O2CB cluster ocfs2: Online
Heartbeat dead threshold = 61
Network idle timeout: 30000
Network keepalive delay: 2000
Network reconnect delay: 2000
Checking O2CB heartbeat: Active
#

References
========
1. OCFS2: A Cluster File System for Linux - User's Guide for Release 1.4
2. OCFS2 Kernel Panics on SAN Failover [ID 377616.1]
3. OCFS2 1.2 - FREQUENTLY ASKED QUESTIONS [ID 391771.1]
4. Heartbeat/Voting/Quorum Related Timeout Configuration for Linux, OCFS2, RAC Stack to Avoid Unnecessary Node Fencing, Panic and Reboot [ID 395878.1]

In my next blog, I will explain the detail steps for configuring multipath using native device mapper(DM) in Linux.

Wednesday, February 9, 2011

Device Mapper Multipathing

Monday, February 7, 2011

OCFS2 Fencing

Popular Posts