Linux: Using multipath to improve performance for your iSCSI array

As part of my self-learning activities at work, I’m playing with a multi-homed EM64T based server, Linux (Fedora 13), the iscsi-initiator-utils and an EqualLogic PS series RAID array. In the previous installment, I gave an overview on how to make the Volume(s)/LUN(s) exported from the array appear on and be usable from the Linux host. In this installment I’ll discuss how to configure the dm-multipath sub-system to aggregate multiple network paths for availability and load-balancing purposes.

But before we get into the nitty-gritty of that, I just wanted to touch on a problem I’ve been having in setting up multipath over IPv4 on my Fedora 13 host. Now, keep in mind that the Fedora host I’m using has been “tweaked” and “messed with” (a lot), so it ain’t no “fresh install” here… That said, I’ve been having problems getting both of my array-facing NICs to successfully ping the EqualLogic Group IP address.

So, if you’re able to ping the EqualLogic array’s group IP address from both Ethernet NICs, setting up the iSCSI initiator as well as the DM multipath module(s) is very much a “straight forward” exercise.

Create two new open-iscsi initiator interfaces

# iscsiadm --mode iface --op=new --interface iscsi-0
# iscsiadm --mode iface --op=new --interface iscsi-1

Validate that the new interfaces are stored

# iscsiadm --mode iface
default tcp,<empty>,<empty>,<empty>,<empty>
iser iser,<empty>,<empty>,<empty>,<empty>
iscsi-0 tcp,<empty>,<empty>,<empty>,<empty>
iscsi-1 tcp,<empty>,<empty>,<empty>,<empty>

Bind the physical NICs to the newly created initiator interfaces

# iscsiadm --mode iface --op=update --interface iscsi-0 --name=iface.net_ifacename --value=eth2
# iscsiadm --mode iface --op=update --interface iscsi-1 --name=iface.net_ifacename --value=eth3

Validate the bindings

# iscsiadm --mode iface --interface iscsi-0
# BEGIN RECORD 2.0-872
iface.iscsi_ifacename = iscsi-0
iface.net_ifacename = eth2 
iface.ipaddress = <empty>
iface.hwaddress = <empty>
iface.transport_name = tcp
iface.initiatorname = <empty>
# END RECORD
# iscsiadm --mode iface --interface iscsi-1
# BEGIN RECORD 2.0-872
iface.iscsi_ifacename = iscsi-1
iface.net_ifacename = eth3 
iface.ipaddress = <empty>
iface.hwaddress = <empty>
iface.transport_name = tcp
iface.initiatorname = <empty>
# END RECORD
# iscsiadm --mode iface
default tcp,<empty>,<empty>,<empty>,<empty>
iser iser,<empty>,<empty>,<empty>,<empty>
iscsi-0 tcp,<empty>,<empty>,eth2,<empty>
iscsi-1 tcp,<empty>,<empty>,eth3,<empty>

Now that we’ve got the initiator interfaces configured and bound to the physical 1 or 10 GigE NICs, we need to make sure every initiator NIC is logged into the available targets. By the way, you can also configure your Fedora host environment to use bridged interfaces if you’ve got a limited number of NICs in the system and want to use KVM/libvirt to import LUNs directly to a virtualized guest instance.

The easiest way to log in all of the interfaces you’ve defined above to the available (the LUNs/Volumes are “available” and “seen” because you’ve set up the ACL on the LUN/Volume to permit access from this host, right?) targets  is to simply list[1] the target(s) you’ve discovered previously and re-issue the # iscsiadm –mode node –login –target iqn.2001-05.com.equallogic:<ID of Volume/LUN> command.

The next step now is to edit/add the contents of the /etc/multipath.conf (or, if you’re so inclined: /etc/multipath/multipath.conf which can represent a symlink target for /etc/multipath.conf) file, assign an alias that can be access from /dev/mapper/<alias> for a given (multipathed) device and “do something” (use as an LVM2 volume, a raw device, host a file system on or whatever)

So, before we “do the work” to configure the Multipath environment, we’ll need to either add (or ensure) the following information is present in the /etc/multipath.conf file (by the way, the multipath utilities assume [expect] its configuration file to be readable from the /etc/multipath.conf location!):

## Use user friendly names, instead of using WWIDs as names.
defaults {
     user_friendly_names yes
     find_multipaths yes
}

## Blacklist local devices (no need to have them accidentally "multipathed")
blacklist {
     wwid scsi-SATA_WDC_WD800AAJS-1_WD-WMAM9ANC0895
     devnode "^sd[a]$"
     devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
     devnode "^hd[a-z]"
     devnode "^cciss!c[0-9]d[0-9]*[p[0-9]*]"
}

## For EqualLogic PS series RAID arrays
devices {
     device {
          vendor                  "EQLOGIC"
          product                 "100E-00"
          path_grouping_policy    multibus
          getuid_callout          "/lib/udev/scsi_id --whitelisted --device=/dev/%n"
          features                "1 queue_if_no_path"
          path_checker            readsector0
          path_selector           "round-robin 0"
          failback                immediate
          rr_min_io               10
          rr_weight               priorities
     }
}

## Assign human-readable name  (aliases) to the World-Wide ID's (names) of the devices we're wanting to manage
## Identify the WWID by issuing # scsi_id --whitelisted --page=0x83 /dev/<path to sdX for one of the device paths>

multipaths {
     multipath {
          wwid                    36090a01850e08bd883c4a45bfa0330ba
          alias                     benchmark2
     }
     multipath {
          wwid                    36090a01850e04bd683c4745bfa03b07c
          alias                     benchmark1
     }
}
And now we’re ready to let the Device Mapper tools locate and configure the LUNs/Volume(s) we’ve got multiple paths to access,  as well as create device nodes in /dev/mapper/ based on the alias configuration we’ve set up above. Simply starting the multipathd service should configure the multipath devices.
# service multipathd start
# chkconfig multipathd on

Verify that the device mapper multipath configuration is ‘complete’ with the following commands:

# multipath -v2
# multipath -ll
benchmark2 (36090a01850e08bd883c4a45bfa0330ba) dm-0 EQLOGIC,100E-00
size=100G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=2 status=active
   |- 6:0:0:0 sdd 8:48 active ready running
   `- 5:0:0:0 sdb 8:16 active ready running
benchmark1 (36090a01850e04bd683c4745bfa03b07c) dm-1 EQLOGIC,100E-00
size=100G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=2 status=active
   |- 4:0:0:0 sde 8:64 active ready running
   `- 3:0:0:0 sdc 8:32 active ready running

[1] = # iscsiadm –mode node

9 Responses to Linux: Using multipath to improve performance for your iSCSI array
  1. MD
    July 29, 2010 | 10:26 am

    I’ve already have an existing RHEL5 connection not using multipath (but actually bonded + MLT) to a PS-6000, and i’ve like to migrate to the multipath setup.

    Have you tested what happens if you add virtual interfaces (IP addresses) to the bonded interface, and then run multipath over that. Thinking that the bridge could be replaced by the bond/MLT instead?

    • Thomas
      July 29, 2010 | 10:46 am

      I’ve not tested using IP aliases and not sure it’d be supported (not that the array really tell, but it’s a question of what’s been tested by Dell’s Q/A group).

      For a supported configuration, Dell would tell you to split the bond, configure the NIC’s presently bonded as physical devices with IP addresses (i.e. 10.10.10.1, 10.10.10.2, etc) on the same subnet as the group IP (if possible) and then configure multipath.conf/multipathd.

      The following would also work, but I’m not clear on whether or not it’s an officially supported configuration, since you’ve already got a bonded interface up, have you considered simply using the iSCSI initiator utilities to configure two (or more, depending on the size of your bond) iSCSI session interfaces, re-discover the LUN(s), re-log in to them and start the multipathd service.

      # iscsiadm -m iface -I bond0_1 –op=new
      # iscsiadm -m iface -I bond0_2 –op=new
      # iscsiadm -m iface -I bond0_1 –op=update -n iface.net_ifacename -v bond0
      # iscsiadm -m iface -I bond0_2 –op=update -n iface.net_ifacename -v bond0
      # iscsiadm -m discovery -I bond0_1 -t st -p
      # iscsiadm -m discovery -I bond0_2 -t st -p

      I tend to remove the LUNs/Volumes I’m not using on the host (like the vss-control LUN, etc with # iscsiadm -m node –op=delete -T

      # iscsiadm -m node –login -T [iqn for LUN/Volume]

      Edit /etc/multipath.conf and start the multipathd service as well as check that the /dev/mapper device is present ( # multipath -v2 ; multipath -ll )

      Again: I’ve not confirmed that we’ll actually support this setup (meaning, I know it works in my own sandbox environment, but YMMV and if you call support they could be telling you it’s not officially supported)

  2. MD
    August 6, 2010 | 10:54 am

    So this issue that i encountered with this setup (over the bonded interfaces) is that all the traffic will go via one of the interfaces, for example:

    Destination Gateway Genmask Flags MSS Window irtt Iface
    10.100.110.0 0.0.0.0 255.255.255.0 U 0 0 0 eth3
    10.100.110.0 0.0.0.0 255.255.255.0 U 0 0 0 eth4
    10.100.110.0 0.0.0.0 255.255.255.0 U 0 0 0 eth5
    10.100.110.0 0.0.0.0 255.255.255.0 U 0 0 0 eth2

    in this case, eth3 takes all the load, despite there being 4 iscsi sessions open from each of the interfaces (one session per interface)

    in effect, while there might be load-balancing via multipath, everything is bottle-necked through one NIC?

    • Thomas
      August 6, 2010 | 12:21 pm

      I’ve seen similar behavior in my personal iSCSI network when using “mode=802.3ad” for the “BONDING_OPTS=” line in ifcfg-bond0 (and the switch configured with Link Aggregation/Trunking for the ports). After switching to “mode=balance-alb” and turning of trunking on the switch ports (LACP is off), my throughput increased and I’m seeing traffic on both of my bonded NICs – this is a personal iSCSI SAN in my basement, no EqualLogic equipment included).

      Also, how come your eth{2,3,4,5} all have IP addresses? If you’re using the bonding driver, I would have expected them all to be IP-less slaves (see “SLAVE” in the output from #ifconfig eth{2,3,4,5} with “bond0″ (or whatever you decide to call it) as the only entity having an IP address.

      The output you included reminds me more of a solution w/no bonding enabled (i.e. the standard EqualLogic recommended configuration) and the dm-multipath driver loaded & configured?

      • MD
        August 6, 2010 | 8:44 pm

        Sorry, maybe my use of words was ambiguous – by “over the bonded interfaces” i really meant “compared to the bonded interfaces”

        i forgot to mention: i switched this morning from the bonded method to the “recommended” method with multipath and individual NICS. I’ve had the bonded working great for 11.5mths, but have always been disappointed by the performance – I used the bonded/MLT (LACP) because i wanted the redundancy at the network end of things.

        Since i thought this was the bottle neck, i’ve now tried the multipath/4 single NICs to see if there’s much difference.

        With my initial tests, the IP address i have assigned to eth5, I’ve mounted a share using NFS on a separate system – interestingly, all of this traffic actually goes via the eth3 interface – it’s listed first in the routing table (it’s on the same subnet as eth2-5) – i’d expect the eth5 since that’s where the IP is “located”

        I then saw the routing table, and that seems to be the reason (i’d read your previous post on the rp_filter, and that was already defaulted to 0 – it’s a RHEL5 system).

        Reading the kernel docs for “arp_filter” (not rp_filter), it almost implies that i need to setup source-based routing:
        “arp_filter – BOOLEAN
        1 – Allows you to have multiple network interfaces on the same
        subnet, and have the ARPs for each interface be answered
        based on whether or not the kernel would route a packet from
        the ARP’d IP out that interface (therefore you must use source
        based routing for this to work). In other words it allows control
        of which cards (usually 1) will respond to an arp request.

        0 – (default) The kernel can respond to arp requests with addresses
        from other interfaces. This may seem wrong but it usually makes
        sense, because it increases the chance of successful communication.
        IP addresses are owned by the complete host on Linux, not by
        particular interfaces. Only for more complex setups like load-
        balancing, does this behaviour cause problems.

        arp_filter for the interface will be enabled if at least one of
        conf/{all,interface}/arp_filter is set to TRUE,
        it will be disabled otherwise”

        Nevertheless, i tested it somewhat (it’s actually a production system so limited scope here). my test with dd if=/dev/zero of=testfile count=50000000 which results in a 26GB file. Whilst looking at iostat, i see traffic over all 4 “paths”. This worked out speed wise at ~896mbit/sec which seems awfully close to 1gbit (when taking into account the overhead of network protocols). And the strange thing, i see traffic on all 4 nics. It’s just an awful co-incidence that the write speed seems to be almost exactly at 1gbit … (it’s a PE2950 w/ 16gb ram). The unit PS6000XV had negligible other load at that time.

        Re-running my DD test:

        [user@host]# time dd if=/dev/zero of=testfile count=50000000
        50000000+0 records in
        50000000+0 records out
        25600000000 bytes (26 GB) copied, 191.673 seconds, 134 MB/s

        real 3m11.733s
        user 0m19.957s
        sys 2m51.032s

        again, very close to 1Gbit…

        So i’m either stuck with the fact that i’ve got a mis-configuration somewhere, or that the performance of the unit really is ~1gbit in terms of writes, or that the way to get a performing system still alludes me.

        • Thomas
          August 15, 2010 | 7:00 pm

          Depending on your switch layout; Do you have SPanning Tree or Rapid SPanning Tree enabled on the ports used by the host(s) and array(s)? Could also be the reason you’re only seeing traffic on one interface.

        • Andy
          November 4, 2010 | 10:54 am

          I had the same problem as you, so firstly thanks for the heads-up on the arp_filter, that solved the reachability problems I had.

          Now, onto the throughput. The reason I moved from etherchannel (just like you did) is because Dell told me I should be using multipath as it’s the ‘supported’ method (same as you).

          They said if I use multipath then connections to the group IP will get distributed among the eth0 and eth1 interfaces and I’ll be able to realise more than the ~1gbps limit which I had with the etherchannel.

          Obviously you’re not seeing these benefits, and although I’m yet to test my changes, I can see connections are still only being made to eth0 on the EL SAN.

          I feel we’re both possibly being spun out a bit by Dell. I’m going to try and get to the bottom of it, but I wondered if you had anything to add to your experience, since it was a few months ago now.

          Thanks.

          • David Velasquez
            July 12, 2011 | 11:32 pm

            Hi,

            When using multipath with Centos 5.6 I can see traffic splitted on different target IPs on the same SAN, but its only using 1 interface to do that, it just doesnt split the traffic on both interfaces. It do failover however.

            Can you update your posts? Does it need to have static routes to make it work spreading the bandwidth on both nics or multipath needs something special to be configured?

            Besr regards.

          • Thomas
            July 13, 2011 | 1:31 pm

            Not sure what I need to update?

            That said, no AFIK, you do not need to use static routes (assuming the dynamic ones are correct). The key is to ensure that the MPIO engine is configured for Round Robin (or Shortest Queue First, if available) and that, if the target ports are on the same sub-net (i.e. 192.168.1.0/255.255.255.0), the sysctl.conf tunables mentioned elsewhere on this site are set correctly.

Leave a Reply

Wanting to leave an <em>phasis on your comment?

Trackback URL http://linux.sjolshagen.net/2010/07/23/linux-using-multipath-to-improve-performance-for-your-iscsi-array/trackback/