As part of my self-learning activities at work, I’m playing with a multi-homed EM64T based server, Linux (Fedora 13), the iscsi-initiator-utils and an EqualLogic PS series RAID array. In the previous installment, I gave an overview on how to make the Volume(s)/LUN(s) exported from the array appear on and be usable from the Linux host. In this installment I’ll discuss how to configure the dm-multipath sub-system to aggregate multiple network paths for availability and load-balancing purposes.
But before we get into the nitty-gritty of that, I just wanted to touch on a problem I’ve been having in setting up multipath over IPv4 on my Fedora 13 host. Now, keep in mind that the Fedora host I’m using has been “tweaked” and “messed with” (a lot), so it ain’t no “fresh install” here… That said, I’ve been having problems getting both of my array-facing NICs to successfully ping the EqualLogic Group IP address.
So, if you’re able to ping the EqualLogic array’s group IP address from both Ethernet NICs, setting up the iSCSI initiator as well as the DM multipath module(s) is very much a “straight forward” exercise.
Create two new open-iscsi initiator interfaces
# iscsiadm --mode iface --op=new --interface iscsi-0# iscsiadm --mode iface --op=new --interface iscsi-1
Validate that the new interfaces are stored
# iscsiadm --mode ifacedefault tcp,<empty>,<empty>,<empty>,<empty> iser iser,<empty>,<empty>,<empty>,<empty> iscsi-0 tcp,<empty>,<empty>,<empty>,<empty> iscsi-1 tcp,<empty>,<empty>,<empty>,<empty>
Bind the physical NICs to the newly created initiator interfaces
# iscsiadm --mode iface --op=update --interface iscsi-0 --name=iface.net_ifacename --value=eth2# iscsiadm --mode iface --op=update --interface iscsi-1 --name=iface.net_ifacename --value=eth3
Validate the bindings
# iscsiadm --mode iface --interface iscsi-0# BEGIN RECORD 2.0-872 iface.iscsi_ifacename = iscsi-0 iface.net_ifacename = eth2 iface.ipaddress = <empty> iface.hwaddress = <empty> iface.transport_name = tcp iface.initiatorname = <empty> # END RECORD
# iscsiadm --mode iface --interface iscsi-1# BEGIN RECORD 2.0-872 iface.iscsi_ifacename = iscsi-1 iface.net_ifacename = eth3 iface.ipaddress = <empty> iface.hwaddress = <empty> iface.transport_name = tcp iface.initiatorname = <empty> # END RECORD
# iscsiadm --mode ifacedefault tcp,<empty>,<empty>,<empty>,<empty> iser iser,<empty>,<empty>,<empty>,<empty> iscsi-0 tcp,<empty>,<empty>,eth2,<empty> iscsi-1 tcp,<empty>,<empty>,eth3,<empty>
Now that we’ve got the initiator interfaces configured and bound to the physical 1 or 10 GigE NICs, we need to make sure every initiator NIC is logged into the available targets. By the way, you can also configure your Fedora host environment to use bridged interfaces if you’ve got a limited number of NICs in the system and want to use KVM/libvirt to import LUNs directly to a virtualized guest instance.
The easiest way to log in all of the interfaces you’ve defined above to the available (the LUNs/Volumes are “available” and “seen” because you’ve set up the ACL on the LUN/Volume to permit access from this host, right?) targets is to simply list[1] the target(s) you’ve discovered previously and re-issue the # iscsiadm –mode node –login –target iqn.2001-05.com.equallogic:<ID of Volume/LUN> command.
The next step now is to edit/add the contents of the /etc/multipath.conf (or, if you’re so inclined: /etc/multipath/multipath.conf which can represent a symlink target for /etc/multipath.conf) file, assign an alias that can be access from /dev/mapper/<alias> for a given (multipathed) device and “do something” (use as an LVM2 volume, a raw device, host a file system on or whatever)
So, before we “do the work” to configure the Multipath environment, we’ll need to either add (or ensure) the following information is present in the /etc/multipath.conf file (by the way, the multipath utilities assume [expect] its configuration file to be readable from the /etc/multipath.conf location!):
## Use user friendly names, instead of using WWIDs as names. defaults { user_friendly_names yes find_multipaths yes } ## Blacklist local devices (no need to have them accidentally "multipathed") blacklist { wwid scsi-SATA_WDC_WD800AAJS-1_WD-WMAM9ANC0895 devnode "^sd[a]$" devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*" devnode "^hd[a-z]" devnode "^cciss!c[0-9]d[0-9]*[p[0-9]*]" } ## For EqualLogic PS series RAID arrays devices { device { vendor "EQLOGIC" product "100E-00" path_grouping_policy multibus getuid_callout "/lib/udev/scsi_id --whitelisted --device=/dev/%n" features "1 queue_if_no_path" path_checker readsector0 path_selector "round-robin 0" failback immediate rr_min_io 10 rr_weight priorities } } ## Assign human-readable name (aliases) to the World-Wide ID's (names) of the devices we're wanting to manage ## Identify the WWID by issuing # scsi_id --whitelisted --page=0x83 /dev/<path to sdX for one of the device paths> multipaths { multipath { wwid 36090a01850e08bd883c4a45bfa0330ba alias benchmark2 } multipath { wwid 36090a01850e04bd683c4745bfa03b07c alias benchmark1 } }
# service multipathd start
# chkconfig multipathd on
Verify that the device mapper multipath configuration is ‘complete’ with the following commands:
# multipath -v2 # multipath -llbenchmark2 (36090a01850e08bd883c4a45bfa0330ba) dm-0 EQLOGIC,100E-00 size=100G features='1 queue_if_no_path' hwhandler='0' wp=rw `-+- policy='round-robin 0' prio=2 status=active |- 6:0:0:0 sdd 8:48 active ready running `- 5:0:0:0 sdb 8:16 active ready running benchmark1 (36090a01850e04bd683c4745bfa03b07c) dm-1 EQLOGIC,100E-00 size=100G features='1 queue_if_no_path' hwhandler='0' wp=rw `-+- policy='round-robin 0' prio=2 status=active |- 4:0:0:0 sde 8:64 active ready running `- 3:0:0:0 sdc 8:32 active ready running
[1] = # iscsiadm –mode node
I’ve already have an existing RHEL5 connection not using multipath (but actually bonded + MLT) to a PS-6000, and i’ve like to migrate to the multipath setup.
Have you tested what happens if you add virtual interfaces (IP addresses) to the bonded interface, and then run multipath over that. Thinking that the bridge could be replaced by the bond/MLT instead?
I’ve not tested using IP aliases and not sure it’d be supported (not that the array really tell, but it’s a question of what’s been tested by Dell’s Q/A group).
For a supported configuration, Dell would tell you to split the bond, configure the NIC’s presently bonded as physical devices with IP addresses (i.e. 10.10.10.1, 10.10.10.2, etc) on the same subnet as the group IP (if possible) and then configure multipath.conf/multipathd.
The following would also work, but I’m not clear on whether or not it’s an officially supported configuration, since you’ve already got a bonded interface up, have you considered simply using the iSCSI initiator utilities to configure two (or more, depending on the size of your bond) iSCSI session interfaces, re-discover the LUN(s), re-log in to them and start the multipathd service.
# iscsiadm -m iface -I bond0_1 –op=new
# iscsiadm -m iface -I bond0_2 –op=new
# iscsiadm -m iface -I bond0_1 –op=update -n iface.net_ifacename -v bond0
# iscsiadm -m iface -I bond0_2 –op=update -n iface.net_ifacename -v bond0
# iscsiadm -m discovery -I bond0_1 -t st -p
# iscsiadm -m discovery -I bond0_2 -t st -p
I tend to remove the LUNs/Volumes I’m not using on the host (like the vss-control LUN, etc with # iscsiadm -m node –op=delete -T
# iscsiadm -m node –login -T [iqn for LUN/Volume]
Edit /etc/multipath.conf and start the multipathd service as well as check that the /dev/mapper device is present ( # multipath -v2 ; multipath -ll )
Again: I’ve not confirmed that we’ll actually support this setup (meaning, I know it works in my own sandbox environment, but YMMV and if you call support they could be telling you it’s not officially supported)
So this issue that i encountered with this setup (over the bonded interfaces) is that all the traffic will go via one of the interfaces, for example:
Destination Gateway Genmask Flags MSS Window irtt Iface
10.100.110.0 0.0.0.0 255.255.255.0 U 0 0 0 eth3
10.100.110.0 0.0.0.0 255.255.255.0 U 0 0 0 eth4
10.100.110.0 0.0.0.0 255.255.255.0 U 0 0 0 eth5
10.100.110.0 0.0.0.0 255.255.255.0 U 0 0 0 eth2
in this case, eth3 takes all the load, despite there being 4 iscsi sessions open from each of the interfaces (one session per interface)
in effect, while there might be load-balancing via multipath, everything is bottle-necked through one NIC?
I’ve seen similar behavior in my personal iSCSI network when using “mode=802.3ad” for the “BONDING_OPTS=” line in ifcfg-bond0 (and the switch configured with Link Aggregation/Trunking for the ports). After switching to “mode=balance-alb” and turning of trunking on the switch ports (LACP is off), my throughput increased and I’m seeing traffic on both of my bonded NICs – this is a personal iSCSI SAN in my basement, no EqualLogic equipment included).
Also, how come your eth{2,3,4,5} all have IP addresses? If you’re using the bonding driver, I would have expected them all to be IP-less slaves (see “SLAVE” in the output from #ifconfig eth{2,3,4,5} with “bond0″ (or whatever you decide to call it) as the only entity having an IP address.
The output you included reminds me more of a solution w/no bonding enabled (i.e. the standard EqualLogic recommended configuration) and the dm-multipath driver loaded & configured?
Sorry, maybe my use of words was ambiguous – by “over the bonded interfaces” i really meant “compared to the bonded interfaces”
i forgot to mention: i switched this morning from the bonded method to the “recommended” method with multipath and individual NICS. I’ve had the bonded working great for 11.5mths, but have always been disappointed by the performance – I used the bonded/MLT (LACP) because i wanted the redundancy at the network end of things.
Since i thought this was the bottle neck, i’ve now tried the multipath/4 single NICs to see if there’s much difference.
With my initial tests, the IP address i have assigned to eth5, I’ve mounted a share using NFS on a separate system – interestingly, all of this traffic actually goes via the eth3 interface – it’s listed first in the routing table (it’s on the same subnet as eth2-5) – i’d expect the eth5 since that’s where the IP is “located”
I then saw the routing table, and that seems to be the reason (i’d read your previous post on the rp_filter, and that was already defaulted to 0 – it’s a RHEL5 system).
Reading the kernel docs for “arp_filter” (not rp_filter), it almost implies that i need to setup source-based routing:
“arp_filter – BOOLEAN
1 – Allows you to have multiple network interfaces on the same
subnet, and have the ARPs for each interface be answered
based on whether or not the kernel would route a packet from
the ARP’d IP out that interface (therefore you must use source
based routing for this to work). In other words it allows control
of which cards (usually 1) will respond to an arp request.
0 – (default) The kernel can respond to arp requests with addresses
from other interfaces. This may seem wrong but it usually makes
sense, because it increases the chance of successful communication.
IP addresses are owned by the complete host on Linux, not by
particular interfaces. Only for more complex setups like load-
balancing, does this behaviour cause problems.
arp_filter for the interface will be enabled if at least one of
conf/{all,interface}/arp_filter is set to TRUE,
it will be disabled otherwise”
Nevertheless, i tested it somewhat (it’s actually a production system so limited scope here). my test with dd if=/dev/zero of=testfile count=50000000 which results in a 26GB file. Whilst looking at iostat, i see traffic over all 4 “paths”. This worked out speed wise at ~896mbit/sec which seems awfully close to 1gbit (when taking into account the overhead of network protocols). And the strange thing, i see traffic on all 4 nics. It’s just an awful co-incidence that the write speed seems to be almost exactly at 1gbit … (it’s a PE2950 w/ 16gb ram). The unit PS6000XV had negligible other load at that time.
Re-running my DD test:
[user@host]# time dd if=/dev/zero of=testfile count=50000000
50000000+0 records in
50000000+0 records out
25600000000 bytes (26 GB) copied, 191.673 seconds, 134 MB/s
real 3m11.733s
user 0m19.957s
sys 2m51.032s
again, very close to 1Gbit…
So i’m either stuck with the fact that i’ve got a mis-configuration somewhere, or that the performance of the unit really is ~1gbit in terms of writes, or that the way to get a performing system still alludes me.
Depending on your switch layout; Do you have SPanning Tree or Rapid SPanning Tree enabled on the ports used by the host(s) and array(s)? Could also be the reason you’re only seeing traffic on one interface.
I had the same problem as you, so firstly thanks for the heads-up on the arp_filter, that solved the reachability problems I had.
Now, onto the throughput. The reason I moved from etherchannel (just like you did) is because Dell told me I should be using multipath as it’s the ‘supported’ method (same as you).
They said if I use multipath then connections to the group IP will get distributed among the eth0 and eth1 interfaces and I’ll be able to realise more than the ~1gbps limit which I had with the etherchannel.
Obviously you’re not seeing these benefits, and although I’m yet to test my changes, I can see connections are still only being made to eth0 on the EL SAN.
I feel we’re both possibly being spun out a bit by Dell. I’m going to try and get to the bottom of it, but I wondered if you had anything to add to your experience, since it was a few months ago now.
Thanks.
Hi,
When using multipath with Centos 5.6 I can see traffic splitted on different target IPs on the same SAN, but its only using 1 interface to do that, it just doesnt split the traffic on both interfaces. It do failover however.
Can you update your posts? Does it need to have static routes to make it work spreading the bandwidth on both nics or multipath needs something special to be configured?
Besr regards.
Not sure what I need to update?
That said, no AFIK, you do not need to use static routes (assuming the dynamic ones are correct). The key is to ensure that the MPIO engine is configured for Round Robin (or Shortest Queue First, if available) and that, if the target ports are on the same sub-net (i.e. 192.168.1.0/255.255.255.0), the sysctl.conf tunables mentioned elsewhere on this site are set correctly.