Dear #followerpower,
does #bonding in #Linux with balance-alb work with direct attached cables between two servers or is a switch needed (although it doesn't need to support balance-alb like #lacp)
3 servers, each server has 2 directly connected ports to each other server in a bond, and bond0 and bond1 on each server is connected to a vmbr1 interface.
Basically this is working, but with iperf I didn't see any increase in bandwidth.
@ij balance-alb works by selectively reporting different MAC values for the two NICs to different clients, so it only load-balances across hosts.
You've presumably also got some form of spanning-tree running, which will selectively disable some links — othewrise I think you'd have a loop and the links would all be full?
Testing wise, make sure to run iperf with multiple TCP channels, as most bonding schemes result in any given TCP connection to only operate over one available channel.
@dwm Yeah, STP might be a thing. I've enabled `bridge-stp on`.
The alternative would be to use obs-bonds, but there I didn't see anything like balance-alb, but would be able to use the existing stp settings:
ovs_options other_config:rstp-path-cost=2000 other_config:rstp-port-mcheck=true vlan_mode=native-untagged other_config:rstp-enable=true other_config:rstp-port-auto-edge=false other_config:rstp-port-admin-edge=false
@ij I've not played with openvswitch, but from reading the documentation from that page, it appears to provide similar capabilities to Linux bridging and channel bonding.
RSTP is the Rapid Spanning Tree Protocol, so would still result in link disablement to prevent loops at the ethernet level.
Given your use case, I suspect that either using IP routing directly or perhaps layering VXLAN over the top of such may work better for you — but I have no experience with such.
@dwm RSTP is already in place for the single 10 GbE link mesh setup currently in place.
Having bonds where one link of the bond is disabled because of SRTP doesn't make much sense to me as I would expect that the bond driver to handle this on its own on this specific bond and SRTP to handle things between the two bonds to prevent loops between hosts and not interfaces.
@ij Ahhh, I think I may have misunderstood.
I thought from your original description that you had 3 servers, each with two ports — with one link between each server in a triangle, for a total of three connections.
What it sounds like you have is 4 ports on each server, with a pair of links between each pair, for a total of six connections.
In which case, yes, both links in each bond will run — but only on the bonded pairs that (R)STP doesn't administrative disable to prevent loops.
@dwm Yeah... actually I have this setup as depicted in the screenshot, alas some links are missing...
each server has:
- one link to the switch for Internet connectivity
- 2 links for internal Proxmox network
- 2 links for internal Ceph network
Now I wanted to move the dedicated Ceph network into a bond with the Proxmox network and later use VLANs to seperate them again.
@ij Cool, makes sense!
Okay, so if the problem is that you're not seeing speedups over the bonded pair, then I'd try switching to round-robin mode at both ends. This will cause packets to always be split across all active links regardless of source/destination/port/TCP session.
If that doesn't solve that problem, something odd is happening.
Getting multipath working between the three nodes directly will require cleverness I've not played with.
@ij … though something based on https://pve.proxmox.com/wiki/Software-Defined_Network#pvesdn_setup_examples might work quite nicely?
@dwm @ij
bond mode 0 (round-robin) may cause out of order pakets. You need to make sure that all your application / TCP stacks can deal with it!
In my homelab - a 3 node proxmox cluster - I configured separated physical links for migration/storage (1x 10GbE), heartbeat (1x 1GbE) and proxmox/internet (2x GbE LACP+switch with multiple VLANs) rather than combining all kind of traffic into a single physical bond.
Q: What is the link speed of your NICs? Does your Ceph storage saturate 10GbE?
@dwm @ij
I had problems with RR causing performance degradation propably due to reshuffling pakets. My network config during that time was different, since all traffic was routed over L2 switches from different manufacturers and different NICs.
Ij's setup with direct physical links might be ok - needs testing.
I'd be cautions saturating links in RR. I would recommend to increase link speed instead.
@markus @dwm All links are 10 GbE between the nodes. Only the uplink to Internet is 1 GbE.
When moving VMs around I can nearly saturate 10 GbE. With Ceph alone, I don't think I've ever seen something near to it. But maybe this could improve when packets can travel simultaneously across two different links?
@ij @dwm
I have no experience with Ceph. (I've deployed Glusterfs for reasons)
To my knowledge Ceph is a replicated clustered filesystem. I do not understand why you get saturated 10GbE links when live-migrating VMs.
Isn't there a replica of all file chunks belonging to a VM imagefile available on src and dst node? Making it unneccessary to copy imagefiles around?
Or is the amount of RAM of your VMs that big? RAM migration could explain saturated links.
@markus @dwm Ah, not all VMs are using Ceph as storage. Most have local storage on ZFS and needs cloning during migration. And memory state needs to be migrated as well, of course.
Beside that, yes, every node has a full copy of the data with 3 nodes and replicas = 3 on per host basis. With more nodes, eg. 5, you could still have 3 copies, so not every host would have a full copy, but every data would still be replicated amongst 3 nodes.
The largest VMs have 48 GB RAM.
@ij @dwm
ZFS: Do you replicate VM images on a timely intervall to lower the amount of storage data that need to be copied during live-migration? Is there enough disk space?
Network: Would it make sense to separate storage and migration network as there are unused NICs available?
Migration: The big amount of RAM that needs to be migrated might still saturate your NICs, but will not cause bandwidth conflicts with your Ceph storage.
Storage: Separate network will provide a stable Ceph storage.
@markus @dwm Yes, separated networks for Proxmox and Ceph is currently what I'm using.
Here's a screenshot of saturating 10 GbE, although it is by migrating VMs from the 2 other servers to one server and Proxmox has dedicated graphs for the network interfaces.
ZFS: all 3 nodes do have 2x 18 TB HDD, 1x 2 TB SSD (both Ceph) and 3x 4 TB SSD ZFS and 2x 2 TB NVMe. So, basically enough storage, I'd say... 88 TB usuable storage in cluster total, Proxmox states..
@ij @dwm
Ah... Sorry I misunderstood the naming of your networks:
- "Proxmox"=migration network (1x 10GbE)
- "Ceph"=storage network (1x 10GbE)
Q: why is there only 1 link between Baldur and Pepper? Shouldn't be there at least 3 Ceph nodes?
HA: I have in principle the same setup
- No HA config in case of active-active services configuration(ZFS).
- HA config in case of active-backup service. HereI I use ZFS replication with 15 min intervall or clustered storage.
hm.. what do with bandwidth?
@ij @dwm
So we are back to the initial question of bonding modes:balance-alb vs round-robin
My 2cent
-I know too little regarding the hash-algorithm (if any) of balance-alb and whether your NICs support them. Nevertheless this would be my preference eventhough I never tested that bond mode.
-Round-robin would be my 2. choice having made not that good of an experience.
Sounds like this is a productive system. Therefore I would run indepth testing on separate maschines with high link saturation
@ij @dwm
Replication dependend on load? I'm not sure whether that helps, since you never know what load will occur in the next second/minute. Maybe to limit replication bandwidth? Shorter intervalls to limit duration?
I think running into resource limits/conflicts is always a problem and best solved by avoiding them. In your case consider upgrading to 40GbE. Honestly I would keep storage and migration network separate and accept a longer migration time rather than risking an unstable system.
@markus @dwm 20 years ago I started writing a render server for Maya in Python. The idea was to have a self-optimizing render server that distributes the load in that way that the render times will be minimized. Without going into detail, I think Proxmox could do better in terms of resource usage like migrating VMs in less busy hours during the day so that a host can be run idle if the load on the VMs is small. Distribute the VMs so that the load is more or less equal in reards of....
@markus @dwm RAM, Disk and CPU usage over time. For this Proxmox could rely on monitoring data over the past weeks and try to estimate the load during the day. For example consolidating the VMs during night on one host and distributing the load across all hosts during working hours.
Maybe this would need requirements like using Ceph as storage backend or a monitoring system to collect data. However, I think Proxmox would benefit from those features.