Issue and Background
The purpose of this blog is to draw attention towards some of the caveats involved in setting up Citrix ADC HA in a Cisco ACI environment. This blog covers the issues encountered when Citrix ADC force failover is performed and the intermittent synchronization issues in HA setup of the Citrix ADC appliance in a Cisco ACI environment.
In my role as a principal consultant for Ferroque, I have deployed Citrix ADC in HA (high availability) pair dozens of times. It is something we take for granted. Recently, I was engaged on a project wherein our aim was to deploy a brand-new HA pair of Citrix ADC VPX-1000 instances for a customer and to integrate Citrix Gateway with their Citrix infrastructure. Sounds like a straightforward deployment, but it was not! We ran into multiple issues with setting up Citrix ADC HA with the Cisco ACI fabric.
The symptoms of the issue were as follows:
- Initiating force failover on the primary Citrix ADC VPX instance resulted in a successful failover to the secondary instance. The old secondary instance was the new primary node in the HA pair, except the old primary instance that was the new secondary instance in the HA setup crashed and would not come back up on the network.
According to Cisco ACI Fabric Endpoint Learning Whitepaper, although Cisco ACI can detect MAC and IP address movement between leaf switch ports, leaf switches, bridge domains, and EPGs, it does not detect the movement of an IP address to a new MAC address if the new MAC address is from the same interface and same EPG as the old MAC address. The default behavior of Cisco ACI fabric is to learn via UDP unicast lookup located in the endpoint database, and there is no need to broadcast or flood an ARP. However, for HA setup on the Citrix ADC, the Cisco ACI fabric needs to be able to learn based upon Gratuitous ARP. GARP is a special type of ARP broadcast that sends out an unsolicited ARP request or reply to all hosts on the local network and ensures that the network device’s ARP tables are populated with the most up-to-date information without delay.
As per CTX208384, Citrix ADC issues GARP for all ADC-owned IP addresses (for example, NetScaler IP address (NSIP), Subnet IP address (SNIP)) after a HA failover, including event state changes on the secondary Citrix ADC appliance. When the new primary Citrix ADC appliance takes over it sends out GARPs for NSIP/SNIP/VIPs, and the Cisco ACI leaf switch updates its endpoint table with the new MAC/IP information. Cisco ACI does not behave like a traditional router wherein it just looks up the ARP/GARP to register new endpoint devices. When the new primary Citrix ADC appliance resets old connections or times out, the MAC address belonging to the NSIP/SNIP/VIPs will be unintentionally relearned by the wrong data plane endpoint, since Cisco ACI forwards packets to the new VIP to the old primary Citrix ADC appliance and this causes an outage. The new secondary Citrix ADC appliance sends out GARPs but when the new primary appliance sends a TCP reset or ACKs, the endpoint table is replaced with the MAC address of the old primary appliance.
A quick workaround on the Citrix ADC for the secondary appliance to come back up on the network was to initiate a GARP request from the new primary Citrix ADC appliance by issuing the following command on the Citrix ADC primary appliance:
send arp all
The new primary Citrix ADC appliance was now accessible via CLI but not through the GUI. Upon further investigation, we found that the /var/core directory was full of core dumps generated from the crash. We transferred all the core dumps on the Citrix ADC to an external location. Access to the Citrix ADC appliance was restored; however, we noticed that the HA sync was still failing, and a failover event still resulted in the same issue.
As per Cisco ACI Fabric Endpoint Learning Whitepaper, in order for Cisco ACI to learn GARP-based endpoint moves, features like EP move detection mode and ARP flooding needed to be enabled on the Bridge Domain (BD). The setting for EP moves detection was set to enable GARP-based detection (Tenants > Networking > Bridge Domains > BD NAME > Policy > L3 Configuration) and ARP flooding was enabled for the bridge domain (Tenants > Networking > Bridge Domains > BD NAME > Policy > General).
Also, CTX238900 recommended disabling endpoint data-plane learning on the Bridge Domain (BD). In our case, the customer did not think it was feasible to disable data-plane learning on the Bridge Domain since it affected all the endpoints on that VRF. Hence, a new VRF for Citrix ADC HA was created along with a dedicated interface on the Citrix ADC for HA sync, and associated the new interface with the newly created Bridge Domain under the VRF. Endpoint data plane learning was disabled on this new Bridge Domain. This setting was configured under the “General” tab on the bridge domain (Tenants > Networking > Bridge Domains > BD NAME > Policy > General).
Upon testing failover for the Citrix ADC appliances after disabling the endpoint data plane learning and setting up GARP-based detection plus ARP flooding on the Bridge Domain on the Cisco ACI, we were finally able to failover Citrix ADC instances successfully. Problem solved!
Chetan graduated the University of Texas with a Masters degree in EE and holds certifications in Azure, Google Cloud, VMware, and Citrix. Chetan has years of experience in consulting and managed services. Chetan has specialized in end-user computing with organizations across North America. At Ferroque, Chetan is a Principal Technical Consultant specialized in Citrix and VMware technologies.