NCD – Customer-to-NCD Zerto Replication Troubleshooting
Navisite Cloud Director® (NCD) integrates with Zerto Virtual Replication (ZVR) disaster recovery protection, enabling hypervisor (or virtual machine monitor)-based replication between your on-premise virtual environment (VE) and your NCD environment. The replication service works with both VMware® vCenter™ and vCloud Director® (VCD) virtual environments.
Zerto Virtual Replication allows you to automatically and continuously replicate application data and virtual machine (VM) images, as well as system configurations and dependencies, in order to facilitate disaster recovery.
Customer-to-NCD ZVR replicates an on-premise vCenter's VMs contained within defined Virtual Protection Groups (VPGs) to dedicated disaster recovery virtual applications (vApps) within Navisite Cloud Director. ZVR can also be configured to replicate or move VMs from a Navisite Cloud Director environment to an on-premise customer site, if desired.
This article details some commonly encountered customer-to-NCD ZVR problems and provides detailed instructions for diagnosing and correcting them where applicable. A glossary of commonly used Zerto Acronyms is also included in this article.
Troubleshooting Zerto congestion problems may involve:
- Analysis of the replication traffic load that a VPG is expected to create
- Determining whether the WAN connection between the protection and recovery sites is sufficient to support the expected replication traffic
- Analysis of VRA memory and how it relates to congestion (including burst write considerations)
Customer-to-NCD ZVR Problems
Below is a list of commonly encountered customer-to-NCD ZVR problems, providing links to detailed instructions for diagnosing and correcting them.- VPG is not meeting service level agreement (SLA) after a VM is added
- VPG is not meeting SLA after a VM is removed
- VPG is exhibiting VRA congestion after a VM is added
- VPG is in a constant state of bitmap synchronization
VPG Not Meeting Service Level Agreement (SLA) After a VM is Added
Assumption: The VPG showed no errors or warnings prior to adding the VM to it.
When Zerto saves new VPG settings (such as when a VM is added to it), it begins performing an initial synchronization to the recovery site.
Note: The new VM must be powered on for initial synchronization to work as expected. When Zerto completes the initial synchronization, it displays a warning about not meeting its Service Level Agreement (SLA). This is expected, because the initial synchronization results in a purge of the journal files at the recovery site.
For details about why adding a VM to a VPG results in SLA warnings, see VPG Updates Affecting SLA.
Back to top
VPG Not Meeting SLA After a VM is Removed
Assumption: The VPG showed no errors or warnings prior to removing the VM from it.
When Zerto saves new VPG settings (such as when a VM is removed from it), it begins performing an initial synchronization to the recovery site.
When Zerto completes the initial synchronization, it displays a warning about not meeting its SLA. This is expected, because the initial synchronization results in a purge of the journal files at the recovery site.
For details on this and other VPG changes that can result in SLA warnings, see VPG Updates Affecting SLA.
Back to top
VPG Exhibiting VRA Congestion After a VM is Added
Assumptions:
- The VPG showed no errors or warnings prior to adding the VM to it.
- The customer site is the protection site, and NCD is the recovery site.
- There is a functioning network connection between the protection and recovery sites.
- Zerto displays a warning message about VRA congestion after adding the VM to the VPG.
VRA Congestion at the Zerto Protected Site
This section addresses troubleshooting VRA congestion issues that are present at the protected site.For VRA congestion issues present at the recovery site, see VRA Congestion at the Zerto Recovery Site.
A Zerto VRA protection site congestion issue indicates that Zerto has detected a VRA buffer overflow. A buffer overflow is caused by attempting to write to a buffer that is already full.
VRA congestion issues at the protected site are generally caused by:
- VRA buffer overflow due to insufficient WAN bandwidth
- VRA buffer overflow due to insufficient VRA sizing
- VRA buffer overflow due to burst writes
- See Estimating VPG Bandwidth to determine how much bandwidth a VPG should need during normal operations.
- See Measuring WAN Performance Between Zerto Sites to measure how much bandwidth you can expect from your Zerto site WAN network connection.
- you determine that your WAN bandwidth is not sufficient to support your replication needs, contact your network administrator and discuss methods for increasing the bandwidth of your WAN connection to NCD.
- you determine that your WAN bandwidth should be sufficient to support the replication need…
- see Managing ZVM WAN Bandwidth Throttling and verify that you are not limiting WAN usage at your onsite ZVM.
- see VRA Sizing and Burst Write Considerations to determine whether VRA sizing is sufficient for your specific application.
VRA Congestion at the Zerto Recovery Site
This section addresses troubleshooting VRA congestion issues that are present at the recovery site.For VRA congestion issues present at the protected site, see VRA Congestion at the Zerto Protected Site.
A Zerto VRA recovery site congestion issue indicates that Zerto has detected a VRA buffer overflow. A buffer overflow is caused by attempting to write to a buffer that is already full.
VRA congestion issues at the recovery site are generally caused by:
- VRA buffer overflow due to insufficient VRA sizing
Back to top
VPG in Constant State of Bitmap Synchronization
In response to VRA congestion or a WAN disconnection, Zerto locally tracks protected VM disk updates by performing a bitmap synchronization operation.Once the issue causing the bitmap synchronization is resolved, Zerto uses the bitmap created during the outage to synchronize changes to the protection site.
Normally, bitmap synchronizations triggered by congestion or network interruption clear automatically within a few minutes (usually 15 minutes or less). It is possible, however, for bitmap synchronizations to last for more significant lengths of time and, in rare cases, indefinitely.
Bitmap synchronizations lasting more than 15 minutes are often the result of insufficient WAN bandwidth between sites. To determine whether the WAN connection is sufficient to handle your replication needs, measure the expected VPG replication load and WAN performance.
- See Estimating VPG Bandwidth to determine how much bandwidth a VPG should need during normal operations.
- See Measuring WAN Performance Between Zerto Sites to measure how much bandwidth you can expect from your Zerto site WAN network connection.
- you determine that your WAN bandwidth is not sufficient to support your replication needs, contact your network administrator and discuss methods for increasing the bandwidth of your WAN connection to NCD.
- you determine that your WAN bandwidth should be sufficient to support the replication need…
- see Managing ZVM WAN Bandwidth Throttling and verify that you are not limiting WAN usage at your onsite ZVM.
- see VRA Sizing and Burst Write Considerations to determine whether VRA sizing is sufficient for your specific application.
Back to top
Support Reference Material
VPG Updates Affecting SLA
Certain VPG updates cause Zerto to perform a synchronization between the protected and recovery sites. As part of the synchronization operation, Zerto also purges the recovery site journal files, resulting in Zerto displaying a "HISTORY NOT MEETING SLA" warning message. Zerto continues to display the warning message until the journal files recover enough data to meet 75% of the configured SLA.In time, entries are added to the recovery journals until they contain enough historical data for Zerto to stop displaying the warning message.
Below are VPG update actions that require the recovery site journal history to be rebuilt, and result in warning messages from Zerto:
Action | Reason for Not Meeting SLA |
---|---|
Adding VM(s) to a VPG | Adding a new VM to a VPG results in Zerto resynchronizing the protected site VMs to the recovery site, and a Zerto status showing the initial synchronization of the new VM taking place. Each journal at the recovery site is purged as part of the synchronization process. Each recovery site journal begins to refill after synchronization between sites is complete – this can take a few minutes. Zerto displays a "HISTORY NOT MEETING SLA" warning status until the journals contain at least 75% of the data in the configured SLA. |
Removing VM(s) from a VPG | Removing a VM from a VPG results in Zerto resynchronizing the protected site VMs to the recovery site. Each journal at the recovery site is purged as part of the synchronization process. Each recovery site journal begins to refill after synchronization between sites is complete – this can take a few moments. Zerto displays a "HISTORY NOT MEETING SLA" warning status until the journals contain at least 75% of the data in the configured SLA. |
Updating a VM's storage profile (Advanced VM Replication Setting) | Updating a VM's storage profile results in Zerto resynchronizing the protected site VMs to the recovery site. Each journal at the recovery site is purged as part of the synchronization process. Each recovery site journal begins to refill after synchronization between sites is complete – this can take a few moments. Zerto displays a "HISTORY NOT MEETING SLA" warning status until the journals contain at least 75% of the data in the configured SLA. |
Back to top
Estimating VPG Bandwidth
When troubleshooting Zerto congestion issues, it can be helpful to estimate how much bandwidth a VPG will use. This requires you to measure the amount of disk write I/O traffic that will be generated by each VM in a VPG.Zerto recommends using a 24-hour minimum average for estimating VM disk write bandwidth. However, you may want to consider other time spans when measuring your VM disk I/O traffic, based on an accurate estimate of your system usage and/or worst-case traffic load.
When bandwidth is measured for each VM in your VPG, simply add each VM's bandwidth value together in order to determine a total bandwidth value for the VPG.
Using VMware vCenter to Estimate Disk Write I/O Bandwidth
Note: You should work with your on-site vCenter administrator to understand any potential impacts to your vCenter infrastructure prior to making the recommended logging changes noted below.
Start by setting up data collection in your vCenter:
- Log into the vCenter server using the vSphere Web Client.
- From "Home," navigate to vCenter.
- From "vCenter," navigate to vCenterServers.
- From "vCenterServers," navigate to vcenter > Manage > Settings > Edit.
- From the vCenter Edit page, navigate to Statistics.
- Set the statistics level to Level 2 for all settings up to and including one day.
- Log into the vCenter server using the vSphere Web Client.
- Navigate to VMs and Templates.
- Locate the VM of interest.
- From the "VM" display, navigate to Monitor > Performance.
- Select Advanced.
- Select Chart Options.
- From "Chart Options," navigate to Disk.
- From "Disk," select a timespan of Last Day.
- From "Chart Options," select Write Rate only.
- Click OK.
- Note the write rate average displayed below the graph.
- Repeat Steps 2 – 11 for each VM in the VPG.
WAN Mb/sec = SUM(VMBW KB/sec * (8/1024/CF) )
Where:
WAN Mb/sec
= total estimated Wide Area Network Megabits per second bandwidth for all VMsVMBW
= measured bandwidth value for a given VM in KB/sec (kilobytes per second)CF
= Compression Factor, set to 2 if WAN data is being compressed, set to 1 for worst-case non-compression
In cases where a VRA host is heavily loaded, Zerto may not compress WAN data at all.
If the VRA host server is expected to be heavily loaded, consider using a CF of 1 in the above WAN bandwidth calculation to account for your possible worst-case WAN loading scenario.
Using a VM's Host OS to Estimate Disk Write I/O Bandwidth
Instead of using vCenter, it is possible gather disk write I/O metrics for the protected VMs using performance monitoring applications included with the VM's host OS.- Microsoft Windows disk write I/O statistics can be obtained using Performance Monitor.
- Linux/UNIX disk write I/O statistics can be obtained using iostat.
Back to top
Measuring WAN Performance Between Zerto Sites
The Zerto ZVM can be used to observe the detected WAN usage between the protected and recovery sites, as follows:- Log into the local site ZVM, or access the ZVM within NCD.
- Within the ZVM, select the "VPGs" tab and click the name of the VPG of interest.
- At the VPG page, observe the WAN TRAFFIC graph and verify that the displayed values are consistent with the previously calculated VPG bandwidth.
- Specific data point values can be obtained by hovering the mouse cursor over data points on the graph.
- When viewing the graph, remember that it shows discrete samples and not averages.
- see Managing ZVM WAN Bandwidth Throttling to verify that you are not limiting WAN usage at your local site ZVM.
- see Measuring WAN Performance Using Iperf for details on measuring WAN network performance using iperf.
Back to top
Managing ZVM WAN Bandwidth Throttling
To ensure that Zerto does not use excessive network bandwidth during steady state production operations, Zerto allows you to configure the ZVM to limit or throttle the amount of WAN bandwidth used during replication activities.Depending on your specific Zerto configuration, limiting the amount of WAN bandwidth available to your Zerto ZVM may lead to Zerto replication congestion issues.
To configure a ZVM to allow unlimited access to the WAN connection:
- Log into the local site ZVM.
- At the ZVM dashboard, click the menu icon in the upper right corner, and select Site Settings.
- At the Site Settings page, select Performance and Throttling.
- Verify that the Unlimited checkbox is selected in the "Bandwidth Throttling" section.
- Click SAVE.
Back to top
Measuring WAN Performance Using Iperf
This section details evaluating Zerto performance by measuring the effectiveness of the WAN connection between the protected and recovery sites using the Iperf tool.Iperf is a network testing tool that can create Transmission Control Protocol (TCP) and User Datagram Protocol (UDP) data streams and measure the throughput of a network that carries them.
Configuration
The Iperf tool requires that two VMs (Windows or Linux) be placed into a customer’s Zerto infrastructure – one VM at the protected site and the other at the recovery site.Port 5000 must be open between a customer’s protected and recovery sites.
To measure network performance from the protected site to the recovery site, an Iperf client must be installed on the protected site network, and an Iperf server must be installed on the recovery site network.
Windows Iperf Installation
To install Iperf on a Windows VM, download the Iperf executable (iperf.exe) and save it on the VM's hard disk.To obtain a copy of the Iperf executable, perform a web search for "Iperf for Windows" and download a copy from a trusted source.
Linux Iperf Installation
Execute the following command to install the Iperf application from a Linux VM terminal prompt:sudo apt-get install iperf
Running the Iperf Test
To run the Iperf server (at your Zerto recovery site), execute the following command:iperf –s –f m
Where:
- the
iperf
command is run from either a Windows cmd window (within same directory that contains the iperf.exe executable file) or a Linux terminal window -s
indicates the server side of the Iperf connection-f m
sets the format of the output so that bandwidth and file transfer values are displayed in megabits
iperf –c x.x.x.x –f m –t 30
Where:
- the
iperf
command is run from either a Windows cmd window (within same directory that contains the iperf.exe executable file) or a Linux terminal window -c
indicates the client side of the Iperf connectionx.x.x.x
represents the IP address of the Iperf server node (substitute the IP address of your own server)-f m
sets the format of the output so that bandwidth and file transfer values are displayed in megabits-t 30
configures the test to run over a 30-second time period
Evaluating Iperf Results
After completion of the Iperf test, results are produced from both the server and client side of the connection.Below is an example of client-side test results:
oper@ubuntu:~/dw$ iperf -c 192.168.30.3 -f m -t 30
------------------------------------------------------------
Client connecting to 192.168.30.3, TCP port 5001
TCP window size: 0.02 MByte (default)
------------------------------------------------------------
[ 3] local 10.10.30.2 port 58338 connected with 192.168.30.3 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-30.2 sec 124 MBytes 34.6 Mbits/sec
Test result details include:
Interval 0.0 -30.2
– indicates that the test ran for a total of 30.2 secondsTransfer 124 MBtyes
– indicates that 124 MBtyes of data was sent from the client to the serverBandwidth 34.6Mbits/sec
– indicates that the average bandwidth during the test was 34.6 Mbits/sec
Note: Zerto requires a minimum WAN bandwidth of 5Mbits per second.
Back to top
VRA Sizing and Burst Write Considerations
The overall memory allocated to your VRAs affects the size of the VRA buffer pools in your Zerto configuration.When a VRA is created, it is assigned a fixed amount of RAM (either 1GB, 2GB, or 3GB). After a VRA is installed, the amount of RAM allocated to the VRA can then be increased in 1GB increments (up to 16GB) by manually altering the VRA VM RAM settings.
When more RAM is allocated to the VRA, more memory is available for the VRA buffer pool, as indicated below:
Total Amount of VRA RAM (GBytes) | VRA Buffer Pool Size (Mbytes) |
---|---|
1 | 450 |
2 | 1450 |
3 | 2300 |
4 | 3300 |
5 | 4300 |
6 | 5300 |
7 | 6300 |
8 | 7300 |
9 | 8300 |
10 | 9300 |
11 | 10300 |
12 | 11300 |
13 | 12300 |
14 | 13300 |
15 | 14300 |
16 | 15300 |
If you continue to detect congestion issues that you believe are related to VRA sizing after allocating 3GBytes of RAM, you should evaluate your Zerto configuration while considering burst write situations.
A burst write occurs when one or more VMs write a large amount of data to disk in a short amount of time (e.g., when performing a large file copy operation or performing a database backup). In these situations, a large amount of data is quickly placed into the VRA buffer, which could potentially fill the VRA buffer before the VRA has a chance to process the data and empty the buffer.
In cases where burst writes are expected, you should appropriately size your VRA memory so that enough VRA buffer space is available to handle the amount of expected data during the burst.
Back to top
Zerto Acronyms
VPG | Virtual Protection Group | A logical grouping of Virtual Machines (VMs) that is used to organize Zerto replication. |
SLA | Service Level Agreement | Agreement between customer and service provider which defines the service that is to be delivered to the customer – a Zerto SLA is an agreement of how much history will be made available for disaster recovery (DR) within the protected site journal files. |
DR | Disaster Recovery | DR is the process of using a replication service to recover computing resources in the case of a disaster. DR requires human intervention to start, and does not provide instantaneous recovery of lost computing resources. DR is a way of preparing for disaster situations that will allow for "quick" recovery of computing resources. |
ZVM | Zerto Virtual Manager | A ZVM is a service that runs on a Windows VM within a cloud service-provided data center. The ZVM manages all activities related to DR within that data center, including interactions with the vCenter itself. There is one ZVM per data center involved in DR activities. |
VRA | Virtual Replication Appliance | A VRA is a service that runs on each ESX/ESXi host within a data center. The VRA is responsible for replication activities within a given ESX/ESXi host. |
Back to top
Contacting Support
For instructions on contacting Navisite Cloud Director Support, see Contacting Support.When contacting Support, please provide as much detail as possible relating to the issue being experienced, including information on all NaviSite-provided troubleshooting procedures attempted prior to contacting Support.
Back to top