Skip to content

NCD – Customer-to-NCD Zerto Replication Troubleshooting


Navisite Cloud Director® (NCD) integrates with Zerto Virtual Replication (ZVR) disaster recovery protection, enabling hypervisor (or virtual machine monitor)-based replication between your on-premise virtual environment (VE) and your NCD environment. The replication service works with both VMware® vCenter and vCloud Director® (VCD) virtual environments.

Zerto Virtual Replication allows you to automatically and continuously replicate application data and virtual machine (VM) images, as well as system configurations and dependencies, in order to facilitate disaster recovery.

Customer-to-NCD ZVR replicates an on-premise vCenter's VMs contained within defined Virtual Protection Groups (VPGs) to dedicated disaster recovery virtual applications (vApps) within Navisite Cloud Director. ZVR can also be configured to replicate or move VMs from a Navisite Cloud Director environment to an on-premise customer site, if desired.

This article details some commonly encountered customer-to-NCD ZVR problems and provides detailed instructions for diagnosing and correcting them where applicable. A glossary of commonly used Zerto Acronyms is also included in this article.

Troubleshooting Zerto congestion problems may involve:
  • Analysis of the replication traffic load that a VPG is expected to create

  • Determining whether the WAN connection between the protection and recovery sites is sufficient to support the expected replication traffic

  • Analysis of VRA memory and how it relates to congestion (including burst write considerations)
To troubleshoot Zerto congestion issues, it is helpful to understand the high-level components involved in moving replication data between the protected and recovery sites. The following diagram illustrates these components, and highlights those associated with congestion issues (within the red box):


Customer-to-NCD ZVR Problems

Below is a list of commonly encountered customer-to-NCD ZVR problems, providing links to detailed instructions for diagnosing and correcting them.

VPG Not Meeting Service Level Agreement (SLA) After a VM is Added


Assumption: The VPG showed no errors or warnings prior to adding the VM to it.

When Zerto saves new VPG settings (such as when a VM is added to it), it begins performing an initial synchronization to the recovery site.

Note: The new VM must be powered on for initial synchronization to work as expected. When Zerto completes the initial synchronization, it displays a warning about not meeting its Service Level Agreement (SLA). This is expected, because the initial synchronization results in a purge of the journal files at the recovery site.



For details about why adding a VM to a VPG results in SLA warnings, see VPG Updates Affecting SLA.

Back to top

VPG Not Meeting SLA After a VM is Removed


Assumption: The VPG showed no errors or warnings prior to removing the VM from it.

When Zerto saves new VPG settings (such as when a VM is removed from it), it begins performing an initial synchronization to the recovery site.

When Zerto completes the initial synchronization, it displays a warning about not meeting its SLA. This is expected, because the initial synchronization results in a purge of the journal files at the recovery site.



For details on this and other VPG changes that can result in SLA warnings, see VPG Updates Affecting SLA.

Back to top

VPG Exhibiting VRA Congestion After a VM is Added


Assumptions:
  • The VPG showed no errors or warnings prior to adding the VM to it.

  • The customer site is the protection site, and NCD is the recovery site.

  • There is a functioning network connection between the protection and recovery sites.

  • Zerto displays a warning message about VRA congestion after adding the VM to the VPG.

VRA Congestion at the Zerto Protected Site

This section addresses troubleshooting VRA congestion issues that are present at the protected site.

For VRA congestion issues present at the recovery site, see VRA Congestion at the Zerto Recovery Site.

A Zerto VRA protection site congestion issue indicates that Zerto has detected a VRA buffer overflow. A buffer overflow is caused by attempting to write to a buffer that is already full.

VRA congestion issues at the protected site are generally caused by:
  • VRA buffer overflow due to insufficient WAN bandwidth

  • VRA buffer overflow due to insufficient VRA sizing

  • VRA buffer overflow due to burst writes
The most common cause of congestion issues at the protected site is insufficient WAN bandwidth between the sites. To determine whether the WAN connection is sufficient to handle your replication needs, measure the expected VPG replication load and WAN performance.
If, after reviewing both estimated VPG bandwidth and WAN performance…
  • you determine that your WAN bandwidth is not sufficient to support your replication needs, contact your network administrator and discuss methods for increasing the bandwidth of your WAN connection to NCD.

  • you determine that your WAN bandwidth should be sufficient to support the replication need…

If you continue to experience protection site VRA congestion issues, see Contacting Support for details on engaging the NCD support team.

VRA Congestion at the Zerto Recovery Site

This section addresses troubleshooting VRA congestion issues that are present at the recovery site.

For VRA congestion issues present at the protected site, see VRA Congestion at the Zerto Protected Site.

A Zerto VRA recovery site congestion issue indicates that Zerto has detected a VRA buffer overflow. A buffer overflow is caused by attempting to write to a buffer that is already full.

VRA congestion issues at the recovery site are generally caused by:
  • VRA buffer overflow due to insufficient VRA sizing
You are unlikely to experience recovery site Zerto congestion because most WAN connections do not have enough bandwidth to overwhelm the recovery site VRA. If you encounter recovery site congestion, see Contacting Support for details on engaging the NCD support team. Recovery site resources (including VRA sizing) are contained within the NCD infrastructure and can only be managed by NCD administrators.

Back to top

VPG in Constant State of Bitmap Synchronization

In response to VRA congestion or a WAN disconnection, Zerto locally tracks protected VM disk updates by performing a bitmap synchronization operation.

Once the issue causing the bitmap synchronization is resolved, Zerto uses the bitmap created during the outage to synchronize changes to the protection site.

Normally, bitmap synchronizations triggered by congestion or network interruption clear automatically within a few minutes (usually 15 minutes or less). It is possible, however, for bitmap synchronizations to last for more significant lengths of time and, in rare cases, indefinitely.

Bitmap synchronizations lasting more than 15 minutes are often the result of insufficient WAN bandwidth between sites. To determine whether the WAN connection is sufficient to handle your replication needs, measure the expected VPG replication load and WAN performance.
If, after reviewing both estimated VPG bandwidth and WAN performance…
  • you determine that your WAN bandwidth is not sufficient to support your replication needs, contact your network administrator and discuss methods for increasing the bandwidth of your WAN connection to NCD.

  • you determine that your WAN bandwidth should be sufficient to support the replication need…

If you continue to experience protection site VRA congestion issues, see Contacting Support for details on engaging the NCD support team.

Back to top

Support Reference Material

VPG Updates Affecting SLA

Certain VPG updates cause Zerto to perform a synchronization between the protected and recovery sites. As part of the synchronization operation, Zerto also purges the recovery site journal files, resulting in Zerto displaying a "HISTORY NOT MEETING SLA" warning message. Zerto continues to display the warning message until the journal files recover enough data to meet 75% of the configured SLA.

In time, entries are added to the recovery journals until they contain enough historical data for Zerto to stop displaying the warning message.

Below are VPG update actions that require the recovery site journal history to be rebuilt, and result in warning messages from Zerto:
Action Reason for Not Meeting SLA
Adding VM(s) to a VPG Adding a new VM to a VPG results in Zerto resynchronizing the protected site VMs to the recovery site, and a Zerto status showing the initial synchronization of the new VM taking place.

Each journal at the recovery site is purged as part of the synchronization process.

Each recovery site journal begins to refill after synchronization between sites is complete – this can take a few minutes. Zerto displays a "HISTORY NOT MEETING SLA" warning status until the journals contain at least 75% of the data in the configured SLA.


Removing VM(s) from a VPG Removing a VM from a VPG results in Zerto resynchronizing the protected site VMs to the recovery site.

Each journal at the recovery site is purged as part of the synchronization process.

Each recovery site journal begins to refill after synchronization between sites is complete – this can take a few moments. Zerto displays a "HISTORY NOT MEETING SLA" warning status until the journals contain at least 75% of the data in the configured SLA.


Updating a VM's storage profile (Advanced VM Replication Setting) Updating a VM's storage profile results in Zerto resynchronizing the protected site VMs to the recovery site.


Each journal at the recovery site is purged as part of the synchronization process.

Each recovery site journal begins to refill after synchronization between sites is complete – this can take a few moments. Zerto displays a "HISTORY NOT MEETING SLA" warning status until the journals contain at least 75% of the data in the configured SLA.



Back to top

Estimating VPG Bandwidth

When troubleshooting Zerto congestion issues, it can be helpful to estimate how much bandwidth a VPG will use. This requires you to measure the amount of disk write I/O traffic that will be generated by each VM in a VPG.

Zerto recommends using a 24-hour minimum average for estimating VM disk write bandwidth. However, you may want to consider other time spans when measuring your VM disk I/O traffic, based on an accurate estimate of your system usage and/or worst-case traffic load.

When bandwidth is measured for each VM in your VPG, simply add each VM's bandwidth value together in order to determine a total bandwidth value for the VPG.

Using VMware vCenter to Estimate Disk Write I/O Bandwidth


Note:
You should work with your on-site vCenter administrator to understand any potential impacts to your vCenter infrastructure prior to making the recommended logging changes noted below.

Start by setting up data collection in your vCenter:
  1. Log into the vCenter server using the vSphere Web Client.

  2. From "Home," navigate to vCenter.

  3. From "vCenter," navigate to vCenterServers.

  4. From "vCenterServers," navigate to vcenter > Manage > Settings > Edit.

  5. From the vCenter Edit page, navigate to Statistics.

  6. Set the statistics level to Level 2 for all settings up to and including one day.
When data collection is configured in vCenter and sufficient time has elapsed for VM statistics to collect, obtain the disk write I/O information for each VM in the VPG as follows:
  1. Log into the vCenter server using the vSphere Web Client.

  2. Navigate to VMs and Templates.

  3. Locate the VM of interest.

  4. From the "VM" display, navigate to Monitor > Performance.

  5. Select Advanced.

  6. Select Chart Options.

  7. From "Chart Options," navigate to Disk.

  8. From "Disk," select a timespan of Last Day.

  9. From "Chart Options," select Write Rate only.

  10. Click OK.

  11. Note the write rate average displayed below the graph.

  12. Repeat Steps 2 – 11 for each VM in the VPG.
When bandwidth is measured for each VM contained in the VPG, calculate the total VPG bandwidth using the following formula:

WAN Mb/sec = SUM(VMBW KB/sec * (8/1024/CF) )

Where:
  • WAN Mb/sec = total estimated Wide Area Network Megabits per second bandwidth for all VMs

  • VMBW = measured bandwidth value for a given VM in KB/sec (kilobytes per second)

  • CF = Compression Factor, set to 2 if WAN data is being compressed, set to 1 for worst-case non-compression
Zerto normally compresses WAN data prior to sending it across the network, except when the server that is hosting the VRA is experiencing heavy CPU loading. When the server hosting the VRA is heavily loaded, the VRA scales back WAN data compression to avoid contributing to overall CPU loading.

In cases where a VRA host is heavily loaded, Zerto may not compress WAN data at all.

If the VRA host server is expected to be heavily loaded, consider using a CF of 1 in the above WAN bandwidth calculation to account for your possible worst-case WAN loading scenario.

Using a VM's Host OS to Estimate Disk Write I/O Bandwidth

Instead of using vCenter, it is possible gather disk write I/O metrics for the protected VMs using performance monitoring applications included with the VM's host OS.
  • Microsoft Windows disk write I/O statistics can be obtained using Performance Monitor.

  • Linux/UNIX disk write I/O statistics can be obtained using iostat.

Back to top

Measuring WAN Performance Between Zerto Sites

The Zerto ZVM can be used to observe the detected WAN usage between the protected and recovery sites, as follows:
  1. Log into the local site ZVM, or access the ZVM within NCD.

  2. Within the ZVM, select the "VPGs" tab and click the name of the VPG of interest.



  3. At the VPG page, observe the WAN TRAFFIC graph and verify that the displayed values are consistent with the previously calculated VPG bandwidth.

    • Specific data point values can be obtained by hovering the mouse cursor over data points on the graph.

    • When viewing the graph, remember that it shows discrete samples and not averages.


If the observed WAN traffic values are not consistent with the calculated VPG bandwidth…

Back to top

Managing ZVM WAN Bandwidth Throttling

To ensure that Zerto does not use excessive network bandwidth during steady state production operations, Zerto allows you to configure the ZVM to limit or throttle the amount of WAN bandwidth used during replication activities.

Depending on your specific Zerto configuration, limiting the amount of WAN bandwidth available to your Zerto ZVM may lead to Zerto replication congestion issues.

To configure a ZVM to allow unlimited access to the WAN connection:
  1. Log into the local site ZVM.

  2. At the ZVM dashboard, click the menu icon in the upper right corner, and select Site Settings.



  3. At the Site Settings page, select Performance and Throttling.

  4. Verify that the Unlimited checkbox is selected in the "Bandwidth Throttling" section.



  5. Click SAVE.

Back to top

Measuring WAN Performance Using Iperf

This section details evaluating Zerto performance by measuring the effectiveness of the WAN connection between the protected and recovery sites using the Iperf tool.

Iperf is a network testing tool that can create Transmission Control Protocol (TCP) and User Datagram Protocol (UDP) data streams and measure the throughput of a network that carries them.

Configuration

The Iperf tool requires that two VMs (Windows or Linux) be placed into a customer’s Zerto infrastructure – one VM at the protected site and the other at the recovery site.

Port 5000 must be open between a customer’s protected and recovery sites.

To measure network performance from the protected site to the recovery site, an Iperf client must be installed on the protected site network, and an Iperf server must be installed on the recovery site network.

Windows Iperf Installation

To install Iperf on a Windows VM, download the Iperf executable (iperf.exe) and save it on the VM's hard disk.

To obtain a copy of the Iperf executable, perform a web search for "Iperf for Windows" and download a copy from a trusted source.

Linux Iperf Installation

Execute the following command to install the Iperf application from a Linux VM terminal prompt:

sudo apt-get install iperf

Running the Iperf Test

To run the Iperf server (at your Zerto recovery site), execute the following command:

iperf –s –f m

Where:
  • the iperf command is run from either a Windows cmd window (within same directory that contains the iperf.exe executable file) or a Linux terminal window

  • -s indicates the server side of the Iperf connection

  • -f m sets the format of the output so that bandwidth and file transfer values are displayed in megabits
To run the Iperf client (at your Zerto protected site), execute the following command:

iperf –c x.x.x.x –f m –t 30

Where:
  • the iperf command is run from either a Windows cmd window (within same directory that contains the iperf.exe executable file) or a Linux terminal window

  • -c indicates the client side of the Iperf connection

  • x.x.x.x represents the IP address of the Iperf server node (substitute the IP address of your own server)

  • -f m sets the format of the output so that bandwidth and file transfer values are displayed in megabits

  • -t 30 configures the test to run over a 30-second time period

Evaluating Iperf Results

After completion of the Iperf test, results are produced from both the server and client side of the connection.

Below is an example of client-side test results:

oper@ubuntu:~/dw$ iperf -c 192.168.30.3 -f m -t 30
------------------------------------------------------------
Client connecting to 192.168.30.3, TCP port 5001
TCP window size: 0.02 MByte (default)
------------------------------------------------------------
[ 3] local 10.10.30.2 port 58338 connected with 192.168.30.3 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-30.2 sec 124 MBytes 34.6 Mbits/sec

Test result details include:
  • Interval 0.0 -30.2 – indicates that the test ran for a total of 30.2 seconds

  • Transfer 124 MBtyes – indicates that 124 MBtyes of data was sent from the client to the server

  • Bandwidth 34.6Mbits/sec – indicates that the average bandwidth during the test was 34.6 Mbits/sec
Note: Zerto requires a minimum WAN bandwidth of 5Mbits per second.


Back to top

VRA Sizing and Burst Write Considerations

The overall memory allocated to your VRAs affects the size of the VRA buffer pools in your Zerto configuration.

When a VRA is created, it is assigned a fixed amount of RAM (either 1GB, 2GB, or 3GB). After a VRA is installed, the amount of RAM allocated to the VRA can then be increased in 1GB increments (up to 16GB) by manually altering the VRA VM RAM settings.

When more RAM is allocated to the VRA, more memory is available for the VRA buffer pool, as indicated below:
Total Amount of VRA RAM (GBytes) VRA Buffer Pool Size (Mbytes)
1 450
2 1450
3 2300
4 3300
5 4300
6 5300
7 6300
8 7300
9 8300
10 9300
11 10300
12 11300
13 12300
14 13300
15 14300
16 15300
Allocating 3 GBytes of VRA RAM at the time of VRA installation should be sufficient for most applications to avoid congestion issues related to VRA buffer overflow. If you detect congestion issues that you believe are related to VRA sizing, set the amount of VRA RAM to 3Gbytes. This can be done during VRA installation, or manually using your hypervisor interface.

If you continue to detect congestion issues that you believe are related to VRA sizing after allocating 3GBytes of RAM, you should evaluate your Zerto configuration while considering burst write situations.

A burst write occurs when one or more VMs write a large amount of data to disk in a short amount of time (e.g., when performing a large file copy operation or performing a database backup). In these situations, a large amount of data is quickly placed into the VRA buffer, which could potentially fill the VRA buffer before the VRA has a chance to process the data and empty the buffer.

In cases where burst writes are expected, you should appropriately size your VRA memory so that enough VRA buffer space is available to handle the amount of expected data during the burst.

Back to top

Zerto Acronyms

VPG Virtual Protection Group A logical grouping of Virtual Machines (VMs) that is used to organize Zerto replication.
SLA Service Level Agreement Agreement between customer and service provider which defines the service that is to be delivered to the customer – a Zerto SLA is an agreement of how much history will be made available for disaster recovery (DR) within the protected site journal files.
DR Disaster Recovery DR is the process of using a replication service to recover computing resources in the case of a disaster. DR requires human intervention to start, and does not provide instantaneous recovery of lost computing resources. DR is a way of preparing for disaster situations that will allow for "quick" recovery of computing resources.
ZVM Zerto Virtual Manager A ZVM is a service that runs on a Windows VM within a cloud service-provided data center. The ZVM manages all activities related to DR within that data center, including interactions with the vCenter itself. There is one ZVM per data center involved in DR activities.
VRA Virtual Replication Appliance A VRA is a service that runs on each ESX/ESXi host within a data center. The VRA is responsible for replication activities within a given ESX/ESXi host.

Back to top

Contacting Support

For instructions on contacting Navisite Cloud Director Support, see Contacting Support.

When contacting Support, please provide as much detail as possible relating to the issue being experienced, including information on all NaviSite-provided troubleshooting procedures attempted prior to contacting Support.

Back to top






Feedback and Knowledge Base