Published on May 13th, 2014 | by Brian Suhr12
Nutanix Review NX-3050 series
Summary: Nutanix is building a real contender, causing the leaders to take notice.
Starting in 2013 hyper-converged solutions were getting a lot of attention and in 2014 they are as hot as Justin Bieber was with a 9 year girl a couple of years ago. You might say customers are getting the message and are becoming hyper-belivers, sorry for that bad analogy. Both customers and vendors are getting behind this new infrastructure movement and are buying into the performance and simplicity that they can offer. This new style of infrastructure is being very disruptive to the classic vendors. They are struggling to compete in some accounts and are resulting to black methods of spreading FUD to deflect from the real issues.
What is Hyper-Converged?
There are a number of vendors that have built Hyper-Converged offerings. What this means is the solution is combining the compute, storage and sometimes the network into a single offering. This reduces the management complexity and possibly the cost. The storage features are typically a combination of local storage within the servers and software that runs on each of the nodes that offers a unified storage layer.
A converged storage offering like this reduces the complexity of the storage layer. Typically they are NFS offerings which removes the overhead of managing disk groups and LUNs. A central management console is offered and very little to no tuning is required for the storage layer. This allows the entire solution to be managed by the storage or virtualization team and does not require multiple resources from separate teams within an organization.
These are acronyms or names that will be used within the article. I’ve created a short list with descriptions to help in understanding the review.
- Prism – This is the name of the Nutanix management user interface
- NOS – Nutanix Operating System
- NDFS – Nutanix Distributed Filesystem
- CVM – Nutanix Controller VM, this is the virtual appliance that runs on each node
- Block – The 2U chassis that contains 1 to 4 nodes
- Node – A dual socket server with dedicated memory, network and disks
The Virtual Computing platform from Nutanix is built on the concept of nodes and blocks. A block is a 2U rack mount chassis that holds four nodes. The compute platform is built on OEM devices from Super Micro. Super Micro is a common hardware platform that many vendors are using to build storage products on. Each of the nodes in the NX-3050 model has two SSD flash drives and four HDD. There are two 10GbE and two 1GbE network connections and an IPMI network connection. This review is based on the 3.5x release of NOS and the features and processes related to it.
On each node is the nutanix controller VM (CVM). The CVM is the storage appliance or controller that uses VMware direct path to directly attach to the LSI disk controller in the node. This allows the CVM to control the disks without the hypervisor in between. Nutanix does not apply any RAID to the disks within each node. Instead NDFS is a proprietary distributed file system that allows for the replication of data between nodes. NDFS is based on similar technology used in Google files systems and other web-scale solutions. You can learn more about how Nutanix uses technology like Cassandra, MapReduce and Zookeeper in NOS on this podcast.
A Nutanix storage cluster is made up of the Blocks/Nodes that are members of the cluster. All of the CVMs that are part of the cluster unify their storage into a storage pool that creates shared storage that all the nodes in the cluster can use. This is presented to the nodes as NFS storage.
The hypervisor is installed on a Disk on Motherboard (DOM) module. This is a small flash device that is on the node’s motherboard and used for installing vSphere on and is the location of the boot drive for the CVM.
Each node has a single CVM running on it. The CVM is a VM that is configured with 8 vCPU’s and 16-24GB of memory depending if you have deduplication enabled or not. The CVM has two vNics, one that connects to your production network and one that connects to a private vSwitch on the host. This private vSwitch has a VMkernel adapter and no physical network adapters. Its a private network that the host uses to communicate with the CVM so that it can mount the NFS datastore from the Nutanix storage cluster. The CVMs use the public vNic connection for inter-CVM communication, this allows them to monitor cluster health and replicate data.
Should a local CVM fail to respond or be shutdown by a user, NOS will automatically detect this outage and will redirect these I/Os to another CVM in the cluster over 10GbE. The re-routing is done transparently to the hypervisor and VMs running on the host. This means that even if a CVM is powered down the VMs will still continue to be able to perform I/Os to NDFS. NDFS is also self-healing meaning it will detect the CVM has been powered off and will automatically reboot or power-on the local CVM. Once the local CVM is back up and available, traffic will then seamlessly be transferred back and served by the local CVM.
There are three types of failures that could happen that could affect the Nutanix NOS or storage layer. I just discussed the failure of a CVM and how autopathing will redirect IO to another CVM in the cluster allowing VMs to continue. The next type of failure would be a disk failure. A Nutanix storage cluster can sustain a single disk failure at a time. Upon a failure the system will redistribute data through the cluster. Once the rebuild or redistribution is complete the cluster is healthy again. This happens pretty fast, I was not able to test this myself. Also the larger the Nutanix cluster the faster this process happens.
The next failure scenario would be losing a node in the storage cluster. This will have a similar data redistribution process as the disk failure scenario just a larger amount of data. Nutanix uses MapReduce to redistribute the data within the cluster speeding up the process. In NOS 3.5.x a cluster can sustain a single node failure at a time. Once the cluster is rebalanced if there is available free space it may be possible to withstand another node failure.
To improve performance Nutanix primarily uses the local CVM on the host a VM is running on. This allows it to use local storage resources to increase performance and reduce latency. If a VM is migrated from node 1 to node 2, initially the VM will be running on node 2 but the CVM on node 1 will still contain the VMs data. As read and writes are requested the CVM on node 2 will begin to locally cache this data as its accessed remotely. The system will not automatically begin move any of the data to the new local CVM upon a VM moving to a different host. This would put additional load on the network. New write I/O is always written to the local CVM for every VM.
Now that the major parts of the Nutanix platform have been covered, its time to look at how they work together with the Hypervisor layer. The great news here is they support VMware vSphere, Microsoft Hyper-V and KVM for virtualization. Having the option to choose your hypervisor of choice is great, since this space is becoming very competitive. For this review Nutanix was only tested with VMware vSphere.
In theory Nutanix could take a whole bunch of nodes and scale them out as a large storage cluster. The current recommendation is to architect your Nutanix storage clusters to a soft limit up to 60 nodes. This can contain multiple hypervisor clusters. For example if I’m running VMware and I build 3 vSphere clusters of 8 nodes each with Nutanix hardware. This could translate to one Nutanix storage cluster under the covers. The soft limit is only a recommendation for customers on managing their failure domain, there is no maximum limit for number of nodes in a single Nutanix cluster. As with any platform and design their may be architecture reasons or customer requirements that are exceptions to this.
The benefits of building a larger Nutanix storage cluster are in performance gains and reducing rebuild times. Customers can achieve data separation if needed through storage containers. The architecture allows for flexible design options for architects to leverage.
The Good and the Bad
The following descriptions are items that our rating was based on. These are important values to the review team in the evaluation of products and their on going usage. Each item is described on what items affected the rating in a positive or negative fashion.
Installation: We gave a 4.5 star rating for the installation process. Nutanix has built some cool automation around the deployment or re-imaging of their nodes. This is a virtual machine that uses Bonjour like technology to discover Nutanix blocks and nodes via the IPMI interface in the servers. It allows you to select the version of Nutanix NOS you want to build with and the hypervisor of your choice. You supply the ISO files and IP addresses, names and the appliance does the rest for you. The process deploys the hypervisor and places a Nutanix CVM on each host as part of the process. You just need to start the storage cluster and create a container if its a new install.
Performance: We thought that the Nutanix platform performed very well. The use of data localization along with SSD and memory allows Nutanix to provide an impressive level of performance out of a node, combined with the ability to utilize cluster resources increases performance by accessing the greater flash resources. The performance rating is 4.25 stars.
Other: The rating of 4.25 stars was awarded based on the following criteria. This section is heavily based on the upgrade process for Nutanix NOS. At the time of this writing April 2014 Nutanix has support for vSphere 5.5 they are keeping a good pace for supporting new vSphere releases. To upgrade ESXi to a newer version you can use the standard vSphere builds from VMware, only thing required is a single .vib file from Nutanix. The good news is that customers can perform the NOS and vSphere upgrades themselves. The upgrade of the Nutanix NOS is still a CLI based processed from the CVM, but is an automated rolling update of all CVM’s. Once they integrate the upgrade process into Prism and its a few clicks, this rating will increase.
Interface: The interface received a rating of 4.5 stars. This rating is based on the quality of the Prism interface for Nutanix. Prism is a good looking web based management portal. They have created a design that is well laid out and easy to understand. There are details on the storage, hardware and virtual machines. These are the important parts of a solution like this. I was easily able to find out how the storage was performing and gain insight into how each VM was performing also.
Scalability: The rating of 4 stars was based on the following factors. In theory there is no limit on the number of nodes that you could scale a Nutanix cluster to. They have customers running 50-60 nodes in a single cluster today and 100+ in a cluster in their lab. These are good numbers for their age and I do not suspect its a limiting factor for customers at this time. Another plus in my eyes is the ability to mix node types within the same cluster. This means that you could mix 3000 series nodes and later add in 6000 series nodes if you need additional capacity. NOS is smart enough to adjust to the node size differences and take that into account when replicating data in the cluster.
In the future they will be able to support 2 simultaneous node failures. This will make customers feel better about scaling out to larger clusters. I think it would be a nice option to be able to upgrade the size of SSD’s in nodes if you need the additional flash capacity. This is possible today but there did not seem to be a SKU or process for this yet.
Documentation: This might be the thing that I was most surprised and impressed with. The documentation received a rating of 4.5 stars. There was a full set of documents include install, upgrade, administration guides as well as multiple hardware related guides. Nutanix is working hard testing and releasing reference architectures for the most common and popular solutions. There are probably a few missing still and possibly some standard recommendation best practice items for new customers. I could see Nutanix reaching a perfect 5 stars on this item in the future as they build out their documentation.
The current docs were very complete. For example I used the upgrade guide for doing my test upgrade, it walked me through the steps and left me with very few questions. There were plenty of examples for the commands, the guide provided default logins for the different products. There were warnings provided at steps to make admins aware of what not to do as to not cause any issues. I just thought that these might be the best written documents from any technology company that I have used in my long IT career.
Nutanix NX-3050 Cost
Nutanix offers four series of their Virtual Computing appliances. Each series is focused on either different sized environments or specific uses cases. The test block that I was using was a NX-3050 configured with 4 nodes. Each node was configured with dual E5 process with 8 cores each and 128GB memory. The storage in each node was 2x 400GB SSD and 4x 1TB HDD’s. I was not able to get permission to publish a list price for the tested configuration yet, but hope to be able to soon.
The management console for Nutanix is Prism and its a web based interface that is accessed by the IP or name of the storage cluster. This allows the storage to be managed from a browser even if vCenter is down for any reason. I thought the interface was well designed, laid out and was easy to learn. Once logged in there is a drop down menu on the upper left of the web page. This will allow you to navigate between the major sections of the Nutanix platform.
When first logging into the page you will see the home page shown below is an example. The home page shows a view of whats going on with this Nutanix cluster. On the left there is details about the count of nodes, hosts and VMs within this storage cluster along with hypervisor version. The next column is showing a view of performance for the storage cluster. The third column is showing any alerts and warnings from the system. The last column is showing the list of recent events. At the top next to the home drop down there are colored circles. In this example these represent the number of alerts and warnings the system currently has. This is helpful when in other parts of the management tool you will always know if there is something that you need to be aware of.
In the next image I’m showing the Storage overview page. This is giving information specifically focused on storage. We can easily see the capacity information and deduplication details. The second box down in the first column is showing how much percentage is being used from each type of storage. This is helpful in knowing the amount of SSD capacity that is being consumed within the storage cluster. There are similar storage charts that were shown in the home view. On this page we are also shown alerts, events and warning but on this page they are only shown if they are storage related.
In the next image I’ve moved to the Storage Diagram view. This goes deeper into the storage details for the storage cluster. We are getting more capacity related information and larger performance charts.
Now the walk through moves onto the VM view that gives us details focused on the virtual machines running on the storage cluster. This layout shows VM counts with power states, a summary of CPU and Memory resources being consumed on the hosts that are part of the storage cluster. Also shown is a Top VMs view for popular performance numbers. This brings a per VM view of the storage which is welcome and only a few vendors can offer this level of details. And much like the other views this page shows any alerts, warnings and events for VMs.
The next page shows the VM Table view, when selecting a VM you are provided with greater details on the performance and configuration. I like this level of detail it really helps showcase the performance or amount of resources VMs are using without lumping them into a big blob for everything on the datastore.
By now you must be seeing the theme that Nutanix is following for the management pages. This takes us to the Hardware Overview. On this page we get a high level view of the hardware within this storage cluster. We see how many Blocks and hosts that we have along with disk counts. The second row is showing storage performance and resource consumption on a per host basis. And again any warnings, alerts and events are shown here for hardware items.
Similar to other views the hardware view offers a Diagram look into the test block. Here we see a graphical representation at the top. If there is an alert or warning it is shown on the host(s) in the picture. The example below shows that several of the hosts have an alert. The red block in the middle is showing an alert because I removed one of the power cords to cause an alert.
The DR overview page is showing similar type information focused on this topic. Here you can create remote sites and protection domains for setting DR protection. I’ll talk a bit more about this in a latter section.
On this page you would be able to dive deeper into the DR details. I really dig this page as it allows me to see how much bandwidth the replication tasks are using. This is something that customers always ask about with any DR solution. I give credit to Nutanix here for not just doing the bare minimum on their initial DR offering.
The Analysis page is where you can dig deeper into what is going on with your Nutanix storage cluster. Here you can pull up data with performance charts on a number of different metrics. You can get full cluster or per host views which is helpful. Some of the charts allow for increasing the size of them. You can adjust the time range at the top for how large of a window you want to look at the data for. I thought this was pretty good but there is room for improvement here and I’m sure that it will happen in future releases.
Last up is the Alerts view, this shows everything that is happening on the storage cluster. You can see every alert, warning and event. Very helpful when tracking down any issues. This is also where you go to acknowledge these alerts so that you can clean them up after the issue has been resolved.
Management walkthrough video
Incase you did not want to read the following section I created a video walkthrough of the management interface. I go through each section and explain the features and functions.
I thought it would be important to cover this and I’ve heard some FUD around this topic in the past and always want to know what this process is like for any product. At the time of my writing this review the Nutanix block that I was testing was loaded with version 3.5.x, which was the latest version at the time. Nutanix has announced a set of features coming in version 4.0 which is expected by summer.
Now on to the upgrade, I wanted to experience the process of upgrading a Nutanix block from one version to another. Also what would it be like to upgrade the vSphere version running on the block. I was able to work with the Nutanix team and request code versions to allow for this.
The upgrade of Nutanix NOS was straight forward and easy for me to follow. I was able to accomplish in a little over an hour on my first attempt by following the upgrade guide. There are a steps to prepare for the upgrade that are accomplished by SSH to one of the CVM’s. You check the health of the cluster and make sure all services are running. Turn off email alerts and pause any replication to other clusters/sites.
Then copy the upgrade file to the CVM you are working from and unzip into a directory. Start the upgrade process from the CVM command line, the upgrade with do a rolling upgrade of all CVMs in the storage cluster. Taking one down and upgrading it while allowing VMs to continue running on the host by allowing it to use an alternate CVM. This allows for a non-disruptive upgrade of the Nutanix NOS.
This process was easy to grasp even for the first time. I was happy to find out that it would not take 3 days and I did not have to give up my first born. I’m being told that in future versions not sure which one yet that the process will become easier.
The vSphere upgrade process was also a good experience. There are several methods that you could accomplish an upgrade in this type of solution. The two mentioned is using a Nutanix process that uploads the off-line bundle to a CVM. I just elected to use VMware VUM for my upgrade. This is supported by Nutanix and seemed the easiest for me as I was used to this process. There are a few files that Nutanix recommends that you backup when doing the upgrade so I did that with a quick SCP copy. Then I started upgrading my hosts one at a time until they were all done. For each host you will shutdown the CVM running on it and proceed with the upgrade. Once upgraded you load the Nutanix NFS .vib file on the host and the upgrade is complete. Again nothing scary here so I’m not sure what people are talking about.
VMware View Planner Performance
I was very interested in testing how Nutanix would perform for virtual desktops. This is a challenging workload and I’ve heard the platform talked about as a great solution for VDI. To test this I need something more than a bunch of desktop VMs that I could boot up and let idle. To perform that test I used View Planner from VMware. This is a load generator that simulates a higher workload for desktop patterns. The tool runs scripts on every desktop that simulates a high end desktop user by running applications.
For the test I created a single Linked Clone pool with 200 desktops. This I knew would not break the Nutanix block but should make it sweat a bit. For my testing I was limited by only having 4 nodes. The other limiting factor was that my lab only has 1GbE networking at the moment, not the recommended 10GbE that is needed for these types of solutions. I was careful in my testing to make sure that I did not reach a point where the network became my bottleneck so this required me to dial back the number of desktops a bit.
I was pleased to see that even under these constraints the block still did extremely well in the test. On initial provisioning of the 200 desktops the pool went from creation, through customization and to the point of all desktops booted up and read for use in under 30 minutes. I did not use the View storage accelerator feature, this was just plain Linked Clone functionality. There is also the option to turn on shadow copies on Nutanix which caches a copy of the replica desktop on each CVM to provide even better performance. I wanted to test the low water mark and it shined in the testing.
The image below shows the performance charts during the provisioning time. You can see the latency was always below 1ms which is very impressive. I’ve seen this type of action make even the best arrays increase in latency. There have been only a few that can hang in with these types of numbers. The IOPS seemed a bit low for me, the desktops seemed to boot very quickly so that I think the peek was lower since there we not any large number of VMs trying to boot at the same time. By the time vCenter was ready to boot and customize the next group the previous group was already booted and waiting.
The actual workload portion of the View Planner test performed just as well as the provisioning numbers. I am hesitant on publishing the numbers since the solution is limited by my slower networking. I don’t want to provide anyone fuel for saying the product did not perform when I was the limiting factor.
Nutanix DR Protection
The Nutanix platform offers native storage replication. The replication can be configured on a per VM basis. Within the Prism interface you must create protection domains, these domains are configured to protect a single or multiple VMs. Within the protection domain you are able to configure your replication settings and target site. The protection domains can be used for several of the following reasons, send groups of VMs to different sites or for creating groups of VMs for application consistency. This allows for flexibility in setting up replication strategy.
As part of the replication a Planned failover and fail back is built into the product along with the expected failover due to an outage feature. In short this unregisters the VMs from the originating site and registers them on the target site, then starts the VMs on the target site. It will then mark the target site as the active copy within Nutanix DR. This is a solid process for customers that do not want to purchase SRM.
The replication interface has a lot of options for configuring replication between sites. You have the option to do typical replication to a secondary site or bi-directional replication between two sites. More complex options such as many to one, many to many and one to many are all possible. This is accomplish through the use of multiple protection domains. These features are all possible with the native Nutanix replication.
Nutanix has also create an SRA adapter for use with VMware Site Recovery Manager (SRM). This allows for SRM to control the failover of the replicated VMs to the secondary site. If using the SRA, customers must use SRM and cannot use Nutanix DR for fail overs as SRM will control the failover of replicated VMs. This is a limit of SRM rather than Nutanix, as SRM is a Site A to Site B type of protection product.
A big thing that I did really like about the replication was the ability to see statistics for bandwidth usage. You are able to view the bandwidth sent vs received for all replication on a per protection domain view. The image below is a view for a single protection domain, we are able to see the bandwidth stats. But also see ongoing, pending and successful replications. These are all very helpful in monitoring and troubleshooting your environment.
When configuring a remote site your are able to set the maximum amount of bandwidth that the replication traffic can consume. The system also compresses the VM snapshots taken and then does a comparison between sites so that it will only transfer data that does not already exist at the target site.
Data Center Zombie was provided a loaner Nutanix block for a month to allow for the testing required for this review. The block was returned at the end of the review period. We would like to thank the greater Nutanix team for their help in providing the hardware and willingness to answer questions during our process.