Published on May 13th, 2015 | by Brian Suhr1
DataGravity storage review
Summary: DataGravity has built something that is going make others ask why can't we do this.
I’m fascinated when I see entrepreneurs create something and grow it to the point that it goes public or is purchased and they leave to start something new again. The passion to create something from nothing must burn strong within them. This is exactly the path that the co-founders of DataGravity have taken. Paula Long was a co-founder and John Joseph was an early executive at EqualLogic a storage startup that was purchased by Dell in 2008. They grew it from nothing into a successful company that flourished and gained momentum with Dell’s resources.
This time around they were focused on solving more than just basic storage problems. They had a vision of storage having an understanding of both virtual machines and data. This combination has never be previously available in a single product. There are now several storage products that are VM-aware and there are storage analytics products that are good at discovering data. In the past organizations would have to deploy multiple products to achieve these two goals and there would be little to no integration.
I was lucky enough to have access to a pre-GA test array. Working with the Beta product and seeing them quickly bring an impressive offering to market. DataGravity has created a storage product that offers both hybrid storage performance, management simplicity and a reporting engine like no other.
What is Data-Aware?
The DataGravity platform understands both virtual machines running on its volumes and unstructured data. Currently they support over 400 data formats, including MS office, Adobe PDF, XML and metadata for most multimedia and DLL formats. A short list of the formats are shown in the table below, full details can be found in the product documentation.
Along with understanding all of these data formats, DataGravity catalogs all user activity on file shares and inside of virtual machines to track content creation, reads, updates and deletes. This functionality is where the real magic happens, because this allows admins and compliance teams to understand who is doing what with the data on DataGravity arrays.
DataGravity is initially offering two capacity options for their storage array, a 48TB and 96TB option. The storage array starts out as a 6U sized chassis that is built on commodity technology. They are using 2U for the storage controllers, which are using SuperMicro commodity hardware. This is a common approach that many vendors use today. Using commodity server hardware for their storage controllers allows DataGravity to separate the processing power form the capacity in their arrays. This flexibility will allow them to adjust CPU and capacity resources independently over time. There is a 4U disk enclosure that holds all of the spinning disk for the array. The SSD flash drives are located within the controller server chassis. The real magic is in the storage software that DataGravity has created, which I will go deeper into during this review.
The storage arrays are all dual controller devices that are Active/Passive for storage connectivity. Connectivity to the array is accomplished by 10GbE connections using several IP storage protocols. Each controller has a pair of 10GbE storage connections and a pair of 1GbE management connections. This allows for a highly available design by providing redundant controllers and network connectivity. The rear view diagram shown below shows the two storage controllers at the top of the array. The lower two-thirds of the array is the disk enclosure. The disk enclosure has redundant SAS controllers that are connected to each storage controller for redundancy.
The most impressive features of a DataGravity array are the data-aware features. Their ability to look into and track data on the array as files, virtual machines, and files within virtual machines is crazy cool. I’ve long been a fan of VM-aware storage arrays and have reviewed some of these already. DataGravity has taken this to a whole new level by providing an impressive set of details on what data is on my storage array. I will go through this and show examples of what they can do in the walkthrough of the interface later in this review.
You may be wondering how does all of this data gathering affect performance? Well, the good news is it does not affect performance at all. In the past to discover data analytics like this, organizations would have to purchase a separate software suite and point it to their storage array. The process of the indexing and discovering data would crush the storage array. To solve this issue, DataGravity has intelligently married these two products. They are using the passive storage controller to remove the risk of affecting performance at the controller level. All I/O traffic is serviced by the active controller, this normally leaves the passive controller doing nothing. DataGravity elected to harness the power of the passive controller to perform their indexing and analytic functions rather than let it sit idle.
To solve the issue of affecting performance at the disk (spindle) level of the storage pool, they have elected to design two separate storage pools into their architecture. The primary pool is used for the actual running of virtual machines and data that is stored on the array. The Intelligence Pool is a secondary pool that is used for the analytic functions and data protection. By having a second storage pool DataGravity is using a dedicated set of spindles to perform the intensive indexing and analytic functions. This isolates this process from the primary pool and prevents them from affecting performance. The secondary pool also provides data protection by having a second set of data that can be restored from as part of the data protection features.
The two storage pools are created upon the initial deployment of the array. The entire capacity of the array is not split at the beginning. A good sized amount of capacity is held in reserve in what they call the free pool, these disks are there waiting to be assigned to one of the other pools. The expansion of pools can be done by the array automatically as demands increase or manually by an admin. The pools can also be manually shrunk by an admin if needed. The image below shows the storage pool view from the management interface.
The storage array supports a number of different storage protocols. These allow the array to be presented and consumed in a number of different methods. This provides flexibility on how organizations can consume the capacity and be used to meet a large number of use cases. The following is a list of all supported protocols.
- CIFS/SMB(V1.C, V2.0, V2.1)
- NFS (V3, V4)
Today the array only supports VMware vSphere for the virtual machine aware features. I would guess that as DataGravity matures and demand rises they will eventually support other hypervisors. DataGravity offers inline compression that is always turned on for both the primary and intelligence pools of storage. The array also offers inline deduplication that is always online for the DiscoveryPoints on the intelligence pool. Deduplication for the primary pool is on but takes a more targeted approach to dedupe content where they feel they can add value, especially around virtual machines.
As part of the dual storage pool design explained earlier DataGravity is able to use the secondary copy of the data as a data protection copy. The data protection is enabled by creating and assigning a DiscoveryPoint Policy to an object, which can be a volume, datastore or virtual machine. You can either perform the recovery of an entire virtual machine or a file-level restore.
A sample of the screen to configure a DiscoveryPoint Policy is shown below. They have the typical steps that you can configure for hourly, daily, weekly and monthly copies of the object that is being protected. Something that is unique is a backup can be taken based on data change. An example would be if a VM is changed up or down by more than 5% then a backup would be taken automatically. This could be used to protect someone ingesting a lot of new data or trying to purge data.
DataGravity arrays offer a dual controller architecture. Each controller on the storage array offers dual 10GB for IP storage traffic and dual 1GB management connectivity. These redundant connections allow for a link failure to occur and not require a controller failover. The redundant storage controllers allow for a complete controller failover without impacting performance.
DataGravity has an active/passive storage controller design, allowing the array to provide maximum performance from the single active controller. DataGravity uses the passive controller to perform the indexing and data analytics. Should a controller failure occur the passive controller will become active and the indexing and data analytics functions will be paused while running in a single controller state. I/O and data protection take priority and once the failed controller is available again the analytic functions will resume.
DataGravity currently produces two different storage arrays. This review is based on the DG2200 the smaller capacity offering in their line up. The two arrays offer the same features but with different amounts of capacity and Flash.
- DG2200 – 48TB raw capacity and 2.4TB of Flash
- DG2400 – 96TB raw capacity and 4.8TB of Flash
The Good and the Bad
The following descriptions are items that our rating was based on. These are important values to the review team in the evaluation of products and their on going usage. Each item is described on what items affected the rating in a positive or negative fashion.
Innovation: We gave a 4.5 star rating for the innovation that DataGravity has built and is delivering. Bringing a whole new level of detail to data on an array and in your VMs is nothing to be bashful about. I was intrigued when I heard about the idea and was blown away when I first saw it. Also the fact that it performs very well and does not affect the performance of the array is equally impressive.
Performance: I have found my DataGravity test array performed well, this earned them a 3.75 rating. But since I was using pre-GA hardware with GA software I was not able to fully test the performance of the array. I did run smaller amounts of server workloads and different amounts of virtual desktops and was happy with the performance from my limited performance testing.
Other: The rating of 3.2 stars was awarded based on the following criteria. The upgrade process today is not as simple as other arrays in the hybrid storage space, this is something that is very common for younger products. I feel confident that this will be fixed in future revisions. I can say that the DataGravity staff was always very helpful in explaining the upgrade process and even performing a few of them over the testing period. Something that negatively affected the rating for this section is that DataGravity has not yet released some base storage functions that many customers consider when evaluating storage. The biggest feature missing seems to be replication, both for data and virtual machines. I expect DataGravity to add replication is an upcoming release at some point.
Interface: The interface received a rating of 4.25 stars. This rating is based on the quality of the DataGravity web interface. The array management interface is a simple looking web based management portal. The design is well laid out and easy to understand. It was easily to find out how the storage was performing and the data analytic searches were clear and simple.
Scalability: The rating of 3.5 stars was based on the following factors. Today the DataGravity arrays are capable of supporting the single disk shelf that is included with the array. No further expansion is allowed at this time. They have communicated that over the next 12-18 months they expect the arrays to be able to support up to 4 total shelves of disks, this would allow an array to grow by 4x. It sounds like customers should expect additional shelf expansion in multiple stages with the first step opening up sometime in 2015. As support for this additional capacity becomes available this rating will increase.
Documentation: The documentation received a rating of 4.0 stars. DataGravity has the standard set of documents that you would expect that covered install, upgrade, administration tasks. The documentation was well written and easy to consume.
The management console for the DataGravity storage arrays are a web based portal that is built into each storage array that is clean and easy to understand. The management interface supports both local accounts and Active Directory authentication.
Upon logging into the storage array you are greeted with the following screen. Three large tiles that offer you options for system related items, storage related items and discover which is for analytics. By clicking on any of the tiles you will be presented with several options.
Before exploring the main parts of the array I clicked on the window looking icon at the top. This provides high level stats about the system, mount points and files. A great way to see some summarized data about what is going on with the array.
If you look under the system tile you will see four options. The VMware credential options is a page to enter in credentials for your vCenter(s) so that the system can communicate for being VM-aware storage. The notifications policies is where you can create policies to alert you based on different alert levels. The user access is for setting up accounts and the System Mgmt choices is covered next.
System management gives us details about the health and configuration of the storage array. You can see many details and drill down deeper on some. The upper left in the System Information block we see hardware details and configuration information. Lower left is the System Health block that shows a physical view of the back of the storage array. It shows in colors if there is an issue and a list of checks shows major items and if there is an issue. If you expand the block it will open a large page that you are able to click on and identify all major items.
In the upper right we see Storage utilization which is showing capacity details about the storage pools. You see how much each of the two storage pools are using and what is left in unallocated capacity. Below that is a a summary of each of the different mount points on that array. A small performance block, provides a view of what the IOPS and throughput for the entire array are. The last block in the lower right shows any alerts for the system.
The storage tile opens to provide a number of options that can be executed on. From this menu you will create new storage in the form of an NFS export, CIFS/SMD share, iSCSI LUN or VMware Datastore. You also can manage policies and VMs.
First I am taking a look at the Manage VMs option. It shows a listing of all VMs that are running the array that it’s aware of. We see details about which mount point its located on, client OS, capacity details, the protection policy and its status. If the list was long you can use the search field to narrow things down. At the bottom of each column is a field that can be used to filter the data also. You can click on a VM name to get further details about that virtual machine.
In the next image its showing the data about the VM that was clicked on. Under this view is where some of the VM and Data level reporting really starts to shine. We can see how much content this person is creating on this virtual machine, following the timeline. Standard type performance graphs are presented also. The lower left shows what users are touching data on this VM, the large more bold the account name the more that user ID is doing. The user IDs are populated if you have the DataGravity agent installed into the VM. The DiscoveryPoints section shows details about the data protection for this VM.
The next image is showing how you can manage the mount points available on the array. From this single tab you can see all of the different types of storage that you are presenting from the array along with details about them. Click on any one of them to get further details.
Following up on the previous section, the following screen is showing details about the selected mount point. From this view there is performance and capacity details provided, along with a list of any virtual machines running on this datastore.
Moving back to the main menu of the management console taking a look at the Discover option is the next area to explore. The Discover section is where the data-aware details can be explored in detail.
The first thing to look at is the Trending option within the Discovery section of the console. As a sample here I have selected my virtual desktop and have entered “vmware” as the term to search for. The chart is showing high level detail on the counts of files in my VM that match the search term. The lower section shows that there was over 9700 matches found and display file names and some data about the files.
The next look is showing the Search option under the Discovery section of the console. Again I have selected my virtual desktop and have entered “vmware” as the search term. The results show the list of files, although in this view I am using the filters on the left side to adjust the results. I have chosen to only show spreadsheet as the file type. This cut the results down from over 9000 to just 5. There are several filter options that you can use to help isolate the exact data that you are looking for. If I click on a file it will take me to a file level detail showing similar details as shown earlier about the VM and allow me to download or preview the file.
DataGravity does offer a non-disruptive upgrade process that must be performed via the command line at this point. They have not yet built the upload a package and click a button in the management console type feature yet. This is something that is on the roadmap for them and I expect to see it in a reasonable amount of time. This is very typical for new storage offerings as they mature, focus is on features and the supporting items follow on afterward.
I was not able to do extensive performance testing on the array that I was provided to test. The array provided was late in the beta process and was not the same hardware configuration as the GA hardware. So I agreed that any performance test I performed would account for this constraint or I would not publish them. Honestly, I was not that interested in how DataGravity was going to perform, I was most interested in the Data-aware details.
I will say that I mostly used the array for testing virtual desktops and some server workloads. I was happy with the performance but did not try and push it to the breaking point. In talks with DataGravity they assured me I would be happy with the performance, but said they would be improving it as they add features and mature the platform. DataGravity does collect the writes in NVRAM until they reach a certain level and then performs a full strip write that is sequential, this is done to improve the write performance of the array.
Working in consulting with may different types of organizations, I have encountered different projects that could have benefited from this types of detail. I also think that if you do not require this level of visibility or are not under heavy regulatory constraints you may not be as interested in DataGravity as your general storage platform.
If you are considering or are a NetApp customer and have a lot of unstructured data, I think DataGravity is worth taking a look at. You get the features that you need and a bunch of cool stuff that NetApp only wished they could provide.
To perform testing I was provided access to a local DataGravity array that allowed for testing.