The Case for Vertica
What is Vertica?
Vertica is an analytical database developed by Michael Stonebraker at MIT that uses columnar technology that handles large data. It is a shared-nothing system that scales linearly. It employs k-safety to survive node failure. It is the best engine for enterprise scale analytics. Many users, large data sets, complex calculations. Multiple data feeds.
Michael Stonebraker’s contributions to the refinement and spread of database management technology are hard to overstate. He began work in this area as a young assistant professor at the University of California—Berkeley. He invented both INGRES and Postgres and is a Turing Award winner. He is basically the godfather of every database scientist in the world. He argues, and Full 360 agrees, that one size does not fit all.
First you have to understand relational row based technology to understand its shortcomings. Imagine you are counting all of the people in the world, about 8 billion. Imagine a data table with thirty columns of data. In the first three columns are: Last Name, Country, Sex. If you indexed people by country and sex 195 countries + Unknown and 2 sexes + Unknown. You would have the name of every country in the second column and the sex of each person in the third. You would then build two separate index tables that would be 8 billion long with pointers to the main data table. In order to count the total number of females, you would have to do a scan of 8 billion records in your index table and match them.
Columnar technology essentially creates a distinct histogram of all field entries and then creates pointers to virtual rows that are instantiated at query time. This eliminates the need for traditional indexing and table scans. Since these virtual rows are instantiated at query time, fields not explicitly specified in the query are not fetched from storage into memory. These features drastically reduce I/O. Additionally it allows for very wide tables, up to 4000 columns. This allows for denormalization which also drastically decreases the overhead of multiple joins. These queried instantiations can also be materialized if the designer so chooses.
The Vertica data model allows you to add any number of nodes which thereby split all of the processing into n nodes. There is no central ‘controlling’ master - slave and data is not shared across nodes unless specifically designated (K-Safety).
Vertica can operate with failed nodes by smart data replication. This redundant partitioning of data is done automatically and set by a K-Safety parameter. If it is 2, then it means no data will be lost if two nodes fail. If it is 1 then no data will be lost if one node fails. A typical basic setup is for three nodes with a K of 1. Five nodes with a K of 2 is also typical.
The Market for Vertica
Vertica is aimed at the Data Warehousing and Business Intelligence market. Anything that falls under the header of ‘Big Data’ is a potential candidate for Vertica. Vertica should be deployed as the thinking, performant part of any business intelligence system - it will do on the backend more than will ever be possible on the front-end. It allows for rapid development for business and process intelligence data and is an ideal store for interactive query processing.
- Tunability (projections, partitions)
- Scalability (data size, users)
- ISO SQL Based
- Analytic Functions
- Developer community + CE
- Extended data integration toolsets
- Handles unstructured data (Dragline)
- No stored procedures
- No conditional logic
- Managed Service/Self Service
With Full 360’s ElasticDW (managed service) and SneaQL (procedural logic) solve for all those weaknesses except for cost, but then again, considering the technical prowess, you are getting excellent value. In fact, Vertica’s free CE version is generous in being fully functional for non-commercial uses.
Major Competitive Environment
The major direct competitors for Vertica are Redshift and Snowflake - columnar databases. These are cloud native databases that are designed for data warehousing. The broader set of competitors are as follows.
- Hive / Hadoop
- Sybase ASE / Hana
- Essbase ASO
- Teradata / Aster
The Case for Full360 and ElasticDW
Full 360 knows pretty much everything about AWS deployments of Vertica. We have 8 years of experience doing so. More than anyone. Years ago we successfully implemented Vertica cluster in the cloud with > 100 nodes for Zynga. At the time it was the world’s largest. We invented SneaQL to speed up development on Vertica and add functionality to what it could not do (Logic, Variables & Stored Procedures). We have also worked with Amazon Redshift and know how it works in practice. We have also built our Vertica Service for our customers on the same general principles (architecturally) as Snowflake. We have many years of DW experience in the enterprise space prior to our AWS work.
Full 360 knows how Vertica fails and what to monitor in the AWS cloud. We know how to optimize loading and build for a near-realtime environment. We understand Eon mode. We have converted Oracle customers. We have converted Netezza customers.
The Hybrid Case
Vertica with Full 360 can develop specific tools and utilities to allow current Vertica customers to maintain an on-premise installation and migrate smoothly to the cloud. We expect that customers who don’t plan to migrate fully to the cloud may want to establish a hybrid cloud environment or to select a few applications for migration. We will have that expertise in order to build a product that enables multiple Vertica clusters to share data, some in the cloud, some on premise. This will be a unique capability.
The Multi-Tier Case
At Full 360 we have built several systems that use different sorts of databases and data storage methods to optimize performance. So there is ample precedent for including Vertica in a ‘data sandwich’.
Our philosophy is that purpose-built databases are best and that enterprises (especially) should not waste their money trying to make one product or technology fit all. This is not necessarily a recipe for disaster, but certainly one for inefficiency and lost opportunity.
Full 360, having worked with Vertica understands how to work around it’s specialist nature, and have done so by judiciously building multi-tier systems. We use other specialized databases and data management tools to fill in gaps. In our experience the following data technologies are best. They have overlapping capabilities and work just right when engineered into a total solution.
- hadoop - hive
VoltDB is also a database engine created by Michael Stonebraker. Its design is specialized for the use case of realtime processing of millions of transactions. When we need to calculate something at the speed of a transaction, that is when we use VoltDB. Examples of this would be fraud detection.
Kafka is the leading message broker. It is capable of handling massive numbers of transactions at high speed with excellent reliability. It can handle XML or JSON documents and perform transformations on these and then pass them to multiple subscribers and data stores. It can be sharded on an application by application basis and tuned for throughput.
DynamoDB is a mid capacity document database that integrates well with websites. It can be scaled up but we will most likely use it for where it already exists.
S3 is the building block for data lakes. Our implementation of Vertica takes advantage of structured data lakes which provides a low cost staging area for data whose value is not clear, or for historical data. We also use S3 as a buffer and redundant storage facility for data destined for multiple destinations. As well we leverage S3 for Vertica’s EON mode.
Redis is primarily used by Full 360 as an in-memory buffer database to speed up transactional processing which might traditionally take place on a more expensive location (like DB2). It offers full control of standardized structured data which we might process or transform for downstream usage. It is also a low cost alternative to VoltDB.
Hadoop is a legacy file system that is capable of managing very large data sets. It is primarily used by web developers with no budget for enterprise databases. It has evolved to process SQL data through the Hive facility. It is best used for one-off transformations on large sets of data.
Apache Spark is a data management facility that provides very popular libraries for manipulations of datasets that generally fit in workstation memory. It is a defacto standard data processing engine for Python programmers and supports Java and Scala.