Big Data Trend 2016 by Rebaca Technologies: January 2016

Friday, 29 January 2016

Potentiality of in memory NO SQL database by Rebaca Analytics

Potentiality of in memory NO SQL database

In-Memory NoSQL

What is No SQL?

A No-SQL (often interpreted as Not Only SQL) database provides a mechanism for storage and retrieval of data that is modelled in means other than the tabular relations used in relational databases. Motivations for this approach include simplicity of design, horizontal scaling and finer control over availability.

What is In Memory Database?

An in-memory database (IMDB; also main memory database system or MMDB or memory resident database) is a database management system that primarily relies on main memory for computer data storage. It is contrasted with database management systems that employ a disk storage mechanism. Main memory databases are faster than disk-optimized databases since the internal optimization algorithms are simpler and execute fewer CPU instructions. Accessing data in memory eliminates seek time when querying the data, which provides faster and more predictable performance than disk.

In Memory Features - Confusion and Big Data:

• In dynamically scalable partitioned storage systems, whether it is a No-SQL database, file-system or in-memory data grid, changes in the cluster (adding or removing a node) can lead to big data moves in the network to re-balance the cluster.

• It is important to note that there is a new crop of traditional databases with serious In-Memory “options”. That includes MS SQL 2014, Oracle’s Exalytics and Exadata, and IBM DB2 with BLU offerings. The line is blurry between these and the new pure In-Memory Databases, and for the simplicity I’ll continue to call them In-Memory Databases.

• It is also important to nail down what we mean by “In-Memory”. Surprisingly – there’s a lot of confusion here as well as some vendors refer to SSDs, Flash-on-PCI, Memory Channel Storage, and, of course, DRAM as “In-Memory”.

• In reality, most vendors support a Tiered Storage Model where some portion of the data is stored in DRAM (the fastest storage but with limited capacity) and then it gets overflown to a verity of flash or disk devices (slower but with more capacity) – so it is rarely a DRAM-only or Flash-only product. However, it’s important to note that most products in both categories are often biased towards mostly DRAM or mostly flash/disk storage in their architecture.

• Bottom line is that products vary greatly in what they mean by “In-Memory” but in the end they all have a significant “In-Memory” component.

• Most In-Memory Databases are your father’s RDBMS that store data “in memory” instead of disk. That’s practically all there’s to it. They provide good SQL support with only a modest list of unsupported SQL features, shipped with ODBC/JDBC drivers and can be used in place of existing RDBMS often without significant changes

• It’s one of the dirty secrets of In-Memory Databases: one of their most useful features, SQL joins, is also is their Achilles heel when it comes to scalability. This is the fundamental reason why most existing SQL databases (disk or memory based) are based on vertically scalable SMP (Symmetrical Processing) architecture unlike In-Memory Data Grids that utilize the much more horizontally scalable MPP approach.

• In-Memory Databases provide almost a mirror opposite picture: they often require replacing your existing database (unless you use one of those In-Memory “options” to temporary boost your database performance) – but will demand significantly less changes to the application itself as it will continue to rely on SQL (albeit a modified dialect of it).

You will want to use an In-Memory Database if the following applies to you:

• You can replace or upgrade your existing disk-based RDBMS

• You cannot make changes to your applications

• You care about speed, but don’t care as much about scalability

In other words – you boost your application’s speed by replacing or upgrading RDBMS without significantly touching the application itself.

Why In Memory NoSQL ?

Application developers have been frustrated with the impedance mismatch between the relational data structures and the in-memory data structures of the application. Using NoSQL databases allows developers to develop without having to convert in-memory structures to relational structures.

Case Study:

DataStax Brings In-Memory To NoSQL

Web and mobile applications are getting bigger and people are as impatient as ever. These are two factors hastening the use of in-memory technology, and DataStax introduced the latest database management system (DBMS) to add in-memory processing capabilities.

DataStax Enterprise is a highly scalable DBMS based on open source Apache Cassandra. Its strengths are flexible NoSQL data modeling, multi-data-center support, and linear scalability on clustered commodity hardware. Customers like eBay, Netflix, and others typically run globally distributed deployments at massive scale.

Use cases for the new feature include scenarios in which semi-static data experience frequent overwrites. Examples include sites or apps with top-10 or top-20 lists that are constantly updated, online games with active leader boards, online gambling sites, or online shopping sites with active "like," "want," and "own" listings.

DataStax is following in familiar footsteps, as lots of DBMS vendors are adding in-memory features. Microsoft, for example, has extensively previewed an In-Memory OLTP option (formerly project Hekaton) that will be included in soon-to-be-launched Microsoft SQL Server 2014. And Oracle has announced that it, too, will add an in-memory option for its flagship 12c database. General release of that option isn't expected until early next year.

The NoSQL realm already has in-memory DBMS options such as Aerospike, which is heavily used in online advertising. But Shumacher said DataStax tends to show up in much higher-scale deployments than Aerospike.

In-memory DBMS vendors MemSQL and VoltDB are taking the trend in the other direction, recently adding flash- and disk-based storage options to products that previously did all their processing entirely in memory. The goal here is to add capacity for historical data for long-term analysis. As in the DataStax case, the idea is to covering a broader range of needs with one product.

Wednesday, 13 January 2016

Computation measure: Flink vs Spark by Rebaca Analytics.

The quest to replace Hadoop’s aging MapReduce is a bit like waiting for buses in Britain. You watch a really long time, then a bunch come along at once. We already have Tez and Spark in the mix, but there’s a new contender for the heart of Hadoop, and it comes from Europe: Apache Flink (German for "quick" or "nimble").

Overview

Apache Flink is an open source platform for distributed stream and batch data processing. Flink’s core is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams. Flink also builds batch processing on top of the streaming engine, overlaying native iteration support, managed memory, and program optimization.

Apache Flink Architecture and Process Model
•    The Processes
•    Component Stack
•    Projects and Dependencies

Flink sprung from Berlin’s Technical University, and it used to be known as Stratosphere before it was added to Apache’s incubator program. It’s a replacement for Hadoop MapReduce that works in both batch and streaming modes, eliminating the mapping and reducing jobs in favor of a directed graph approach that leverages in-memory storage for massive performance gains.

Similarities:

Spark and Flink have a lot in common. Here's some Scala that shows a simple word count operation in Flink:

As you can see, while there are some differences in syntactic sugar, the APIs are rather similar. I'm a fan of Flink's use of case classes over Spark's tuple-based PairRDD construction, but there's not much in it. Given that Apache Spark is now a stable technology used in many enterprises across the world, another data processing engine seems superfluous. Why should we care about Flink?
The reason Flink may be important lies in the dirty little secret at the heart of Spark Streaming, one you may have come across in a production setting: Instead of being a pure stream-processing engine, it is in fact a fast-batch operation working on a small part of incoming data during a unit of time (known in Spark documentation as "micro-batching"). For many applications, this is not an issue, but where low latency is required (such as financial systems and real-time ad auctions) every millisecond lost can lead to monetary consequences.

Flink flips this on its head. Whereas Spark is a batch processing framework that can approximate stream processing, Flink is primarily a stream processing framework that can look like a batch processor. Immediately you get the benefit of being able to use the same algorithms in both streaming and batch modes (exactly as you do in Spark), but you no longer have to turn to a technology like Apache Storm if you require low-latency responsiveness. You get all you need in one framework, without the overhead of programming and maintaining a separate cluster with a different API.

Also, Flink borrows from the crusty-but-still-has-a-lot-to-teach-us RDBMS to bring us an aggressive optimization engine. Similar to a SQL database's query planner, the Flink optimizer analyzes the code submitted to the cluster and produces what it thinks is the best pipeline for running on that particular setup (which may be different if the cluster is larger or smaller).
For extra speed, it allows iterative processing to take place on the same nodes rather than having the cluster run each iteration independently. With a bit of reworking of your code to give the optimizer some hints, it can increase performance even further by performing delta iterations only on parts of your data set that are changing (in some cases offering a five-fold speed increase over Flink’s standard iterative process).
Flink has a few more tricks up its sleeve. It is built to be a good YARN citizen (which Spark has not quite achieved yet), and it can run existing MapReduce jobs directly on its execution engine, providing an incremental upgrade path that will be attractive to organizations already heavily invested in MapReduce and loath to start from scratch on a new platform. Flink even works on Hortonworks’ Tez runtime, where it sacrifices some performance for the scalability that Tez can provide.
In addition, Flink takes the approach that a cluster should manage itself rather than require a heavy dose of user tuning. To this end, it has its own memory management system, separate from Java’s garbage collector. While this is normally Something You Shouldn’t Do, high-performance clustered computing changes the rules somewhat. By managing memory explicitly, Flink almost eliminates the memory spikes you often see on Spark clusters. To aid in debugging, Flink supplies its equivalent of a SQL EXPLAIN command. You can easily get the cluster to dump a JSON representation of the pipelines it has constructed for your job, and you can get a quick overview of the optimizations Flink has performed through a built-in HTML viewer, providing better transparency than in Spark at times.
But let’s not count out Spark yet. Flink is still an incubating Apache project. It has only been tested in smaller installations of up to 200 nodes and has limited production deployment at this time (although it’s said to be in testing at Spotify). Spark has a large lead when it comes to mature machine learning and graph processing libraries, although Flink’s maintainers are working on their own versions of MLlib and GraphX. Flink currently lacks a Python API, and most important, it does not have a REPL (read-eval-print-loop), so it's less attractive to data scientists -- though again, these deficiencies have been recognized and are being remedied. I’d bet on both a REPL and Python support arriving before the end of 2015.
Flink seems be a project that has definite promise. If you’re currently using Spark, it might be worthwhile standing up a Flink cluster for evaluation purposes (especially if you’re using Spark Streaming). However, I wonder whether all of the "next-generation MapReduce" communities (including Tez and Impala along with Spark and Flink) might be better served if there were less duplication of effort and more cooperation among the groups. Can’t we all just get along?

Differences:

In contrast to Apache Flink, Apache Spark is not capable of handling data sets larger than the RAM (Cached RDD) before version 1.5.x

Apache Flink is optimized for cyclic or iterative processes by using iterative transformations on collections. This is achieved by an optimization of join algorithms, operator chaining and reusing of partitioning and sorting. However, Flink is also a strong tool for batch processing. Flink streaming processes data streams as true streams, i.e., data elements are immediately "moved" through a streaming program as soon as they arrive. This allows to perform flexible window operations on streams.
Apache Spark on the other hand is based on resilient distributed datasets (RDDs). This (mostly) in-memory data structure gives the power to Spark's functional programming paradigm. It is capable of big batch calculations by pinning memory. Spark streaming wraps data streams into mini-batches, (it collects all data that arrives within a certain period of time and runs a regular batch program on the collected data). While the batch program is running, in tandem, the data for the next mini-batch is collected.

Yahoo! Benchmarks Apache Flink, Spark and Storm

Yahoo! has benchmarked three of the main stream processing frameworks: Apache Flink, Spark and Storm.
For stream processing Yahoo! used in the past S4, a platform developed internally, but a decision was made in 2012 to replace it with Apache Storm. They are currently using Storm extensively for various data processing needs running it on ~2,300 nodes. But new data processing frameworks have taken the first stage lately and Yahoo! wanted to compare their performance against Storm’s, so they devised a benchmark published on GitHub.
In this benchmark, Yahoo! compared Apache Flink, Spark and Storm. The application tested is related to advertisement, having 100 campaigns and 10 ads per campaign. Five Kafka nodes are used to generate JSON events which are deserialized, passed through a filter, then the events are joined with their associated campaigns and stored into a Redis node. Kafka varied the number of events generated in 10 steps from 50K/s to 170K/s. The entire setup of the benchmark, including the hardware configuration used along with various observations on what was left out and what could be improved is presented in this Yahoo! Engineering post. To be able to compare the three frameworks, Yahoo! measured the percentile latency needed for a touple to be completely processed for each event emitting rate.
According to Yahoo!, both Flink and Storm showed similar behavior, the percentile latency varying linearly until the 99th percentile when the latency grows exponentially. Storm 0.10.0 could not process data starting at the event rate of 135K/s. Storm 0.11.0 with acking had serious trouble processing data at 150K/s. With acking disabled, Storm 0.11 performed much better, beating Flink, but Yahoo! acknowledges that “with acking disabled, the ability to report and handle tuple failures is disabled also.”
In Yahoo!’s tests, Spark’s results were considerably worse than Flink and Storm’s, going up to 70 sec without back-pressure and 120 sec with back-pressure, compared with less than 1 sec for Flink and Storm, as shown in the following chart (percentile latency depicted for the rate 150K/s)

Thursday, 7 January 2016

Big Data Top Trends

Quantum Computing To Grow
The concept of quantum computing has been around for a long time, but has always been seen as something that we are going to see become a real possibility in some undefined future. However, 2016 may be when its use becomes more commonplace.
After recent work by Australian researchers at the University of NSW it has become possible to code the machines in a more cohesive and understandable way. They have managed to entangle a pair of qubits for the first time, allowing for more complex coding to be created and therefore the use of quantum computers to potentially become more widespread.
2016 will not see the use of quantum computing becoming common, but its presence within data will become far more pronounced and some of the more experimental and forward thinking tech giants may begin to use it more frequently.

Improved Security Scrutiny
Data in 2015 has been in the media spotlight, but not for the ways that many would want. Unfortunately, the data hacks have become more common than many would have predicted, from the Ashley Madison hack to the TalkTalk hack, it has shown up that companies could do more to protect their data.
2016 will therefore see an increased scrutiny on how data is dealt with and protected. This will also come at a time when many countries around at the world are looking at implementing new data protection and data access laws, meaning that the waters are going to become increasingly muddied.
Within this, companies will need to increase their security spending, improve database safety and prepare for seismic changes in the way that hackers work. It is going to be a difficult year for data security, but it will build the foundation on which future stable and robust data security is created.

Analytics To Be Simplified & Outsourced
We have seen the use of new data visualization and automation software breaking down the barriers between the data initiated and uninitiated. Through a continuation of this trend, we are going to see that conducting analysis on datasets become considerably simpler, we have already seen software that has a drag and drop analysis option available on tablets which is useable by almost anybody.
This comes not only from the needs of the untrained, but because we are still in the midst of a skills gap in the data scientist market, meaning that companies need to look at how they can leverage their data without necessarily having the skills in house to do so. Therefore we have these pieces of software that can do relatively simple analysis for companies, but for the more complex analysis needed we are likely to see this being outsourced to companies who have the expertise. This is likely to be a growth area in 2016 and we already have a number of companies leading the way in this regard.

Data In The Hands Of The Masses
Data is no longer just something being discussed in boardrooms and laboratories at the highest levels. Every day people get out of bed and look at the data collected on their sleep patterns, investigate what they are spending money on through apps or even just looking at the possession and running stats from their favourite sports teams. Data is now everywhere in our society, which means that the general population is becoming increasingly clued up on using it.
It is not to say that the general population are going to suddenly become data scientists, but it means that the kind of data shared can become more complex as the understanding of it across a population increases. When discussing important matters, informed discussions can be had with data rather than conjecture. There will still be many who throw themselves at things with blind faith and gut instinct, but 2016 will see a growing segment of the population who can engage with matters through data in a way that they never could before, both through increased access and understanding of it.

Hadoop for mission critical workloads
In 2016, Hadoop will be used to deliver more mission critical workloads — “beyond the ‘web scale’ companies,While companies like Yahoo!, Spotify and TrueCar all have built businesses which significantly leverage Hadoop, we will see Hadoop used by more traditional enterprises to extract valuable insights from the vast quantity of data under management and deliver net new mission critical analytic applications which simply weren’t possible without Hadoop.

Big data made easy
There is a market need to simplify big data technologies, and opportunities for this exist at all levels: technical, consumption, etc.Next year there will be significant progress towards simplification. It doesn't matter who you are - cluster operator, security administrator, data analyst - everyone wants Hadoop and related big data technologies to be straightforward. Things like a single integrated developer experience or a reduced number of settings or profiles will start to appear across the board.