Data Mastery: A Hitchhiker’s Guide to the Data Ecosystem Landscape



Shivnath Babu, CTO and co-founder at Unravel by Shivnath Babu, CTO and co-founder at Unravel

Although a relatively new industry, the big data ecosystem has seen extensive developments since its inception in the 1990s. For those who remember when data in an Excel spreadsheet was the cutting edge, today’s landscape is a far cry from its modest origins.

This has meant constantly adapting to the latest technologies and practices for those veterans, and those technologies and practices seem to come faster every year.

While constantly keeping up with industry trends may seem inconvenient for data scientists trying to stay current, it has provided broad benefits for their understanding of the market’s needs.

Excel Spreadsheets

Big data began on Excel spreadsheets. Whilst the processing of the average computer pales in comparison to the average phone of today, these spreadsheets were the cutting edge, back in their day.

They first introduced people to proper data organisation, visualisation and process.

They set the expectations for how data can be used and consumed and many of the ways of working that Excel informed have persisted to today.

Now, when nearly anyone can use Excel (and millions do), we take it as a given that we can see data in an accessible format and users can easily deploy functions like SUM or AVG in real-time.

For really flashy users, you could even generate functions to identify certain rule conditions and signify this via cell colour. Excel also set the expectation that we can easily view data in graphs and charts rather than numbers on a screen.

Excel set the standard for developments in big data to progress from, and as all consequent advancements have been in these same areas. 

Databases

Once organisations recognised how much value data could provide them they were quick to realise new ways of managing it for greater profit. This grew out new consumer technologies like Microsoft Access as well as larger corporate ones like Oracle, SQL Server, and Db2, and later disruptors like MySQL and PostgreSQL.

These technologies built on the functionality offered by Excel while also allowing for structured query language. This provides the ability to identify patterns in the data as enacted by users and applications, and transactions to handle concurrency issues.

The broad abilities provided by databases was a significant factor in the dot-com revolution.

However, implicity was soon bogged down by complications and overuse. The most problematic of these being the “third normal form”. As a result, different entities were stored in their own tables.

While there are some use cases where a third normal form can provide value, it can ruin performance and design. By forcing designers to join tables together in order to gain a higher level of meaning, performance and design are hindered. SQL can aid these joining complexities somewhat, but it also led to a host of problems related to functionality.

SQL made complex distributed functionality accessible to people – even those who didn’t understand how the function would be executed. While this generated code that did ultimately work, it was inefficient and would perform poorly.

As the amount of data and number of tables involved grew, we saw multiple thousand-line SQL code queries. These performed terribly and ultimately led to the development of the appliance…

Appliances

The central proposition of an appliance was to take a database and distribute it across numerous nodes on many racks.

However, this immediately ran into some stumbling blocks due to a lack of distribution experience. At this point, the industry was still nascent and wasn’t familiar with how to begin building a distribution system. As a result, many of the implementations built a distributed system – so a number of the implementations were underwhelming.

Another factor was that the hardware was nowhere near the quality of today. Processing systems lacked backups and redundancy components – unless they were expensive – so as a result frequent node failure was highly disruptive.

Even if organisations were using top-end hardware the ‘abuse’ of SQL was too much for the technology of the time. Organisations thought that by throwing more money at the hardware they could match the increasing data size but this simply wasn’t the case.

Eventually all these problems came to a head and people became disillusioned with appliances.

While people were reeling from the disappointment of appliances, this left from the next big player in the space: cloud.


Newsletter

Time is precious, but news has no time. Sign up today to receive daily free updates in your email box from the Data Economy Newsroom.


Cloud, On-Premises, and Hybrid Environments

While we saw many innovations to deal with these issues – extract, transform and load, and the processing pipeline, for instance – the cloud revolutionised everything.

While organisations had traditionally been reluctant to host their data online, this was changed by a number of big companies moving to the cloud, for example the CIA, Amazon, AT&T, IBM and Oracle.

While the core technology wasn’t a drastic innovation for the businesses using the data world, the cost model and the deployment model were revolutionary.

These deployments drastically reduced the cost by many factors. And this wasn’t even including the advancements we’ve seen in machine learning, artificial intelligence and IoT.

Today, these technologies are enabling specialised systems that allow for more pipelines and data processing. 

With this history in mind, it’s clear how much the solutions at hand have changed the ecosystem, now we’re living in a data-everything world. Yet organisations still have difficulty in deriving the value they want from their data.

Although we live in an age where real-time information covering most dimensions of a business are readily available, not all companies will make use of this data.

What we have seen historically is that the solution to these problems and data mastery does not come from theory or technology alone, but finding ways to practically implement it. Solutions that improve the speed of development and execution, and reduce cost always triumph.

Of course, high-quality engineers are needed to build these solutions, but their work does not exist in a vacuum. Awareness of the past is the first step to building a better future.

Read the latest from the Data Economy Newsroom: