top of page

The Characteristics Of Big Data Platforms And Their Importance For Data Exploration

  • Feb 13, 2017
  • 5 min read

This site is optimized with the Yoast SEO plugin v4.0.2 - https://yoast.com/wordpress/plugins/seo/ / Yoast SEO plugin. This site uses the Google Analytics by MonsterInsights plugin v5.5.4 - Universal enabled - https://www.monsterinsights.com/ / Google Analytics by MonsterInsights Data is everywhere. ‘Data is Gold’, ‘Data is the new oil’ etc. are the quotes which describe the importance that is attached to data gleaned from different industry sources. From IOT enabled sensors and traffic patterns to web history and medical records, data is recorded, stored and analyzed to enable the technologies and services relied upon by the world today.

StartFragment

The term ‘Big Data’ comes from the fact that it is of humongous size and can be divided into 4 major dimensions, the 4Vs- Volume, Velocity, Variety and Veracity. The data is ingested in metrics reaching up to hellabytes (10^27 bytes), from a variety of sources, in forms- like structured, semi-structured and unstructured, and therefore the truthfulness of the data is always under scrutiny. That is why, it is important for the organizations to collect this ‘Big Data’ using dynamic information discovery models, and analyze it for making important business decisions.

EndFragment

StartFragment

Underlying Architecture

An essential component of a Big Data platform is the process that enables the ingestion, storage and management of data, and Hadoop is a major open-source framework which helps achieve this. It supports the processing and storage of extremely large datasets in a distributed computing environment. Hadoop’s architecture basically involves cluster planning, i.e. dedicating multi-core CPUs with RAM and HDDs of heavy configurations to facilitate ingression via two main approaches: batch and event-driven. The former is appropriate for file and structured data, while the latter is appropriate for most near-real-time events such as logs, transactional or sensor data. Hadoop’s MapReduce does an excellent job in processing batch events, but its efficiency in reduced while processing real-time streams. To compensate for this, we can use Apache Spark or Storm along with Hadoop, since they are naturally meant for real time processing. The storage of this processed data is done in either HDFS or HBase, and both are highly performant databases with fast read and write capabilities.

Since Hadoop processes in a parallel distributed manner, a central infrastructure is required for cross-node synchronization. A ZooKeeper server does that job pretty efficiently, by keeping a copy of the state of the entire system and persisting this information in local log files. Hadoop also provides access control in the form of Information architecture, i.e. a concise access schema that controls tightly who has access to what data, and is very helpful where a cluster is shared across departments. But on an enterprise level, the continuously generated data far exceeds the limits of our ability to store, process, analyze and transmit it, and this situation is causing stress on all of the underlying infrastructure used to generate and process it.

Cloud comes to the rescue

This shortcoming can be taken care of by employing cloud based large-scale distributed compute and storage infrastructures. It helps enable, either the manual setup of Hadoop and other computing engines like Storm and Spark in a VM, or provide these capabilities as services out-of-the-box with automatic scalability of the arrangement as per the usage. These out-of-the box services have been heavily adopted by SMEs and startups since they provide efficient resource utilization. Azure’s HDInsight and Amazon’s EMR are such solutions, which provide easy deployability of these technologies as managed clusters with enterprise-level security and monitoring, and are the leading players in this domain. Given the current trend of the usage of cloud based services, it can be stated that we are going to see the rise of the information service organizations, the same way the banking industry arose centuries or millennia ago to manage and handle our financial assets.

Encompassing it all into One

Since the overall focus of employing a big-data strategy is on gaining business insights, companies are looking forward to develop a comprehensive information management strategy that involves more than simply ingesting big data. Specifically, they want to integrate their existing data systems, including the relational DBMS, enterprise content management systems, data warehouses, etc. This is where the concept of Data exploration comes into the picture, that describes the data by means of statistical and visualization techniques which help exploring it, in order to bring its important aspects into focus for further analysis. To achieve comprehensive Data exploration, companies need to do away with traditional analytic techniques and move from hindsight to foresight analytics. If this variable data is the oil, data analysis has to be the engine that drives its exploration, and therefore the tools used for this task should be able to harness data from all the given data systems.

Data Exploration Tools

The first step in analyzing this data is data cleaning, since the data visualization tools only understand nicely structured and clean data. This can be achieved by using tools like OpenRefine and DataCleaner, after which the data can be passed on to the data mining phase. Here, tools like RapidMiner and IBM SPSS Modelerdata help to discover insights within a database, and help making predictions and decision on the data we already have at hand.

While data mining is all about sifting through your data in search of previously unrecognized patterns, data analysis is about breaking that data down and assessing the impact of those patterns overtime. Analytics is about asking specific questions and finding the answers in data, which can be about the past, present or the future. This can be facilitated using tools like Qubole and BigML. The latter is a powerful machine learning service with an easy-to-use interface to import the data and get predictions out of it. The last and the most important step in this process is the visualization of the analyzed data, which helps a data scientist to convey the insights from that data to the rest of the company with powerful tools like Tableau, Silk, and Plot.ly etc.

Tapping the potential of this beast

But just the predictive analytics of data we get from the above tools is not enough. It has to be followed by Prescriptive analytics, which combines descriptive analytics to elucidate what happened in the past with the forecasting capabilities of predictive analytics, to prescribe the next best steps to take and help enterprises reach an action stage. Price optimization, inventory management, supply chain optimization, resource allocation and transportation planning are the business processes to which prescriptive analytics can be applied and help produce great results. The biggest example of an old-school organization, which has fully adapted the implementation of Big Data is GE (General Electric). They have put sensors in gas turbines, jet engines, MRIs etc. to determine when the machines will need to be serviced. GE has been able to reduce the expenditure on these services by 50%, and it is now transforming all of its services business with this data. The sensor from their one gas turbine alone creates more data per day than Twitter does in a week.

In all, we can say that the Big Data technologies have the propensity within them to foster great results for the organizations, if combined with efficiently sought-after result oriented analytics. But for that to happen, organizations have to evolve their existing data ingestion architectures. With more data and more potential relationships between data points, businesses will need experts to sift through and pinpoint the signal from the noise, and this is where the role of data scientist comes into the picture. IT departments also need to continue building up a data-driven mindset which includes investing in the back end of data by improving governance policies and data quality.

References

  1. https://www.import.io/post/all-the-best-big-data-tools-and-how-to-use-them/

  2. https://datafloq.com/read/overview-data-analysis-tools-journey-exploration/1104

  3. http://www.ibmbigdatahub.com/sites/default/files/infographic_file/DataExplorer-Infographic-FINAL.pdf

  4. https://arxiv.org/ftp/arxiv/papers/1509/1509.01331.pdf

  5. http://whatis.techtarget.com/definition/data-driven-disaster

  6. http://blog.cloudera.com/blog/2014/09/getting-started-with-big-data-architecture/

  7. http://searchcio.techtarget.com/feature/Is-prescriptive-analytics-the-RX-for-optimal-business-outcomes

  8. http://searchcio.techtarget.com/opinion/Analytics-30-the-old-guard-masters-how-to-build-data-products

  9. http://www.dummies.com/programming/big-data/hadoop/hadoop-zookeeper-for-big-data/

  10. http://www.ibmbigdatahub.com/infographic/four-vs-big-data

  11. https://www.dezyre.com/article/spark-vs-hadoop-vs-storm/145

  12. https://www.entrepreneur.com/article/242387

EndFragment

Comments


©2017 by Kamal Chaturvedi. Proudly created with Wix.com

bottom of page