Translate

Saturday, 21 February 2015

Emergence of Big Data Systems and Hadoop

To better understand the market drivers related to Big Data, it is helpful to first understand some past history of data stores and the kinds of repositories and tools that were used to manage these data stores. Most organizations analyzed structured data in rows and columns and used relational databases and data warehouses to manage large stores of enterprise information. The preceding decade saw a proliferation of different kinds of data sources — mainly productivity and publishing tools such as content management repositories and networked attached storage systems — to manage this kind of information, and the data began to increase in size and started to be measured at petabyte scales.


In the 2010s, the information that organizations try to handle has broadened to include many other kinds of data. In this era, everyone and everything is leaving a digital footprint. Organizations and data collectors are realizing that the data they can gather from individuals contains intrinsic value and, as a result, a new economy is coming forth.

As this new digital economic system continues to develop, the market sees the introduction of data vendors and data cleaners that use crowd sourcing to test the outcomes of machine learning techniques. Other vendors offer added value by repackaging open source tools in a simpler way and bringing the tools to market. Marketers such as Cloudera, Hortonworks, and Pivotal have provided this value-add for the open source framework Hadoop. It represents another example of Big Data innovation on the IT infrastructure.

Apache Hadoop is an open source framework that allows companies to process vast amounts of information in a highly parallelized way. It is an ideal technical framework for many Big Data projects, which rely on large or unwieldy datasets with unconventional data structures. One of the main benefits of Hadoop is that it employs a distributed file system, meaning it can use a distributed cluster of servers and commodity hardware to process large amounts of data.

Some of the most common examples of Hadoop implementations are in the social media space, where Hadoop can manage transactions, give textual updates, and develop social graphs among millions of users. Twitter and Facebook generates monolithic amounts of unstructured data and use Hadoop and its ecosystem of tools to manage this large amount of data.

Big Data comes from myriad sources, including social media, sensors, the Internet of Things, video surveillance, and many sources of data that may not have been considered data even a few years ago. As businesses struggle to keep up with changing market requirements, some companies are finding creative ways to apply Big Data to their growing business needs and increasingly complex problems. As organizations evolve their processes and see the opportunities that Big Data can provide, they try to move beyond traditional BI activities, such as using data to populate reports and dashboards, and move toward Data Science-driven projects that attempt to answer more open-ended and complex questions.

We at Analytix Labs offer Business Analytics training and a variety of other programs, such as SAS+ Business analytics, SAS Edge, Advanced SPSS and big data hadoop training for individuals, corporates, colleges and universities. Visit our website for details.

What exactly is Big Data?

Data is created constantly, and at an ever-increasing rate. Mobile phones, social media, medical imaging technologies — all these and more create new data, and that must be stored somewhere for various purposes. Devices and sensors automatically generate diagnostic information that are needed and kept in real time. Merely keeping up with this huge influx of data is difficult, but substantially more challenging is analyzing vast amounts of it, especially when it does not conform to traditional notions of data structure, to identify meaningful patterns and extract useful information. 

Although the volume of Big Data tends to attract the most attention; generally the variety and velocity of the data provide a more apt definition of Big Data. Big Data is sometimes described as having 3 Vs: volume, variety, and velocity. Due to its quantity and structure, Big Data can’t be expeditiously examined using only traditional methods. Big Data problems require new tools and technologies to store, manage, and actually benefit the business. These new tools and technologies need to enable creation, manipulation, and management of large datasets and the storage environments that house them.

However, these challenges of the data flood present the opportunity to transform business, government, science, and everyday life.  For example, in 2012 Facebook users posted 700 status updates per second worldwide, which can be leveraged to deduce latent interests or political views of users and show relevant ads. Facebook can also construct social graphs to analyze which users are connected to each other as an interconnected network. In March 2013, Facebook released a new feature called “Graph Search,” enabling users and developers to search social graphs for people with same kind of interest, people and shared locations.

Big Data is the data whose scale, distribution, diversity, and timeliness demands the use of new technical analytics and architectures to alter, enable, and unlock new insights sources of business value. Social media and genetic sequencing are among the fastest-growing sources of Big Data and examples of untraditional sources of data being used for analysis.

Big Data can come in multiple forms, including structured and non-structured formats such as financial data, text files, multimedia files, and genetic mappings. Contrary to much of the traditional data analysis performed by organizations, popular varieties of Big Data are either semi-structured or unstructured in nature, which requires a lot of engineering effort and tools to process it and analyze the same. Environments like distributed computing and parallel processing architectures that enable the parallelized data ingest and analysis the preferred approach to process such complex data.

Exploiting the opportunities that Big Data presents requires new data architectures, including analytic sandboxes, new ways of working, and people with new skill sets. These drivers are causing organizations to set up analytic sandboxes and build Data Science teams. Although some organizations are fortunate to have skilled data scientists, most are not, because there is a growing talent gap that makes finding and hiring data scientists in a timely manner difficult. Still, organizations such as those in web retail, health care, genomics, new IT infrastructures, and social media are beginning to take advantage of Big Data and apply it in creative and novel ways.

If you want to get big data certification then you can visit AnalytixLabs, a premier training institute for analytics, big data, hadoop training and more.