The more that information technology (IT) and big data contribute to a company’s success and profits, the more suitable it is for providing insights into current and past operations and business development using appropriate tools such as big data analytics. Big data analytics can be used to generate weekly, daily or even hourly overviews of key performance indicators (KPIs) for management staff, taking into account information that may only have been visible in annual and quarterly reports in the past. This enables the management to take well-founded decisions that can be quickly converted into actions.
Business intelligence vs. big data
There are several factors driving companies to use big data analyses such as reports, data mining and predictive analytics to gain a competitive edge in the global marketplace. For example, customers demand punctuality and reliability when it comes to delivery and product quality, but this requires planning, control and monitoring systems within the company. The data generated by these systems (“big data”) needs to be evaluated and analyzed automatically on an ongoing basis.
If an incident occurs in any of the production facilities or in the logistics department, the company should be able to switch to plan B quickly, as was the case when a volcano in Iceland spread clouds of ash over northern Europe in 2012 and air traffic was stopped. In order to switch processes flexibly, it is no longer enough to evaluate historical data from yesterday or last week – enormous quantities of the most current data must be evaluated to keep scheduling, planning and logistics on course for success.
One way of achieving this is through ad-hoc big data analyses, which are performed by means of in-memory computing. These analyses are generally based on structured data that can be evaluated quickly and efficiently using SQL (Structured Query Language).
Big data requirements
The vast amount of information that flows into companies’ IT infrastructures from log files, mobile devices and social media posts continues to swell every day like a data lake, creating big data in the truest sense of the term. The question as to whether this data lake really has to be operated in-house (i.e. by the IT department) is becoming increasingly urgent. Can the existing systems still be expanded at a reasonable cost, or would it be more cost-efficient to rent the necessary big data processing and analysis capacities from a cloud service provider?
After all, operating an IT infrastructure for big data is not exactly part of a company’s core business. The Open Telekom Cloud offers Infrastructure as a Service (IaaS), providing storage, computing and network resources from the cloud that are ideally suited for big data analysis.
Today, no more than ten percent of this unstructured and unpredictable flood of data – generally referred to as big data – is evaluated, even if the best business intelligence systems are used for big data analytics. Therefore, big data systems need to be used.
Such systems typically include a distribution of Apache Hadoop, an optional SQL extension, a development and reporting environment and various analysis tools. Alternatively, it is also possible to use Apache Spark, which utilizes an in-memory architecture to accelerate data processing.
Given that market studies have shown that a high percentage of big data projects fail, it is worthwhile to call in experienced big data analytics experts from the outset as well as hiring computing and storage capacities for big data instead of buying them yourself at considerable risk. For this reason, a significant proportion of big data projects are run on cloud platforms from the start and operated by specialists including data scientists and enterprise architects.
Hadoop and MapReduce
It all started with Google. What Google needed for the efficient and high-performance evaluation of big data was a way to divide up large amounts of data, distribute it on a cluster of computer nodes and let these servers process the data. This method is called MapReduce. MapReduce began with Google’s development of a large database called BigTable, which could be distributed over a cluster of computing nodes.
The open source project Hadoop evolved from the BigTable database. The widely used Apache Hadoop is primarily a file system (HDFS) that can be expanded as required and distributed across several computing nodes: a computer cluster that can store several hundred petabytes. Additional modules such as MapReduce, HBase and Hive make Hadoop a fully fledged, column-based database that can store large amounts of data.
Big data technology: the MapReduce programming model
Google’s MapReduce technology in BigTable was groundbreaking at the outset. Not only did it allow computing tasks to be split up, distributed and executed in parallel – MapReduce also stored big data in a completely different way than traditional databases.
Instead of fields defined by columns and rows, MapReduce data consists of only a single key and a value. If the key is “customer name”, the value could be “John Smith”, for example. Transactions such as purchases can be reduced to three key/value pairs: customer name, item and price. The metadata also contains the time stamp (date, day, time), the IP address and other connection data.
As seems obvious, such quantities of data are saved in a network cluster with storage capacity. One node (NameNode) takes control, while the others (DataNodes) perform the calculations in the cluster. The file system of this cluster is called “Hadoop Distributed File System” (HDFS). HDFS can also be replaced by GPFS (“General Parallel File System”), which is distributed by IBM under the name Spectrum Scale, to gain various advantages in terms of big data analytics.
The MapReduce Service in Open Telekom Cloud
The MapReduce Service (MRS) offers users a suite of tools for big data analytics, including storage capacities and methods as well as analysis functions.
More specifically, MRS includes the following components, which are critical to big data analytics: Hadoop Common as a file system and platform; HBase as a distributed NoSQL database; Hive as an SQL query tool; the computing engine Spark as an alternative to Hadoop for faster queries; and the Loader tool for uploading mass data. While Spark allows in-memory queries, the standard MRS suite does not include this feature, although it has other ways of supporting in-memory computing.
The computer clusters commonly used for Hadoop and Spark provide the necessary capacities for the duration of a big data analysis, including a monitoring tool. The critical evaluation services are operated in an active standby mode, so that a replacement server can take over within a matter of minutes in case an active server fails.
Another component of the suite is the configuration and synchronization tool Apache ZooKeeper. Data can be stored in the Open Telekom Cloud’s Object Storage Service (OBS), which enables users to initiate analysis jobs. Clusters can be created, configured, extended and searched using a REST API or console access.
A stream of data – usually log and sensor data – supplied from the Internet of Things, such as the smart grid, sensors, wind turbines or machines of all kinds, is called streaming data. The process of efficiently evaluating this form of big data is called streaming analytics. Streaming analytics services include the streaming service Apache Storm and the messaging service Apache Kafka.
CarbonData is an indexed format for storing data in column form and is particularly suitable for writing large amounts of data quickly, for example to Hadoop, Spark or Storm. This tool is supplemented by the log file management tool Apache Flume. Such log files are not only constantly entered into the system, as mentioned above, but are also generated by it to make it easier for the system administrator to manage and monitor big data analyses.
For the user, the MRS offers a wide range of technologies and tools to professionally and successfully set up a big data project and carry out the necessary big data analytics.
Do you have questions?
We answer your questions about testing, booking and use - free of charge and individually. Try it! Hotline: 24 hours a day, 7 days a week
0800 33 04477 from Germany / 00800 33 04 47 70 from abroad