Big data has a big reputation

December 03, 2023

Just like Taylor Swift’s popularity, the use of big data has grown in recent years. As companies have expanded, they’ve found new ways of collecting and keeping data. And as they stored more data, the traditional database management systems became less efficient. This is where the term big data comes to play. Big data refers to the large volume of data that businesses are having to deal with day-to-day. This new era has brought forth new challenges and new technologies to address them. In this post, we’ll focus on Hadoop systems.

Hadoop systems

Hadoop is an open-sourced framework used to store and manage big data. Its cluster design allows for distributed storage and processing across a number of computers using programming nodes. Its structure is based on normal, commodity hardware which allows for increasing expansion as the needs to the business change. Hadoop uses horizontal scaling-based solutions to be able to process the large volumes of data it stores. The Hadoop ecosystem is comprised of four main components, and each allows different capabilities.

Hadoop Distributed File System (HDFS) – Hadoops’s primary component, manages the filing system that allows for the storage of data across the computer cluster. The data is stored in blocks, with individual Hadoop nodes operating on the data included in these blocks. There are two types of nodes: Name nodes, which holds all the metadata and Data nodes, which handle the data requests including creation, deletion, and replication. HDFS allow for a global view and doesn’t require users to define schemas beforehand.

MapReduce – a distributed programming model that assists with data processing. The model starts with key-value pairs that are mapped to each other, producing more key-value pairs. These outputs are grouped according to their key and the reduce function is applied. The final outputs of this process are individual subsets combined into smaller datasets that are easier to understand and analyze. It’s important to point out that MapReduce is not a programming language, but a model/paradigm.

Yet Another Resource Negotiator (YARN) – the coordination and scheduling component of Hadoop. YARN distributes MapReduce, manages the resources to schedule user’s applications and keeps log of all jobs.

Hadoop Common – this component includes the libraries and utilities across Hadoop modules. It provides the services and processes behind the operating and filing systems. Its library stores documentation and procedures and includes a contribution section with the Hadoop community.

Hadoop is just one of many systems and technologies that have surfaced in response to the big data phenomenon. Spark is another framework that’s popularly used by companies. It is important for businesses to establish their needs before investing in systems like this. Sometimes, the data could be easily managed using data warehouses and data lakes instead of a system like Hadoop or Spark.

Sources

Craig Stedman, M. L. (2021). TechTarget. Retrieved from Big Data Management: https://www.techtarget.com/searchdatamanagement/definition/big-data-management

Google Cloud. (n.d.). Retrieved from What is Hapache Hadoop?: https://cloud.google.com/learn/what-is-hadoop

Rouse, M. (2014). Techopedia. Retrieved from Hadoop Common: https://www.techopedia.com/definition/30427/hadoop-common

Search This Blog

Data Management, Databases, and Big Data

Big data has a big reputation

Hadoop systems

Comments

Post a Comment

Popular posts from this blog

The Importance of the Basics: Data Management

The 4s of Database Management Systems (DBMS)