Big data has a big reputation
Hadoop systems
Hadoop is an open-sourced framework used to store and manage big data. Its cluster design allows for distributed storage and processing across a number of computers using programming nodes. Its structure is based on normal, commodity hardware which allows for increasing expansion as the needs to the business change. Hadoop uses horizontal scaling-based solutions to be able to process the large volumes of data it stores. The Hadoop ecosystem is comprised of four main components, and each allows different capabilities.
Hadoop Distributed File System (HDFS) – Hadoops’s primary component, manages the filing system that allows for the storage of data across the computer cluster. The data is stored in blocks, with individual Hadoop nodes operating on the data included in these blocks. There are two types of nodes: Name nodes, which holds all the metadata and Data nodes, which handle the data requests including creation, deletion, and replication. HDFS allow for a global view and doesn’t require users to define schemas beforehand.
MapReduce – a distributed programming model that assists with data processing. The model starts with key-value pairs that are mapped to each other, producing more key-value pairs. These outputs are grouped according to their key and the reduce function is applied. The final outputs of this process are individual subsets combined into smaller datasets that are easier to understand and analyze. It’s important to point out that MapReduce is not a programming language, but a model/paradigm.
Yet Another Resource Negotiator (YARN) – the coordination and scheduling component of Hadoop. YARN distributes MapReduce, manages the resources to schedule user’s applications and keeps log of all jobs.
Hadoop Common – this component includes the libraries and utilities across Hadoop modules. It provides the services and processes behind the operating and filing systems. Its library stores documentation and procedures and includes a contribution section with the Hadoop community.
Hadoop is just one of many systems and technologies that have surfaced in response to the big data phenomenon. Spark is another framework that’s popularly used by companies. It is important for businesses to establish their needs before investing in systems like this. Sometimes, the data could be easily managed using data warehouses and data lakes instead of a system like Hadoop or Spark.
Sources
Craig Stedman, M. L. (2021). TechTarget. Retrieved from Big Data Management: https://www.techtarget.com/searchdatamanagement/definition/big-data-management
Google Cloud. (n.d.). Retrieved from What is Hapache Hadoop?: https://cloud.google.com/learn/what-is-hadoop
Rouse, M. (2014). Techopedia. Retrieved from Hadoop Common: https://www.techopedia.com/definition/30427/hadoop-common
Comments
Post a Comment