Installing and Configuring Hadoop and HDFS on Linux

Published on April 3, 2025

Running a single cluster in Kali Linux for handling Large Data

Hadoop has become a foundational tool in the big data ecosystem, widely used for distributed storage and processing of large datasets. This article walks through the installation and configuration of Hadoop 3.4.1 and HDFS on a Kali Linux system. Designed for students and professionals alike, the steps below outline a hands-on, local setup using a non-root user—ideal for experimentation, learning, and lightweight development.

Understanding the Basics

Hadoop is an open-source framework that allows distributed processing of large data sets across clusters of computers using simple programming models. Its core modules include:

  • HDFS (Hadoop Distributed File System): For scalable, fault-tolerant storage
  • MapReduce: A computation model for parallel processing
  • YARN: Yet Another Resource Negotiator, used for cluster resource management

Before diving into setup, it's essential to understand the significance of each component and why configuration matters. Setting up Hadoop locally on Kali Linux (or any Unix-based OS) provides valuable insights into its core mechanisms and how real-world clusters are managed.

#Prerequisites

Before diving into Hadoop, ensure your system is ready:

#1. Update System & Install Java

Hadoop requires Java. We'll use OpenJDK 8.

bash
1sudo apt update
2sudo apt install openjdk-8-jdk -y
3java -version; javac -version

#2. Install SSH

SSH is used by Hadoop daemons for communication.

bash
1sudo apt install openssh-server openssh-client -y

#3. Create a Hadoop User

Avoid using root. Create a dedicated user:

bash
1sudo adduser hdoop
2sudo adduser hdoop sudo
3su - hdoop

#4. Setup Passwordless SSH

This enables internal Hadoop communication.

bash
1ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
2cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
3chmod 0600 ~/.ssh/authorized_keys
4ssh localhost

#Download and Extract Hadoop

bash
1wget https://downloads.apache.org/hadoop/common/hadoop-3.4.1/hadoop-3.4.1.tar.gz
2tar xzf hadoop-3.4.1.tar.gz

#Configure Environment Variables

#Edit

bash
1sudo nano .bashrc

Add these lines at the end of code:

bash
1export HADOOP_HOME=/home/hdoop/hadoop-3.4.1
2export HADOOP_INSTALL=$HADOOP_HOME
3export HADOOP_MAPRED_HOME=$HADOOP_HOME
4export HADOOP_COMMON_HOME=$HADOOP_HOME
5export HADOOP_HDFS_HOME=$HADOOP_HOME
6export YARN_HOME=$HADOOP_HOME
7export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
8export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
9export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"

Hadoop Architecture Diagram
Hadoop Architecture Diagram

Then reload the file:

bash
1source ~/.bashrc

#Configure Hadoop Core Files

#Edit hadoop-env.sh

bash
1nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh

Add this line:

bash
1export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64

#Edit core-site.xml

bash
1nano $HADOOP_HOME/etc/hadoop/core-site.xml
xml
1<configuration>
2 <property>
3 <name>hadoop.tmp.dir</name>
4 <value>/home/hdoop/tmpdata</value>
5 </property>
6 <property>
7 <name>fs.default.name</name>
8 <value>hdfs://localhost:9000</value>
9 </property>
10</configuration>

#Edit hdfs-site.xml

bash
1nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
xml
1<configuration>
2 <property>
3 <name>dfs.name.dir</name>
4 <value>/home/hdoop/dfsdata/namenode</value>
5 </property>
6 <property>
7 <name>dfs.data.dir</name>
8 <value>/home/hdoop/dfsdata/datanode</value>
9 </property>
10 <property>
11 <name>dfs.replication</name>
12 <value>1</value>
13 </property>
14</configuration>

#Edit mapred-site.xml

bash
1cp mapred-site.xml.template mapred-site.xml
2nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
xml
1<configuration>
2 <property>
3 <name>mapreduce.framework.name</name>
4 <value>yarn</value>
5 </property>
6</configuration>

#Edit yarn-site.xml

bash
1nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
xml
1<configuration>
2 <property>
3 <name>yarn.nodemanager.aux-services</name>
4 <value>mapreduce_shuffle</value>
5 </property>
6 <property>
7 <name>yarn.resourcemanager.hostname</name>
8 <value>127.0.0.1</value>
9 </property>
10</configuration>

#Start Hadoop Services

#Format the NameNode

bash
1hdfs namenode -format

#Start HDFS

bash
1start-dfs.sh

#Start YARN

bash
1start-yarn.sh

#Check Running Processes

bash
1jps

You should see processes like

You should see NameNode, DataNode, ResourceManager, and NodeManager among the listed processes.
You should see NameNode, DataNode, ResourceManager, and NodeManager among the listed processes.

By following the steps in this guide, you've successfully installed and configured Hadoop and HDFS in a standalone environment. This local setup is perfect for learning Hadoop's core components like MapReduce, HDFS, and YARN and for running small test jobs.

Whether you're preparing for a data engineering role, experimenting with distributed computing, or just curious about big data infrastructure, this Kali Linux-based Hadoop setup offers a controlled, educational sandbox.

From here, you can:

  • Write MapReduce programs in Java or Python
  • Explore data ingestion tools like Apache Sqoop or Flume
  • Connect Hadoop to Hive or Spark for advanced analytics

This practical foundation prepares you to take on larger-scale deployments in the cloud or on actual clusters with confidence.