Running Hadoop in Pseudo-distributed Mode on Mac OSX
Instead of the default non-distributed or standalone mode where hadoop runs as a single Java process, in a single JVM instance with no daemons and without HDFS, we can run it in the pseudo-distributed mode so as it make it closer to the production environment.
In this mode hadoop simulates a cluster on a small scale. Different Hadoop daemons run in different JVM instances, but on a single machine and HDFS is used instead of local FS.
For development it is useful to have a setup on local machine that simulates the production environment closely. These are the step i followed to create this setup on my Mac OSX
This installs hadoop at /usr/local/Cellar/hadoop/2.7.3
SSH Mac: Enable Remote Login in System Preference -> Sharing.
ssh and check that you can ssh to the localhost without a passphrase:
$ ssh localhost If you cannot ssh to localhost without a passphrase, execute the following commands:
Configuration Edit following config files in your Hadoop directory
Execution Format and start HDFS and YARN
Now you can browse the web interface for the NameNode at - http://localhost:50070/
Make the HDFS directories required to execute MapReduce jobs:
Start ResourceManager daemon and NodeManager daemon:
Browse the web interface for the ResourceManager at - http://localhost:8088/
Test examples code that came with the hadoop version
Examine the output files:
Copy the output files from the distributed filesystem to the local filesystem and examine them:
submit a yarn job
When you’re done, stop the daemons with:
Reference: http://zhongyaonan.com/hadoop-tutorial/setting-up-hadoop-2-6-on-mac-osx-yosemite.html
The latest version of this setup can be found at: https://gist.github.com/eellpp/fcdcb03ca02fbd495b67ce7e488422f5