Distributed Cache in Hadoop

Introduction:
If you are writing Map Reduce Applications, where you want some files to be shared across all nodes in Hadoop Cluster. It can be simple properties file or can be executable jar file.

Hadoop Map Reduce Project provides us this facility with something called as DistributedCache.
This Distributed Cache is configured with Job Configuration, What it does is, it provides read only data to all machine on the cluster.

Step 1 : Put file to HDFS

# hdfs dfs -put /tmp/file1 /cachefile1

Step 2: Add cachefile in Job Configuration

Configuration conf = new Configuration();  
Job job = new Job(conf, "wordcount");
DistributedCache.addCacheFile(new URI("/cachefile1"),job.getConfiguration());

Step 3: Access Cached file

Path[] cacheFiles = context.getLocalCacheFiles();
FileInputStream fileStream = new FileInputStream(cacheFiles[0].toString());
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s