org.apache.hadoop.mapreduce.filecache
Class DistributedCache

java.lang.Object
  extended by org.apache.hadoop.mapreduce.filecache.DistributedCache
Direct Known Subclasses:
DistributedCache

Deprecated.

@Deprecated
@InterfaceAudience.Private
public class DistributedCache
extends Object

Distribute application-specific large, read-only files efficiently.

DistributedCache is a facility provided by the Map-Reduce framework to cache files (text, archives, jars etc.) needed by applications.

Applications specify the files, via urls (hdfs:// or http://) to be cached via the JobConf. The DistributedCache assumes that the files specified via urls are already present on the FileSystem at the path specified by the url and are accessible by every machine in the cluster.

The framework will copy the necessary files on to the slave node before any tasks for the job are executed on that node. Its efficiency stems from the fact that the files are only copied once per job and the ability to cache archives which are un-archived on the slaves.

DistributedCache can be used to distribute simple, read-only data/text files and/or more complex types such as archives, jars etc. Archives (zip, tar and tgz/tar.gz files) are un-archived at the slave nodes. Jars may be optionally added to the classpath of the tasks, a rudimentary software distribution mechanism. Files have execution permissions. In older version of Hadoop Map/Reduce users could optionally ask for symlinks to be created in the working directory of the child task. In the current version symlinks are always created. If the URL does not have a fragment the name of the file or directory will be used. If multiple files or directories map to the same link name, the last one added, will be used. All others will not even be downloaded.

DistributedCache tracks modification timestamps of the cache files. Clearly the cache files should not be modified by the application or externally while the job is executing.

Here is an illustrative example on how to use the DistributedCache:

     // Setting up the cache for the application
     
     1. Copy the requisite files to the FileSystem:
     
     $ bin/hadoop fs -copyFromLocal lookup.dat /myapp/lookup.dat  
     $ bin/hadoop fs -copyFromLocal map.zip /myapp/map.zip  
     $ bin/hadoop fs -copyFromLocal mylib.jar /myapp/mylib.jar
     $ bin/hadoop fs -copyFromLocal mytar.tar /myapp/mytar.tar
     $ bin/hadoop fs -copyFromLocal mytgz.tgz /myapp/mytgz.tgz
     $ bin/hadoop fs -copyFromLocal mytargz.tar.gz /myapp/mytargz.tar.gz
     
     2. Setup the application's JobConf:
     
     JobConf job = new JobConf();
     DistributedCache.addCacheFile(new URI("/myapp/lookup.dat#lookup.dat"), 
                                   job);
     DistributedCache.addCacheArchive(new URI("/myapp/map.zip", job);
     DistributedCache.addFileToClassPath(new Path("/myapp/mylib.jar"), job);
     DistributedCache.addCacheArchive(new URI("/myapp/mytar.tar", job);
     DistributedCache.addCacheArchive(new URI("/myapp/mytgz.tgz", job);
     DistributedCache.addCacheArchive(new URI("/myapp/mytargz.tar.gz", job);
     
     3. Use the cached files in the Mapper
     or Reducer:
     
     public static class MapClass extends MapReduceBase  
     implements Mapper<K, V, K, V> {
     
       private Path[] localArchives;
       private Path[] localFiles;
       
       public void configure(JobConf job) {
         // Get the cached archives/files
         File f = new File("./map.zip/some/file/in/zip.txt");
       }
       
       public void map(K key, V value, 
                       OutputCollector<K, V> output, Reporter reporter) 
       throws IOException {
         // Use data from the cached archives/files here
         // ...
         // ...
         output.collect(k, v);
       }
     }
     
 

It is also very common to use the DistributedCache by using GenericOptionsParser. This class includes methods that should be used by users (specifically those mentioned in the example above, as well as addArchiveToClassPath(Path, Configuration)), as well as methods intended for use by the MapReduce framework (e.g., JobClient).

See Also:
JobConf, JobClient

Constructor Summary
DistributedCache()
          Deprecated.  
 
Method Summary
static void addArchiveToClassPath(org.apache.hadoop.fs.Path archive, org.apache.hadoop.conf.Configuration conf)
          Deprecated. Use Job.addArchiveToClassPath(Path) instead
static void addArchiveToClassPath(org.apache.hadoop.fs.Path archive, org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.FileSystem fs)
          Deprecated. Add an archive path to the current set of classpath entries.
static void addCacheArchive(URI uri, org.apache.hadoop.conf.Configuration conf)
          Deprecated. Use Job.addCacheArchive(URI) instead
static void addCacheFile(URI uri, org.apache.hadoop.conf.Configuration conf)
          Deprecated. Use Job.addCacheFile(URI) instead
static void addFileToClassPath(org.apache.hadoop.fs.Path file, org.apache.hadoop.conf.Configuration conf)
          Deprecated. Use Job.addFileToClassPath(Path) instead
static void addFileToClassPath(org.apache.hadoop.fs.Path file, org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.FileSystem fs)
          Deprecated. Add a file path to the current set of classpath entries.
static boolean checkURIs(URI[] uriFiles, URI[] uriArchives)
          Deprecated. This method checks if there is a conflict in the fragment names of the uris.
static void createSymlink(org.apache.hadoop.conf.Configuration conf)
          Deprecated. This is a NO-OP.
static org.apache.hadoop.fs.Path[] getArchiveClassPaths(org.apache.hadoop.conf.Configuration conf)
          Deprecated. Use JobContext.getArchiveClassPaths() instead
static long[] getArchiveTimestamps(org.apache.hadoop.conf.Configuration conf)
          Deprecated. Use JobContext.getArchiveTimestamps() instead
static boolean[] getArchiveVisibilities(org.apache.hadoop.conf.Configuration conf)
          Deprecated. Get the booleans on whether the archives are public or not.
static URI[] getCacheArchives(org.apache.hadoop.conf.Configuration conf)
          Deprecated. Use JobContext.getCacheArchives() instead
static URI[] getCacheFiles(org.apache.hadoop.conf.Configuration conf)
          Deprecated. Use JobContext.getCacheFiles() instead
static org.apache.hadoop.fs.Path[] getFileClassPaths(org.apache.hadoop.conf.Configuration conf)
          Deprecated. Use JobContext.getFileClassPaths() instead
static long[] getFileTimestamps(org.apache.hadoop.conf.Configuration conf)
          Deprecated. Use JobContext.getFileTimestamps() instead
static boolean[] getFileVisibilities(org.apache.hadoop.conf.Configuration conf)
          Deprecated. Get the booleans on whether the files are public or not.
static org.apache.hadoop.fs.Path[] getLocalCacheArchives(org.apache.hadoop.conf.Configuration conf)
          Deprecated. Use JobContext.getLocalCacheArchives() instead
static org.apache.hadoop.fs.Path[] getLocalCacheFiles(org.apache.hadoop.conf.Configuration conf)
          Deprecated. Use JobContext.getLocalCacheFiles() instead
static boolean getSymlink(org.apache.hadoop.conf.Configuration conf)
          Deprecated. symlinks are always created.
static void setCacheArchives(URI[] archives, org.apache.hadoop.conf.Configuration conf)
          Deprecated. Use Job.setCacheArchives(URI[]) instead
static void setCacheFiles(URI[] files, org.apache.hadoop.conf.Configuration conf)
          Deprecated. Use Job.setCacheFiles(URI[]) instead
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

DistributedCache

public DistributedCache()
Deprecated. 
Method Detail

setCacheArchives

@Deprecated
public static void setCacheArchives(URI[] archives,
                                               org.apache.hadoop.conf.Configuration conf)
Deprecated. Use Job.setCacheArchives(URI[]) instead

Set the configuration with the given set of archives. Intended to be used by user code.

Parameters:
archives - The list of archives that need to be localized
conf - Configuration which will be changed

setCacheFiles

@Deprecated
public static void setCacheFiles(URI[] files,
                                            org.apache.hadoop.conf.Configuration conf)
Deprecated. Use Job.setCacheFiles(URI[]) instead

Set the configuration with the given set of files. Intended to be used by user code.

Parameters:
files - The list of files that need to be localized
conf - Configuration which will be changed

getCacheArchives

@Deprecated
public static URI[] getCacheArchives(org.apache.hadoop.conf.Configuration conf)
                              throws IOException
Deprecated. Use JobContext.getCacheArchives() instead

Get cache archives set in the Configuration. Used by internal DistributedCache and MapReduce code.

Parameters:
conf - The configuration which contains the archives
Returns:
A URI array of the caches set in the Configuration
Throws:
IOException

getCacheFiles

@Deprecated
public static URI[] getCacheFiles(org.apache.hadoop.conf.Configuration conf)
                           throws IOException
Deprecated. Use JobContext.getCacheFiles() instead

Get cache files set in the Configuration. Used by internal DistributedCache and MapReduce code.

Parameters:
conf - The configuration which contains the files
Returns:
A URI array of the files set in the Configuration
Throws:
IOException

getLocalCacheArchives

@Deprecated
public static org.apache.hadoop.fs.Path[] getLocalCacheArchives(org.apache.hadoop.conf.Configuration conf)
                                                         throws IOException
Deprecated. Use JobContext.getLocalCacheArchives() instead

Return the path array of the localized caches. Intended to be used by user code.

Parameters:
conf - Configuration that contains the localized archives
Returns:
A path array of localized caches
Throws:
IOException

getLocalCacheFiles

@Deprecated
public static org.apache.hadoop.fs.Path[] getLocalCacheFiles(org.apache.hadoop.conf.Configuration conf)
                                                      throws IOException
Deprecated. Use JobContext.getLocalCacheFiles() instead

Return the path array of the localized files. Intended to be used by user code.

Parameters:
conf - Configuration that contains the localized files
Returns:
A path array of localized files
Throws:
IOException

getArchiveTimestamps

@Deprecated
public static long[] getArchiveTimestamps(org.apache.hadoop.conf.Configuration conf)
Deprecated. Use JobContext.getArchiveTimestamps() instead

Get the timestamps of the archives. Used by internal DistributedCache and MapReduce code.

Parameters:
conf - The configuration which stored the timestamps
Returns:
a long array of timestamps
Throws:
IOException

getFileTimestamps

@Deprecated
public static long[] getFileTimestamps(org.apache.hadoop.conf.Configuration conf)
Deprecated. Use JobContext.getFileTimestamps() instead

Get the timestamps of the files. Used by internal DistributedCache and MapReduce code.

Parameters:
conf - The configuration which stored the timestamps
Returns:
a long array of timestamps
Throws:
IOException

addCacheArchive

@Deprecated
public static void addCacheArchive(URI uri,
                                              org.apache.hadoop.conf.Configuration conf)
Deprecated. Use Job.addCacheArchive(URI) instead

Add a archives to be localized to the conf. Intended to be used by user code.

Parameters:
uri - The uri of the cache to be localized
conf - Configuration to add the cache to

addCacheFile

@Deprecated
public static void addCacheFile(URI uri,
                                           org.apache.hadoop.conf.Configuration conf)
Deprecated. Use Job.addCacheFile(URI) instead

Add a file to be localized to the conf. Intended to be used by user code.

Parameters:
uri - The uri of the cache to be localized
conf - Configuration to add the cache to

addFileToClassPath

@Deprecated
public static void addFileToClassPath(org.apache.hadoop.fs.Path file,
                                                 org.apache.hadoop.conf.Configuration conf)
                               throws IOException
Deprecated. Use Job.addFileToClassPath(Path) instead

Add an file path to the current set of classpath entries It adds the file to cache as well. Intended to be used by user code.

Parameters:
file - Path of the file to be added
conf - Configuration that contains the classpath setting
Throws:
IOException

addFileToClassPath

public static void addFileToClassPath(org.apache.hadoop.fs.Path file,
                                      org.apache.hadoop.conf.Configuration conf,
                                      org.apache.hadoop.fs.FileSystem fs)
                               throws IOException
Deprecated. 
Add a file path to the current set of classpath entries. It adds the file to cache as well. Intended to be used by user code.

Parameters:
file - Path of the file to be added
conf - Configuration that contains the classpath setting
fs - FileSystem with respect to which archivefile should be interpreted.
Throws:
IOException

getFileClassPaths

@Deprecated
public static org.apache.hadoop.fs.Path[] getFileClassPaths(org.apache.hadoop.conf.Configuration conf)
Deprecated. Use JobContext.getFileClassPaths() instead

Get the file entries in classpath as an array of Path. Used by internal DistributedCache code.

Parameters:
conf - Configuration that contains the classpath setting

addArchiveToClassPath

@Deprecated
public static void addArchiveToClassPath(org.apache.hadoop.fs.Path archive,
                                                    org.apache.hadoop.conf.Configuration conf)
                                  throws IOException
Deprecated. Use Job.addArchiveToClassPath(Path) instead

Add an archive path to the current set of classpath entries. It adds the archive to cache as well. Intended to be used by user code.

Parameters:
archive - Path of the archive to be added
conf - Configuration that contains the classpath setting
Throws:
IOException

addArchiveToClassPath

public static void addArchiveToClassPath(org.apache.hadoop.fs.Path archive,
                                         org.apache.hadoop.conf.Configuration conf,
                                         org.apache.hadoop.fs.FileSystem fs)
                                  throws IOException
Deprecated. 
Add an archive path to the current set of classpath entries. It adds the archive to cache as well. Intended to be used by user code.

Parameters:
archive - Path of the archive to be added
conf - Configuration that contains the classpath setting
fs - FileSystem with respect to which archive should be interpreted.
Throws:
IOException

getArchiveClassPaths

@Deprecated
public static org.apache.hadoop.fs.Path[] getArchiveClassPaths(org.apache.hadoop.conf.Configuration conf)
Deprecated. Use JobContext.getArchiveClassPaths() instead

Get the archive entries in classpath as an array of Path. Used by internal DistributedCache code.

Parameters:
conf - Configuration that contains the classpath setting

createSymlink

@Deprecated
public static void createSymlink(org.apache.hadoop.conf.Configuration conf)
Deprecated. This is a NO-OP.

Originally intended to enable symlinks, but currently symlinks cannot be disabled. This is a NO-OP.

Parameters:
conf - the jobconf

getSymlink

@Deprecated
public static boolean getSymlink(org.apache.hadoop.conf.Configuration conf)
Deprecated. symlinks are always created.

Originally intended to check if symlinks should be used, but currently symlinks cannot be disabled.

Parameters:
conf - the jobconf
Returns:
true

getFileVisibilities

public static boolean[] getFileVisibilities(org.apache.hadoop.conf.Configuration conf)
Deprecated. 
Get the booleans on whether the files are public or not. Used by internal DistributedCache and MapReduce code.

Parameters:
conf - The configuration which stored the timestamps
Returns:
a string array of booleans
Throws:
IOException

getArchiveVisibilities

public static boolean[] getArchiveVisibilities(org.apache.hadoop.conf.Configuration conf)
Deprecated. 
Get the booleans on whether the archives are public or not. Used by internal DistributedCache and MapReduce code.

Parameters:
conf - The configuration which stored the timestamps
Returns:
a string array of booleans

checkURIs

public static boolean checkURIs(URI[] uriFiles,
                                URI[] uriArchives)
Deprecated. 
This method checks if there is a conflict in the fragment names of the uris. Also makes sure that each uri has a fragment. It is only to be called if you want to create symlinks for the various archives and files. May be used by user code.

Parameters:
uriFiles - The uri array of urifiles
uriArchives - the uri array of uri archives


Copyright © 2014 Apache Software Foundation. All Rights Reserved.