org.apache.hadoop.mapreduce.lib.input
Class CompressedSplitLineReader

java.lang.Object
  extended by org.apache.hadoop.util.LineReader
      extended by org.apache.hadoop.mapreduce.lib.input.SplitLineReader
          extended by org.apache.hadoop.mapreduce.lib.input.CompressedSplitLineReader
All Implemented Interfaces:
Closeable

@InterfaceAudience.Private
@InterfaceStability.Unstable
public class CompressedSplitLineReader
extends SplitLineReader

Line reader for compressed splits Reading records from a compressed split is tricky, as the LineRecordReader is using the reported compressed input stream position directly to determine when a split has ended. In addition the compressed input stream is usually faking the actual byte position, often updating it only after the first compressed block after the split is accessed. Depending upon where the last compressed block of the split ends relative to the record delimiters it can be easy to accidentally drop the last record or duplicate the last record between this split and the next. Split end scenarios: 1) Last block of split ends in the middle of a record Nothing special that needs to be done here, since the compressed input stream will report a position after the split end once the record is fully read. The consumer of the next split will discard the partial record at the start of the split normally, and no data is lost or duplicated between the splits. 2) Last block of split ends in the middle of a delimiter The line reader will continue to consume bytes into the next block to locate the end of the delimiter. If a custom delimiter is being used then the next record must be read by this split or it will be dropped. The consumer of the next split will not recognize the partial delimiter at the beginning of its split and will discard it along with the next record. However for the default delimiter processing there is a special case because CR, LF, and CRLF are all valid record delimiters. If the block ends with a CR then the reader must peek at the next byte to see if it is an LF and therefore part of the same record delimiter. Peeking at the next byte is an access to the next block and triggers the stream to report the end of the split. There are two cases based on the next byte: A) The next byte is LF The split needs to end after the current record is returned. The consumer of the next split will discard the first record, which is degenerate since LF is itself a delimiter, and start consuming records after that byte. If the current split tries to read another record then the record will be duplicated between splits. B) The next byte is not LF The current record will be returned but the stream will report the split has ended due to the peek into the next block. If the next record is not read then it will be lost, as the consumer of the next split will discard it before processing subsequent records. Therefore the next record beyond the reported split end must be consumed by this split to avoid data loss. 3) Last block of split ends at the beginning of a delimiter This is equivalent to case 1, as the reader will consume bytes into the next block and trigger the end of the split. No further records should be read as the consumer of the next split will discard the (degenerate) record at the beginning of its split. 4) Last block of split ends at the end of a delimiter Nothing special needs to be done here. The reader will not start examining the bytes into the next block until the next record is read, so the stream will not report the end of the split just yet. Once the next record is read then the next block will be accessed and the stream will indicate the end of the split. The consumer of the next split will correctly discard the first record of its split, and no data is lost or duplicated. If the default delimiter is used and the block ends at a CR then this is treated as case 2 since the reader does not yet know without looking at subsequent bytes whether the delimiter has ended. NOTE: It is assumed that compressed input streams *never* return bytes from multiple compressed blocks from a single read. Failure to do so will violate the buffering performed by this class, as it will access bytes into the next block after the split before returning all of the records from the previous block.


Constructor Summary
CompressedSplitLineReader(org.apache.hadoop.io.compress.SplitCompressionInputStream in, org.apache.hadoop.conf.Configuration conf, byte[] recordDelimiterBytes)
           
 
Method Summary
protected  int fillBuffer(InputStream in, byte[] buffer, boolean inDelimiter)
           
 boolean needAdditionalRecordAfterSplit()
           
 int readLine(org.apache.hadoop.io.Text str, int maxLineLength, int maxBytesToConsume)
           
 
Methods inherited from class org.apache.hadoop.util.LineReader
close, readLine, readLine
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

CompressedSplitLineReader

public CompressedSplitLineReader(org.apache.hadoop.io.compress.SplitCompressionInputStream in,
                                 org.apache.hadoop.conf.Configuration conf,
                                 byte[] recordDelimiterBytes)
                          throws IOException
Throws:
IOException
Method Detail

fillBuffer

protected int fillBuffer(InputStream in,
                         byte[] buffer,
                         boolean inDelimiter)
                  throws IOException
Overrides:
fillBuffer in class org.apache.hadoop.util.LineReader
Throws:
IOException

readLine

public int readLine(org.apache.hadoop.io.Text str,
                    int maxLineLength,
                    int maxBytesToConsume)
             throws IOException
Overrides:
readLine in class org.apache.hadoop.util.LineReader
Throws:
IOException

needAdditionalRecordAfterSplit

public boolean needAdditionalRecordAfterSplit()
Overrides:
needAdditionalRecordAfterSplit in class SplitLineReader


Copyright © 2014 Apache Software Foundation. All Rights Reserved.