Back to posts.

Reading Chunks from a Buffer

What follows is a brain dump of something I need to do from time to time and I thought it would be a good idea to write down the steps which are involved when reading and parsing chunks from a source. For example when reading from a file.

When do you want to read chunks anyway? Well, lets take an example for which I often need to read chunks: reading nals from a h264 file (annex-b). When parsing raw h264 data you don't know the byte-offsets in the file where the video frames start and stop. Therefore you need to parse it and detect the annex-b headers. Because the file size can be very large, I don't want to read the complete file at once into a buffer. A more sense approach is to read and parse the file in chunks of lets say 128 kilobytes. When you combine reading chunks and parsing these chunks there are several things you need to keep in mind:

  • How many bytes are still available in the source file?
  • How many bytes did the parser parsed?
  • How many h264-bytes didn't get parsed?

Determining the number of bytes to read

A first approach could be something like this: we keep track of how many bytes there are still in the source buffer (e.g. the file). Lets call the variable that holds the number of bytes that are available bytes_left_to_read. Then every time we read from the source, we try to read the maximum amount of bytes. The maximum bytes is normally the size of the chunk. But be aware that the last read from the source is special. For example, lets say the file size is 101 bytes and we're reading in chunks of 10 bytes, then we need to read only 1 byte for the last read. Using std::min<size_t>(chunk_size, bytes_left_to_read) we can determine how many bytes we still can read.

// Be aware that this is an incomplete example.
while (bytes_left_to_read > 0) {
  bytes_to_read = std::min<size_t>(chunk_size, bytes_left_to_read);
  bytes_left_to_read -= bytes_to_read;
}

This is all good and simple and it will make sure that we only read the number of bytes that fit in our chunk and doesn't exceed the number of bytes in our source.

But the approach above won't work because you can't always, read the full chunk_size. When the parser didn't parse the full previous chunk there are still some bytes left that you need to parse after the next read. Therefore we need to reduce the chunk_size by the number of bytes_available_in_chunk that still need to be parsed. bytes_available_in_chunk holds the number of source bytes that still need to be parsed but have been read. So a better approach is this to use: std::min<size_t>(chunk_size - bytes_available_in_chunk, bytes_left_to_read);

while (bytes_left_to_read > 0) {
  bytes_to_read = std::min<size_t>(chunk_size - bytes_available_in_chunk, bytes_left_to_read);
  bytes_left_to_read -= bytes_to_read;
}

Reading data into the buffer and parsing it.

Once we've determined how many bytes we can read, we need to read the bytes into our chunk. Because the chunk can hold some valid bytes that we need to parse we cannot simply copy new bytes into the start of the buffer. New bytes need to be stored after the bytes which are still available from our previous read.

But lets do one step back. Lets say we've just read a complete chunk of 10 bytes but only parsed 8 bytes. We need to move the last 2 bytes which are still waiting to be parsed, to the beginning of our chunk before we start reading new fresh bytes. For this we use memmove. With memmove we move the valid bytes to the start of our chunk buffer. The number of bytes that we need to move is what we called bytes_available_in_chunk that we calculate using chunk_size - bytes_parsed.

bytes_available_in_chunk = chunk_size - bytes_parsed;
memmove(chunk, chunk + bytes_parsed, bytes_available_in_chunk);

Once we've moved the remaining bytes to the start of the buffer we repeat the steps described above which leads to something like this:

int VideoH264Creator::create(const std::string inpath) {
 
    const int chunk_size = 1024 * 28;
    size_t bytes_available_in_chunk = 0;
    size_t file_size = 0;
    size_t bytes_left_to_read = 0;
    size_t bytes_to_read = 0;
    size_t bytes_parsed = 0;
    int parse_result = H264_PARSE_OK;
    uint8_t buffer[chunk_size];
 
    if (0 == inpath.size()) {
      SX_ERROR("Given input path is empty.");
      return -1;
    }
 
    /* Open the input file (h264, annex-b) */
    std::ifstream ifs(inpath.c_str(), std::ios::in | std::ios::binary);
    if (false == ifs.is_open()) {
      SX_ERROR("Failed to open: %s", inpath.c_str());
      return -2;
    }
 
    /* Check the file size. */
    ifs.seekg(0, std::ifstream::end);
    file_size = ifs.tellg();
    bytes_left_to_read = file_size;
    ifs.seekg(0, std::ifstream::beg); 
 
    if (0 == file_size) {
      SX_ERROR("Input file is empty.");
      return -3;
    }
 
    while (bytes_left_to_read > 0) {
 
      /* We can only read the remaining free space in the chunk, or what's still remaining in the file. */
      bytes_to_read = std::min<size_t>(chunk_size - bytes_available_in_chunk, bytes_left_to_read);
 
      /* We read new bytes, after the bytes which are still available. */
      ifs.read((char*)buffer + bytes_available_in_chunk, bytes_to_read);
 
      /* Increment the number of valid bytes using the number of bytes we just read. */
      bytes_available_in_chunk += ifs.gcount();
 
      parse_result = parser.parse(buffer, bytes_available_in_chunk,  bytes_parsed);
 
      if (bytes_parsed > chunk_size) {
        SX_ERROR("Number of bytes parsed bigger then given buffer. Not supposed to happen.");
        break;
      }
 
      /* Remove the bytes that were read from our small buffer. */
      bytes_available_in_chunk = chunk_size - bytes_parsed;
      memmove(buffer, buffer + bytes_parsed, bytes_available_in_chunk);
 
      /* Recude the number of bytes read from the file. */
      bytes_left_to_read -= bytes_to_read;
    }
 
    return 0;
  }