Chapter 74. Parsing MIME documents

Index

Delimiting MIME newline sequences
Parsing MIME documents that use CRLF newline sequences
Summary
Detecting start of a MIME document's body
Header parser iterator
x::mime::header_collector
MIME section information
MIME entity parser
Creating MIME entity parsers
Creating compound MIME entity parsers
Custom MIME section information classes

The templates and classes describe in this chapter implement an output iterator-based approach to parsing MIME documents. A MIME parser gets constructed by instantiating a sequence of template classes. Each one of them is an output iterator over a sequence, a stream of tokens, that modifies the stream and passes the modified stream to the next output iterator.

Delimiting MIME newline sequences

The initial output iterator, x::mime::newline_iter, instantiates an output iterator over char values. Its template parameter is an output iterator class which iterates over int values. x::mime::newline_iter takes the char sequence it iterates over, and promotes each character to an int between 0 and 255. Additionally, x::mime::newline_iter inserts an x::mime::newline_start into the output sequence before each newline sequence, and x::mime::newline_end after the newline sequence.

Note

All other output iterators described in this chapter iterate over an int value sequence, which consists of the char values, from the original output sequence that comprises the MIME document, and additional values inserted by these output iterators.

Its important to note that the original char output sequence does not get modified, but gets supplemented by int values that the output iterators insert into the output sequence, like x::mime::newline_start and x::mime::newline_end, which appear before and after the LF (or the CRLF) value. The LF (or the CRLF values) remain in the output sequence where they were, but get bracketed by x::mime::newline_start and x::mime::newline_end.

#include <x/mime/newlineiter.H>
#include <vector>
#include <iterator>
#include <iostream>

int main()
{
	std::vector<int> seq;

	typedef std::back_insert_iterator<std::vector<int>> ins_iter_t;

	auto iter=std::copy(std::istreambuf_iterator<char>(std::cin),
			    std::istreambuf_iterator<char>(),
			    x::mime::newline_iter<ins_iter_t>
			    ::create(ins_iter_t(seq)));

	iter.get()->eof();

	ins_iter_t value=iter.get()->iter;

	for (int c:seq)
	{
		if (x::mime::nontoken(c))
			std::cout << (char)c;
		else
			std::cout << '<' << c << '>';
	}
	std::cout << std::endl << std::flush;
	return 0;
}

Instantiating a x::mime::newline_iter results in an output iterator, but x::mime::newline_iter gets instantiated by create() like a reference-counted object (because, internally, it is). The template parameter is an output iterator class over ints, and the constructor takes an instance of the template parameter class.

This example copies chars from std::cin into the instantiated x::mime::newline_iter, which outputs to a std::back_insert_iterator<std::vector<int>>.

No formal means exist to notify an output iterator of an end to the output sequence, other than its destruction, so the MIME parsing iterators use this convention. The output iterator's get() method returns a reference to the underlying reference-counted method, with an eof() that must be invoked in order to signal the end of the output sequence. All MIME parsing templates and classes require that x::mime::newline_iter's eof() must get invoked.

In addition to eof(), the iter class member gives the current value of the output iterator that x::mime::newline_iter's constructor received, via create(). Sample output from the above newlineparser.C example:


$ cat newlineparser.txt
Subject: test

test
$ ./newlineparser <newlineparser.txt
Subject: test<256>
<257><256>
<257>test<256>
<257><-1>
      

x::mime::newline_iter promotes the char sequence it iterates over to int between 0 and 255. x::mime::nontoken() returns true if the given value is in this range, and false for additional tokens. As the sample output shows, 256 and 257 (corresponding to x::mime::newline_start and x::mime::newline_end) wrap each newline character. -1 is x::mime::eof that gets inserted by x::mime::newline_iter's eof().

One important characteristic of x::mime::newline_iter is that when the output sequence does not end with a newline, x::mime::newline_iter inserts x::mime::newline_start immediately followed by x::mime::newline_end, without a newline in between (this gets triggered by eof()). In all cases x::mime::newline_iter does not modify the character part of the output sequence that gets forwarded to its output iterator, but the output sequence always ends with a newline sequence:


$ cat newlineparser.txt
Subject: test

test
$ ./newlineparser <newlineparser.txt
Subject: test<256>
<257><256>
<257>test<256><257><-1>
      

This is same as the previous example, except that the original MIME-formatted message did not end with a newline. x::mime::newline_iter adds x::mime::newline_start (256) immediately followed by x::mime::newline_end (257), before the trailing x::mime::eof.

Parsing MIME documents that use CRLF newline sequences

x::mime::newline_iter<ins_iter_t>::create(ins_iter_t(seq), true);

Setting the second optional parameter to x::mime::newline_iter's create() to true instantiates the output iterator that recognizes CRLF sequence as the newline sequence instead of LF.

x::mime::newline_iter inserts x::mime::newline_start and x::mime::newline_end before and after each CRLF sequence. CR and LF by themselves are left alone.

Summary

x::mime::newline_iter iterates over a char-valued output sequence that contains a MIME document. Its template parameter parameter is an output iterator class that iterates over int values, and create() takes an instance of the template class.

The iterator passed to create() iterates over an int values that consists of the char values that x::mime::newline_iter iterates over. Additionally, each recognized newline sequence gets preceded by a x::mime::newline_start and followed by x::mime::newline_end. This includes the implied newline at the end of the output sequence that does not end with an explicit newline sequence. Invoking eof() on x::mime::newline_iter's output iterator object insert the x::mime::eof into the output sequence.