Creating compound MIME entity parsers

bodydecoder.C and bodydecoder2.C lack the logic for handling compound MIME entities. x::mime::make_multipart_parser() and x::mime::make_message_rfc822_parser() are section processor factories that wrap other section processor factories and return output iterators for parsing compound MIME sections. They get invoked from either the section processor factory that's passed to x::mime::make_document_entity_parser, or from another section processor factory that was previously wrapped by one of these functions. This results in an open-ended framework for recursively parsing compound MIME documents:

#include <x/mime/newlineiter.H>
#include <x/mime/headeriter.H>
#include <x/mime/bodystartiter.H>
#include <x/mime/headercollector.H>
#include <x/mime/sectiondecoder.H>
#include <x/mime/entityparser.H>
#include <x/mime/structured_content_header.H>
#include <x/mime/contentheadercollector.H>
#include <x/chrcasecmp.H>
#include <iostream>

x::outputrefiterator<int> parse_section(const x::headersbase &,
					const x::mime::sectioninfo &);

x::outputrefiterator<int> create_parser(const x::mime::sectioninfo &info,
					bool is_message_rfc822)
{
	auto header_iter=
		x::mime::contentheader_collector::create(is_message_rfc822);

	auto headers=header_iter.get();

	return x::mime::make_entity_parser
		(header_iter,
		 [headers, info]
		 {
			 return parse_section(headers->content_headers,
					      info);
		 }, info);
}

x::outputrefiterator<int>
parse_section(const x::headersbase &headers,
	      const x::mime::sectioninfo &info)
{
	x::mime::structured_content_header
		content_type(headers,
			     x::mime::structured_content_header::content_type);

	if (content_type.is_message())
		return x::mime::make_message_rfc822_parser
			(create_parser, info);

	if (content_type.is_multipart())
		return x::mime::make_multipart_parser
			(content_type.boundary(), create_parser, info);

	typedef std::ostreambuf_iterator<char> dump_iter_t;

	dump_iter_t dump_to_stdout(std::cout);

	std::string content_transfer_encoding=
		x::mime::structured_content_header
		(headers,
		 x::mime::structured_content_header::content_transfer_encoding)
		.value;

	return content_type.mime_content_type() == "text"
		? x::mime::section_decoder::create(content_transfer_encoding,
						   dump_to_stdout,
						   content_type
						   .charset("iso-8859-1"),
						   "UTF-8")
		: x::mime::section_decoder::create(content_transfer_encoding,
						   dump_to_stdout);
}

void dump(const x::mime::const_sectioninfo &info)
{
	std::cout << "MIME section " << info->index_name()
		  << " starts at character offset "
		  << info->starting_pos << std::endl
		  << "  " << info->header_char_cnt << " bytes in the header, "
		  << info->body_char_cnt << " bytes in the body." << std::endl
		  << "  " << info->header_line_cnt << " lines in the header, "
		  << info->body_line_cnt << " lines in the body." << std::endl;
	if (info->no_trailing_newline)
		std::cout << "  No trailing newline" << std::endl;
	for (const auto &child:info->children)
	{
		std::cout << std::endl;
		dump(child);
	}
}

int main()
{
	x::mime::sectioninfoptr top_level_info;

	std::copy(std::istreambuf_iterator<char>(std::cin),
		  std::istreambuf_iterator<char>(),
		  x::mime::make_document_entity_parser
		  ([&top_level_info]
		   (const x::mime::sectioninfo &info,
		    bool is_message_rfc822)
		   {
			   top_level_info=info;
			   return create_parser(info, is_message_rfc822);
		   }))
		.get()->eof();

	if (top_level_info.null())
	{
		std::cerr << "How did we get here?" << std::endl;
		return 0;
	}

	dump(top_level_info);
	return 0;
}

x::mime::contentheader_collector is an output iterator that's similar to x::mimeheadercollector, except that it collects all headers whose names start with Content- into a x::headersbase.

The output iterator's get method returns a reference to a reference-counted object with a content_headers member, which is a x::headersbase container for the Content- headers. Additionally, x::mime::contentheader_collector's constructor takes a bool flag. If true, the Mime-Version: 1.0 header must be present, otherwise no Content- headers get collected (content_headers will be empty).

Mime-Version: 1.0 can appear after the Content- headers. x::mime::contentheader_collector collects all Content- headers as it iterates over the header portion of a MIME entity. At the end of the output sequence, the accumulated headers in content_headers get cleared if the bool flag is true but Mime-Version: 1.0 was absent:


$ cat bodydecoder.txt
Subject: test
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable

Hello=A0world!
$ ./bodydecoder3 <bodydecoder.txt
Hello=A0world!
MIME section 1 starts at character offset 0
  104 bytes in the header, 15 bytes in the body.
  4 lines in the header, 1 lines in the body.

In the absence of the Mime-Version: 1.0 header, this is parsed as a non-MIME message, so the quoted-printable transfer encoding is not used, producing Hello=A0world on output.


$ cat bodydecoder2.txt
Subject: test
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable
Mime-Version: 1.0

Hello=A0world!
$ ./bodydecoder3 <bodydecoder2.txt
Hello world!
MIME section 1 starts at character offset 0
  122 bytes in the header, 15 bytes in the body.
  5 lines in the header, 1 lines in the body.

Now the MIME headers are in effect.

In bodydecoder3.C , create_parser is the session processor factory functor/lambda that's passed to x::mime::make_document_entity_parser, like bodydecoder.C and bodydecoder2.C (with a small wrapper that captures the top level x::mime::sectioninfo object.

create_parser constructs a new x::mime::contentheader_collector object, and passes to x::mime::make_entity_parser() as the header iterator.

The body iterator factory captures the reference to the reference-counted object with the content_headers by value, so that the object is still in scope long after create_parser() returns, when the header portion iteration concludes, at some time later. The body iterator factory parameter given to x::mime::make_entity_parser() looks at the collected headers. x::mime::make_message_rfc822_parser() takes a section processor functor/lambda and the x::sectioninfo of a message/rfc822 MIME entity, that invokes the section processor functor/lambda with a x::sectioninfo for the body of the message/rfc822 MIME entity. bodydecoder3.C passes the same create_parser() functor, effecting recursive parsing of these MIME entities.

x::mime::make_multipart_parser() takes three parameters: a delimiter for a multipart compound MIME entity, a session processor factory functor/lambda, and multipart's x::sectioninfo. It returns an output iterator that invokes the functor/lamba for every entity that the multipart entity contains.

For non-compound MIME entities, the body iterator factory returns a x::mime::section_decoder to decode the non-compound entity, converting text MIME entities to the UTF-8 character set:


$ cat bodydecoder3.txt
Subject: test
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="xxx"

--xxx
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable

Hello=A0
--xxx
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable

world!

--xxx--
$ ./bodydecoder3 <bodydecoder3.txt
Hello world!
MIME section 1 starts at character offset 0
  79 bytes in the header, 217 bytes in the body.
  4 lines in the header, 12 lines in the body.

MIME section 1.1 starts at character offset 85
  90 bytes in the header, 8 bytes in the body.
  3 lines in the header, 1 lines in the body.
  No trailing newline

MIME section 1.2 starts at character offset 190
  90 bytes in the header, 7 bytes in the body.
  3 lines in the header, 1 lines in the body.

bodydecoder3.C decodes each MIME entity in the document, one at a time, concatenating their contents. The first part of the multipart entity does not end with a trailing newline, so the result of the concatenation is a single line of text.

examples/mime/bodydecoder4.C is an alternative version of bodydecoder3.C that uses x::mime::make_parser() to replace the logic in the first half of parse_section(). The first parameter to x::mime::make_parser() is a x::mime::structured_content_header with the the value of the Content-Type header. The second parameter is the x::sectioninfo for the MIME section where this Content-Type header came from. If it's a compound MIME section, x::mime::make_parser() uses x::mime::make_multipart_parser() or x::mime::make_message_rfc822_parser() to take care of it, with the section processor factory passed as the third parameter. The fourth parameter is a functor or a lambda that gets invoked if the MIME section is not a compound section. It receives one argument, the x::sectioninfo.