Laurent Dami

Posted on Mar 28 • Edited on May 8

Beautiful Perl feature: "heredocs", multi-line strings embedded in source code

#perl #programming #beautifulperl

Beautiful Perl series

This post is part of the beautiful Perl features series. See the introduction post for general explanations about the series.

Previous posts covered random topics ranging from fundamental concepts like blocks or list context and scalar context to sharp details like reusable subregexes. Today's topic is neither very fundamental nor very sharp: it is just a handy convenience for managing multi-line strings in source code, namely the heredoc feature. This is not essential, because multi-line strings can be expressed by other means; but it addresses a need that is quite common in programming, so it is interesting to compare it with other programming languages.

"Heredocs": multi-line data embedded in source code

A "here document", abbreviated as "heredoc", is a piece of multi-line text embedded in the source code. Perl borrowed the idea from Unix shells, and was later imitated by other languages like PHP or Ruby (discussed below). So instead of concatenating single lines, like this:

my $email_template
  = "Dear %s,\n"
  . "\n"
  . "You just won %d at our lottery!\n"
  . "To claim your prize, just click on the link below.\n";

one can directly write the whole chunk of text, with a freely-chosen final delimiter, like this:

my $email_template = <<~_END_OF_MAIL_;
  Dear %s,

  You just won %d at our lottery!
  To claim your prize, just click on the link below.
  _END_OF_MAIL_

The second version is easier to write and to maintain, and also easier to read.

Perl heredoc syntax

General rules

A piece of "heredoc" data can appear anywhere in a Perl expression. It starts with an initial operator written either << or <<~. The second variant with an added tilde ~, available since Perl v5.26, introduces an indented heredoc, where initial spaces on the left of each line are automatically removed by the interpreter; thanks to this feature the inserted text can be properly indented within the Perl code, as illustrated in the example above. The amount of indent removed from each data line is determined by the indent of the final delimiter.

The heredoc operator must be immediately followed by a delimiter string freely chosen by the programmer. Lines below the current line will be part of the heredoc text, until the second appearance of that delimiter string which closes the heredoc sequence. The delimiter string can be any string enclosed in double quotes or in single quotes¹; the double quotes can be omitted if the string would be valid as an identifier (if the string only contains letters, digits and underscores). Indeed, the most common practice is to use unquoted strings in capital letters, often surrounded by underscores - but there is no technical obligation to do so.

When explicit or implicit double quotes are used for the delimiter string, the content of the heredoc text is subject to variable interpolation, like for a usual double-quoted string, whereas with single quotes, no interpolation is performed.

The heredoc content only starts at the line after the << symbol; before that next line, the current line must properly terminate the current expression, i.e. close any open parentheses and terminate the current statement with a final ;. For example, it is perfectly legal (and quite common) to have heredoc data within a subroutine call:

my $html = htmlize_markdown(<<~_END_OF_MARKDOWN_);
  # Breaking news

  ## Perl remontada

  After many years of Perl bashing, the software industry 
  slowly rediscovers the beauty of Perl design. Many indicators ... 
  _END_OF_MARKDOWN_

Arbitrary string as delimiter

As stated above, the delimiter string can be any string, even if it includes special characters or spaces, provided that the string is explicitly quoted:

my $ordinary_liturgy = <<~ "Ite, missa est";
  Kyrie
  Gloria
  Credo
  Sanctus
  Agnus dei
  Ite, missa est

say $ordinary_liturgy;

Empty string as delimiter

The delimiter string can also be .. an empty string! In that case the heredoc content ends at the next empty line; this is an elegant way to minimize noise around the data. I used it for example for embedding Template toolkit fragments in the test suite for my Array::PseudoScalar module:

  my $tmpl = Template->new();
  my $result = "";
  $tmpl->process(\<<"", \%data, \$result); # 1st arg: reference to heredoc string
     [% obj.replace(";", " / ") ; %]

  like($result, qr[^\s+FOO / BAR / BUZ$], "scalar .replace()");

  $result = "";
  $tmpl->process(\<<"", \%data, \$result);
     size is [% obj.size %]
     last is [% obj.last %]
     [% FOREACH member IN obj %]member is [% member %] [% END; # FOREACH %]

  like($result, qr/size is 3/,     "array .size");

or for embedding SQL fragments in my Lingua::Thesaurus module:

  $dbh->do(<<"");
    CREATE $term_table;

  $dbh->do(<<"");
    CREATE TABLE rel_type (
      rel_id      CHAR PRIMARY KEY,
      description CHAR,
      is_external BOOL
    );

  # foreign key control : can't be used with fulltext, because 'docid'
  # is not a regular column that can be referenced
  my $ref_docid = $params->{use_fulltext} ? '' : 'REFERENCES term(docid)';

  $dbh->do(<<"");
    CREATE TABLE relation (
      lead_term_id  INTEGER NOT NULL $ref_docid,
      rel_id        CHAR    NOT NULL REFERENCES rel_type(rel_id),
      rel_order     INTEGER          DEFAULT 1,
      other_term_id INTEGER          $ref_docid,
      external_info CHAR
    );

  $dbh->do(<<"");
    CREATE INDEX ix_lead_term  ON relation(lead_term_id);

...

While it is technically possible to write my $content = <<~ ""; for an indented heredoc ending with an empty string (notice the ~), this requires that the blank line at the end be indented accordingly. In that case the fact that the blank line is composed of initial indenting spaces followed by a newline character is not visible when reading the source code, so this is definitely not something to recommend!

Several heredocs on the same line

Several heredocs can start on the same line, as in this example:

my @blogs = $dbh->selectall_array(<<~END_OF_SQL, {}, split(/\n/, <<~END_OF_BIND_VALUES));
  select d_publish, title, content
  from blog_entries
  where pseudo=? and d_publish between ? and ?
  END_OF_SQL
  chatterbox
  01.01.2023
  30.06.2024
  END_OF_BIND_VALUES

The first heredoc is the SQL request, and the second heredoc is a piece of text containing the bind values; the split operation on the first line transforms this text into an array.

Other Perl mechanisms for multi-line strings

Heredocs are not the only way to express multi-line strings in Perl. String concatenation in source code, as shown in the initial example of this article, is of course always possible, albeit not very practical nor elegant. Yet another way is to simply insert newline characters inside an ordinary quoted string, like this:

my $str1 = "this is
            a multi-line string";
my $str2 = qq{and this
              is another multi-line string};

but in that case the indenting spaces at the beginning of each line are always part of the data. So Perl offers more that one way to do it, it is up to the programmer to decide what is most appropriate according to the situation.

Perl's acceptance of literal newline characters inside ordinary quoted strings is sometimes very handy, and therefore was also adopted in PHP and Ruby; but the majority of other languages, like Java, JavaScript, Python and C++, do not allow it - this is why they needed to introduce other mechanisms, as we will see later.

Other programming languages

As mentioned in the introduction, the idea of heredocs originated in Unix shells and later percolated into general-purpose programming languages. Some of them, inspired by Perl, adopted the shell-style heredoc syntax; other languages preferred the more familiar syntax of quoted strings, but with variants for supporting multi-line strings. This section will highlight the main differences.

Languages with heredoc syntax

PHP

The syntax for heredocs in PHP is very close to Perl, except that it uses three less-than signs (<<<) instead of two (<<). Like in Perl, the heredoc content is usually subject to variable interpolation, unless when the delimiter string is enclosed in single quotes; in that latter case, PHP uses the name "nowdoc" instead of "heredoc".

There are a couple of technical differences with Perl's heredocs, however:

the ending delimiter, even if enclosed in double or in single quotes, must be a proper identifier, not an arbitrary string - so it cannot contain space or special characters .. and obviously it cannot be an empty string!
the PHP expression is interrupted at line where the heredoc starts, and must be properly terminated after the final heredoc delimiter. Here is an example from the official documentation:

<?php
$values = [<<<END
a
  b
    c
END, 'd e f'];
var_dump($values);

So the reader must mentally connect the lines before and after the heredoc to understand the structure of the complete expression, which might be difficult if the heredoc content spans over many lines.

Ruby

There is little to say about heredocs in Ruby: they work almost exactly like in Perl, with the same syntax, the same variants regarding interpolation of variables or regarding indented content, and the same possibility to use arbitrary quoted strings (including the empty string) as delimiters. Multiple heredocs starting on the same line are also supported.

There is however a minor difference in the interpretation of indented content: in presence of an indented heredoc, called "squiggly heredoc" in Ruby, the interpreter considers the least indented line to be the basis for indentation; the number of spaces before the delimiter string is irrelevant. So in

text = <<~END
    foo
  bar
       END

the value is `" foo\nbar\n" (two spaces before "foo", no spaces before "bar"). In Perl this would raise an exception because the indentation level is determined by the number of spaces before the terminating delimiter string and it is illegal to have content lines with less spaces.

Languages with quoted multi-line strings

Expressing multi-line strings in source code is quite a common need, so programming languages that support neither heredocs nor literal newline characters inside ordinary quoted strings had to offer something. A commonly adopted solution is triple-quoted strings as special syntax for multi-line string literals.

Python

Python was probably the inventor of triple-quoted strings. These solved two problems at once:

embedded double-quote or single-quote characters need not be escaped
embedded newline characters are accepted and retained in the string value

as shown in this example:

str = """this is a triple-quoted string
         with embedded "double quotes"
         and embedded 'single quotes'
         and also embedded newlines""";

Triple-quoted literals have no syntactic support for handling indented content, so in the example above, initial spaces at lines 2, 3 and 4 are part of the data. Removing indenting spaces must be explicitly performed at runtime, usually through the textwrap.dedent() function.

Like for regular string literals, interpolation of expressions within triple-quoted strings can be performed through a f prefix, a feature introduced in 2017 in Python 3.6.

JavaScript

Originally JavaScript had no support neither for multi-line strings nor for variable interpolation within string literals. Both features were simultaneously introduced in 2015 through the new construct of template literals: instead of single or double quotes, the string is enclosed in backticks, and it may contain newline characters and interpolated expressions of form ${...}. Those two behavioural aspects always come together, there are no syntactic variants for using them independently.

Java

Java introduced triple-quoted strings as late as 2020, under the name text blocks. The compiler automatically removes "incidental spaces", i.e. indentation spaces at the beginning of lines, and trailing spaces at the end of lines. This mechanism has no basic support for variable interpolation; if needed, programmers have to use other classes like MessageFormat or StringSubstitutor.

Conclusion

Today many programming tasks need to handle some polyglot aspects: besides the main source code, fragments of other languages must be included, like HTML, CSS, XML, SQL, templates, etc. Therefore the need to embed multi-line strings in the main source code is quite frequent.

Surprisingly, some popular languages took a very long time before proposing that kind of feature, and it is not always possible to freely decide some options, like removing indentation or applying variable interpolation. By contrast, Perl always had a rich spectrum of mechanisms, including but not limited to heredoc documents. Various options can be chosen with great flexibility, allowing the programmer to be creative in crafting legible and elegant source code. Yet another example of Perl's beautiful features!

About the cover picture

This score excerpt is taken from Luciano Berio's Sinfonia, written in 1968. In the third movement, Berio makes ample citations of Mahler's Symphony N° 2, a kind of "musical heredoc" inside the new composition.

a third possibility is to enclose the string in backticks, but this will not be covered here. ↩

Top comments (4)

Bernhard Schmalhofer • Mar 31

Thanks, I wasn't aware with the empty string as a heredoc terminator, so I played around with it. I found that the terminating empty line can also be the hypothetical line after the last line of the file.

use v5.24;

say <<'' =~ s/\n$//r, ' = ', <<'' =~ s/\n$//r;
1 + 1

2

This prints quite sensible 1 + 1 = 2. Removing the trailing newline of the file with truncate -s -1 t.pl produced also quite sensibly:

Can't find string terminator "" anywhere before EOF at t.pl line 3.

🌌 Sébastien Feugère ☔ • Mar 30

The dev.to syntax highlighter got defeated by the embedding SQL fragments example.

Laurent Dami • Mar 31

Thanks, your are right, but I'm afraid there is nothing I can do to improve it. I explicitly removed the "perl" label to this code snippet, so in principle it should be language-agnostic, but apparently some highlighting rules were applied, I don't know on which basis.

🌌 Sébastien Feugère ☔ • Mar 31

That's a pretty complicated one for the syntax highlighter. Not tested, but I'm not even sure Gitlab one's would handle those heredocs. I am sure Moose syntax breaks it, so...

Very interesting post, thank you.