DEV Community: pwsmith

Fully Formatted Citations in Reveal.js

pwsmith — Sun, 01 Nov 2020 14:59:59 +0000

In this post I will show how to make a fully formatted citation list if you have made slides in reveal.js. This is useful if you are giving an academic presentation with the slides, and want to follow the convention of giving a short citation note in the slide, with the full information at the end.

Why?

One of the drawbacks of reveal.js, that I noted in an overview is that there is no native way of making a citation list in html. That is, there is no equivalent of bibtex/biblatex that will automatically scan your file for citation keys, and output a formatted list of references at the end of your document based upon what it finds. In the same post, I outlined why I don't think that this is a problem in and of itself — a reference list is effectively a way of pointing someone viewing your slides to a source, and there are html-based ways of doing that, such as making a hyperlink using the <a>...</a> tag.

Yet, despite some positives, there are a couple of drawbacks: unpublished work that is not associated with a url is difficult to incorporate, and, whilst it is easy to click on links if you are viewing the slides at a later point, if you are in the audience at the talk and don't have online access to the slides, then you can't check which exact source is being used.¹ It is also a convention of academic work that one provides references as well, so it is understandable to want a formatted list at the end.

Prerequisites

This method requires use of the command line and the ability to run a python script. So, you should have a basic knowledge of the command line on your computer (nothing complex, but how to open the terminal, how to navigate to a directory, and how to put in basic commands), as well as having Python installed. Furthermore, we will be using Pandoc to generate the citation list.

Install Python

You can check if Python is already installed on your computer by typing into the command line:

python --version

If the output gives (with the value of X depending on what version is installed on your system), then you're good.

Python 3.X.X

Note that if you are using a Mac, it's likely that you'll have both Python 2 and Python 3 installed.² You can check if Python 3 is available on your system by typing:

python3 --version

If this is the case, then wherever I use the command python in this guide, you should replace it with python3. If you receive a notice command not found: python, then you need to install python.

Install Pandoc

Pandoc is a command line tool useful for converting documents from one file type to another. Instructions for installing it on a variety of systems can be found here. We will use the filter pandoc-citeproc, which sometimes is not included in the installation (e.g. if you're installing on Linux).

Bibliography

Finally, you'll need a LaTeX style .bib file. If you don't have one already, you can make one using Bibdesk, Jabref or Zotero (amongst others).

Step 1: Incorporating citation keys

The first step is to incorporate citation keys within the document. In my overview, I noted that given that reveal.js makes slide decks that are web-based, it makes sense to make use of this and cite other works as a hyperlink where possible, to take the reader right to the selected work. I will take this approach here. Given that the slides are written in html, we will be using the <a>...</a> tag, where (the href value is set to the webpage of the citation, and the text between the tags is what you want displayed on the slide). Below is a sample document before we've put in the citation keys, which I'll be using as a toy example for the remainder of the tutorial:

<!-- aba.html -->
<html>
    <head>
        <link rel="stylesheet" href="linguistic-examples.css">
    </head>
    <body>
        <title>Generating html references</title>
    </body>
    <p><a href="https://www.glossa-journal.org/article/10.5334/gjgl.362/">Moskal (2018)</a> shows that suppletion for the exclusive happens only when the inclusive pronoun is also suppletive.
    This follows in work on *ABA patterns in suppletion in <a href="https://mitpress.mit.edu/books/universals-comparative-morphology">Bobaljik (2012)</a> and <a href="https://link.springer.com/article/10.1007%2Fs11049-018-9425-0">Smith et al (2019)</a></p>
</html>

In order to eventually access the bibliography entry for the citation, one should also incorporate a citation key using data-citation-key, which is a custom attribute added to <a> elements in order to specify a citation key.³ The value of data-citation-key should correspond to the citation key in your .bib file. As we will eventually be converting a markdown document to generate the bibliography, we need to include @ immediately before the citation key value, as this is the citation identifier in markdown. A basic citation would then look like the following:

<a href="https://www.glossa-journal.org/article/10.5334/gjgl.362/" data-citation-key="@moskal2017">Moskal (2018)</a>

Important: the value after @ in the field for data-citation-key must match the citation key in your .bib file. That is, the value of data-citation-key should be of the form @+cite_key.

Our toy source file would then look like the following, once the data-citation-key values are added:⁴

<html>
    <head>
        <link rel="stylesheet" href="linguistic-examples.css">
    </head>
    <body>
        <title>Generating html references</title>
    </body>
    <p><a href="https://www.glossa-journal.org/article/10.5334/gjgl.362/" data-citation-key="@moskal2017">Moskal (2018)</a> shows that suppletion for the exclusive happens only when the inclusive pronoun is also suppletive.
    This follows in work on *ABA patterns in suppletion in <a href="https://mitpress.mit.edu/books/universals-comparative-morphology" data-citation-key="@bobaljik2012">Bobaljik (2012)</a> and <a href="https://link.springer.com/article/10.1007%2Fs11049-018-9425-0" data-citation-key="@smithetal2016">Smith et al (2019)</a></p>
</html>

Step 2: Extracting the citation keys

Once all of the citation keys are in the html source file, they need to be extracted so that there is a list containing all the citation keys that you want to then be references. We will use Python for this step. Overall, we want to be able to call the .py script from the command line, and in the same command specify which .html source file to look for the citations in. So, we will aim for the following:

python SCRIPT.py SOURCE.html

For the first step, we will use the sys library in python, which should already be present in your Python installation.
In order to extract the citation keys we will use Beautiful Soup, which is a Python library for processing the html code underlying web pages. Firstly, install Beautiful Soup on your system using the following at the command line:⁵

pip install --user beautifulsoup4

Make a file — we'll call it refgrab.py here, but the name doesn't matter too much — and insert the following:

# rebgrab.py
import sys
from bs4 import BeautifulSoup

file = sys.argv[1]

with open(file, "r") as fp:
    soup = BeautifulSoup(fp)

for link in soup.find_all('a'):
    print(link.get('data-citation-key'))

What this does:

Imports the sys library.
Imports the Beautiful Soup library.
Scans the file picked out by sys.argv[1] (the second file you specify on the command line, in our case, aba.html), looks for all <a> elements in that file and returns a list of each value of data-citation-key associated with each <a>.⁶ This list is what we will use to build our reference list.

We can run this script on our toy example using the following:

python refgrab.py aba.html

It will return the following to the terminal window:

@moskal2017
@bobaljik2012
@smithetal2016

Step 3: Making the reference list

Now that we have a way to extract all the citation keys from our html document, we need a way to allow Pandoc to access them. One way to do this is to print the output of refgrab.py to its own file. To do this, redirect the output of refgrab.py to a temporary markdown file refs_temp.md that we'll delete later on:

python refgrab.py aba.html > refs_temp.md

This will produce a document like the following, from our sample file above:

<!--- refs_temp.md --->
@moskal2018
@bobaljik2012
@smithetal2016

Alternatively, if you don't want to generate an intermediate file, you can pipe the output of the python script directly to Pandoc, as described later on.

Step 4: Generating the citation list

With the file refs_temp.md, we can now build a citation list of what is contained within. To do this, run the following command, subsituting the bibliography path for the relevant one for your bibliography. If you chose a name other than refs_temp for the markdown file in the previous step, use that name instead of refs_temp.⁷

pandoc --bibliography /home/pwsmith/Dropbox/Ducks/biblio.bib --filter pandoc-citeproc refs_temp.md -o references.html

Step 5: Incorporating the citations into your document

Finally, with the references generated, you should end up with a file called references.html with the following as its content:

<!-- references.html -->
<p><span class="citation" data-cites="moskal2017">Moskal (2018)</span> <span class="citation" data-cites="bobaljik2012">Bobaljik (2012)</span> <span class="citation" data-cites="smithetal2016">Smith et al. (2019)</span></p>
<div id="refs" class="references hanging-indent" role="doc-bibliography">
    <div id="ref-bobaljik2012">
        <p>Bobaljik, Jonathan D. 2012. <em>Universals in Comparative Morphology</em>. Cambridge, MA: MIT Press.</p>
    </div>
    <div id="ref-moskal2017">
        <p>Moskal, Beata. 2018. “Exclusively excluding the Exclusive: Suppletion Patterns in Clusivity.” <em>Glossa</em> 2018.</p>
    </div>
    <div id="ref-smithetal2016">
        <p>Smith, Peter W., Beata Moskal, Ting Xu, Jungmin Kang, and Jonathan D. Bobaljik. 2019. “Case and Number Suppletion in Pronouns.” <em>Natural Language and Linguistic Theory</em> 37 (3): 1029–1101.</p>
    </div>
</div>

Pandoc generates both the reference list, which is what we are after, as well as the intext citations, which we don't need. What remains then is to select all the lines contained within the <div id="refs"> and copy-paste that into the appropriate place in your html document. Note:

If you're making a webpage, then you can most likely copy the entire "refs" div.
If you're making slides in reveal.js, then you'll most likely need to split the references over multiple slides so put in <section>...</section> elements where needed (and don't copy the entire "refs" div).

Step 6: Deleting the temp files (optional)

The last step is to remove the file refgrab.py, and, if you want, the references.html, as they are no longer needed and can be easily regenerated following the steps above if they are. To remove them, simply run the command:⁸

rm refs_temp.md references.html

Saving time

If you're comfortable with all of the above steps, you can simply run one command (and not build refs_temp.md) by piping the output of the python script directly into pandoc. In which case, steps 3 and 4 are conflated into one:

python refgrab.py SOURCE.html | pandoc --bibliography /home/pwsmith/Dropbox/Ducks/biblio.bib --filter pandoc-citeproc -o references.html

Again, be sure to put in the relevant names and pathways for the python script, source file and bibliography.

Notes

That said, if you use reveal.js to make slides, then it makes sense to host the slides online, say for instance on Github Pages, and give the audience a link at the beginning of the talk. ↩
This may also be the case for Windows, but I don't know. ↩
The only thing that is important here is the prefix data; -citation-key is used as its semantically easy to follow. However, one can use whatever you want, e.g. data-key, data-citation, data-ref. ↩
In my bibliography, the years of the first and third citation key doesn't match the publication year. This doesn't matter, I'm just noting in case you're wondering why it appears that that is the case. ↩
If you are using a Mac with both Python 2 and Python 3 installed, you should use pip3 instead of pip. ↩
If you use <a> elements that are not citations, then you will end up with some empty references in the markdown file that is generated. Dont' worry about this: Pandoc will run fine anyway. ↩
You can pick whatever names you want for this file, and references.html: they are temporary ones and we'll delete them in Step 6. ↩
Agian, if you used different names for the intermediate files, be sure to use those instead. ↩

A bash function to make a book index

pwsmith — Fri, 30 Oct 2020 15:47:47 +0000

Scope of the issue

Commonly, in non-fiction there is an index at the end to allow readers to look at a collated list of topics that are contained within with page numbers where they are discussed, so the reader can go directly there without having to scan through the whole book each time. Books written in LaTeX can make use of the imakeidx package to generate the index. The package is well documented, and I won't discuss it in any real detail here, as there are tutorials online for how to use it.¹ What's important is that each term that you want to index is followed by the command \index{TERM}. So, say that you want to index second in the text below, you'd add in \index{second} like so:

A second\index{second} is one of the fundmental units of time.

In an ideal world, you would create the index as you write, so that each time you put in a term, you already put the index marker in the source text. However, writing a book doesn't really work that way and, take it from me, what you end up writing will rarely match exactly what you set out to write. It's better then to do the index at the end of the project, when you can see the book in context and know the contents.

The problem is, then you have to go through your source files and put the \index{TERM} command after everything you wish to index. That's a daunting prospect for a long book. So, what is the best way to do this?

The obvious

Suppose you have gone through your work, noted which are the important terms you want in the index, conceptualised how they all group together (i.e. what stands alone, what categories can be nested under others etc.) and you're ready to do it. The obvious thing to do is to go to the beginning of the pdf, start reading, and wherever you see something you want to index, run synctex, find the item in the source, and input the relevant index marker. Easy, right? This is certainly one way to do it, and not an inherently bad one, but there are some downsides. It is repetitive, in many many cases redundant, as you will enter the same text for the same item in various different places, and most likely, you are going to miss some. Why? Well, if you have 50 items you want to index, you need to keep the entire group of 50 in your mind whilst you read through. You may not miss some, but chances are you will. So, whilst this is an option, it seems there is a better way. Let's call this Plan B for now.

Replace All with a GUI?

As I noted before, indexing in LaTeX is pretty easy.
Suppose that you want to index second, then all you need to do is add \index{second} after every instance of second. Simple. It effectively boils down to second -> second\index{second}. There's two ways of looking at this.
Firstly, one can append \index{second} to every instance of second. Or you can replace the string second with second\index{second}. The effect is the same, in the sense that the result is one of appending the index command. However, the subtle difference is that you replace the original string, but the replacement string contains the original string.

The next obvious solution is that most text editors have a function that allows you to find and replace text.Usually by hitting something like control/command+f you can get the editor to scan through the document and find the next instance of the string you're searching for. Then, depending on your editor, there is often an option to replace that with a different one. We can then make a simpler version of the procedure above, and for a specific term, programmatically go through the document hitting find and then replace. Again however, there is a problem. This is easier than earlier, but still redundant in that you are repeating the same step over and over.

So, how about the replace all option, that will at a click replace every instance of the chosen string with the alternative? That's what we're striving for, right? The procedure now is simple, hit control+f, enter second in the then click replace all. Easy. Repeat that 50 times for your terms, and you have yourself an index. Right? Well, yes, but probably not the one you want.

Two problems. Firstly, language is, frankly, annoying for this task.² Homophony, where the same string means different things, means that not every instance of the word second is going to be right. If you want to index the unit of time, then with replace all you may also end up indexing second in the sense of "second place", which is not ideal. You may even use the verb seconded, which is going to get conflated with the unit of time. Again, not ideal. The second (get it?) issue is that books are long, and best practice with LaTeX is to split long documents up into smaller parts and call them from a main document with the \include{} or \input{} command. You don't have to write this way, but if you compile a 300 page document each time, it's going to take a while and this can be annoying when debugging. Replace all often only works on a single document, and not for all documents in an entire directory. So, if you have 6 chapters of a book, all split into different files, then you have to run the same command six times. We're starting to zero in on the solution, but we need a better way still, that allows us to replace all in one go, but important (i) quickly check what we're replacing; and (ii) do all of our source files at once.

Enter bash: find and replace with `sed`

Doing the same task on a batch of files in one go is ideally suited to using the shell of your computer. There are a number of different shells available, with bash and zsh being the most common on unix based systems (Linux, Mac and BSD, for instance) and powershell on Windows. I'll use bash in what follows. Find-and-replace can be done on the command line using sed, or 'stream editor'. It is a very powerful tool, that can also be used to append, as we'll see later. Changes can be written either to the standard output, in which case you'll see them on your terminal screen, or written into the source document, or written to a new document. We'll use the former two.

Let's see a simple example. In our directory with the source files, suppse there is a file chapter_1.tex that contains the following text.

A \textbf{second} is one of the fundamental units of time.
There are 60 seconds for each \textbf{minute}.
Beneath the level of the second is the \textbf{millisecond}, the \textbf{nanosecond} and the \textbf{microsecond}.            
Second, the term, was borrowed into English from Old French \cite{wiktionary}.
Second to none, the second is the easiest unit to count with.

If we want to use sed to replace all instances of second, with, say, hour, then we run the following. The s prefix indicates the substitution operation, and g suffix indicates that the change should happen globally throughout the file, rather than just the first instance. Between the slashes, the first string is the original, and the second is the replacement.

sed -e 's/second/hour/g' chapter_1.tex

And this prints the following to the terminal screen:

Find and replace with `sed`.

Let's say we run that to check and we're happy. Now we want to make the change in the file, then we replace with -e flag with -i (in place).

sed -i 's/second/hour/g' chapter_1.tex

Compare the result of running cat on the chapter_1.tex before and after the command is:

Note the differing outputs of running `cat chapter_1.tex` before and after `sed`.

But suppose you also want to change the last instance also, on the last line. This begins with a capital, so it wasn't picked up by sed the first time. Fortunately, sed can be combined with regular expressions, so you can make the query [sS]econd, and the search will pick up both second and Second.

Using a basic regular expression with `sed`.

Making a bash function for indexing

Now we have the basics in place of what we need to define a function that we will use for indexing. Firstly, make a file to store the function. The name doesn't matter, so we'll just call it index.sh.

## to be revised...
function index() {
    sed -e 's/'"$1"'/'"$2"'/g' $3
}

Then type source index.sh and the function will be added to your environment. We'll revise this function throughout. Any time you make a change to the index.sh file, you need to run source index.sh on the command line to reimport the function into your environment. This is used as follows. Typing index at your terminal calls the function, then the string you wish to replace (the query), the string you wish to be the replacement, and then the file you want to do it for. Bash will match the first word after index to $1, the second to $2 and the third to $3. Going back to what we've been doing so far, suppose again that we wish to replace second with hour in chapter_1.tex, at the command line run the following:

Nice!

But, we need some refinements. All we've done so far is replicate the GUI replace all function. But, we wanted to make it do two more things. We wanted it to allow us to check, to make sure it's all good, so we don't blindly replace everything. We'll soon add this functionality to allow us to note wrong instances, and go back and correct them later. We also wanted to be able to do it for all files in the directory.

Looping over files

Let's deal with the latter first. This, we achieve with a simple bash for-loop. After making this change in index.sh, don't forget to run source index.sh in your terminal!

## to be revised..
function index() {    
    for FILE in *.tex      
    do                               
    sed -e 's/'"$1"'/'"$2"'/g' $FILE     
    done                                          
}

We now no longer need to specify a file to run sed on, as the bash loop will run it on all files that have the .tex suffix. Let's suppose that in addition to chapter_1.tex we have a second chapter, chapter_2.tex, with the following contents:

A \textbf{second} after he left, the phone rang.
He would have only had time to talk for a \textbf{minute} anyway.
But obviously he could talk for more than a \textbf{millisecond}, a \textbf{nanosecond} and a \textbf{microsecond}.            
Second to none, the second he left was the biggest regret of his life.

Now, if we run our modified function, we get the following to standard output:

Now we're looping.

Making an actual index function

Before moving on though, let's revise the function to make a proper index. I've been using replacement to demonstrate the use of sed, but recall that we don't want to replace, but rather append \index{TERM}. It's tempting to simply add in $1 in the replacement slot:

## to be revised..
function index() {    
    for FILE in *.tex    
    do    
    sed -e 's/'"$1"'/'"$1"'\\index\{'"$2"'\}/g' $FILE    
    done                                  
}

But, the query will not always match the replacement, especially if the query involves a regular expression:

This is clearly not what we want.

The solution is to use a group in the replacement string. The escaped parentheses in the query creates a group, which is then referred to with \1 in the replacement string.

## to be revised..
function index() {    
    for FILE in *.tex    
    do    
    sed -e 's/\('"$1"'\)/\1\\index\{'"$2"'\}/g' $FILE        
    done                                  
}

This is better. However, it's still not quite right as we're now indexing a couple of things we shouldn't. Firstly, nanosecond, millisecond and microsecond shouldn't get indexed. Though they contain the string second, they refer to different things. Secondly, Second to none has an indexed item in it, but it doesn't refer to a unit of time here, but second as in 'second place'. We can fix the former by refining our query to add in a word boundary at the beginning of the string with \b. Because there is no word boundary before second in, eg., microsecond, then it doesn't match the query:

We're no longer indexing, eg., 'microsecond' with 'second'.

For the latter case, there is not much we can do. The query matches, as it should do. The problem is, as noted earlier, there are two homophonous uses of 'second'. These tend to be fairly rare, so it's best to make a note of where they happen, and fix them by hand later.

Adding in checking

Now we can loop over files, let's move to the other functionality we want to add, the ability to check before replacing. Up to now we've been printing the output of sed to standard output with the -e flag. This provides a convenient way to check. But, we've also just been looking at the entire output. In our toy example, with two files of around five lines each, it's not a problem to look through and manually check all instanes of the query but this gets unwieldy really fast. So, we'll first run the query through grep, add in colour highlighting of the query with --color=always, and also use the -H and -n flags to show file name and line numbers respectively. This will allow us to quickly scan all and only the instances of the query, and make a note if there are any we will need to fix manually later (such as with the two instances of Second, which don't fit the 'unit of time' meaning).

## to be revised..
function index() {    
    for FILE in *.tex    
    do    
    grep -nH --color=always $1 $FILE | sed -e 's/\('"$1"'\)/\1\\index\{'"$2"'\}/g'        
done                                           
}

Note the use of the single quotes in the query when we run the command. That's important for `grep` to correctly identify `\b`.

Up until now, we've been running sed with the -e flag and previewing the results in standard output on the terminal, which serves as a useful check, but we also want to make the results in the file. So, as a final step, we'll add in a second run of sed, this time with the -i flag if the preview looks good. What we need for this is to add a if-loop into our bash function after the run of sed. If we're happy with the result then we continue to run sed -i, if not, the function should abort to give us a chance to refine our query. We'll also map the command line input to variables to allow them to be used throughout the function, including within the if-loop:

## Final version!
function index() {
    ARG1=$1
    ARG2=$2
    for FILE in *.tex
    do
    grep -nH --color=always $ARG1 $FILE | sed -e 's/\('"$ARG1"'\)/\1\\index\{'"$ARG2"'\}/g'
    done
read -p 'Is this correct (y/n): ' VALIDATION
    if [ ${VALIDATION::1} == Y ] || [ ${VALIDATION::1} == y ]
    then
        echo "Applying changes to file."
        for SOURCE in *.tex
            do
            sed -i 's/\('"$ARG1"'\)/\1\\index\{'"$ARG2"'\}/g' $SOURCE
            done
    else
        echo "Function aborted, go and refine the query."
    fi
}

And that's it! Now when we want to index a term, like second. All we need to do is run the following:

index '\b[sS]econd' second

First grep will find the relevant instances, feed that into sed, which will then print to the terminal the suggested changes. A prompt will appear asking if that is correct, and if you then press y or Y (or type yes or Yes), then the command will run again and sed will this time make the changes in the file. If not, it will abort and prompt you to refine your query.
The result:

Running the full function. Note again the second `cat` output, where the changes have been made in-file.

It's important to note that this will not do everything for us. As I said before, language doesn't allow for a perfect 1:1 mapping between terms and meanings. In our example files, some instances of Second have been picked up which should not be indexed, as in Second to none. However, if you refine the query enough, and carefully check before you implement the changes, you can write down the files and locations of the 'wrongly indexed' instances and go and manually change them. This should be easy, given that the grep query, with the -nH flag will tell you which file, and line number the example came from. Because of the first step of the function, grep tells us that we need to go and manually change line 5 in chapter_1.tex and line 4 in chapter_2.tex. And that's that! Our bash function saves us going through the entire file by hand making the same changes time and time again. We still need to do some manual editing, but if you take the time to craft the right regexes and combine them with this function, you'll save yourself a lot of time and energy!

Here and here, for instance. ↩
Take it from me, I have a PhD in linguistics and everything. ↩