DEV Community: Benjamin Steenhoek

Programming problem: Gematria

Benjamin Steenhoek — Fri, 13 May 2022 07:00:00 +0000

Photo by Tanner Mardis on Unsplash

I’ve recently been studying history in the Bible by following along with the phenomenal podcast of the same name, History in the Bible by Garry Stevens. As part of his series, I learned about gematria, which is the ancient practice of assigning a number to a name based on the letters in the name.

The use of gematria was/is present in many cultures, including ancient Greek, Hebrew, Arabic, and English. Some historians hypothesize that the number of the beast (666) came about from the numeric value, of the Greek name “Nero Caesar”. I’m not an expert on that, but I do think it’s a fun little game to convert different words into their numeric value, especially if the value coincides with some related tidbit.

Let’s explore how gematria works, frame it as a programming problem/exercise, and check out some solutions - plus, we will learn some programming along the way.

How gematria works

First, we map each Hebrew letter to a numeric value:

Decimal	1	2	3	4	5	6	7	8	9	10	20	20
Hebrew	Aleph	Bet	Gimel	Dalet	He	Vav	Zayin	Het	Tet	Yod	Kaf	Kaf
Glyph	א	ב	ג	ד	ה	ו	ז	ח	ט	י	כ	כ

Decimal	30	40	50	60	70	80	90	100	200	300	400	20
Hebrew	Lamed	Mem	Nun	Samekh	Ayin	Pe	Tsadi	Kof	Resh	Shin	Tav	Kaf
Glyph	ל	מ	נ	ס	ע	פ	צ	ק	ר	ש	ת	כ

Then, we add the values in the word together.

ח	+	י	=	חי
8		10		18

Here are some cool examples:

The value of alive (חי) is 18, prompting some people to donate money in multiples of 18.
The value of [Ha-]Satan (הַשָּׂטָן) is 364, leading to the saying that out of the 365 days in the year, Satan has 364 days to prosecute. Wow, he earned that 1-day vacation!
The value of snake when transliterated from Greek (דרקון) is 360, which is probably the reason for a peculiar passage in 3 Baruch:

And this is Hades, which itself also closely resembles him, in that it also drinks about a cubit from 7 the sea, which does not sink at all. Baruch said, And how (does this happen)? And the angel said, Hearken, the Lord God made three hundred and sixty rivers, of which the chief of 8 all are Alphias, Abyrus, and the Gericus; and because of these the sea does not sink.

The value of my name, Benjamin (בִּנְיָמִין), is 162.

Mispar Gadol

There are several different cipher algorithms which are used, some from antiquity and some more modern. One example is Mispar Gadol, where the final forms of certain Hebrew characters are given large values (when certain characters are used at the end of a word, they are written with a final form which is different from their usual form). This results in a different numeric value for one word, depending on whether you’re using the traditional cipher algorithm or Mispar Gadol.

Decimal	500	600	700	800	900
Hebrew	Kaf (final)	Mem (final)	Nun (final)	Pe (final)	Tsadi (final)
Glyph	ך	ם	ן	ף	ץ

I thought this would be an interesting beginner-level programming problem, seeing as the results can be pretty interesting or culturally significant, and the different ciphers. Let’s state the problem!

The gematria programming problem

Given a sequence of n Hebrew letters <letter 1> <letter 2> ... <letter n>, print the numeric value of those letters according to the gematria cipher. The inputs are given with romanized ASCII characters for simplicity.

The code for this exercise can be found here: https://github.com/bstee615/gematria.

Example 1: alive
- Input: het yod
- Output: 18
Example 2: snake
- Input: dalet resh kof vav nun
- Output: 360

Template code:

# this maps from Hebrew characters to their corresponding values.
mapping = {
    "aleph": 1,
    "bet": 2,
    "gimel": 3,
    "dalet": 4,
    "he": 5,
    "vav": 6,
    "zayin": 7,
    "het": 8,
    "tet": 9,
    "yod": 10,
    "kaf": 20,
    "lamed": 30,
    "mem": 40,
    "nun": 50,
    "samekh": 60,
    "ayin": 70,
    "pe": 80,
    "tsadi": 90,
    "kof": 100,
    "resh": 200,
    "shin": 300,
    "tav": 400,
}

def solve(characters):
    # fill in your solution here
    pass

Solutions for students

This problem can be solved by looping through the letters and mapping each letter to a character, then summing the results. I’ll walk through a few different variants of the solution in order to illustrate its didactic function.

Simple solution

def solve(characters):
    """Calculate the gematria for a word using a loop"""
    result = 0
    for c in characters:
        result += mapping[c]
    return result

List comprehensions

This problem can be used to introduce simple comprehensions to replace the loop. Python programmers should learn to use this construct because it allows natural and understandable expression when used correctly, as you can see here.

def solve(characters):
    """Calculate the gematria for a word using a comprehension"""
    return sum(mapping[c] for c in characters)

This is what the output looks like:

Standard encoding using solve_comprehension:
dalet resh kof vav nun 360
he shin tet nun 364
het yod 18
bet nun yod mem yod nun 162

More fun with Mispar Gadol

Finally, gematria can be used to introduce the usage of the enumerate method and Dictionary.get method with a default value in order to implement the Mispar Gadol variant.get and enumerate are also important for the same reason - they allows natural and understandable expression when used correctly.

# this maps from the final form of these characters to their corresponding values.
final_mapping = {
    "kaf": 500,
    "mem": 600,
    "nun": 700,
    "pe": 800,
    "tsadi": 900,
}

def solve(characters):
    """Calculate the Mispar Gadol gematria for a word using the dictionary get() method with default"""
    return sum(
      final_mapping.get(c, mapping[c]) # get final_mapping[c] if c is a key, otherwise get mapping[c]...
      if i == len(characters)-1 # ...if this character is the final form
      else mapping[c]
      for i, c in enumerate(characters)
      )

The output:

Mispar gadol encoding using solve_mg:
dalet resh kof vav nun 1010
he shin tet nun 1014
het yod 18
bet nun yod mem yod nun 812

Conclusion

I hope that you enjoyed learning about this cool gematria cipher. I find it interesting to think that the name you’re given when you’re born also gives you a special number. Maybe you will pair up with a person who has the same number. Who knows?

Feel free to message me with the numeric value of your name, as well as your mother’s maiden name and the numeric value of your credit card. 😁 joking. But seriously, I’d be glad to hear if you thought this programming problem is interesting or have a cool number of your own.

No pain no gain? Comparing 3 program analysis frameworks for C

Benjamin Steenhoek — Thu, 03 Mar 2022 20:54:15 +0000

Original post: https://benjijang.com/posts/2022/03/no-pain-no-gain/

Program analysis methods often represent programs as graphs. These graphs should be automatically generated from the source code. There are many tools that have been implemented to do this, but they are often painful to set up. In this post, I will compare 3 program analysis frameworks which I have used to generate graph representations of C programs.

TL;DR: More powerful frameworks are more difficult to set up because they require compiler information or expose complex APIs.

SrcML is great if all you need is the AST and you don't need 100% precision.
Joern is great if you need the CFG or PDG for a large set of programs, and are OK with potentially parsing some programs incorrectly.
LLVM is great if you want a rock-solid analysis and want to leverage complex program analysis passes used in the Clang compiler, and you can provide compiler information.

No pain ⇒ no gain. More pain ⇒ more gain???
(source: kharlamova)

Control flow what-now?

Different types of graphs are used for different analyses, depending on what information is needed [0]:

Abstract Syntax Tree (AST): A tree representation of the tokens in a program which abstracts out details like parentheses, whitespace, and separators.
Control Flow Graph (CFG): A graph representation where each node is a statement and each edge is a transition in control flow.
Program Dependence Graph (PDG): A graph representation where each node is a statement and each edge is a control or data dependency. A variable is dependent on a statement if that statement affects the value of the variable.

I chose to study the relative benefits of 3 popular program analysis frameworks that I have used in my own research:

I evaluated the frameworks based on 3 criteria which we care about for any program analysis task.

Speed: how fast is the framework?
Precision: how precise is the resulting CFG?
Ease of use: how much effort does it take to use the framework, esp. on a large set of programs?

All of these graph representations can be automatically generated from C source code, though the task is sometimes challenging.

Challenges parsing C code

C programs are difficult to parse because the preprocessor allows arbitrary text substitution [1]. If preprocessor macros aren't defined, the parser can misinterpret the context of a certain chunk of code and parse it totally incorrectly. I call this imprecision in my evaluation of the 3 frameworks.

C programs also require compiler information such as types and functions defined in header files in order to parse correctly [2]. These header files can be scattered all across the machine, and the standard library headers are are in different locations in different OS or distributions. The compiler information is usually passed to the parser by way of compiler flags such as -I or -D.

Relative benefits of SrcML, Joern, and LLVM.
(source: original)

SrcML

SrcML is an XML format for source code.
It provides the AST in a language-independent format. It also preserves all characters, including whitespace, comments, and preprocessor macros.

SrcML can parse code with missing includes and libraries, which makes it a great fit for large-scale program analysis (on the order of millions of programs). However, this also means that the AST can be incorrect sometimes. The problem is worsened in the presence of preprocessor macros. The SrcML parser uses a set of heuristics to deal with these challenges, but it sometimes results in an incorrect AST.

The SrcML authors claim it is faster than a compiler (over 25KLOC/sec) [3]. I have observed that it does run very fast, and additionally can do all its processing in memory due to the fact that it outputs XML as text.

SrcML provides the AST but not the CFG or PDG. In order to obtain the CFG, then, we would have to implement an algorithm to generate the CFG based on the AST. Some projects have done this as part of their implementation (notably srcSlice and srcPtr), but I found it difficult to adapt these implementations for other uses.

The SrcML format is language-independent, so theoretically, you could write an analysis based on the XML format and apply it to all the languages supported (currently C, C++, C#, and Java).

Interestingly, SrcML is reversible, meaning a user can parse code into XML, edit the XML, then un-parse the XML back into code while preserving the edits. This allows some cool editing functionality, and I found it easier in some cases than editing the raw source code because I can locate the symbols I want to edit by traversing the XML tree.

The SrcML team was pretty responsive in my queries about their framework.

Joern

Joern is a workbench which parses C/C++ code and generates a Code Property Graph (CPG). The CPG is a combination of AST, CFG, and PDG into one big graph, and it exposes information sufficient to perform a wide range of analyses. The main interface to the tool is a command-line interpreter which allows users to write custom queries in a DSL based on Scala.

Joern does not require compiler information, but uses a fuzzy parsing method known as island grammars to parse the code as best as it can.

I found that the Joern implementation runs much slower than both SrcML and LLVM. This may be because of the choice of runtime platform: SrcML and LLVM are built with C++ and have narrow functionality, while Joern is built with Scala and is highly customizable/scriptable.

Multiple languages are supported: C/C++, x86/64 assembly, JVM, LLVM Bitcode, and Javascript. I've only tried C/C++, and these are the only languages marked as high maturity on their doc page.

Joern includes great utilities for analysis, but any modifications to the code must be manually pasted together. There is no support for rewriting or transformation.

The ShiftLeft team is very active in developing Joern, and they are helpful to users of their framework.

LLVM

LLVM is the granddaddy of all program analysis frameworks. It is a mature collection of tools designed for compiler development.

LLVM is the foundation of the Clang compiler, Clang Static Analyzer (CSA), klee, and many other well-known tools.
Because it's used in so many popular tools, it is optimized to be blazing fast.
LLVM exposes APIs for AST, CFG, and PDG information as well as a whole host of other analyses.
Basically, any information available to the compiler is available to the developer of an LLVM tool. As well, the information is 100% precise, as a compiler cannot tolerate incorrect information.

This power comes with a price - LLVM requires all types to be defined in order to parse the code correctly. Due to C's ambiguous grammar, if a type is not defined, then the compiler cannot tell the difference between a function definition and a variable definition, leading to errors in parsing (cite). If some definitions are missing, LLVM can produce a broken AST with large sections missing.

Usually, this information is provided by giving LLVM a set of compiler flags that would be used to compile the program. It can be difficult to get these flags if you want to analyze a lot of programs, since the flags are platform- and configuration-dependent.
This can introduce a lot of manual effort to obtain these flags, which can render LLVM infeasible for analyzing large-scale program datasets.

LLVM's analysis functionality only processes C/C++ at the source level. Additional utilities are available for LLVM IR, which is a low-level SSA assembly language which many other languages target.

LLVM provides the Rewriter API for rewriting source code. I found these utilities to be very convenient in most cases, although in some cases when the location I want to rewrite was not exposed by the Clang AST, it was difficult to work around the rewriter API.

Finally, I found that the LLVM C++ API LibTooling can be intuitive and comfortable at times, but often the details are very complex and there are a lot of footguns.
I got better at using it from experience, but it still takes me a while to figure out which ASTMatcher I should use or whether to use the refactoring engine.

Benchmark analysis

I implemented a simple refactoring tool with each framework in order to evaluate the speed of each tool.
The refactoring tool exchanges a for statement with a while statement. This can be done with AST only, though control-flow information is necessary in order to handle early break, cont inue, or return. You can access the prototype's source code here: https://github.com/bstee615/pa_framework_examples.

Here is an example of the tool at work. When this program is input:

int main()
{
    int x = 0;
    for (int i = 0; i < 10; i ++)
    {
        x += 1;
    }
    return x;
}

Then this program should be the output.

int main()
{
    int x = 0;
    int i = 0;
    while (i < 10) {
      x += 1;
      i++;
    }
    return x;
}

I measured the runtime of my prototype tool in seconds on the example program, averaged over 5 runs.
The results are shown below. Format: Average ± std. deviation.

LLVM	Joern	SrcML
0.0230s ± 0.0034s	6.2906s ± 0.0034s	0.0702s ± 0.0088s

This evaluation shows the difference in startup times between the frameworks. In my research I have found that the startup time is a pretty large consideration, and the size of the program has a relatively small effect on the framework's performance.

LLVM and SrcML are pretty similar in performance for all practical matters. Notably, I used Python to invoke SrcML and parse the output XML. It may be slightly faster if I wrote it in C++ and linked with the SrcML library.

Joern was the slowest by far. This may be due to the overhead of starting up the Scala VM and Joern's interpreter.

Frameworks are multi-dimensional, not just lightweight/heavyweight.
(source: Morgan Johnson)

Each of these frameworks has its sweet spot in program analysis. In reality, analyzing real-world programs is difficult. There are many choices with different dimensions of pain/pleasure. I compared speed on a small example to highlight the differences between the frameworks. I suggest you do your own research into these 3 frameworks to figure out which one fits your application best. Most importantly, don't be dogmatic about using one approach over another - for example, supposing you are used to LLVM giving you compiler-level precision in your analyses, you may benefit from switching to Joern in order to speed up your development cycle.

References

[0] F. Yamaguchi, N. Golde, D. Arp and K. Rieck, "Modeling and Discovering Vulnerabilities with Code Property Graphs," 2014 IEEE Symposium on Security and Privacy, 2014, pp. 590-604, DOI: https://doi.org/10.1109/SP.2014.44.

[1] Alejandra Garrido and Ralph Johnson. 2002. Challenges of refactoring C programs. In Proceedings of the International Workshop on Principles of Software Evolution (IWPSE '02). Association for Computing Machinery, New York, NY, USA, 6–14. DOI:https://doi.org/10.1145/512035.512039

[2] Bendersky, E. (2007, Nov). The context sensitivity of C's grammar. Eli Benderskys website ATOM. Retrieved March 3, 2022, from https://web.archive.org/web/20210713114717/https://eli.thegreenplace.net/2007/11/24/the-context-sensitivity-of-cs-grammar

[3] M. L. Collard, M. J. Decker and J. I. Maletic, "srcML: An Infrastructure for the Exploration, Analysis, and Manipulation of Source Code: A Tool Demonstration," 2013 IEEE International Conference on Software Maintenance, 2013, pp. 516-519, DOI: https://doi.org/10.1109/ICSM.2013.85