DEV Community: James McDermott

Odin Development Server!

James McDermott — Tue, 07 Jul 2020 12:40:26 +0000

About a month ago I wrote this article about the Odin Job Scheduler - a programmable, observable and distributed job orchestration system. I built Odin to change the way in which we run and manage scheduled jobs, and now I want your help to manifest it's potential!

I've set up this Discord Server to concentrate the development process as we get closer to v2.0.0! There's something for everyone with the Odin project, here is a list of the current technologies used in the project:

Go (Used for the CLI tool, the central engine and a library)
Typescript and Node.js (Used for the Observability Dashboard and a library)
Python3 (Used for a library)
MongoDB (Used for job data storage)

I want to built a lot more libraries so if you're interested in Rust, Ruby or any other languages you can join too!

If you're interested feel free to pop into the Discord and say hi! You can view the Odin project here (feel free to give it a star!)

Trying to build a Dashboard with AngularJS

James McDermott — Sat, 20 Jun 2020 12:29:25 +0000

I have a project called Odin - its a programmable, observable and distributed job scheduler.

For a while now, Odin has had a user interface, a dashboard written with Typescript and AngularJS. Metrics from jobs are stored in MongoDB and then displayed on this web UI.

I myself didn't actually build the dashboard and I think it leaves a lot to be desired when looking at Dashboards for other projects (such as Grafana).

Here's what the dashboard looks like at the moment:

I'd really appreciate any resources folks have which could help me out! Anything related to build dashboard interfaces like the image above would be great, I'd kind of at a dead end right now.

(Alternatively, if you'd like to sink your teeth into a new a project let me know.)

Published to an International Conference!

James McDermott — Wed, 17 Jun 2020 15:51:40 +0000

A few months back, I co-wrote a research paper titled An Analysis of Function-as-a-Service Infrastructures. Today I found out that the research paper has been accepted for publication in a leading international conference (EuroSPI 2020 in Dusseldorf, Germany), the current citation for which is as follows:

Grogan, J., Mulready, C., McDermott, J., Urbanavicius, M., Yilmaz, M., Abgaz, Y., McCarren, A., MacMahon, S.T., Garousi, V., Elger, P., Clarke, P.M.: A Multivocal Literature Review of Function-as-a-Service (FaaS) Infrastructures & Implications for Software Developers. To Appear In: Proceedings of the 27th European and Asian Conference on Systems, Software and Services Process Improvement (EuroSPI 2020), Springer CCIS Vol. 1251, 9-11 September 2020, Dusseldorf, Germany.

This is the first time I've had such academic success so it's really exciting. The research was conducted during my final year of my undergraduate studies in Dublin City University. Broadly speaking the aim of this paper was to provide an extensive analysis of Functions as a Service (FaaS) infrastructures, and how they compare to each other. I have for sometime been genuinely enthralled by the world of serverless so this was a great opportunity to learn more.

FaaS Infrastructures

FaaS infrastructures are important platforms that make up a category of cloud computing. This paradigm is representative of how writing and deploying applications has changed with time, moving from the heavyweight three-tier model of monolithic applications to the composable microservice oriented landscape of today. Instead, FaaS infrastructures are often associated with on-demand functionality, allowing clients to build applications without the overhead complexity associated with running and maintaining server infrastructures.

In this way, applications built alongside FaaS infrastructures are said to follow the “serverless” model. This is an execution model where a provider dynamically manages and allocates machine resources, simplifying the process of deploying code into a production environment.

This study provided an analysis of scalability, cost, execution times, integration support, and the constraints associated with FaaS services provided by AWS Lambda, Google Cloud Functions, and Azure Functions.

Findings

Hot and cold starts for differing runtimes are linked to the underlying architecture of the provide.
A reason for memory configuration having a greater impact on Google Cloud Functions than on AWS Lambda, suggests that Google allocates CPU proportionally to memory.
The biggest constraint of FaaS infrastructures is the amount of supported language runtimes.
Double billing in the serverless model breaks a core principle of the serverless model and that without intrinsic support from the FaaS vendor, it is impossible to avoid double billing.
Azure and AWS offers the cheapest access to a FaaS solution based on request invocations alone and also offers the best flat rate charge for duration of a functions execution - at $0.000016 per GB-second.
With respect to compute time, both AWS and Google Cloud round up to the nearest 100ms, whilst Azure only rounds to the nearest 1ms.
In regards to the integration capabilities of each solution, Google Cloud provides a more streamlined approach due to the absence of integrating the function with an API, whereas both AWS and Azure require the user to either deploy or have a pre-existing API.

The Final Publication

Unfortunately the final publication is not yet available as it is undergoing peer review and additional academic authors are working on the paper. Once I can make it freely available I will make sure to share it here on Dev!

What is your experience with FaaS and serverless computing? Let me know in the comments!

📚 A Larder.io Command Line Interface!

James McDermott — Fri, 29 May 2020 17:22:17 +0000

Preface

Larder is a bookmarking tool for developers. It supports categorizing and tagging bookmarks. Categorizing is done by folders. You can make a folder that has child. It will make you easier to organize bookmarks. Beside that, you can search your bookmarked links by tag, title, and URL.

This tool also supports synchronizing your GitHub stars too. So, you can track updates from projects that you starred on GitHub. If you want to bookmark something automatically, Larder extension is available for browsers, Android, and there's also an API.

I decided to make a robust, easy to use command line interface for Larder. I did this mainly because I have a 2012 Macbook Pro with 4GB of RAM that really cannot handle any more browser extensions. I also did this because I know there's a lot of folk who live in their terminals and detest browser extensions.

The Larder CLI

Okay so first off, I am aware a few CLI's for Larder exist, but most of these are Ruby or Node.js based. Go is much more performant and because of this, the Larder CLI, which you can find here is written in Go.

The tool leverages the Cobra library. Cobra has been used to build some of the best CLI's around, being utilized in projects such Kubernetes, Hugo, etcd, Docker and rkt, to name but a few.

The Larder CLI allows for the viewing, adding, removing of a user's folders and/or bookmarks.

Along with this, the tool offers a robust search functionality. The user provides a string of search terms delimited by commas. For example: "texas,bbq". This will search through all folders, for names or tags containing those terms.

The Unique Selling Point

What really makes this new Larder CLI stand out is the ability to refresh your access token. Tokens expire in a month and can be refreshed for a new access token at any time, invalidating the original access and refresh tokens.

This built in functionality does not exist in alternatives to this Larder CLI. This process is automated so the user can continuously use the Larder CLI without having to manually remove and add new access tokens.

You can check out Larder on Github.

🐍 Python4: What should it do differently?

James McDermott — Fri, 29 May 2020 11:44:23 +0000

Dev discussion on what Python4 should do differently from it's predecessor.

I'll start - I'm hoping that migrating from 3to4 is simpler than 2to3 was 🤞

Odin - The Programmable, Observable and Distributed Job Scheduler

James McDermott — Thu, 21 May 2020 11:25:28 +0000

Overview of my Final Year Project

Odin is a programmble, observable and distributed job orchestration system which allows for the scheduling, management and unattended background execution of individual user created tasks on Linux based systems.

The primary objective of such a system is to provide users/teams a shared platform for jobs that allows individual members to package their code for periodic execution, providing a set of metrics and variable insights which will in turn lend transparency and understanding into the execution of all system run jobs. Odin aims to do this by changing the way in which we approach scheduling and managing jobs.

Job schedulers by definition are supposed to eliminate toil, a kind of work tied to running a service which is manual, repetitive and most importantly - automatable. Classically, job schedulers are ad-hoc systems that treat it’s jobs as code to be executed, specifying the parameters of what is to be executed, and when it is to be executed. This presents a problem for those implementing the best practices of DevOps. DevOps is something to be practiced in the tools a team uses, and when traditional job schedulers fail they introduce a new level of toil in debugging what went wrong.

Odin treats it’s jobs as code to be managed before and after execution. While caring about what is to be executed and when it will be executed, Odin is equally concerned with the expected behavior of your job, which is to be described entirely by the user’s code. This observability can be achieved through a web facing user interface which displays job logs and metrics. All of this will be gathered through the use of Odin libraries (written in Go, Python and Node.js) and will help infer the internal state of jobs. For teams, this means Odin can directly help diagnose where the problems are and get to the root cause of any interruptions. Debugging, but faster!

Demo Link

If you'd like to check out a brief demo of the Odin system, I recorded a short five minute video as it was a criteria of submitting Odin as my final year Project to Dublin City University.

You can check out that demo here.

Link to Code

You can check out the source code for Odin out here on GitHub.

There is also a final-year-project branch which you may view. This branch contains all materials I submitted alongside Odin as my final year project in Dublin City University.

Building Odin

Let's get down to brass tacks. The development cycle for the project was pretty standard, so in this section I'm going to touch on some of the research conducted going into the project, some interesting issues/challenges posed during development and finally I take a look at the ultimate design and implementation of the Odin Job Scheduler.

Research

When you're aiming to build a distributed scheduling tool for teams practicing DevOps, it's only right you consult this chapter in Google SRE Handbook.

Obviously SRE and DevOps are not one and the same, but given the close association of the two, we deemed the consultation of this material entirely relevant. The chapter linked above specifically details two approaches to scheduling managers, those that store state and those that use remote data stores to hold information. The chapter details the advantages and disadvantages of both, with specific emphasis on the dangers of relying on a single source dependency such as a remote data store. From our reading of this chapter we set sights to ensure that Odin, which would utilize a remote data store, would be able to manage its own execution state in the event the data store went offline.

When considering the role of observability in the Odin ecosystem, I felt the following two resources to be integral to building an understanding:

The former of these writings, by observability thought leader Charity Majors, focused on the cultural aspect of observability within teams, and this writing really focused on how observability is a prerequisite for any major effort to have saner systems. Observability allows built-in transparency to the user in the instance of Odin, so these writings allowed us to broaden our scope in regards to what observability in a system could it achieve.

The latter of these writings really gives a direct insight into how observability systems work, with direct data collectors/loggers which run information through concurrent pipelines into the data stores. This confirmed that the correct approach in designing our system was for the language specific software development kits to extract live information about jobs, and then plugging this into a back-end of a visualization unit.

Finally we consulted the following two academic papers in relation to the algorithmic side of job scheduling, along with performance techniques to enhance the rate of execution:

Generally I consulted with much, much more material outside of the linked appendices. These are just the ones which I felt helped me the most during the research stage.

Issues During Development

Every project has it's hurdles to overcome. Some big, some small. The following three scenarios are problems I confronted during the development stage.

The Original Job Queue

Originally the jobs queue was a simple array of nodes, with each node representing a job. One attribute of each node was that node’s time until execution. When the time left was found to be zero, the job would be sent to the executor and the ticker would check the next job in the queue.

The issue with this design was that at any one time, there could be several nodes that had “zero” seconds until execution. This would mean that by the time one job had been sent to the executor, the next job had been rescheduled as it had strayed into a negative range.

The diagram below demonstrates the example of such a queue:

Job 4, Job 2 and Job 1 would all be set to execute at the same time, but only Job 1 would execute successfully, as it was the first job added of these three. The same issue will rise its head again when both Job 3 and Job 5 come to a Seconds Left value of 0 also.

The solution to this problem was simple - groups. Instead of one job per node, we instead tried one Seconds Left value per node. This means that nodes in the queue were not representative of just one job, but rather of all jobs which were set to execute at the same time. This approach meant that this example of an old Odin Priority Queue:

Would be transformed into the new queue design:

This now means that the problem of skipping over nodes which are ready to execute is an issue of the past, as execution is now run as a batch operation across the jobs array of the node at the head of the queue.

Improving Concurrency

During design, the process of sorting of Priority Queue was still quite slow. This needs to be a fast operation because the queue can only be grouped once it has been sorted. The grouping/mapping of jobs per node in the queue is a more computationally complex operation by default, but we also aimed to improve the speed of execution for this operation too.

While using a custom implementation of quicksort, we knew we could leverage Go’s built-in Goroutines to speed up our sorting algorithm. Goroutines are thin lightweight threads in the Go programming language which allow for the concurrent running of functions.

Our solution to the sorting problem can be seen in the code below:

Making Odin a Distributed System

Scaling Odin out to be a distributed system was a significant challenge set out in the Odin Functional Spec. Distributing Odin would offer increased reliability to the system in case of a failure.

Transforming Odin into a distributed system required a significant amount of terraforming. We utilized the Hashicorp implementation of the Raft consensus protocol. The Raft protocol requires the election of a Leader node from the pool of candidates. A finite state machine was designed to allow all started nodes to be added as voters in this election and in turn cast their votes to reach consensus.

New API endpoints for join and leave were created so expanding the cluster was handled and nodes failing was handled. This did not prove to be a significant challenge but was necessary in managing the Raft Cluster also.

Another significant hurdle was deciding the way in which jobs were distributed to worker nodes. In designing a method of distributing work, we also needed to take into account how the work is distributed in the event of a node failing. In an example of 12 jobs and 4 nodes, ideally each node should execute 3 jobs each. If a node fails, then each node should take on some extra responsibility. These 3 remaining nodes will then share the 12 jobs by executing 4 jobs each.

Thus we reached a consensus of our own in how work was shared between nodes:

Each node is assigned an integer, in the example of 4 nodes above this would be an integer between 0 and 3.
Each job is then assigned an integer, so in the case of the twelve jobs an integer between 0 and 11.
We then get the modulus of the current job integer (0-11) and the number of nodes in the cluster (currently 4). We call this mod.
We then compare the node integer (0-3) to mod and if they are the same then that node is selected to execute the current job.

In pseudocode, the operation looks like this:

The Final Design

The diagram below represents Odin as it currently operates, making reference to the various components of the system architecture. This diagram serves as a demonstration of how these components link up to facilitate the flow of data. This architecture diagram goes to great lengths to show the internals of each component in a more structured way, something which early design diagrams did not include.

Beginning at the Odin CLI, we can see how each of the three aforementioned categories will link to three different places, with the job commands category linking to the jobs endpoint of the API, and the Join Cluster category linking to the join endpoint of the API. Finally we see how the recover and log commands in the /etc/odin/ category link to their respective sections on the filesystem.

We also see how the jobs endpoint itself is integral to the Odin Engine component. It interacts directly with the MongoDB instance, the Ticker, and the execute and links endpoints. The latter of these, the links endpoint, is something in the current design which was not present in the initial system design. The concept of linking jobs together was discussed in our Function Specification document but never truly explored. We can see this concept has been manifested through the presence in the Odin Engine through the links endpoint.

The join endpoint simply points back to the Odin Engine component itself as a way to signify the replication of the service in a Raft cluster. The Odin CLI directly links to this endpoint to facilitate the ease of expanding this cluster.

The user code which is actually executed is always stored in /etc/odin/jobs under a directory of the job’s ID value. This was a design detail which was never explored in the original design - how orchestration would be achieved. Making a copy of the files used to run a job ensures redundancy, as users may accidentally remove files scattered across their home directory.

We can see how the Odin SDK (of any language) is used to log values and report them to the MongoDB instance. These values can be extracted by the Odin Dashboard back-end server sent to be displayed on an Angular front-end.

Conclusion

In all, Odin solves the problem of periodic job scheduling in a way not currently available to the open source community. This is in specific reference to the Odin Software Development Kits, which centralize all job information/metrics. We feel that the observability features are a significant achievement as it reduces any associated toil with debugging a job which has broken, helping infer the internal state of jobs and any given time.