DEV Community

Sam Gould
Sam Gould

Posted on

Introduction to Web Development for Data Scientists

You are a data scientist used to helping organisations answer their strategic questions using data and technology. When you are in engineering mode, you spend your time:

  • building data pipelines in Python and SQL

  • modelling and analysing data in Python or R

  • using REST APIs and Linux containers for lightweight app deployments

You now want to build a website. What are the right tools and approaches for getting started with web development?

This is the situation in which I recently found myself while building samgould.net. There are lots of reasons to build a website:

  • Startups/SaaS businesses need websites and web apps

  • It gives you a self-owned platform for creativity and expression

  • Web dev is a new topic to learn

If you are working as a data scientist then you already have a lot of the foundational technical skillset required for web development. But the modern proliferation of languages, frameworks and platforms can make getting started seem daunting. Here's how I did it.

Let's tackle our Hello World foe
Let's tackle our Hello World foe

How is a website structured?

My experience in deploying machine learning MVPs gave me a rough mental model of a website's internals as a starting point. We will need a backend handling application/business logic, a frontend providing pages for a user to view, and some kind of API connecting the two. Incoming internet requests are pointed at our server host via a DNS lookup (mapping URL to IP). We will probably also need some layer handling incoming traffic (API/routing and traffic load balancing).

A rough mental model of a website architecture
A rough mental model of a website architecture

What are the main types of website?

We can refine this picture by considering the use case. In a simple blog, the frontend user experience is the same for everyone: I request some content, like an article, and the website serves it to me; but in a more complex application, the experience might depend on the user and some data exchange with the server. The key distinction here is that of static vs. dynamic websites. A dynamic site typically uses static templates which it dynamically populates based on client requests. An idea of our desired website functionality in these terms will be important for technology selection and architecture.

What are the main web dev technologies?

There are various ways to research this question. I looked at posts on popular developer sites StackOverflow, IndieHackers, DEV.to, Reddit:

A minimal static site is, at its core, HTML and CSS, with bits of JavaScript sprinkled in for dynamic functionality. However, as a Python-using data scientist, I am used to high-level frameworks doing the heavy lifting for me, and I don't fancy the sound of learning three new languages. But which one to pick? There is clearly a massive array of web dev frameworks out there. In any form of programming, there are many ways to skin a cat. The most important thing is to pick something which works well enough for us to build out our use case. With a Python background, the frameworks which jump out to me are Django and Flask.

Python web frameworks

Django

Django is "a high-level Python web framework that encourages rapid development and clean, pragmatic design". It is appealing as a "batteries-included" framework which handles a lot of boilerplate and functionality which is important but unfamiliar to a developer from a non-web background. It is considered "somewhat opinionated" and utilises a 'model-view-template' design pattern. A model is a piece of backend logic which is invoked by a view, which is an HTTP request handler. Views use templates to format the data for display to the client.

The official docs provide the best place to begin with Django development (project setup etc.) but there are alternative tutorials too. The docs explain the concept of WSGI (the Web Server Gateway Interface - the standard which lets your Python code communicate with web requests) - Django's startproject command sets up a minimal default WSGI configuration.

Looking at our diagram, the other missing puzzle piece is hosting. Django applications are typically wrapped in lightweight web server frameworks (Nginx or Apache) to expose ports and allow connections in (i.e. HTTP requests to the frontend), and run on Cloud VMs, for example, in order to have a public IP address. It is possible, but not recommended, to run on your own hardware as a server.

With this new knowledge of Django, we can update our mental model:

High level architecture of a website using Django
High level architecture of a website using Django

If you do choose to develop with Django, then consider exploring popular frameworks and toolkits. To stay on the pulse of the Django ecosystem, see Filip Němeček's DjangoFeeds.

Flask

Flask is an alternative Python library, often compared with Django: "a lightweight WSGI web application framework. It is designed to make getting started quick and easy, with the ability to scale up to complex applications". As can be seen from the source code, it can spin up a Hello World API routing example in just a few lines of extremely simple, Pythonic code. I personally wanted to explore heavier features such as front-end admin panels without spending too much time on a learning curve, so did not delve deeper into Flask - one to revisit in future.

Alternative microframeworks include FastAPI and Starlite.

Static Site Generators

As we have seen, developing a website in Python is an extremely viable option. But suppose that you have a laser focus on a simple use case: a static blog site. In this scenario, a Static Site Generator (SSG) might be the more efficient option.

Let's slightly redefine our mental model. A common pattern for developing a site which serves static content involves less two-way communication than in Django's MVT: content is created when the site is developed, not when the user requests it, so why not generate all of our site's HTML at build time too? We can push our content (typically a collection of Markdown files) into a CI/CD platform (e.g. GitHub), run an SSG to inject it into templates and spit out a bunch of static HTML pages, and then serve these to the user. This typically makes for an extremely fast website with minimal development overhead.

Hugo

A nice SSG is Hugo, which can be used to quickly set up a blog. It is trivial to then host these static files on a CDN hosting service like Netlify or GitHub Pages, although deploying through an Nginx server is also very doable.

High level architecture of a website using Hugo
High level architecture of a website using Hugo

Hugo is not the only SSG, there are many others.

Content Management Systems

With the previous two methods, we can successfully host our content on the internet. So what happens when our scope expands and we need to add new site functionalities, like collaborative editing of front-end content?

A Content Management System (CMS) can be used to create and edit a site's content. Like with an SSG, this content is then injected into HTML templates to be served on the front-end. The difference, architecturally speaking, is that the CMS can be dynamically coupled with the front-end. From a development perspective, popular CMSes are highly-featured with highly mature plugin ecosystems, meaning it is quicker to implement advanced site functionality such as e-commerce integration and user access roles.

Note: Jamstack is the name of a web development architecture based on core principles around decoupling the front end web experience from the data and business layers, with a focus on delivery as static sites. Both SSGs and CMSes can be used in this way: a CMS can be used in 'headless' mode - i.e. backend only, requiring a separate presentation layer to handle design, site structure and templates. This CMS usage mode is technically more aligned to Jamstack, but the distinction appears to make little practical difference during early stage web dev.

WordPress (the open source WordPress.org, not the managed service WordPress.com) is the most popular CMS in use today. It is straightforward to deploy a WordPress instance into a VM running Nginx or Apache.

High level architecture of a website running WordPress
High level architecture of a website running WordPress

From my perspective, this is a great option. It gives me control over my server - I have more fine-grained control because I am paying for IaaS (a VM server) rather than PaaS (website hosting). I am (still) using open source tools which are, on the whole, very approachable from a data science background. Personally, WordPress itself is still a bit of a black box of PHP code, which is why I stopped representing the distinction between back- and front-ends in the diagram, but the trade-off is that I have immediate access to plugins and integrations.

This architecture is one implementation of what is known as the LAMP/LEMP stack: Linux (my Ubuntu VM), Apache/Nginx (pronounced Engine-X, hence "E"), MySQL, PHP. For reference, a typical JS web app stack would be something like MEAN or MERN (but there are lots of other possibilities).

Conclusion

There are multiple ways to build a website, and the choice of one should balance use case requirements against ease of development. Each approach conforms to the basic server hosting mental model coming from a Python/ML/DS background, with key distinctions depending on website archetype (static/dynamic web page/app) - but essentially we are putting a bunch of HTML files on a server and exposing them to the internet via a webserver service. Django lets Python developers focus on implementing business logic, Hugo can easily spin up static blogs, and WordPress provides instant access to mature plugins. There is no single right way to do web development - but it is important to remember the end goal and not fixate on the technology choice.

I have skimmed over some details in this article, notably DNS setup (which is really quite straightforward) and best practices for server maintenance, which include the following recommendations. You can also refer to DigitalOcean's 1-click WordPress installer to see what additional configurations it performs on its Apache server:

For a live production site, you should go on to explore:

  • Security - fail2ban and DDoS prevention (DigitalOcean disables XML-RPC)

  • Backups - for example via git or using the Cloud hosting provider's services

  • Staging and deployment - commonly using the blue/green pattern via Apache's virtual hosts file

  • SEO

  • Sustainable content development - e.g. using content frameworks

but these are topics for future posts! I hope this was helpful for anyone looking to develop their site.

This content was originally posted on samgould.net

Top comments (0)