DEV Community: Shane Lee

A Brief 10-Point Guide to GraphQL

Shane Lee — Sat, 12 Sep 2020 14:49:50 +0000

When I’m learning new about new technologies, one of the first things, I’ll do is write a numbered list 1–10 and then fill that numbered list in with points about that particular technology. I realised that these lists might be useful to others so I’m going to start posting them. This week I looked at GraphQL.

GraphQL can be best thought of as a middleware between your clients and your servers. The idea behind GraphQL is to have a system which sits between your clients and your API and allows the requesting of data through itself. It has an in-built query language which allows your to create requests and GraphQL will handle the fetching and composing of the data.
One of the primary reasons to use GraphQL is that it simplifies dealing with your backend APIs. For example, a common pattern to find in applications is the need to request data from two or more distinct sources and compose them to be processed together. For example, you might have images stored in one database and text stored in another. Using GraphQL enables you to have a single query to get the data from both of the sources and return a single response, which greatly simplifies your code has you don’t have to deal with the complexities of handling multiple requests and waiting all of them before processing the data together.
One of the oft-quoted benefits of GraphQL is that it decouples you from your existing API implementation. This means that with GraphQL you can easily swap out the implementation of your existing API for another one and everything will continue hunky dory. However, the seldom mentioned caveat to this is that you are necessarily coupling yourself to GraphQL itself
Data savings are another benefit of using GraphQL. With GraphQL, you don’t need to make multiple calls to your API as you would with REST. You can query for only the data you want: GraphQL can aggregate data from multiple sources for you and it will only return the data that you need. The result of this is data savings for each API call, which leads to improved performance of your application overall.
GraphQL enables increased speed of development. The reason for this is that because GraphQL decouples your client from your API, different teams also become decoupled, meaning there is a reduced need of coordination between teams to get development done. To give an example, instead of the frontend team needing to coordinate with the backend team to update or add a new REST endpoint, they can simply develop their new frontend functionality using GraphQL to get the exact data they need.
Another benefit of using GraphQL is that there is reduced need for filtering data. Often when developing a new feature of an application, a developer will have to filter and aggregate existing data to generate the new data required for the new feature. All of this leads to additional code, which incurs additional costs both in terms of development time and maintenance time. With GraphQL this is unnecessary as GraphQL handles the aggregation and filtering of data for free.
One of the drawbacks of GraphQL is that there is a need for a fair amount of boilerplate code. For example you will often need to write a Resolver, a query, a mutation, and a schema.
Benefit — Strong typing — GraphQL’s query language is strongly typed. This is benefit because it provides a common contract for communication between the client of an application and the backend. This helps facilitate independent development of both the frontend and backend because the strong typing of GraphQL provides a solid and predictable foundation on which to base the target of future system states. In other words, developer can write new code knowing that GraphQL will be able to provide the data they require and they will know the format of that data.
Another disadvantage of GraphQL is that error handling is somewhat more difficult and cumbersome than it is with REST. If you GraphQL query errors, it doesn’t return a 5xx error code or even a 4xx error code, but instead it always returns a 200 success code. The error will be in the JSON response itself. This is weird. It’s something that you should be aware of because if you are considering migrating your system to GraphQL, this quirk will likely increase the time required to do so because you will have to change your error handling and any code that specifically checks for standard error codes. This will likely mean changing many of your test code as well.
A final consideration of GraphQL is the complexity of it. Now this isn’t really a disadvantage or an advantage. The reason being is that it isn’t particularly complex, however it is more complex than a standard REST API. If you know that your application is going to have a relatively simple and stable API over time, then you might be better sticking with REST. If you are unsure about this, I would be better to use GraphQL from the start as the complexity involved in migrating from REST to GraphQL once an application is midway through development wouldn’t be the quickest process, you’d have to define all your schemas, mutations, queries, and resolvers. Plus you’d potentially have to write new tests and learn a whole new testing API.

A Brief 10-Point Guide to AWS Elastic Beanstalk

Shane Lee — Sat, 05 Sep 2020 10:45:28 +0000

When I’m learning new about new technologies, one of the first things, I’ll do is write a numbered list 1-10 and then fill that numbered list in with points about that particular technology. I realised that these lists might be useful to others so I’m going to start posting them. This week I looked at AWS Elastic Beanstalk.

Amazon's Elastic Beanstalk is a service which allows you to deploy applications within an AWS context that involves a number of different AWS services such as S3 and ECR. With Elastic Beanstalk you don't have to worry about the management and orchestration of these services. You can simply focus on your application.
One of the main benefits of using AWS Elastic Beanstalk is that is it abstracts away the need to deal with infrastructure. You don't have to worry about configuring load-balancing or scaling your application. You can simply upload your code and go.
Another benefit of using AWS Elastic Beanstalk is that assuming you were planning to host your application on AWS to begin with, there's no additional cost to using Elastic Beanstalk.
While you have the ability to sit back and let Elastic Beanstalk handling everything for you, AWS does give to the option to configure and control the AWS resources you use for your application. And what's more, it's not an all or nothing deal. You could for example decide to manually manage your AWS ECR instances, but leave your S3 instances to be managed by Elastic Beanstalk.
Setting up monitoring and logging tools for your application is often a full-time job in of itself. With Elastic Beanstalk, you don't have to bother because Elastic Beanstalk comes with a monitoring interface which is integrated with Amazon CloudWatch and AWS X-Ray.
One of the drawbacks of a system that abstracts away the need for management is that understanding when things have gone wrong with Elastic Beanstalk can be a difficult task, because it can be difficult to see the error to diagnose the problem.
An additional drawback of using Elastic Beanstalk is third-party integration. Some of the common culprits like Docker, Jenkins and Github are supported, but don't expect to find the third-party integration extensive.
One of the pros of AWS Elastic Beanstalk is that you can easily integrate Code pipelines into it, which can enable you to check if your code you've just uploaded is working correctly.
Another one of the benefits of the auto-management of AWS Elastic Beanstalk is that unavoidable things like updating versions and operating systems, can be done without any downtime to the application. Furthermore, should something go wrong with one of those updates, it is quite easy to rollback the application to an earlier state. Again, without any downtime to the application (Unless of course the update itself caused downtime).
A final disadvantage of using AWS Elastic Beanstalk is that if you require technical support from Amazon, there is a charge for that. While this is something which is normally to be expected from modern SaaS and PaaS apps. In this case it is something to consider carefully, because of the challenges of not being easily able to diagnose problems with the system as mentioned above.

A Brief 10-Point Guide to AWS DyanmoDB

Shane Lee — Sat, 29 Aug 2020 12:02:40 +0000

Amazon's DynamoDB is fully managed key-value and document database
As DynamoDB is fully managed, one of the benefits of using it is that there is no server to manage and more importantly DynamoDB auto-scales to the demand on the database as a result the performance of the AWS DynamoDB is consistent at scale.
Performance is another reason to use DynamoDB. Amazon boasts: 'single-digit millisecond response times at any scale'
Data safety is another benefit of using DynamoDB. Amazon enables instant backups of 'hundreds of terabytes of data' with no performance impact. Further, with DynamoDB it is possible to restore the database to any state that existed within the last 35 days. More impressive is that this can be done with no downtime.
A seldom mentioned, but very important benefit of using DynamoDB is that Amazon provides the ability to deploy instances of DynamoDB locally, which is very useful for testing your application. Being able to deploy a local instance of DynamoDB means that you can run your integration tests against that instance, which provides a more valid test context, which means you are more likely to catch bugs that would occur in production therefore enabling you to improve the quality of your code and speed up the development cycle.
Another benefit of using DynamoDB is that it accommodates cases where some data is accessed more than other data while still providing autoscaling. To understand this benefit, it is important to understand that this wasn't always the case. It used to be the case that with autoscaling just sticking the data in DynamoDB could lead to a problem where performance would be impacted or queries would error. The reason for this is that DynamoDB when autoscaling would shard the data with the assumption that all the data will be accessed with uniform frequency. It did not consider what data is most important. Not all data is created equal. Some data will be accessed far more often than other data. The term for this is a 'hot key'. In other words, it used to assume that you will access each key in the database roughly an equal number of times, which often isn't the case. However, now DynamoDB has a feature called adaptive capacity, which, as of 2019, will instantly adapt to your data. In other words the sharding process which occurs during autoscaling distributes your data according to the demands on different keys. This is remarkable. What's more, this is done at no additional cost and it does not have to be configured.
Perhaps one of the drawbacks of using DynamoDB is that the costs of using it can balloon if you experience spikes in demand or if you haven't done a good job of predicting what the demand on your database will be. While Amazon does provide a pricing calculator to help you estimate your costs, it is still dependent on your assumptions and estimates. This drawback is really tradeoff for the autoscaling capacity that DynamoDB provides. One of the things you should therefore ask your self is are you expecting fluctuations in demand that would benefit from DynamoDB's autoscaling?
Another drawback of using DynamoDB is that it has a limit on individual binary items of 400kb. This means that DynamoDB is not well-suited to storing images or large documents.
In DynamoDB, scanning a table is not an efficient operation, which can of course be problem depending on the structure of your data and your use case. In furtherance, this may lead to to incur addition costs over what was initially planned, scanning the whole table in DynamoDB is expensive. This may lead you desire additional indexes, particularly secondary global indexes, which can be provisioned at an additional cost particularly because of the additional cost in writing to DynamoDB.
A final drawback to consider is that write operations to DynamoDB are expensive. Writes to DynamoDB are prices via 'Write Capacity Units' per month. When you have a use case with a lot of writes this cost can balloon.

A Brief 10-Point Guide to AWS Aurora

Shane Lee — Sat, 22 Aug 2020 10:28:16 +0000

Amazon Aurora is a cloud-based database solution (database-as-a-service DBaas) that is compatible with MySQL and PostgreSQL.
Using Amazon Aurora has the benefit that users don't have to deal with the headaches involved in managing and maintaining physical hardware for their databases.
Another benefit of using Aurora over MySQL and PostgreSQL is the performance. Aurora is around 5 times as fast as MySQL, and around 3 times as fast as PostgreSQL when compared on the same hardware.
With Aurora, the concern about database capacity is eradicated because the storage on Aurora will automatically scale all the way from the 10gb minimum to the 64TB maximum. Meaning, the maximum table size (64TB) using Aurora is four times the size of the maximum table size (16TB) of innodb.
A further benefit of Aurora over MySQL is replication. In Aurora, you can create up to 15 replicas
One of the most prominent drawbacks of AWS Aurora is that it is only compatible with MySQL 5.6 and 5.7.
A somewhat minor drawback is that the port number of connections to the database cannot be configured, they are locked at 3306
Another benefit is that, as you would expect, Aurora has great integration capabilities with other AWS products, for example, you can invoke an AWS Lambda function from within an AWS Aurora database cluster. You can also load data from S3 buckets.
With AWS Aurora, the minimum RDS instance size you can use is r3.large, which will impact the cost of using AWS Aurora.
AWS Aurora is also not available on the AWS free tier

A Brief 10-Point Guide to AWS Lambda

Shane Lee — Wed, 12 Aug 2020 14:09:35 +0000

When I'm learning new about new technologies, one of the first things, I'll do is write a numbered list 1-10 and then fill that numbered list in with points about that particular technology. I realised that these lists might be useful to others so I'm going to start posting them. This week I looked AWS Lambda.

AWS Lambda lets you run an application without rigging or managing servers. It is an event-driven computing platform. The code is executed in Lambda upon configured event trigger.
AWS Lambda uses a Function as a Service (FaaS) model of cloud computing.
With AWS Lambda you don't need an application that is running 24/7, as a consequence, you only pay for the time your functions were executing which can lead to a significant cost-reduction against traditional server-based architectures.
Due to the nature of AWS Lambda's FaaS model of cloud computing, development times for applications can be greatly increased, because the problems of managing an application stack within a server-based context are eliminated.
Planning and management of resources are efforts, which are nearly completely removed with AWS Lambda, because of its auto-scaling when more computing power is needed it will scale up the resources seamlessly, and conversely, when fewer resources are required it will scale down seamlessly too.
A greater proportion of developer time is available for working on the problems and challenges of the business logic with AWS Lambda.
One of the drawbacks of using AWS Lambda is that is is not necessarily faster than traditional architectures, this is because when a new instance of a function is invoked it needs to start up the process with the code to be executed. This start up time is not present in traditional server-based architectures the process or processes are running all the time.
A further drawback of using AWS Lambda is the issue of concurrent executions of functions. The default account limit for concurrent execution of functions is 1000, although depending on region this can go up to 3000, but this has to be specifically requested. Few applications will have to worry about this problem however. One additional thing of note about concurrent executions is that AWS Lambda does allow you to specifically limit the concurrent executions of specific functions.
The current maximum execution time for a AWS Lambda function is 15 minutes, which can be problem if the function is running a task which will take longer. Although this would be a sign that the function should be decomposed where possible.
A further drawback AWS Lambda is that individual functions are somewhat constrained in there access to computational resources. For example the current RAM limit of an AWS Lambda function is 3GB.

Keepgrabbing.py Analysed Line-by-Line - Aaron Swartz JSTOR Script

Shane Lee — Tue, 11 Aug 2020 18:05:06 +0000

Aaron Swartz was a programmer and political activist, who infamously downloaded an estimated 4.8 million articles from the JSTOR database of academic articles. This led him to be prosecuted by the United States.

In the documentary "The Internet's Own Boy" about Aaron Swartz, there is a script that is referenced called KeepGrabbing.py. This script is what Aaron used to bulk download PDFs from the JSTOR database from within the MIT network.

When first saw the documentary, I was fascinate by the idea that a simple computer program could cause such furore. Naturally, I wanted to know what was in the program.
It turns out this program is just a simple python 2 script, that is only 21 lines long (17 if you don't include blank lines!)

Here is the excerpt from the documentary in which the script is briefly discussed.

https://www.youtube.com/watch?v=j0DLmgmh2N8

I managed to find the script online. But I didn't immediately understand it. I got the general gist, but there were a few things I had not seen before like the one line class which Aaron defines which is just an exception that appears to just do nothing. There's also a URL which has been reacted by the courts, which further makes this program difficult to understand.

So let's look through the program and try and understand it line-by-line.

Here is the file keepgrabbing.py:

```import subprocess, urllib, random
class NoBlocks(Exception): pass
def getblocks():
r = urllib.urlopen("http://{?REDACTED?}/grab").read()
if '<html' in r.lower(): raise NoBlocks
return r.split()

import sys
if len(sys.argv) > 1:
prefix = ['--socks5', sys.argv[1]]
else:
prefix = []#'-interface','eth0:1']
line = lambda x: ['curl'] + prefix + ['-H', "Cookie: TENACIOUS=" + str(random.random())[3:], '-o', 'pdfs/' + str(x) + '.pdf', "http://www.jstor.org/stable/pdfplus/" + str(x) + ".pdf?acceptTC=true"]

while 1:
blocks = getblocks()
for block in blocks:
print block
subprocess.Popen(line(block)).wait()```

This is written in python2.
I because we don't have the URL which is reacted, some of the notes here are my best interpretation.

The imports:

subprocess - allows you to create new processes from the python script. subprocesses are child processes created from another process.
urllib - allows you to retrieve resources from URL.
random - generates pseudo-random numbers (i.e. not truly random numbers)
sys - imported later. Enables access to parameters and functions specific to the machine that is running the code.

import subprocess, urllib, random

After a little bit of investigation, I realised the one-line class NoBlocks that Aaron had included in this script was a way of terminating the script (more on this later).

class NoBlocks(Exception): pass

The next bit of code is the method getblocks().
This function reads from a URL which has been reacted by the courts.

It looks like this function called to custom url which was likely set up by Swartz himself, which had a list of URLs linking to PDFs to download. It is likely the case that Swartz had this grab page setup and would control and modify it from out side of the MIT campus in order to direct the script to download the PDFs he wanted. The first line in this function calls urllib.urlopen().read and saves the response of this call to a variable called 'r'. The urlopen function reads from a URL and returns a file-like object from the contents of the external resource that the URL points to. The read function that is called on this simply reads the byte contents and returns them.

The second line of this function checks to see if there is HTML in the retrieved page. If there is then it raises a NoBlocks exception and exits the script. It is likely that the URL that is reacted simply was a text file with the PDFs Swartz wanted to download. When he wanted to stop the script he could simply swap this text file for a HTML file and the script would exit.

The split function simply takes a string and splits it into a list and by default, it will split the string at every space, which is what Aaron is doing here.

def getblocks(): r = urllib.urlopen("http://{?REDACTED?}/grab").read() if '<html' in r.lower(): raise NoBlocks return r.split()

The next 5 lines of code are concerned with taking the arguments to the script from the sys.argv and if there is one present adding it to a variable as a list with the string - socks5 as the first string in the list.

Note that it takes the second element in the sys.argv list as the first element in the sys.argv list is the name of the script.

This prefix variable will be used in a lamda expression below.
Basically, this prefix is used to make the script connect to JSTOR via a proxy or just though the computer's internet connection (which Aaron left a command about, suggesting this was an ethernet connection, which makes sense as the computer that Aaron used to run this script was in a store cupboard connected to the MIT network)

This line may mean that Aaron could run this script from outside the MIT network, but that is just speculation.

import sys if len(sys.argv) > 1: prefix = ['--socks5', sys.argv[1]] else: prefix = []#'-interface','eth0:1']

The next line declares a lambda function which is saved to the variable called line.

This lambda expression takes a single argument, which is the name of the PDF that the script is going to download.

This lambda expression will later be used as part of a subprocess call later in the script. It defines a curl request. The curl command is a command which allows you
to transfer data to or from a URL, i.e upload or download from a url. The curl request is with a proxy to connect via (depending on the conditional above as mentioned). Next, it defines a cookie, which is simply the string TENACIOUS= followed by a random 3-digit number. This cookie, will make the server responding to this curl request think that it is coming from a real user as opposed to a script. The next thing this function does is define the output of this curl request: the pdf file name to a directory called pdfs. The rest of this lambda creates the url of the PDF from which to download the PDF with using curl.

line = lambda x: ['curl'] + prefix + ['-H', "Cookie: TENACIOUS=" + str(random.random())[3:], '-o', 'pdfs/' + str(x) + '.pdf', "http://www.jstor.org/stable/pdfplus/" + str(x) + ".pdf?acceptTC=true"]

This next section of code. Defines an infinite loop, which is the part of the code that composes everything else.

First it calls the getblocks from earlier and saves the resulting list of PDFs to a variable called blocks.
It then iterates over these, printing them to the console and then calling the line lambda from earlier in a subprocess.Popen call. Subprocess Popen will create a new process, in this case the curl request that will download the current PDF. Then the script will block until this subprocess finishes, i.e. it will wait until the PDF is finished downloading and then it will move on to the next PDF.

while 1: blocks = getblocks() for block in blocks: print block subprocess.Popen(line(block)).wait()

And that's it!

I have also made a video where I go through this code, if that sort of thing floats your goat:

https://www.youtube.com/watch?v=0sJ7CVos5bQ

How to Plot a Correlation with Python | Python for Statistics

Shane Lee — Fri, 08 May 2020 07:52:58 +0000

Plotting correlations with Python is a relatively straight-forward affair. For this example, I have provided a basic correlation dataset which is in a CSV file. If you have your own dataset, you can obviously use that, although if you have it in a different format, you will likely have to import it into your Python code differently.

In order to follow along with this, first open your terminal and install the following Python modules unless you have already.

pip install numpy pip install pandas pip install matplotlib

Then create and open a new .py file and add those modules as imports like so:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

The ‘as …’ allows us to alias the module to a more succinct series of characters and allow for more idiomatic Python code. For example it would be an absolute ballache to type out matplotlib.pyplot every time we wanted to access a function from that module, so instead we alias it to ‘plt’ and then we can simply call plt.whatever whenever we want to use function from that module.

Next we need to get the data into the programme. For this i’m going to assume you have the data saved in the same directory as your .py file. My example dataset which can be downloaded from the link below is called ‘memes.csv’. In this dataset we have two columns which we want to correlate. ‘Memes’ and ‘Dankness’. Note that this is case sensitive. Ideally you wouldn’t mix cases in your column names, but I have because I’m a buffoon and it’s too late to change it now.

We can use the read_csv function from the pandas python module to import the dataset. This will take the csv and turn into a lovely pandas DataFrame, which makes it nice and easy to manipulate the data. In order to access the individual columns we can simply pass the column names as below:

data = pd.read_csv('memes.csv')
x = data['Memes']
y = data['Dankness']

Now we have two variables, x and y, which we can correlate.

To do this, we can simply call the plt.scatter function, passing in our data. If we add the plt.show() function and run the programme we will see this:

Python generated correlation with Matplotlib and pandas

plt.scatter(x, y) 
plt.show()

But we’re not finished there. As any mathematics teacher will say, we need to add titles to the plot and axes, and we need to add a line of best fit.

Adding the titles is the simplest thanks to matplotlib, so let’s start with that:

plt.title('A plot to show the correlation between memes and dankness')
plt.xlabel('Memes')
plt.ylabel('Dankness')

In order to add the line of best fit we need to do the following:

plt.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))(np.unique(x)), color='yellow')

Finally if we wanted to print the correlation coefficient, we could use the numpy function corrcoef like so :

print(np.corrcoef(x, y))```



Here is the full code from this tutorial:



```import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('memes.csv')
x = data['Memes']
y = data['Dankness']

print(np.corrcoef(x, y))

plt.scatter(x, y) 
plt.title('A plot to show the correlation between memes and dankness')
plt.xlabel('Memes')
plt.ylabel('Dankness')
plt.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))(np.unique(x)), color='yellow')
plt.show()

This post was originally published here, where you can also download the example dataset we used for this tutorial.

How to Visualise Multiple Stocks with Matplotlib

Shane Lee — Fri, 31 Jan 2020 19:07:41 +0000

In this quick tutorial, we are going to use python to get data about a collection of stocks, and then plot then on a single graph.

The very first thing we need to is find the tickers for the stocks that we want to plot.

So go to Yahoo Finance and search for the companies you are interested in.

When you get to the company's listing you will see the ticker in parentheses next to the company's name.

An example of United Utilities from the Yahoo Finance page.

Once you have the list of companies you want to plot. It's time to start coding.

First we need a few imports:

import numpy as np
import pandas as pd
from pandas_datareader import data as wb
import matplotlib.pyplot as plt

If you don't have any of these installed. Go to your terminal and run:

pip install pandas
pip install numpy
pip install pandas-datareader
pip install matplotlib

Now let's create an array of stock objects:

stocks = [
    {
        'ticker': 'UU.L',
        'name': 'United Utilities'
    },
    {
        'ticker': 'VOD.L',
        'name': 'Vodafone Group'
    },
    {
        'ticker': 'BP.L',
        'name': 'BP Group'
    }
]

Next let's create a method to take this data and plot the data.

def create_plot(stocks):
    # Create an empty dataframe
    data = pd.DataFrame()
    for stock in stocks: 
        # Create a column for the adjusted close of each stock
        # Here we use the DataReader library to get the data.
        data[stock['ticker']] = wb.DataReader(stock['ticker'], data_source='yahoo', start='2007-1-1')['Adj Close']

Here's what the contents of the data frame currently looks like:

You can see that adjusted close prices for each stock at present their own columns.

The problem with the current state of the data is that we can't easily plot this data as if we did, all the of the stocks would displayed on different scales.

So what we need to do is change this data to be the percentage return between each date and the first date in the series.

To do this we can use a lovely bit of pythonic code:

# Calculate the returns for all the days
returns = data.apply(lambda x: (x / x[0] * 100))

Here we are using a lambda expression to turn create a new data frame which contains the percent returns of each of the days for each of the stocks.

That's all we have to do to get and adjust the data.

Now all we have do is plot the data.

plt.figure(figsize=(10,6))

# Plot the returns
for stock in stocks: 
plt.plot(returns[stock['ticker']], label=stock['name'])

# We need to call .legend() to show the legend.
plt.legend()
# Give the axes labels
plt.ylabel('Cumulative Returns %')
plt.xlabel('Time')
plt.show()

And we're done.

You can find the full code here: Full Code

I've also made a video tutorial to go alongside this tutorial.

If you have any questions or have any feedback (definitely welcome!), then please leave a comment below, or you can find me at:

Article originally posted here