The problem
In some cases (some "shadier" than others) you may need to have multiple "clean" ip addresses.
One use case could be the obvious web scraper.
Usually when you make multiple GET requests to a host, you may get a captcha to prove you're a human, your ip may be rate limited or banned.
Ideally you want something that is:
- simple to use
- doesn't need additional software
- could be used with traditional CLI commands like cURL
- cheap
You can deal with this problem in several ways: use a vpn, use tor proxy (please, please, please don't do this) or use some kind of throwaway ip.
You could use an AWS EC2 instance to be your proxy; when the ip is banned, you take a snapshot of the instance and create a new one.
...or you can be really creative and use AWS Lambda Functions...
The pattern
NOTE: I never saw such pattern applied to Lambda functions, I checked several times. If you know someone who already did this, please let me know in the comments!
The pattern is nothing new: you have a manager always running and one or more worker doing the heavy-lifting.
If we have to keep going with the scraper example, the "heavy-lifting" here is making the GET requests.
Why can't we use a single Lambda?
Lambda functions are by definition "serverless": they aren't tied to the underlying hardware, you can't make assumption about the hardware.
On AWS the Lambda functions have an "execution context": on first run (cold start) AWS create an execution context for the function.
This context is kept alive for an indefinite period of time (usually between 5 and 7 minutes). In this time your Lambda function will respond in less time (warm start) but keep the same ip address.
So we have a manager (let's call it "lambda-inception").
This manager is a Lambda Function URL (a function with a dedicated url you can call).
So we want to call the manager url, passing a payload consisting of the url to scrape.
The manager will create a new Lambda worker Function (without URL, we don't need to access this function), pass the url to scrape to the function, await for response, destroy the worker and return the response to the client.
We destroy the worker after each call because we want a clean ip every time.
So without further ado this is the configuration on AWS:
Roles
We need 2 roles (LambdaInceptionWorker
and LambdaInceptionManager
) with some custom policies (replace AWS_REGION and AWS_ACCOUNT_ID with your account information):
LambdaInceptionWorker
(no permission policies) We leave the worker role without policies (neither AWSLambdaBasicExecutionRole
) so it doesn't log anything on CloudWatch
LambdaInceptionManager
LambdaInceptionPassRole
(Custom created):
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": "iam:PassRole",
"Resource": "arn:aws:iam::AWS_ACCOUNT_ID:role/LambdaInceptionWorker"
}
]
}
LambdaInceptionCreateFunction
(Custom created):
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": "lambda:CreateFunction",
"Resource": "arn:aws:lambda:AWS_REGION:AWS_ACCOUNT_ID:function:*"
}
]
}
LambdaInceptionDeleteFunction
(Custom created):
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": "lambda:DeleteFunction",
"Resource": "arn:aws:lambda:AWS_REGION:AWS_ACCOUNT_ID:function:*"
}
]
}
LambdaInvokeFunction
(Custom created):
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": "lambda:InvokeFunction",
"Resource": "arn:aws:lambda:AWS_REGION:AWS_ACCOUNT_ID:function:*"
}
]
}
AWSLambdaBasicExecutionRole
(AWS standard policy)
Users
If you want to keep your Functions protected (why wouldn't you?) you need to create an IAM user (call it lambda-inception-invoker
) with LambdaInceptionInvoke
custom policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": "lambda:InvokeFunctionUrl",
"Resource": "arn:aws:lambda:AWS_REGION:AWS_ACCOUNT_ID:function:lambda-inception"
}
]
}
We need programmatic access for this user, so after creating the user go to the "Security Credentials" tab and create an access key.
You may need a lambda-deployer
user with AWSLambda_FullAccess
permission policy if you want to use the script provided on the github repository to deploy the infrastructure on your account.
The code
The code is available here: https://github.com/gfabrizi/lambda-inception
The code uses a mono-lambda approach to manage the routing.
src/app.mjs
and src/res.mjs
are responsible for managing the basic routing and create an evelope for the response.
The code for the manager lies in index.js
:
const worker = await createWorker();
const lambdaClient = new LambdaClient({region: awsRegion, apiVersion: '2015-03-31'});
const invokeCommand = new InvokeCommand({
FunctionName: worker.FunctionName,
Payload: JSON.stringify({
'url': req.body.url,
'accept': req.body.accept ?? 'text/html,application/xhtml+xml,application/xml;q=0.9',
'contentType': req.body.contentType ?? 'application/octet-stream',
'method': req.body.method ?? 'GET',
}),
LogType: LogType.Tail,
});
const { Payload, LogResult } = await lambdaClient.send(invokeCommand);
const result = Buffer.from(Payload).toString();
const deleteCommand = new DeleteFunctionCommand({
FunctionName: worker.FunctionName
});
await lambdaClient.send(deleteCommand);
res.json(result, 200);
here we create a new worker, invoke the worker (passing the url, the method and a few headers).
Then we await for the response of the worker, destroy the worker and return the scraped html to the client.
src/create-worker.mjs
wraps the functionality to create a new worker:
const lambda = new LambdaClient({region: awsRegion, apiVersion: '2015-03-31'});
const functionName = 'worker-' + randomUUID();
const functionCommand = new CreateFunctionCommand({
Code: {
ZipFile: fs.readFileSync('worker.zip')
},
Architectures: [Architecture.x86_64],
FunctionName: functionName,
Handler: 'worker.handler',
PackageType: PackageType.Zip,
Role: lambdaInceptionWorkerRole,
Runtime: 'nodejs18.x',
Description: 'Crawler Worker ' + functionName
});
return lambda.send(functionCommand);
nothing too fancy here: we specify an archive (worker.zip) to be used for creating the worker.
The worker code (worker/worker.js
) is just a call to got.js:
let response = await got(event.url, {
headers: {
'user-agent': getUserAgent(),
'Content-Type': event.contentType,
'Accept': event.accept,
},
method: event.method
});
return {
statusCode: 200,
body: JSON.stringify(response.body),
};
We use a simple user-agent rotation, just in case...
Make a request
So if everything is correctly configured (refer also to the README.md file in the repository) you can run ./deploy.sh
to deploy the manager on a Lambda Function.
Then you can call your function with a simple cURL:
curl -X POST https://XXXXXXXXXXXXXXXXXXXXXX.lambda-url.AWS_REGION.on.aws/crawl \
-H 'Content-Type: application/json' \
-d '{"url":"https://ifconfig.me/", "accept":"application/json", "method":"GET"}' \
--user XXXXXXXXXXXXXXXXXXXX:XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX \
--aws-sigv4 "aws:amz:AWS_REGION:lambda"
(replace all the X
s with yout function url and account details. Also replace AWS_REGION
with the region you have used)
Links
The code is available here: https://github.com/gfabrizi/lambda-inception
Top comments (1)
Appreciate your effort, thanks