I recently made some changes to my website that resulted in an outage. The outage went unnoticed for over 12 hours before it was noticed and shortly resolved.
As with many things DevOps, it was a small, preventable error which i hope i will not make again.
What i have learnt though is i need some active monitoring & notification system setup to prevent this in the future.
Today i’ll be building a heartbeat monitor for my website in CDK using Cloudwatch synthetics canary.
CloudWatch Synthetics canaries are lightweight scripts that run on a schedule to simulate user interactions with your application. Canaries can be used to monitor the performance and availability of web applications, APIs, and other services. Canaries are easy to set up, and you can use them to test critical paths in your application, such as login, registration, and checkout.
CloudWatch Synthetics canaries work by executing scripts that simulate user interactions with your application. Canaries can be configured to run on a schedule, and you can specify the frequency, duration, and endpoints to test. Canaries can also be configured to use real browser-based scripts or headless scripts, which provide faster results and reduce resource usage.
When a canary runs, it reports metrics such as page load time, HTTP status codes, and response time. If a canary fails, CloudWatch Synthetics can notify you via email, SMS, or other methods. You can also use CloudWatch Synthetics canaries to trigger AWS Lambda functions, which can perform additional actions, such as restarting a service or notifying an on-call engineer.
Every 60 minutes the Cloudwatch Synthetics Canary will trigger and attempt to load several endpoints of my website.
If there is a problem accessing the website, the alarm will trigger and i will get an email notification of the error.
This testing could be expanded to access the API as well, however the static site content is all we will be testing today.
First things first, let’s simply create our artifact bucket.
I’m wanting to keep my costs to a minimum, so let’s also add a lifecycle rule to delete any objects after 7 days.
canary_bucket = s3.Bucket(
self,
"canaryBucket",
bucket_name = "jeremyritchie.com-canary",
access_control = s3.BucketAccessControl.PRIVATE,
encryption = s3.BucketEncryption.S3_MANAGED,
versioned = False,
block_public_access = s3.BlockPublicAccess.BLOCK_ALL
)
canary_bucket.add_lifecycle_rule(
expiration=Duration.days(7),
)
The CDK L2 Construcuts for Cloudwatch Synthetics are currently in Alpha. This will require specific downloading if we are to use it.
Here’s my requirements.txt
:
aws-cdk-lib==2.66.0
constructs>=10.0.0,<11.0.0
aws-cdk.aws-synthetics-alpha==2.66.0a0
Installing aws-cdk.aws-synthetics-alpha
enables the use of the module aws_synthetics_alpha
which is the developer preview version of CDK Synthetics.
canary = synthetics.Canary(
self,
"HeartbeatCanary",
canary_name = "website-heartbeat",
schedule=synthetics.Schedule.rate(Duration.minutes(5)),
artifacts_bucket_location=synthetics.ArtifactsBucketLocation(
bucket=canary_bucket,
prefix="artifacts"
),
test=synthetics.Test.custom(
code=synthetics.Code.from_asset('./lambda'),
handler="heartbeat.handler"
),
runtime=synthetics.Runtime.SYNTHETICS_NODEJS_PUPPETEER_3_9
)
No CDK Lambda function needs to be created because that’s already handled by the Canary
Resource. We do however need to provide the Lambda code.
This code was taken direct from the AWS console for Cloudwatch Synthetics. The only notable input from me is the url’s that will be tested for a response by the Canary.
NB: The lambda code must be within the directory <file_root>/nodejs/node_modules/. I.e. my chosen root is ./lambda
, so from the CDK workspace directory root, the lambda is located at lambda/nodejs/node_modules/heartbeat.js
const { URL } = require('url');
const synthetics = require('Synthetics');
const log = require('SyntheticsLogger');
const syntheticsConfiguration = synthetics.getConfiguration();
const syntheticsLogHelper = require('SyntheticsLogHelper');
const loadBlueprint = async function () {
const urls = ['https://jeremyritchie.com','https://jeremyritchie.com/contact','https://jeremyritchie.com/author/jeremy-ritchie','https://jeremyritchie.com/posts/1','https://jeremyritchie.com/blog'];
// Set screenshot option
const takeScreenshot = false;
/* Disabling default step screen shots taken during Synthetics.executeStep() calls
* Step will be used to publish metrics on time taken to load dom content but
* Screenshots will be taken outside the executeStep to allow for page to completely load with domcontentloaded
* You can change it to load, networkidle0, networkidle2 depending on what works best for you.
*/
syntheticsConfiguration.disableStepScreenshots();
syntheticsConfiguration.setConfig({
continueOnStepFailure: true,
includeRequestHeaders: true, // Enable if headers should be displayed in HAR
includeResponseHeaders: true, // Enable if headers should be displayed in HAR
restrictedHeaders: [], // Value of these headers will be redacted from logs and reports
restrictedUrlParameters: [] // Values of these url parameters will be redacted from logs and reports
});
let page = await synthetics.getPage();
for (const url of urls) {
await loadUrl(page, url, takeScreenshot);
}
};
// Reset the page in-between
const resetPage = async function(page) {
try {
await page.goto('about:blank',{waitUntil: ['load', 'networkidle0'], timeout: 30000} );
} catch (e) {
synthetics.addExecutionError('Unable to open a blank page. ', e);
}
}
const loadUrl = async function (page, url, takeScreenshot) {
let stepName = null;
let domcontentloaded = false;
try {
stepName = new URL(url).hostname;
} catch (e) {
const errorString = `Error parsing url: ${url}. ${e}`;
log.error(errorString);
/* If we fail to parse the URL, don't emit a metric with a stepName based on it.
It may not be a legal CloudWatch metric dimension name and we may not have an alarms
setup on the malformed URL stepName. Instead, fail this step which will
show up in the logs and will fail the overall canary and alarm on the overall canary
success rate.
*/
throw e;
}
await synthetics.executeStep(stepName, async function () {
const sanitizedUrl = syntheticsLogHelper.getSanitizedUrl(url);
/* You can customize the wait condition here. For instance, using 'networkidle2' or 'networkidle0' to load page completely.
networkidle0: Navigation is successful when the page has had no network requests for half a second. This might never happen if page is constantly loading multiple resources.
networkidle2: Navigation is successful when the page has no more then 2 network requests for half a second.
domcontentloaded: It's fired as soon as the page DOM has been loaded, without waiting for resources to finish loading. Can be used and then add explicit await page.waitFor(timeInMs)
*/
const response = await page.goto(url, { waitUntil: ['domcontentloaded'], timeout: 30000});
if (response) {
domcontentloaded = true;
const status = response.status();
const statusText = response.statusText();
logResponseString = `Response from url: ${sanitizedUrl} Status: ${status} Status Text: ${statusText}`;
//If the response status code is not a 2xx success code
if (response.status() < 200 || response.status() > 299) {
throw new Error(`Failed to load url: ${sanitizedUrl} ${response.status()} ${response.statusText()}`);
}
} else {
const logNoResponseString = `No response returned for url: ${sanitizedUrl}`;
log.error(logNoResponseString);
throw new Error(logNoResponseString);
}
});
// Wait for 15 seconds to let page load fully before taking screenshot.
if (domcontentloaded && takeScreenshot) {
await page.waitFor(15000);
await synthetics.takeScreenshot(stepName, 'loaded');
await resetPage(page);
}
};
const urls = [];
exports.handler = async () => {
return await loadBlueprint();
};
We want to be notified via email when the Cloudwatch alarm triggers, so let’s quickly create that resource.
This is something we’ve done several times on this blog before, nothing new here.
topic = sns.Topic(
self,
"heartbeatTopic",
topic_name="website-heartbeat"
)
sns.Subscription(
self,
"admin",
topic=topic,
protocol=sns.SubscriptionProtocol.EMAIL,
endpoint="jeremyritchie1996@hotmail.com"
)
The Cloudwatch alarm is a funny one here - I’ve chosen to use the Cfn (CloudFormation) construct because the L2 construct does not allow me to control the period.
Currently the canary is sending every 5 minutes, so a 5 minute period is suitable. However if i choose to decrease the canary frequency to once an hour, then the alarm period must also increase. Using the L1 construct is necessary for the control over alarm period.
alarm = cloudwatch.CfnAlarm(
self,
'HeartbeatAlarm',
alarm_name='heartbeat-alarm',
comparison_operator='LessThanOrEqualToThreshold',
evaluation_periods=1,
metric_name='SuccessPercent',
namespace='CloudWatchSynthetics',
period=3600,
statistic='Average',
threshold=99,
alarm_actions=[topic.topic_arn],
dimensions=[cloudwatch.CfnAlarm.DimensionProperty(
name='CanaryName',
value=canary.canary_name,
)],
)
Right, with all this infrastructure deployed, let’s see if it works!
And only moments later, i get an email.
Hey presto, that’ll do the trick!
AWS CloudWatch Synthetics canaries provide a simple, yet powerful way to monitor the performance and availability of your applications. By simulating user interactions with your application, canaries can help identify issues before they affect customers, which can improve application reliability and reduce downtime.
This was an extremely simply implementation of CloudWatch synthetics canary, however it demonstrates how easy it can be to setup a monitoring solution on AWS.
As my heartbeat monitoring needs grow, i can expand the canary to perform increasingly more complex checks.