Chapter 9

AWS Systems Manager

Introduction

Observing and tracking the correct functioning of workloads running on both cloud and on-prem can be a challenge. The scale, distribution and diversity of systems can add complexity to day-to-day operations. Common tasks such as logging, compliance-checking, troubleshooting, patching and upgrades can become time-consuming and tedious, particularly when conducted manually.

AWS Systems Manager (formerly Simple Systems Manager or SSM) is the remedy to this common hybrid infrastructure problem. . Systems Manager is an AWS service that allows for the automated monitoring and control of a wide variety of supported AWS and local infrastructure instances. Accessible via a central console, it provides a variety of tools that can perform operation, application, change control and node management.

This article is a beginner’s guide to AWS Systems Manager, where we will explore its capabilities, provide guidance on getting started and give a high-level SSM process flow. Then we will describe the installation and management of the SSM agent, discuss automation and provide a practical troubleshooting scenario, highlighting the power and utility of AWS Systems Manager. To conclude, we will look at AWS Systems Manager’s limitations and alternatives.

Capabilities

AWS Systems Manager contains several tools which are organized into the five ‘Capability Categories’  shown below. These tools allow us to perform operational tasks swiftly against various resource objects. Resources such as EC2 Instances, Amazon S3 buckets and even on-premises servers can be associated with resource tags.  These tags define membership of a Resource Group, against which we can then view operational and troubleshooting data.

AWS Systems Manager Features -Figure 1.1

The following table highlights the elements and features of AWS Systems Manager, placed into their Capability Categories.

CategoryBenefits
Operations ManagementOpsCenter to view and resolve issues related to AWS resourcesExplorer, Cloudwatch, and PHD (Personal Health Dashboard) to visualize reports  
Application ManagementApplication Manager to manage multiple applications from a single consoleAppConfig to quickly deploy your applicationsParameter to store secret and configuration data
Change ManagementAutomation to simplify the maintenance and deployment tasks.Maintenance windows to schedule your maintenance timeChange calendar to set the date and time for actions or events
Node ManagementCompliance to detect non-compliant resourcesHybrid activations to set up VMs in Hybrid environmentsPatch Manager to automate patchingSession Manager to connect and manage your EC2 instancesRun command to configure managed instances remotely Distributor lets you package your softwareFleet Manager to give you a global view of the health and performance of your entire fleet of servers.
Shared ResourcesPre-configured documents to define actions that Systems Manager can perform
Features of AWS Systems Manager – Table 1.1

Getting Started

From a high level, using Systems Manager breaks into three stages – Group, Visualize, and Action.

Three steps to using AWS Systems Manager – Figure 1.2

1. Group: We can create logical groupings of AWS resources. Grouping is a foundational precursor to performing operations (such as compliance management, patching, and automation) on AWS resources.

Create Resource Group (AWS Management Console) – Figure 1.3(source)

2. Insights: AWS Systems Manager automatically displays aggregated operational data for each resource group via a dashboard. You can also integrate CloudWatch Dashboards, AWS Config rules, AWS CloudTrail, and AWS Personal Health Dashboard (PHD).

Inventory Insights (AWS Management Console) – Figure 1.4 (source)

3. Actions on Insights: You can act upon insights, or perform administrative actions on the resource groups that were defined in step 1, via the central console. 

AWS Systems Manager Process Flow

Now we have covered the basics of AWS Systems Manager. Let’s take a deeper look into the AWS Systems Manager Process Flow, the general set of steps taken to access and use AWS Systems Manager,allowing us to perform actions on AWS EC2 instances, edge devices, and virtual machines. Each Systems Manager capability conveniently follows a similar process, regardless of which one is selected.

AWS Systems Manager Process Flow – Figure 1.5

1. Access AWS Systems Manager: AWS provides three ways to access Systems Manager, via the AWS Management Console, AWS Command Line Interface or AWS SDK. For example, open AWS Management console and type “AWS Systems Manager”  into the search bar as shown in the figure below.

Access AWS Systems Manager (AWS Management Console) – Figure 1.6

2. Select Capabilities: Systems Manager provides a variety of capabilities (see figure 1.1). Each capability serves a different purpose, for example, you could select the Fleet Manager option to apply patches against a fleet of nodes.

3. Processing: This consists of two stages. AWS Systems Manager verifies user permissions and then the SSM agent (discussed later) performs relevant actions on the selected resource.

4. Reporting: After making any configuration changes, the SSM agent reports the status of the resource to the Systems Manager or other configured AWS services.

SSM Agent

SSM (AWS Systems Manager Agent) is a lightweight software agent that  allows AWS Systems Manager to update, configure and manage the resource that it is installed on. The concept is similar to the OpsRamp Agent, which can deliver analytics for hybrid asset inventory, incident remediation and OS patching.

Many AMIs (Amazon Machine Images), such as Amazon Linux and Windows server 2019, have the agent preinstalled. Manual installation is possible for images that do not. In the following section we will see how to do this.

Installing SSM Agent into Ubuntu Server

To manually install the SSM Agent into a Linux OS, you can use Debian application packages or Snap application packages.

Installation with Snap Package:

1. SSH into your EC2 instance with the associated .pem file

2. Run the following command.

sudo snap install amazon-ssm-agent --classic

SSM Agent installation Command – Figure 1.7

The Output will look like this:

The Output of SSM Agent installation command – Figure 1.8

3. Check the status of the SSM agent with the below command:

sudo snap services amazon-ssm-agent

Check Status command – Figure 1.9

Active Status of the SSM Agent – Figure 1.10

SSM Agent Logs

The SSM agent reports detailed information about state, execution and error status to local log files. These can be examined directly from the resource. We can also send log files to AWS CloudWatch Logs to aggregate and monitor them in greater detail.

Sending SSM Agent logs to CloudWatch Logs

You can follow this step-by-step procedure to configure SSM Agent log-forwarding.

AWS Systems Manager publishes metrics to CloudWatch about the status of the resource Run Command, including ‘success’, ‘fail’, or ‘delivery time out’ Additionally, you can configure alarms if a status of ‘success’ is not reported for any specified SSM Command document.

Run the following command to view the metric using AWS CLI

aws cloudwatch list-metrics --namespace "AWS/SSM-RunCommand"

Metrics using AWS CLI- Figure 1.11

Further information about Run Command Metrics can be found here. Now that we have covered the basics of AWS Systems Manager and the SSM agent, it is time to look at a more practical example.

Automation with Systems Manager 

There are several remediation, maintenance, and deployment tasks common to AWS services, including Amazon Elastic Compute Cloud (Amazon EC2), Amazon Relational Database Service (Amazon RDS), and others. Using Automation (a capability of AWS Systems Manager), we can simplify the deployment and management of AWS resources to achieve operational efficiency and minimize errors often associated with manual intervention.

AWS offers an “automation toolkit” called “Runbooks.”  Runbooks are documents that contain routine maintenance activities or tasks that respond to events. Runbooks are pre-defined by AWS for immediate use. We can also define our own runbooks to meet any specific need that we may have. The following (see below screenshot) is a sample runbook created using YAML, that creates an AMI image, via the following commands:

  • aws:executeAwsApi automation action used to create an AMI.
  • aws:waitForAwsResourceProperty automation action to confirm the availability of the AMI.
  • aws:executeScript automation action to copy the AMI to the destination Region.
---
description: Custom Automation Backup and Recovery Sample
schemaVersion: '0.3'
assumeRole: "{{ AutomationAssumeRole }}"
parameters:
  AutomationAssumeRole:
    type: String
    description: "(Required) The ARN of the role that allows Automation to perform
      the actions on your behalf. If no role is specified, Systems Manager Automation
      uses your IAM permissions to use this runbook."
    default: ''
  InstanceId:
    type: String
    description: "(Required) The ID of the EC2 instance."
    default: ''
mainSteps:
- name: createImage
  action: aws:executeAwsApi
  onFailure: Abort
  inputs:
    Service: ec2
    Api: CreateImage
    InstanceId: "{{ InstanceId }}"
    Name: "Automation Image for {{ InstanceId }}"
    NoReboot: false
  outputs:
    - Name: newImageId
      Selector: "$.ImageId"
      Type: String
  nextStep: verifyImageAvailability
- name: verifyImageAvailability
  action: aws:waitForAwsResourceProperty
  timeoutSeconds: 600
  inputs:
    Service: ec2
    Api: DescribeImages
    ImageIds:
    - "{{ createImage.newImageId }}"
    PropertySelector: "$.Images[0].State"
    DesiredValues:
    - available
  nextStep: copyImage
- name: copyImage
  action: aws:executeScript
  timeoutSeconds: 45
  onFailure: Abort
  inputs:
    Runtime: python3.6
    Handler: crossRegionImageCopy
    InputPayload:
      newImageId : "{{ createImage.newImageId }}"
    Script: |-
      def crossRegionImageCopy(events,context):
        import boto3

        #Initialize client
        ec2 = boto3.client('ec2', region_name='us-east-1')
        newImageId = events['newImageId']

        ec2.copy_image(
          Name='DR Copy for ' + newImageId,
          SourceImageId=newImageId,
          SourceRegion='us-west-2'
        )

Sample Runbook to create Amazon Machine Image (AMI) of an instance- Figure 1.12 (source)

Troubleshooting Scenario

Occasionally, we may not be able to connect to an AWS Windows (or other) instance. The underlying problem could be a system misconfiguration. But how do we check and change the configuration, if we cannot connect to it?  This is the basis of our scenario.

Problem with a Windows Instance – Figure 1.13 (source)

From the above image, we can see that the Windows EC2 instance is not passing all of its health checks. As a result, we cannot access and investigate the Windows machine directly. Potential causes of this connectivity issue could be:

  • RDP service issue
  • Firewall blocking RDP traffic
  • Network adapter misconfiguration

To aid the troubleshooting process, we could perform a number of proactive and reactive tasks, such as:

  • Taking screenshots during Windows updates
  • Manually using the Systems Manager Run Command
  • Investigating Root volumes of the instance, using Windows Registry Analysis

However, these investigative steps are manual and require a good understanding of the underlying operating system. They may even carry a degree of risk. For example, the Windows Registry contains many important system parameters and a wrong action, such as the accidental deletion of a key, could significantly impact the system. Instead, we will make use of AWS Systems Manager.

High-level Solution

AWSSupport-ExecuteEC2Rescue can be usedto detect and resolve issues with a few mouse-clicks. AWS Support-ExecuteEC2Rescue is an AWS Systems Manager automation document composed of sequential steps that remediate standard Windows issues. These steps are described and numbered below.

Automated solution with Systems Manager- Figure 1.14 (source)
  1. Issue: The Windows instance is unreachable and is not passing health checks.
  2. Run AWSSupport-ExecuteEC2Rescue: Pass the instance ID to AWSSupport-ExecuteEC2Rescue.
  3. VPC Creation: EC2Rescue VPC is created.
  4. Instance Creation: Once AWSSupport-ExecuteEC2Rescue is runit creates an EC2Rescue instance in an isolated VPC (Virtual Private Connection) in the same availability zone as the instance.
  5. Stopping the Instance: AWSSupport-ExecuteEC2Rescue stops the unreachable instance.
  6. Backing AMI: ExecuteEC2Rescue creates a backup AMI (Amazon Machine Image) for the unreachable instance.
  7. Fixing the issue: EC2Rescue and RunCommand attempt to troubleshoot the problem with the unreachable instance.
  8. Terminate Instance: AWSSupport-ExecuteEC2Rescue terminates the EC2Rescue instance created in step 4.
  9. Starting your Instance: The automation document (AWSSupport-ExecuteEC2Rescue) then restarts the instance.
  10.  End: The automation ends here, all EC2Rescue  infrastructure is deleted, and your problem is resolved.

Low-level Step-by-Step Solution

If you are new to the AWS CLI (Command Line Interface), the AWS CLI guide is available here and is an invaluable reference.

1. As the Windows server is not passing any health checks and is not accessible, we will run EC2Rescue via the CLI. To run this document, you will need to pass the following:

  • Instance ID
  • IAM role with the required permissions
aws ssm start-automation-execution --document-name "AWSSupport-ExecuteEC2Rescue" --parameters "ImpairedInstanceId=InstanceID ,AssumeRole=arn:aws:iam::YOURACCOUNTID:role/YOURSSMAUTOMATIONROLE"

{
    "AutomationExecutionId": "ae6c3617-843e-11e7-8f65-57a040263d53"
}

Executing EC2Rescue Command – Figure 1.15

2. Now we have to look at the results. If you wish to monitor the execution of the command, you can use the below command with the Automation Execution ID.

aws ssm get-automation-execution --automation-execution-id "ae6c3617-843e-11e7-8f65-57a040263d53”

{
    "AutomationExecution": {
        "AutomationExecutionStatus": "InProgress",
        "Parameters": {
            (..)
        },
        "Outputs": {
            (..)
        },
        "DocumentName": "AWSSupport-ExecuteEC2Rescue",
        "AutomationExecutionId": "ae6c3617-843e-11e7-8f65-57a040263d53",
        "DocumentVersion": "1",
        "ExecutionStartTime": 1503079041.084,
        "StepExecutions": [
            {
                (..)
            }
        ]
    }
}

Monitoring the progress of EC2Rescue – Figure 1.16

3. After executing the command, we can see from the below AWS Console screenshot that the Windows instance is back online and passing its health checks.

Status of Windows instance – Figure 1.17 (source)

4. To check where the original problem occurred, run the following command :

aws ssm get-automation-execution --automation-execution-id "ae6c3617-843e-11e7-8f65-57a040263d53" --query 'AutomationExecution.Outputs."runEC2Rescue.Output"' --output text

Output the execution information – Figure 1.18

Final Result

EC2Rescue has discovered that:

  • The firewall had a static IP address which may have caused the connectivity issue. EC2Rescue made changes to resolve this.
  • EC2Rescue subsequently updated the AMI, ensuring that new EC2 instances can be launched consistently in the VPC. Refer to the following figure for a detailed analysis.
===== System Information =====
 
Operating System: Windows Server 2008 R2 Datacenter
 
Service Pack: Service Pack 1
Version: 6.1.7601
Computer Name: WIN-0KEEGO57HHS
Time Zone: UTC
.NET Framework:
v4.7 (4.7.02053)
EC2Config Version: 4.9.1981
===== Analysis =====
System Time
OK – RealTimeIsUniversal (Enabled): This registry value should be enabled when timezone is not UTC.
 
Windows Firewall
Warning – Domain networks (Enabled): Windows Firewall will be disabled.
Warning – Private networks (Enabled): Windows Firewall will be disabled.
Warning – Guest or public networks (Enabled): Windows Firewall will be disabled.
Remote Desktop
OK – Service Start (Manual): Sets Remote Desktop service start to automatic.
OK – Remote Desktop Connections (Enabled): The RDP listening port will be changed to TCP/3389.
OK – TCP Port (3389): The RDP listening port will be changed to TCP/3389.

Suggestions by EC2Rescue – Figure 1.19

Limitations

AWS Systems Manager is a useful and powerful tool that allows organizations to operate complex infrastructure at scale,  both safely and securely. Despite this, there are still a number of disadvantages.

  1. Hybrid Cloud: AWS Systems Manager facilitates the management of resources hosted on hybrid environments. However, its overall compatibility, from an integration point of view, is limited. Not everything in the modern digital business integrates well with AWS Systems Manager. For example, some servers, especially VMWare ESXi, do not appear in the SSM inventory and others can appear with the mode “connection lost” even after re-installing the SSM agent. As an alternative,  OpsRamp’s out-of-the-box integrations provide a better digital experience.
  2. Absence of Drag-and-Drop feature: AWS Systems Manager allows for the definition of Custom Actions which can be performed on managed instances via SSM documents. However, creating an SSM document requires JavaScript Object Notation (JSON) or YAML. A drag and drop workflow-creation feature, allowing the triggering of workflows by alert is currently not an included feature.
  3. Lack of Machine Learning and Event Correlation: AWS Systems Manager does not leverage machine learning techniques for intelligent alerting, correlation, or root cause analysis. These features can be pivotal during high consequence situations and are sadly lacking.  As a good example of a utility that does ship with these tools, OpsRamp’s artificial intelligence for IT operations (AIOps) is worth examining. It is the world’s first service-centric AIOps platform that comes with intelligent event management, alert correlation and rapid remediation. Feel free to check out the product further here.

Final Thoughts

We have discussed the significance of AWS Systems Manager and have utilized one of its key capabilities; automation. Although AWS Systems Manager is an excellent tool, offering strong tracking and remediation functions, it also comes with several glaring limitations. For a one-stop solution with remediative AI and greater cloud integration, OpsRamp is a viable alternative. It also comes with a free demo.

You like our article?

Follow our monthly hybrid cloud digest on LinkedIn to receive more free educational content like this.