AWS CloudWatch Metrics Integration

Configure AWS Kinesis Firehose

Use different Firehose accounts for logs and metrics.

Create a new delivery stream in the account that emits the metrics, in the Kinesis Firehose AWS console.

Specify the following attribute values:

Source

Direct PUT

Destination

HTTP Endpoint

Destination settings

Provide the external-facing endpoint of the Kloudfuse cluster as part of the following URL address format:

https://<external facing endpoint of Kfuse cluster>/ingester/kinesis/metrics
Access token key

Provide when required

Content encoding

GZIP

  1. Provide an existing S3 bucket, or create a new one for storing Kinesis records as a backup.

    Backing up only failed data should be sufficient.

  2. Change the name of the stream, as necessary.

Configure AWS CloudWatch Metrics Stream

In the account that emits the metrics, in the Cloudwatch AWS console, navigate to the Metrics section on the left side of the console, select Streams, and create a new metric stream.

  1. Select the metric namespaces to send to the stream; the default is all metrics.

  2. In the configuration section, select an existing Firehose owned by your account, and the select theKinesis Firehose you created earlier.

  3. Under Change Output Format, make sure to select JSON for the output format.

  4. Change the name of the stream if necessary.

Enable AutoScaling Group Metrics

Perform these steps in the account that emits the metrics.

  1. Open the Amazon EC2 console.

  2. Choose Auto Scaling Groups from the navigation pane.

  3. Enable the checkbox next to your Auto Scaling group.

    A split pane opens up at the bottom of the page.

  4. On the Monitoring tab, select the Auto Scaling group metrics collection, and enable the checkbox located under Auto Scaling, at the top of the page.

Enable Collection of Request Metrics in S3

In the account that emits the metrics, follow the instructions in AWS documentation for Creating a CloudWatch metrics configuration for all the objects in your bucket.

Enable Enrichment of AWS Metrics

The metrics sent by AWS CloudWatch to the Kinesis Firehose include minimal labels. Kloudfuse enables you to attach more labels and user-defined custom tags to the ingested metrics, from within the AWS console, by scraping AWS.

To enable this enrichment of AWS metrics, follow these steps:

  1. Modify yaml in the global section of the custom-values.yaml file:

    global:
      enrichmentEnabled:
        - aws
    yaml
  2. Create IAM scraper role in the AWS account where the services that emit the metrics run.

    Attach the following policy, for Kloudfuse to scrape the additional labels from AWS. See AWS documentation Define custom IAM permissions with customer managed policies

    Create a scraper role with custom policies
    			"Action": [
    				"acm:ListCertificates",
    				"acm:ListTagsForCertificate",
    				"apigateway:GET",
    				"athena:ListWorkGroups",
    				"athena:ListTagsForResource",
    				"autoscaling:DescribeAutoScalingGroups",
    				"bedrock:ListFoundationModels",
    				"bedrock:ListTagsForResource",
    				"cloudwatch:ListMetrics",
    				"cloudwatch:GetMetricStatistics",
    				"dynamodb:ListTables",
    				"dynamodb:DescribeTable",
    				"dynamodb:ListTagsOfResource",
    				"ec2:DescribeInstances",
    				"ec2:DescribeInstanceStatus",
    				"ec2:DescribeSecurityGroups",
    				"ec2:DescribeNatGateways",
    				"ec2:DescribeVolumes",
    				"ecs:ListClusters",
    				"ecs:ListContainerInstances",
    				"ecs:ListServices",
    				"ecs:DescribeContainerInstances",
    				"ecs:DescribeServices",
    				"ecs:ListTagsForResource",
    				"elasticache:DescribeCacheClusters",
    				"elasticache:DescribeServerlessCaches",
    				"elasticache:ListTagsForResource",
    				"elasticfilesystem:DescribeFileSystems",
    				"elasticfilesystem:DescribeBackupPolicy",
    				"elasticloadbalancing:DescribeTags",
    				"elasticloadbalancing:DescribeLoadBalancers",
    				"es:ListDomainNames",
    				"es:DescribeDomains",
    				"es:ListTags",
    				"events:ListRules",
    				"events:ListTagsForResource",
    				"events:ListEventBuses",
    				"firehose:DescribeDeliveryStream",
    				"firehose:ListDeliveryStreams",
    				"firehose:ListTagsForDeliveryStream",
    				"fsx:DescribeFileSystems",
    				"fsx:ListTagsForResource",
    				"glue:ListJobs",
    				"glue:GetTags",
    				"kafka:ListTagsForResource",
    				"kafka:ListClustersV2",
    				"kinesis:ListStreams",
    				"kinesis:ListTagsForStream",
    				"kinesis:DescribeStream",
    				"lambda:GetPolicy",
    				"lambda:List*",
    				"lambda:ListTags",
    				"logs:DescribeLogGroups",
    				"logs:ListTagsForResource",
    				"logs:ListTagsLogGroup",
    				"mq:ListBrokers",
    				"mq:DescribeBroker",
    				"mediaconvert:ListQueues",
    				"mediaconvert:ListTagsForResource",
    				"qbusiness:ListApplications",
    				"qbusiness:GetApplication",
    				"qbusiness:ListTagsForResource",
    				"rds:DescribeDBInstances",
    				"rds:DescribeDBClusters",
    				"rds:ListTagsForResource",
    				"rds:DescribeEvents",
    				"redshift:DescribeClusters",
    				"redshift:DescribeTags",
    				"route53:ListHealthChecks",
    				"route53:ListTagsForResource",
    				"s3:ListAllMyBuckets",
    				"s3:GetBucketTagging",
    				"ses:ListConfigurationSets",
    				"ses:GetConfigurationSet",
    				"ses:ListTagsForResource",
    				"sns:ListTagsForResource",
    				"sns:ListTopics",
    				"sqs:ListQueues",
    				"sqs:ListQueueTags",
    				"states:ListStateMachines",
    				"states:ListActivities",
    				"states:ListTagsForResource",
    				"timestream:ListDatabases",
    				"timestream:ListTables",
    				"timestream:DescribeDatabase",
    				"timestream:DescribeTable",
    				"timestream:ListTagsForResource",
    				"wafv2:ListWebACLs",
    				"wafv2:ListRuleGroups",
    				"wafv2:ListTagsForResource",
    				"cloudfront:ListDistributions",
    				"cloudfront:GetDistribution",
    				"cloudfront:ListTagsForResource"
    			]
    yaml
  3. Modify the Trust Relationship for the policy of the scrape role ARN to add the node-group (Node IAM Role ARN), in which Kloudfuse is running on, as the Principal on the Account.

    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "Statement1",
                "Effect": "Allow",
                "Principal": {
                    "AWS": "arn:aws:iam::ACCOUNT-NUMBER:role/eksctl-XXXXX-nodegroup-ng-XXXXXX-NodeInstanceRole-XXXXXXXXXX"
                },
                "Action": "sts:AssumeRole"
            }
        ]
    }
    yaml
  4. Ensure that the permissions map to the node pool that hosts the EKS cluster for Kloudfuse.

  5. Specify the AWS namespaces to scrape. Starting in 4.0.0, the awsNamespaces list defaults to an empty list, so you must explicitly enumerate the namespaces you want to scrape.

    Namespace values use the format AWS/<ServiceName> as defined by AWS. For example, a typical production deployment monitoring compute, database, storage, and serverless workloads would include:

    ingester:
      config:
        awsNamespaces:
          - "AWS/EC2"           # EC2 instances
          - "AWS/AutoScaling"   # Auto Scaling groups
          - "AWS/EBS"           # EBS volumes
          - "AWS/RDS"           # RDS databases
          - "AWS/ElastiCache"   # ElastiCache (Redis and Memcache)
          - "AWS/Lambda"        # Lambda functions
          - "AWS/ApplicationELB" # Application Load Balancers
          - "AWS/NetworkELB"    # Network Load Balancers
          - "AWS/S3"            # S3 buckets
          - "AWS/SQS"           # SQS queues
          - "AWS/ECS"           # ECS clusters and services
    yaml

    For the full list of supported namespaces, see AWS Services.

  6. Enable Kloudfuse to consume the new role; there are two approaches — through AWS credentials or through Role ARNs — described in the tabs below.

AWS credentials

Add your AWS credentials as a secret, and use the secret in the ingester config.

  1. Retrieve your aws credentials; see Configure tool authentication with AWS.

  2. In the Kloudfuse namespace, create a kube secret name aws-access-key, with keys accessKey and secretKey.

    kubectl create secret generic aws-access-key \
      --from-literal=accessKey=<AWS_ACCESS_KEY_ID> \
      --from-literal=secretKey=<AWS_SECRET_ACCESS_KEY>
  3. Specify the secretName in the custom-values.yaml file.

    ingester:
      config:
        awsScraper:
          secretName: aws-access-key
    yaml
  4. To restrict scraping to specific namespaces or regions, add the following to custom-values.yaml:

    ingester:
      config:
        awsScraper:
          secretName: aws-access-key
          namespaces:
            - <add namespace>
          regions:
            - <add region>
    yaml
Role ARNs

Add Role ARNs in the ingester config: This option enables you to scrape multiple AWS accounts.

  1. Add the scraper Role ARNs that you created with the new permissions to the awsRoleArns list to your custom-values.yaml file.

    ingester:
      config:
        awsRoleArns:
          - role: <ADD ROLE ARN HERE>
    yaml
  2. To restrict scraping to specific namespaces or regions, add the following to custom-values.yaml:

    ingester:
      config:
        awsRoleArns:
          role: <ADD ROLE ARN HERE>
          namespaces:
            - <add namespace>
          regions:
            - <add region>
    yaml

For Global Services like AWS CloudFront, a scraper role in the us-east-1 region is required. Ensure that you create the scraper role in us-east-1 and also configure a Firehose delivery stream and CloudWatch Metric stream in the us-east-1 region for global services.

ingester:
  config:
    awsRoleArns:
      - role: <ADD US-EAST-1 ROLE ARN HERE>
        regions:
          - us-east-1
yaml
  1. To modify the node-group IAM role where Kloudfuse Platform runs, add the following permissions policy to the node-group (Node IAM Role ARN) to assume the role.

    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": "sts:AssumeRole",
                "Resource": <REPLACE SCRAPER ROLE ARN HERE>
            }
        ]
    }
    yaml
  2. Complete a helm upgrade to save the changes.

    helm upgrade --create-namespace --install kfuse . -f <custom_values.yaml>

AWS Namespace Enrichment

Kloudfuse enriches metadata for metrics from the following AWS services. Each service includes all AWS tags plus the specific metadata fields listed below:

The awsNamespaces Value column lists the value to add under the awsNamespaces variable in the ingester section of your custom-values.yaml cluster setup file. Note that some service display names differ from the AWS CloudWatch namespace — for example, OpenSearch uses AWS/ES.
AWS Service awsNamespaces Value Enriched Metadata Fields

AutoScaling

AWS/AutoScaling

All AWS tags on the AutoScaling Group

Firehose

AWS/Firehose

All AWS tags on the Delivery Stream

RDS

AWS/RDS

For DB Instances:

  • allocatedstoragegb, availability_zone, backupretentionperioddays

  • dbinstancearn, dbinstanceidentifier, dbinstanceclass, dbiresourceid, dbname

  • engine, engineversion, multiaz, networktype, publicly_accessible

  • secondary_availability_zone, storagetype

  • host (DbiResourceId), hostname (Endpoint Address)

  • All AWS tags

For DB Clusters:

  • allocatedstoragegb, availability_zones, backupretentionperioddays

  • dbclusterarn, dbclusterresourceid, databasename

  • engine, enginemode, engineversion, global_write_forwarding_status

  • multiaz, networktype, storagetype

  • All AWS tags

EKS

AWS/EKS

  • arn, cluster_name, endpoint, platform_version, role_arn, status, kube_server_version

  • All AWS tags

EBS

AWS/EBS

  • availability_zone, multiattachenabled, outpostarn, size, snapshotid, state

  • throughput, volumeid, volume_type, volume_name

  • device (if attached)

  • All AWS tags

EC2

AWS/EC2

For Instances:

  • availability_zone, image_id, instance_id, instance_type, kernel

  • iam_profile (ARN), host (instance ID), autoscaling_group, service

  • All AWS tags

For NAT Gateways:

  • natgatewayid

  • All AWS tags

ELB

AWS/ELB

  • canonicalhostedzonename, canonicalhostedzonenameid, dnsname, loadbalancername, scheme, vpcid

  • host (CanonicalHostedZoneName), hostname (CanonicalHostedZoneName), name (LoadBalancerName)

  • All AWS tags

MQ

AWS/AmazonMQ

  • brokerarn, brokerid, brokername, brokerstate, deploymentmode

  • enginetype, engineversion, hostinstancetype, storagetype

  • All AWS tags

S3

AWS/S3

All AWS tags on the S3 Bucket

EFS

AWS/EFS

  • filesystemarn, name

  • aws_elasticfilesystem_default_backup (enabled/disabled)

  • All AWS tags

ELBv2

AWS/ApplicationELB, AWS/NetworkELB, AWS/GatewayELB

  • loadbalancerarn, name (LoadBalancerName), host (DNSName)

  • All AWS tags

    ELBv2 covers Application (ALB), Network (NLB), and Gateway (GWLB) load balancers, each under its own CloudWatch namespace. Include only the namespaces for the load balancer types you run.

ACM

AWS/CertificateManager

All AWS tags on the Certificate

ElastiCache

AWS/ElastiCache

For ElastiCache Clusters:

  • cache_node_type, name (CacheClusterId), engine, engine_version

  • preferred_availability_zone, replication_group

  • All AWS tags

For Serverless Caches:

  • name (ServerlessCacheName), engine, status, create_time

CloudFront

AWS/CloudFront

All AWS tags on Distributions

Route53

AWS/Route53

All AWS tags on Health Checks

SNS

AWS/SNS

All AWS tags on Topics

Redshift

AWS/Redshift

All AWS tags on Clusters

OpenSearch

AWS/ES

  • elasticsearch_version (EngineVersion), name (DomainName), dedicated_master_enabled

  • instance_type, zone_awareness_enabled, ebs_enabled

  • All AWS tags

SQS

AWS/SQS

All AWS tags on Queues

Lambda

AWS/Lambda

  • function_arn, functionname, memory_size, runtime

  • architecture (first in list), storage_size (EphemeralStorage Size)

  • All AWS tags

DynamoDB

AWS/DynamoDB

All AWS tags on Tables

ApiGateway

AWS/ApiGateway

  • apiid (Id)

  • All AWS tags

ApiGatewayV2

AWS/ApiGateway

  • apiname (Name)

  • All AWS tags

Glue

AWS/Glue

All AWS tags on Jobs

Athena

AWS/Athena

All AWS tags on WorkGroups

ECS

AWS/ECS

For Clusters and Services:

  • All AWS tags

EventBridge

AWS/Events

For Rules and Event Buses:

  • All AWS tags

Kafka

AWS/Kafka

All AWS tags on Clusters

Kinesis

AWS/Kinesis

All AWS tags on Streams

Logs

AWS/Logs

All AWS tags on Log Groups

WAF

AWS/WAFV2

For Web ACLs and Rule Groups:

  • All AWS tags

FSx

AWS/FSx

  • generation (1 or 2, for ONTAP only), file_system_type

  • All AWS tags

Bedrock

AWS/Bedrock

For Foundation Models:

  • model_name, provider_name

  • All AWS tags

QBusiness

AWS/QBusiness

For Applications:

  • display_name

  • All AWS tags

MediaConvert

AWS/MediaConvert

For Queues:

  • name, status, type

  • All AWS tags on the Queue

States (Step Functions)

AWS/States

For State Machines:

  • name, type

  • All AWS tags on the State Machine

For Activities:

  • name

  • All AWS tags on the Activity

Timestream

AWS/Timestream

For Databases:

  • database_name, arn

  • All AWS tags on the Database

For Tables:

  • table_name, database_name, arn

  • All AWS tags on the Table

SES

AWS/SES

For Configuration Sets:

  • name (ConfigurationSetName)

  • tls_policy (DeliveryOptions.TlsPolicy)

  • sending_enabled (SendingOptions.SendingEnabled)

  • All AWS tags on the Configuration Set

Reduce Cost of Metrics Ingestion

AWS CloudWatch metrics ingestion can be a high-cost operation. The driving factor here is the AWS CW:MetricsStreamUsage attribute, especially the MetricsUpdate statistical aggregate.

To reduce the cost of operating CloudWatch metrics ingestion, consider these factors:

Volume of Ingested Metrics

Control this by sending only the necessary Namespaces and metrics to the stream.

In other words, avoid selecting All Namespaces and All Metrics when configuring ingestion.

Some namespaces are very costly when deriving metrics. These include AWS NLB and AWS Lambda because they feature both a high volume of metrics, and multiple dimensions.

Data Retention

Our research indicates that you should modify the retention period of the CloudWatch metrics data by changing the retention setting for the log group of the firehose stream.

Sampling Frequency

The frequency of data sampling by CloudWatch is controlled internally by the AWS CloudWatch implementation.