Glossary

A

ACM: ACM is the abbreviation for the AWS Certificate Manager, which creates, stores, and renews public and private SSL/TLS X.509 certificates and keys that protect AWS websites and applications.

See AWS documentation What is AWS Certificate Manager?role=external.
agent: An entity that collects logs or other metrics and sends them to the Kloudfuse Platform.
aggregation: Aggregation simplifies a large dataset into key values such as average, sum, or count. It summarizes data to reduce processing and storage costs.
alert: An alert signals a change in system health that may indicate a problem. It generates a notification so operators can investigate and take corrective action.
aggregation: Metadata attached to Kubernetes objects. Use annotations to store non-identifying data that tools and systems can read.
anomaly detection: Anomaly detection finds unexpected patterns in telemetry data. It highlights issues that need investigation.
apdex: The Application Performance Index, a single metric that represents the service quality on the scale of 0 to 1. Higher values indicate better performance, while lower values report low quality. APDEX combines the metrics of RED into a single measurement.
APM: An acronym for Application Performance Monitoring, APM tracks and monitors application performance metrics like response times, error rates, and transaction traces. APM identifies and addresses issues that impact the application’s performance and user experience.
api: A set of public methods and properties that enables users to retrieve and modify objects through REST endpoints.
APM alert: APM metric alerts are similar to metric alerts; with controls tailored specifically to APM. Use alerts at the service level on hits, errors, and a variety of latency measures.
APDEX: The Application Performance Index, a single metric that represents the service quality on the scale of 0 to 1. Higher values indicate better performance, while lower values report low quality. APDEX combines the metrics of RED into a single measurement.
ASM: An acronym for Advanced Service Monitoring, ASM consists of technologies that help manage and monitor distributed applications within a cloud environment, and provide centralized visibility into their health and performance. ASM enhances observability by providing a unified view of complex microservice interactions through centralized management, traffic management, and observability data gathering.
audit trail: An audit trail records all actions and changes in a system. It provides traceability for accountability and debugging.
autoscaling: Autoscaling automatically increases or decreases application resources in response to demand. It ensures performance and cost-efficiency.
availability: Availability is the system’s ability to consistently respond to user requests, even if some individual nodes or components are experiencing failures.

Availability ensures that users can access services without interruption regardless of potential disruptions; essentially, it means that a working node in the system will always return a response to a request, even if the data is not entirely up-to-date because of inconsistencies with other nodes that are down.

It is one of the key aspects of performance.
availability zone: A distinct and isolated data center in a cloud provider’s infrastructure, designed to provide fault tolerance and high availability for cloud-based services.

Part of the cloud filter specification.

B

blob storage

Blob storage stores large volumes of unstructured data, such as images or logs. It supports long-term storage and archiving. burn rate:: The rate at which the service consumes the error budget; refers to SLO. See Burn rate of error budget.

C

capacity planning

Capacity planning forecasts and manages the resources an application needs to handle current and future loads.

cardinality

Cardinality of a data attribute is the number of possible distinct values, or uniqueness, it can have. See also High cardinality and Low cardinality

Cardinality is important for the following reasons:

Data integrity: It defines clear relationships between tables and ensures that data is accurately linked and accessible.
Efficient queries: Understanding cardinality optimizes query execution plans. Well-defined relationships and data distribution leads to faster information retrieval and improved performance.
Database design and normalization: Cardinality is crucial in designing databases and normalizing data. It helps reduce redundancy and optimize storage for effective organization.
Data analysis and reporting: For businesses, cardinality enhances data analysis and reporting. It establishes meaningful relationships that provide insights into customer behavior and operational efficiency.

chart

A chart represents data or information in a graphical format. Most charts in Kloudfuse are line graphs, stacked bar graphs, stacked area graphs, or point graphs. Terms “chart” and “graph” are interchangeable.

cloud

In Kloudfuse, a group of filters that identify the attributes of cloud-based services.

Includes availability zone, cloud account id, instance type, project, and region.

cloud account id

Unique identifier for a user or organization’s account in a public cloud provider’s system.

Part of the cloud filter specification.

cold hot storage

The process of restoring archived logs into queryable storage. Use it for investigating older issues.

consistency

Consistency is the implementation of consistent user interfaces and interaction patterns across different parts of the distributed system, regardless of which component is currently handling the request.

It is one of the key aspects of usability.

container

In Kloudfuse, a group of filters that identify the attributes of cloud-based container-based virtualizations. Includes container id.

container id

Unique identifier for a container.

Part of the container filter specification.

counter

A cumulative metric that represents a single monotonically increasing counter, where the value can only increase, or be reset to zero on restart.

custom resource

An extension of the Kubernetes API. Used to define and manage custom configurations like ClusterLogging and ClusterLogForwarder.

D

Data lake

A data lake stores logs, metrics, traces, and events in one place so you can easily correlate and analyze them.

dimensionality

Dimensionality refers to the number of attributes in a dataset. Higher dimensionality increases data complexity.

Disaster Recovery Plan

A documented strategy, abbreviated as DRP, that outlines how an organization can quickly resume its critical IT operations using backup systems and procedures to minimize downtime and data loss after a disruptive event — a natural disaster, power outage, cyber attack, or hardware failure.

The key points of an effective DRP are:

Focus on critical systems
Risk assessment
Data backup strategy
Recovery point objective (RPO)
Recovery time objective (RTO)

The last two points are the parameters that define how long a business can afford to be offline, and how much data loss it can tolerate.

display container name

The name of a container within a system or application; used for display purposes or identification.

Part of the Kubernetes filter specification.

distributed tracing

Distributed tracing follows a request across multiple services to pinpoint where issues occur.

duration

Duration of a span is the difference between the span end time and the span start time. Contrast with execution time.

E

eBPF

Extended Berkeley Packet Filter (eBPF) is a Linux kernel technology that enables users to run programs in a protected environment within the kernel. eBPF programs loaded into the kernel using the bpf(2) syscall. The user provides them as a binary blobs of eBPF machine instructions. eBPF allows you to:

Reprogram the Linux kernel without rebooting the system
Run user-supplied programs to extend kernel functionality
Collect kernel statistics, monitor, and debug
Detect malware
Inspect traffic
Perform continuous profiling of applications
Run tracing and profiling scripts on a Kubernetes cluster
Filter data packets from networks and embed them into the kernel
Monitor system calls, network traffic, and system behavior at both the kernel and socket levels

env

Represents the environment or deployment stage ( development, staging, production) of a system or application.

environment

Similar to env, represents the environment or deployment context.

events

An event is something that has happened in a system at a point in time. Events are discrete occurrences with precise temporal and numerical values. Through events, we can track crucial indidences, and detect potential problems associated with user requests. Because events are very time-sensitive, they include timestamps.

execution time

Total time that a span is active , not waiting for a child span to complete, scaled according to the number of concurrent active spans.

Contrast with duration.

exporter

An exporter collects metrics from a system and converts them into a format that observability platforms like Kloudfuse can process.

F

facets: Attributes that the ingester extracts from a log line.

See log event, log line, log labels, and fingerprint.
fingerprint: An automatically-detercted structure of a log line.

Can be used to effectively filter logs

See log event, log line, log labels, and facet.
flame graph: A flame graph is a visual representation of profiling data that analyzes call stacks and identifies performance hotspots.

Each horizontal bar represents a function call and its width is a measure of much time is spent in that function. This representation enables you to easily see which parts of the code are taking the most time to execute.
fluentd: A log collector that runs on every node. Collects logs from containers and system services and forwards them to a defined output like Elasticsearch or S3.
FuseQL: Kloudfuse developed FuseQL as a query language for searching across logs data for a range of applications. It has flexible parameters for answering highly complex questions. See FuseQL.

G

gauge: A metric that represents a single numerical value that can increase or decrease.
GraphQL: This query language fetches data from multiple data sources with a single API call. See GraphQL.
gRPC: gRPC is a high-performance, open-source framework for remote procedure calls (RPCs). gRPC enables client applications to call methods on a server application on a different machine as if it were local, making it easier to create distributed applications and services.

H

high cardinality: High cardinality is a condition when individual attributes within a dataset have a large number of unique values, compared to the total number of entries. We also refer to high cardinality when combinations of dimensions lead to a large number of unique permutations.

For example, an Email address column in a typical employee table naturally has high cardinality because each person has a unique email address.
HIPPA: HIPAA is the accepted abbreviation for the Health Insurance Portability and Accountability Act of 1996. It is the United States federal law that aims to protect patient health information, and also ensure that the insurance coverage is portable. It sets national standards for the protection of Protected Health Information (PHI).
histogram: A metric that tracks the distribution of observed values. A histogram samples observations (request durations, response sizes, and so on) and counts them in configurable buckets. It also provides a sum of all observed values.
http: An accronym for Hypertext Transfer Protocol, http is a set of rules that govern how to exchange information between a client and a web server.

HTTP is a request-response model, where a client sends a request to a server and the server responds with the requested content and status information.

While designed for communication between web browsers and web servers, HTTP can be used for machine-to-machine communication, programmatic access to APIs, and many more.

I

indexing: Organizes log data into searchable structures. Helps speed up queries by mapping data to indexed fields.
infrastructure: Infrastructure is the foundational layer, or backbone, of hardware, software, and resources that support the operation, deployment, and scalability of software applications and services. It includes a range of components and technologies that deliver computing, networking, storage, and other services for running and maintaining software systems.
instance type: A predefined configuration of virtual hardware resources in a cloud provider’s infrastructure.

Part of the cloud filter specification.
instrumentation: Instrumentation adds code that captures performance and telemetry data from your app.
instrumentation-quality: A feature in Cloud Observability that reports how complete and consistent telemetry data is for a given service.

J

just in time provisioning: Automatically creates a user account in Cloud Observability the first time someone logs in with Identity Provider (IdP) credentials. Assigns a default role during provisioning.

K

key value store: A key-value store is a NoSQL database that stores data as pairs for quick access and updates.
kube cronjob: A Kubernetes resource that creates jobs that run at specified intervals, defined by a cronological schedule. Performs regular scheduled actions such as backups, report generation, and so on.

Part of the Kubernetes filter specification.
kube container name: The name of a container that runs within a Kubernetes pod, a basic building block of Kubernetes applications.

Part of the Kubernetes filter specification.
kube cluster name: The name of a Kubernetes cluster, a group of nodes that run containerized applications managed by Kubernetes.

Part of the Kubernetes filter specification.
kube daemon set: A Kubernetes resource that ensures a specific pod runs on all or selected nodes within a cluster.

Part of the Kubernetes filter specification.
kube deployment: A Kubernetes resource that represents a set of identical pods, ensuring application scalability and fault tolerance.

The filter includes a search option.

Part of the Kubernetes filter specification.
kube job: A Kubernetes resource that manages batch tasks or processes that run to completion.

Part of the Kubernetes filter specification.
kube-namespace: A logical partition or virtual cluster that isolates and organizes resources within a Kubernetes cluster.

Part of the Kubernetes filter specification.
kube node: A worker machine within a Kubernetes cluster, responsible for running containers and managing their lifecycle.

Part of the Kubernetes filter specification.
kube replica set: A Kubernetes resource that ensures a specified number of replicas (identical pods) are running at all times.

Part of the Kubernetes filter specification.
kube replica service: A method for exposing a network application that runs as one or more pods in your cluster.

Part of the Kubernetes filter specification.
kube stateful set: Runs a group of pods and maintains a "sticky" identity for each; useful for managing applications that need persistent storage or a stable, unique network identity.

Part of the Kubernetes filter specification.
Kubernetes: In Kloudfuse, a group of filters that identify the attributes of services running in Kubernetes.

These include display container name, image name, image tag, kube cluster name, kube container name, kube cronjob, kube daemon set, kube deployment, kube job, kube namespace, kube-node, kube-replica-set, kube service kube stateful set, and pod-name.
kubernetes api server: The core service that validates and manages Kubernetes API requests. Acts as the gateway for controlling cluster state.

L

language: In Kloudfuse, a group of filters that identify the programming language of the integrated telemetry software development kit (SDK), including go, nodejs, cpp, java, dotnet, rust, python, php, and ruby.
latency: Latency is the time taken for a single operation to complete, often measured as response time.

It is one of the key aspects of performance.
Lead time: Lead time for changes tracks how long it takes code to go from commit to production.
load-balancing: Load balancing spreads traffic across servers to improve reliability and performance.
log aggregation: Log aggregation collects logs from many sources and stores them in one place for analysis.
log event: A discrete log entry, with associated information: log line, log labels, facets, and fingerprint.
log labels: The key/value pairs associated with the log line.

These can include Kubernetes attributes, Cloud infrastructure attributes, and so on. These typically originate outside a log line.

See log event, log line, facets, and fingerprint.
log level: The severity of a log event - info, error, warning, trace, debug, or notice.
log line: This is the emitted log message. In practical terms, the maximum size of a log line is 1 MB.
See log event, log labels, facets, and fingerprint.
log management: Log management organizes, indexes, and analyzes logs to help you troubleshoot and monitor systems.
log rehydration: Moves logs from cold storage to hot storage for querying and analysis. Useful for investigating historical issues without always paying for high-cost storage.
LogQL: LogQL is Grafana Loki’s PromQL-inspired query language for searching across logs. See LogQL.
logs: Logs record all activities that occur within your system; they are a history of the system’s behavior during a specific time interval. Logs are essential for successful debugging. When we parse log data, we can develop an insight into application performance that we cannot determine through APIs or application databases.

Logs can use various formats, such as plain text or JSON objects, accessed through a range of querying techniques, making them the most useful inputs when monitoring application performance, investigating security threats, and addressing performance issues.
low cardinality: This conditions describes situations of low counts of unique values in the dataset, compared to the total number of entries.

For example, a Gender column in a typical employee table naturally has low cardinality; it may have a large number of entries, yet a low number of options: female, male, non-binary, and declined to answer.

M

MELT: An acronym for Metrics, Events, Logs, and Traces, MELT is a framework that provides insights into system performance, health, and behavior. MELT can help teams quickly identify, diagnose, and resolve issues, while optimizing system performance.
metric cardinality: The number of unique metric time series from the combination of metric name and dimensions.
metrics: Metrics are numerical measurements that provide an insight into a system’s performance. Use metrics, such as error rate and CPU% use, in mathematical modeling and forecasting to represent specific data structures.

Monitoring metrics as part of your observability strategy helps when constructing dashboards that display past trends across multiple services, facilitating extended data retention and simplifying queries.

N

node: A node is a single machine or virtual instance in a network or cluster that performs work and runs services or containers.
noise: Noise is the set of irrelevant or low-priority alerts that distract from real issues. Reducing noise helps teams focus on what actually needs attention.

O

O11y: O11y is shorthand for “observability.” It helps you understand your system using external signals like logs, metrics, and traces.
observability: An observability platform aggregates and visualizes telemetric data collected from application and infrastructure components in a distributed environment. It monitors and analyzes application behavior, transactional data, and the various types of infrastructure that support application delivery, making it possible to proactively address issues before they become serious concerns.

Beyond monitoring capabilities, observability provides deeper insights into the data and optimizes performance, ensurses availability, and improves customer experience.

At its best, observability enables users to ask arbitrary questions about their environment, without knowing in advance what you may want to know.

It is one of the key aspects of usability.
OTLP: OTLP(OpenTelemetry Protocol) defines how telemetry data moves between systems, collectors, and observability platforms.

P

performance: Performance is the efficiency and effectiveness of the system in handling workloads by distributing tasks across multiple nodes. It is one of the key aspects of usability.

We measure performance through factors like throughput, latency, scalability, and availability, ensuring the system can respond quickly and reliably under high demand, while minimizing resource usage across all nodes.

The key aspects of performance are throughput, latency, scalability, and availability.
pg: The PostgreSQL network protocol that governs how clients and servers interact in a PostgreSQL environment.
pod: The smallest deployable unit of computing that you can create and manage in Kubernetes.
pod name: The name of a specific pod, representing a single instance of a running process in a Kubernetes cluster.

Includes a search option.

Part of the Kubernetes filter specification.
project: A named, isolated environment within a cloud account for organizing and managing resources.

Part of the cloud filter specification.
PromQL: Prometheus Query Language enables you to select and aggregate metric time series data in real time. See PromQL.
protocol: Protocol refers to the methodology for network communication, such as http or pg.
protobuf: Shortened form of Protocol Buffers, protobuf is an efficient, language-agnostic data serialization mechanism for defining structured data in a . proto file. You can use it to generate source code that writes and reads data from different data streams.

Q

query: A query is a request or command that retrieves, manipulates, or manages data stored in a database or information system.

We author queries in a specific query language, such as SQL for relational databases or other query languages designed for different types of databases or data sources.

For example, the Kloudfuse proprietary FuseQL is particularly powerful in a wide range of applications, with flexible parameters for answering highly complex questions.
query type: Represents the type or category of a query performed on a database or data store, such as SELECT, INSERT, UPDATE, or DELETE.

R

raw data

Raw data is source data that has not been transformed, organized, or analysed; it is in its original form.

RED

Refers to RED Metrics: and acronym for Requests, Errors, and Duration (Latency).

redundancy

Redundancy means having backup components in place to ensure service continues even if a part fails. This improves reliability and fault tolerance.

region

A geographical location that hosts a cloud provider’s data centers.

Part of the cloud filter specification.

reliability

Reliability is the ability of a system, application, or service to consistently perform its intended functions under specific conditions for a specified period. Along with availability and usability, reliability is a key attribute of quality in software and services.

Reliability requires careful design, engineering, and ongoing monitoring and maintenance, and ensures that users and organizations get the fundamental system performance that they expect.

resilience

Resilience is the ability of a system or application to gracefully and effectively handle unexpected failures, disruptions, errors, or adverse conditions while maintaining core functionality and minimizing the impact on users and operations. It is one of the key aspects of usability.

Resilience ensures that software systems can continue to operate under adverse circumstances, recover quickly from failures, and maintain a high level of availability and performance.

resource

Resources represent a particular domain of a customer application - they are typically an instrumented web endpoint, database query, or background job.

rollup

An aggregation of data points over time, such as a weekly average or percentile.

RPO (Recovery Point Objective)

The maximum amount of data the organization can tolerate losing; the goal an organization sets for the maximum length of time it can take to restore normal operations following an outage or data loss.

This parameter is measured in time: from the moment a failure occurs to your last valid data backup.

Key points about RPO:

Data Loss Tolerance: how much data loss is acceptable to an organization during a recovery process.
Backup Frequency: To achieve a specific RPO, organizations must implement backup strategies with appropriate frequencies.
Business Impact Analysis: Considering the criticality of data and potential consequences of data loss.

RTO (Recovery Time Objective)

The maximum amount of time that it is acceptable to wait to restore a system or service after a disruption.

This ensures that normal operations can be restored as quickly as possible after a disaster, to avoid significant disruption to a business.

The factors that affect RTO are:

Business impact: How much revenue is lost, and how much the disruption affects business continuity.
System criticality: How important the system is to the business.
Application dependencies: Whether the application depends on other applications.
Compliance requirements: Whether the application is subject to any external compliance or regulatory requirements. How is RTO measured? RTO is usually expressed in hours or days.

RUM

RUM is an acronym for Real User Monitoring, a technology that measures and tracks the end-user experience of an application or website. It is a core feature of APM, and a key component for gaining real-world observability into web performance and service availability.

S

sampling: In sampling, we programmatically select a subset or representative group of data points or items from a larger dataset for analysis, testing, or inspection.

Sampling is a common when it is impractical or resource-intensive to work with an entire dataset, especially when dealing with large volumes of data.

In observability and performance monitoring, we sample logs, traces, and events to summarize information about a service or activity. When sampling your information, there is a constant tradeoff between granularity, system-representative accuracy, cost, performance, and relevancy.
scalability: Scalability is the ability to handle increasing workloads by adding more nodes to the system without significant performance degradation.

It is one of the key aspects of performance.
scheduled view: Pre-aggregated datasets that Kloudfuse generates at scheduled intervals to improve query performance and efficiency. Instead of expensive real-time queries on raw data, scheduled views store advance results, enabling faster access to summarized information.
service: Services are the building blocks of modern microservice architectures - broadly, a service groups together endpoints, queries, or jobs to build your application.
[service-map]]service-map: The Service Map is a graphical workflow representation of all services in the deployment.
service entry span: A service entry span records the entry point method for a request to a service. Kloudfuse APM interface shows service entry spans when the immediate parent on a flame graph is a different color than the span.
[shards]]shards: Subdivisions of an Elasticsearch index. Improve performance by parallelizing data storage and query execution.
Service Level Agreement (SLA): SLAs define the level of service and performance standards that the service provider is expected to meet. SLAs serve as a means of establishing clear expectations, responsibilities, and consequences for service quality and delivery.

A Service Level Agreement is a formal and legally-binding contract or agreement between a service provider and a customer or client. Most IT contracts include an SLA for the protection of both the vendor and the customer.
Service Level Indicator (SLI): Acronym for Service Level Indicator, this metric measures how well a service meets customer expectations. SLIs gauge the reliability of a service, and report the percentage of the service effectiveness. To calculate the SLI, divide the number of good events by the total number of valid events, and then multiply by 100.

SLIs are part of SLOs, which are part of SLAs. They measure compliance with SLOs. If an SLA states that a system should be available 99.95% of the time, the SLI is the actual uptime measurement.

SLIs help organizations identify issues, improve performance, and meet customer expectations.
Service Level Objective (SLO): SLOs are critical components of SLAs. They establish clear, measurable expectations for service providers and stakeholders, help to ensure that the service meets the needs of its users, and align with business or operational requirements. The best SLOs set a minimum standard for performance, are chosen thoughtfully with business objectives in mind, and focus only on measurable metrics.

A Service Level Objective is a specific and quantifiable target or goal that defines the level of performance, reliability, or quality that a service or system must achieve.
service map: The Service Map is a graphical workflow representation of all services in the deployment.
span: Spans are integral parts of a distributed system, and the basic element in distributed tracing. A span represents a logical unit of work in a distributed system for a specified period. It is a single operation within a trace. Multiple spans construct a trace.
span tags: You can tag spans in APM, as key-value pairs, to correlate requests and filters.
span type: In Kloudfuse, a group of filters that identify the environment where the span runs. Includes http, grpc, db, queue, and custom.
source: The source application that emits logs or other metrics.

This is a mandatory field for each log line. Users can configure the system to attach an appropriate source name to their logs.
synthetic-monitoring: Synthetic monitoring uses scripted tests that simulate user interactions with an app to measure performance and availability before users encounter issues.

T

table-name

Refers to the name of a specific table within a database or data store where data is organized and stored.

taint

A label on a node that restricts which pods can run on it. Prevents unwanted scheduling.

telemetry

Telemetry data is automatically collected and transmitted from remote or inaccessible sources to a centralized location for monitoring and analysis. Metrics, events, logs, and traces each provide crucial insights into the application’s performance, latency, throughput, and resource utilization.

We can use telemetry data to observe system performance, recognize potential problems, detect irregularities, and investigate issues.

We can extract telemetry data from:

Application logs
System logs
Network traffic
Third-party services
APIs

telemetry type

The telemetry type is the specific category or classification of data collected through telemetry, such as logs, events, distributed tracing, and so on.

threshold

A threshold is a defined limit (such as CPU usage > 80%) that, when exceeded, triggers an alert or action.

throughput

Throughput is the number of operations or requests a system can process per unit of time.

It is one of the key aspects of performance.

time series

Time series is a set of data points indexed in chronological time order. Commonly, a time series is a sequence of successive equally-spaced points in time, or a sequence of discrete-time data.

tolerance

In DBSCAN, the tolerance level, or eps, determines the clustering radius of the neighborhood around each point. The eps controls the sensitivity of outlier detection. A lower tolerance detects more subtle outliers, while a higher tolerance detects only the most significant deviations.

trace

A trace is a collection of operations that represents a unique transaction handled by an application and its constituent services. It is the path of a request or workflow as it progresses from one component of the system to another, capturing the entire request flow through a distributed system.

Traces expose the directionality and relationships between two data points, service interactions, and the effects of asynchrony. When analyzing trace data, we better understand the performance and behavior of a distributed system.

Some examples of traces include:

A SQL query execution
A function call during a user authentication request

trace metrics

Trace metrics identify and alert on hits, errors, or latency. Kloudfuse automatically collects trace metrics and retains them for 15 months, similar to other Kloudfuse metrics.

trace root span

A root span tracks the entry point method for the trace. Its start marks the beginning of the trace.

TraceQL

TraceQL is Grafana’s query language that selects traces. See TraceQL.

traces

Some examples of traces include:

A SQL query execution
A function call during a user authentication request

transparency

The user should not have to be aware of the system’s distributed nature, and should be able to interact with it as if it were a single, centralized system.

It is one of the key aspects of usability.

U

usability

Usability is a measure of how easy and intuitive it is for users to interact with a distributed system, without needing to understand the complex underlying architecture or the individual components.

Good usability ensures a seamless and efficient user experience despite the distributed nature of the system, making a complex system feel like a single, unified entity to the user.

The key aspects of usability are transparency, consistency, resilience, performance, and observability.

USE

The USE method for monitoring system performance focuses on three key metrics: Utilization, Saturation, and Errors.

Utilization: The average time that the resource was busy servicing work, including CPU utilization and Memory utilization.

Saturation: How much workload your system handles beyond its capacity, reflecting the amount of work that is waiting to be processed. It includes Queue Lengths and Input/Output Wait Times.

Errors: Count the number of problems or failures that occur in the system. These include both system and application errors.

V

vertical scaling: Vertical scaling means increasing a system’s capacity by adding more power (CPU, RAM) to a single server or node.
visualization: Visualization is the graphical representation of data or information. It means that we creatie visual representations (charts, graphs, diagrams, maps, and dashboards) to make complex data more understandable, accessible, and interpretable.

Visualization is a powerful tool for gaining insights from data, presenting information, and conveying patterns or trends.

W

workload: A workload is the amount and type of processing done by an application or service. It includes user requests, background jobs, and system tasks.

X

xml logs: XML logs are structured log files formatted using XML, often used in legacy systems. They’re machine-readable, but harder to query than JSON.

Z

zone awareness: Zone awareness is a deployment strategy that distributes workloads across multiple availability zones to improve fault tolerance and availability.