Glossary

A

ACM

ACM is the abbreviation for the AWS Certificate Manager, which creates, stores, and renews public and private SSL/TLS X.509 certificates and keys that protect AWS websites and applications.

See AWS documentation [What is AWS Certificate Manager?role=external,window=_blank].

agent

An entity that collects logs or other metrics and sends them to the Kloudfuse Platform.

APM

An acronym for Application Performance Monitoring, APM tracks and monitors application performance metrics like response times, error rates, and transaction traces. APM identifies and addresses issues that impact the application’s performance and user experience.

APM alert

APM metric alerts are similar to metric alerts; with controls tailored specifically to APM. Use alerts at the service level on hits, errors, and a variety of latency measures.

APDEX

The Application Performance Index, a single metric that represents the service quality on the scale of 0 to 1. Higher values indicate better performance, while lower values report low quality. APDEX combines the metrics of RED into a single measurement.

ASM

An acronym for Advanced Service Monitoring, ASM consists of technologies that help manage and monitor distributed applications within a cloud environment, and provide centralized visibility into their health and performance. ASM enhances observability by providing a unified view of complex microservice interactions through centralized management, traffic management, and observability data gathering.

availability

Availability is the system’s ability to consistently respond to user requests, even if some individual nodes or components are experiencing failures.

Availability ensures that users can access services without interruption regardless of potential disruptions; essentially, it means that a working node in the system will always return a response to a request, even if the data is not entirely up-to-date because of inconsistencies with other nodes that are down.

It is one of the key aspects of performance.

availability zone

A distinct and isolated data center in a cloud provider’s infrastructure, designed to provide fault tolerance and high availability for cloud-based services.

Part of the cloud filter specification.

C

chart

A chart represents data or information in a graphical format. Most charts in Kloudfuse are line graphs, stacked bar graphs, stacked area graphs, or point graphs. Terms “chart” and “graph” are interchangeable.

cloud

In Kloudfuse, a group of filters that identify the attributes of cloud-based services.

cloud account id

Unique identifier for a user or organization’s account in a public cloud provider’s system.

Part of the cloud filter specification.

consistency

Consistency is the implementation of consistent user interfaces and interaction patterns across different parts of the distributed system, regardless of which component is currently handling the request.

It is one of the key aspects of usability.

container

In Kloudfuse, a group of filters that identify the attributes of cloud-based container-based virtualizations. Includes container id.

container id

Unique identifier for a container.

Part of the container filter specification.

counter

A cumulative metric that represents a single monotonically increasing counter, where the value can only increase, or be reset to zero on restart.

cardinality

Cardinality of a data attribute is the number of possible distinct values, or uniqueness, it can have.

Cardinality is important for the following reasons:

Data integrity

It defines clear relationships between tables and ensures that data is accurately linked and accessible.

Efficient queries

Understanding cardinality optimizes query execution plans. Well-defined relationships and data distribution leads to faster information retrieval and improved performance.

Database design and normalization

Cardinality is crucial in designing databases and normalizing data. It helps reduce redundancy and optimize storage for effective organization.

Data analysis and reporting

For businesses, cardinality enhances data analysis and reporting. It establishes meaningful relationships that provide insights into customer behavior and operational efficiency.

D

Disaster Recovery Plan

A documented strategy, abbreviated as DRP, that outlines how an organization can quickly resume its critical IT operations using backup systems and procedures to minimize downtime and data loss after a disruptive event — a natural disaster, power outage, cyber attack, or hardware failure.

The key points of an effective DRP are:

display container name

The name of a container within a system or application; used for display purposes or identification.

Part of the Kubernetes filter specification.

duration

Duration of a span is the difference between the span end time and the span start time. Contrast with execution time.

E

eBPF

Extended Berkeley Packet Filter (eBPF) is a Linux kernel technology that enables users to run programs in a protected environment within the kernel. eBPF programs loaded into the kernel using the bpf(2) syscall. The user provides them as a binary blobs of eBPF machine instructions. eBPF allows you to:

  • Reprogram the Linux kernel without rebooting the system

  • Run user-supplied programs to extend kernel functionality

  • Collect kernel statistics, monitor, and debug

  • Detect malware

  • Inspect traffic

  • Perform continuous profiling of applications

  • Run tracing and profiling scripts on a Kubernetes cluster

  • Filter data packets from networks and embed them into the kernel

  • Monitor system calls, network traffic, and system behavior at both the kernel and socket levels

env

Represents the environment or deployment stage ( development, staging, production) of a system or application.

environment

Similar to env, represents the environment or deployment context.

events

An event is something that has happened in a system at a point in time. Events are discrete occurrences with precise temporal and numerical values. Through events, we can track crucial indidences, and detect potential problems associated with user requests. Because events are very time-sensitive, they include timestamps.

execution time

Total time that a span is active , not waiting for a child span to complete, scaled according to the number of concurrent active spans.

Contrast with duration.

F

facets

Attributes that the ingester extracts from a log line.

fingerprint

An automatically-detercted structure of a log line.

Can be used to effectively filter logs

flame graph

A flame graph is a visual representation of profiling data that analyzes call stacks and identifies performance hotspots.

Each horizontal bar represents a function call and its width is a measure of much time is spent in that function. This representation enables you to easily see which parts of the code are taking the most time to execute.

FuseQL

Kloudfuse developed FuseQL as a query language for searching across logs data for a range of applications. It has flexible parameters for answering highly complex questions. See FuseQL.

G

gauge

A metric that represents a single numerical value that can increase or decrease.

GraphQL

This query language fetches data from multiple data sources with a single API call. See GraphQL.

gRPC

gRPC is a high-performance, open-source framework for remote procedure calls (RPCs). gRPC enables client applications to call methods on a server application on a different machine as if it were local, making it easier to create distributed applications and services.

H

high cardinality

High cardinality is a condition when individual attributes within a dataset have a large number of unique values, compared to the total number of entries. We also refer to high cardinality when combinations of dimensions lead to a large number of unique permutations.

For example, an Email address column in a typical employee table naturally has high cardinality because each person has a unique email address.

histogram

A metric that tracks the distribution of observed values. A histogram samples observations (request durations, response sizes, and so on) and counts them in configurable buckets. It also provides a sum of all observed values.

http

An accronym for Hypertext Transfer Protocol, http is a set of rules that govern how to exchange information between a client and a web server.

HTTP is a request-response model, where a client sends a request to a server and the server responds with the requested content and status information.

While designed for communication between web browsers and web servers, HTTP can be used for machine-to-machine communication, programmatic access to APIs, and many more.

I

image name

The name of a container image used to create instances of containers with specific functionalities.

Part of the Kubernetes filter specification.

image tag

A label or version number associated with a container image, specifying its version or configuration.

Part of the Kubernetes filter specification.

infrastructure

Infrastructure is the foundational layer, or backbone, of hardware, software, and resources that support the operation, deployment, and scalability of software applications and services. It includes a range of components and technologies that deliver computing, networking, storage, and other services for running and maintaining software systems.

instance type

A predefined configuration of virtual hardware resources in a cloud provider’s infrastructure.

Part of the cloud filter specification.

K

kube cronjob

A Kubernetes resource that creates jobs that run at specified intervals, defined by a cronological schedule. Performs regular scheduled actions such as backups, report generation, and so on.

Part of the Kubernetes filter specification.

kube container name

The name of a container that runs within a Kubernetes pod, a basic building block of Kubernetes applications.

Part of the Kubernetes filter specification.

kube cluster name

The name of a Kubernetes cluster, a group of nodes that run containerized applications managed by Kubernetes.

Part of the Kubernetes filter specification.

kube daemon set

A Kubernetes resource that ensures a specific pod runs on all or selected nodes within a cluster.

Part of the Kubernetes filter specification.

kube deployment

A Kubernetes resource that represents a set of identical pods, ensuring application scalability and fault tolerance.

The filter includes a search option.

Part of the Kubernetes filter specification.

kube job

A Kubernetes resource that manages batch tasks or processes that run to completion.

Part of the Kubernetes filter specification.

kube-namespace

A logical partition or virtual cluster that isolates and organizes resources within a Kubernetes cluster.

Part of the Kubernetes filter specification.

kube node

A worker machine within a Kubernetes cluster, responsible for running containers and managing their lifecycle.

Part of the Kubernetes filter specification.

kube replica set

A Kubernetes resource that ensures a specified number of replicas (identical pods) are running at all times.

Part of the Kubernetes filter specification.

kube replica set

A Kubernetes resource that ensures a specified number of replicas (identical pods) are running at all times.

Part of the Kubernetes filter specification.

kube stateful set

Runs a group of pods and maintains a "sticky" identity for each; useful for managing applications that need persistent storage or a stable, unique network identity.

Part of the Kubernetes filter specification.

Kubernetes

In Kloudfuse, a group of filters that identify the attributes of services running in Kubernetes.

L

language

In Kloudfuse, a group of filters that identify the programming language of the integrated telemetry software development kit (SDK), including go, nodejs, cpp, java, dotnet, rust, python, php, and ruby.

latency

Latency is the time taken for a single operation to complete, often measured as response time.

It is one of the key aspects of performance.

log event

A discrete log entry, with associated information: log line, log labels, facets, and fingerprint.

log labels

The key/value pairs associated with the log line.

These can include Kubernetes attributes, Cloud infrastructure attributes, and so on. These typically originate outside a log line.

log level

The severity of a log event - info, error, warning, trace, debug, or notice.

log line

This is the emitted log message. See log event, log labels, facets, and fingerprint.

LogQL

LogQL is Grafana Loki’s PromQL-inspired query language for searching across logs. See LogQL.

logs

Logs record all activities that occur within your system; they are a history of the system’s behavior during a specific time interval. Logs are essential for successful debugging. When we parse log data, we can develop an insight into application performance that we cannot determine through APIs or application databases.

Logs can use various formats, such as plain text or JSON objects, accessed through a range of querying techniques, making them the most useful inputs when monitoring application performance, investigating security threats, and addressing performance issues.

low cardinality

This conditions describes situations of low counts of unique values in the dataset, compared to the total number of entries.

For example, a Gender column in a typical employee table naturally has low cardinality; it may have a large number of entries, yet a low number of options: female, male, non-binary, and declined to answer.

M

MELT

An acronym for Metrics, Events, Logs, and Traces, MELT is a framework that provides insights into system performance, health, and behavior. MELT can help teams quickly identify, diagnose, and resolve issues, while optimizing system performance.

metrics

Metrics are numerical measurements that provide an insight into a system’s performance. Use metrics, such as error rate and CPU% use, in mathematical modeling and forecasting to represent specific data structures.

Monitoring metrics as part of your observability strategy helps when constructing dashboards that display past trends across multiple services, facilitating extended data retention and simplifying queries.

O

observability

An observability platform aggregates and visualizes telemetric data collected from application and infrastructure components in a distributed environment. It monitors and analyzes application behavior, transactional data, and the various types of infrastructure that support application delivery, making it possible to proactively address issues before they become serious concerns.

Beyond monitoring capabilities, observability provides deeper insights into the data and optimizes performance, ensurses availability, and improves customer experience.

At its best, observability enables users to ask arbitrary questions about their environment, without knowing in advance what you may want to know.

It is one of the key aspects of usability.

P

performance

Performance is the efficiency and effectiveness of the system in handling workloads by distributing tasks across multiple nodes. It is one of the key aspects of usability.

We measure performance through factors like throughput, latency, scalability, and availability, ensuring the system can respond quickly and reliably under high demand, while minimizing resource usage across all nodes.

The key aspects of performance are throughput, latency, scalability, and availability.

pg

The PostgreSQL network protocol that governs how clients and servers interact in a PostgreSQL environment.

pod

The smallest deployable unit of computing that you can create and manage in Kubernetes.

pod name

The name of a specific pod, representing a single instance of a running process in a Kubernetes cluster.

Includes a search option.

Part of the Kubernetes filter specification.

project

A named, isolated environment within a cloud account for organizing and managing resources.

Part of the cloud filter specification.

PromQL

Prometheus Query Language enables you to select and aggregate metric time series data in real time. See PromQL.

protocol

Protocol refers to the methodology for network communication, such as http or pg.

protobuf

Shortened form of Protocol Buffers, protobuf is an efficient, language-agnostic data serialization mechanism for defining structured data in a . proto file. You can use it to generate source code that writes and reads data from different data streams.

Q

query

A query is a request or command that retrieves, manipulates, or manages data stored in a database or information system.

We author queries in a specific query language, such as SQL for relational databases or other query languages designed for different types of databases or data sources.

For example, the Kloudfuse proprietary FuseQL is particularly powerful in a wide range of applications, with flexible parameters for answering highly complex questions.

query type

Represents the type or category of a query performed on a database or data store, such as SELECT, INSERT, UPDATE, or DELETE.

R

RED

Refers to RED Metrics: and acronym for Requests, Errors, and Duration (Latency).

region

A geographical location that hosts a cloud provider’s data centers.

Part of the cloud filter specification.

reliability

Reliability is the ability of a system, application, or service to consistently perform its intended functions under specific conditions for a specified period. Along with availability and usability, reliability is a key attribute of quality in software and services.

Reliability requires careful design, engineering, and ongoing monitoring and maintenance, and ensures that users and organizations get the fundamental system performance that they expect.

resilience

Resilience is the ability of a system or application to gracefully and effectively handle unexpected failures, disruptions, errors, or adverse conditions while maintaining core functionality and minimizing the impact on users and operations. It is one of the key aspects of usability.

Resilience ensures that software systems can continue to operate under adverse circumstances, recover quickly from failures, and maintain a high level of availability and performance.

resource

Resources represent a particular domain of a customer application - they are typically an instrumented web endpoint, database query, or background job.

RPO (Recovery Point Objective)

The maximum amount of data the organization can tolerate losing; the goal an organization sets for the maximum length of time it can take to restore normal operations following an outage or data loss.

This parameter is measured in time: from the moment a failure occurs to your last valid data backup.

Key points about RPO:

  • Data Loss Tolerance: how much data loss is acceptable to an organization during a recovery process.

  • Backup Frequency: To achieve a specific RPO, organizations must implement backup strategies with appropriate frequencies.

  • Business Impact Analysis: Considering the criticality of data and potential consequences of data loss.

RTO (Recovery Time Objective)

The maximum amount of time that it is acceptable to wait to restore a system or service after a disruption.

This ensures that normal operations can be restored as quickly as possible after a disaster, to avoid significant disruption to a business.

The factors that affect RTO are:

  • Business impact: How much revenue is lost, and how much the disruption affects business continuity.

  • System criticality: How important the system is to the business.

  • Application dependencies: Whether the application depends on other applications.

  • Compliance requirements: Whether the application is subject to any external compliance or regulatory requirements. How is RTO measured? RTO is usually expressed in hours or days.

RUM

RUM is an acronym for Real User Monitoring, a technology that measures and tracks the end-user experience of an application or website. It is a core feature of APM, and a key component for gaining real-world observability into web performance and service availability.

S

sampling

In sampling, we programmatically select a subset or representative group of data points or items from a larger dataset for analysis, testing, or inspection.

Sampling is a common when it is impractical or resource-intensive to work with an entire dataset, especially when dealing with large volumes of data.

In observability and performance monitoring, we sample logs, traces, and events to summarize information about a service or activity. When sampling your information, there is a constant tradeoff between granularity, system-representative accuracy, cost, performance, and relevancy.

scalability

Scalability is the ability to handle increasing workloads by adding more nodes to the system without significant performance degradation.

It is one of the key aspects of performance.

service

Services are the building blocks of modern microservice architectures - broadly, a service groups together endpoints, queries, or jobs to build your application.

service entry span

A service entry span records the entry point method for a request to a service. Kloudfuse APM interface shows service entry spans when the immediate parent on a flame graph is a different color than the span.

Service Level Agreement (SLA)

SLAs define the level of service and performance standards that the service provider is expected to meet. SLAs serve as a means of establishing clear expectations, responsibilities, and consequences for service quality and delivery.

A Service Level Agreement is a formal and legally-binding contract or agreement between a service provider and a customer or client. Most IT contracts include an SLA for the protection of both the vendor and the customer.

Service Level Objective (SLO)

SLOs are critical components of SLAs. They establish clear, measurable expectations for service providers and stakeholders, help to ensure that the service meets the needs of its users, and align with business or operational requirements. The best SLOs set a minimum standard for performance, are chosen thoughtfully with business objectives in mind, and focus only on measurable metrics.

A Service Level Objective is a specific and quantifiable target or goal that defines the level of performance, reliability, or quality that a service or system must achieve.

service map

The Service Map is a graphical workflow representation of all services in the deployment.

span

Spans are integral parts of a distributed system, and the basic element in distributed tracing. A span represents a logical unit of work in a distributed system for a specified period. It is a single operation within a trace. Multiple spans construct a trace.

span tags

You can tag spans in APM, as key-value pairs, to correlate requests and filters.

span type

In Kloudfuse, a group of filters that identify the environment where the span runs. Includes http, grpc, db, queue, and custom.

source

The source application that emits logs or other metrics.

This is a mandatory field for each log line. Users can configure the system to attach an appropriate source name to their logs.

T

telemetry

Telemetry data is automatically collected and transmitted from remote or inaccessible sources to a centralized location for monitoring and analysis. Metrics, events, logs, and traces each provide crucial insights into the application’s performance, latency, throughput, and resource utilization.

We can use telemetry data to observe system performance, recognize potential problems, detect irregularities, and investigate issues.

We can extract telemetry data from:

  • Application logs

  • System logs

  • Network traffic

  • Third-party services

  • APIs

telemetry type

The telemetry type is the specific category or classification of data collected through telemetry, such as logs, events, distributed tracing, and so on.

throughput

Throughput is the number of operations or requests a system can process per unit of time.

It is one of the key aspects of performance.

time series

Time series is a set of data points indexed in chronological time order. Commonly, a time series is a sequence of successive equally-spaced points in time, or a sequence of discrete-time data.

tolerance

In DBSCAN, the tolerance level, or eps, determines the clustering radius of the neighborhood around each point. The eps controls the sensitivity of outlier detection. A lower tolerance detects more subtle outliers, while a higher tolerance detects only the most significant deviations.

trace

A trace is a collection of operations that represents a unique transaction handled by an application and its constituent services. It is the path of a request or workflow as it progresses from one component of the system to another, capturing the entire request flow through a distributed system.

Traces expose the directionality and relationships between two data points, service interactions, and the effects of asynchrony. When analyzing trace data, we better understand the performance and behavior of a distributed system.

Some examples of traces include:

  • A SQL query execution

  • A function call during a user authentication request

trace metrics

Trace metrics identify and alert on hits, errors, or latency. Kloudfuse automatically collects trace metrics and retains them for 15 months, similar to other Kloudfuse metrics.

trace root span

A root span tracks the entry point method for the trace. Its start marks the beginning of the trace.

TraceQL

TraceQL is Grafana’s query language that selects traces. See TraceQL.

transparency

The user should not have to be aware of the system’s distributed nature, and should be able to interact with it as if it were a single, centralized system.

It is one of the key aspects of usability.

U

usability

Usability is a measure of how easy and intuitive it is for users to interact with a distributed system, without needing to understand the complex underlying architecture or the individual components.

Good usability ensures a seamless and efficient user experience despite the distributed nature of the system, making a complex system feel like a single, unified entity to the user.

The key aspects of usability are transparency, consistency, resilience, performance, and observability.

USE

The USE method for monitoring system performance focuses on three key metrics: Utilization, Saturation, and Errors.

Utilization

The average time that the resource was busy servicing work, including CPU utilization and Memory utilization.

Saturation

How much workload your system handles beyond its capacity, reflecting the amount of work that is waiting to be processed. It includes Queue Lengths and Input/Output Wait Times.

Errors

Count the number of problems or failures that occur in the system. These include both system and application errors.

V

visualization

Visualization is the graphical representation of data or information. It means that we creatie visual representations (charts, graphs, diagrams, maps, and dashboards) to make complex data more understandable, accessible, and interpretable.

Visualization is a powerful tool for gaining insights from data, presenting information, and conveying patterns or trends.