Glossary
A
- ACM
-
ACM is the abbreviation for the AWS Certificate Manager, which creates, stores, and renews public and private SSL/TLS X.509 certificates and keys that protect AWS websites and applications.
See AWS documentation [What is AWS Certificate Manager?role=external,window=_blank].
- agent
-
An entity that collects logs or other metrics and sends them to the Kloudfuse Platform.
- APM
-
An acronym for Application Performance Monitoring, APM tracks and monitors application performance metrics like response times, error rates, and transaction traces. APM identifies and addresses issues that impact the application’s performance and user experience.
- APM alert
-
APM metric alerts are similar to metric alerts; with controls tailored specifically to APM. Use alerts at the service level on hits, errors, and a variety of latency measures.
- APDEX
-
The Application Performance Index, a single metric that represents the service quality on the scale of 0 to 1. Higher values indicate better performance, while lower values report low quality. APDEX combines the metrics of RED into a single measurement.
- ASM
-
An acronym for Advanced Service Monitoring, ASM consists of technologies that help manage and monitor distributed applications within a cloud environment, and provide centralized visibility into their health and performance. ASM enhances observability by providing a unified view of complex microservice interactions through centralized management, traffic management, and observability data gathering.
- availability
-
Availability is the system’s ability to consistently respond to user requests, even if some individual nodes or components are experiencing failures.
Availability ensures that users can access services without interruption regardless of potential disruptions; essentially, it means that a working node in the system will always return a response to a request, even if the data is not entirely up-to-date because of inconsistencies with other nodes that are down.
It is one of the key aspects of performance.
- availability zone
-
A distinct and isolated data center in a cloud provider’s infrastructure, designed to provide fault tolerance and high availability for cloud-based services.
Part of the cloud filter specification.
C
- chart
-
A chart represents data or information in a graphical format. Most charts in Kloudfuse are line graphs, stacked bar graphs, stacked area graphs, or point graphs. Terms “chart” and “graph” are interchangeable.
- cloud
-
In Kloudfuse, a group of filters that identify the attributes of cloud-based services.
Includes availability zone, cloud account id, instance type, project, and region.
- cloud account id
-
Unique identifier for a user or organization’s account in a public cloud provider’s system.
Part of the cloud filter specification.
- consistency
-
Consistency is the implementation of consistent user interfaces and interaction patterns across different parts of the distributed system, regardless of which component is currently handling the request.
It is one of the key aspects of usability.
- container
-
In Kloudfuse, a group of filters that identify the attributes of cloud-based container-based virtualizations. Includes container id.
- container id
-
Unique identifier for a container.
Part of the container filter specification.
- counter
-
A cumulative metric that represents a single monotonically increasing counter, where the value can only increase, or be reset to zero on restart.
- cardinality
-
Cardinality of a data attribute is the number of possible distinct values, or uniqueness, it can have.
Cardinality is important for the following reasons:
- Data integrity
-
It defines clear relationships between tables and ensures that data is accurately linked and accessible.
- Efficient queries
-
Understanding cardinality optimizes query execution plans. Well-defined relationships and data distribution leads to faster information retrieval and improved performance.
- Database design and normalization
-
Cardinality is crucial in designing databases and normalizing data. It helps reduce redundancy and optimize storage for effective organization.
- Data analysis and reporting
-
For businesses, cardinality enhances data analysis and reporting. It establishes meaningful relationships that provide insights into customer behavior and operational efficiency.
See High cardinality and Low cardinality
D
- Disaster Recovery Plan
-
A documented strategy, abbreviated as DRP, that outlines how an organization can quickly resume its critical IT operations using backup systems and procedures to minimize downtime and data loss after a disruptive event — a natural disaster, power outage, cyber attack, or hardware failure.
The key points of an effective DRP are:
-
Focus on critical systems
-
Risk assessment
-
Data backup strategy
-
The last two points are the parameters that define how long a business can afford to be offline, and how much data loss it can tolerate.
-
- display container name
-
The name of a container within a system or application; used for display purposes or identification.
Part of the Kubernetes filter specification.
- duration
-
Duration of a span is the difference between the span end time and the span start time. Contrast with execution time.
E
- eBPF
-
Extended Berkeley Packet Filter (eBPF) is a Linux kernel technology that enables users to run programs in a protected environment within the kernel. eBPF programs loaded into the kernel using the bpf(2) syscall. The user provides them as a binary blobs of eBPF machine instructions. eBPF allows you to:
-
Reprogram the Linux kernel without rebooting the system
-
Run user-supplied programs to extend kernel functionality
-
Collect kernel statistics, monitor, and debug
-
Detect malware
-
Inspect traffic
-
Perform continuous profiling of applications
-
Run tracing and profiling scripts on a Kubernetes cluster
-
Filter data packets from networks and embed them into the kernel
-
Monitor system calls, network traffic, and system behavior at both the kernel and socket levels
-
- env
-
Represents the environment or deployment stage ( development, staging, production) of a system or application.
- environment
-
Similar to env, represents the environment or deployment context.
- events
-
An event is something that has happened in a system at a point in time. Events are discrete occurrences with precise temporal and numerical values. Through events, we can track crucial indidences, and detect potential problems associated with user requests. Because events are very time-sensitive, they include timestamps.
- execution time
-
Total time that a span is active , not waiting for a child span to complete, scaled according to the number of concurrent active spans.
Contrast with duration.
F
- facets
-
Attributes that the ingester extracts from a log line.
See log event, log line, log labels, and fingerprint.
- fingerprint
-
An automatically-detercted structure of a log line.
Can be used to effectively filter logs
See log event, log line, log labels, and facet.
- flame graph
-
A flame graph is a visual representation of profiling data that analyzes call stacks and identifies performance hotspots.
Each horizontal bar represents a function call and its width is a measure of much time is spent in that function. This representation enables you to easily see which parts of the code are taking the most time to execute.
- FuseQL
-
Kloudfuse developed FuseQL as a query language for searching across logs data for a range of applications. It has flexible parameters for answering highly complex questions. See FuseQL.
G
- gauge
-
A metric that represents a single numerical value that can increase or decrease.
- GraphQL
-
This query language fetches data from multiple data sources with a single API call. See GraphQL.
- gRPC
-
gRPC is a high-performance, open-source framework for remote procedure calls (RPCs). gRPC enables client applications to call methods on a server application on a different machine as if it were local, making it easier to create distributed applications and services.
H
- high cardinality
-
High cardinality is a condition when individual attributes within a dataset have a large number of unique values, compared to the total number of entries. We also refer to high cardinality when combinations of dimensions lead to a large number of unique permutations.
For example, an Email address column in a typical
employee
table naturally has high cardinality because each person has a unique email address.
- histogram
-
A metric that tracks the distribution of observed values. A histogram samples observations (request durations, response sizes, and so on) and counts them in configurable buckets. It also provides a sum of all observed values.
- http
-
An accronym for Hypertext Transfer Protocol, http is a set of rules that govern how to exchange information between a client and a web server.
HTTP is a request-response model, where a client sends a request to a server and the server responds with the requested content and status information.
While designed for communication between web browsers and web servers, HTTP can be used for machine-to-machine communication, programmatic access to APIs, and many more.
I
- image name
-
The name of a container image used to create instances of containers with specific functionalities.
Part of the Kubernetes filter specification.
- image tag
-
A label or version number associated with a container image, specifying its version or configuration.
Part of the Kubernetes filter specification.
- infrastructure
-
Infrastructure is the foundational layer, or backbone, of hardware, software, and resources that support the operation, deployment, and scalability of software applications and services. It includes a range of components and technologies that deliver computing, networking, storage, and other services for running and maintaining software systems.
- instance type
-
A predefined configuration of virtual hardware resources in a cloud provider’s infrastructure.
Part of the cloud filter specification.
K
- kube cronjob
-
A Kubernetes resource that creates jobs that run at specified intervals, defined by a cronological schedule. Performs regular scheduled actions such as backups, report generation, and so on.
Part of the Kubernetes filter specification.
- kube container name
-
The name of a container that runs within a Kubernetes pod, a basic building block of Kubernetes applications.
Part of the Kubernetes filter specification.
- kube cluster name
-
The name of a Kubernetes cluster, a group of nodes that run containerized applications managed by Kubernetes.
Part of the Kubernetes filter specification.
- kube daemon set
-
A Kubernetes resource that ensures a specific pod runs on all or selected nodes within a cluster.
Part of the Kubernetes filter specification.
- kube deployment
-
A Kubernetes resource that represents a set of identical pods, ensuring application scalability and fault tolerance.
The filter includes a search option.
Part of the Kubernetes filter specification.
- kube job
-
A Kubernetes resource that manages batch tasks or processes that run to completion.
Part of the Kubernetes filter specification.
- kube-namespace
-
A logical partition or virtual cluster that isolates and organizes resources within a Kubernetes cluster.
Part of the Kubernetes filter specification.
- kube node
-
A worker machine within a Kubernetes cluster, responsible for running containers and managing their lifecycle.
Part of the Kubernetes filter specification.
- kube replica set
-
A Kubernetes resource that ensures a specified number of replicas (identical pods) are running at all times.
Part of the Kubernetes filter specification.
- kube replica set
-
A Kubernetes resource that ensures a specified number of replicas (identical pods) are running at all times.
Part of the Kubernetes filter specification.
- kube stateful set
-
Runs a group of pods and maintains a "sticky" identity for each; useful for managing applications that need persistent storage or a stable, unique network identity.
Part of the Kubernetes filter specification.
- Kubernetes
-
In Kloudfuse, a group of filters that identify the attributes of services running in Kubernetes.
L
- language
-
In Kloudfuse, a group of filters that identify the programming language of the integrated telemetry software development kit (SDK), including go, nodejs, cpp, java, dotnet, rust, python, php, and ruby.
- latency
-
Latency is the time taken for a single operation to complete, often measured as response time.
It is one of the key aspects of performance.
- log event
-
A discrete log entry, with associated information: log line, log labels, facets, and fingerprint.
- log labels
-
The key/value pairs associated with the log line.
These can include Kubernetes attributes, Cloud infrastructure attributes, and so on. These typically originate outside a log line.
See log event, log line, facets, and fingerprint.
- log level
-
The severity of a log event - info, error, warning, trace, debug, or notice.
- log line
-
This is the emitted log message. See log event, log labels, facets, and fingerprint.
- LogQL
-
LogQL is Grafana Loki’s PromQL-inspired query language for searching across logs. See LogQL.
- logs
-
Logs record all activities that occur within your system; they are a history of the system’s behavior during a specific time interval. Logs are essential for successful debugging. When we parse log data, we can develop an insight into application performance that we cannot determine through APIs or application databases.
Logs can use various formats, such as plain text or JSON objects, accessed through a range of querying techniques, making them the most useful inputs when monitoring application performance, investigating security threats, and addressing performance issues.
- low cardinality
-
This conditions describes situations of low counts of unique values in the dataset, compared to the total number of entries.
For example, a Gender column in a typical
employee
table naturally has low cardinality; it may have a large number of entries, yet a low number of options: female, male, non-binary, and declined to answer.
M
- MELT
-
An acronym for Metrics, Events, Logs, and Traces, MELT is a framework that provides insights into system performance, health, and behavior. MELT can help teams quickly identify, diagnose, and resolve issues, while optimizing system performance.
- metrics
-
Metrics are numerical measurements that provide an insight into a system’s performance. Use metrics, such as error rate and CPU% use, in mathematical modeling and forecasting to represent specific data structures.
Monitoring metrics as part of your observability strategy helps when constructing dashboards that display past trends across multiple services, facilitating extended data retention and simplifying queries.
O
- observability
-
An observability platform aggregates and visualizes telemetric data collected from application and infrastructure components in a distributed environment. It monitors and analyzes application behavior, transactional data, and the various types of infrastructure that support application delivery, making it possible to proactively address issues before they become serious concerns.
Beyond monitoring capabilities, observability provides deeper insights into the data and optimizes performance, ensurses availability, and improves customer experience.
At its best, observability enables users to ask arbitrary questions about their environment, without knowing in advance what you may want to know.
It is one of the key aspects of usability.
P
- performance
-
Performance is the efficiency and effectiveness of the system in handling workloads by distributing tasks across multiple nodes. It is one of the key aspects of usability.
We measure performance through factors like throughput, latency, scalability, and availability, ensuring the system can respond quickly and reliably under high demand, while minimizing resource usage across all nodes.
The key aspects of performance are throughput, latency, scalability, and availability.
- pg
-
The PostgreSQL network protocol that governs how clients and servers interact in a PostgreSQL environment.
- pod
-
The smallest deployable unit of computing that you can create and manage in Kubernetes.
- pod name
-
The name of a specific pod, representing a single instance of a running process in a Kubernetes cluster.
Includes a search option.
Part of the Kubernetes filter specification.
- project
-
A named, isolated environment within a cloud account for organizing and managing resources.
Part of the cloud filter specification.
- PromQL
-
Prometheus Query Language enables you to select and aggregate metric time series data in real time. See PromQL.
- protobuf
-
Shortened form of Protocol Buffers, protobuf is an efficient, language-agnostic data serialization mechanism for defining structured data in a . proto file. You can use it to generate source code that writes and reads data from different data streams.
Q
- query
-
A query is a request or command that retrieves, manipulates, or manages data stored in a database or information system.
We author queries in a specific query language, such as SQL for relational databases or other query languages designed for different types of databases or data sources.
For example, the Kloudfuse proprietary FuseQL is particularly powerful in a wide range of applications, with flexible parameters for answering highly complex questions.
- query type
-
Represents the type or category of a query performed on a database or data store, such as
SELECT
,INSERT
,UPDATE
, orDELETE
.
R
- RED
-
Refers to RED Metrics: and acronym for Requests, Errors, and Duration (Latency).
- region
-
A geographical location that hosts a cloud provider’s data centers.
Part of the cloud filter specification.
- reliability
-
Reliability is the ability of a system, application, or service to consistently perform its intended functions under specific conditions for a specified period. Along with availability and usability, reliability is a key attribute of quality in software and services.
Reliability requires careful design, engineering, and ongoing monitoring and maintenance, and ensures that users and organizations get the fundamental system performance that they expect.
- resilience
-
Resilience is the ability of a system or application to gracefully and effectively handle unexpected failures, disruptions, errors, or adverse conditions while maintaining core functionality and minimizing the impact on users and operations. It is one of the key aspects of usability.
Resilience ensures that software systems can continue to operate under adverse circumstances, recover quickly from failures, and maintain a high level of availability and performance.
- resource
-
Resources represent a particular domain of a customer application - they are typically an instrumented web endpoint, database query, or background job.
- RPO (Recovery Point Objective)
-
The maximum amount of data the organization can tolerate losing; the goal an organization sets for the maximum length of time it can take to restore normal operations following an outage or data loss.
This parameter is measured in time: from the moment a failure occurs to your last valid data backup.
Key points about RPO:
-
Data Loss Tolerance: how much data loss is acceptable to an organization during a recovery process.
-
Backup Frequency: To achieve a specific RPO, organizations must implement backup strategies with appropriate frequencies.
-
Business Impact Analysis: Considering the criticality of data and potential consequences of data loss.
-
- RTO (Recovery Time Objective)
-
The maximum amount of time that it is acceptable to wait to restore a system or service after a disruption.
This ensures that normal operations can be restored as quickly as possible after a disaster, to avoid significant disruption to a business.
The factors that affect RTO are:
-
Business impact: How much revenue is lost, and how much the disruption affects business continuity.
-
System criticality: How important the system is to the business.
-
Application dependencies: Whether the application depends on other applications.
-
Compliance requirements: Whether the application is subject to any external compliance or regulatory requirements. How is RTO measured? RTO is usually expressed in hours or days.
-
- RUM
-
RUM is an acronym for Real User Monitoring, a technology that measures and tracks the end-user experience of an application or website. It is a core feature of APM, and a key component for gaining real-world observability into web performance and service availability.
S
- sampling
-
In sampling, we programmatically select a subset or representative group of data points or items from a larger dataset for analysis, testing, or inspection.
Sampling is a common when it is impractical or resource-intensive to work with an entire dataset, especially when dealing with large volumes of data.
In observability and performance monitoring, we sample logs, traces, and events to summarize information about a service or activity. When sampling your information, there is a constant tradeoff between granularity, system-representative accuracy, cost, performance, and relevancy.
- scalability
-
Scalability is the ability to handle increasing workloads by adding more nodes to the system without significant performance degradation.
It is one of the key aspects of performance.
- service
-
Services are the building blocks of modern microservice architectures - broadly, a service groups together endpoints, queries, or jobs to build your application.
- service entry span
-
A service entry span records the entry point method for a request to a service. Kloudfuse APM interface shows service entry spans when the immediate parent on a flame graph is a different color than the span.
- Service Level Agreement (SLA)
-
SLAs define the level of service and performance standards that the service provider is expected to meet. SLAs serve as a means of establishing clear expectations, responsibilities, and consequences for service quality and delivery.
A Service Level Agreement is a formal and legally-binding contract or agreement between a service provider and a customer or client. Most IT contracts include an SLA for the protection of both the vendor and the customer.
- Service Level Objective (SLO)
-
SLOs are critical components of SLAs. They establish clear, measurable expectations for service providers and stakeholders, help to ensure that the service meets the needs of its users, and align with business or operational requirements. The best SLOs set a minimum standard for performance, are chosen thoughtfully with business objectives in mind, and focus only on measurable metrics.
A Service Level Objective is a specific and quantifiable target or goal that defines the level of performance, reliability, or quality that a service or system must achieve.
- service map
-
The Service Map is a graphical workflow representation of all services in the deployment.
- source
-
The source application that emits logs or other metrics.
This is a mandatory field for each log line. Users can configure the system to attach an appropriate source name to their logs.
T
- telemetry
-
Telemetry data is automatically collected and transmitted from remote or inaccessible sources to a centralized location for monitoring and analysis. Metrics, events, logs, and traces each provide crucial insights into the application’s performance, latency, throughput, and resource utilization.
We can use telemetry data to observe system performance, recognize potential problems, detect irregularities, and investigate issues.
We can extract telemetry data from:
-
Application logs
-
System logs
-
Network traffic
-
Third-party services
-
APIs
-
- telemetry type
-
The telemetry type is the specific category or classification of data collected through telemetry, such as logs, events, distributed tracing, and so on.
- throughput
-
Throughput is the number of operations or requests a system can process per unit of time.
It is one of the key aspects of performance.
- time series
-
Time series is a set of data points indexed in chronological time order. Commonly, a time series is a sequence of successive equally-spaced points in time, or a sequence of discrete-time data.
- tolerance
-
In DBSCAN, the tolerance level, or
eps
, determines the clustering radius of the neighborhood around each point. Theeps
controls the sensitivity of outlier detection. A lower tolerance detects more subtle outliers, while a higher tolerance detects only the most significant deviations.
- trace
-
A trace is a collection of operations that represents a unique transaction handled by an application and its constituent services. It is the path of a request or workflow as it progresses from one component of the system to another, capturing the entire request flow through a distributed system.
Traces expose the directionality and relationships between two data points, service interactions, and the effects of asynchrony. When analyzing trace data, we better understand the performance and behavior of a distributed system.
Some examples of traces include:
-
A SQL query execution
-
A function call during a user authentication request
-
- trace metrics
-
Trace metrics identify and alert on hits, errors, or latency. Kloudfuse automatically collects trace metrics and retains them for 15 months, similar to other Kloudfuse metrics.
- trace root span
-
A root span tracks the entry point method for the trace. Its start marks the beginning of the trace.
- TraceQL
-
TraceQL is Grafana’s query language that selects traces. See TraceQL.
- transparency
-
The user should not have to be aware of the system’s distributed nature, and should be able to interact with it as if it were a single, centralized system.
It is one of the key aspects of usability.
U
- usability
-
Usability is a measure of how easy and intuitive it is for users to interact with a distributed system, without needing to understand the complex underlying architecture or the individual components.
Good usability ensures a seamless and efficient user experience despite the distributed nature of the system, making a complex system feel like a single, unified entity to the user.
The key aspects of usability are transparency, consistency, resilience, performance, and observability.
- USE
-
The USE method for monitoring system performance focuses on three key metrics: Utilization, Saturation, and Errors.
- Utilization
-
The average time that the resource was busy servicing work, including CPU utilization and Memory utilization.
- Saturation
-
How much workload your system handles beyond its capacity, reflecting the amount of work that is waiting to be processed. It includes Queue Lengths and Input/Output Wait Times.
- Errors
-
Count the number of problems or failures that occur in the system. These include both system and application errors.
V
- visualization
-
Visualization is the graphical representation of data or information. It means that we creatie visual representations (charts, graphs, diagrams, maps, and dashboards) to make complex data more understandable, accessible, and interpretable.
Visualization is a powerful tool for gaining insights from data, presenting information, and conveying patterns or trends.