Known Issues

We are aware of some issues you may encounter in Kloudfuse and their causes.

Installation

The helm registry login command may fail.

Resolution

Replace helm registry login with docker login:

cat token.json | docker login -u _json_key --password-stdin https://us-east1-docker.pkg.dev

Review helm values

We configured the default Kloudfuse helm chart for a single-node cluster install, without a deepstore. It may be difficult to understand what these are.

Resolution

Run the following command:

helm show values oci://us-east1-docker.pkg.dev/mvp-demo-301906/kfuse-helm/kfuse --version <VERSION.NUM.BER> (1)

1	`version`: Use the version number of the most current Kloudfuse release; the pattern looks like `3.0.0`, `3.1.3`, and so on. See Version documentation.

As your configuration changes, you can add customizations by editing the custom_values.yaml file.

To view your helm values at any time, run the command on your cluster:

helm show values <Your Kloudfuse Installation IP or address>/kfuse-helm/kfuse --version <VERSION.NUM.BER> (1)

1	`version`: Use the version number of the most current Kloudfuse release; the pattern looks like `3.0.0`, `3.1.3`, and so on. See Version documentation.

Networking

Kloudfuse cannot be reached from an external host

External IP, host, or DNS is unable to access Kfuse:

curl http://EXTERNAL_IP
curl: (28) Failed to connect to XX.XX.XX.XX port 80 after 129551 ms: Connection timed out
curl https://EXTERNAL_IP --insecure
curl: (28) Failed to connect to XX.XX.XX.XX port 443 after 129551 ms: Connection timed out

Resolution

Ensure that the security group or firewall policy for the Kubernetes cluster, node, and VPC endpoint allows external incoming traffic.

Ingress-NGINX drops packet

The ingress-NGINX logs the error client intended to send too large body.

2023/03/06 05:38:22 [error] 43#43: *128072996 client intended to send too large body: 1097442 bytes, client: XXXX, server: _, request: "POST /ingester/v1/fluent_bit HTTP/1.1", host: "XXXX"

Resolution

The default request body size is 1M. Configure Ingress-NGINX to accept larger request body size; include the following specification in the custom-values.yml file:

ingress-nginx:
  controller:
    config:
      proxy-body-size: <REPLACE THE BODY SIZE HERE, e.g., 8m. Setting to 0 will disable all limits.>

yaml

Kafka

Increase Kafka partition replication factor

In some scenarios, you have to increase the Kafka partition replication factor.

Resolution

In the custom-values.yaml file, update the global.kafkaTopics section with new replicationFactor. This does not change the configuration of a deployed cluster. It is necessary for tracking if you use yaml to re-deploy in a fresh installation.
If necessary, resize Kafka Persistent Disk.

By default, Kloudfuse uses 10GB retention size per partition, set in the kafka.logRetentionBytes variable.
Ensure that the persistent disk size has enough capacity.

Use this formula:

where:
- NP is the number of partitions
- RF is the ReplicationFactor across all topics defined in global.kafkaTopics
- LRB is the logRetentionBytes variable
- NKB is the number of Kafka brokers
If you must increase the size of the persistent disk; see Increase existing PVC size and Resize PVC on Azure.
To start increasing the ReplicationFactor, log in to the Kafka Pod:
```
kubectl exec -ti -n kfuse kafka-broker-0 -- bash
```
Unset the JMX_PORT:
```
unset JMX_PORT
```
Get the list of configured topics; you can get the list from global.kafkaTopics:
```
/opt/bitnami/kafka/bin/kafka-topics.sh --bootstrap-server :9092 --list
```

Create the topics.json file, and save it in the /bitnami/kafka directory:

cat > /bitnami/kafka/topics.json
{
  "version": 1,
  "topics": [
    { "topic": "kf_events_topic" },
    { "topic": "kf_logs_metric_topic" },
    { "topic": "kf_logs_topic" },
    { "topic": "kf_metrics_topic" },
    { "topic": "kf_traces_errors_topic" },
    { "topic": "kf_traces_metric_topic" },
    { "topic": "kf_traces_topic" },
    { "topic": "logs_ingest_topic" }
  ]
}

Get into the Kafka Zookeeper pod.

kubectl exec -ti -n kfuse kafka-zookeeper-0 -- bash

Get the broker ids.
Get List of Active Brokers
```
/opt/bitnami/zookeeper/bin/zkCli.sh -server localhost:2181 (1)
ls /brokers/ids (2)
```
1 Enter the Zookeeper CLI.

2 Get the list of active brokers.

Get the current partition assignment.

The --broker-list should match the number of current configured brokers.

This command sets up 3 Kafka brokers:

/opt/bitnami/kafka/bin/kafka-reassign-partitions.sh --bootstrap-server :9092 --generate --topics-to-move-json-file /bitnami/kafka/topics.json  --broker-list <Broker IDs> (1)

List of Broker IDs from Get List of Active Brokers.

This command prints the Current partition replica assignment and Proposed partition reassignment configuration. Ignore the proposed output; only the Current partition replica assignment is relevant.

Show output

{"version":1,"partitions":[{"topic":"kf_events_topic","partition":0,"replicas":[2],"log_dirs":["any"]},{"topic":"kf_logs_metric_topic","partition":0,"replicas":[0],"log_dirs":["any"]},{"topic":"kf_logs_topic","partition":0,"replicas":[2],"log_dirs":["any"]},{"topic":"kf_metrics_topic","partition":0,"replicas":[2],"log_dirs":["any"]},{"topic":"kf_metrics_topic","partition":1,"replicas":[1],"log_dirs":["any"]},{"topic":"kf_metrics_topic","partition":2,"replicas":[0],"log_dirs":["any"]},{"topic":"kf_metrics_topic","partition":3,"replicas":[2],"log_dirs":["any"]},{"topic":"kf_metrics_topic","partition":4,"replicas":[1],"log_dirs":["any"]},{"topic":"kf_metrics_topic","partition":5,"replicas":[0],"log_dirs":["any"]},{"topic":"kf_metrics_topic","partition":6,"replicas":[2],"log_dirs":["any"]},{"topic":"kf_metrics_topic","partition":7,"replicas":[1],"log_dirs":["any"]},{"topic":"kf_metrics_topic","partition":8,"replicas":[0],"log_dirs":["any"]},{"topic":"kf_metrics_topic","partition":9,"replicas":[2],"log_dirs":["any"]},{"topic":"kf_metrics_topic","partition":10,"replicas":[1],"log_dirs":["any"]},{"topic":"kf_metrics_topic","partition":11,"replicas":[0],"log_dirs":["any"]},{"topic":"kf_traces_errors_topic","partition":0,"replicas":[0],"log_dirs":["any"]},{"topic":"kf_traces_metric_topic","partition":0,"replicas":[0],"log_dirs":["any"]},{"topic":"kf_traces_topic","partition":0,"replicas":[2],"log_dirs":["any"]},{"topic":"logs_ingest_topic","partition":0,"replicas":[0],"log_dirs":["any"]}]}

json

Copy the Current partition replica assignment to a file.
Create a copy of the partition replica assignment file, and modify the replicas and log_dirs fields.
For each replicas field, add N number of brokers, depending on the desired replicationFactor.

The log_dirs field must match. For each new broker that you add in the replicas field, add an "any" item in the log_dirs field.

Balance the replicas across all brokers.

For example (focusing on 1 partition of kf_metrics_topic)
Show output for 1 partition of kf_metrics_topic
... {"topic":"kf_metrics_topic","partition":3,"replicas":[2,0],"log_dirs":["any", "any"]} ... snipped ... ...

Save the new assignment files in the /bitnami/kafka directory:

cat > /bitnami/kafka/topics.assignment.json
<PASTE THE new assignments here>

Run the reassignment:

/opt/bitnami/kafka/bin/kafka-reassign-partitions.sh --bootstrap-server :9092 --execute --reassignment-json-file /bitnami/kafka/topics.assignment.json

Verify the reassignment:

/opt/bitnami/kafka/bin/kafka-reassign-partitions.sh --bootstrap-server :9092 --verify --reas

Pinot

Pinot server realtime pods in crash loop back off

Container logs shows the following JFR initialization errors:

jdk.jfr.internal.dcmd.DCmdException: Could not use /var/pinot/server/data/jfr as repository. Unable to create JFR repository directory using base location (/var/pinot/server/data/jfr)Error occurred during initialization of VM
Failure when starting JFR on_create_vm_2

Pinot server realtime disk usage is at 100%.

Resolution

Restart the Pinot server offline:

kubectl rollout restart -n kfuse statefulset pinot-server-offline

Edit pinot-server-realtime sts to set BALLOON_DISK env variable to false.
Wait for pinot server realtime to start, and to complete moving segments to offline servers.
Edit pinot-server-realtime sts to set BALLOON_DISK env variable to true.

Pinot Deepstore access issues

Pinot-related jobs are stuck in a crash loop back-off. For example, kfuse-set-tag-hook, pinot-metrics-table-creation, and similar errors.

Pinot controller logs a Deepstore access-related exception.

On AWS S3, the exception has the following format:

Caused by: software.amazon.awssdk.services.s3.model.S3Exception: Access Denied (Service: S3, Status Code: 403, Request ID: MAYE68P6SYZMTTMP, Extended Request ID: L7mSpEzHz9gdxZQ8iNM00jKtoXYhkNrUzYntbbGkpFmUF+tQ8zL+fTpjJRlp2MDLNvhaVYCie/Q=)

Resolution

Configure Deepstore on Pinot.

GCP
AWS S3

Ensure that the secret has correct access to the cloud storage bucket.

If the node does not have permissions to the S3 bucket, ensure that the access key and secret access key are populated:

pinot:
    deepStore:
      enabled: true
      type: "s3"
      useSecret: true
      createSecret: true
      dataDir: "s3://[REPLACE BUCKET HERE]/kfuse/controller/data"
      s3:
        region: "YOUR REGION"
        accessKey: "YOUR AWS ACCESS KEY"
        secretKey: "YOUR AWS SECRET KEY"

yaml

If Pinot has the correct access credentials to the Deepstore, then the configured bucket creates a directory that matches the dataDir.

Rehydration of segments from Deepstore

As we decommission older Kloudfuse installations and deploy new ones, segments from the old installation can be loaded into the new installation, if:

The Deepstore location for the new installation has a different path from the old installation.
Pinot servers on the new installation have permissions to read from the old deep store location.

Resolution

Use this script to monitor rehydration.
Point the Pinot controller port to the new location:
```
kubectl port-forward pinot-controller-0 -n kfuse 9000:9000
```

For each table (kf_metrics, kf_logs, kf_traces, kf_traces_errors, kf_events) run this command:

curl -X POST --fail -H "Content-Type: application/json" -H "TABLE_TYPE:REALTIME" -H "UPLOAD_TYPE:BATCH" -H "DOWNLOAD_URI:<OLD DEEPSTORE PATH>/controller/data/<TABLE NAME>" -v "http://localhost:9000/v2/segments?tableName=<TABLE NAME>&tableType=REALTIME&enableParallelPushProtection=false&allowRefresh=false"

To prevent data loss, do not delete the older Deepstore folder. The new Kloudfuse installation downloads the segments from older deeptstore locations, but still has a reference to it.
Retention on the new cluster does not reset. Instead, Kloudfuse computes it from the time when the data was initially ingested into the cluster (older installation). So:
- If the retention period on the new cluster is set to 1 month
- If a log line gets ingested into the old cluster on April 7, 2024 for the first time
- If the segment is subsequently rehydrated into a new cluster installation on May 6, 2024
- Then Kloudfuse deletes the log line on May 7, 2024, or 1 month from the initial ingestion date of Apr 7th 2024.

Getting ideal state and external view for segments from Pinot controller

To see these, follow these steps:

Ensure that pinot-controller-0 pod is running and fully up:
```
kubectl get pods
```

Enable port-forward for pinot-controller:

kubectl port-forward pinot-controller-0 9000:9000

Dump the ideal state and external view for segments:

curl "http://localhost:9000/tables/<tableName>/idealstate" | jq > ideal_state.json 2>&1
curl "http://localhost:9000/tables/<tableName>/externalview" | jq > external_state.json 2>&1

Replace <tableName> with one of the following, depending on the stream type:

Metrics

kf_metrics_REALTIME

Events

kf_events_REALTIME

Logs

kf_logs_REALTIME

Traces

kf_traces_REALTIME

Realtime usage continously increasing

The pinot-server-realtime persistent volume usage keeps increasing when there is a disconnect in segment movement.

This has been partially fixed in Kloudfuse Release 2.6.5.

Two methods to verify the behaviour:

In Release 2.6.5 and later versions, Kloudfuse issues an automatic alert when PVC usage exceeds 40%.
Navigate to Kloudfuse Overview → System dashboards. In the PV Used Space panel, check the graph for pinot-server-realtime.

Resolution

Restart the pinot-realtime and pinot-offline servers:

kubectl rollout restart sts pinot-server-offline pinot-server-realtime

If PV usage already reached 100% and cannot be restarted gracefully, you must increase the PVC size of pinot-realtime PVCs by approximately 10% to accommodate the increased requirements, and then restart the pinot-server offline and pinot-server realtime.

See Increase existing PVC size and Resize PVC on Azure.

Storage

Increase existing PVC size

In some scenarios, you have to increase the size of PVC.

For Azure, see Resize PVC on Azure.

Resolution

Run the resize_pvc.sh script from our customer/scripts/ directory.

If you cannot resize the storageclass, add this instruction to the top level of the script to force the resizing:
```
allowVolumeExpansion: true
```
yaml
Ensure that the Helm values.yaml file reflects the updated disk size.

For example, to increase the size of Kafka stateful PVCs to 100GB in Kfuse namespace, run the script:

sh resize_pvc.sh kafka 100Gi kfuse

Resize PVC on Azure

In some scenarios, you have to increase the size of PVC. On Azure, you must detach the PremiumV2_LRS disk before resizing.

Resolution

Cordon all nodes:
```
kubectl cordon <NODE>
```
Delete the statefulset.
```
kubectl sts <STATEFULSET>
```
In Azure Portal, verify that the disk is in unattached state:

Patch all PVCs to the desired size:

kubectl patch pvc <PVC> --patch '{"spec": {"resources": {"requests": {"storage": "'<SIZE>'" }}}}'

Remove the cordon from the node:
```
kubectl uncordon <NODE>
```
Update custom_values.yaml with disk size for the statefulset disk.
[Optional] Run helm upgrade on kfuse using the updated custom_values.yaml file:
```
helm upgrade --install -n kfuse kfuse <source_location> --version <VERSION.NUM.BER>  -f custom_values.yaml (1)
```
1 version: Use the version number of the most current Kloudfuse release; the pattern looks like 3.0.0, 3.1.3, and so on. See Version documentation.

Fluent-Bit Agent

Duplicate logs in Kloudfuse stack

When using the Fluent-Bit agent, Kloudfuse stack may show duplicate logs with the same timestamp and log event. However, if you check the application logs, either on the host or in the container, there is no evidence of duplication.

Examine Fluent-Bit logs and search for the following known Fluent-Bit (Issue #7166 and Issue #6886:

[error] [in_tail] file=<path_to_filename> requires a larger buffer size, lines are too long. Skipping file

log

This occurs because of the default buffer size on the tail plugin.

Resolution

To help diagnose this issue, add a randomly-generated number or string as part of the Fluent-Bit record. This appears as a log facet in the Kloudfuse stack. If the duplicate log lines have different numbers or strings, this confirms that the duplication occured in the Fluent-Bit agent.

To add a randomly-generated number or string, add this filter to your Fluent-Bit configuration:
```
[FILTER]
   Name lua
   Match *
   Call append_rand_number
   Code function append_rand_number(tag, timestamp, record) math.randomseed(os.clock()*100000000000); new_record = record; new_record["rand_id"] = tostring(math.random(1, 1000000000)); return 1, timestamp, new_record end
```
yaml

Increase the buffer size by adding Buffer_Chunk_Size and Buffer_Max_Size to the configuration of each tail plugin.

[INPUT]
    Name              tail
    Path              <file_path_to_tail>
    Tag               <tag>
    Buffer_Chunk_Size 1M
    Buffer_Max_Size   8M

yaml

Datadog Agent

Kube_cluster_name label does not appear in Kloudfuse stack

The Kube_cluster_name label does not show in Kloudfuse stack when MELT data ingested from Datadog agent is missing the kube_cluster_name label.

A known issue in the Datadog agent cluster name detection requires that the cluster agent be up. If the dd-agent starts before the cluster agent, it fails to detect the cluster name. See Datadog Issue #24406.

Resolution

Perform a rollout restart of the Datadog agent daemonset:

kubectl rollout restart daemonset datadog-agent

Access denied when creating and alert or contact point

A non-admin (SSO) user may get one of these permission errors when creating an alert or a contact point:

{"accessErrorId":"ACE0947587429","message":"You'll need additional permissions to perform this action. Permissions needed: any of alert.notifications:write","title":"Access denied"}

{"accessErrorId":"ACE3104889351","message":"You'll need additional permissions to perform this action. Permissions needed: any of alert.provisioning:read, alert.provisioning.secrets:read","title":"Access denied"}

This is because the user may not have the permissions to create contact points or alerts manually.

Resolution

Log in as an admin user.
Create a contact point or an alert.

UI

Step size differences in charts between Kloudfuse and Grafana UIs

When rendering charts in Kloudfuse UI, we determine the rollup time intervals based on the overall timeframe that the chart renders. The Grafana UI uses both the time interval and the width of the chart that it renders. This leads to detectible differences in how charts appear in Kloudfuse vs. Grafana. This is not an error in Kloudfuse.

Resolution

None

Additional resources

AWS

InvalidClientTokenId Error

You receive the following error:

InvalidClientTokenId: The security token included in the request is invalid.`

Resolution

Several AWS regions are not enabled by default, and this causes the error. To enable your region, and fix the issue, follow the recommendations and steps in the AWS documentation to Enable or disable AWS Regions in your account.

RBAC

Inconsistent RBAC filters on telemetry streams and signals

The APM Services interface combines data from trace streams (service name information) with RED data from metric streams (requests, latency, errors, and Apdex). When the RBAC policies conflict at the stream level, the APM Services interface does not display some of the expected information.

Resolution: We are working to provide a programmatic solution to this issue in an upcoming release. At this time, ensure that the RBAC policies are consistent across the streams.

User unable to suppress (mute) an alert they created

A user may not be able to suppress (mute) an alert that they previously created because of insufficent permission level they have for the folder where they saved the alert.

Resolution: We are working to provide a programmatic solution to this issue in an upcoming release.

FuseQL

Advanced Search Alerts fail for Algorithmic Operators

Advanced search alerts fail for algorithmic operators outlier, anomaly, and forecast.

Resolution: We are working to provide a programmatic solution to this issue in an upcoming release.

Alerts

Unable to Create the First Suppress Schedule

Issue fixed in Kloudfuse 3.2.4, see Fixed Issues: Cannot Create First Alert Suppress Schedule.

In an environment that does not have a defined suppress schedule, fails to create a new one.

Resolution: Create an initial suppress schedule using the Grafana interface. Subsequently, create new suppress schedules using the Kloudfuse UI.

1	Enter the Zookeeper CLI.
2	Get the list of active brokers.

Known Issues

Installation

helm registry login failure

Review helm values

Networking

Kloudfuse cannot be reached from an external host

Ingress-NGINX drops packet

Kafka

Increase Kafka partition replication factor

Pinot

Pinot server realtime pods in crash loop back off

Pinot Deepstore access issues

Rehydration of segments from Deepstore

Getting ideal state and external view for segments from Pinot controller

Realtime usage continously increasing

Storage

Increase existing PVC size

Resize PVC on Azure

Fluent-Bit Agent

Duplicate logs in Kloudfuse stack

Datadog Agent

Kube_cluster_name label does not appear in Kloudfuse stack

Access denied when creating and alert or contact point

UI

Step size differences in charts between Kloudfuse and Grafana UIs

AWS

InvalidClientTokenId Error

RBAC

Inconsistent RBAC filters on telemetry streams and signals

User unable to suppress (mute) an alert they created

FuseQL

Advanced Search Alerts fail for Algorithmic Operators

Alerts

Unable to Create the First Suppress Schedule