Known Issues
We are aware of some issues you may encounter in Kloudfuse and their causes.
Installation
helm registry login failure
The helm registry login command may fail.
Resolution
Replace helm registry login with docker login:
cat token.json | docker login -u _json_key --password-stdin https://us-east1-docker.pkg.dev
Review helm values
We configured the default Kloudfuse helm chart for a single-node cluster install, without a deepstore. It may be difficult to understand what these are.
Resolution
Run the following command:
helm show values oci://us-east1-docker.pkg.dev/mvp-demo-301906/kfuse-helm/kfuse --version <VERSION.NUM.BER> (1)
1 | version : Use the version number of the most current Kloudfuse release; the pattern looks like 3.0.0 , 3.1.3 , and so on. See Version documentation. |
As your configuration changes, you can add customizations by editing the custom_values.yaml
file.
To view your helm values at any time, run the command on your cluster:
helm show values <Your Kloudfuse Installation IP or address>/kfuse-helm/kfuse --version <VERSION.NUM.BER> (1)
1 | version : Use the version number of the most current Kloudfuse release; the pattern looks like 3.0.0 , 3.1.3 , and so on. See Version documentation. |
Networking
Kloudfuse cannot be reached from an external host
External IP, host, or DNS is unable to access Kfuse:
curl http://EXTERNAL_IP
curl: (28) Failed to connect to XX.XX.XX.XX port 80 after 129551 ms: Connection timed out
curl https://EXTERNAL_IP --insecure
curl: (28) Failed to connect to XX.XX.XX.XX port 443 after 129551 ms: Connection timed out
Resolution
Ensure that the security group or firewall policy for the Kubernetes cluster, node, and VPC endpoint allows external incoming traffic.
Ingress-NGINX drops packet
The ingress-NGINX logs the error client intended to send too large body
.
2023/03/06 05:38:22 [error] 43#43: *128072996 client intended to send too large body: 1097442 bytes, client: XXXX, server: _, request: "POST /ingester/v1/fluent_bit HTTP/1.1", host: "XXXX"
Resolution
The default request body size is 1M. Configure Ingress-NGINX to accept larger request body size; include the following specification in the custom-values.yml
file:
ingress-nginx:
controller:
config:
proxy-body-size: <REPLACE THE BODY SIZE HERE, e.g., 8m. Setting to 0 will disable all limits.>
Kafka
Increase Kafka partition replication factor
In some scenarios, you have to increase the Kafka partition replication factor.
Resolution
-
In the
custom-values.yaml
file, update theglobal.kafkaTopics
section with newreplicationFactor
. This does not change the configuration of a deployed cluster. It is necessary for tracking if you use yaml to re-deploy in a fresh installation. -
If necessary, resize Kafka Persistent Disk.
By default, Kloudfuse uses 10GB retention size per partition, set in the
kafka.logRetentionBytes
variable. -
Ensure that the persistent disk size has enough capacity.
Use this formula:
\(\frac{(NP)(RF)(LRB)}{NKB}\)
where:
-
If you must increase the size of the persistent disk; see Increase existing PVC size and Resize PVC on Azure.
-
To start increasing the
ReplicationFactor
, log in to the Kafka Pod:kubectl exec -ti -n kfuse kafka-broker-0 -- bash
-
Unset the
JMX_PORT
:unset JMX_PORT
-
Get the list of configured topics; you can get the list from
global.kafkaTopics
:/opt/bitnami/kafka/bin/kafka-topics.sh --bootstrap-server :9092 --list
-
Create the
topics.json
file, and save it in the/bitnami/kafka
directory:cat > /bitnami/kafka/topics.json { "version": 1, "topics": [ { "topic": "kf_events_topic" }, { "topic": "kf_logs_metric_topic" }, { "topic": "kf_logs_topic" }, { "topic": "kf_metrics_topic" }, { "topic": "kf_traces_errors_topic" }, { "topic": "kf_traces_metric_topic" }, { "topic": "kf_traces_topic" }, { "topic": "logs_ingest_topic" } ] }
-
Get into the Kafka Zookeeper pod.
kubectl exec -ti -n kfuse kafka-zookeeper-0 -- bash
-
Get the broker ids.
Get List of Active Brokers/opt/bitnami/zookeeper/bin/zkCli.sh -server localhost:2181 (1) ls /brokers/ids (2)
1 Enter the Zookeeper CLI. 2 Get the list of active brokers. -
Get the current partition assignment.
The
--broker-list
should match the number of currentconfigured brokers
.This command sets up 3 Kafka brokers:
/opt/bitnami/kafka/bin/kafka-reassign-partitions.sh --bootstrap-server :9092 --generate --topics-to-move-json-file /bitnami/kafka/topics.json --broker-list <Broker IDs> (1)
1 List of Broker IDs from Get List of Active Brokers. This command prints the
Current partition replica assignment
andProposed partition reassignment configuration
. Ignore the proposed output; only theCurrent partition replica assignment
is relevant.Show output
{"version":1,"partitions":[{"topic":"kf_events_topic","partition":0,"replicas":[2],"log_dirs":["any"]},{"topic":"kf_logs_metric_topic","partition":0,"replicas":[0],"log_dirs":["any"]},{"topic":"kf_logs_topic","partition":0,"replicas":[2],"log_dirs":["any"]},{"topic":"kf_metrics_topic","partition":0,"replicas":[2],"log_dirs":["any"]},{"topic":"kf_metrics_topic","partition":1,"replicas":[1],"log_dirs":["any"]},{"topic":"kf_metrics_topic","partition":2,"replicas":[0],"log_dirs":["any"]},{"topic":"kf_metrics_topic","partition":3,"replicas":[2],"log_dirs":["any"]},{"topic":"kf_metrics_topic","partition":4,"replicas":[1],"log_dirs":["any"]},{"topic":"kf_metrics_topic","partition":5,"replicas":[0],"log_dirs":["any"]},{"topic":"kf_metrics_topic","partition":6,"replicas":[2],"log_dirs":["any"]},{"topic":"kf_metrics_topic","partition":7,"replicas":[1],"log_dirs":["any"]},{"topic":"kf_metrics_topic","partition":8,"replicas":[0],"log_dirs":["any"]},{"topic":"kf_metrics_topic","partition":9,"replicas":[2],"log_dirs":["any"]},{"topic":"kf_metrics_topic","partition":10,"replicas":[1],"log_dirs":["any"]},{"topic":"kf_metrics_topic","partition":11,"replicas":[0],"log_dirs":["any"]},{"topic":"kf_traces_errors_topic","partition":0,"replicas":[0],"log_dirs":["any"]},{"topic":"kf_traces_metric_topic","partition":0,"replicas":[0],"log_dirs":["any"]},{"topic":"kf_traces_topic","partition":0,"replicas":[2],"log_dirs":["any"]},{"topic":"logs_ingest_topic","partition":0,"replicas":[0],"log_dirs":["any"]}]}
-
Copy the
Current partition replica assignment
to a file. -
Create a copy of the partition replica assignment file, and modify the
replicas
andlog_dirs
fields. -
For each
replicas
field, add N number of brokers, depending on the desiredreplicationFactor
.The
log_dirs
field must match. For each new broker that you add in thereplicas
field, add an "any" item in thelog_dirs
field.Balance the
replicas
across all brokers.For example (focusing on 1 partition of kf_metrics_topic)
Show output for 1 partition of
kf_metrics_topic
... {"topic":"kf_metrics_topic","partition":3,"replicas":[2,0],"log_dirs":["any", "any"]} ... snipped ... ...
-
Save the new assignment files in the
/bitnami/kafka
directory:cat > /bitnami/kafka/topics.assignment.json <PASTE THE new assignments here>
-
Run the reassignment:
/opt/bitnami/kafka/bin/kafka-reassign-partitions.sh --bootstrap-server :9092 --execute --reassignment-json-file /bitnami/kafka/topics.assignment.json
-
Verify the reassignment:
/opt/bitnami/kafka/bin/kafka-reassign-partitions.sh --bootstrap-server :9092 --verify --reas
Pinot
Pinot server realtime pods in crash loop back off
-
Container logs shows the following JFR initialization errors:
jdk.jfr.internal.dcmd.DCmdException: Could not use /var/pinot/server/data/jfr as repository. Unable to create JFR repository directory using base location (/var/pinot/server/data/jfr)Error occurred during initialization of VM Failure when starting JFR on_create_vm_2
-
Pinot server realtime disk usage is at 100%.
Resolution
-
Restart the Pinot server offline:
kubectl rollout restart -n kfuse statefulset pinot-server-offline
-
Edit
pinot-server-realtime sts
toset BALLOON_DISK env
variable tofalse
. -
Wait for pinot server realtime to start, and to complete moving segments to offline servers.
-
Edit
pinot-server-realtime sts
toset BALLOON_DISK env
variable totrue
.
Pinot Deepstore access issues
-
Pinot-related jobs are stuck in a crash loop back-off. For example,
kfuse-set-tag-hook
,pinot-metrics-table-creation
, and similar errors. -
Pinot controller logs a Deepstore access-related exception.
On AWS S3, the exception has the following format:
Caused by: software.amazon.awssdk.services.s3.model.S3Exception: Access Denied (Service: S3, Status Code: 403, Request ID: MAYE68P6SYZMTTMP, Extended Request ID: L7mSpEzHz9gdxZQ8iNM00jKtoXYhkNrUzYntbbGkpFmUF+tQ8zL+fTpjJRlp2MDLNvhaVYCie/Q=)
Resolution
-
Configure Deepstore on Pinot.
-
GCP
-
AWS S3
Ensure that the secret has correct access to the cloud storage bucket.
If the node does not have permissions to the S3 bucket, ensure that the access key and secret access key are populated:
pinot: deepStore: enabled: true type: "s3" useSecret: true createSecret: true dataDir: "s3://[REPLACE BUCKET HERE]/kfuse/controller/data" s3: region: "YOUR REGION" accessKey: "YOUR AWS ACCESS KEY" secretKey: "YOUR AWS SECRET KEY"
-
-
If Pinot has the correct access credentials to the Deepstore, then the configured bucket creates a directory that matches the
dataDir
.
Rehydration of segments from Deepstore
As we decommission older Kloudfuse installations and deploy new ones, segments from the old installation can be loaded into the new installation, if:
-
The Deepstore location for the new installation has a different path from the old installation.
-
Pinot servers on the new installation have permissions to read from the old deep store location.
Resolution
-
Use this script to monitor rehydration.
-
Point the Pinot controller port to the new location:
kubectl port-forward pinot-controller-0 -n kfuse 9000:9000
-
For each table (
kf_metrics
,kf_logs
,kf_traces
,kf_traces_errors
,kf_events
) run this command:curl -X POST --fail -H "Content-Type: application/json" -H "TABLE_TYPE:REALTIME" -H "UPLOAD_TYPE:BATCH" -H "DOWNLOAD_URI:<OLD DEEPSTORE PATH>/controller/data/<TABLE NAME>" -v "http://localhost:9000/v2/segments?tableName=<TABLE NAME>&tableType=REALTIME&enableParallelPushProtection=false&allowRefresh=false"
-
To prevent data loss, do not delete the older Deepstore folder. The new Kloudfuse installation downloads the segments from older deeptstore locations, but still has a reference to it.
-
Retention on the new cluster does not reset. Instead, Kloudfuse computes it from the time when the data was initially ingested into the cluster (older installation). So:
-
If the retention period on the new cluster is set to 1 month
-
If a log line gets ingested into the old cluster on April 7, 2024 for the first time
-
If the segment is subsequently rehydrated into a new cluster installation on May 6, 2024
-
Then Kloudfuse deletes the log line on May 7, 2024, or 1 month from the initial ingestion date of Apr 7th 2024.
-
Getting ideal state and external view for segments from Pinot controller
To see these, follow these steps:
-
Ensure that
pinot-controller-0
pod is running and fully up:kubectl get pods
-
Enable
port-forward
forpinot-controller
:kubectl port-forward pinot-controller-0 9000:9000
-
Dump the ideal state and external view for segments:
curl "http://localhost:9000/tables/<tableName>/idealstate" | jq > ideal_state.json 2>&1 curl "http://localhost:9000/tables/<tableName>/externalview" | jq > external_state.json 2>&1
-
Replace
<tableName>
with one of the following, depending on the stream type:- Metrics
-
kf_metrics_REALTIME
- Events
-
kf_events_REALTIME
- Logs
-
kf_logs_REALTIME
- Traces
-
kf_traces_REALTIME
Realtime usage continously increasing
The pinot-server-realtime
persistent volume usage keeps increasing when there is a disconnect in segment movement.
This has been partially fixed in Kloudfuse Release 2.6.5.
Two methods to verify the behaviour:
-
In Release 2.6.5 and later versions, Kloudfuse issues an automatic alert when PVC usage exceeds 40%.
-
Navigate to Kloudfuse Overview → System dashboards. In the PV Used Space panel, check the graph for
pinot-server-realtime
.
Resolution
-
Restart the
pinot-realtime
andpinot-offline
servers:kubectl rollout restart sts pinot-server-offline pinot-server-realtime
-
If PV usage already reached 100% and cannot be restarted gracefully, you must increase the PVC size of
pinot-realtime
PVCs by approximately 10% to accommodate the increased requirements, and then restart thepinot-server offline
andpinot-server realtime
.
Storage
Increase existing PVC size
In some scenarios, you have to increase the size of PVC.
For Azure, see Resize PVC on Azure.
Resolution
-
Run the
resize_pvc.sh
script from ourcustomer/scripts/
directory.If you cannot resize the
storageclass
, add this instruction to the top level of the script to force the resizing:allowVolumeExpansion: true
-
Ensure that the Helm
values.yaml
file reflects the updated disk size.
For example, to increase the size of Kafka stateful PVCs to 100GB in Kfuse namespace, run the script:
sh resize_pvc.sh kafka 100Gi kfuse
Resize PVC on Azure
In some scenarios, you have to increase the size of PVC. On Azure, you must detach the PremiumV2_LRS disk before resizing.
Resolution
-
Cordon all nodes:
kubectl cordon <NODE>
-
Delete the
statefulset
.kubectl sts <STATEFULSET>
-
In Azure Portal, verify that the disk is in unattached state:
-
Patch all PVCs to the desired size:
kubectl patch pvc <PVC> --patch '{"spec": {"resources": {"requests": {"storage": "'<SIZE>'" }}}}'
-
Remove the cordon from the node:
kubectl uncordon <NODE>
-
Update
custom_values.yaml
with disk size for thestatefulset
disk. -
[Optional] Run
helm upgrade
on kfuse using the updatedcustom_values.yaml
file:helm upgrade --install -n kfuse kfuse <source_location> --version <VERSION.NUM.BER> -f custom_values.yaml (1)
1 version
: Use the version number of the most current Kloudfuse release; the pattern looks like3.0.0
,3.1.3
, and so on. See Version documentation.
Fluent-Bit Agent
Duplicate logs in Kloudfuse stack
When using the Fluent-Bit agent, Kloudfuse stack may show duplicate logs with the same timestamp and log event. However, if you check the application logs, either on the host or in the container, there is no evidence of duplication.
Examine Fluent-Bit logs and search for the following known Fluent-Bit (Issue #7166 and Issue #6886:
[error] [in_tail] file=<path_to_filename> requires a larger buffer size, lines are too long. Skipping file
This occurs because of the default buffer size on the tail
plugin.
Resolution
-
To help diagnose this issue, add a randomly-generated number or string as part of the Fluent-Bit record. This appears as a log facet in the Kloudfuse stack. If the duplicate log lines have different numbers or strings, this confirms that the duplication occured in the Fluent-Bit agent.
To add a randomly-generated number or string, add this filter to your Fluent-Bit configuration:
[FILTER] Name lua Match * Call append_rand_number Code function append_rand_number(tag, timestamp, record) math.randomseed(os.clock()*100000000000); new_record = record; new_record["rand_id"] = tostring(math.random(1, 1000000000)); return 1, timestamp, new_record end
-
Increase the buffer size by adding
Buffer_Chunk_Size
andBuffer_Max_Size
to the configuration of eachtail
plugin.[INPUT] Name tail Path <file_path_to_tail> Tag <tag> Buffer_Chunk_Size 1M Buffer_Max_Size 8M
Datadog Agent
Kube_cluster_name label does not appear in Kloudfuse stack
The Kube_cluster_name
label does not show in Kloudfuse stack when MELT data ingested from Datadog agent is missing the kube_cluster_name
label.
A known issue in the Datadog agent cluster name detection requires that the cluster agent be up. If the dd-agent starts before the cluster agent, it fails to detect the cluster name. See Datadog Issue #24406.
Resolution
Perform a rollout restart of the Datadog agent daemonset:
kubectl rollout restart daemonset datadog-agent
Access denied when creating and alert or contact point
A non-admin (SSO) user may get one of these permission errors when creating an alert or a contact point:
{"accessErrorId":"ACE0947587429","message":"You'll need additional permissions to perform this action. Permissions needed: any of alert.notifications:write","title":"Access denied"}
{"accessErrorId":"ACE3104889351","message":"You'll need additional permissions to perform this action. Permissions needed: any of alert.provisioning:read, alert.provisioning.secrets:read","title":"Access denied"}
This is because the user may not have the permissions to create contact points or alerts manually.
- Resolution
-
-
Log in as an admin user.
-
Create a contact point or an alert.
-
UI
Step size differences in charts between Kloudfuse and Grafana UIs
When rendering charts in Kloudfuse UI, we determine the rollup time intervals based on the overall timeframe that the chart renders. The Grafana UI uses both the time interval and the width of the chart that it renders. This leads to detectible differences in how charts appear in Kloudfuse vs. Grafana. This is not an error in Kloudfuse.
Resolution
None
Additional resources
AWS
InvalidClientTokenId Error
You receive the following error:
InvalidClientTokenId: The security token included in the request is invalid.`
Resolution
Several AWS regions are not enabled by default, and this causes the error. To enable your region, and fix the issue, follow the recommendations and steps in the AWS documentation to Enable or disable AWS Regions in your account.
RBAC
Inconsistent RBAC filters on telemetry streams and signals
The APM Services interface combines data from trace streams (service name information) with RED data from metric streams (requests, latency, errors, and Apdex). When the RBAC policies conflict at the stream level, the APM Services interface does not display some of the expected information.
Resolution: We are working to provide a programmatic solution to this issue in an upcoming release. At this time, ensure that the RBAC policies are consistent across the streams.
User unable to suppress (mute) an alert they created
A user may not be able to suppress (mute) an alert that they previously created because of insufficent permission level they have for the folder where they saved the alert.
Resolution: We are working to provide a programmatic solution to this issue in an upcoming release.