diff --git a/docs/en/solutions/ecosystem/kafka/How_to_Access_Kafka_with_Username_and_Password.md b/docs/en/solutions/ecosystem/kafka/How_to_Access_Kafka_with_Username_and_Password.md new file mode 100644 index 00000000..705acf6f --- /dev/null +++ b/docs/en/solutions/ecosystem/kafka/How_to_Access_Kafka_with_Username_and_Password.md @@ -0,0 +1,218 @@ +--- +products: + - Alauda Application Services +kind: + - Solution +ProductsVersion: + - 3.x +--- + +# Access a Kafka Cluster with Username and Password + +:::info Applicable Versions +ACP 3.x Kafka instances created from the management view. +::: + +## Introduction + +This guide shows how to create a Kafka cluster with SCRAM-SHA-512 authentication, create a topic and user, retrieve the generated password, and test producer and consumer access with Kafka command-line tools. + +## 1. Create a Kafka Cluster + +Enable SCRAM-SHA-512 authentication on the listener used by clients and enable simple authorization: + +```yaml +apiVersion: kafka.strimzi.io/v1beta1 +kind: Kafka +metadata: + name: demo + namespace: demo-dba +spec: + kafka: + version: 2.5.0 + replicas: 3 + listeners: + plain: + authentication: + type: scram-sha-512 + external: + type: nodeport + tls: true + authentication: + type: scram-sha-512 + tls: + authentication: + type: tls + authorization: + type: simple + config: + log.message.format.version: "2.5" + offsets.topic.replication.factor: 3 + transaction.state.log.min.isr: 2 + transaction.state.log.replication.factor: 3 + storage: + type: persistent-claim + size: 10Gi + class: topolvm + zookeeper: + replicas: 3 + storage: + type: persistent-claim + size: 10Gi + class: topolvm +``` + +## 2. Create a Topic + +```yaml +apiVersion: kafka.strimzi.io/v1beta1 +kind: KafkaTopic +metadata: + name: demo-topic + namespace: demo-dba + labels: + strimzi.io/cluster: demo +spec: + topicName: demo-topic + partitions: 10 + replicas: 3 + config: + retention.ms: 604800000 + segment.bytes: 1073741824 +``` + +## 3. Create a User + +```yaml +apiVersion: kafka.strimzi.io/v1beta1 +kind: KafkaUser +metadata: + name: demo-user + namespace: demo-dba + labels: + strimzi.io/cluster: demo +spec: + authentication: + type: scram-sha-512 + authorization: + type: simple + acls: + - host: "*" + operation: Read + resource: + type: topic + name: demo-topic + patternType: literal + - host: "*" + operation: Write + resource: + type: topic + name: demo-topic + patternType: literal + - host: "*" + operation: Describe + resource: + type: topic + name: demo-topic + patternType: literal + - host: "*" + operation: Create + resource: + type: topic + name: demo-topic + patternType: literal + - host: "*" + operation: Read + resource: + type: group + name: demo-group + patternType: literal +``` + +## 4. Get the Bootstrap Service + +Internal access: + +```bash +kubectl -n demo-dba get svc demo-kafka-bootstrap +``` + +External access: + +```bash +kubectl -n demo-dba get svc demo-kafka-external-bootstrap +``` + +## 5. Retrieve the Generated Password + +```bash +kubectl -n demo-dba get secret demo-user -o jsonpath='{.data.password}' | base64 -d +``` + +## 6. Create the Client Properties File + +For the internal plain listener with SCRAM-SHA-512: + +```properties +sasl.jaas.config=org.apache.kafka.common.security.scram.ScramLoginModule required username="demo-user" password=""; +security.protocol=SASL_PLAINTEXT +sasl.mechanism=SCRAM-SHA-512 +``` + +Save it as `client.properties`. + +## 7. Run Test Pods + +Use the same Kafka image as the broker when possible: + +```bash +kubectl -n demo-dba get pod demo-kafka-0 -o yaml | grep 'image:' + +kubectl -n demo-dba run kafka-test0 -it \ + --image= \ + --rm=true \ + --restart=Never \ + -- bash + +kubectl -n demo-dba run kafka-test1 -it \ + --image= \ + --rm=true \ + --restart=Never \ + -- bash +``` + +Copy the properties file into both pods: + +```bash +kubectl -n demo-dba cp ./client.properties kafka-test0:/home/kafka/client.properties +kubectl -n demo-dba cp ./client.properties kafka-test1:/home/kafka/client.properties +``` + +## 8. Produce and Consume Messages + +Producer: + +```bash +/opt/kafka/bin/kafka-console-producer.sh \ + --bootstrap-server demo-kafka-bootstrap:9092 \ + --topic demo-topic \ + --producer.config /home/kafka/client.properties +``` + +Consumer: + +```bash +/opt/kafka/bin/kafka-console-consumer.sh \ + --bootstrap-server demo-kafka-bootstrap:9092 \ + --topic demo-topic \ + --consumer.config /home/kafka/client.properties \ + --from-beginning \ + --group demo-group +``` + +## Important Considerations + +- The user secret is generated by the Strimzi user operator after the `KafkaUser` becomes ready. +- Use `SASL_PLAINTEXT` only for non-TLS listeners. For TLS listeners, configure truststore settings and use `SASL_SSL`. +- Grant both topic ACLs and group ACLs for consumers. +- External access requires the broker endpoints returned in Kafka metadata to be reachable from the client network. diff --git a/docs/en/solutions/ecosystem/kafka/How_to_Import_Kafka_Resources_From_Management_View.md b/docs/en/solutions/ecosystem/kafka/How_to_Import_Kafka_Resources_From_Management_View.md new file mode 100644 index 00000000..a0bcaf89 --- /dev/null +++ b/docs/en/solutions/ecosystem/kafka/How_to_Import_Kafka_Resources_From_Management_View.md @@ -0,0 +1,136 @@ +--- +products: + - Alauda Application Services +kind: + - Solution +ProductsVersion: + - 3.x +--- + +# Import Kafka Resources Created From the Management View + +:::info Applicable Versions +ACP 3.12.x. +::: + +## Introduction + +Older Kafka instances may have been created directly from the Strimzi management view. In ACP 3.12, the business view expects RDS-layer custom resources. Use the `rdskafka-sync` tool to generate RDS resources from existing Strimzi resources and import Kafka clusters, topics, and users into the business view. + +The import updates the managed Kafka resources and can restart the Kafka instance. Run the check phase first and review the generated YAML before accepting the sync. + +## Prerequisites + +- Cluster administrator access to the target Kubernetes cluster. +- Existing Kafka resources created from the management view. +- Access to the `rdskafka-sync` image. +- A backup or rollback plan for the Kafka instance. + +## Quick Upgrade Workflow + +### 1. Check Import Readiness + +For Docker-based environments: + +```bash +docker run -it --rm \ + -v ~/.kube/config:/root/.kube/config \ + build-harbor.alauda.cn/middleware/rdskafka-sync:1.0 \ + ./bin/check.sh +``` + +For containerd-based environments: + +```bash +ctr run --rm \ + --mount type=bind,src=/root/.kube,dst=/root/.kube,options=rbind:rw \ + --net-host \ + build-harbor.alauda.cn/middleware/rdskafka-sync:1.0 \ + sh ./bin/check.sh +``` + +`Ready` means the resources can be imported. `Not Ready` means at least one resource failed validation; review the output and fix the reported cause before continuing. + +### 2. Run the Import + +For Docker: + +```bash +docker run -it --rm \ + -v ~/.kube/config:/root/.kube/config \ + build-harbor.alauda.cn/middleware/rdskafka-sync:1.0 \ + ./bin/sync.sh +``` + +For containerd: + +```bash +ctr run --rm \ + --mount type=bind,src=/root/.kube,dst=/root/.kube,options=rbind:rw \ + --net-host \ + build-harbor.alauda.cn/middleware/rdskafka-sync:1.0 \ + sh ./bin/sync.sh +``` + +If the command completes without errors, the imported resource names are printed. Contact operations if any resource fails to import. + +## Using the CLI Directly + +### Check Resources + +```bash +./rdskafka-sync check cluster +./rdskafka-sync check cluster -n +./rdskafka-sync check topic -n +./rdskafka-sync check user -n +``` + +The check output includes these fields: + +| Field | Meaning | +| --- | --- | +| `NAMESPACE` | Namespace of the resource. | +| `RDSNAME` | RDS resource name. Empty means only the management-view resource exists and needs import. | +| `CLUSTERNAME` | Management-view Kafka resource name. | +| `VALIDATE` | Whether the resource passed import validation. Only `true` can be imported. | +| `REASON` | Validation failure reason. Empty when validation succeeds. | + +### Sync Resources + +```bash +./rdskafka-sync sync cluster -n +./rdskafka-sync sync topic -n +./rdskafka-sync sync user -n +./rdskafka-sync sync cluster -n +./rdskafka-sync sync topic -n +./rdskafka-sync sync user -n +``` + +Force sync skips confirmation and must be used carefully: + +```bash +./rdskafka-sync sync cluster -n -f +``` + +## Validation Rules + +The tool validates whether resources can be safely imported. Common validation failures include: + +- The Kafka instance does not use PVC-based storage. +- The resource is being deleted. +- The resource is not in a ready state. +- Kafka topic or cluster config values use non-string values such as booleans or integers. The RDS operator expects string config values. + +Imported topics always include the required RDS config keys. If the management-view topic did not define them, default values are added: + +```properties +retention.ms=604800000 +max.message.bytes=1048576 +``` + +## Important Considerations + +- Importing a Kafka cluster can restart the Kafka instance. +- Review the generated RDS YAML and the resulting Strimzi YAML before confirming the operation. +- Convert config values to strings before import. +- Run the check command immediately before sync so the validation output matches the current cluster state. diff --git a/docs/en/solutions/ecosystem/kafka/How_to_Run_Kafka_as_Root_User.md b/docs/en/solutions/ecosystem/kafka/How_to_Run_Kafka_as_Root_User.md new file mode 100644 index 00000000..b25cf3fb --- /dev/null +++ b/docs/en/solutions/ecosystem/kafka/How_to_Run_Kafka_as_Root_User.md @@ -0,0 +1,54 @@ +--- +products: + - Alauda Application Services +kind: + - Solution +ProductsVersion: + - 3.x +--- + +# Run Kafka Pods as the Root User + +:::info Applicable Versions +ACP 3.12 and later. +::: + +## Problem + +Kafka components normally run with a non-root user for security. Some storage integrations or legacy environments require Kafka and ZooKeeper pods to run as UID 0 and use root-owned filesystems. + +Use this only when the storage integration cannot work with the default non-root security context. + +## Procedure + +When creating the Kafka instance, switch to the YAML view and add pod security context settings under both `spec.kafka.template.pod` and `spec.zookeeper.template.pod`: + +```yaml +spec: + kafka: + template: + pod: + securityContext: + runAsUser: 0 + fsGroup: 0 + zookeeper: + template: + pod: + securityContext: + runAsUser: 0 + fsGroup: 0 +``` + +Create or update the instance, then enter a pod and verify the effective user: + +```bash +kubectl -n exec -it -- id +kubectl -n exec -it -- id +``` + +## Important Considerations + +- Running as root weakens the default security posture. Use it only when storage requirements make it necessary. +- Confirm the namespace security policy, admission policy, or Pod Security Admission level allows UID 0. +- Re-test volume permissions after storage class or CSI driver changes. +- Document the exception for security review. diff --git a/docs/en/solutions/ecosystem/kafka/How_to_Use_Kafka_Built_in_Test_Scripts.md b/docs/en/solutions/ecosystem/kafka/How_to_Use_Kafka_Built_in_Test_Scripts.md new file mode 100644 index 00000000..92894e13 --- /dev/null +++ b/docs/en/solutions/ecosystem/kafka/How_to_Use_Kafka_Built_in_Test_Scripts.md @@ -0,0 +1,135 @@ +--- +products: + - Alauda Application Services +kind: + - Solution +ProductsVersion: + - 3.x +--- + +# Use Kafka Built-in Performance Test Scripts + +:::info Applicable Versions +General Kafka guidance for ACP 3.x Kafka deployments. +::: + +## Introduction + +Kafka images include command-line tools that can create topics, run producer throughput tests, run consumer throughput tests, and inspect basic performance characteristics. Use these scripts for initial validation and comparative tests, not as a replacement for workload-specific benchmarking. + +## Prerequisites + +- A Kafka cluster is deployed and reachable. +- The test environment has the Kafka scripts available. +- Test topics can be created and deleted safely. +- Authentication configuration is prepared if the cluster requires SASL or TLS. + +Set common connection variables: + +```bash +kafka_link="localhost:9092" +``` + +## Topic Operations + +```bash +./kafka-topics.sh --create \ + --bootstrap-server ${kafka_link} \ + --topic test_producer_perf \ + --partitions 6 \ + --replication-factor 1 + +./kafka-topics.sh --list --bootstrap-server ${kafka_link} +./kafka-topics.sh --describe --bootstrap-server ${kafka_link} +./kafka-topics.sh --delete --bootstrap-server ${kafka_link} --topic test_producer_perf +``` + +## Producer Tests + +### Test Different Partition Counts + +```bash +./kafka-topics.sh --create --bootstrap-server ${kafka_link} --topic test_producer_perf6 --partitions 6 --replication-factor 1 +./kafka-topics.sh --create --bootstrap-server ${kafka_link} --topic test_producer_perf12 --partitions 12 --replication-factor 1 + +./kafka-producer-perf-test.sh --num-records 5000000 --topic test_producer_perf6 --throughput -1 --record-size 1000 --producer-props bootstrap.servers=${kafka_link} +./kafka-producer-perf-test.sh --num-records 5000000 --topic test_producer_perf12 --throughput -1 --record-size 1000 --producer-props bootstrap.servers=${kafka_link} +``` + +### Test Different Replica Counts + +```bash +./kafka-topics.sh --create --bootstrap-server ${kafka_link} --topic test_replication3 --partitions 3 --replication-factor 3 +./kafka-topics.sh --create --bootstrap-server ${kafka_link} --topic test_replication5 --partitions 3 --replication-factor 5 + +./kafka-producer-perf-test.sh --num-records 5000000 --topic test_replication3 --throughput -1 --record-size 1000 --producer-props bootstrap.servers=${kafka_link} +./kafka-producer-perf-test.sh --num-records 5000000 --topic test_replication5 --throughput -1 --record-size 1000 --producer-props bootstrap.servers=${kafka_link} +``` + +### Test Batch Size + +```bash +./kafka-producer-perf-test.sh --num-records 5000000 --topic test_producer_perf --throughput -1 --record-size 1000 --producer-props bootstrap.servers=${kafka_link} batch.size=200 +./kafka-producer-perf-test.sh --num-records 5000000 --topic test_producer_perf --throughput -1 --record-size 1000 --producer-props bootstrap.servers=${kafka_link} batch.size=400 +``` + +### Test Message Size + +```bash +./kafka-producer-perf-test.sh --num-records 5000000 --topic test_producer_perf --throughput -1 --record-size 1000 --producer-props bootstrap.servers=${kafka_link} +./kafka-producer-perf-test.sh --num-records 5000000 --topic test_producer_perf --throughput -1 --record-size 2000 --producer-props bootstrap.servers=${kafka_link} +``` + +Important producer options: + +| Option | Meaning | +| --- | --- | +| `--topic` | Topic to produce to. | +| `--num-records` | Number of records to produce. | +| `--throughput` | Approximate message rate. `-1` disables throttling. | +| `--record-size` | Message size in bytes. | +| `--producer-props` | Producer config overrides such as `bootstrap.servers`. | +| `--producer.config` | Producer properties file. | +| `--print-metrics` | Print detailed metrics after the run. | + +## Consumer Tests + +### Test Different Thread Counts + +```bash +./kafka-consumer-perf-test.sh --bootstrap-server ${kafka_link} --messages 5000000 --topic test_producer_perf --timeout 100000 --threads 2 +./kafka-consumer-perf-test.sh --bootstrap-server ${kafka_link} --messages 5000000 --topic test_producer_perf --timeout 100000 --threads 3 +./kafka-consumer-perf-test.sh --bootstrap-server ${kafka_link} --messages 5000000 --topic test_producer_perf --timeout 100000 --threads 4 +``` + +### Test Different Partition Counts + +```bash +./kafka-consumer-perf-test.sh --bootstrap-server ${kafka_link} --messages 5000000 --topic test_producer_perf6 --timeout 100000 +./kafka-consumer-perf-test.sh --bootstrap-server ${kafka_link} --messages 5000000 --topic test_producer_perf12 --timeout 100000 +``` + +Important consumer options: + +| Option | Meaning | +| --- | --- | +| `--bootstrap-server` | Kafka bootstrap server. | +| `--topic` | Topic to consume from. | +| `--messages` | Number of messages to consume. | +| `--threads` | Number of processing threads. | +| `--num-fetch-threads` | Number of fetcher threads. | +| `--consumer.config` | Consumer properties file. | +| `--timeout` | Maximum interval between returned records. | + +## Interpreting Results + +Producer output includes records per second, MiB per second, average latency, maximum latency, and latency percentiles. Consumer output includes data consumed, throughput in MiB/s, message throughput, and fetch timing. + +To estimate network bandwidth from producer MiB/s, multiply by 8 to convert bytes to bits. + +## Important Considerations + +- Kafka's built-in producer performance script does not model every application pattern and does not replace real client testing. +- Test with the same authentication, TLS, compression, batch, and acknowledgement settings used by production clients. +- Benchmark on isolated topics and clean up test topics after use. +- Watch broker CPU, disk, network, and consumer lag during tests. diff --git a/docs/en/solutions/ecosystem/kafka/Kafka_24x_Not_Supported_After_ACP_312_Upgrade.md b/docs/en/solutions/ecosystem/kafka/Kafka_24x_Not_Supported_After_ACP_312_Upgrade.md new file mode 100644 index 00000000..503abc12 --- /dev/null +++ b/docs/en/solutions/ecosystem/kafka/Kafka_24x_Not_Supported_After_ACP_312_Upgrade.md @@ -0,0 +1,58 @@ +--- +products: + - Alauda Application Services +kind: + - Solution +ProductsVersion: + - 3.x +--- + +# Kafka 2.4.x Is Not Supported After Upgrading to ACP 3.12 + +:::info Applicable Versions +ACP 3.8 to 3.12 upgrade path. +::: + +## Problem + +The Kafka operator version used by ACP 3.12 does not support Kafka 2.4.x. ACP 3.12 supports newer Kafka versions such as 2.5.x, 2.6.x, and 2.7.x. Kafka instances that remain on 2.4.x before the platform upgrade can become abnormal after the operator is upgraded. + +Management-view Kafka resources also need to be imported into the RDS business view before upgrading to ACP 3.12. Otherwise, fields can be lost because the newer CRD schema differs from the older 3.8-era schema. + +## Resolution + +### 1. Upgrade Kafka Before the Platform Upgrade + +Before upgrading ACP to 3.12, upgrade each Kafka instance from 2.4.x to 2.5.0 or later. + +You can use the product UI to perform a step-by-step version upgrade, or edit the resource YAML and update `spec.version` directly: + +```yaml +spec: + kafka: + version: 2.5.0 +``` + +The Kafka instance restarts brokers one by one during the version upgrade. With healthy replicas and ISR, the rolling upgrade should not interrupt service or lose data. + +### 2. Import Management-View Resources + +If the Kafka instance was created directly from the management view in ACP 3.8, import it into the business view before upgrading to ACP 3.12. + +Use the `rdskafka-sync` tool described in the import guide: + +```bash +./rdskafka-sync check cluster -n +./rdskafka-sync sync cluster -n +``` + +## Impact + +After a successful Kafka version upgrade and resource import, clients can continue to use the Kafka cluster normally. The operation is designed as a rolling update and should not cause data loss when the cluster is healthy. + +## Important Considerations + +- Do not upgrade the platform while Kafka instances are still on 2.4.x. +- Import management-view resources before the ACP 3.12 upgrade. +- Confirm all Kafka brokers and ZooKeeper pods are ready before and after the rolling update. +- Validate topic availability and consumer lag after the upgrade. diff --git a/docs/en/solutions/ecosystem/kafka/Kafka_AWS_NLB_Access.md b/docs/en/solutions/ecosystem/kafka/Kafka_AWS_NLB_Access.md new file mode 100644 index 00000000..b465197b --- /dev/null +++ b/docs/en/solutions/ecosystem/kafka/Kafka_AWS_NLB_Access.md @@ -0,0 +1,106 @@ +--- +products: + - Alauda Application Services +kind: + - Solution +ProductsVersion: + - 3.x +--- + +# Expose Kafka on AWS EKS with Network Load Balancer + +:::info Applicable Versions +ACP 3.x Kafka on AWS EKS with Strimzi load balancer listeners. +::: + +## Introduction + +On AWS EKS, Kafka can be exposed externally through AWS Network Load Balancers (NLBs). NLB provides Layer 4 TCP forwarding and static external endpoints managed by the AWS Load Balancer Controller. + +## Prerequisites + +- An EKS cluster imported into ACP. +- Kafka operator deployed in the target namespace. +- AWS Load Balancer Controller installed and configured. +- Subnets, security groups, and IAM permissions prepared for NLB creation. + +AWS controller setup reference: + +```text +https://docs.aws.amazon.com/eks/latest/userguide/network-load-balancing.html +``` + +## Kafka Listener Configuration + +Set the external listener type to `loadbalancer` and add AWS NLB annotations to the bootstrap service and each broker service. + +```yaml +apiVersion: kafka.strimzi.io/v1beta2 +kind: Kafka +metadata: + name: my-cluster-nlb +spec: + kafka: + replicas: 3 + listeners: + - name: plain + port: 9092 + tls: false + type: internal + - name: external + port: 9094 + tls: false + type: loadbalancer + configuration: + bootstrap: + annotations: + service.beta.kubernetes.io/aws-load-balancer-type: external + service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip + service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing + brokers: + - broker: 0 + annotations: + service.beta.kubernetes.io/aws-load-balancer-type: external + service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip + service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing + - broker: 1 + annotations: + service.beta.kubernetes.io/aws-load-balancer-type: external + service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip + service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing + - broker: 2 + annotations: + service.beta.kubernetes.io/aws-load-balancer-type: external + service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip + service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing +``` + +The number of broker entries must match the broker replica count. + +## Verify Services + +```bash +kubectl -n get svc | grep +``` + +Use the `EXTERNAL-IP` or DNS name of the bootstrap load balancer with port `9094`. Broker load balancers are also created because Kafka clients need broker-specific addresses after metadata discovery. + +## Test Producer and Consumer + +```bash +./bin/kafka-console-producer.sh \ + --bootstrap-server :9094 \ + --topic my-topic + +./bin/kafka-console-consumer.sh \ + --bootstrap-server :9094 \ + --topic my-topic \ + --from-beginning +``` + +## Important Considerations + +- Kafka clients must be able to reach every advertised broker endpoint, not only the bootstrap endpoint. +- Use security groups to restrict external Kafka access. +- For private clusters, use an internal NLB scheme instead of `internet-facing`. +- DNS propagation and load balancer provisioning can take several minutes. diff --git a/docs/en/solutions/ecosystem/kafka/Kafka_Best_Practices.md b/docs/en/solutions/ecosystem/kafka/Kafka_Best_Practices.md new file mode 100644 index 00000000..0b24237b --- /dev/null +++ b/docs/en/solutions/ecosystem/kafka/Kafka_Best_Practices.md @@ -0,0 +1,159 @@ +--- +products: + - Alauda Application Services +kind: + - Solution +ProductsVersion: + - 3.x +--- + +# Kafka Best Practices + +:::info Applicable Versions +ACP 3.14.x and 3.15.x. Most architectural guidance also applies to later Kafka operator releases, but verify exact resource defaults against the operator version installed in your cluster. +::: + +## Introduction + +Kafka is a distributed streaming platform used for high-throughput event ingestion, message buffering, data pipelines, and stream processing. On Kubernetes, Alauda Application Services manages Kafka through the Kafka operator stack: an RDS-facing operator, the Strimzi cluster operator, the entity operator, and Kafka exporter. + +Use this guide when planning production Kafka instances or reviewing an existing deployment. + +## Core Terms + +| Term | Description | +| --- | --- | +| Broker | A Kafka server process. A cluster contains multiple brokers. | +| Topic | A logical stream of records. Producers write to topics and consumers read from topics. | +| Partition | The storage and parallelism unit of a topic. Each topic has one or more partitions. | +| Producer | A client that writes records to topics. | +| Consumer | A client that reads records from topics. | +| Consumer group | A group of consumers that share topic partitions for load balancing. | +| Offset | The monotonically increasing record position in a partition. | +| Lag | The distance between the latest partition offset and the offset consumed by a consumer group. | +| Leader | The broker replica that handles reads and writes for a partition. | +| Follower | A replica that copies data from the leader and can take over after failure. | + +## Operator Components + +| Component | Responsibility | +| --- | --- | +| `rds-operator` | Handles product-layer configuration, UI integration, and RDS custom resources. | +| `cluster-operator` | Creates, updates, and deletes Kafka clusters from the generated Kafka resources. | +| `entity-operator` | Contains topic and user operators for managing Kafka topics and users. Each Kafka instance has its own entity operator. | +| `kafka-exporter` | Connects to brokers and exposes Kafka metrics for monitoring. | + +## Resource Planning + +### CPU + +Kafka is usually I/O-intensive rather than CPU-intensive. CPU is mainly consumed by compression, decompression, TLS, request handling, and high fan-out between producers, consumers, and partitions. Prefer more cores over higher single-core frequency when brokers serve many topics and clients. + +Tune these broker parameters based on CPU sizing and benchmark results: + +```properties +num.network.threads= +num.io.threads= +``` + +### Memory + +Kafka relies heavily on the operating system page cache. If consumers hit page cache, reads avoid disk I/O and throughput improves. Avoid co-locating Kafka brokers with memory-heavy workloads unless node capacity is reserved and validated. + +For JVM heap, a common starting point is 6-8 GiB for large brokers. Keep enough memory outside the heap for page cache: + +```yaml +spec: + kafka: + jvmOptions: + -Xms: 6g + -Xmx: 8g +``` + +### Disk + +Use dedicated disks for Kafka data. Do not share the same disk path with the node system disk or ZooKeeper storage for production brokers. + +Prefer SSD-backed storage for better latency and IOPS. Size disk capacity by message volume, average message size, replica count, retention period, and compression ratio. For example, 1 billion messages per day, 1 KiB average message size, 2 replicas, and 7-day retention requires roughly 14 TiB before adding operational headroom. + +### Network + +Network bandwidth is often a throughput bottleneck. Plan for peak producer and consumer traffic, inter-broker replication, MirrorMaker 2 replication, and client fan-out. + +Useful broker-level parameters include: + +```properties +socket.send.buffer.bytes= +socket.receive.buffer.bytes= +socket.request.max.bytes= +``` + +Enable producer compression when network bandwidth is the limiting factor. Kafka supports codecs such as `gzip`, `snappy`, `lz4`, and `zstd`. Compression saves bandwidth but increases CPU usage. + +## Operator Deployment Modes + +| Mode | Description | Recommended Use | Constraint | +| --- | --- | --- | --- | +| Cluster mode | One operator manages instances across all namespaces. | Resource-constrained clusters or centralized operation. Keep the managed instance count moderate. | Operator must run in the platform default namespace. | +| Multi-namespace mode | One operator manages a selected set of namespaces. | Moderate isolation with lower operator overhead. | Do not deploy another operator into the same namespace. | +| Single-namespace mode | One operator manages only its own namespace. | Strong isolation between tenants or workloads. | Higher operator overhead. | + +## Creating Instances + +In the Data Services view, select **Kafka**, choose the project and namespace, then create a Kafka instance. For 3.x deployments, use the latest Kafka version supported by your operator unless your application requires a specific version. + +### Reference Resource Sizes + +| Component | Small Production Starting Point | +| --- | --- | +| Kafka broker | 2 vCPU / 4 GiB, 3 replicas | +| ZooKeeper | 1 vCPU / 2 GiB, 3 replicas | +| Kafka exporter | 300m CPU / 128 MiB | +| Topic operator | 500m CPU / 500 MiB or higher for many topics | +| User operator | 500m CPU / 500 MiB or higher for many users | + +For heavier workloads, benchmark with production-like producers and consumers, then scale brokers, partitions, and disks together. + +### Important Parameters + +| Parameter | Recommended Default | Reason | +| --- | --- | --- | +| `auto.create.topics.enable` | `false` | Create topics explicitly so partition count, replica count, and retention are controlled. | +| `auto.leader.rebalance.enable` | `false` | Avoid unexpected leader movement in production. Rebalance leaders manually after planned maintenance when needed. | +| `log.message.format.version` | Match the Kafka version used by clients during upgrades. | Prevents wire-format compatibility surprises. | +| `offsets.topic.replication.factor` | `3` | Keeps internal consumer offsets highly available. | +| `transaction.state.log.replication.factor` | `3` | Required for reliable transactional workloads. | +| `transaction.state.log.min.isr` | `2` | Prevents acknowledged transactional writes when too few replicas are in sync. | + +## Scheduling + +Enable pod anti-affinity for Kafka and ZooKeeper pods so replicas are spread across nodes. A three-broker Kafka cluster and a three-node ZooKeeper ensemble require at least three schedulable nodes for hard anti-affinity. + +Hard anti-affinity improves availability but can block scheduling when nodes are scarce. Use soft anti-affinity only when the cluster does not have enough dedicated nodes and the availability tradeoff is acceptable. + +## Application Access + +| Scenario | Recommended Access Mode | +| --- | --- | +| Application runs in the same Kubernetes cluster | Use the internal bootstrap service, for example `-kafka-bootstrap:9092`. | +| Application runs outside the Kubernetes cluster | Use `NodePort` or `LoadBalancer` external listener, depending on the environment. | +| Applications require authenticated access | Enable SCRAM-SHA-512 on the listener and create `KafkaUser` or `RdsKafkaUser` resources. | + +For Kafka clusters with external access, each broker must remain individually reachable by clients after metadata discovery. Do not expose only one broker endpoint unless a Kafka-aware proxy or supported load-balancer configuration is used. + +## Operations + +- Monitor broker availability, under-replicated partitions, ISR changes, disk usage, request latency, consumer lag, and controller count. +- Keep broker data disks below operational thresholds. Alert before retention or disk pressure affects availability. +- Re-evaluate partition count when scaling brokers. Adding brokers alone does not automatically move existing partition replicas. +- Use explicit topic configuration for retention, segment size, partition count, and replica count. +- Avoid underscores in custom resource names. Kubernetes resource names must satisfy RFC 1123 naming rules. + +## Important Considerations + +- Kafka throughput is constrained by the slowest of disk, network, CPU, and client configuration. Benchmark before committing sizing numbers. +- Keep ZooKeeper storage and Kafka broker storage separated. +- Plan memory for page cache, not only JVM heap. +- Use hard anti-affinity for production clusters when enough nodes are available. +- Explicitly create topics instead of relying on auto-creation. +- Review operator defaults after every platform or operator upgrade. diff --git a/docs/en/solutions/ecosystem/kafka/Kafka_Certificate_Expiry_Restart_on_ACP_3102.md b/docs/en/solutions/ecosystem/kafka/Kafka_Certificate_Expiry_Restart_on_ACP_3102.md new file mode 100644 index 00000000..c92d8801 --- /dev/null +++ b/docs/en/solutions/ecosystem/kafka/Kafka_Certificate_Expiry_Restart_on_ACP_3102.md @@ -0,0 +1,68 @@ +--- +products: + - Alauda Application Services +kind: + - Solution +ProductsVersion: + - 3.x +--- + +# Kafka Restarts After Client CA Certificate Expiry on ACP 3.10.2 + +:::info Applicable Versions +ACP 3.10.2. +::: + +## Problem + +A Kafka cluster is recreated or restarted unexpectedly around one year after creation. Operator logs indicate certificate renewal activity, and the Kafka client CA certificate shows a one-year validity period. + +In the affected setup, the operator creates Kafka clusters with a one-year `clientsCa` validity. Near expiry, certificate renewal can trigger Kafka pod restarts. + +## Diagnosis + +Export and inspect the client CA certificate: + +```bash +kubectl -n get secret -clients-ca-cert -o jsonpath='{.data.ca\.crt}' | base64 -d > client-ca.crt +openssl x509 -in client-ca.crt -noout -dates +``` + +Review Kafka and operator events around the restart time: + +```bash +kubectl -n get events --sort-by=.lastTimestamp | grep -i kafka +kubectl -n logs deploy/ +``` + +## Resolution + +Add `clientsCa.validityDays` to the Kafka resource so both cluster and client CA certificates use a longer validity period: + +```yaml +spec: + clusterCa: + validityDays: 3650 + clientsCa: + validityDays: 3650 +``` + +Then manually trigger client CA renewal: + +```bash +kubectl -n annotate secret -clients-ca-cert \ + strimzi.io/force-renew=true --overwrite +``` + +Verify the new certificate dates: + +```bash +kubectl -n get secret -clients-ca-cert -o jsonpath='{.data.ca\.crt}' | base64 -d > client-ca.crt +openssl x509 -in client-ca.crt -noout -dates +``` + +## Important Considerations + +- Updating the product-level Kafka resource can cause the generated community Kafka YAML to revert. Confirm the final generated YAML still contains the desired CA settings. +- Certificate renewal can restart clients or brokers depending on the configuration. Schedule the change in a maintenance window when possible. +- Keep application truststores aligned with the renewed CA material. diff --git a/docs/en/solutions/ecosystem/kafka/Kafka_Cluster_Optimization_on_ACP_38.md b/docs/en/solutions/ecosystem/kafka/Kafka_Cluster_Optimization_on_ACP_38.md new file mode 100644 index 00000000..ea550107 --- /dev/null +++ b/docs/en/solutions/ecosystem/kafka/Kafka_Cluster_Optimization_on_ACP_38.md @@ -0,0 +1,136 @@ +--- +products: + - Alauda Application Services +kind: + - Solution +ProductsVersion: + - 3.x +--- + +# Kafka Cluster Optimization on ACP 3.8 + +:::info Applicable Versions +ACP 3.8.x. +::: + +## Introduction + +This guide summarizes resource and configuration adjustments for Kafka clusters on ACP 3.8.x. Use it as a review checklist when creating or tuning Kafka instances. + +## Reference Resource Plan + +For a small 2 vCPU / 4 GiB Kafka broker profile, plan the surrounding components as well: + +| Component | Resource | Storage | +| --- | --- | --- | +| Kafka broker | 2 vCPU / 4 GiB | 100 GiB or workload-specific | +| ZooKeeper | 2 vCPU / 4 GiB | 10-100 GiB depending on policy | +| TLS sidecar | 200m CPU / 128 MiB | N/A | +| Topic operator | 1 vCPU / 2 GiB | N/A | +| User operator | 1 vCPU / 2 GiB | N/A | +| Kafka exporter | 1 vCPU / 2 GiB | N/A | + +## JVM Sizing + +Set Kafka and ZooKeeper JVM heap to roughly one quarter of the container memory limit as a starting point: + +```yaml +spec: + kafka: + jvmOptions: + -Xms: 1024m + -Xmx: 1024m + resources: + limits: + cpu: 2 + memory: 4Gi + requests: + cpu: 2 + memory: 4Gi + zookeeper: + jvmOptions: + -Xms: 1024m + -Xmx: 1024m + resources: + limits: + cpu: 2 + memory: 4Gi + requests: + cpu: 2 + memory: 4Gi +``` + +## Entity Operator Resources + +```yaml +spec: + entityOperator: + tlsSidecar: + resources: + limits: + cpu: 200m + memory: 128Mi + requests: + cpu: 200m + memory: 128Mi + topicOperator: + jvmOptions: + -Xms: 500m + -Xmx: 500m + resources: + limits: + cpu: "1" + memory: 2Gi + requests: + cpu: "1" + memory: 2Gi + userOperator: + jvmOptions: + -Xms: 500m + -Xmx: 500m + resources: + limits: + cpu: "1" + memory: 2Gi + requests: + cpu: "1" + memory: 2Gi +``` + +## Kafka Exporter Resources + +```yaml +spec: + kafkaExporter: + groupRegex: ".*" + topicRegex: ".*" + resources: + limits: + cpu: 1 + memory: 2Gi + requests: + cpu: 1 + memory: 2Gi +``` + +## Naming Rules + +Do not use underscores in Kafka-related custom resource names. Kubernetes resource names must follow RFC 1123. Use lowercase letters, numbers, hyphens, and dots, and start and end with an alphanumeric character. + +## Topic Defaults + +In ACP 3.8.2 and later, the default topic segment size was corrected. For older resources, verify topic config and set the intended value explicitly when needed: + +```yaml +spec: + config: + retention.ms: "604800000" + message.max.bytes: "1073741824" +``` + +## Important Considerations + +- Size the topic operator and user operator according to the number of topics and users, not only the broker size. +- Use explicit topic configuration for production workloads. +- Keep resource requests equal to limits for predictable Kafka scheduling when using dedicated nodes. +- Validate generated YAML after UI changes because product defaults can vary by patch version. diff --git a/docs/en/solutions/ecosystem/kafka/Kafka_DR_Data_Migration_with_SCRAM_SHA_512.md b/docs/en/solutions/ecosystem/kafka/Kafka_DR_Data_Migration_with_SCRAM_SHA_512.md new file mode 100644 index 00000000..f5084707 --- /dev/null +++ b/docs/en/solutions/ecosystem/kafka/Kafka_DR_Data_Migration_with_SCRAM_SHA_512.md @@ -0,0 +1,201 @@ +--- +products: + - Alauda Application Services +kind: + - Solution +ProductsVersion: + - 3.x +--- + +# Kafka Disaster Recovery and Data Migration with SCRAM-SHA-512 + +:::info Applicable Versions +ACP 3.x Kafka clusters using SCRAM-SHA-512 authentication. +::: + +## Introduction + +This guide describes how to use MirrorMaker 2 to replicate data from a source Kafka cluster to a target Kafka cluster when SCRAM-SHA-512 authentication is enabled. The same pattern can be used for hot standby disaster recovery or migration. + +## Architecture + +- Source cluster: the Kafka cluster currently used by applications. +- Target cluster: the Kafka cluster that receives replicated topics and data. +- MirrorMaker 2: runs in the target namespace, consumes from the source cluster, and produces to the target cluster. + +## Procedure + +### 1. Create the Target Cluster + +Create the target Kafka cluster and enable SCRAM-SHA-512 authentication on the listener used by MirrorMaker 2. + +### 2. Create a Source User + +MirrorMaker 2 needs a source-side user with permission to read topics and consumer groups. + +For an RDS-managed Kafka source cluster: + +```yaml +apiVersion: middleware.alauda.io/v1 +kind: RdsKafkaUser +metadata: + name: sync-user + namespace: + labels: + middleware.alauda.io/cluster: +spec: + authentication: + type: scram-sha-512 + authorization: + type: simple + acls: + - host: "*" + operation: All + resource: + type: topic + name: "*" + patternType: literal + - host: "*" + operation: All + resource: + type: group + name: "*" + patternType: literal +``` + +For a native Kafka source cluster, create SCRAM credentials and ACLs with Kafka scripts: + +```bash +bin/kafka-configs.sh --zookeeper 127.0.0.1:2181 --alter \ + --add-config 'SCRAM-SHA-256=[password=],SCRAM-SHA-512=[password=]' \ + --entity-type users \ + --entity-name sync-user + +bin/kafka-acls.sh --authorizer kafka.security.auth.SimpleAclAuthorizer \ + --authorizer-properties zookeeper.connect=127.0.0.1:2181 \ + --add --allow-principal User:sync-user --operation All --topic "*" + +bin/kafka-acls.sh --authorizer kafka.security.auth.SimpleAclAuthorizer \ + --authorizer-properties zookeeper.connect=127.0.0.1:2181 \ + --add --allow-principal User:sync-user --operation All --group "*" +``` + +### 3. Create the Source Password Secret in the Target Namespace + +MirrorMaker 2 reads the source password from a Kubernetes secret in the namespace where MirrorMaker 2 runs: + +```bash +echo -n '' > MY-PASSWORD.txt +kubectl -n create secret generic sync-user-secret \ + --from-file=password=./MY-PASSWORD.txt +``` + +### 4. Create a Target User + +```yaml +apiVersion: middleware.alauda.io/v1 +kind: RdsKafkaUser +metadata: + name: target-cluster-user + namespace: + labels: + middleware.alauda.io/cluster: +spec: + authentication: + type: scram-sha-512 + authorization: + type: simple + acls: + - host: "*" + operation: All + resource: + type: topic + name: "*" + patternType: literal + - host: "*" + operation: All + resource: + type: group + name: "*" + patternType: literal +``` + +The target user's generated password secret usually has the same name as the user. + +### 5. Create MirrorMaker 2 + +```yaml +apiVersion: kafka.strimzi.io/v1alpha1 +kind: KafkaMirrorMaker2 +metadata: + name: my-mm2-cluster + namespace: +spec: + clusters: + - alias: my-cluster-source + bootstrapServers: :9092 + authentication: + type: scram-sha-512 + username: sync-user + passwordSecret: + secretName: sync-user-secret + password: password + - alias: my-cluster-target + bootstrapServers: target-cluster-kafka-bootstrap:9092 + authentication: + type: scram-sha-512 + username: target-cluster-user + passwordSecret: + secretName: target-cluster-user + password: password + config: + config.storage.replication.factor: 1 + offset.storage.replication.factor: 1 + status.storage.replication.factor: 1 + connectCluster: my-cluster-target + mirrors: + - sourceCluster: my-cluster-source + targetCluster: my-cluster-target + topicsPattern: ".*" + groupsPattern: ".*" + checkpointConnector: + config: + emit.checkpoints.interval.seconds: 60 + checkpoints.topic.replication.factor: 1 + sync.group.offsets.enabled: "true" + sync.group.offsets.interval.seconds: 60 + refresh.groups.interval.seconds: 600 + replication.policy.class: "io.strimzi.kafka.connect.mirror.IdentityReplicationPolicy" + heartbeatConnector: + config: + heartbeats.topic.replication.factor: 1 + sourceConnector: + tasksMax: 3 + config: + offset-syncs.topic.replication.factor: 1 + refresh.topics.interval.seconds: 600 + replication.factor: 2 + sync.topic.acls.enabled: "false" + replication.policy.class: "io.strimzi.kafka.connect.mirror.IdentityReplicationPolicy" + replicas: 1 + version: 2.7.0 +``` + +## Verify Synchronization + +Check topic synchronization: + +```bash +kafka-topics.sh --bootstrap-server :9092 --list +``` + +Check data synchronization by producing to the source and consuming from the target. + +Monitor lag. Synchronization is complete for a cutover only when topic data is present and lag reaches zero or an acceptable business threshold. + +## Important Considerations + +- The source and target users must have `All` permissions for topics and groups used by MirrorMaker 2. +- MirrorMaker 2 runs in the target namespace, so required source credentials must be copied there. +- `IdentityReplicationPolicy` keeps topic names unchanged, which is normally required for active-passive DR. +- Replication is asynchronous. Freeze writes during final migration cutover when exact consistency is required. diff --git a/docs/en/solutions/ecosystem/kafka/Kafka_DR_Deployment_Guide_for_ACP_315.md b/docs/en/solutions/ecosystem/kafka/Kafka_DR_Deployment_Guide_for_ACP_315.md new file mode 100644 index 00000000..6e0017aa --- /dev/null +++ b/docs/en/solutions/ecosystem/kafka/Kafka_DR_Deployment_Guide_for_ACP_315.md @@ -0,0 +1,254 @@ +--- +products: + - Alauda Application Services +kind: + - Solution +ProductsVersion: + - 3.x +--- + +# Kafka Disaster Recovery Deployment Guide for ACP 3.15 + +:::info Applicable Versions +ACP 3.12 and later; validated for ACP 3.15. +::: + +## Introduction + +This guide describes the deployment side of a Kafka hot standby disaster recovery solution based on MirrorMaker 2. The source cluster is used for normal production and consumption. The target cluster receives replicated data and can be used after failover or by separate read-only workloads. + +The solution targets same-city dual-site deployments where bandwidth and latency can be controlled. + +## DR Characteristics + +| Item | Value | +| --- | --- | +| DR level | Level 5 DR pattern | +| Replication engine | MirrorMaker 2 | +| Synchronization mode | Near-real-time asynchronous replication | +| RTO | Minutes, depending on manual failover speed | +| RPO | Seconds when replication is sized correctly | + +## Risks and Limitations + +- Data loss can occur because replication is asynchronous. +- Duplicate consumption can occur because consumer group offsets are synchronized periodically. +- Insufficient bandwidth or high latency increases replication lag. +- Failover requires manual judgment and connection switching. +- MirrorMaker 2 consumes CPU, memory, and network bandwidth. Size it separately from application consumers. + +## Key MirrorMaker 2 Parameters + +| Parameter | Description | +| --- | --- | +| `checkpointConnector.config.sync.group.offsets.enabled` | Enables consumer group offset synchronization. | +| `checkpointConnector.config.sync.group.offsets.interval.seconds` | Consumer group offset sync interval. Default is 60 seconds. | +| `checkpointConnector.config.refresh.groups.interval.seconds` | New consumer group discovery interval. Default is 10 minutes. | +| `sourceConnector.config.refresh.topics.interval.seconds` | New topic discovery interval. Default is 10 minutes. | +| `sourceConnector.config.replication.factor` | Replica count for replicated topics on the target cluster. | +| `topicsPattern` | Topic name pattern to replicate. | + +## Prerequisites + +- Source and target business clusters are upgraded to ACP 3.12 or later. +- Source and target Kafka instances use similar resource, storage, and parameter profiles. +- Network bandwidth between the two sites is higher than the source write traffic. +- Target storage capacity and performance match the source cluster. +- Prometheus is deployed if MirrorMaker 2 monitoring is required. + +## Deploy Without Authentication + +Create the MirrorMaker 2 metrics `ConfigMap` and the `KafkaMirrorMaker2` resource in the target Kafka namespace. + +```yaml +apiVersion: kafka.strimzi.io/v1beta2 +kind: KafkaMirrorMaker2 +metadata: + name: my-mm2-cluster + namespace: +spec: + resources: + limits: + cpu: "1" + memory: 2Gi + requests: + cpu: "1" + memory: 2Gi + jvmOptions: + -Xms: 1g + -Xmx: 1g + clusters: + - alias: my-cluster-source + bootstrapServers: :9092 + - alias: my-cluster-target + bootstrapServers: target-cluster-kafka-bootstrap:9092 + config: + config.storage.replication.factor: 1 + offset.storage.replication.factor: 1 + status.storage.replication.factor: 1 + connectCluster: my-cluster-target + mirrors: + - sourceCluster: my-cluster-source + targetCluster: my-cluster-target + topicsPattern: ".*" + groupsPattern: ".*" + checkpointConnector: + config: + emit.checkpoints.interval.seconds: 60 + checkpoints.topic.replication.factor: 1 + sync.group.offsets.enabled: "true" + sync.group.offsets.interval.seconds: 60 + refresh.groups.interval.seconds: 600 + replication.policy.class: "io.strimzi.kafka.connect.mirror.IdentityReplicationPolicy" + heartbeatConnector: + config: + heartbeats.topic.replication.factor: 1 + sourceConnector: + tasksMax: 1 + config: + offset-syncs.topic.replication.factor: 1 + refresh.topics.interval.seconds: 600 + replication.factor: 2 + sync.topic.acls.enabled: "false" + replication.policy.class: "io.strimzi.kafka.connect.mirror.IdentityReplicationPolicy" + replicas: 1 + version: 2.7.0 +``` + +## Deploy With SCRAM-SHA-512 Authentication + +### 1. Create a Source User + +Create an `RdsKafkaUser` on the source cluster with `All` permissions for topics and groups: + +```yaml +apiVersion: middleware.alauda.io/v1 +kind: RdsKafkaUser +metadata: + name: sync-user + namespace: + labels: + middleware.alauda.io/cluster: +spec: + authentication: + type: scram-sha-512 + authorization: + type: simple + acls: + - host: "*" + operation: All + resource: + type: topic + name: "*" + patternType: literal + - host: "*" + operation: All + resource: + type: group + name: "*" + patternType: literal +``` + +### 2. Create a Password Secret in the Target Namespace + +MirrorMaker 2 runs in the target namespace, so copy the source user's password there: + +```bash +echo -n '' > MY-PASSWORD.txt +kubectl -n create secret generic sync-user-secret \ + --from-file=password=./MY-PASSWORD.txt +``` + +### 3. Create a Target User + +```yaml +apiVersion: middleware.alauda.io/v1 +kind: RdsKafkaUser +metadata: + name: target-cluster-user + namespace: + labels: + middleware.alauda.io/cluster: +spec: + authentication: + type: scram-sha-512 + authorization: + type: simple + acls: + - host: "*" + operation: All + resource: + type: topic + name: "*" + patternType: literal + - host: "*" + operation: All + resource: + type: group + name: "*" + patternType: literal +``` + +### 4. Add Authentication to MirrorMaker 2 + +```yaml +spec: + clusters: + - alias: my-cluster-source + bootstrapServers: :9092 + authentication: + type: scram-sha-512 + username: sync-user + passwordSecret: + secretName: sync-user-secret + password: password + - alias: my-cluster-target + bootstrapServers: target-cluster-kafka-bootstrap:9092 + authentication: + type: scram-sha-512 + username: target-cluster-user + passwordSecret: + secretName: target-cluster-user + password: password +``` + +## Verify Deployment + +```bash +kubectl -n get kmm2 +kubectl -n get pod | grep mirrormaker2 +``` + +The MirrorMaker 2 custom resource should become ready. + +## Monitoring + +Create a `PodMonitor` for MirrorMaker 2 metrics: + +```yaml +apiVersion: monitoring.coreos.com/v1 +kind: PodMonitor +metadata: + name: kafka-mirrormaker2 + namespace: operators + labels: + prometheus: kube-prometheus +spec: + jobLabel: strimzi.io/cluster + namespaceSelector: + any: true + podMetricsEndpoints: + - interval: 15s + path: /metrics + port: tcp-prometheus + selector: + matchLabels: + strimzi.io/kind: KafkaMirrorMaker2 +``` + +## Important Considerations + +- Set `tasksMax` based on topic count, partition count, and replication lag. +- Keep `replication.policy.class` as `IdentityReplicationPolicy` for active-passive DR when topic names must remain unchanged. +- Use separate credentials for source reads and target writes. +- Validate topic sync, data sync, consumer group sync, and failover before using the DR environment for production. diff --git a/docs/en/solutions/ecosystem/kafka/Kafka_DR_Monitoring_and_Alerts.md b/docs/en/solutions/ecosystem/kafka/Kafka_DR_Monitoring_and_Alerts.md new file mode 100644 index 00000000..3cf9ab1b --- /dev/null +++ b/docs/en/solutions/ecosystem/kafka/Kafka_DR_Monitoring_and_Alerts.md @@ -0,0 +1,150 @@ +--- +products: + - Alauda Application Services +kind: + - Solution +ProductsVersion: + - 3.x +--- + +# Kafka Disaster Recovery Monitoring and Alerts + +:::info Applicable Versions +ACP 3.x with Kafka MirrorMaker 2 based disaster recovery. +::: + +## Introduction + +MirrorMaker 2 DR monitoring should answer three questions: + +- Is MirrorMaker 2 running? +- Is data being replicated at the required rate? +- Can the target cluster accept failover traffic? + +This guide summarizes the dashboard panels and alert expressions used for Kafka DR. + +## Deploy Monitoring + +Create a `PodMonitor` so Prometheus scrapes MirrorMaker 2 metrics: + +```yaml +apiVersion: monitoring.coreos.com/v1 +kind: PodMonitor +metadata: + name: kafka-mirrormaker2 + namespace: operators + labels: + prometheus: kube-prometheus +spec: + jobLabel: strimzi.io/cluster + namespaceSelector: + any: true + podMetricsEndpoints: + - interval: 15s + path: /metrics + port: tcp-prometheus + selector: + matchLabels: + strimzi.io/kind: KafkaMirrorMaker2 +``` + +Import the MirrorMaker 2 dashboard JSON into the embedded Grafana after Prometheus starts scraping the pod. + +## Dashboard Panels + +| Panel | Description | +| --- | --- | +| Number of Connectors | Number of active MirrorMaker 2 connectors. | +| Number of Tasks | Number of connector tasks. | +| Total record rate | Records replicated per second. | +| Total byte rate | Bytes replicated per second. | +| Incoming bytes | Bytes read from the source cluster. | +| Outgoing bytes | Bytes written to the target cluster. | +| CPU Usage | MirrorMaker 2 CPU usage. | +| JVM Memory | JVM memory usage. | +| Time spent in GC | JVM garbage collection pressure. | +| Record Age | Age of records waiting in MirrorSourceConnector. | +| Replication Latency | Replication delay from source to target. | +| Consumer Lag | Lag for replicated topics and partitions. | + +## Alert Variables + +Use these variables in alert templates: + +| Variable | Meaning | +| --- | --- | +| `$namespace` | Namespace of the target Kafka and MirrorMaker 2 resources. | +| `$targetClusterName` | Target Kafka cluster name. | +| `$mm2ClusterName` | MirrorMaker 2 custom resource name. | +| `$topicName` | Optional topic filter. | + +## Recommended Alerts + +```text +count(kafka_server_replicamanager_leadercount{namespace="$namespace",job="$targetClusterName"}) +``` + +Target broker count. + +```text +avg(kubelet_volume_stats_used_bytes{namespace="$namespace",persistentvolumeclaim=~"data-$targetClusterName-.*"}) by (persistentvolumeclaim) +/ +avg(kubelet_volume_stats_capacity_bytes{namespace="$namespace",persistentvolumeclaim=~"data-$targetClusterName-.*"}) by (persistentvolumeclaim) +``` + +Target Kafka PVC usage ratio. + +```text +sum(kafka_controller_kafkacontroller_activecontrollercount{job="$targetClusterName",namespace="$namespace"}) +``` + +Active controller count. Normally exactly one controller should be active. + +```text +sum(kube_pod_container_status_ready{namespace="$namespace",pod=~"$mm2ClusterName-.*"}) +``` + +MirrorMaker 2 ready pod count. + +```text +max(sum(kafka_consumer_fetch_manager_records_lag{namespace="$namespace",job="$mm2ClusterName",clientid!~"consumer-mirrormaker2-.*"}) by (topic, partition)) +``` + +Maximum lag across replicated topics. + +```text +max(sum(kafka_connect_mirror_mirrorsourceconnector_replication_latency_ms{namespace="$namespace",job="$mm2ClusterName"}) by (partition, topic)) +``` + +Maximum replication latency. + +```text +max(sum(kafka_connect_mirror_mirrorsourceconnector_record_age_ms{namespace="$namespace",job="$mm2ClusterName"}) by (partition, topic)) +``` + +Maximum record age in MirrorSourceConnector. + +```text +sum(rate(container_cpu_usage_seconds_total{namespace="$namespace",pod=~"$mm2ClusterName-mirrormaker2-.+",container="$mm2ClusterName-mirrormaker2"}[5m])) by (pod) +``` + +MirrorMaker 2 CPU usage. + +```text +sum without(area)(jvm_memory_bytes_used{namespace="$namespace",job="$mm2ClusterName"}) +``` + +MirrorMaker 2 JVM memory usage. + +## RTO and RPO + +RTO is the maximum time required to restore service after a failure. In this solution, RTO is mostly determined by manual failover decision time and application connection switching. + +RPO is the maximum data loss after a failure. In MirrorMaker 2 DR, RPO depends on replication throughput, replication lag, and consumer group offset checkpoint timing. With sufficient capacity, RPO can be seconds, but it must be measured in drills. + +## Important Considerations + +- A healthy MirrorMaker 2 pod does not guarantee the target Kafka cluster is ready for failover. Monitor both sides. +- Alert on lag trends, not only absolute values, when traffic is bursty. +- Disk usage on the target cluster is part of DR readiness. +- Re-test alert expressions after Prometheus, VictoriaMetrics, or label conventions change. diff --git a/docs/en/solutions/ecosystem/kafka/Kafka_DR_Operations_Manual_for_ACP_315.md b/docs/en/solutions/ecosystem/kafka/Kafka_DR_Operations_Manual_for_ACP_315.md new file mode 100644 index 00000000..cadd74c9 --- /dev/null +++ b/docs/en/solutions/ecosystem/kafka/Kafka_DR_Operations_Manual_for_ACP_315.md @@ -0,0 +1,110 @@ +--- +products: + - Alauda Application Services +kind: + - Solution +ProductsVersion: + - 3.x +--- + +# Kafka Disaster Recovery Operations Manual for ACP 3.15 + +:::info Applicable Versions +ACP 3.15. +::: + +## Introduction + +This operations guide covers routine monitoring, alerting, troubleshooting, and drill practices for a Kafka hot standby DR deployment based on MirrorMaker 2. + +## Routine Monitoring + +MirrorMaker 2 monitoring should show replication health and resource usage. Import the MirrorMaker 2 dashboard into the embedded Grafana after Prometheus is available on the target ACP cluster. + +Common dashboard filters: + +| Filter | Description | +| --- | --- | +| Namespace | Namespace where the MirrorMaker 2 instance runs. | +| Cluster Name | Name of the MirrorMaker 2 custom resource. | +| Topic Name | Topic to inspect. | + +Important panels: + +| Panel | Meaning | +| --- | --- | +| Number of Connectors | Number of MirrorMaker 2 connectors. | +| Number of Tasks | Number of running connector tasks. | +| Total record rate | Overall record replication rate. | +| Total byte rate | Overall byte replication rate. | +| Incoming bytes | Bytes read from the source cluster. | +| Outgoing bytes | Bytes written to the target cluster. | +| CPU Usage | MirrorMaker 2 CPU usage. | +| JVM Memory | MirrorMaker 2 JVM memory usage. | +| Time spent in GC | JVM garbage collection time. | +| Record Age | How long records stay in the MirrorSourceConnector before replication. High values can indicate slow replication. | +| Replication Latency | Source-to-target replication latency in milliseconds. | +| Consumer Lag | Replication lag by topic and partition. | + +## Alert Expressions + +Replace `$namespace`, `$targetClusterName`, `$mm2ClusterName`, and `$topicName` with your environment values. If the environment uses VictoriaMetrics with a cluster label, add the appropriate `vmcluster` filter. + +| Alert | Example Expression | +| --- | --- | +| Target broker count | `count(kafka_server_replicamanager_leadercount{namespace="$namespace",job="$targetClusterName"})` | +| Target PVC usage | `(avg(kubelet_volume_stats_used_bytes{namespace="$namespace",persistentvolumeclaim=~"data-$targetClusterName-.*"}) by (persistentvolumeclaim) / avg(kubelet_volume_stats_capacity_bytes{namespace="$namespace",persistentvolumeclaim=~"data-$targetClusterName-.*"}) by (persistentvolumeclaim))` | +| Active controller count | `sum(kafka_controller_kafkacontroller_activecontrollercount{job="$targetClusterName",namespace="$namespace"})` | +| MirrorMaker 2 ready pods | `sum(kube_pod_container_status_ready{namespace="$namespace",pod=~"$mm2ClusterName-.*"})` | +| Max lag across all topics | `max(sum(kafka_consumer_fetch_manager_records_lag{namespace="$namespace",job="$mm2ClusterName",clientid!~"consumer-mirrormaker2-.*"}) by (topic, partition))` | +| Max lag for one topic | `max(sum(kafka_consumer_fetch_manager_records_lag{topic=~"$topicName",namespace="$namespace",job="$mm2ClusterName",clientid!~"consumer-mirrormaker2-.*"}) by (topic, partition))` | +| Lag growth | `sum(delta(kafka_consumer_fetch_manager_records_lag{namespace="$namespace",job="$mm2ClusterName",clientid!~"consumer-mirrormaker2-.*"}[5m])) by (topic, partition)` | +| Max replication latency | `max(sum(kafka_connect_mirror_mirrorsourceconnector_replication_latency_ms{namespace="$namespace",job="$mm2ClusterName"}) by (partition, topic))` | +| Record age | `max(sum(kafka_connect_mirror_mirrorsourceconnector_record_age_ms{namespace="$namespace",job="$mm2ClusterName"}) by (partition, topic))` | +| CPU usage | `sum(rate(container_cpu_usage_seconds_total{namespace="$namespace",pod=~"$mm2ClusterName-mirrormaker2-.+",container="$mm2ClusterName-mirrormaker2"}[5m])) by (pod)` | +| JVM memory | `sum without(area)(jvm_memory_bytes_used{namespace="$namespace",job="$mm2ClusterName"})` | + +## Troubleshooting + +### Kafka Startup Is Abnormal + +If the target Kafka cluster is not ready, MirrorMaker 2 may be unable to write replicated data. Check broker status, controller count, PVC capacity, and under-replicated partitions. + +```bash +kubectl -n get pod +kubectl -n get kafka -o yaml +``` + +### MirrorMaker 2 Lag Keeps Increasing + +Check these items: + +- Source write rate exceeds MirrorMaker 2 throughput. +- `tasksMax` is too low for the topic and partition count. +- CPU or memory limits are too small. +- Network bandwidth or latency between sites is insufficient. +- Target Kafka brokers are slow or under disk pressure. + +### Consumer Group Offset Is Behind After Failover + +Consumer group offsets are checkpointed periodically. Reduce `sync.group.offsets.interval.seconds` if the business requires a lower RPO for consumer offsets, and ensure applications can tolerate duplicates. + +## DR Drill Checklist + +- Confirm source and target Kafka clusters are ready. +- Confirm MirrorMaker 2 status is ready. +- Confirm selected topics are replicated. +- Produce and consume test data. +- Stop the source consumer and continue producing. +- Simulate source failure in a controlled environment. +- Switch clients to the target bootstrap endpoint. +- Measure RTO and RPO. +- Record any duplicate records observed by the consumer. +- Restore the source cluster and document switch-back steps. + +## Important Considerations + +- Alerts should be tuned to business traffic patterns. Low traffic periods can make some rate-based alerts noisy. +- Use both replication lag and record age to judge replication health. +- High target PVC usage can break DR even when MirrorMaker 2 itself is healthy. +- Treat DR drills as part of release validation after Kafka operator upgrades. diff --git a/docs/en/solutions/ecosystem/kafka/Kafka_DR_Validation_for_ACP_315.md b/docs/en/solutions/ecosystem/kafka/Kafka_DR_Validation_for_ACP_315.md new file mode 100644 index 00000000..b6b650b9 --- /dev/null +++ b/docs/en/solutions/ecosystem/kafka/Kafka_DR_Validation_for_ACP_315.md @@ -0,0 +1,87 @@ +--- +products: + - Alauda Application Services +kind: + - Solution +ProductsVersion: + - 3.x +--- + +# Kafka Disaster Recovery Validation for ACP 3.15 + +:::info Applicable Versions +ACP 3.15. +::: + +## Introduction + +After deploying Kafka hot standby disaster recovery with MirrorMaker 2, validate topic replication, data replication, consumer group synchronization, monitoring, and failover behavior. + +## Prerequisites + +- The source Kafka cluster, target Kafka cluster, and MirrorMaker 2 instance are deployed. +- The MirrorMaker 2 custom resource is ready. +- Monitoring is configured if alert and dashboard validation is required. +- You have a test topic and test producer/consumer tools available. + +## Validate Topic Synchronization + +1. Create a topic on the source cluster. +2. Check whether the topic appears on the target cluster. + +New topics are not synchronized immediately by default. MirrorMaker 2 discovers new topics based on `refresh.topics.interval.seconds`, which is commonly set to 600 seconds. Lower this value for testing if faster discovery is required. + +## Validate Data Synchronization + +1. Produce test records to the source topic. +2. Consume from the corresponding topic on the target cluster. +3. Confirm the target topic receives the expected records. + +Example producer and consumer commands: + +```bash +kafka-console-producer.sh \ + --bootstrap-server :9092 \ + --topic my-topic + +kafka-console-consumer.sh \ + --bootstrap-server :9092 \ + --topic my-topic \ + --from-beginning +``` + +## Validate Consumer Group Synchronization + +1. Produce records to a source topic. +2. Consume part of the records on the source cluster with a fixed group, for example `group1`. +3. Stop the source consumer. +4. Continue producing records to the source topic. +5. Check the target cluster for the synchronized consumer group. +6. Start a consumer with the same group on the target cluster and confirm it continues from the synchronized offset rather than always starting from the beginning. + +Consumer group offsets are synchronized periodically, not in real time. Some duplicate consumption can occur after failover if the source cluster fails before the latest offset checkpoint reaches the target cluster. + +## Validate Monitoring + +1. Produce traffic to the source topic. +2. Open the MirrorMaker 2 dashboard. +3. Confirm that replicated topics and traffic appear in the panels. +4. Check lag, replication latency, record age, CPU, JVM memory, and task count. + +## Validate Failover + +1. Produce and consume records on the source cluster with a fixed consumer group. +2. Stop the source consumer after it has consumed part of the records. +3. Continue producing several more records. +4. Simulate source cluster failure by stopping the Kafka operator or broker workloads in a controlled test environment. +5. Switch application bootstrap addresses to the target cluster. +6. Produce and consume on the target cluster. +7. Confirm that the consumer group can continue from the synchronized offset. +8. Restore the source cluster and test the planned switch-back procedure separately. + +## Important Considerations + +- Run failover validation only in a non-production or approved DR drill window. +- Duplicate consumption is expected in some failure timings. Applications must be idempotent or otherwise tolerate duplicate messages. +- Topic creation and consumer group synchronization use separate refresh intervals. +- Record the observed RTO and RPO from the drill instead of relying only on theoretical values. diff --git a/docs/en/solutions/ecosystem/kafka/Kafka_Entity_Operator_Startup_Failure_Due_to_Container_Limits.md b/docs/en/solutions/ecosystem/kafka/Kafka_Entity_Operator_Startup_Failure_Due_to_Container_Limits.md new file mode 100644 index 00000000..c6ede733 --- /dev/null +++ b/docs/en/solutions/ecosystem/kafka/Kafka_Entity_Operator_Startup_Failure_Due_to_Container_Limits.md @@ -0,0 +1,51 @@ +--- +products: + - Alauda Application Services +kind: + - Solution +ProductsVersion: + - 3.x +--- + +# Kafka Entity Operator Fails to Start Because Container Limits Are Too Small + +:::info Applicable Versions +ACP 3.6 and 3.6.1. +::: + +## Problem + +When creating a Kafka instance, the `kafka-entity-operator` pod fails to start. The pod receives only the namespace default container limit, for example a very small CPU limit, because the deployment template in the affected version does not set explicit resources for this component. + +## Diagnosis + +Check the entity operator pod and deployment: + +```bash +kubectl -n get pod | grep kafka-entity-operator +kubectl -n describe pod +kubectl -n get deploy -o yaml +``` + +If the container resources are inherited from the namespace default limit and are too small for startup, the pod can repeatedly restart or stay unhealthy. + +## Resolution + +Increase the default container limit for the project or namespace, then recreate the affected pod so it is scheduled with the larger resource limit: + +```bash +kubectl -n delete pod +``` + +After the pod is recreated, verify that it receives the updated resource limits and becomes ready: + +```bash +kubectl -n describe pod +kubectl -n get pod +``` + +## Important Considerations + +- This is a template issue in the affected 3.6 releases. Later versions define standard resource parameters for the entity operator. +- Increase namespace defaults only to a value appropriate for the workload. Avoid setting broadly excessive defaults. +- If the pod still fails after resource adjustment, inspect its container logs for a second failure cause. diff --git a/docs/en/solutions/ecosystem/kafka/Kafka_Hot_Standby_Disaster_Recovery_with_MirrorMaker2.md b/docs/en/solutions/ecosystem/kafka/Kafka_Hot_Standby_Disaster_Recovery_with_MirrorMaker2.md new file mode 100644 index 00000000..e72beab9 --- /dev/null +++ b/docs/en/solutions/ecosystem/kafka/Kafka_Hot_Standby_Disaster_Recovery_with_MirrorMaker2.md @@ -0,0 +1,103 @@ +--- +products: + - Alauda Application Services +kind: + - Solution +ProductsVersion: + - 3.x +--- + +# Kafka Hot Standby Disaster Recovery with MirrorMaker 2 + +:::info Applicable Versions +ACP 3.8 and later. ACP 3.12+ deployments should prefer the newer KafkaMirrorMaker2 API examples from the ACP 3.15 deployment guide. +::: + +## Introduction + +A hot standby Kafka disaster recovery architecture uses two Kafka clusters in different sites or availability zones. The source cluster handles normal production and consumption. The target cluster receives replicated data and can be used after a failure or for read-only analytics workloads with separate consumer groups. + +Replication is performed by Kafka MirrorMaker 2, which is deployed near the target cluster. + +## Architecture + +MirrorMaker 2 creates consumers connected to the source cluster and producers connected to the target cluster. It reads records from selected source topics and writes them to the target cluster. + +MirrorMaker 2 also uses checkpoint topics to synchronize consumer group offsets. A checkpoint record contains the consumer group, topic, partition, upstream offset, downstream offset, metadata, and timestamp. + +## Limitations + +Kafka 2.5 and earlier do not automatically translate checkpoint offsets into the target cluster's `__consumer_offsets` topic. After failover, applications may need to translate offsets manually before continuing consumption on the target cluster. + +For Java clients, the Kafka source code includes `RemoteClusterUtils.translateOffsets()` for this purpose. Non-Java clients need equivalent logic. + +## Basic Deployment Flow + +1. Create the source Kafka cluster. +2. Create the target Kafka cluster. +3. Deploy `KafkaMirrorMaker2` in the target cluster namespace. +4. Verify the MirrorMaker 2 pod and custom resource are ready. +5. Create a topic in the source cluster. +6. Verify that the topic appears in the target cluster. +7. Produce data to the source topic. +8. Verify that data appears in the target topic. +9. Test consumer group failover before declaring the DR procedure ready. + +## Example MirrorMaker 2 Resource + +```yaml +apiVersion: kafka.strimzi.io/v1alpha1 +kind: KafkaMirrorMaker2 +metadata: + name: my-mm2-cluster + namespace: +spec: + clusters: + - alias: my-cluster-source + bootstrapServers: :9092 + - alias: my-cluster-target + bootstrapServers: target-cluster-kafka-bootstrap:9092 + config: + config.storage.replication.factor: 1 + offset.storage.replication.factor: 1 + status.storage.replication.factor: 1 + connectCluster: my-cluster-target + mirrors: + - sourceCluster: my-cluster-source + targetCluster: my-cluster-target + topicsPattern: ".*" + groupsPattern: ".*" + checkpointConnector: + config: + checkpoints.topic.replication.factor: 1 + emit.checkpoints.interval.seconds: 60 + sync.group.offsets.enabled: "true" + sync.group.offsets.interval.seconds: 60 + refresh.groups.interval.seconds: 600 + replication.policy.class: "io.strimzi.kafka.connect.mirror.IdentityReplicationPolicy" + heartbeatConnector: + config: + heartbeats.topic.replication.factor: 1 + sourceConnector: + config: + offset-syncs.topic.replication.factor: 1 + refresh.topics.interval.seconds: 600 + replication.factor: 2 + sync.topic.acls.enabled: "false" + replication.policy.class: "io.strimzi.kafka.connect.mirror.IdentityReplicationPolicy" + replicas: 1 + version: 2.7.0 +``` + +## Failover Notes + +- New topics are not always synchronized immediately. The default refresh interval is 10 minutes and can be changed with `refresh.topics.interval.seconds`. +- Consumer group offsets are synchronized periodically. If the source cluster fails before the latest offset checkpoint is replicated, consumers may reprocess some records after failover. +- Applications should tolerate duplicate consumption during DR failover. + +## Important Considerations + +- MirrorMaker 2 replication is asynchronous. RPO depends on replication throughput and lag. +- Network bandwidth between sites must exceed the normal write traffic, plus replication overhead. +- Keep source and target cluster sizing, storage capacity, and topic retention aligned. +- Failover requires manual decision-making and application connection changes unless the application has its own routing layer. diff --git a/docs/en/solutions/ecosystem/kafka/Kafka_ISR_Synchronization_Issues.md b/docs/en/solutions/ecosystem/kafka/Kafka_ISR_Synchronization_Issues.md new file mode 100644 index 00000000..27e7ec33 --- /dev/null +++ b/docs/en/solutions/ecosystem/kafka/Kafka_ISR_Synchronization_Issues.md @@ -0,0 +1,102 @@ +--- +products: + - Alauda Application Services +kind: + - Solution +ProductsVersion: + - 3.x +--- + +# Resolve Kafka ISR Synchronization Issues That Block Broker Restart + +:::info Applicable Versions +Kafka 2.7.0 and 2.8.2 on ACP 3.x. +::: + +## Problem + +When the operator tries to restart a broker pod, it detects that restarting the pod would reduce the ISR count for a partition below `min.insync.replicas`. The restart is blocked. Topic details show that some partitions have the expected replica count, but only one replica remains in ISR. + +## Recovery Options + +### Option 1: Force Delete the Stuck Pod + +This option is risky and should be used only when data consistency is not important. + +```bash +kubectl -n delete pod --force --grace-period=0 +``` + +If ISR has only one replica, forcibly deleting the pod can lose the latest records for affected partitions or cause consumers to reprocess data. Avoid this in production unless the business accepts the data risk. + +### Option 2: Reassign Affected Topic Partitions + +Reassign the affected topic so replicas can recover and ISR can become healthy again. + +Create a topic list file. Replace `__consumer_offsets` with the affected topic if needed: + +```bash +cat > /tmp/topic-generate.json <<'EOF' +{ + "topics": [ + {"topic": "__consumer_offsets"} + ], + "version": 1 +} +EOF +``` + +Generate a reassignment plan. Update `--broker-list` to match your brokers: + +```bash +./bin/kafka-reassign-partitions.sh \ + --bootstrap-server localhost:9092 \ + --topics-to-move-json-file /tmp/topic-generate.json \ + --broker-list "0,1,2" \ + --generate +``` + +Copy the proposed partition assignment from the command output and save it: + +```bash +cat > /tmp/partition-replica-reassignment.json <<'EOF' +{ + "version": 1, + "partitions": [ + { + "topic": "__consumer_offsets", + "partition": 0, + "replicas": [0, 1, 2], + "log_dirs": ["any", "any", "any"] + } + ] +} +EOF +``` + +Execute reassignment: + +```bash +./bin/kafka-reassign-partitions.sh \ + --bootstrap-server localhost:9092 \ + --reassignment-json-file /tmp/partition-replica-reassignment.json \ + --execute +``` + +Check the result: + +```bash +./bin/kafka-topics.sh \ + --bootstrap-server localhost:9092 \ + --describe \ + --topic __consumer_offsets +``` + +After ISR recovers, the blocked pod restart or rolling update can continue. + +## Important Considerations + +- Reassign only the affected topics and partitions. +- Keep a copy of the current assignment from the `--generate` output for rollback planning. +- Do not force-delete broker pods in production unless the data consistency risk is explicitly accepted. +- Monitor under-replicated partitions and ISR after reassignment. diff --git a/docs/en/solutions/ecosystem/kafka/Kafka_MetalLB_Access.md b/docs/en/solutions/ecosystem/kafka/Kafka_MetalLB_Access.md new file mode 100644 index 00000000..49b99a7c --- /dev/null +++ b/docs/en/solutions/ecosystem/kafka/Kafka_MetalLB_Access.md @@ -0,0 +1,73 @@ +--- +products: + - Alauda Application Services +kind: + - Solution +ProductsVersion: + - 3.x +--- + +# Expose Kafka with MetalLB + +:::info Applicable Versions +ACP 3.16.x with MetalLB deployed. +::: + +## Introduction + +When MetalLB is available in the Kubernetes cluster, Kafka can expose broker services through `LoadBalancer` services and receive stable external IP addresses from the MetalLB address pool. + +## Prerequisites + +- MetalLB is installed and configured. +- Kafka operator is deployed. +- The MetalLB address pool has enough IPs for the bootstrap service and broker services. + +## Configure the Kafka Listener + +Switch the external Kafka listener to `loadbalancer`: + +```yaml +spec: + kafka: + listeners: + plain: {} + external: + type: loadbalancer + tls: false +``` + +Create or update the Kafka instance and wait until it becomes running. + +## Get External Addresses + +List the services created for the Kafka instance: + +```bash +kubectl -n get svc -l middleware.instance/name= +``` + +Find the broker services and their `EXTERNAL-IP` values. Clients can connect to broker endpoints such as: + +```text +:9094 +``` + +## Test Access + +From a Kafka client environment: + +```bash +./bin/kafka-console-producer.sh \ + --broker-list :9094 \ + --topic my-cluster-topic +``` + +Then consume from the same topic to confirm end-to-end access. + +## Important Considerations + +- Keep the LoadBalancer services. MetalLB IPs remain stable as long as the services are not deleted. +- Kafka external access requires every advertised broker endpoint to be reachable from the client network. +- Confirm firewalls and routing allow TCP access to the MetalLB IPs and Kafka port. +- For TLS listeners, use the appropriate client truststore and protocol settings. diff --git a/docs/en/solutions/ecosystem/kafka/Kafka_Node_Placement_Affinity_Taints_Guide.md b/docs/en/solutions/ecosystem/kafka/Kafka_Node_Placement_Affinity_Taints_Guide.md new file mode 100644 index 00000000..1d04e1e9 --- /dev/null +++ b/docs/en/solutions/ecosystem/kafka/Kafka_Node_Placement_Affinity_Taints_Guide.md @@ -0,0 +1,121 @@ +--- +products: + - Alauda Application Services +kind: + - Solution +ProductsVersion: + - 3.x +--- + +# Schedule Kafka on Dedicated Middleware Nodes with Affinity, Taints, and Tolerations + +:::info Applicable Versions +ACP 3.8 and later. YAML examples are based on Strimzi Kafka resources. +::: + +## Scenario + +A business cluster already runs customer applications such as Harbor and other workloads. New nodes are added for middleware products, and Kafka must run only on those dedicated nodes. Other applications should not be scheduled onto the middleware nodes. + +Use taints to repel general workloads, labels to identify middleware nodes, tolerations so Kafka can use those nodes, and node affinity so Kafka is scheduled only there. + +## Implementation Plan + +1. Add a taint to each middleware node. +2. Add a label to each middleware node. +3. Configure Kafka and ZooKeeper pod tolerations for the taint. +4. Configure Kafka and ZooKeeper node affinity for the label. +5. Keep pod anti-affinity enabled so highly available replicas do not land on the same node. + +## Label and Taint Nodes + +Example label: + +```bash +kubectl label node middleware.alauda.io/dedicated=true +``` + +Example taint: + +```bash +kubectl taint node middleware.alauda.io/dedicated=true:NoSchedule +``` + +## Kafka YAML Example + +Add affinity and tolerations under both Kafka and ZooKeeper pod templates: + +```yaml +apiVersion: kafka.strimzi.io/v1beta1 +kind: Kafka +metadata: + name: my-cluster + namespace: operators +spec: + kafka: + replicas: 3 + template: + pod: + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: middleware.alauda.io/dedicated + operator: In + values: + - "true" + podAntiAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + - labelSelector: + matchLabels: + strimzi.io/cluster: my-cluster + strimzi.io/kind: Kafka + topologyKey: kubernetes.io/hostname + tolerations: + - key: middleware.alauda.io/dedicated + operator: Equal + value: "true" + effect: NoSchedule + zookeeper: + replicas: 3 + template: + pod: + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: middleware.alauda.io/dedicated + operator: In + values: + - "true" + podAntiAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + - labelSelector: + matchLabels: + strimzi.io/cluster: my-cluster + strimzi.io/kind: Kafka + topologyKey: kubernetes.io/hostname + tolerations: + - key: middleware.alauda.io/dedicated + operator: Equal + value: "true" + effect: NoSchedule +``` + +## Verify Scheduling + +```bash +kubectl -n get pod -o wide | grep +kubectl describe node | grep -E 'Taints|middleware.alauda.io/dedicated' +``` + +Confirm Kafka brokers and ZooKeeper pods are placed only on the dedicated middleware nodes and are spread across different hosts. + +## Important Considerations + +- A three-broker Kafka cluster and three-node ZooKeeper ensemble require at least three dedicated nodes when hard anti-affinity is used. +- If there are not enough dedicated nodes, pods remain pending. Decide whether to add nodes or relax anti-affinity. +- Apply the same node placement policy to Kafka, ZooKeeper, entity operator, and exporter if the whole instance must stay on dedicated nodes. +- Keep labels and taints stable. Removing them can cause unexpected scheduling during pod recreation. diff --git a/docs/en/solutions/ecosystem/kafka/Kafka_Operator_Blocked_By_Abnormal_Instance.md b/docs/en/solutions/ecosystem/kafka/Kafka_Operator_Blocked_By_Abnormal_Instance.md new file mode 100644 index 00000000..4b8f8a9b --- /dev/null +++ b/docs/en/solutions/ecosystem/kafka/Kafka_Operator_Blocked_By_Abnormal_Instance.md @@ -0,0 +1,43 @@ +--- +products: + - Alauda Application Services +kind: + - Solution +ProductsVersion: + - 3.x +--- + +# Kafka Operator Is Blocked by an Abnormal Instance + +:::info Applicable Versions +ACP 3.x Kafka operator versions before the fix included in ACP 3.15. +::: + +## Problem + +During an operator upgrade or reconciliation, all Kafka instances may become abnormal and the upgrade can be blocked. Operator logs show previous API server connection errors, followed by repeated messages similar to `Reconciliation is in progress`. After that, instances no longer reconcile normally. + +## Resolution + +Restart the Strimzi cluster operator and wait for reconciliation to resume: + +```bash +kubectl delete pods --all-namespaces -l strimzi.io/kind=cluster-operator +``` + +After the operator pod is recreated, monitor Kafka instance status: + +```bash +kubectl get kafka --all-namespaces +kubectl get pod --all-namespaces | grep cluster-operator +``` + +## Root Cause + +This matches a known upstream Strimzi issue where reconciliation can remain blocked after API server connectivity problems. The issue is fixed in the ACP 3.15 Kafka operator line. + +## Important Considerations + +- Restart only the operator pod; do not delete Kafka broker pods unless a separate recovery step requires it. +- Check operator logs after restart to confirm reconciliation is progressing. +- If reconciliation remains blocked, collect operator logs and Kafka custom resource status before further changes. diff --git a/docs/en/solutions/ecosystem/kafka/Kafka_Operator_Startup_Exception_After_314_Upgrade.md b/docs/en/solutions/ecosystem/kafka/Kafka_Operator_Startup_Exception_After_314_Upgrade.md new file mode 100644 index 00000000..b5eecedf --- /dev/null +++ b/docs/en/solutions/ecosystem/kafka/Kafka_Operator_Startup_Exception_After_314_Upgrade.md @@ -0,0 +1,42 @@ +--- +products: + - Alauda Application Services +kind: + - Solution +ProductsVersion: + - 3.x +--- + +# Kafka Operator Startup Exception After ACP 3.14 Upgrade + +:::info Applicable Versions +ACP 3.14 operator upgrade paths affected by the referenced issue. +::: + +## Problem + +After upgrading the Kafka operator, the operator can start abnormally when an `RdsTopic` custom resource exists but the corresponding `KafkaTopic` custom resource does not exist. The operator may hit a null pointer panic during startup. + +## Resolution + +Identify the `RdsTopic` resource named in the operator error logs and remove its finalizers, then delete the stale resource. + +```bash +kubectl -n get rdstopic +kubectl -n patch rdstopic \ + --type=merge \ + -p '{"metadata":{"finalizers":[]}}' +kubectl -n delete rdstopic +``` + +Restart the operator if it does not recover automatically: + +```bash +kubectl delete pods --all-namespaces -l strimzi.io/kind=cluster-operator +``` + +## Important Considerations + +- Patch only the stale `RdsTopic` reported by the logs. +- Confirm whether a real Kafka topic still exists before deleting product-layer resources. +- Collect operator logs before remediation if this needs escalation. diff --git a/docs/en/solutions/ecosystem/kafka/Kafka_Topic_Delete_Data_Loss_Issue.md b/docs/en/solutions/ecosystem/kafka/Kafka_Topic_Delete_Data_Loss_Issue.md new file mode 100644 index 00000000..3cddd0d3 --- /dev/null +++ b/docs/en/solutions/ecosystem/kafka/Kafka_Topic_Delete_Data_Loss_Issue.md @@ -0,0 +1,57 @@ +--- +products: + - Alauda Application Services +kind: + - Solution +ProductsVersion: + - 3.x +--- + +# Avoid Kafka Topic Data Loss When Deleting an Abnormal Topic CR + +:::info Applicable Versions +Affected versions: ACP 3.10.x, <= 3.12.3, <= 3.14.2, <= 3.16.1. Fixed in ACP 3.12.4, 3.14.3, 3.16.2, and 3.18. +::: + +## Problem + +If a Kafka topic named `test` already exists, and a user creates a second topic custom resource whose `spec.topicName` is also `test`, the second topic custom resource enters an abnormal state. In affected versions, deleting that abnormal custom resource can delete or clear data from the existing topic. + +The issue is triggered by a community deletion logic bug that does not correctly verify the relationship between the private topic and the custom resource being deleted. + +## Trigger Conditions + +The issue requires both of these actions: + +1. A user creates a topic custom resource and manually sets `spec.topicName` to a topic name already used by another topic custom resource. +2. The user deletes the abnormal topic custom resource. + +If `spec.topicName` is not set manually, the topic name defaults to the custom resource name and the issue is less likely to be triggered. + +## Workaround + +Before deleting the abnormal topic custom resource, change its `spec.topicName` to a new unused topic name. Then delete the abnormal custom resource. + +```bash +kubectl -n edit rdstopic +``` + +Update the topic name to a disposable unused name: + +```yaml +spec: + topicName: unused-topic-name-for-cleanup +``` + +Then delete the custom resource: + +```bash +kubectl -n delete rdstopic +``` + +## Important Considerations + +- Do not delete an abnormal topic CR if its `spec.topicName` points to an existing valid topic. +- Prefer leaving `spec.topicName` empty so the operator uses the CR name as the topic name. +- Upgrade to a fixed version where available. +- Before deleting topic resources in affected versions, confirm both the CR name and `spec.topicName` relationship. diff --git a/docs/en/solutions/ecosystem/rabbitmq/How_to_Run_RabbitMQ_as_Root_User.md b/docs/en/solutions/ecosystem/rabbitmq/How_to_Run_RabbitMQ_as_Root_User.md new file mode 100644 index 00000000..e1a0572b --- /dev/null +++ b/docs/en/solutions/ecosystem/rabbitmq/How_to_Run_RabbitMQ_as_Root_User.md @@ -0,0 +1,56 @@ +--- +products: + - Alauda Application Services +kind: + - Solution +ProductsVersion: + - 3.x +--- + +# How to Run RabbitMQ as Root User + +:::info Applicable Versions +ACP 3.12 and later. +::: + +## Background + +RabbitMQ normally runs internal processes as a non-root user. Some environments need root-level access to integrate with existing storage or platform constraints. + +## Configuration + +Add the following override to the `RabbitmqCluster`: + +```yaml +apiVersion: rabbitmq.com/v1beta1 +kind: RabbitmqCluster +metadata: + name: my-rabbitmq +spec: + override: + statefulSet: + spec: + template: + spec: + securityContext: + runAsUser: 0 +``` + +## Verification + +Check the running user in a RabbitMQ pod: + +```bash +kubectl -n exec -ti my-rabbitmq-server-0 -- id +``` + +Expected result includes: + +```text +uid=0(root) +``` + +## Notes + +- Use root only when the environment requires it. +- Review storage permissions, init container behavior, and security policy impact before applying this change. diff --git a/docs/en/solutions/ecosystem/rabbitmq/How_to_Use_RabbitMQ_Plugins.md b/docs/en/solutions/ecosystem/rabbitmq/How_to_Use_RabbitMQ_Plugins.md new file mode 100644 index 00000000..9b72ef3c --- /dev/null +++ b/docs/en/solutions/ecosystem/rabbitmq/How_to_Use_RabbitMQ_Plugins.md @@ -0,0 +1,116 @@ +--- +products: + - Alauda Application Services +kind: + - Solution +ProductsVersion: + - 3.x +--- + +# How to Use RabbitMQ Plugins + +## Introduction + +RabbitMQ plugins extend the broker with additional protocols, monitoring, management, routing behavior, and integration features. + +Operator-created RabbitMQ instances typically enable these plugins by default: + +- `rabbitmq_peer_discovery_k8s` +- `rabbitmq_prometheus` +- `rabbitmq_management` + +## Check Plugin Status + +Run the following command in a RabbitMQ pod: + +```bash +rabbitmq-plugins list +``` + +An enabled plugin is shown with an enabled marker in the command output. + +## Enable Built-In Plugins + +Add the plugin names to `spec.rabbitmq.additionalPlugins`: + +```yaml +spec: + rabbitmq: + additionalPlugins: + - rabbitmq_top + - rabbitmq_shovel +``` + +Verify after the pod is ready: + +```bash +rabbitmq-plugins list +``` + +## Common Plugin Categories + +| Category | Examples | +| --- | --- | +| Management | `rabbitmq_management`, `rabbitmq_management_agent` | +| Monitoring | `rabbitmq_prometheus` | +| Discovery | `rabbitmq_peer_discovery_k8s`, `rabbitmq_peer_discovery_aws` | +| Replication | `rabbitmq_shovel`, `rabbitmq_federation` | +| Protocols | `rabbitmq_mqtt`, `rabbitmq_amqp1_0`, `rabbitmq_web_stomp` | +| Exchange extensions | `rabbitmq_consistent_hash_exchange`, `rabbitmq_delayed_message_exchange` | + +## Enable Community or Custom Plugins + +If the plugin is not already packaged in the RabbitMQ image, placing the name in `additionalPlugins` is not enough. The plugin file must exist in the container before RabbitMQ starts. + +### Method 1: Download in an Init Container + +Use an init container to download the `.ez` plugin file into a shared volume and extend `RABBITMQ_PLUGINS_DIR`. + +```yaml +spec: + rabbitmq: + additionalPlugins: + - rabbitmq_management_exchange + envConfig: | + RABBITMQ_PLUGINS_DIR=/opt/rabbitmq/plugins:/opt/rabbitmq/community-plugins + override: + statefulSet: + spec: + template: + spec: + volumes: + - name: community-plugins + emptyDir: {} + initContainers: + - name: copy-community-plugins + image: curlimages/curl + command: + - sh + - -c + - curl -L https:// --output /community-plugins/.ez + volumeMounts: + - name: community-plugins + mountPath: /community-plugins + containers: + - name: rabbitmq + volumeMounts: + - name: community-plugins + mountPath: /opt/rabbitmq/community-plugins +``` + +### Method 2: Mount Plugin Files from the Node + +If the environment cannot download from the internet, mount the plugin directory from the node and copy it into a writable shared volume before RabbitMQ starts. + +This method requires: + +- plugin files already present on the selected nodes +- node-level directory management +- stricter scheduling control + +## Recommendations + +- Keep plugin sets minimal. +- Validate plugin compatibility with the RabbitMQ version in use. +- Apply the same required plugin set to target clusters used for migration or DR. +- Treat community plugins as application dependencies and test them before production rollout. diff --git a/docs/en/solutions/ecosystem/rabbitmq/RabbitMQ_Best_Practices.md b/docs/en/solutions/ecosystem/rabbitmq/RabbitMQ_Best_Practices.md new file mode 100644 index 00000000..730ffa9d --- /dev/null +++ b/docs/en/solutions/ecosystem/rabbitmq/RabbitMQ_Best_Practices.md @@ -0,0 +1,135 @@ +--- +products: + - Alauda Application Services +kind: + - Solution +ProductsVersion: + - 3.x +--- + +# RabbitMQ Best Practices + +:::info Applicable Versions +ACP 3.14 and later. Most sizing and deployment guidance also applies to earlier operator-based RabbitMQ deployments. +::: + +## Introduction + +RabbitMQ is widely used for application decoupling, asynchronous processing, event distribution, and request-reply workloads. In Kubernetes environments, production stability depends on correct sizing, queue design, storage selection, and deployment isolation. + +Use this guide as a baseline when planning a new RabbitMQ cluster or reviewing an existing one. + +## Core Concepts + +| Term | Description | +| --- | --- | +| Producer | Client that publishes messages. | +| Exchange | Receives messages and routes them according to bindings. | +| Queue | Stores messages for consumers. | +| Binding | Routing rule between an exchange and a queue. | +| Consumer | Client that reads messages from a queue. | +| Broker | A RabbitMQ node. A cluster contains one or more brokers. | + +## Exchange Types + +| Type | Use Case | +| --- | --- | +| `direct` | Exact routing key matching. Good for point-to-point routing. | +| `fanout` | Broadcast to all bound queues. | +| `topic` | Pattern-based routing. Good for selective pub/sub. | +| `headers` | Routing by message headers instead of routing keys. | + +Prefer `topic` exchanges when the business requires flexible routing growth over time. Use `fanout` only when broad broadcast semantics are intentional. + +## Capacity Planning + +### Memory + +RabbitMQ memory usage is affected by message size, backlog size, persistence, consumer speed, and queue type. Large backlogs increase both broker memory pressure and disk usage. + +Recommendations: + +- Keep queue backlog low whenever possible. +- Monitor memory alarms and queue growth together. +- Avoid co-locating memory-heavy business workloads on the same nodes unless resource isolation is validated. + +### CPU + +CPU usage mainly comes from routing, protocol handling, queue management, TLS, and plugin overhead. + +Recommendations: + +- Size CPU for peak publish and consume rates, not average load. +- Benchmark quorum queues and mirrored workloads separately from simple classic queues. +- Review Erlang scheduler binding behavior when many large RabbitMQ instances share the same node. + +### Disk + +Persistent queues require storage that can absorb both normal write throughput and burst backlogs. + +Recommendations: + +- Use stable persistent volumes for production clusters. +- Prefer storage classes with predictable latency. +- Size capacity from message rate, retention, durability mode, and DR duplication overhead. +- Keep operational headroom for failover, requeue, and temporary backlog spikes. + +## Deployment Recommendations + +### Operator and Instance Layout + +- Use multiple replicas for production clusters. +- Distribute replicas across nodes with pod anti-affinity. +- Keep plugin sets aligned across clusters when DR or migration is required. +- Use the same service exposure model across environments unless there is a clear reason to differ. + +### Service Type + +Choose service exposure according to the access model: + +- `ClusterIP` for in-cluster applications. +- `NodePort` when external access is required and platform networking is managed at the node level. +- `LoadBalancer` when MetalLB or a cloud load balancer is available. + +### Persistence + +Do not run production message workloads without persistence unless data loss is acceptable by design. + +### Additional Configuration + +Review these items during creation: + +- replica count +- storage class and storage size +- CPU and memory requests and limits +- service type +- additional plugins +- affinity, anti-affinity, and tolerations +- environment configuration such as scheduler binding or plugin paths + +## Integration Guidance + +- Prefer access through a stable Service endpoint instead of individual pod IPs. +- Keep exchange, queue, and binding creation under version-controlled application or platform automation. +- Align client retry behavior with broker failover behavior. +- Design consumers to tolerate duplicate delivery. + +## Operations Guidance + +- Track queue depth, consumer count, message rates, memory alarms, and disk alarms. +- Investigate increasing backlog early. Slow consumers usually become visible before resource exhaustion. +- Validate plugin changes, DR settings, and migration procedures in a non-production environment first. + +## Common Risks + +- Too many messages accumulating in queues +- Under-sized storage for persistent workloads +- Missing anti-affinity for multi-replica clusters +- Uncontrolled plugin growth +- Treating RabbitMQ as durable long-term storage instead of a messaging system + +## Reference Suggestions + +- Use dedicated nodes when the cluster must meet strict latency or throughput targets. +- Prefer explicit routing and queue lifecycle management over ad hoc console-driven changes. +- Re-run capacity tests after changing queue types, plugins, or service exposure mode. diff --git a/docs/en/solutions/ecosystem/rabbitmq/RabbitMQ_CPU_Imbalance_Across_Multiple_Instances.md b/docs/en/solutions/ecosystem/rabbitmq/RabbitMQ_CPU_Imbalance_Across_Multiple_Instances.md new file mode 100644 index 00000000..3a182bc1 --- /dev/null +++ b/docs/en/solutions/ecosystem/rabbitmq/RabbitMQ_CPU_Imbalance_Across_Multiple_Instances.md @@ -0,0 +1,57 @@ +--- +products: + - Alauda Application Services +kind: + - Solution +ProductsVersion: + - 3.x +--- + +# RabbitMQ CPU Imbalance Across Multiple Instances + +## Problem + +In performance tests with multiple large RabbitMQ instances on the same node, CPU and memory were not exhausted, but throughput still plateaued early and instances interfered with each other. + +## Cause + +In containerized environments, RabbitMQ and Erlang scheduler behavior can bind work unevenly across CPU cores. With the default RabbitMQ scheduler bind type, some cores are used heavily while others remain underused. When several RabbitMQ instances share a node, this can create unnecessary CPU contention and lower throughput. + +## Erlang Scheduler Bind Types + +| Value | Meaning | +| --- | --- | +| `u` | Unbound. The operating system decides scheduler placement. | +| `ns` | No spread. Keep schedulers close together. | +| `ts` | Thread spread. Spread across hardware threads. | +| `ps` | Processor spread. Spread across processor packages. | +| `s` | Spread as much as possible. | +| `db` | Default bind behavior. | + +## Recommendation + +Set the scheduler bind type to `u` so the operating system can distribute RabbitMQ scheduler threads more evenly. + +## Configuration + +Add the following under `spec.rabbitmq.envConfig`: + +```yaml +spec: + rabbitmq: + envConfig: | + RABBITMQ_SCHEDULER_BIND_TYPE="u" +``` + +## Expected Result + +After the change: + +- CPU allocation is usually more balanced across cores +- throughput improves when multiple RabbitMQ instances share the same node +- the workload can approach the true network or storage limit instead of an artificial CPU scheduling bottleneck + +## Notes + +- Re-run performance tests after the change. +- Combine this fix with proper node placement and anti-affinity if several large clusters share the same hardware. diff --git a/docs/en/solutions/ecosystem/rabbitmq/RabbitMQ_DR_Deployment_Guide.md b/docs/en/solutions/ecosystem/rabbitmq/RabbitMQ_DR_Deployment_Guide.md new file mode 100644 index 00000000..101e0590 --- /dev/null +++ b/docs/en/solutions/ecosystem/rabbitmq/RabbitMQ_DR_Deployment_Guide.md @@ -0,0 +1,105 @@ +--- +products: + - Alauda Application Services +kind: + - Solution +ProductsVersion: + - 3.x +--- + +# RabbitMQ Disaster Recovery Deployment Guide + +:::info Applicable Versions +ACP 3.14, 3.15, and 3.16. +::: + +## Introduction + +This guide covers deployment of a hot standby RabbitMQ DR solution based on Shovel. The source cluster handles production traffic. The target cluster keeps replicated messages and becomes active only after failover. + +## DR Characteristics + +| Item | Value | +| --- | --- | +| Replication engine | RabbitMQ Shovel | +| Mode | Near-real-time asynchronous replication | +| Recommended topology | Source exchange to target exchange | +| RTO | Minutes, depending on manual failover | +| RPO | Seconds to minutes, depending on backlog and network | + +## Risks and Limitations + +- Shovel is single-process replication and needs operational monitoring. +- Replication lag grows when network or target performance is insufficient. +- Source and target message state can diverge during instability. +- Consumers must tolerate duplicate messages after failover. + +## Prerequisites + +- Source and target RabbitMQ clusters are created in separate failure domains. +- Network connectivity between the sites is stable. +- Target cluster storage is sized for full replicated backlog. +- Source and target exchanges, queues, and bindings are planned in advance. +- The shovel plugins are enabled on one side, preferably the target cluster. + +## Enable Plugins + +```yaml +spec: + rabbitmq: + additionalPlugins: + - rabbitmq_shovel + - rabbitmq_shovel_management +``` + +## Recommended Deployment Pattern + +### Source Exchange to Target Exchange + +Use this mode when the target cluster should preserve exchange semantics. Shovel creates an internal source-side queue and republishes messages to the target exchange. + +When the source exchange is a `topic` exchange, routing key `#` can protect all routing keys for that exchange. + +### Source Exchange to Target Queue + +Use this mode only when the target side intentionally terminates into a specific queue. RabbitMQ uses the default exchange on the target side to route the message to that queue. + +## Create Source and Target Objects + +Before configuring the shovel: + +1. Create the required exchanges on both clusters. +2. Create the target queues. +3. Create the target bindings. +4. Confirm that source routing keys and target routing rules are consistent. + +The auto-generated `amq.*` internal queue used by Shovel does not need to be created manually. + +## Configure the Shovel + +In RabbitMQ Management, configure: + +1. the source URI and source object +2. the destination URI and destination object +3. reconnect delay +4. acknowledgement mode + +Recommended settings: + +- `Reconnect delay`: a non-zero value such as `5` +- `Acknowledgement mode`: `on confirm` + +## Deployment Verification + +1. Confirm that the shovel status becomes `running`. +2. Confirm that the internal queue is created and bound correctly on the source side. +3. Publish test messages to the source exchange. +4. Confirm that target queues receive the expected messages. +5. Leave the test running long enough to observe backlog behavior under sustained publish load. + +## Capacity Considerations + +- The target cluster keeps replicated messages without normal consumer drain. +- Queue retention and message expiration should be planned explicitly. +- Bandwidth between the sites must exceed peak replication demand. +- Target cluster disk, CPU, and memory sizing should match the source workload profile. diff --git a/docs/en/solutions/ecosystem/rabbitmq/RabbitMQ_DR_Operations_Manual.md b/docs/en/solutions/ecosystem/rabbitmq/RabbitMQ_DR_Operations_Manual.md new file mode 100644 index 00000000..625065cc --- /dev/null +++ b/docs/en/solutions/ecosystem/rabbitmq/RabbitMQ_DR_Operations_Manual.md @@ -0,0 +1,90 @@ +--- +products: + - Alauda Application Services +kind: + - Solution +ProductsVersion: + - 3.x +--- + +# RabbitMQ Disaster Recovery Operations Manual + +:::info Applicable Versions +ACP 3.14, 3.15, and 3.16. +::: + +## Routine Monitoring + +The DR solution should be monitored from both RabbitMQ Management and queue behavior. + +Key checks: + +- shovel state +- source-side `amq.gen-*` internal queue +- queued message growth +- consumer acknowledgement rate +- target cluster health and storage consumption + +## Shovel States + +| State | Meaning | +| --- | --- | +| `starting` | Shovel is trying to connect to one or both sides | +| `running` | Shovel is actively consuming and forwarding messages | + +If the state does not reach `running`, check source URI, target URI, credentials, queue or exchange existence, and network connectivity. + +## Queue-Based Health Checks + +When exchange-to-exchange replication is used, Shovel creates an internal queue under the source exchange. + +Use these signals: + +- If total queued messages keep increasing, target-side replication is slower than source-side publishing. +- If queued messages stay above zero without movement, replication may be stalled. +- If consumer acknowledgement rate is much lower than publish rate, check target cluster performance and cross-site latency. + +## Failover Procedure + +1. Confirm that the source cluster is unavailable or no longer safe for writes. +2. Stop or redirect producers so the switch is controlled. +3. Point producers and consumers to the target cluster. +4. Validate exchange, queue, and consumer behavior on the target side. +5. Watch for duplicate consumption during the first recovery window. + +## Switch-Back Principle + +Do not automatically replicate traffic back to the original source cluster. After the source cluster is repaired, decide whether to rebuild it from scratch, resynchronize business state, or create a new DR direction. + +## Shovel Management Commands + +List shovel status on a specific node: + +```bash +rabbitmqctl shovel_status -n rabbit@.. +``` + +Restart a shovel: + +```bash +rabbitmqctl -n rabbit@.. restart_shovel +``` + +Delete a shovel: + +```bash +rabbitmqctl -n rabbit@.. delete_shovel +``` + +## Common Risks + +- source-target latency spike +- target cluster storage exhaustion +- forgotten queue or binding changes that were never duplicated to the target side +- duplicate consumption after client failover + +## Recommendations + +- Run DR drills after operator upgrades or major queue topology changes. +- Keep an operations record of all shovel names, URIs, protected exchanges, and owning applications. +- Test application reconnection behavior before relying on the DR design in production. diff --git a/docs/en/solutions/ecosystem/rabbitmq/RabbitMQ_Data_Migration.md b/docs/en/solutions/ecosystem/rabbitmq/RabbitMQ_Data_Migration.md new file mode 100644 index 00000000..36e4729e --- /dev/null +++ b/docs/en/solutions/ecosystem/rabbitmq/RabbitMQ_Data_Migration.md @@ -0,0 +1,94 @@ +--- +products: + - Alauda Application Services +kind: + - Solution +ProductsVersion: + - 3.x +--- + +# RabbitMQ Data Migration + +## Introduction + +Common RabbitMQ migration scenarios include: + +1. replicating one queue to other nodes in the same cluster +2. migrating data from one RabbitMQ cluster to another +3. exporting queue data to a file and importing it later +4. loading data into RabbitMQ from a database-driven workflow + +## Queue Replication Inside One Cluster + +Classic HA-style queue replication can be configured with a policy such as: + +```json +{ + "policies": [ + { + "vhost": "/", + "name": "test-ha", + "pattern": "test-ha", + "apply-to": "queues", + "definition": { + "ha-mode": "all" + }, + "priority": 0 + } + ] +} +``` + +Use this only when the queue model and RabbitMQ version still support the intended HA behavior. + +## Cluster-to-Cluster Migration with Shovel + +Shovel can move data from a source cluster to a destination cluster. + +Basic flow: + +1. create source and destination clusters +2. enable shovel plugins on one side +3. create source exchange and backup queue +4. create matching target exchange and queue +5. configure a shovel from the source queue to the target exchange or queue +6. verify message arrival on the destination side + +This model is suitable for migration windows and also forms the base of the hot standby DR solution. + +## Export Queue Data to a File + +One tool previously used for file export and import is `node-amqp-tool`: + +```bash +amqp-tool --host --port --user --password \ + --queue --export > dump.json + +amqp-tool --host --port --user --password \ + --queue --import dump.json +``` + +Limitations: + +- the tool is not actively maintained +- export consumes the queue like a normal consumer +- format compatibility is limited +- exported files include extra metadata and can become much larger than the raw message payloads + +## Database-to-RabbitMQ Import + +Possible patterns include: + +- database extensions that publish directly to RabbitMQ +- RabbitMQ plugins that react to database notifications +- database triggers that publish or notify on insert and update + +Risks: + +- database extensions may not be supported by the platform operator +- triggers can affect database performance +- custom RabbitMQ plugins usually require custom images and additional validation + +## Recommendation + +If the business can drain the source queue backlog before cutover, prefer direct cutover to a new cluster. Use message migration only when backlog must be preserved. diff --git a/docs/en/solutions/ecosystem/rabbitmq/RabbitMQ_Disaster_Recovery_Solution_Before_ACP_312.md b/docs/en/solutions/ecosystem/rabbitmq/RabbitMQ_Disaster_Recovery_Solution_Before_ACP_312.md new file mode 100644 index 00000000..d04cd6ac --- /dev/null +++ b/docs/en/solutions/ecosystem/rabbitmq/RabbitMQ_Disaster_Recovery_Solution_Before_ACP_312.md @@ -0,0 +1,85 @@ +--- +products: + - Alauda Application Services +kind: + - Solution +ProductsVersion: + - 3.x +--- + +# RabbitMQ Disaster Recovery Solution Before ACP 3.12 + +:::info Applicable Versions +Validated for ACP 3.8 era RabbitMQ deployments. +::: + +## Introduction + +Earlier RabbitMQ DR deployments also used the Shovel plugin, but the operating model was more manual. A primary cluster handled normal traffic while a standby cluster held copied data for failover. + +## Supported Shovel Modes + +Older deployments commonly described four source-destination combinations: + +| Source | Destination | Notes | +| --- | --- | --- | +| Exchange | Exchange | Recommended for routing-preserving replication | +| Exchange | Queue | Publishes into the default exchange on the target side | +| Queue | Exchange | Reads directly from a source queue | +| Queue | Queue | Direct queue-to-queue replication | + +The exchange-to-exchange model is still the safest default because it keeps routing behavior closer to the original design. + +## Enable Plugins + +```yaml +apiVersion: rabbitmq.com/v1beta1 +kind: RabbitmqCluster +metadata: + name: rabbitmq-source +spec: + rabbitmq: + additionalPlugins: + - rabbitmq_shovel + - rabbitmq_shovel_management +``` + +## Important Parameters + +### Common Parameters + +- `Name` +- `Source` +- `Destination` +- `Reconnect delay` +- `Acknowledgement mode` + +### Source-Side Parameters + +- source URI +- source queue or exchange +- routing key when the source is an exchange +- prefetch count +- auto-delete policy + +### Target-Side Parameters + +- target URI +- target queue or exchange +- routing key when the target is an exchange +- forwarding headers + +## Validation Approach + +For each configured mode: + +1. Create the expected exchanges and queues on both clusters. +2. Publish test messages that hit both matched and unmatched routing paths. +3. Verify that Shovel creates its internal queue when exchange-based source mode is used. +4. Confirm the exact messages visible in the target queue or target exchange bindings. + +## Operational Notes + +- Queue-based source replication consumes directly from the source queue and can interfere with normal consumers. +- The management UI in older versions is enough for configuration and state observation, but repeated manual validation is still required. +- This solution is appropriate only when the team can tolerate manual setup, manual verification, and manual failover. diff --git a/docs/en/solutions/ecosystem/rabbitmq/RabbitMQ_Exporter_Metrics_Collection_Solution.md b/docs/en/solutions/ecosystem/rabbitmq/RabbitMQ_Exporter_Metrics_Collection_Solution.md new file mode 100644 index 00000000..13d1ea92 --- /dev/null +++ b/docs/en/solutions/ecosystem/rabbitmq/RabbitMQ_Exporter_Metrics_Collection_Solution.md @@ -0,0 +1,302 @@ +--- +products: + - Alauda Application Services +kind: + - Solution +ProductsVersion: + - 3.x +--- + +# RabbitMQ Exporter Metrics Collection Solution + +## Background + +`rabbitmq_exporter` collects metrics from the RabbitMQ Management API and exposes them in Prometheus format through `/metrics`. + +This solution deploys the exporter as an external component: + +- do not modify RabbitMQ operator logic +- deploy one exporter per RabbitMQ instance +- expose metrics through a Kubernetes `Service` +- integrate with Prometheus Operator through a `ServiceMonitor` + +## Architecture + +```text +Prometheus + | + v +ServiceMonitor + | + v +Service/:9419 + | + v +Deployment/ + | + v +RabbitMQ Management API +``` + +One exporter supports only one `RABBIT_URL`, so one RabbitMQ instance usually needs one exporter deployment. + +## Prerequisites + +- the RabbitMQ cluster already exists +- the `rabbitmq_management` plugin is enabled +- the management API is reachable on port `15672` +- a RabbitMQ account can access the management API +- Prometheus Operator is installed if `ServiceMonitor` will be used + +RabbitMQ Cluster Operator usually creates a Secret named: + +```text +-default-user +``` + +## Deployment + +### Deployment Resource + +```yaml +apiVersion: apps/v1 +kind: Deployment +metadata: + name: + namespace: + labels: + app.kubernetes.io/name: + app.kubernetes.io/part-of: +spec: + replicas: 1 + selector: + matchLabels: + app.kubernetes.io/name: + template: + metadata: + labels: + app.kubernetes.io/name: + app.kubernetes.io/part-of: + spec: + containers: + - name: exporter + image: registry.alauda.cn:60070/middleware/rabbitmq-exporter:v4.1.1 + imagePullPolicy: IfNotPresent + ports: + - name: metrics + containerPort: 9419 + env: + - name: RABBIT_URL + value: http://..svc:15672 + - name: RABBIT_USER + valueFrom: + secretKeyRef: + name: + key: username + - name: RABBIT_PASSWORD + valueFrom: + secretKeyRef: + name: + key: password + - name: RABBIT_CONNECTION + value: loadbalancer + - name: RABBIT_EXPORTERS + value: exchange,node,queue,aliveness + - name: PUBLISH_PORT + value: "9419" + - name: LOG_LEVEL + value: info + - name: RABBIT_TIMEOUT + value: "30" + readinessProbe: + httpGet: + path: /health + port: metrics + livenessProbe: + httpGet: + path: / + port: metrics + resources: + requests: + cpu: 50m + memory: 64Mi + limits: + cpu: 200m + memory: 256Mi + securityContext: + allowPrivilegeEscalation: false + readOnlyRootFilesystem: true + runAsNonRoot: true +``` + +### Important Environment Variables + +| Variable | Description | +| --- | --- | +| `RABBIT_URL` | RabbitMQ Management API URL | +| `RABBIT_USER` | Management username | +| `RABBIT_PASSWORD` | Management password | +| `RABBIT_CONNECTION` | Use `loadbalancer` when accessing through a Service | +| `RABBIT_EXPORTERS` | Enabled metric modules | +| `PUBLISH_PORT` | Exporter metrics port | +| `RABBIT_TIMEOUT` | Management API timeout in seconds | + +Common modules: + +- `exchange` +- `node` +- `queue` +- `aliveness` +- `connections` +- `shovel` +- `federation` +- `memory` + +Recommended default modules: + +```text +exchange,node,queue,aliveness +``` + +## Service and ServiceMonitor + +### Service + +```yaml +apiVersion: v1 +kind: Service +metadata: + name: + namespace: + labels: + app.kubernetes.io/name: +spec: + type: ClusterIP + ports: + - name: metrics + port: 9419 + targetPort: metrics + selector: + app.kubernetes.io/name: +``` + +### ServiceMonitor + +```yaml +apiVersion: monitoring.coreos.com/v1 +kind: ServiceMonitor +metadata: + name: + namespace: + labels: + app.kubernetes.io/name: + prometheus: kube-prometheus +spec: + selector: + matchLabels: + app.kubernetes.io/name: + endpoints: + - port: metrics + path: /metrics + interval: 60s + scrapeTimeout: 30s +``` + +Notes: + +- `ServiceMonitor.metadata.labels` must match the Prometheus instance `serviceMonitorSelector`. +- If your platform Prometheus uses a different selector, adjust the labels accordingly, for example `release: kube-prometheus`. +- `endpoints[].port` must match the Service port name, which is `metrics`. + +## Verification + +Check pod status and logs: + +```bash +kubectl -n get pod -l app.kubernetes.io/name= +kubectl -n logs deploy/ --tail=100 +``` + +Create a test queue and publish messages: + +```bash +RABBIT_USER=$(kubectl -n get secret -o go-template='{{index .data "username" | base64decode}}') +RABBIT_PASSWORD=$(kubectl -n get secret -o go-template='{{index .data "password" | base64decode}}') + +kubectl -n exec -server-0 -- \ + rabbitmqadmin --host localhost --port 15672 \ + --username "$RABBIT_USER" --password "$RABBIT_PASSWORD" \ + declare queue name= durable=true +``` + +Check metrics: + +```bash +kubectl -n exec deploy/ -- sh -c \ + "wget -qO- http://localhost:9419/metrics | grep '^rabbitmq_' | head" +``` + +## Useful Metrics + +| Metric | Meaning | +| --- | --- | +| `rabbitmq_up` | Exporter can reach RabbitMQ | +| `rabbitmq_module_up{module="queue"}` | Queue module scrape health | +| `rabbitmq_queue_messages_ready` | Ready messages in a queue | +| `rabbitmq_queue_messages_unacknowledged` | Unacknowledged messages | +| `rabbitmq_queue_consumers` | Consumer count | +| `rabbitmq_queue_state` | Queue state such as `running`, `idle`, or `flow` | +| `rabbitmq_node_mem_used` | Node memory usage | +| `rabbitmq_node_disk_free` | Free disk bytes | +| `rabbitmq_shovel_state` | Shovel state when the shovel module is enabled | + +## Recommended Alerts + +```text +rabbitmq_up == 0 +``` + +```text +rabbitmq_module_up{module="queue"} == 0 +``` + +```text +rabbitmq_queue_state{state="flow"} == 1 +``` + +```text +rabbitmq_queue_messages_ready > 0 +``` + +```text +rabbitmq_queue_consumers == 0 +``` + +## Risks and Limitations + +- one exporter can scrape only one RabbitMQ management endpoint +- queue metrics through a Service can depend on the management view of the selected backend node +- the `connections` module can create high-cardinality metrics +- `/health` reflects exporter scrape state, not full RabbitMQ business health + +## Resource Recommendation + +Recommended production resources: + +```yaml +resources: + requests: + cpu: 50m + memory: 64Mi + limits: + cpu: 200m + memory: 256Mi +``` + +## Uninstall + +```bash +kubectl -n delete servicemonitor +kubectl -n delete service +kubectl -n delete deployment +``` diff --git a/docs/en/solutions/ecosystem/rabbitmq/RabbitMQ_Hot_Standby_Disaster_Recovery_with_Shovel.md b/docs/en/solutions/ecosystem/rabbitmq/RabbitMQ_Hot_Standby_Disaster_Recovery_with_Shovel.md new file mode 100644 index 00000000..bc7d7a8c --- /dev/null +++ b/docs/en/solutions/ecosystem/rabbitmq/RabbitMQ_Hot_Standby_Disaster_Recovery_with_Shovel.md @@ -0,0 +1,99 @@ +--- +products: + - Alauda Application Services +kind: + - Solution +ProductsVersion: + - 3.x +--- + +# RabbitMQ Hot Standby Disaster Recovery with Shovel + +:::info Applicable Versions +ACP 3.14, 3.15, and 3.16. +::: + +## Introduction + +This solution builds two RabbitMQ clusters in different sites. The source cluster serves production traffic. The target cluster receives replicated messages and is used only after a failover decision. + +Replication is implemented with RabbitMQ Shovel. Shovel acts as a client that reads messages from the source side and publishes them to the target side. + +## Architecture + +RabbitMQ routes messages through exchanges and queues. Shovel can consume from either side of that model and publish to either side of the target model. For hot standby, the practical choices are: + +- source exchange to target exchange +- source exchange to target queue + +The recommended mode is source exchange to target exchange. With a `topic` exchange and routing key `#`, one shovel can cover all matching traffic for that exchange. + +## Limitations + +- Shovel is not a full active-active replication engine. +- Replication is asynchronous, so some data loss is still possible during failure. +- The target cluster stores replicated messages without normal consumption, so storage must be sized for the retained backlog. +- Duplicate consumption can occur after failover because already consumed source messages may still exist on the target side. +- Client producer and consumer endpoints are not switched automatically. +- During unstable network conditions, Shovel can remove its auto-created internal queue and may lose in-flight data. + +## Recommended Topology + +| Item | Recommendation | +| --- | --- | +| Shovel location | Enable shovel plugins on one side only, usually the target cluster | +| Source mode | Exchange | +| Target mode | Exchange | +| Exchange type | `topic` when broad exchange-level protection is required | +| Routing key | `#` for full exchange coverage | + +## Enable Shovel Plugins + +Add the following plugins to the `RabbitmqCluster`: + +```yaml +spec: + rabbitmq: + additionalPlugins: + - rabbitmq_shovel + - rabbitmq_shovel_management +``` + +## Key Shovel Parameters + +| Parameter | Description | +| --- | --- | +| `Name` | Shovel name shown in the management UI | +| `Source` | Source protocol, URI, and exchange or queue settings | +| `Destination` | Target protocol, URI, and exchange or queue settings | +| `Reconnect delay` | Delay before reconnect after link failure | +| `Acknowledgement mode` | `no ack`, `on publish`, or `on confirm` | + +Prefer `on confirm` when message durability matters more than throughput. + +## Deployment Flow + +1. Create the source and target clusters. +2. Enable shovel plugins on the chosen cluster. +3. Create the required exchanges, queues, and bindings on both sides. +4. Configure the shovel in RabbitMQ Management. +5. Verify that the auto-created internal queue appears on the source side. +6. Publish test data to the source exchange. +7. Confirm that the target queue receives the replicated messages. + +## Monitoring + +Shovel health can be judged from both shovel state and queue behavior: + +- Shovel state should move from `starting` to `running`. +- The source-side internal queue usually appears as `amq.gen-*`. +- If queued messages keep growing, target-side delivery is too slow. +- If queued messages stay above zero without change, replication may be stalled. +- If consumer acknowledgement rate is much lower than publish rate, replication throughput is insufficient. + +## Important Considerations + +- Size target storage for full standby retention. +- Test failover with business clients before calling the solution ready. +- Build idempotency or duplicate-handling into consumers. +- Use this pattern for hot standby, not for multi-primary messaging. diff --git a/docs/en/solutions/ecosystem/rabbitmq/RabbitMQ_MetalLB_Access.md b/docs/en/solutions/ecosystem/rabbitmq/RabbitMQ_MetalLB_Access.md new file mode 100644 index 00000000..78938fcf --- /dev/null +++ b/docs/en/solutions/ecosystem/rabbitmq/RabbitMQ_MetalLB_Access.md @@ -0,0 +1,62 @@ +--- +products: + - Alauda Application Services +kind: + - Solution +ProductsVersion: + - 3.x +--- + +# RabbitMQ MetalLB Access + +:::info Applicable Versions +Validated for ACP 3.16. +::: + +## Problem + +Expose RabbitMQ through MetalLB when the environment already provides a MetalLB-based `LoadBalancer` implementation. + +## Prerequisites + +- MetalLB is already deployed and working in the cluster. + +## Configuration + +Set the RabbitMQ service type to `LoadBalancer` and disable node port allocation for the Service override: + +```yaml +spec: + service: + type: LoadBalancer + override: + service: + spec: + allocateLoadBalancerNodePorts: false +``` + +## Verification + +1. Create or update the instance. +2. Confirm that the cluster status becomes ready. +3. Check the Service: + +```bash +kubectl get svc -n | grep +``` + +Use the Service external IP with the standard RabbitMQ ports: + +| Access Type | Address | +| --- | --- | +| Client connection | `:5672` | +| Management UI | `:15672` | + +## Validation + +- Open the management UI through the external IP. +- Publish and consume a test message through the exposed endpoint. + +## Notes + +The assigned MetalLB IP stays stable as long as the Service is not deleted and recreated. diff --git a/docs/en/solutions/ecosystem/rabbitmq/RabbitMQ_Migrate_From_38x_to_3124.md b/docs/en/solutions/ecosystem/rabbitmq/RabbitMQ_Migrate_From_38x_to_3124.md new file mode 100644 index 00000000..100c2112 --- /dev/null +++ b/docs/en/solutions/ecosystem/rabbitmq/RabbitMQ_Migrate_From_38x_to_3124.md @@ -0,0 +1,98 @@ +--- +products: + - Alauda Application Services +kind: + - Solution +ProductsVersion: + - 3.x +--- + +# RabbitMQ Migration from 3.8.x to 3.12.4 + +:::info Applicable Versions +Validated for ACP 3.14 and later environments that need to replace older 3.8.x RabbitMQ clusters. +::: + +## Problem + +Migrate data and metadata from RabbitMQ 3.8.x to RabbitMQ 3.12.4. + +## Constraints + +- Direct rolling upgrade across all intermediate versions is not handled by the operator workflow. +- The RabbitMQ upgrade path in upstream guidance requires intermediate major and minor versions. +- The safest production path is usually to create a new 3.12.4 cluster and switch applications to it. + +## Recommended Option + +If possible, let the old cluster drain completely and cut over to the new cluster without migrating messages. + +## Option 1: Cut Over After Backlog Is Drained + +### Suitable For + +Queues can be fully consumed, or unconsumed messages do not need to be preserved. + +### Steps + +1. Create a new 3.12.x cluster with comparable sizing. +2. Enable the same required plugins. +3. Export definitions from the old cluster. +4. Import definitions into the new cluster. +5. Stop producers on the old cluster. +6. Wait until source queues are drained. +7. Update client connection settings and start using the new cluster. + +## Option 2: Migrate Remaining Data with Shovel + +### Suitable For + +Some queues cannot be drained completely and their backlog must be preserved. + +### Additional Preparation + +Enable Shovel on the new cluster: + +```yaml +spec: + rabbitmq: + additionalPlugins: + - rabbitmq_shovel + - rabbitmq_shovel_management +``` + +### Steps + +1. Create the new 3.12.x cluster. +2. Import metadata from the old cluster. +3. Stop source-side producers where possible to reduce the number of queues that need migration. +4. Identify the queues that still contain required backlog. +5. Configure one shovel per queue. +6. Verify that messages arrive in the new cluster. +7. Remove the shovel after migration completes. +8. Switch clients to the new cluster. + +### Example Shovel Command + +```bash +rabbitmqctl set_parameter shovel --vhost / \ + '{"src-protocol":"amqp091","src-uri":"amqp://:@:/","src-queue":"","dest-protocol":"amqp091","dest-uri":"amqp://:@:/","dest-queue":""}' +``` + +Check shovel state: + +```bash +rabbitmqctl shovel_status --formatter=pretty_table +``` + +Remove the shovel after cutover: + +```bash +rabbitmqctl clear_parameter shovel "" +``` + +## Important Notes + +- For the default vhost, do not keep a trailing `/` value of `/` in the URI if your environment rejects it. +- Queue-by-queue shovel configuration is time-consuming. Reducing backlog first saves significant effort. +- Re-validate plugin compatibility on the new cluster before cutover. diff --git a/docs/en/solutions/ecosystem/rabbitmq/RabbitMQ_Node_Placement_Affinity_Taints_Guide.md b/docs/en/solutions/ecosystem/rabbitmq/RabbitMQ_Node_Placement_Affinity_Taints_Guide.md new file mode 100644 index 00000000..18d71b5d --- /dev/null +++ b/docs/en/solutions/ecosystem/rabbitmq/RabbitMQ_Node_Placement_Affinity_Taints_Guide.md @@ -0,0 +1,71 @@ +--- +products: + - Alauda Application Services +kind: + - Solution +ProductsVersion: + - 3.x +--- + +# RabbitMQ Node Placement Affinity and Taints Guide + +## Scenario + +Some customer clusters run both business applications and middleware. In those cases, RabbitMQ should run only on dedicated middleware nodes, and unrelated workloads should be kept away from those nodes. + +## Goals + +1. Reserve selected nodes for middleware workloads. +2. Ensure RabbitMQ pods schedule only onto those reserved nodes. +3. Spread RabbitMQ replicas across different nodes for high availability. + +## Recommended Approach + +1. Add taints to the dedicated middleware nodes. +2. Add labels to the same nodes. +3. Configure RabbitMQ node affinity to match the labels. +4. Configure RabbitMQ tolerations to accept the taints. +5. Configure pod anti-affinity so replicas do not land on the same node. + +## Example Configuration + +```yaml +apiVersion: rabbitmq.com/v1beta1 +kind: RabbitmqCluster +metadata: + name: rabbitmq3816 + namespace: operators +spec: + replicas: 3 + affinity: + podAntiAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + - labelSelector: + matchLabels: + app.kubernetes.io/name: rabbitmq3816 + topologyKey: kubernetes.io/hostname + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: middleware + operator: In + values: + - "enable" + tolerations: + - key: middleware + operator: Equal + value: enable + effect: NoSchedule +``` + +## Expected Result + +- Only RabbitMQ workloads that tolerate the middleware taint can use the dedicated nodes. +- RabbitMQ replicas are scheduled only on nodes labeled for middleware. +- Anti-affinity prevents multiple replicas of the same cluster from landing on one node. + +## Notes + +- Keep label keys and taint keys aligned across teams. +- Verify node capacity before pinning multiple middleware clusters to the same node pool. diff --git a/docs/en/solutions/ecosystem/rocketmq/How_to_Configure_RocketMQ_Console.md b/docs/en/solutions/ecosystem/rocketmq/How_to_Configure_RocketMQ_Console.md new file mode 100644 index 00000000..64ddd599 --- /dev/null +++ b/docs/en/solutions/ecosystem/rocketmq/How_to_Configure_RocketMQ_Console.md @@ -0,0 +1,108 @@ +--- +products: + - Alauda Application Services +kind: + - Solution +ProductsVersion: + - 3.x +--- + +# How to Configure RocketMQ Console + +## Problem + +Deploy and access `rocketmq-console` for an existing RocketMQ cluster. + +## Prerequisites + +- A RocketMQ cluster is already deployed. +- You know the NameServer Service addresses for the target cluster. + +## Create the Console Resource + +Example `Console` custom resource: + +```yaml +apiVersion: rocketmq.apache.org/v1alpha1 +kind: Console +metadata: + name: console + namespace: +spec: + nameServers: my-nameserver-nameserver-server-0.my-nameserver-nameserver-nodes..svc.cluster.local:9876;my-nameserver-nameserver-server-1.my-nameserver-nameserver-nodes..svc.cluster.local:9876;my-nameserver-nameserver-server-2.my-nameserver-nameserver-nodes..svc.cluster.local:9876 + numberOfInstances: 1 + resources: + limits: + cpu: "1" + memory: 1Gi + requests: + cpu: "1" + memory: 1Gi + version: 1.0.0 +``` + +Parameter notes: + +- `namespace`: use the same namespace as the RocketMQ instance you want to manage +- `nameServers`: point to the target RocketMQ NameServer Service addresses + +## Example Deployment + +```yaml +apiVersion: rocketmq.apache.org/v1alpha1 +kind: Console +metadata: + name: console + namespace: dba-demo +spec: + nameServers: demo-nameserver-server-0.demo-nameserver-nodes.dba-demo.svc.cluster.local:9876;demo-nameserver-server-1.demo-nameserver-nodes.dba-demo.svc.cluster.local:9876 + numberOfInstances: 1 + resources: + limits: + cpu: "1" + memory: 1Gi + requests: + cpu: "1" + memory: 1Gi + version: 1.0.0 +``` + +Create it with: + +```bash +kubectl create -f /tmp/rocketmq-console.yaml +``` + +## Verify + +Check the console pod: + +```bash +kubectl get pod -n -owide | grep console +``` + +Check the console Service: + +```bash +kubectl get svc -n | grep console-service +``` + +Example result: + +```text +console-console-service NodePort ... 8080:/TCP +``` + +## Access + +Open the console through the Service address: + +```text +http://: +``` + +## Notes + +- Make sure the console namespace and the target RocketMQ namespace are aligned unless your environment explicitly supports cross-namespace access. +- If multiple NameServers are used, separate them with semicolons. +- When using pod-level NameServer addresses, use the full StatefulSet DNS name, which typically includes the headless service segment: `...svc.cluster.local`. diff --git a/docs/en/solutions/ecosystem/rocketmq/How_to_Enable_External_Access_for_RocketMQ.md b/docs/en/solutions/ecosystem/rocketmq/How_to_Enable_External_Access_for_RocketMQ.md new file mode 100644 index 00000000..f1a69570 --- /dev/null +++ b/docs/en/solutions/ecosystem/rocketmq/How_to_Enable_External_Access_for_RocketMQ.md @@ -0,0 +1,258 @@ +--- +products: + - Alauda Application Services +kind: + - Solution +ProductsVersion: + - 3.x +--- + +# How to Enable External Access for RocketMQ + +## Background + +This guide describes how to expose a RocketMQ cluster created by RocketMQ Operator to clients outside the Kubernetes cluster. + +The approach uses: + +- NodePort Services for each NameServer +- NodePort Services for each broker pod +- broker listener port changes so external clients can reach the correct broker endpoints +- broker-side external address advertisement so NameServer returns reachable broker endpoints to outside clients + +## Create Shared Broker Configuration + +Create a `ConfigMap` with common broker settings: + +```yaml +apiVersion: v1 +kind: ConfigMap +metadata: + name: broker-config + namespace: +data: + BROKER_MEM: " -Xms2g -Xmx2g -Xmn1g " + broker-common.conf: | + # brokerClusterName, brokerName, and brokerId are generated by the operator. + deleteWhen=04 + fileReservedTime=48 + flushDiskType=ASYNC_FLUSH + brokerRole=ASYNC_MASTER + listenPort=30911 +``` + +The namespace must match the namespace used by the RocketMQ broker resources. + +## Create NameServers + +Deploy at least two independent NameServers for availability. + +Example: + +```yaml +apiVersion: rocketmq.apache.org/v1alpha1 +kind: NameService +metadata: + name: name-service1 + namespace: +spec: + dnsPolicy: ClusterFirstWithHostNet + hostNetwork: false + imagePullPolicy: Always + nameServiceImage: build-harbor.alauda.cn/middleware/rocketmq-namesrv:v3.7.1 + resources: + limits: + cpu: 500m + memory: 1024Mi + requests: + cpu: 250m + memory: 512Mi + size: 1 + storageMode: StorageClass + volume: + size: 1Gi +``` + +Expose each NameServer with a NodePort Service: + +```yaml +apiVersion: v1 +kind: Service +metadata: + name: namesrv1 + namespace: +spec: + type: NodePort + ports: + - name: namesrv + nodePort: 39876 + port: 9876 + targetPort: 9876 + protocol: TCP + selector: + app: rocketmq_name_service + rocketmq_name_service_cr: name-service1 +``` + +Create a second NameServer and Service in the same way. + +In-cluster `nameServers` can then use: + +```text +namesrv1:9876;namesrv2:9876 +``` + +External clients use: + +```text +:;: +``` + +## Create the Broker Cluster + +Example broker resource: + +```yaml +apiVersion: rocketmq.apache.org/v1alpha1 +kind: Broker +metadata: + name: broker + namespace: +spec: + allowRestart: true + brokerImage: build-harbor.alauda.cn/middleware/rocketmq-broker:v3.7.0 + env: + - name: BROKER_MEM + valueFrom: + configMapKeyRef: + name: broker-config + key: BROKER_MEM + imagePullPolicy: Always + nameServers: namesrv1:9876;namesrv2:9876 + replicaPerGroup: 1 + size: 2 + resources: + limits: + cpu: 500m + memory: 12288Mi + requests: + cpu: 250m + memory: 2048Mi + scalePodName: broker-0-master-0 + storageMode: StorageClass + volume: + size: 2Gi + volumes: + - name: broker-config + configMap: + name: broker-config + items: + - key: broker-common.conf + path: broker-common.conf +``` + +This example uses `size: 2` and `replicaPerGroup: 1`, which creates a 2 master + 2 replica layout. + +## Expose Each Broker Pod + +Create one NodePort Service for each broker pod. Example for `broker-0-master`: + +```yaml +apiVersion: v1 +kind: Service +metadata: + name: broker-0-master + namespace: +spec: + type: NodePort + ports: + - name: broker + nodePort: 30911 + port: 30911 + targetPort: 30911 + protocol: TCP + selector: + app: rocketmq_broker + broker_cr: broker + brokerGroup: "0" + replicaIndex: "0" +``` + +Repeat the pattern for the remaining pods, for example: + +- `broker-0-replica-1` -> `30912` +- `broker-1-master` -> `30913` +- `broker-1-replica-1` -> `30914` + +## Update Broker Listener Ports and Advertised Address + +After creating the Services, update each broker StatefulSet so the broker both: + +- listens on the same port as the Service `NodePort` +- advertises a reachable external node IP or hostname instead of an in-cluster-only address + +Changing the port alone is not enough. RocketMQ clients first contact NameServer and then connect to the broker addresses returned by NameServer metadata. If brokers still advertise pod IPs or other cluster-internal addresses, external clients will still fail. + +For example: + +- `broker-0-master` -> `LISTEN_PORT=30911` +- `broker-0-replica-1` -> `LISTEN_PORT=30912` +- `broker-1-master` -> `LISTEN_PORT=30913` +- `broker-1-replica-1` -> `LISTEN_PORT=30914` + +Also set the broker's externally advertised IP or hostname to the node address that clients can actually reach. The exact field or environment variable name depends on the RocketMQ operator and image version in use, so verify it against the workload generated in your cluster before rollout. + +If a port is already in use, choose a different port, but keep `port`, `targetPort`, and `nodePort` aligned for that broker. + +## Optional: Create a RocketMQ Console + +Example `Console` resource: + +```yaml +apiVersion: rocketmq.apache.org/v1alpha1 +kind: Console +metadata: + name: console + namespace: +spec: + dockerImage: build-harbor.alauda.cn/middleware/rocketmq-dashboard:v3.7.0 + nameServers: namesrv1:9876;namesrv2:9876 + numberOfInstances: 1 + resources: + limits: + cpu: "2" + memory: 1000Mi + requests: + cpu: 500m + memory: 500Mi +``` + +Expose it with a NodePort Service: + +```yaml +apiVersion: v1 +kind: Service +metadata: + name: console-service + namespace: +spec: + type: NodePort + ports: + - protocol: TCP + port: 8080 + targetPort: 8080 + nodePort: 30000 + selector: + app: rocketmq-console +``` + +## Access + +After deployment: + +- RocketMQ clients outside the cluster use the external NameServer endpoints. +- The console is accessible through: + +```text +http://:30000 +``` diff --git a/docs/en/solutions/ecosystem/rocketmq/RocketMQ_Deployment_Fails_With_Unvalidated_Storage.md b/docs/en/solutions/ecosystem/rocketmq/RocketMQ_Deployment_Fails_With_Unvalidated_Storage.md new file mode 100644 index 00000000..c96e7a73 --- /dev/null +++ b/docs/en/solutions/ecosystem/rocketmq/RocketMQ_Deployment_Fails_With_Unvalidated_Storage.md @@ -0,0 +1,46 @@ +--- +products: + - Alauda Application Services +kind: + - Solution +ProductsVersion: + - 3.x +--- + +# RocketMQ Deployment Fails with Unvalidated Storage + +:::info Applicable Versions +All currently affected versions mentioned in the source page. +::: + +## Problem + +RocketMQ deployment can fail when it uses a storage backend that requires root-owned permissions or otherwise does not work with the default non-root runtime model. + +## Root Cause + +For security reasons, RocketMQ containers do not run as `root` by default. Some customer-provided storage backends require ownership or write permissions that are incompatible with the default container user, which causes cluster creation or startup to fail. + +## Solution + +Set an explicit security context through the custom resource and use a matching `fsGroup`. + +Example: + +```yaml +spec: + override: + statefulSet: + spec: + template: + spec: + securityContext: + runAsUser: 1001 + runAsGroup: 1001 + fsGroup: 1001 +``` + +## Notes + +- Apply this only after confirming that the storage class or backend really needs it. +- Re-verify volume mount permissions after the change. diff --git a/docs/en/solutions/ecosystem/rocketmq/RocketMQ_Exporter_Frequent_Restart_in_3121.md b/docs/en/solutions/ecosystem/rocketmq/RocketMQ_Exporter_Frequent_Restart_in_3121.md new file mode 100644 index 00000000..094df525 --- /dev/null +++ b/docs/en/solutions/ecosystem/rocketmq/RocketMQ_Exporter_Frequent_Restart_in_3121.md @@ -0,0 +1,40 @@ +--- +products: + - Alauda Application Services +kind: + - Solution +ProductsVersion: + - 3.x +--- + +# RocketMQ Exporter Frequent Restart in 3.12.1 + +:::info Applicable Versions +RocketMQ 3.12.1. +::: + +## Problem + +In some RocketMQ 3.12.1 deployments, the exporter container restarts repeatedly. + +## Diagnosis + +Check the pod and container events first. In the reported case, the exporter was being killed because of insufficient memory and showed `OOMKilled`. + +Useful commands: + +```bash +kubectl -n describe pod +kubectl -n get events --sort-by=.lastTimestamp +``` + +## Solution + +Increase the resource specification for the affected exporter container from the Data Services YAML view or the underlying workload resource. + +Adjust the exporter memory limit according to observed usage and restart behavior. + +## Notes + +- This issue was observed on the exporter side rather than on the RocketMQ brokers themselves. +- After increasing memory, re-check pod stability and exporter scrape continuity. diff --git a/docs/en/solutions/ecosystem/rocketmq/RocketMQ_Exporter_OOM_Temporary_Mitigation.md b/docs/en/solutions/ecosystem/rocketmq/RocketMQ_Exporter_OOM_Temporary_Mitigation.md new file mode 100644 index 00000000..bc481ecc --- /dev/null +++ b/docs/en/solutions/ecosystem/rocketmq/RocketMQ_Exporter_OOM_Temporary_Mitigation.md @@ -0,0 +1,40 @@ +--- +products: + - Alauda Application Services +kind: + - Solution +ProductsVersion: + - 3.x +--- + +# RocketMQ Exporter OOM Temporary Mitigation + +:::info Applicable Versions +RocketMQ 3.12.x. +::: + +## Problem + +RocketMQ exporter can run out of memory in some environments. A reported case stabilized only after increasing the exporter memory to `2Gi`. + +The default exporter resource profile was: + +- CPU: `500m` +- Memory: `512Mi` +- scrape interval: `15s` + +The exporter is implemented in Java, and the custom resource could not directly update exporter resource settings in the reported environment. + +## Temporary Workaround + +Manually patch or edit the exporter `Deployment` and raise its memory request and limit to `2Gi`. + +## Important Limitation + +This is only a temporary workaround. If the RocketMQ instance is updated or reconciled again, the manual change can be overwritten. + +## Recommendation + +- Apply the manual adjustment only as a short-term mitigation. +- Monitor actual steady-state memory usage after the change. +- Follow up with a product-side fix that exposes exporter sizing through the CR or improves exporter memory behavior.