1. ClickHouse Keeper集群部署概述ClickHouse Keeper是ClickHouse官方推出的分布式协调服务用于替代ZooKeeper。我在实际生产环境中部署过多次发现它确实比ZooKeeper更轻量、更易维护。对于中小规模的ClickHouse集群来说双节点高可用配置是个不错的起点既能满足可用性要求又不会增加太多运维复杂度。核心优势在于它原生集成在ClickHouse中不需要额外部署组件。我遇到过ZooKeeper集群因为GC停顿导致ClickHouse查询超时的情况而Keeper由于采用C编写且针对ClickHouse优化性能表现稳定得多。下面这张表格对比了两种方案的差异特性ClickHouse KeeperZooKeeper部署复杂度低内置高独立部署资源占用较低较高与ClickHouse兼容性完美兼容需要额外配置运维成本低中高2. 环境准备与安装2.1 系统要求我建议使用CentOS 7或Ubuntu 18.04系统实测下来这些版本最稳定。每台服务器需要至少4核CPU8GB以上内存100GB以上磁盘空间SSD最佳开放端口9000TCP、9181Keeper TCP、9234Raft通信重要提示确保所有节点时间同步NTP服务时间不同步会导致Raft协议异常。2.2 安装ClickHouse在双节点假设为node1和node2上执行以下命令# Ubuntu/Debian sudo apt-get install -y apt-transport-https ca-certificates dirmngr sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv E0C56BD4 echo deb https://repo.clickhouse.com/deb/stable/ main/ | sudo tee /etc/apt/sources.list.d/clickhouse.list sudo apt-get update sudo apt-get install -y clickhouse-server clickhouse-client # CentOS/RHEL sudo yum install -y yum-utils sudo yum-config-manager --add-repo https://repo.clickhouse.com/rpm/stable/x86_64 sudo yum install -y clickhouse-server clickhouse-client安装完成后先不要启动服务我们需要先配置Keeper。3. Keeper集群配置详解3.1 修改config.xml在两个节点的/etc/clickhouse-server/config.xml中添加以下配置keeper_server tcp_port9181/tcp_port server_id1/server_id !-- node1设为1node2设为2 -- log_storage_path/var/lib/clickhouse/coordination/log/log_storage_path snapshot_storage_path/var/lib/clickhouse/coordination/snapshots/snapshot_storage_path coordination_settings operation_timeout_ms10000/operation_timeout_ms session_timeout_ms30000/session_timeout_ms raft_logs_levelwarning/raft_logs_level /coordination_settings raft_configuration server id1/id hostnamenode1/hostname port9234/port /server server id2/id hostnamenode2/hostname port9234/port /server /raft_configuration /keeper_server zookeeper node hostnode1/host port9181/port /node node hostnode2/host port9181/port /node /zookeeper关键参数说明server_id每个节点必须唯一raft_logs_level生产环境建议设为warning调试时可设为traceoperation_timeout_ms根据网络状况调整跨机房部署需要增大3.2 配置集群复制继续在config.xml中添加分布式表配置remote_servers cluster_2s_1r shard internal_replicationtrue/internal_replication replica hostnode1/host port9000/port /replica /shard shard internal_replicationtrue/internal_replication replica hostnode2/host port9000/port /replica /shard /cluster_2s_1r /remote_servers macros shard01/shard !-- node1设为01node2设为02 -- replica01/replica !-- 单副本设为01 -- /macros4. 启动与验证4.1 启动服务按顺序启动节点先启动server_id小的节点# 在两个节点上执行 sudo systemctl start clickhouse-server检查服务状态sudo systemctl status clickhouse-server4.2 验证Keeper集群使用以下命令检查Keeper状态echo ruok | nc localhost 9181 # 应返回imok通过SQL查询集群信息SELECT * FROM system.zookeeper WHERE path /;4.3 创建测试表CREATE DATABASE test ON CLUSTER cluster_2s_1r; CREATE TABLE test.dist_table ON CLUSTER cluster_2s_1r ( id UInt32, data String ) ENGINE ReplicatedMergeTree() ORDER BY id;5. 常见问题处理问题1节点无法加入集群检查防火墙设置确认raft_configuration中的主机名可解析查看/var/log/clickhouse-server/clickhouse-server.log日志问题2启动时报错Server still not initialized删除/var/lib/clickhouse/coordination目录后重启检查磁盘空间是否充足问题3客户端连接超时增大operation_timeout_ms值检查网络延迟我在实际部署中发现当节点间延迟超过operation_timeout_ms设置值时很容易出现各种奇怪的问题。建议内网节点间延迟控制在1ms以内跨机房部署则需要适当调大超时参数。