Cassandra运维之道

Transcript Cassandra运维之道

运维之道
version 0.1
淘宝江枫
http://www.NinGoo.net
http://twitter.com/NinGoo
Agenda
•
•
•
•
•
•
•
基本概念
体系架构
参数配置
备份恢复
限制
监控
参考
基本概念
•
•
•
•
•
•
•
•
Gossip
Memtable/SSTable
Compaction
Commitlog
Consistency level
Hinted Handoff
Anti Entropy
Read Repair
Gossip
• 去中心化，一致性hash， P2P协议
• Gossip协议通过endPointStateMap的摘要digest
同步节点状态信息数据。一个节点自身的状态只
能由自己修改，其他节点的状态只能通过同步更
新。
• Map中每一个EndpointStat包括：
– HeartbeatStat：Generation(节点重启后递
增)/Version Number
– ApplicationStat：应用状态（每个对象标识一种
状态）/Version Number
Gossip
• endPointStateMap
EndPointState 10.0.0.1
HeartBeatState: generation 1259909635, version 325 ApplicationState "load-information": 5.2,
generation 1259909635, version 45
ApplicationState "bootstrapping": bxLpassF3XD8Kyks, generation 1259909635, version 56
EndPointState 10.0.0.2
HeartBeatState: generation 1259911052, version 61
ApplicationState "load-information": 2.7, generation 1259911052, version 2
ApplicationState "bootstrapping": AujDMftpyUvebtnn, generation 1259911052, version 31
Gossip Digest for endpoint 10.0.0.2:
10.0.0.2:1259911052:61 (IP:Generation:Max Version)
一般情况下HeartbeatState中的Version都会是endpointstat中最大Max
Version ，但这不是一个“死规定”。
Gossip
Gossip
每秒运行一次（Gossiper.java的start方法），按照以下规
则向其他节点发送同步消息：
• 随机取一个当前活着的节点，并向它发送同步请求
（ doGossipToLiveMember ）
• 向随机一台不可达的机器发送同步请求
（ doGossipToUnreachableMember ）
• 如果第一步中所选择的节点不是seed，或者当前活着
的节点数少于seed数，则向随意一台seed发送同步请
求，以避免出现信息孤岛（ doGossipToSeed ）
也就是说，一个节点发起一轮Gossip，最多请求三
个节点。整个集群的信息达到同步的时间大概是log(N)。
Memtable/SSTable
• 出自Google Bigtable设计的存储模型
• 数据先写入内存中的Memtable
• 写入关键路径上不需要持有任何锁
• Memtable达到条件(大小，key的数量，时间间隔
等)后刷新到磁盘，保存为SSTable
• SSTable不可修改
• 同一个CF的多个SSTable可以合并(Compaction)以优
化读操作
• 通过布隆过滤算法(Bloom Filter)减少对不可能包含
查询key的SSTable的读取。
• 将随机写转变为顺序写，提升系统写性能。
Memtable/SSTable
• SSTable包含对应的三种文件
– Datafile
按照Key排序顺序保存的数据文件
– Indexfile
保存每个Key在Datafile中的位置偏移
– Filterfile
保存BloomFilter的Key查找树
Compaction
• 一个CF可能有很多SSTable，系统会将多个SSTable
合并排序后保存为一个新的SSTable，称之为
Compaction。
• 超过4个SSTable后可能触发Compaction。
• Major Comaction：合并CF的所有SSTable为一个新
的SSTable，同时执行垃圾数据(已标记删除的数据
tombstone)清理。
• Minor Compaction：只合并大小差不多的SSTable。
• 可通过nodetool compact命令手动触发。
Commitlog
• 数据写入Memtable前需要由CommitLogExecutorService
线程先写Commitlog
• CommitlogHeader记录了CF的脏标志位和该CF的恢复起
始偏移位置。
• CommitlogSegment记录了变更的RowMutation信息。
• Commitlog刷新有两种机制：
– Batch：当CommitlogSegment刷新到磁盘后，插入Memtable操
作才可继续。并且需要等待CommitLogSyncBatchWindowInMS
毫秒内的其他写操作一起批量刷日志到磁盘。可以类比为
Oracle的batch/wait模式。
– Periodic ：每隔CommitLogSyncPeriodInMS毫秒性刷新
CommitlogSegment，不阻塞数据写操作，可以类比为Oracle的
batch/nowait模式。
Commitlog
• SSTable持久后不可变更，故Commitlog只用于
Memtable的恢复，相当于Oracle的Instance Recovery。
Cassandra不需要做Media Recover
• 当节点异常重启后，将根据SSTable和Commitlog进
行实例恢复，在内存中重新恢复出宕机前的
Memtable。
• 当一个Commitlog文件对应的所有CF的Memtable都
刷新到磁盘后，该Commitlog就不再需要，系统会
自动清除。
ConsistencyLevel
• Write
Level
ZERO
ANY
Behavior
Ensure nothing. A write happens asynchronously in background.
Until CASSANDRA-685 is fixed: If too many of these queue up, buffers
will explode and bad things will happen.
(Requires 0.6) Ensure that the write has been written to at least 1 node,
including hinted recipients.
ONE
Ensure that the write has been written to at least 1 node's commit log
and memory table before responding to the client.
QUORUM
Ensure that the write has been written
to <ReplicationFactor> / 2 + 1 nodes before responding to the client.
ALL
Ensure that the write is written to all <ReplicationFactor> nodes before
responding to the client. Any unresponsive nodes will fail the operation.
ConsistencyLevel
• Read
Level
Behavior
ONE
Will return the record returned by the first node to respond. A
consistency check is always done in a background thread to fix any
consistency issues when ConsistencyLevel.ONE is used. This means
subsequent calls will have correct data even if the initial read gets an
older value. (This is called read repair.)
QUORUM
Will query all nodes and return the record with the most recent
timestamp once it has at least a majority of replicas reported. Again, the
remaining replicas will be checked in the background.
ALL
Will query all nodes and return the record with the most recent
timestamp once all nodes have replied. Any unresponsive nodes will fail
the operation.
Hinted Handoff
• Key A按照规则首要写入节点为N1，复制到N2
• 假如N1宕机，如果写入N2能满足ConsistencyLevel要求，则
Key A对应的RowMutation将封装一个带hint信息的头部（包
含了目标为N1的信息），然后随机写入一个节点N3，此副
本不可读。同时正常复制一份数据到N2，此副本可以提供
读。如果写N2不满足写一致性要求，则写会失败。
• N1恢复后，原本应该写入N1的带hint头的信息将重新写回
N1。
• HintedHandoff是实现最终一致性的一个优化措施，可以减
少最终一致的时间窗口。
Anti Entropy
• 数据的最终一致性由AntiEntropy（逆熵）所生成的
MerkleTrees对比来发现数据复制的不一致，通过
org.apache.cassandra.streaming来进行完整的一致性修复。
该动作可以由Nodetool触发，也可以由系统自动触发。
• Merkle Tree是一种Hash Tree，叶子节点是Key的hash值，父
节点是所有子节点值的hash值，通过判断父节点的异同可
以知道所有子节点的异同。
• 通过判断root的异同可以快速判断所有叶子节点数据的异同。
Read Repair
• 读取Key A的数据时，系统会读取Key A的所有数据副本，如
果发现有不一致，则进行一致性修复。
• 如果读一致性要求为ONE，会立即返回离客户端最近的一份
数据副本。然后会在后台执行Read Repair。这意味着第一
次读取到的数据可能不是最新的数据。
• 如果读一致性要求为QUORUM，则会在读取超过半数的一
致性的副本后返回一份副本给客户端，剩余节点的一致性
检查和修复则在后台执行。
• 如果读一致性要求高(ALL)，则只有Read Repair完成后才能
返回一致性的一份数据副本给客户端。
• 该机制有利于减少最终一致的时间窗口。
体系架构
• 数据分布
• 数据复制
• 接口
数据分布
• RandomPartitioner
基于MD5的随机Hash分布。MD5的hash空间为
2^127-1，每个节点的InitialToken可以按节点数量
N进行平均分配，如第i个节点可以设置为
i*(2^127-1)/N
• OrderPreservingPartitioner
基于Key值(UTF-8)的范围分布
• CollatingOrderPreservingPartitioner
基于Key值(不同语言环境排序)的范围分布
数据复制
• DatacenterShardStategy
如果replication factor为N，则(N-1)%2的副本复制到
不同数据中心。所有副本在两个数据中心均衡分布
• RackAwareStrategy
一个副本复制到不同数据中心，其他副本复制到同
数据中心的不同机架。异地机房只保有一个副本，
主要用于容灾
• RackUnAwareStrategy
不考虑复制节点的物理位置，一般是hash环右边的
N-1个节点
接口
• 两种编程接口
– Thrift
2007年由Facebook开源给Apache，目前发展缓慢。
需要生成不同语言的接口代码
– Avro
Hadoop的一个子项，Cassandra正在往这个接口
进行迁移。这是一个动态序列化库，无须生成静
态接口代码
类似接口的还有Google的Protocol Buffer
参数配置
• 主要配置文件storage-conf.xml
–
–
–
–
–
–
–
–
–
ClusterName：集群名，所有节点统一
AutoBootstrap：作为新节点加入集群时，设置true开始初始化
HintedHandoffEnabled：启用Hinted Handoff特性
Keyspaces: 数据模型相关keyspace和column family设置
ReplicaPlacementStrategy: 数据副本复制策略（基于数据中心分布/
机架分布）
ReplicationFactor: 数据副本复制份数，一般建议设置为3份
EndPointSnitch: 集群节点对应物理机器分布策略，据此路由不同的
数据副本。
Partitioner: 数据分布策略。随机分布 or 有序分布
InitialToken: 初始化Token，具体key的第一份副本分布到哪个节点
参数配置
• 主要配置文件storage-conf.xml
– CommitLogDirectory: Commitlog文件存放路径
– DataFileDirectory : 数据文件存放路径，可以指定多个路径
– Seeds:种子节点列表，当初始化完成后可以设置为种子节点，新节
点加入集群时，需要从种子节点获取需要的信息。
– RpcTimeoutInMillis: 等待远程节点返回消息的超时设置
– CommitLogRotationThresholdInMB: commitlog文件大小，超过则进行
切换
– ListenAddress/ StoragePort: 集群内部通讯监听IP和端口
– ThriftAddress/ ThriftPort: Thrift监听IP和端口，用于响应客户端请求
– DiskAccessMode: 磁盘访问模式。64位系统建议设置为mmap，或者
auto(64位时等效于mmap)
– RowWarningThresholdInMB: 对超长的压缩行进行告警。如果压缩行
不能完全放入内存中，Cassandra会崩溃，所以需要根据内存设置告
警阀值。
参数配置
• 主要配置文件conf/storage-conf.xml
–
–
–
–
–
–
–
–
SlicedBufferSizeInKB:读取连续列的缓存大小
FlushDataBufferSizeInMB: 刷新Memtable到磁盘数据文件的缓存大小
FlushIndexBufferSizeInMB: 刷新Memtable到磁盘索引文件的缓存大小
ColumnIndexSizeInKB: 当一行长度超过该值时，添加一个列偏移索引
MemtableThroughputInMB: Memtable大小
MemtableFlushAfterMinutes: N分钟后强制刷新Memtable到磁盘
ConcurrentReads: 并发读请求，建议设置为CPU核数的两倍
ConcurrentWrites: Cassandra写性能更好，因此并发写请求可以设置
更高，例如CPU核数的8倍
– CommitLogSync: Commitlog刷新到磁盘的方式，batch or periodic
– GCGraceSeconds: 清理带有删除标记的垃圾数据的间隔时间。如果
节点宕机时间超过这个间隔，则节点会永久失效，只能重新进行初
始化后才能加入到集群。默认为10天。
参数配置
• 日志配置文件conf/log4j.properties
– log4j.appender.R.File=/var/log/cassandra/system.log 日志文件位置
– log4j.appender.file.maxFileSize=20MB 日志文件大小
参数配置
• jvm配置bin/ cassandra.in.sh
JVM_OPTS=" \
-ea \
-Xms256M \
-Xmx1G \
-XX:+UseParNewGC \
-XX:+UseConcMarkSweepGC \
-XX:+CMSParallelRemarkEnabled \
-XX:SurvivorRatio=8 \
-XX:MaxTenuringThreshold=1 \
-XX:+HeapDumpOnOutOfMemoryError \
-Dcom.sun.management.jmxremote.port=8080 \
-Dcom.sun.management.jmxremote.ssl=false \
-Dcom.sun.management.jmxremote.authenticate=false"
备份恢复
• Snapshot
–利用nodetool的snapshot命令可以生成SSTable的一个快照。
–在生成snapshot前，先会执行一次Memtable切换，将最新的数据保存
为SSTable。
–复制snapshot即可对节点的数据进行物理备份。
– Snapshot实际上是SSTable文件的一个Hard link。
备份恢复
• Export/Import
通过sstable2json可以将数据导出为json格式的文件，相当于逻
辑备份。
通过json2sstable则可以将json格式的文件导入为SSTable。
限制
• Keyspace/CF无法动态增删，0.7以后的版本有计划支持动态
增删。
• 由于Compaction时对整行数据反序列化，所以一行数据必
须要能够全部存放进内存中。
https://issues.apache.org/jira/browse/CASSANDRA-16
• 一行数据的长度不能超过2^31-1字节，因为行数据序列化时
用一个整数表示其长度同时序列化到磁盘中。
• Super columnfamilies中的sub column没有索引，因此在反序
列化一个sub column时需要反序列化super column中的所有
sub column。因此需要避免设计使用大量的sub column。
https://issues.apache.org/jira/browse/CASSANDRA-598
限制
• Thrift不支持流（streaming)，读写请求的数据都需要存放在
内存中，因此大对象可能需要切分后存取。
http://issues.apache.org/jira/browse/CASSANDRA-265
• Thrift端口收到非协议标准的随机数据可能导致Cassandra崩
溃。因此对Thrift的探测如telnet等操作可能导致节点挂掉
http://issues.apache.org/jira/browse/CASSANDRA-475
http://issues.apache.org/jira/browse/THRIFT-601
监控
• Nodetool
nodetool –h localhost –p 8080 tpstats
监控
• Nodetool
nodetool –h localhost –p 8080 cfstats
监控
• jconsole
jmx地址：service:jmx:rmi:///jndi/rmi://localhost:8080/jmxrmi
监控
• Nagios
http://www.mahalo.com/how-to-monitor-cassandra-with-nagios
监控
• Cassandra web console
http://github.com/suguru/cassandra-webconsole/downloads
参考
•
•
•
•
•
•
•
•
•
•
•
http://wiki.apache.org/cassandra
http://io.typepad.com/glossary.html
http://spyced.blogspot.com/
http://perspectives.mvdirona.com/2009/02/07/FacebookCassandraA
rchitectureAndDesign.aspx
http://nosql.mypopescu.com/tagged/cassandra
http://www.cs.cornell.edu/home/rvr/papers/flowgossip.pdf
http://www.ruohai.org/?p=13
http://www.ningoo.net/html/2010/cassandra_token.html
http://www.dbthink.com/?tag=cassandra
http://java.sun.com/developer/technicalArticles/J2SE/jconsole.html
http://cassandra.apache.org/
*部分链接需要翻墙访问