| Causes | Symptoms | Countermeasures | ||
| Category 1 | Category 2 | |||
| There are no changes in the cluster configuration, but a failure resulting in a stop of the data service in some of the partitions or partial operation occurs due to an insufficient number of replicas. | The cluster deemed that the current owner or a large replication delay among the backups has failed and excluded them from the backup. | [Server]: 10011 [Description] (Backup error detected) [Format] (*Master node only) Partition no., owner address, owner LSN, backup address, backup LSN | Replication delays tends to occur easily especially in the asynchronous replication mode. If the delay occurs frequently, either change to the semi-synchronous replication mode (set /transaction/replicationMode in the gs_cluster.json file to 1), or add /cluster/ownerBackupLsnGap to the gs_cluster.json file and make the value larger than the default value (50000) (*1). Example: "cluster":{ "ownerBackupLsnGap":"100000", : } | |
| As synchronization did not complete before the process timeout, (partial) operation was started for some of the data services for the specified number of replicas or less. | [Server]: 10016 [Description] (Synchronization timeout detected) [Format] (*Owner node only) Partition no., owner address when operation starts partially | Trade-off with app downtime, but if availability is prioritized, increase the synchronization timeout time (/sync/timeoutInterval in the gs_cluster.json file). This value serves as a guide for the maximum downtime. | ||
| Data service of the relevant partition has been stopped due to a data non-conformance which occurred when service is continued even though a node holding the latest data of a certain partition is down. | [Server]: 10007 [Description] (Data service stopped due to detection of data lost) [Format] (*Master node only) Partition no., LSN of the latest data including the down node in corresponding partition, largest LSN in the current cluster, node address presumed to hold the latest data | See “Problems related to client failover” as well. In addition, the probability of occurrence can be lowered by increasing the no. of replicas or changing to semi-synchronous replication (set /transaction/replicationMode to 1 in the gs_cluster.json file). | ||
| A failure requiring a change in the cluster configuration has occurred. | Master node has detected that a follower node is down, or the node status has changed to ABNORMAL, or a gs_leavecluster has been executed. | [Server]: 10010 [Description] (Start failover) [Format] (*Master node only) List of nodes with detected errors, failover no. | Check the event log and error code of the respective node for the respective error description. | |
| A heartbeat error due to network failure or a large delay occurring has been detected. | [Server]: 10008, 10009 [Description] (Heartbeat timeout) [Format] (*Master node only) Node with detected error, heartbeat limit time, final heartbeat arrival time [External tool] Check whether it is within the network bandwidth | Lengthen the heartbeat interval if it appears that the isolation occurs regularly instead of intermittently However, as failure error detection and recovery will become late if the heartbeat interval is too long, take the trade-off with availability into consideration when setting the heartbeat interval. If the delay is due to the network bandwidth, the probability of occurrence can also be lowered by setting each serviceAddress/servicePort in gs_cluster.json separately from other networks. | ||
| A heartbeat error due to a high load on resources other than the network has been detected. | [Server]: 10008, 10009 [Description] (Heartbeat timeout) [Format] (*Master node only) Node with detected error, heartbeat limit time, final heartbeat arrival time [External tools] Resource investigation tool | Lengthen the heartbeat interval if it appears that the isolation occurs regularly instead of intermittently Unlike a network, this could be due to a variety of reasons, e.g. a swap due to insufficient memory, waiting for disk I/O, server is busy due to the execution of an application with a high load, start-up of another app within the same machine, and so on. Besides gs_stat, check the resource status of the entire machine concerned as well. | ||
| A failure to maintain the cluster configuration has occurred. | Cluster has been reset due to a majority of the nodes leaving the cluster or errors being detected in a majority of the nodes. | [Server]: 10014 [Description] (Cluster breakup as a majority of the nodes cannot be secured) [Format] (*Master node only) Number of nodes required to maintain a cluster, number of nodes already participating in a cluster Detection of heartbeat error is also recorded just before the event. [Server]: 10008, 10009 [Description] (Heartbeat timeout) [Format] (*Master node only) Node with detected error, heartbeat limit time, final heartbeat arrival time | Recover nodes which are down due to a failure occurring. If recovery is not possible, start a new node separately, and let the cluster recover so that the number of nodes constituting the cluster is reached. However, since there is also a possibility that the latest data may be retained in a down node, check using “gs_partition --loss", etc. when adding a node. | |
| A network disruption occurred in a cluster that was in operation and a cluster re-configuration was attempted after that but a majority number of nodes could not be secured in any cluster. | [Server]: 10014 [Description] (Cluster breakup as a majority of the nodes cannot be secured) [Format] (*Master node only) Number of nodes required to maintain a cluster, number of active nodes Detection of heartbeat error is also recorded just before the event. [Server]: 10008, 10009 [Description] (Heartbeat timeout) [Format] (*Master node only) Node with detected error, heartbeat limit time, final heartbeat arrival time | When a network disruption occurs, the cluster will be automatically restarted at the point the disruption is recovered, but if there is no likelihood of recovery after the disruption, the number of constituting nodes needs to be manually reduced to re-constitute the cluster. However, since there is also a possibility that the latest data may be retained in a distribution destination node, check using “gs_partition --loss", etc. when adding a node. | ||
| Failure to rebalance (replica creation for nodes with insufficient replicas and uniform distribution of replicas among nodes) | Checkpoint competition, etc. occurred and rebalancing process under execution could not be continued. | [Server]: 10012 [Description] (Rebalance failure) [Format] Failed partition no., partition group no., checkpoint no., failure cause | If a checkpoint is executed during rebalancing, the data file may be updated and the log file may also be deleted. In this case, the rebalancing process under execution will be cut off midway. However, even if the process were to be cut off, checking and retry will be carried out regularly. However, since time loss occurs, the probability of occurrence can be reduced by increasing the checkpoint time and the number of logs maintained beforehand. | |
| Rebalancing could not be completed within the rebalance timeout period. | [Server]: 10014 [Description] (Rebalance timeout) [Format] (*Master node only) Timeout detected partition no., timeout time | Increase the rebalance timeout time. | ||
| A failure to continue has occurred in a cluster node. | Node stopped due to platform error (the node status is ABNORMAL) | [Server]: Each platform error no. [Description] Platform error | In case of the node status is ABNORMAL, a failure to continue may be occurred due to platform error. Refer to the following items for details. Collect the necessary information, remove the cause, stop the node forcibly (gs_stopnode with --force option), restart the node (gs_startnode), and join the node to the cluster (gs_joincluster). | |
| Node stopped due to disk full. | [Server]: Each platform error no. [Description] Platform error (disk full) [External tools] Check with df, du | Reserve disk space by deleting unnecessary files, increase the number of disks or get ready new nodes which can secure a new disk space. | ||
| Node stopped due to disk I/O error | [Server]: Each platform error no. [Description] Platform error (disk I/O) | Several scenarios are possible e.g. a physical disk failure, a file write failure due to a resource exhaustion, manual deletion of a required file by mistake, and so on. See “Problems related to recovery process” for the latter. | ||
| Node stopped due to memory error | [Server]: Each platform error no. [Description] Platform error (memory allocate) [External tools] vmstat, top [Command check] gs_stat : /performance/processMemory | First, check whether the memory upper limit (storeMemoryLimit) has been increased by too much relative to the physical memory (gs_stat). In addition, since the request process may also stagnate if the processMemory has been enlarged, check the total amount of memory secured by the communication message (/performance/memoryDetail/work.transactionMessageTotal) which is output by executing a gs_stat command with the --memoryDetail option appended to it. Check "gs_stat" for the storeMemoryLimit and processMemory. storeMemoryLimit can be changed by editing gs_node.json and using "gs_paramconf". | ||
| (*1) | ||||
| /cluster/ownerBackupLsnGap in the gs_cluster.json file: LSN threshold for determining backup error of the partition and promotion to the owner (master of partition) | ||||
| In future, this parameter may be deleted or its name may be changed. | ||||