Documentation
LiveSPInstallation Operating GuideDownload PDF
Monitoring : Services : Check 1 – LiveSP services are all up and running
Check 1 – LiveSP services are all up and running
Process
Bash command: livesp-status
API endpoint: /api/v1/service/status
Status: equals “OK” if all services are running else “KO”
Severity: equals “NONE” if all services are running else equals the highest criticality of the services that are not running
Messages: details for each service that is not running (criticality part is detailed in the above paragraph)
Support action
1. Does the service keep failing at startup? What’s the message when it fails?
Run command dksps
2. Check inter-services connectivity
Run command livesp-service-connectivity
If not OK, force restart of all services that failed to be resolved with DNS and see if it solved the issue dkkill
3. If still not OK, check the last logs of the service, to look for application-level details.
Run command dklogs --tail 50 -f
Example
$ livesp-status
{
  "name": "/api/v1/service/status",
  "timestamp": "2020-08-05T14:04:27Z",
  "status": "KO",
  "severity": "MEDIUM",
  "messages": [
    "KO - livesp_dataoperator - Replicas 0/1 - Criticity: MEDIUM"
  ]
}
The livesp_dataoperator service is not healthy.
$ dksps livesp_dataoperator
ID                  NAME                      NODE   CURRENT STATE                      ERROR
impf4uxqmgrdxo75b3d9y28hk   livesp_dataoperator.1  mgt    Running less than a second ago
g92rsmr0ju6dyevog3lgw4gz0  \_ livesp_dataoperator.1  mgt  Failed 17 seconds ago "task: non-zero exit (2)"
j63l5awgomjkrum7fxzlc6iz5 \_  livesp_dataoperator.1  mgt  Failed 52 seconds ago "task: non-zero exit (2)"
0a5ffymddo88ib38drm2yf3xx  \_ livesp_dataoperator.1  mgt  Failed about a minute ago "task: non-zero exit (2)"
r6ms1owkw4rarafysy6gtuufe  \_ livesp_dataoperator.1  mgt  Failed 2 minutes ago "task: non-zero exit (2)"
The backup and restore management server restart every few seconds (failure to startup correctly).
$ dklogs --tail 10 -f livesp_dataoperator
panic: runtime error: invalid memory address or nil pointer dereference
INFO[2020-03-30T19:32:07Z] Server will start on port 8000
INFO[2020-03-30T19:32:07Z] Task purge configured to 30 days.
INFO[2020-03-30T19:32:07Z] Starting server...
ERRO[2020-03-30T19:32:37Z] UpdatePendingTasks: MongoDB find error: connection() : auth error: sasl conversation error: unable to authenticate using mechanism "SCRAM-SHA-256": (AuthenticationFailed) Authentication failed.
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x8bf583]
goroutine 8 [running]:
go.mongodb.org/mongo-driver/mongo.(*Cursor).Next(0x0, 0xb51fa0, 0xc000072800, 0x2a)
/root/go/pkg/mod/go.mongodb.org/mongo-driver@v1.1.3/mongo/cursor.go:93 +0x43
liveaction.com/livesp/dataoperator/internal.UpdatePendingTasks()
/root/go/src/liveaction.com/livesp/dataoperator/internal/agentlib.go:87 +0x5c4
main.main.func1()
/root/go/src/liveaction.com/livesp/dataoperator/cmd/data-operator-agent/data_operator_agent.go:85 +0x20
github.com/robfig/cron/v3.FuncJob.Run(0xa15988)
/root/go/pkg/mod/github.com/robfig/cron/v3@v3.0.0/cron.go:131 +0x25
github.com/robfig/cron/v3.(*Cron).startJob.func1(0xc0000af180, 0xb48e40, 0xa15988)
/root/go/pkg/mod/github.com/robfig/cron/v3@v3.0.0/cron.go:307 +0x69
created by github.com/robfig/cron/v3.(*Cron).startJob
/root/go/pkg/mod/github.com/robfig/cron/v3@v3.0.0/cron.go:305 +0x73
The backup and restore management server fail to start because it cannot authenticate while connecting to its database (livesp_dataoperatordb service).
Kill the database service with dkkill livesp_dataoperatordb and, if problem isn’t solved, call L3 support (this service being of medium importance, there is no real emergency but it should be investigated within next few days).