Skip to main content

Posts

Showing posts from 2018

Custom monitoring graphs for xrp.ninja validator

Custom monitoring graphs for xrp.ninja validator

As you may know, my rippled validator runs on GCE. So I use stackdriver to monitor it. I created two custom metrics for it.

One metric is uptime with rippled version as a label. Here's a live graph for this metric. From this graph, you can easily tell what the current version is and when it was upgraded to that version. You can also correlate this with other graphs to see if new version may have caused certain changes in rippled behavior.
The other metric is the percentage of current uptime that rippled has been in a certain state. Normally the curve for "full" state should asymptotically approach 100. If it is not, then rippled is in trouble.

xrp.ninja Ripple validator returns normal since two day ago

There's been some problems with the validator ever since it crashed about a week ago. It doesn't crash any more since I added much more disk space. But it easily falls behind and needs to play catch up several hours after each restart. I learned this from state_accounting field in the output of "rippled server_info". I created a metric and a chart from it. This is what the chart looks for the last week.
The "Full" mode state means it is fully synced and participating in consensus process. You can see it is now asymptotically approaching 100 now. The value means the percentage of uptime the validator stays in each mode.
I figured out why the disk was out of space. It was because the validator fell behind. When that happens, online delete is disabled. See this code:
https://github.com/ripple/rippled/blob/fc0d64f5eec4386db7146251ab1a7fe880bec17c/src/ripple/app/misc/SHAMapStoreImp.cpp#L751
I saw some "Not deleting" messages in the log which led me to t…

xrp.ninja Ripple validator crashed last night due to low free disk space

The above graph shows what happened.

I did get an alert from GCE that disk usage was high. I have an alert policy which says alert me if disk usage is over 80% for more than 5 minutes. However, it was too late, so I didn't get up and thought maybe it could resolve on its own. But it didn't. And GCE didn't keep alerting me, which surprises me.

Rippled logged these two lines before it died:

2018-Jan-10 11:33:27 Application:FTL Remaining free disk space is less than 512MB
2018-Jan-10 11:33:27 Application:FTL Application::onStop took 23ms
So rippled killed itself: https://github.com/ripple/rippled/search?utf8=%E2%9C%93&q=%22Remaining+free+disk+space+is+less+than+%22&type=
Before that, log was flooded with the following messages for 5 hours: 2018-Jan-10 10:55:08 LoadMonitor:WRN Job: recvGetLedger run: 1390ms wait: 0ms
2018-Jan-10 10:55:32 LoadMonitor:WRN Job: recvGetLedger run: 1250ms wait: 0ms
2018-Jan-10 10:55:32 LoadMonitor:WRN Job: recvGetLedger run: 1533ms wait: 0m…

Ripple's Decentralization Strategy

Copied from the following link: https://www.xrpchat.com/topic/16362-rippled-0810-released/?do=findComment&comment=191441
mDuo13 wrote this. Kudos to him.
To recap the Decentralization Strategy, here's a summary: Switch to using a validator list site (vl.ripple.com). This is where we are now.All rippled instances configured to use the site can automatically follow Ripple's updates to the recommended set of validators, in lockstep.In case you're curious, the validator list site publishes cryptographically signed recommendations of validators, so it's not easy to impersonate. And rippled caches the data it gets from the site, so the XRP Ledger won't go down even if vl.ripple.com is down for a while. (It might be tough to bring new rippled servers online while vl.ripple.com is down, but I think there are some protections against that, too.)Update the site and the existing validators to use validator tokens instead of master validator secret keys.This adds security to…

xrp.ninja Ripple validator upgraded to 0.81.0

Information about this version can be found here https://ripple.com/dev-blog/rippled-version-0-81-0/.

This happened at about 1:45pm local time.

There were some behavior changes after the upgrade:

Monitoring ripple validator running in GCE

GCE provides various kinds of metrics from which one can create dashboard, alerting policies etc. However, there is no way to monitor performance of rippled unless we wrote something ourselves.

Fortunately, GCE allows creating custom metrics. So as a starting point, I decided to create a metric for rippled build_version. This information is very useful. For example, you will be able to tell if the behavior of the server changes after version changes.

However, I later learned that custom metric can't have "STRING" as its value type. So I created an uptime metric with build version as its label. It works just the same.

Here is a screenshot of the chart created from this metric:

Unfortunately, it seems I can't share this chart publicly, unlike charts created from built-in metrics which can be shared publicly.