Ivory tower metrics don't help when your in the middle of an outage
Metrics are great but you simply don't have them all of the time, especially for clusters you don't own.
A lot of times, etcd will be up but your kube cluster is slow, unresponsive, and you need a simple way to quantify how bad things are.
So, here's a simple python script that reads from a etcd log file and gives you some really easy numbers to interpret.
Background
So, the other day I got a log file of etcd crashes, its never clear : is it due to the network, the VM, or the disk?
1) If its not obviously the disk, you'll see "FSync" warning messages in the logs.
2) If its not the disk look at metrics, specifically you can look at leader elect events, and "help, I'm leaderless" events.
3) If you don't have metrics, and its not the FSync message which is easy to eye-grep for, its worth looking at what nodes are unstable,
And how often.
However, at large scales, you might get 10s of log messages for a given second, making this tricky to draw conclusions from.
The output
This script will quickly aggregate logs into hourly bins, from etcd, and give you an easy to read output, i.e.
('Jan 11 hour=14', {'connection refused': 501, 'connection reset': 100, 'lost leader': 3})
('Jan 11 hour=17', {'connection refused': 1020, 'overloaded': 120, 'failed to send out heartbeat': 99, 'connection reset': 983, 'lost leader': 2})
('Jan 11 hour=16', {'connection refused': 139
And so on, and will output the IP addresses and how often they were associated with a log complaint:
TOTAL FAILURES PER IP ADDRESS
{'192.10.0.1': 10000, '192.10.0.2': 60000}
TOTAL RECORDS
75000
So in the above output, you can quickly ascertain that of the error logs, over half of them came from the 2nd node, and that leader losses occured several times in some hours, with 100s of refused and reset connections in those same hours.
Thats enough to ascertain that one of your VMs is less stable then the other, all of the VMs were in trouble, and most importantly, conclud that if you monitor leader elections in this cluster, alerting for a leader election event that happens more then once in an hour is probably a good, not too noisy,
Metric to indicate to you that something is severely wrong.
At large scales, kuberentes events (mass deletions, for example, if your rolling out all new versions of an application on a large cluster),
ETCD failures or instabilities happen.
These can happen for a long time, after a triggering event, bringing down metrics or other services, in which case, you have to look at the logs.
Also, its common if you have one or more etcd clusters, you might accidentally bridge them to the wrong
Masters, resulting in other sorts of confusing behavior.
The code
The code
You should be able to just copy+paste this and run it anywhere.
| import csv | |
| # Add strings to this array to capture/aggregate | |
| failure_type=["overloaded", "connection reset","connection refused","failed to send out heartbeat", "lost leader"] | |
| entries = {} | |
| total = 0 | |
| def add_entry(time_record, contents): | |
| global total | |
| global entries | |
| # tr_key can be hacked if you want more granular keys to aggregate on. | |
| tr_key=time_record | |
| if tr_key not in entries: | |
| entries[tr_key] = [] | |
| # Entries is a map time->log1,log2,... | |
| entries[tr_key].append(contents) | |
| total = total+1 | |
| with open('etcd.txt') as csv_file: | |
| csv_reader = csv.reader(csv_file, delimiter="|") | |
| for row in csv_reader: | |
| if len(row) > 1: | |
| datekey=row[0].split(" ")[0] + " " + row[0].split(" ")[1] + " hour=" + row[0].split(" ")[2].split(":")[0] | |
| add_entry(datekey, str(row)) | |
| print(len(entries)) | |
| print(total) | |
| print(type(entries)) | |
| ip={} | |
| import re | |
| for e,logs in entries.iteritems(): | |
| failures={} | |
| for log in logs: | |
| for f in failure_type: | |
| if f in log: | |
| if f not in failures: | |
| failures[f] = 0 | |
| failures[f] = failures[f] + 1 | |
| ip_address = r'(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3})\.0' | |
| ip_address = r'(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3})' | |
| foundip = re.findall( ip_address, log ) | |
| for f in foundip: | |
| if f not in ip: | |
| ip[f]=0 | |
| ip[f]=ip[f]+1 | |
| if len(failures) > 0: | |
| print(e, failures) | |
| print ("TOTAL LOGS") | |
| print(len(entries)) | |
| print("NUMBER OF IP ADDRESS MENTIONS (correlates highly to node failures)") | |
| print(ip) | |
| print("TOTAL RECORDS") | |
| print(total) |
No comments:
Post a Comment