SQL Server Database Mail failures: sysmail_faileditems.description is truncated, how can I see the full message?

I’m configuring SQL Server Database Mail – always a fun task – and am getting an error message from the mail server, which I can see in the Database Mail Log from Log File Viewer and I can see by querying msdb.dbo.sysmail_faileditems. However, the server response is clearly longer than msdb.dbo.sysmail_faileditems.description column allows. Is there any way to see the full server response?

Here’s an example description (I’ve edited the IP address but it shows my server’s external IP address):

The mail could not be sent to the recipients because of the mail server failure. (Sending Mail using Account 3 (2021-10-13T11:40:56). Exception Message: Cannot send mails to mail server. (Mailbox unavailable. The server response was: 5.7.1 Invalid credentials for relay []. The IP address you’ve). )

In this case I’m using Gmail’s SMTP Relay with allowed IP Addresses. I’d like to know what the rest of the message is. I’ve configured a different server to successfully use the same account and smtp relay server, so it’s probably an IP config problem.

Are oozes immune to critical failures?

In a previous session, one of my players used Sudden Bolt against a Living Sap. Living Sap’s have immunity to critical hits. They rolled low on their Reflex Save, and got 11 less than the spellcaster’s DC. I ruled critical failure and gave Sudden Bolt double damage. One of my players had assumed that their immunity to critical hits would make them immune to critical failures (they didn’t fight too hard since it was a fairly beneficial ruling 🙂 ). Was I right to rule that way? (I’ve included my reasoning as an answer on the off chance that I’m right).

Cortex rules: critical failures

The Cortex rules say (and I’m paraphrasing from memory) that critical failures happen when all the dice are 1’s.

Is this all the dice, or just the stat and skill? If I have an asset that adds a die, does that have to be a 1 for a critical failure? What if I throw in a plot point? I’d say a strict interpretation of the rules is that it is all the dice, but that seems like it’s going to really decrease the critical failure rate.

Do all RPGs have critical failures, and which was the first?

In all the games I remember playing, critical failures have always, in every case triggered an event. This critical failure is due to exceptionally bad rolls on dice, either a ‘natural one’ in DnD or a ‘tI’d Failure’ in nWoD. This is committing a failure so catastrophic it makes something bad happen (some special enemy appearing, or some item breaking), sometimes so bad it’s borderline nonsensical.

Does this necessarily have to be like this in every system or game? When we are playing, everyone assumes something is going to happen if they roll a 1 (or the equivalent in other systems), but I have been thinking about it and it doesn’t make much sense. While I agree it’s the worst possible roll and therefore it indicates the least successful or desirable outcome, I’ve never agreed on critical failures necessarily having to generate a special event.

Opinions aside, I’ve only been playing for a few years and I want to know: have roleplaying games always been like this? When did this critical failure trend start? I have casually asked some of the people I play with but no one has given it any thought and I’m really curious about it.

What is the difference between masking and tolerating failures?

Distributed Systems 5ed by Coulouris says on p21-22

1.5.5 Failure handling

Detecting failures: Some failures can be detected. For example, checksums can be used to detect corrupted data in a message or a file. Chapter 2 explains that it is difficult or even impossible to detect some other failures, such as a remote crashed server in the Internet. The challenge is to manage in the presence of failures that cannot be detected but may be suspected.

Masking failures: Some failures that have been detected can be hidden or made less severe. Two examples of hiding failures:

  1. Messages can be retransmitted when they fail to arrive.

  2. File data can be written to a pair of disks so that if one is corrupted, the other may still be correct.

Just dropping a message that is corrupted is an example of making a fault less severe – it could be retransmitted. The reader will probably realize that the techniques described for hiding failures are not guaranteed to work in the worst cases; for example, the data on the second disk may be corrupted too, or the message may not get through in a reasonable time however often it is retransmitted.

Tolerating failures: Most of the services in the Internet do exhibit failures – it would not be practical for them to attempt to detect and hide all of the failures that might occur in such a large network with so many components. Their clients can be designed to tolerate failures, which generally involves the users tolerating them as well. For example, when a web browser cannot contact a web server, it does not make the user wait for ever while it keeps on trying – it informs the user about the problem, leaving them free to try again later. Services that tolerate failures are discussed in the paragraph on redundancy below.

Recovery from failures: Recovery involves the design of software so that the state of permanent data can be recovered or ‘rolled back’ after a server has crashed. In general, the computations performed by some programs will be incomplete when a fault occurs, and the permanent data that they update (files and other material stored in permanent storage) may not be in a consistent state. Recovery is described in Chapter 17.

Redundancy: Services can be made to tolerate failures by the use of redundant components. Consider the following examples:

  1. There should always be at least two different routes between any two routers in the Internet.

  2. In the Domain Name System, every name table is replicated in at least two different servers.

  3. A database may be replicated in several servers to ensure that the data remains accessible after the failure of any single server; the servers can be designed to detect faults in their peers; when a fault is detected in one server, clients are redirected to the remaining servers.

What is the difference between masking and tolerating failures?

Can they both be done by redundancy?

Do they both need to perform recovery from failures?


Do arbitrary/Byzantine failures include omission failures and timing failures?

Distributed Systems 5ed by Coulouris says on p68

2.4.2 Failure Models

Omission Failures

Arbitrary Failures The term arbitrary or Byzantine failure is used to describe the worst possible failure semantics, in which any type of error may occur. For example, a process may set wrong values in its data items, or it may return a wrong value in response to an invocation.

Timing Failures

Are arbitrary/Byzantine failures arbitrary? (Sounds yes to me.)

Do arbitrary/Byzantine failures include omission failures and timing failures? (I guess not. Otherwise, why does it describe omission failures and timing failures separately?)


How should failures by a single user on a simulated phishing email be measured?

I work in the IT Security function of my company as a team lead. We periodically send out phishing emails to all users on company network as a form of continuous education of users on how to spot malicious phishing emails. Our company operates in the regulated financial industry and have a diverse user base with various levels of technical ability from IT to customer service roles. We work frequently with sensitive customer data and personally identifiable information (PII).

My team does metrics reporting of user performance on these simulated emails. Sometimes end users take multiple ill – advised actions on a single simulated phishing email we sent such as clicking a link, or opening an attachment in the email.

My thinking is that given each bad action potentially represents a different attack vector that can be exploited by a threat agent, each bad action should be counted as a separate failure. After all, clicking on a malicious link in a true phishing email can result in compromise just as easily as opening an infected attachment in such email. The fact that a single user can take multiple bad actions on a single, albeit fake, phishing email seems to highlight how such end users are not really conscious of their actions or skeptical enough, which only emphasizes the value of this reporting methodology in my opinion.


To most accurately measure end user behavior and where weaknesses may be, should multiple bad actions on an single email be counted as a 1 failure or should each action be counted as a failure on its own?

SharePoint 2016 Request Manager SPING Failures over TLS

This issue involves SharePoint 2016 web application request failure (no access) due to failure of the SharePoint 2016 Request Management Service’s SPING over TLS.

The farm is currently operational without the Request Management service running. Since this is a small high availability farm (4 nodes = 2 WFE-DistCache + 2 App-Search), we seem to be operating fine without Request Manager, however we could potentially use it in the future, and would like to know:

1) Why is the service failing over TLS, and how can we resolve the issue?

2) Secondarily, what are the implications of operating without it? For example, is there still internal load balancing of service application requests?

Here’s some background description of the environment and issue.

1) The server farm resides within a restricted DMZ. There are no port blocks within the single VLAN where the SharePoint servers reside, however between VLANS such as that to the database tier and Internet are highly restricted. There are also highly restricted group policies and McAfee HIPS software.

2) The servers are all Windows 2016, IIS 10, SP2016. Schannel settings restrict transports to TLS 1.1 and above with cipher restrictions as well. RC4 is all blocked.

3) Request Manager works AOK when the site bindings are HTTP:80. Request Manager fails only when site bindings require HTTPS:443. The certificates used are all valid with valid certificate chain and trust between servers and within the farm.

4) Basis for determining Request Manager failure is:

a. Sites don’t work when Request manager is enabled with HTTPS bindings.

b. Numerous Event 8317, SharePoint Foundation errors …. General error description: ‘ServerHostName (Web App(IIS Site root))’ failed ping validation and has been unavailable since ‘Time’.

c. Also several Event 8311, SharePoint Foundation … General error description ‘An operation failed because the following certificate has validation error’ (Certificate identity with thumbprint) “Errors: SSL policy errors have been encountered. Error code ‘0x2’

Really appreciate any thoughts on this. We’re having trouble finding good documentation on Request Manager and see some BLOG evidence that maybe Request Manager simply does not work over TLS. It seems more likely that somehow the restrictions in our environment cause the SPINGs to fail server-to-server, but it’s not clear why since we do not know enough about SPING or what certificate checks and cipher handshake may be taking place. Also, aside from some of the obvious features for controlling requests, we don’t understand why we should really care about Request Manager.

If I am hit after I am reduced to 0 HP during this duel in the Hoard of the Dragon Queen adventure, do I take 1 or 2 death save failures?

We’re playing Hoard of the Dragon Queen. During the brutal first episode, there is a duel that can take place (p. 12; spoilers in the link).

When the player loses the duel, the other participant…

strikes one more time [and] inflicts one death roll failure on a character.

When this happened, the player marked off two death failures on their sheet. I asked them why. Here are the relevant parts of the PHB.

The section titled “Dropping to 0 Hit Points” states:

If damage reduces you to 0 hit points and fails to kill you, you fall unconscious. This unconsciousness ends if you regain any hit points.

And later:

If you take any damage while you have 0 hit points, you suffer a death saving throw failure. If the damage is from a critical hit, you suffer two failures instead. If the damage equals or exceeds your hit point maximum, you suffer instant death.

Lastly, the Unconscious condition description says:

Any attack that hits the creature is a critical hit if the attacker is within 5 feet of the creature.

In conclusion, any melee attack made within 5 feet of me is an automatic crit. So does that mean if a bandit is being rude, and stabs me again with his puny dagger – I have a 50% chance of dying come my turn (save for help)?

Am I right in assuming that this is hand-waved in this adventure as it specifically states “one death roll failure”?

Calculate Meant Time Between Failures ( SRE Metrics) in Java Stack

We are currently trying to capture / calculate Mean Time Between Failures for all our services running in production. We use K8s and Prometheus to monitor the same. MBTR = Total Uptime / Number of failures How to get total uptime in prometheus ? Is it possible to use JVM Metrics , is it a good idea to use ps -eo ? Also how to get number of failures ?