Faults, alerts, and actions in DCT

Introduction

DCT 21.0.0 introduces the ability for DCT to ingest and list/search Continuous Data Engine faults, alerts, and actions (audit logs). These objects are gathered from registered engines via telemetry updates and can be examined by either the DCT API or the UI.

Here are the explanations of what a Fault, Alert, and Action mean in Data Control Tower (DCT):

Fault: A fault in DCT represents a persistent issue that affects Delphix's behavior and requires manual resolution by the user. For example, a connectivity problem would be considered a fault. While Delphix attempts to resolve faults automatically when the underlying issue is fixed, there are situations where user intervention is necessary to clear the fault. Additionally, faults can be ignored, which prevents Delphix from posting the same fault for the same object in the future.
Alert: An alert is a notification generated by the engine to inform users of specific events or conditions that have occurred. Unlike faults, alerts do not necessarily indicate a persistent problem that needs immediate attention. Instead, they serve as informational messages to make users aware of certain situations or changes within the system that might need attention but do not necessarily impact the engine's overall behavior.
Action (Audit Log): Actions refer to recorded events or operations within the engine, typically logged for auditing purposes. These actions are tracked in audit logs and provide a historical record of changes, activities, or operations performed in the engine. Actions are generally used for monitoring compliance and reviewing past events.

Prerequisites

To use these new APIs, you need a running DCT instance with at least one registered Continuous Data Engine.

API Changes

This feature introduces two new APIs for Continuous Data alerts and actions, and five new APIs for Continuous Data faults.

Alerts

GET /virtualization-alerts/history – Fetch a list of all Continuous Data alerts.
POST /virtualization-alerts/history/search – Search Continuous Data alerts.

Example response:

CODE

{
  "items": [
    {
      "id": "1-ALERT-1",
      "engine_id": "1",
      "alert_timestamp": "2024-08-06T07:09:45.74Z",
      "event": "alert.system.shutdown.management.initialconfig",
      "event_severity": "INFORMATIONAL",
      "event_title": "Initial configuration",
      "event_response": "Initial configuration",
      "event_description": "The management service is going down after initial configuration.",
      "target_name": "system"
    },

   "response_metadata": {
    "total": 1
  }

Actions

GET /virtualization-actions/history – Fetch a list of all Continuous Data actions.
POST /virtualization-actions/history/search – Search Continuous Data actions.

Example response:

CODE

{
  "items": [
.   {
      "id": "1-ACTION-10210",
      "engine_id": "2",
      "action_type": "USER_LOGIN",
      "title": "USER_LOGIN",
      "details": "Log in as user \"admin\" from IP \"127.0.0.1\".",
      "start_time": "2024-08-16T19:56:59.463Z",
      "end_time": "2024-08-16T19:56:59.463Z",
      "user": "USER-2",
      "user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
      "origin_ip": "172.16.124.16",
      "state": "COMPLETED",
      "work_source": "WEBSERVICE",
      "work_source_name": "admin",
      "work_source_principal": "admin"
    }
  ],
  "response_metadata": {
    "total": 1
  }
}

Faults

GET /virtualization-faults/history – Fetch a list of all Continuous Data faults.
POST /virtualization-faults/history/search – Search Continuous Data faults.
POST /virtualization-faults/resolveOrIgnore – Mark a list of faults as resolved or ignored.
POST /virtualization-faults/{engineId}/resolveAll – Resolve all active faults on an engine the user has permissions over as resolved.
POST /virtualization-faults/{faultId}/resolve – Resolve or ignore an individual fault.

Example response:

CODE

{
  "items": [
    {
      "id": "1-FAULT-1",
      "engine_id": "1",
      "bundle_id": "fault.email.smtprequired",
      "target_name": "system",
      "title": "Password reset requires SMTP to be enabled",
      "description": "Password reset emails cannot be sent if SMTP is disabled or not configured.",
      "fault_action": "Configure and enable SMTP, or disable the password reset function.",
      "severity": "WARNING",
      "status": "RESOLVED",
      "date_diagnosed": "2024-08-06T07:10:45.836Z",
      "date_resolved": "2024-08-28T04:39:40.938Z"
    },
  ],

  “Response Metadata”: {
       “total”: 1
  }
}

GUI changes

Faults, Events (Alerts), and Audit (Actions) tabs are made available to the UI via the details page of a Continuous Data engine. By default, the faults table only shows Active faults. You can choose to show faults with other statuses by configuring the column filter for the Status column.

Implementation

These APIs were implemented to mimic the behavior of the corresponding Delphix engine APIs. In the case of actions and alerts, this is a straightforward list, search, and filter functionality. For faults, it is worth understanding the difference between the three APIs specific to resolving or ignoring a given fault:

ResolveOrIgnore – This endpoint accepts a list of fault IDs and a boolean indicating whether to ignore them. If false, all faults in the list will be marked as resolved in DCT and on the Continuous Data Engine. If true, those faults will instead be marked as ignored.
ResolveAll – This endpoint accepts an engine ID and resolves every active fault on that engine (both in DCT and on the engine itself). Permission checks are performed to confirm the active user can operate on the engine before the resolve operation is permitted.
Resolve – This endpoint accepts a single fault ID and a boolean indicating whether to ignore that fault. If false, the fault will be marked resolved in DCT and on the engine. If true, it will instead be marked as ignored.

Feature limitations

DCT will only persist the most recent ten thousand (10,000) alerts/faults per registered engine and the most recent one hundred thousand (100,000) actions.
- More specifically, DCT will remove the oldest alerts and faults from the database once they exceed 10,000 on a given Delphix engine and when the actions exceed 100,000. This was an intentional design choice for scalability reasons but might be subject change in the future.
Logins to a Continuous Data Engine by the DCT agent (e.g. to fetch a telemetry update) are not present in the Actions drop down list.
If a registered engine has many pre-existing faults, alerts, or actions, it may take some time for them to fully hydrate in the DCT database (e.g. if multiples of 100,000 actions are present).
- With many actions (hundreds of thousands or millions), full hydration of the DCT database will likely take several hours, though once completed, subsequent updates will happen every 30 seconds.
While some targets of faults/alerts/actions are linkable in the GUI (e.g. clicking through to view the details page for those objects), this requires an explicit object type mapping in the backend code that has not yet been implemented. This means that only the following object types are linkable at present:
- dSources
- VDBs
- CDBs
- vCDBs

Diagnostic data

Any relevant logging specific to the new APIs will be in the virtualization-app container.