Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stabilize structured logging #3359

Closed
jtcohen6 opened this issue May 16, 2021 · 16 comments · Fixed by #4055
Closed

Stabilize structured logging #3359

jtcohen6 opened this issue May 16, 2021 · 16 comments · Fixed by #4055
Assignees
Labels
1.0.0 Issues related to the 1.0.0 release of dbt enhancement New feature or request logging

Comments

@jtcohen6
Copy link
Contributor

dbt can output structured JSON logs via:

  • dbt --log-output json (CLI)
  • logs parameter (RPC)

These logs include a lot of information similar to dbt's JSON artifacts, with the crucial differences being that logs are written/returned in real time, whereas JSON artifacts are only written at the end of invocations.

As of v0.19, JSON artifacts are contracted and documented, but JSON logs are not contracted and barely documented. We should change this ahead of v1.0. There are other technologies and community members who want to build on top of stable logs (e.g. dbt Cloud IDE, #2915).

As I see it, the scope of this issue is to:

  • Document structured (JSON) logs as they work today
  • Communicate which components are stable (statuses, log-levels, node-level information) and which components are liable to change in future versions (exact language of info-level messages)
  • Identify discrepancies between the information available in structure logs and the information included in JSON artifacts, and (if very slight) bring them into consistency
@jtcohen6 jtcohen6 added enhancement New feature or request 1.0.0 Issues related to the 1.0.0 release of dbt labels May 16, 2021
@kwigley
Copy link
Contributor

kwigley commented Jun 4, 2021

today

we don't really have structure logging today

  • stdout logging -> just log message
  • debug logging -> time, thread, message
  • json logging -> time, thread, channel, level, levelname, thread, process, extra
    • extra is just arbitrary data at the moment, added by using a context manager
      • node index
      • node count
      • unique id
      • is status message
      • timing info

future

questions

  • do we want to recommend using structured logs for anything other than observability?
  • when is this issue considered complete?

@leahwicz
Copy link
Contributor

leahwicz commented Jun 4, 2021

  • LogBook isn't great to keep using (seems abandoned)

  • We need to see what we want to do for events (pub/sub) b/c ppl use our logs as a way to see events which is not what the logs should be used for (not needed for v1)

Open questions:

  • What are the big painpoints in the logs for Cloud/consumers?
  • Need to see where we are doing logging

@jtcohen6
Copy link
Contributor Author

  • Stabilizing, rationalizing, documenting the content of structured logs: this issue!
  • How we push the contents of those logs for real observability/integration: pubsub issue for the future

@jtcohen6
Copy link
Contributor Author

To make this even more concrete, I see a bug like #3603 to be right square within the scope of this issue

@drewbanin
Copy link
Contributor

This is a great thread! structlog looks fantastic, and it would be a really good idea to move away from logbook -- do we think we can do that pre-1.0?

Here's my braindump on structured logging:

Format of json log lines
Right now, all "structured" info in our json logs gets shoved into an extra key in the log line. Here's an example:

{"timestamp": "2021-10-05T13:17:28.018538Z", "message": "Finished running node model.my_new_project.my_model", "channel": "dbt", "level": 10, "levelname": "DEBUG", "thread_name": "Thread-2", "process": 19949, "extra": {"node_status": "passed", "node_finished_at": "2021-10-05T13:17:28.018543", "unique_id": "model.my_new_project.my_model", "run_state": "running"}}

This log line contains a fragment of information related to the model.my_new_project.my_model model. The big idea with our current logging is that if you capture all of the log lines emitted for a model then you can "know" its state at any point in time. Further, you can collect all of the logs for a model together by selecting all of the log lines with a matching log.extra.unique_id value.

I think that a better approach to structured logging would have contracted fields in each log line. These fields would represent the state of a specific entity. This would be simpler for clients to consume than a kitchen-sink style "extra" field.

Log/event names

The "message" in each log line is not semantically useful. The example above has a message like Finished running node model.my_new_project.my_model, which isn't really suitable for a client to depend on. Instead, each structured log line should have a contracted name/message/type that clients can use to process certain logs/events. Today, you'd either need to:

  1. Do something pretty jank to inspect and respond to the message field of the log line OR
  2. Depend solely on the existence of fields in the log line (eg. if there is node_finished_at in the extra property of the log, then a node has finished running.

Instead, it would be a lot better if we decoupled these "structured" log lines from typical debug/info-level logs that are intended for humans to read. Each structured log could have a semantically useful name (or message) like node_status_update that communicated the type of information in its payload.

Node information completeness

Today, structured logs contain partial updates about a given node. In the example above, the log line shows when the node finished, but the information about when that node started is in a different castle log message. Instead, I think it would be better if log lines contained all of the information that we had about a node at the point in time that the log occurs. An example could look like:

{[other fields here]...., "node_info": {"node_path": "models/slow.sql", "node_name": "slow", "resource_type": "model", ""node_materialized": "table", "node_started_at": "2021-10-05T13:17:28.018613", "unique_id": "model.my_new_project.slow", "run_state": "running"}}
{[other fields here]...., "node_info": {"node_path": "models/slow.sql", "node_name": "slow", "resource_type": "model", ""node_materialized": "table", "node_started_at": "2021-10-05T13:17:28.018613", "unique_id": "model.my_new_project.slow", "node_finished_at": "2021-10-05T13:17:29.134126", "node_status": "passed", "run_state": "running"}}

This would make it possible for a consumer of the structured logs to know the state of a node at any point in time without needing to consume the entirety of a potentially very long log file.

Note: I am kind of on the fence about this one. I think it's probably a bad idea to assume that a client could "know" everything about a dbt run without consuming all of the structured logs for that run. Instead, I think that more complete structured logging would be good ergonomics for clients -- it would reduce the amount of "state building" that would need to happen on the client, and ultimately would give us more flexibility around how & when we log node information inside of dbt Core.

Completeness

There's a lot happening inside of dbt that is not emitted in structured logs today. There are some potentially slow actions like building the adapter cache that are not logged in such a way that clients can render progress information to a user. We should consider new types of structured logging around project parsing/run startup/running hooks (both on-run-start/end, and pre/post-hooks), etc, that communicate the exact status of dbt Core as it operates on projects.

Materialization-based logging

It would be really good if logs could be emitted inside of materializations. This could help us communicate what's happening when a given model is running. Quick example of state transitions:

  1. Running {model} with 3 steps
  2. Running pre-hook 1 of 1 (1/4, 25% complete)
  3. Creating a temp table {temp_table} (2/4, 50% complete)
  4. Merging from {temp table} into {destination table} (3/4, 75% complete)
  5. Running post-hook 1/1 (4/4, 100% complete)

This would totally be a stretch goal, but I think the idea of materializations being less imperative and more "planned out" is probably well-aligned with our goals for model building into the future :)

That's it! Those are all of the thoughts in my head around structured logging. Lmk if there's anything else I can shine a light on!

@leahwicz leahwicz assigned nathaniel-may and iknox-fa and unassigned kwigley Oct 6, 2021
@nathaniel-may
Copy link
Contributor

nathaniel-may commented Oct 7, 2021

I spoke with @iknox-fa about a plan of action to implement this, and the way we understand it, structured logging in dbt will be a formal contract between dbt and several downstream stakeholders to programmatically rely on. Because it is a user-facing contract we want to code for the correctness concerns it raises, and make sure that all the use cases for it are actually good candidates to be solved by logging.

In general we believe we can reasonably implement a future-ready structured logging solution in dbt that satisfies our needs today with the ability to move some use-cases to a more scalable approach when needed.

Requirements

  • There must be enough information to trace a run as closely as possible (ala @drewbanin 's "node information completeness").
  • Logs must be output in an easy to parse format (likely json) with semantically useful names, messages, or codes so that client code can efficiently and correctly consume log events
  • The ability to quickly make changes to the log structure that doesn't break downstream consumers
  • Log obfuscation (for secrets/etc) should be easy to implement and update.

Questions

  • What are all the known use cases for structured logging today? What ones should we also plan for?
  • Is "enough information to trace a run" just what was outlined in "node completeness" or is there more that we need to include? Is there a better way to describe this requirement?
  • There has been some talk about pub/sub with structured logging. Is this a known future requirement?
  • What log messages are not being logged today that must be logged in the first version of this feature? What about in future versions?

Concerns

  • IO contention on disk with a single threaded Python app with a significant add to the number of log lines per second could significantly slow down dbt
  • IO contention for consumers competing for the same on-disk resource could significantly increase the latency of log reads.
  • Being able to identify all consumers use cases and meet them quickly

Because of the scalability concerns with logging specifically regarding IO and concurrent file access, we want to distinguish use cases that are well suited for logging where the intent is to store information and maybe read it later, such as support, and use cases that need to read information as it is written for immediate use such as displaying progress bars. These cases for immediate use are better suited to an server-based API and will likely need to be implemented as such to scale appropriately.

Implementation Proposal

Structured logging is a user-facing contract, and as such it should be versioned. Parsers can use this version to determine if it understands the format written. It is on dbt developers to ensure that the version changes when structured logging changes.

structlog looks like a grand improvement over our current logger and is flexible enough for us to reuse a lot of the logic even if we change the destination of some of these messages in the future. However, calling structlog directly at the current call sites creates a distributed definition of our user-facing contract. For that reason we are proposing centralizing the structured logging interface in its own file or module that leverages mypy to enforce conformity across call sites. This will be done by creating a base class for a LogEvent that can be inherited from in the same file for each individual type of log event dbt could produce. Each of these class values would only contain the information necessary to produce a the full message, such that message generation can be centralized to the logging module. Log leveling can be done on the LogEvent subclasses so that modifying log levels for certain messages also becomes centralized. These classes can then be passed as a dictionary to structlog to handle the output.

This approach has the advantage that mypy will be able to do the vast majority of the correctness work for us without relying on sufficient test coverage. Test coverage will still be required to make sure that the final json output matches what is expected for each event type, but that is easy to write unit tests for. A drawback is that this approach locks down call sites so that it might be less obvious how to trigger a log event as a developer.

Possibilities for the Future

  • Log ordering via vector clocks to protect against threads and clock skew
  • Moving some structured logs to a server-based api
  • Having logs write via storage adapters for the ability to write to different destinations

@drewbanin
Copy link
Contributor

@nathaniel-may love the thinking here!

What are all the known use cases for structured logging today? What ones should we also plan for?

I think current use-cases fall into the following buckets:

  • Humans looking at logs (eg. in DataDog) to understand what dbt did in an invocation
    • Example: Alice got an email from the CEO because the revenue dashboard was out of date. Alice wants to know the last time the fct_orders model was run so she does a faceted search on model_name = fct_orders in DataDog
    • Note: piping structured logs to DataDog is left as an exercise to the reader and is not in scope for dbt Core!
  • Programmatic log processing
    • Example: A dbt client streams logs from a running dbt Server. It uses these logs to report on the state of a run (eg. which models are running, completed, failed, skipped) as the run progresses
    • Example: Alice adds an alert in DataDog for any log line where status = error and model_name = fct_orders. Alice now gets paged when the fct_orders model fails to build

Tell me if I'm missing the mark in terms of "use cases" -- happy to follow up with different types of info if that would be more helpful!

Is "enough information to trace a run" just what was outlined in "node completeness" or is there more that we need to include? Is there a better way to describe this requirement?

If I were to break it down further, I'd say:

  • Every state transition for every node or operation should be logged.
    • State transitions include:
      • queued --> started
      • queued --> skipped
      • started --> success
      • started --> failed
    • nodes include:
      • models
      • seeds
      • sources (when snapshotting freshness)
      • snapshots
      • tests
  • operations include
  • pre- and post- hooks for nodes (where applicable)
  • on-run-start and on-run-end hooks
  • macros invoked via run-operations

There has been some talk about pub/sub with structured logging. Is this a known future requirement?

This is not in scope for dbt Core, but may be implemented by processes that wrap dbt Core in the future. Ideally, one of these processes could tail Core structured logs and disseminate these logs via pub/sub if desired

What log messages are not being logged today that must be logged in the first version of this feature? What about in future versions?

  • We're missing structured logs for on-run-start and on-run-end hooks
  • We're missing structured logs for model pre- and post- hooks
  • We're missing structured logs for run-operations

Additionally, structured log output is inconsistent. It is challenging for consumers of structured logs to disambiguate end states for node executions.

Examples:

  • models that error have "status = error"
  • tests that fail because of syntax/database errors have "status = error"
  • tests that fail because assumptions do not hold have "status = fail"

Ideally, there would be consistent flags/enums for success/skip/error conditions. This would simply logic for consumers of these structured logs


Think I hit on everything, but very happy to expand on any and all of this! Let me know what you think :)

@nathaniel-may
Copy link
Contributor

Thanks for the feedback @drewbanin! This helps a lot.

Regarding the "programmatic log processing" use case, it seems like the requirements are for core to support a dbt client in reporting the state of a run to a user in live time. I don't think writing these events to a log file then reading them again is actually the right way to achieve this goal, so I want to make sure that even if our initial solution is to stream logs, we have built the necessary plumbing to do it the right way when we need it. The type modeling for events on top of the logger in this proposal would allow us to send events to any number of destinations (potentially an event history in memory for clients to stream via a server layer) independent of the configured log level. If you think all of these events will always need to be logged, that's something I'll need some clarification around, since I'll want to guard against IO contention as best I can.

@drewbanin
Copy link
Contributor

I don't think writing these events to a log file then reading them again is actually the right way to achieve this goal, so I want to make sure that even if our initial solution is to stream logs, we have built the necessary plumbing to do it the right way when we need it.

What do you think the right way to achieve this goal is?

@nathaniel-may
Copy link
Contributor

Generally, when streaming status updates to a client I'd reach for a server end point. Either polling or http2 streams. It handles concurrency better, doesn't rely on synchronous reads and writes to a file on disk, and can be flexible about the information returned to minimize network load. Some off the shelf frameworks like graphql provide some of this for you because they're meant for this exact kind of pattern.

@drewbanin
Copy link
Contributor

drewbanin commented Oct 13, 2021

ok - i buy that as an overall strategy! I think my point of contention is that I don't think dbt Core should actually do anything like that -- it should instead provide an API for letting a wrapping process tap into the logs and then implement polling / http2 streams / websockets / etc.

This is actually a big challenge today that @kwigley and I have been discussing recently -- dbt Core's logger framework makes it very hard for a process that imports dbt to tap into the log stream. You can do it, but jankness abounds! My preference would be to:

  1. Support writing logs to file
  2. Support tapping into the log stream in a first-class way so that file IO can be bypassed entirely if the wrapping process imports dbt and uses it as a library

Here's an example of the kind of jankness that's necessary today (more or less):

import dbt.logger as dbt_logger
import logging

GLOBAL_LOGGER = logging.getLogger(__name__)
GLOBAL_LOGGER.setLevel(logging.DEBUG)
stdout = logging.StreamHandler()
stdout.setLevel(logging.DEBUG)
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
stdout.setFormatter(formatter)
GLOBAL_LOGGER.addHandler(stdout)
logger = GLOBAL_LOGGER

json_formatter = dbt_logger.JsonFormatter(
    format_string=dbt_logger.STDOUT_LOG_FORMAT
)

# This could be any kind of log handler -- file, memory, or otherwise
logs_redirect_handler = logbook.FileHandler(
    filename=self.log_path,
    level=logbook.DEBUG,
    bubble=True,
    filter=lambda x: x.channel.split(".")[0] == "dbt"
)


# If you don't do this, then Core logs to a file itself
dbt_logger.log_manager.set_path(None)

# Reach into dbt lib and mutate the log managers handler list!! This is bad
dbt_logger.log_manager.objects = [
    handler
    for handler in dbt_logger.log_manager.objects
    if type(handler) is not logbook.NullHandler
]

# Create a new nested logging setup
handlers = [logbook.NullHandler(), logs_redirect_handler]
self.log_context = logbook.NestedSetup(handlers)
self.log_context.push_application()

### This is where you can actually invoke dbt and capture logs
dbt.do_the_thing()

### Pop the log handlers
self.log_context.pop_application()

A better approach IMO looks like:

import dbt.logger as dbt_logger

# create whatever kind of log handler you want
logs_redirect_handler = logbook.FileHandler(
    filename=self.log_path,
    level=logbook.DEBUG,
    bubble=True,
    filter=lambda x: x.channel.split(".")[0] == "dbt"
)

# tell dbt to exclusively write logs to our handler, not its own internal handlers
dbt_logger.override_log_handler(logs_redirect_handler)

Again - the FileHandler is just an example - this could alternatively be a StreamHandler that buffers logs in memory and could be connected up to a polling service or http2 streams :)

@nathaniel-may
Copy link
Contributor

Ok great! It sounds like we agree on all the big points, and I'm glad we double checked. Item 2 is what I was getting at because I also agree that core should not be doing server things, and the way you described it is exactly what I would like to do to provide this functionality.

The confusion I had was around the terminology "logging" which to me implies that logs are going to some persistent storage to be read at an unknown future time. I would call this something like "event streaming" but I don't care to get pedantic about what we should call things here.

The way I'm thinking about this, we should be able to mix and match log events and progress events when it comes to what destination/s and the level of detail we want. If you have any thoughts on what kinds of flexibility should be built in from the start, I'd like to hear them. But otherwise I think we're on the same page now. Thanks for the detailed feedback, @drewbanin!

@drewbanin
Copy link
Contributor

woohoo! ok, fantastic. I think my only thoughts re: flexibility are that it will be really hard to change / remove log events in the future. We might need to think hard about the contracts we're committing to up-front. Once people start triggering alerts in datadog, or building clients that consume these logs.... we sort lose the ability to mess with those contracts too much :)

That said, we can always add things in the future, so my preference would be to start with the existing set of semantic logs that we publish. Those are (AFAIK) the places where we subclass logbook.Processor in dbt Core today. If we want to add more things into the mix in the future, we can do that in minor releases. If we want to change or delete existing logs... that's a major version bump I think 😄 . No pressure!

@nathaniel-may
Copy link
Contributor

I totally agree with this! That's why I want to version the events separately from dbt even if its not something we want users to interact with. For example, MongoDB versions their replication protocol, but you never really have to know that.

@mobuchowski
Copy link

Hey all, this looks like wonderful feature that we'd love to use in OpenLineage dbt integration - especially getting live updates on node state transitions.

One question: will we get (some) standarized adapter responses the same way? We'd like to report information like row count and byte size written, which we now get from run_response.json file.

@jtcohen6
Copy link
Contributor Author

jtcohen6 commented Nov 3, 2021

@mobuchowski Glad to hear this will be of help!

Adapter responses will continue to live on the result object for each node. Those results are included in run_results.json, and they're also used to construct several events / log messages while a node is running. I don't know if we already have a task event that includes the full result contents. I do think it would be possible, as a slight extension on top of the foundation we're working hard to build right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1.0.0 Issues related to the 1.0.0 release of dbt enhancement New feature or request logging
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants