Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial structured logging work with fire_event #4137

Merged

Conversation

nathaniel-may
Copy link
Contributor

@nathaniel-may nathaniel-may commented Oct 26, 2021

Description

This PR into the feature branch adds the first real bit of structured logging. The description of the module layout is in the README file. This is the best PR to ask for large structural changes to the general approach if there are any concerns with this way of doing things.

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really appreciate the readme and annotations. I'd be curious to hear if others have structural comments or suggestions. For my part, I think I have a decent sense of how we plan to use this, and how we can extend it beyond stdout/file logging.

Am I right to think that our next steps are:

  • going through all the places in dbt where logger is called
  • adding typed events to events/types
  • replacing those logger calls with fire_event for the new typed event

At the same time, we need to replicate the functionality currently available in logger (log levels, JSON formatting, secret scrubbing, etc) with our new event-based system.

I'm especially interested in finding ways to safely split up and parallelize all this work, knowing that we've got precious time left ahead of planned release for v1.0.0-rc1

core/dbt/events/types.py Outdated Show resolved Hide resolved
core/dbt/events/functions.py Outdated Show resolved Hide resolved
# (i.e. - mutating the event history, printing to stdout, logging
# to files, etc.)
def fire_event(e: Event) -> None:
EVENT_HISTORY.append(e)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dumb question, because I don't really understand how computers work: Do we need to flush/cycle EVENT_HISTORY at some point? A dbt invocations can have hundreds of thousands of debug-level log-lines, is there a risk that this will grow to use a substantial amount of memory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe! This is just a horribly naive approach to in-memory history. I figure we'll cross that bridge when we need to. There are a few tactics we could use to make this more robust, but I'm not totally sure which one to go with yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's another reason not to keep methods (i.e. - cli_msg) on these datatypes. I'm pretty sure methods make the memory footprint bigger for objects in Python land. (Would need to double check this)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did some rudimentary testing here, and I really can't get methods to increase the memory footprint of objects. So I'm going to try swapping this around to the OO way of things and see if I can get a similar level of safety there too.

@@ -0,0 +1,61 @@

import dbt.logger as logger # type: ignore # TODO eventually remove dependency on this logger
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal is to eventually replace with structlog, when the output is CLI/file logging, right?

Copy link
Contributor Author

@nathaniel-may nathaniel-may Oct 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. This is just to show that the new structure works when you run dbt without introducing too much nonsense all at once. It should be a relatively easy swap out in a future PR.

@nathaniel-may
Copy link
Contributor Author

nathaniel-may commented Oct 27, 2021

@jtcohen6

Am I right to think that our next steps are:

  • going through all the places in dbt where logger is called
  • adding typed events to events/types
  • replacing those logger calls with fire_event for the new typed event

At the same time, we need to replicate the functionality currently available in logger (log levels, JSON formatting, secret scrubbing, etc) with our new event-based system.

That's exactly correct.

@nathaniel-may
Copy link
Contributor Author

nathaniel-may commented Oct 27, 2021

Did some exploration on whether we should take the fp or oo approach to computing messages from event types. The goal here is to know at development time whether we have log messages defined for each event type or not without the diligence to write a test every time we add a log statement. The scenarios we want to avoid are: a release where some messages are not written, or a release that raises exceptions when an obscure log line is triggered.

The FP Way

The way this PR presents a solution to this is to use Union types and type-level branching (i.e. isinstance which is both runtime and type-level) to guarantee that every branch of the union is caught. assert_never is the trick that forces mypy to fail when we miss a case when we're writing a function from SomeUnionType -> SomeOtherType. This is guaranteed to work every time, however these functions will absolutely balloon into huge switch-like statements which is a little ugly and potentially difficult to tell what you're missing because mypy isn't advanced enough to name the branch you're missing, only tell you that you have one missing at all with the cryptic message "assert_never" has incompatible type.

The OO Way

We could instead use an abstract base class (ABC) to define the methods we wish to inherit with code like this:

class CliEventABC(metaclass=ABCMeta):
    @abstractmethod
    def cli_msg(self) -> str:
        raise Exception("cli_msg not implemented for event")

@dataclass(frozen=True)
class ParsingStart():
    pass

CliEventABC.register(ParsingStart)

However, this would require we drop support for v 3.6 because custom NamedTuple classes disallow also inheriting from a metaclass. Unfortunately, I can't seem to get mypy to fail in the event we don't define cli_msg here. It also doesn't fail when the file is run either. If someone who knows OO Python a little better can show me how to do this, we could consider the OO option which would definitely make the code a little neater and simpler to maintain.

Additionally, I did some tinkering to figure out if adding methods to these classes increases their memory footprint (see this thread with @jtcohen6 about in-memory event history), but I keep coming up with local results that show that they don't. I don't know enough about Python's runtime model to know how this is possible, but I can't seem to blow out the memory footprint with methods alone.

]

# top-level event type for all events that go to the CLI
CliEvent = Union[
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do you see CliEvent and Event differing?

Copy link
Contributor Author

@nathaniel-may nathaniel-may Oct 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect there to be different destinations so Event would encompass all events, but CliEvent is only the events that go to the cli as opposed to an event stream, log file, or any other destination we want. This way we can keep the cli super crisp, while putting more details in the log file even at the same log level. That's because each event can have more than one destination, and would just have one message computed for each destination.

core/dbt/events/types.py Outdated Show resolved Hide resolved
@nathaniel-may
Copy link
Contributor Author

I just pushed the change to OO-style python. After pairing with @iknox-fa, it's very likely that in order to match the correctness guarantees of the FP-style code we will have to manually maintain a list of dummy class instantiations until mypy is run on every file. This is because mypy does not check that the concrete classes implement abstract methods unless the class itself is instantiated. If it's not, mypy considers the concrete class definition dead code and does not check it.

That being said, this code is way nicer so I think we should go with this option since we plan on turning mypy on everywhere ala #3203 and #4089

@nathaniel-may
Copy link
Contributor Author

So the tests pass even though I'm using dataclasses from python 3.7. I would expect 3.6 tests to fail, but they don't.

Because I'm planning on making many many event types, I would really like to use dataclasses so we don't have to live in boilerplate city, but if we really need to I can deal with the bloat. Thoughts? (especially @jtcohen6 re: dropping support for 3.6 for ease of development here)

@nathaniel-may
Copy link
Contributor Author

I just learned we can use dataclasses with 3.6 because dbt-core installs a backport module when a user installs dbt-core witih python3.6. Thanks, @emmyoop for pointing out @kwigley's wisdom! (sorry for the false alarm @jtcohen6)

core/dbt/events/README.md Outdated Show resolved Hide resolved
@nathaniel-may nathaniel-may mentioned this pull request Oct 28, 2021
4 tasks
@nathaniel-may nathaniel-may mentioned this pull request Oct 28, 2021
21 tasks
# types to represent log levels

# in preparation for #3977
class TestLevel():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idea for future iterations on this: Implement log level number as well since it's a defined thing in python. This would let us more easily tie in with python tools that utilize log levels (I imagine structlog has some support for it)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know! Yeah I imagine we can work these in once we hook structlog up.

raise Exception("cli_msg not implemented for cli event")


class ParsingStart(InfoLevel, CliEventABC):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor detail: can we ditch the periods at the end of these returned strings? They aren't sentences.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I copied the messages exactly the way they are printed today. I completely agree, however I think I want to do user-facing message improvements in their own PR so product can give them all a go. I might be being a bit too structured for something as silly as these periods though.

# we need to skirt around that by computing something it doesn't check statically.
#
# TODO remove these lines once we run mypy everywhere.
if 1 == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to admit, I kinda love this hack...

Copy link
Contributor

@iknox-fa iknox-fa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one minor thing that's not going to bother anyone but me. :)

@nathaniel-may nathaniel-may merged commit 1015b89 into feature/structured-logging Oct 29, 2021
@nathaniel-may nathaniel-may deleted the first-pass-structured-logging branch October 29, 2021 13:16
emmyoop pushed a commit that referenced this pull request Nov 8, 2021
add event type modeling and fire_event calls
emmyoop pushed a commit that referenced this pull request Nov 8, 2021
add event type modeling and fire_event calls
kwigley pushed a commit that referenced this pull request Nov 9, 2021
add event type modeling and fire_event calls
nathaniel-may pushed a commit that referenced this pull request Nov 9, 2021
add event type modeling and fire_event calls
iknox-fa pushed a commit that referenced this pull request Feb 8, 2022
add event type modeling and fire_event calls

automatic commit by git-black, original commits:
  f9ef9da
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants