Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace github.com/coreos/go-systemd/v22/sdjournal by journalctl #40061

Merged
merged 61 commits into from
Aug 9, 2024

Conversation

belimawr
Copy link
Contributor

@belimawr belimawr commented Jun 28, 2024

Proposed commit message

github.com/coreos/go-systemd/v22/sdjournal is removed and Filebeat now calls journalctl directly to read journald entries.

sdjournal relies on libsystemd to read journal files and the active system journal, however due to a bug (systemd/systemd#29456) in systemd, it crashes during journal rotation. Filebeat is affected by it, if the host has a libsystemd affected. During a journal rotation (usually only on high load) Filebeat will crash with a SIGBUS. There is no way to prevent or recover from this crash, it happens outside of our codebase, the SIGBUS is turned into a panic by the Go runtime and we cannot recover from it.

The bug has been fixed in Systemd v255, which is not widely used yet. So in most systems out there Filebeat might crash when reading journal logs.

Because there is no way for Filebeat to avoid the crash, we decided to replace github.com/coreos/go-systemd/v22/sdjournal by calling journalctl directly and reading it stdout.

On hosts where Filebeat crashes when reading from journald, journalctl can successfully read all journal files. OpenTelemetry collector also calls journalctl and has no issues reading the journal during rotation.

Because the reading backend has changed, some configuration options have been removed and behaviours adapted to match journalctl.

Breaking changes:
Changes that will prevent the journald input from starting:

  • include_matches.match does not accept the and and or keys any more.

Changes in the journald input behaviour:

  • backoff, max_backoff, cursor_seek_fallback have been removed
  • seek now has only 3 modes: since, head and tail.
  • If there is a cursor in the registry, it will always be used and the seek option will be ignored.

The following syscalls are added to the default seccomp policy: dup3, faccessat2, prctl and setrlimit. This affects all Beats because the policy is global and shared by all Beats.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Disruptive User Impact

Even though the journald input is not GA yet, which makes breaking changes acceptable, this PR introduces breaking changes that will make certain configurations not work as expected or not to work at all.

Changes that will prevent the journald input from starting:

  • include_matches.match does not accept the and and or keys any more.

Changes in the journald input behaviour:

  • backoff, max_backoff, cursor_seek_fallback have been removed
  • seek now has only 3 modes: since, head and tail.
  • If there is a cursor in the registry, it will always be used and the seek option will be ignored.

Author's Checklist

  • Ensure an error logged right away if the journald input crashes
  • Stress test the new input
  • Manual test to ensure all related issues are actually closed by this PR
  • Ensure shutdown does not block unnecessarily

How to test this PR locally

Using the following input configuration:

filebeat.inputs:
  - type: journald
    id: PR-testing

Start Filebeat and assert the journald messages are sent to the configured output.

To manually see the journald messages and compare with what you see in Filebeat's output, you can use:

journalctl --follow -o json | jq -c --sort-keys

This will print out all fields Filebeat can read.

Identifying the missing syscalls

This PR adds some syscalls to our seccomp policy, to find the syscalls names required by the seccomp policy, I ran Auditbeat configured to collect seccomp violations and convert numbers to names, then I started Filebeat setting the default behaviour of our seccomp policy to log and allowing the default list of syscalls.

Here are the configuration files used:

filebeat.yml

filebeat.inputs:
  - type: journald
    id: PR-testing

output:
  elasticsearch:
    hosts:
      - https://localhost:9200
    username: elastic
    password: changeme
    ssl.verification_mode: none

seccomp:
  default_action: log
  syscalls:
  - action: allow
    names: 
      - accept
      - accept4
      - access
      - arch_prctl
      - bind
      - brk
      - capget
      - chmod
      - chown
      - clock_gettime
      - clock_nanosleep
      - clone
      - clone3
      - close
      - connect
      - dup
      - dup2
      - epoll_create
      - epoll_create1
      - epoll_ctl
      - epoll_pwait
      - epoll_wait
      - execve
      - exit
      - exit_group
      - fchdir
      - fchmod
      - fchmodat
      - fchown
      - fchownat
      - fcntl
      - fdatasync
      - flock
      - fstat
      - fstatfs
      - fsync
      - ftruncate
      - futex
      - getcwd
      - getdents
      - getdents64
      - geteuid
      - getgid
      - getpeername
      - getpid
      - getppid
      - getrandom
      - getrlimit
      - getrusage
      - getsockname
      - getsockopt
      - gettid
      - gettimeofday
      - getuid
      - inotify_add_watch
      - inotify_init1
      - inotify_rm_watch
      - ioctl
      - kill
      - listen
      - lseek
      - lstat
      - madvise
      - mincore
      - mkdirat
      - mmap
      - mprotect
      - munmap
      - nanosleep
      - newfstatat
      - open
      - openat
      - pipe
      - pipe2
      - poll
      - ppoll
      - pread64
      - pselect6
      - pwrite64
      - read
      - readlink
      - readlinkat
      - recvfrom
      - recvmmsg
      - recvmsg
      - rename
      - renameat
      - rseq
      - rt_sigaction
      - rt_sigprocmask
      - rt_sigreturn
      - sched_getaffinity
      - sched_yield
      - sendfile
      - sendmmsg
      - sendmsg
      - sendto
      - set_robust_list
      - setitimer
      - setsockopt
      - shutdown
      - sigaltstack
      - socket
      - splice
      - stat
      - statfs
      - sysinfo
      - tgkill
      - time
      - tkill
      - unlink
      - unlinkat
      - wait4
      - waitid
      - write
      - writev

logging:
  level: info
  to_stderr: true

auditbeat.yml

auditbeat.modules:

- module: auditd
  enabled: true
  resolve_ids: true
  include_warnings: true
  failure_mode: log
  include_raw_message: true

output:
  elasticsearch:
    hosts:
      - https://localhost:9200
    username: elastic
    password: changeme
    ssl.verification_mode: none

processors:
  - add_host_metadata: ~
  - add_cloud_metadata: ~
  - add_docker_metadata: ~

First start Auditbeat as root, then run Filebeat. Ensure Filebeat is correct collecting the journal logs. To find out the syscalls names you can use Discover on Kibana or run the following query:

curl -s -u elastic:changeme  --insecure -XPOST "https://elasticsearch:9200/auditbeat*/_search" -H "kbn-xsrf: reporting" -H "Content-Type: application/json" -d'
{
  "fields": [
    "process.name",
    "auditd.data.syscall"
  ],
  "_source": false,
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "auditd.message_type": "seccomp"
          }
        },
        {
          "match": {
            "process.name": "filebeat"
          }
        }
      ]
    }
  }
}'|jq '.hits.hits | .[] | .fields."auditd.data.syscall"'

Related issues

## Use cases
## Screenshots
## Logs

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jun 28, 2024
@belimawr belimawr self-assigned this Jun 28, 2024
@belimawr belimawr added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Jun 28, 2024
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jun 28, 2024
Copy link
Contributor

mergify bot commented Jun 28, 2024

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @belimawr? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@belimawr belimawr added backport-skip Skip notification from the automated backport with mergify skip-ci Skip the build in the CI but linting skip docs-build Skips docs build CI labels Jun 28, 2024
Copy link
Contributor

mergify bot commented Jul 2, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b journalctl-for-journald-input upstream/journalctl-for-journald-input
git merge upstream/main
git push upstream journalctl-for-journald-input

@belimawr belimawr force-pushed the journalctl-for-journald-input branch from 3460771 to 71db39c Compare July 2, 2024 14:47
@belimawr belimawr removed skip-ci Skip the build in the CI but linting skip docs-build Skips docs build CI labels Jul 2, 2024
@belimawr belimawr changed the title Journalctl for journald input Replace github.com/coreos/go-systemd/v22/sdjournal by journalctl Jul 2, 2024
@belimawr belimawr marked this pull request as ready for review July 2, 2024 14:49
@belimawr belimawr requested a review from a team as a code owner July 2, 2024 14:49
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@fearful-symmetry
Copy link
Contributor

fearful-symmetry commented Jul 2, 2024

So, more of a general comment, since I don't want to spam the same review comments: we're missing a lot of godocs, particularly on exported functions. Also, it would be good to add an extended explanation to the journald implementation as to why we're exec'ing out to journalctl instead of using a library. Also, an explanation of the args passed to journalctl and the timestamp field in Next() would also be useful to people reading this code in the future.

@belimawr
Copy link
Contributor Author

belimawr commented Jul 2, 2024

So, more of a general comment, since I don't want to spam the same review comments: we're missing a lot of godocs, particularly on exported functions. Also, it would be good to add an extended explanation to the journald implementation as to why we're exec'ing out to journalctl instead of using a library. Also, an explanation of the args passed to journalctl and the timestamp field in Next() would also be useful to people reading this code in the future.

Thanks @fearful-symmetry ! I'll start working on it.

@belimawr
Copy link
Contributor Author

belimawr commented Jul 2, 2024

@fearful-symmetry I added the reasoning for using journalctl to the PR description, let me know if it is clear enough.

@fearful-symmetry
Copy link
Contributor

fearful-symmetry commented Jul 2, 2024

@belimawr did you want to add a description to the code itself? My concern is that someone is going to look at the code
in the future and wonder why we aren't using the APIs or systemd libraries.

@belimawr
Copy link
Contributor Author

belimawr commented Jul 2, 2024

@belimawr did you want to add a description to the code itself? My concern is that someone is going to look at the code in the future and wonder why we aren't using the APIs or systemd libraries.

I was gonna add to the commit message, but I can add to the code itself if you prefer.

Where in the code do you believe is the best place to add it? in a doc.go file?

@fearful-symmetry
Copy link
Contributor

@belimawr in reader.go should suffice, since that's where the implementation is.

@belimawr belimawr added the skip-ci Skip the build in the CI but linting label Jul 3, 2024
On TestInputSeek set the since option to a time between the first and
second entries to avoid issues with different hosts.
The since tests was broken because newer versions of Jouranlctl
accepts time stamps on RFC3339, but older versions do not.

The Reader now formats the since parameter in a format accepted by
older versions of systemd. The since tests ensures the timestamp is in
the local timezone to avoid conflicts with jouranlct.
This commit fixes tests that were failing due to an incompatible
journal version. The new journal file was generated with journald
249 (249.11-0ubuntu3.12).
This commit adds the missing syscalls to our seccomp policy so
Filebeat can start the journalctl process.

The syscalls were acquired by running Filebeat with the seccomp
default action set to "log", and running Auditbeat to collect those
logs and convert the syscall number into a name accepted by our
seccomp policy.
@belimawr belimawr force-pushed the journalctl-for-journald-input branch from b350bf2 to 284f7b1 Compare August 7, 2024 16:16
Copy link
Contributor

mergify bot commented Aug 7, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b journalctl-for-journald-input upstream/journalctl-for-journald-input
git merge upstream/main
git push upstream journalctl-for-journald-input

Copy link
Contributor

mergify bot commented Aug 8, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b journalctl-for-journald-input upstream/journalctl-for-journald-input
git merge upstream/main
git push upstream journalctl-for-journald-input

pierrehilbert and others added 7 commits August 8, 2024 15:44
Add the correct syscalls to the changelog and move the entry under
breaking changes.
Fix merge conflicts in the changelog and improve working of one entry.
Re-add a syscall that was mistakenly removes from the default seccomp
policy.
Fixes a log entry that was mixing keys with data
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-skip Skip notification from the automated backport with mergify Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
6 participants