Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logparser/tail input on Windows stops sendings logs due to file being locked #6539

Open
roelandvh opened this issue Oct 17, 2019 · 8 comments
Labels
area/tail bug unexpected problem or unintended behavior help wanted Request for community participation, code, contribution platform/windows size/l 1 week or more effort upstream bug or issues that rely on dependency fixes

Comments

@roelandvh
Copy link

Relevant telegraf.conf:

[global_tags]

[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""
  hostname = ""
  omit_hostname = false

[[outputs.influxdb]]
  urls = ["http://192.168.50.10:8086"]
  database = "telegraf"

[[inputs.logparser]]
  files = ["C:/Program Files/Apache/logs/All-AccessLog.*"]
  from_beginning = false

  [inputs.logparser.grok]
    patterns = ["%{COMBINED_LOG_FORMAT} %{RESPONSE_TIME}"]
    measurement = "apache_access_log"

System info:

  • OS: Windows 10 (version 1903)
  • Telegraf: telegraf-1.12.3_windows_amd64

Steps to reproduce:

  1. Start a process to repeatedly request pages from the Apache server to ensure log lines keep being written to the access logs (I used watch -n 10 wget -q -r -P wget-out https://<url>)
  2. Start telegraf:
telegraf.exe --console --config <path to above config>
  1. In influx CLI, select the telegraf database and keep executing and watching the results of this query:
select * from apache_access_log order by time desc limit 10;

Expected behavior:

New data points should keep appearing in apache_access_log, shortly after each logline appears in the Apache log.

Actual behavior:

Observe that immediately after starting, fresh data will appear in apache_access_log, but that will stop after a shorter or longer period of time (but usually within a few minutes max), even though new log lines are appearing in the Apache log files.

Additional info:

I think I've reproduced the cause of this issue using the following simple Go program that uses the same code for tailing the log files as telegraf does:

package main

import (
	"log"
	"os"
	"github.com/influxdata/tail"
)

func main() {
	t, err := tail.TailFile("c:\\Program Files\\Apache\\logs\\All-AccessLog.2019-10-17", tail.Config{
		Follow:   true,
		ReOpen:   true,
		Location: &tail.SeekInfo{Offset: -200, Whence: os.SEEK_END}, // start 200 bytes from the end
		Logger:   log.New(os.Stdout, "[TailLog] ", log.LstdFlags),
		Poll:     true,
	})
	if err != nil {
		log.Fatal(err)
	}

	for line := range t.Lines {
		log.Println(line.Text)
	}
}

After some experimenting (printf-debugging, really) in the influxdata/tail code, I noticed that the tail would sooner or later die, as a result of a "non-EOF error". The output of the above program was:

...
[lines from the file that is being tailed]
...
[TailLog] 2019/10/17 16:01:26 non-EOF error: %!(EXTRA *os.PathError=read c:\Program Files\Apache\logs\All-AccessLog.2019-10-17: The process cannot access the file because another process has locked a portion of the file.)

This would happen at this place in the tail.go file in the tailFileSync func:

...
		line, err := tail.readLine()

		// Process `line` even if err is EOF.
		if err == nil {
...
		} else if err == io.EOF {
...
		} else {
			tail.Logger.Printf("non-EOF error: ", err) // Line added by me
			// non-EOF error
			tail.Killf("Error reading %s: %s", tail.Filename, err)
			return
		}

So it appears that the fact that Apache is writing additional lines to the file at the same moment telegraf logparser wants to read new lines is causing this issue.

(Initially I wanted to report this at https://github.com/influxdata/tail but that repository doesn't accept issues.)

@danielnelson
Copy link
Contributor

Thanks for the report, it seems that an approriate fix might be to have a list of temporary errors here that are okay and should be ignored.

If you can reproduce this with https://github.com/hpcloud/tail, and I suspect you will be able to, then it would be best to report there. The influxdata/tail repo has just a few patches that have not been accepted upstream yet. We may want to switch to the https://github.com/nxadm/tail fork when we have time, as hpcloud doesn't seem to have any active maintainers, but I would still report to hpcloud/tail since it has the most watchers.

@danielnelson danielnelson added area/tail bug unexpected problem or unintended behavior upstream bug or issues that rely on dependency fixes labels Oct 17, 2019
@roelandvh
Copy link
Author

@danielnelson Thanks for your reply. On your suggestion, I created this issue in hpcloud/tail: hpcloud/tail#157.

@danielnelson danielnelson changed the title logparser on apache logs on Windows stops sendings logs due to file being locked Logparser/tail input on Windows stops sendings logs due to file being locked Oct 24, 2019
@fedeori
Copy link

fedeori commented Nov 22, 2019

Hi, i didnt test these with roelandvh conf but i had a similar problem a few days ago, and as a work arround I used:
watch_method = "poll"
Under inputs.logparser. That solved my problem, and could be used as a work arround until the real problem is solved.

@nxadm
Copy link

nxadm commented Feb 6, 2021

@roelandvh If you have the time, can you try to reproduce the error with https://github.com/nxadm/tail? I added quite a few updates that fixed many of the MS Windows problems and made all Windows test pass, therefore removing the Windows support warning in the inherited README.md.

@whyjp
Copy link

whyjp commented Sep 9, 2021

I was got same issue
so i tried use nxadm/tail but it also permission issue raised..

i try found another metric integration solution.

@lavalex
Copy link

lavalex commented May 31, 2023

Hello,

I'm observing the same issue with Telegraf 1.25.0 (git: HEAD@4d17ec79). Both methods "inotify" and "poll" behave the same.

Is there any fix expected for this?

@srebhan srebhan self-assigned this May 31, 2023
@lavalex
Copy link

lavalex commented Apr 3, 2024

Hello,

Any updates on this issue?

@srebhan
Copy link
Member

srebhan commented Apr 3, 2024

@lavalex unfortunately, the issue is buried deep in the tail library we are using. Whenever we see the described error, the tailer stops processing the file and we do not have a chance to reopen it in a sensible manner... So solving this issue would involve either fixing the tail library or basing the plugin on a different library... Both is a huge chunk of work. Help is appreciated!

@srebhan srebhan added help wanted Request for community participation, code, contribution size/l 1 week or more effort labels Apr 3, 2024
@srebhan srebhan removed their assignment Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/tail bug unexpected problem or unintended behavior help wanted Request for community participation, code, contribution platform/windows size/l 1 week or more effort upstream bug or issues that rely on dependency fixes
Projects
None yet
Development

No branches or pull requests

8 participants