Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cleardtc on aw50 fails successfully #84

Closed
aaeegg opened this issue Aug 6, 2023 · 9 comments
Closed

cleardtc on aw50 fails successfully #84

aaeegg opened this issue Aug 6, 2023 · 9 comments
Labels
need more info waiting for submitter

Comments

@aaeegg
Copy link
Contributor

aaeegg commented Aug 6, 2023

When clearing DTCs on AW50-42 transmission using D2 protocol, DTCs are cleared, but scantool reports "Failed" and loses connection to the TCM. Maybe a timeout needs to be increased.

@fenugrec
Copy link
Owner

fenugrec commented Aug 6, 2023

hey @aaeegg , long time no see. Do you remember what the reported failure was ?

@aaeegg
Copy link
Contributor Author

aaeegg commented Aug 6, 2023

Here's the log:

scantool/850> connect aw50
diag_os_unix.c:438:  WARNING: Not running as superuser; things may not work correctly
diag_os_unix.c:451:  sched_setscheduler failed: Operation not permitted.
clock_gettime(): using CLOCK_MONOTONIC_RAW
clock_nanosleep(): using CLOCK_MONOTONIC
diag_os_gethrt() resolution <= 2us, avg ~1us
diag_os_getms() resolution <= ~1ms, avg ~1ms
testing diag_os_millisleep(), this will take a moment...
diag_os_chronoms() : initial time 3358872304ms; resolution: ~1ms
diag_l0_elm.c:410:  Sending ATI to ELM32x at 38400...
diag_l0_elm.c:410:  Sending ATI to ELM32x at 9600...
diag_l0_elm.c:410:  Sending ATI to ELM32x at 115200...
Official ELM found, v1.3a
Connected to AW50-42 transmission.
scantool/850> dtc
Stored DTCs:
AT-??? (08) 
scantool/850> cleardtc
Are you sure you wish to clear the Diagnostic Trouble Codes (y/n) ? y
Failed
scantool/850> dtc
Couldn't retrieve DTCs.
scantool/850> disconnect
diag_l0_elm.c:335:  ELM returned error : ?
diag_l0_elm.c:336: Unspecified Error.
Disconnected from AW50-42 transmission.
scantool/850> connect aw50
diag_l0_elm.c:410:  Sending ATI to ELM32x at 38400...
diag_l0_elm.c:410:  Sending ATI to ELM32x at 9600...
diag_l0_elm.c:410:  Sending ATI to ELM32x at 115200...
Official ELM found, v1.3a
Connected to AW50-42 transmission.
scantool/850> dtc
No stored DTCs.

@fenugrec
Copy link
Owner

fenugrec commented Aug 6, 2023

you should know that code better than me since you wrote it P)
Looks like the error is coming from here : https://github.com/fenugrec/freediag/blob/master/scantool/scantool_850.c#L1420

Suggest adding some DIAG_DBGM(... lines around that area and the last function it calls.
Agreed timeouts may need to be higher since this is an ELM, but without more debug info or a bus trace it will be hard to guess.

@fenugrec fenugrec added the need more info waiting for submitter label Sep 29, 2023
@aaeegg
Copy link
Contributor Author

aaeegg commented Oct 8, 2023

In response to the clearDiagnosticInformation request (AF 01), the TCM sends a series of routineNotCompleteOrServiceInProgress responses (7E xx 23) at ~52ms intervals, then after a total of nearly a full second, a success response (EF 01).

Raw communication as seen when connecting to the interface (OBDLink SX) with a terminal program:

>AF 01
84 13 6E 7E AF 23 55 
84 13 6E 7E AF 23 55 
84 13 6E 7E AF 23 55 
84 13 6E 7E AF 23 55 
84 13 6E 7E AF 23 55 
84 13 6E 7E AF 23 55 
84 13 6E 7E AF 23 55 
84 13 6E 7E AF 23 55 
84 13 6E 7E AF 23 55 
84 13 6E 7E AF 23 55 
84 13 6E 7E AF 23 55 
84 13 6E 7E AF 23 55 
84 13 6E 7E AF 23 55 
84 13 6E 7E AF 23 55 
84 13 6E 7E AF 23 55 
84 13 6E 7E AF 23 55 
84 13 6E 7E AF 23 55 
84 13 6E 7E AF 23 55 
83 13 6E EF 01 F4 

>

Scope shot of overall timing (readDiagnosticTroubleCodes request at G1, response at G2, immediately followed by clearDiagnosticInformation request at x, final response at o):
cleardtc18

Scope shot of routineNotCompleteOrServiceInProgress repetition rate:
7erate2

The behavior of the TCM is in contrast to the fuel management ECU, which immediately sends a success response. scantool handles the latter correctly. But when faced with the TCM's series of responses, scantool only looks at the first routineNotCompleteOrServiceInProgress response and doesn't wait for the interface to send the rest of the responses and the > prompt. It interprets the routineNotCompleteOrServiceInProgress response as a failure and ignores the rest of the responses.

scantool output with debugging turned on:

freediag/850> cleardtc
Are you sure you wish to clear the Diagnostic Trouble Codes (y/n) ? y
You haven't read the DTCs yet. Are you sure you wish to clear them (y/n) ? y
diag_l2.c:523:  _request dl2c=0x55ea1f51e220 msg=0x7ffe4dac6e10 called
diag_l2.c:499:  diag_l2_send 0x55ea1f51e220 msg 0x7ffe4dac6e10 msglen 2 called
diag_l0_elm.c:1192:  getflags link 0x55ea1f51d9f0 proto 1 flags 0x146A1
diag_l1.c:151:  _send: len=5 P4=5 l0flags=0x146A1; 0x83 0x6E 0x13 0xAE 0x01 
diag_l0_elm.c:943:  ELM: sending 5 bytes
diag_l0_elm.c:288:  elm_sendcmd: ATSH 83 6E 13
diag_l0_elm.c:310:  received 5 bytes (OK

>
); hex: 0x4F 0x4B 0x0A 0x0A 0x3E 
diag_l0_elm.c:983:  ELM: (sending string 836E13AE01^M)
diag_l1.c:247:  _recv request len=66, timeout=1100;diag_l0_elm.c:1040:  Expecting 3*66 bytes from ELM, 1100 ms timeout(+400)...got 7 bytes; 0x84 0x13 0x6E 0xEE 0x01 0x39 0x2D 
diag_l2.c:530:  _request returns 0x55ea1f51ef80, err 0
diag_l2.c:523:  _request dl2c=0x55ea1f51e220 msg=0x7ffe4dac6e90 called
diag_l2.c:499:  diag_l2_send 0x55ea1f51e220 msg 0x7ffe4dac6e90 msglen 2 called
diag_l0_elm.c:1192:  getflags link 0x55ea1f51d9f0 proto 1 flags 0x146A1
diag_l1.c:151:  _send: len=5 P4=5 l0flags=0x146A1; 0x83 0x6E 0x13 0xAF 0x01 
diag_l0_elm.c:943:  ELM: sending 5 bytes
diag_l0_elm.c:983:  ELM: (sending string 836E13AF01^M)
diag_l1.c:247:  _recv request len=66, timeout=1100;diag_l0_elm.c:1040:  Expecting 3*66 bytes from ELM, 1100 ms timeout(+400)...got 7 bytes; 0x84 0x13 0x6E 0x7E 0xAF 0x23 0x55 
diag_l2.c:530:  _request returns 0x55ea1f51ef20, err 0
Failed
freediag/850>

Contrary to my original comment, scantool is still communicating with the TCM at this point. However, a subsequent disconnect command fails. scantool sends a stopDiagnosticSession request and gets a success response, then it tries to send an ATPC command to disconnect the interface. Normally, by the time scantool sends the ATPC command, the interface has already returned to command mode. But in this case, the previous series of routineNotCompleteOrServiceInProgress responses caused the interface's adaptive timing algorithm to increase the response timeout. The interface is still listening for more responses and hasn't yet sent the > prompt, but scantool considered the stopDiagnosticSession request/response to be done and didn't wait for the > prompt before sending the ATPC command. This caused the interface to abort the still-in-progress receive and send a STOPPED error.

scantool output with debugging turned on:

freediag/850> disconnect
diag_l2.c:499:  diag_l2_send 0x55f54e1fc220 msg 0x7ffdcd422400 msglen 1 called
diag_l0_elm.c:1192:  getflags link 0x55f54e1fb9f0 proto 1 flags 0x146A1
diag_l1.c:151:  _send: len=4 P4=5 l0flags=0x146A1; 0x82 0x6E 0x13 0xA0 
diag_l0_elm.c:943:  ELM: sending 4 bytes
diag_l0_elm.c:288:  elm_sendcmd: ATSH 82 6E 13
diag_l0_elm.c:310:  received 5 bytes (OK

>
); hex: 0x4F 0x4B 0x0A 0x0A 0x3E 
diag_l0_elm.c:983:  ELM: (sending string 826E13A0^M)
diag_l1.c:247:  _recv request len=66, timeout=1100;diag_l0_elm.c:1040:  Expecting 3*66 bytes from ELM, 1100 ms timeout(+400)...got 5 bytes; 0x82 0x13 0x6E 0xE0 0xE3 
diag_l2.c:291:  Entered diag_l2_close for dl0d=0x55f54e1fb9f0;
        closing dl2link 0x55f54e1fc1f0.
diag_l2.c:199:  l2_closelink 0x55f54e1fc1f0 called
diag_l1.c:112:  entering diag_l1_close: dl0d=0x55f54e1fb9f0
diag_l0_elm.c:288:  elm_sendcmd: ATPC
diag_l0_elm.c:310:  received 14 bytes (STOPPED

>?

>
); hex: 0x53 0x54 0x4F 0x50 0x50 0x45 0x44 0x0A 0x0A 0x3E 0x3F 0x0A 0x0A 0x3E 
diag_l0_elm.c:334:  ELM returned error : ?
diag_l0_elm.c:335: Unspecified Error.
diag_l0_elm.c:205:  link 0x55f54e1fb9f0 closing
Disconnected from AW50-42 transmission.
freediag/850>

According to Richard H. Jones's document on error codes, in addition to the AW50-42 TCM sending routineNotCompleteOrServiceInProgress while clearing DTCs, the airbag module does the same thing. Additionally, when a readDataByLocalIdentifier request is sent with a final byte of 02 rather than the usual 01 (scantool never does this), the ECU will send a continuous data stream alternating between routineNotCompleteOrServiceInProgress and readDataByLocalIdentifier responses.

So there are two problems here:

  1. The D2 code doesn't properly handle a routineNotCompleteOrServiceInProgress response.
  2. If the interface is still listening for responses, elm_sendcmd doesn't know about it and won't abort the previous command.

A possible side issue is that the diag_l1 timeouts are totally uncoordinated with the interface's timeouts. So the upper layer code might know how long a response should take and use a specific timeout, and that's the timeout used by scantool to communicate with the interface, but the timeout used by the interface to communicate with the ECU is totally different. Also, if we know what the protocol's timing constraints are, it would probably be better to set the interface to fixed timeout (ATAT 0, ATST hh) instead of letting it use adaptive timing.

@aaeegg aaeegg changed the title cleardrc on aw50 fails successfully cleardtc on aw50 fails successfully Oct 8, 2023
@fenugrec
Copy link
Owner

fenugrec commented Oct 8, 2023

wow, excellent analysis.
ELM "user-friendliness" strikes again. I wonder if the cheap elm327 clones support ATAT / ATST at all ? I get the feeling that they don't, but I don't have an ELM anymore and not at all interested in spending time on them.

Do I understand that for this issue, the only clean way to deal with it is to disable adaptive timing ? I can think of a few approaches :

  • send ATAT + ATST systematically on anything that reports as "elm327". Some shitty clones may start throwing errors on every init if they do not support those.
  • send ATAT + STST only on "non-clones" (although that is meaningless now that ELM doesn't exist anymore)
  • add an IOCTL that is only used in D2 code to disable adaptive timing. Then if clones start breaking at least it's localized to D2 users, easier to contain the error and inform the user appropriately. Disadvantage is a single-use IOCTL.

@aaeegg
Copy link
Contributor Author

aaeegg commented Oct 8, 2023

I think this could break on a fast enough computer even if adaptive timing is disabled. We should be setting a flag when an operation is in progress, and clear it when we see a > character. On entry to elm_send and elm_sendcmd, if the flag is set and there's not already a > in the input buffer, we should send a space character to abort the operation, then wait for the >. Note that we should only wait for the > and not explicitly check for the STOPPED response, because there's a race condition if the operation completes just as we're trying to abort it. Also note that we send a space rather than a CR, because if a CR arrives after the operation completes, the interface will interpret this as a request to repeat the last command. I tested leading space characters on the OBDLink SX and they're safe. They should also be OK on original ELM327 according to the datasheet. I don't know what other clones will do, but even if they don't handle it correctly, we won't be hosed any worse than if we had done nothing.

@fenugrec
Copy link
Owner

fenugrec commented Oct 9, 2023

I think this could break on a fast enough computer even if adaptive timing is disabled.

Why would it break ? When you request that sendcmd from D2 code, I understand there's a good notion of what kind of timeout is appropriate, so it should be a simple matter of passing the request with that timeout ? I don't know if I want to change the ELM layer into some kind of partially async code.

@aaeegg
Copy link
Contributor Author

aaeegg commented Oct 9, 2023

The problem is, you send your request, elm_send returns immediately. The car responds, you call elm_recv with whatever timeout, elm_recv gets the response and returns. But the interface is still waiting for more potential responses from the car. If you actually expect multiple responses, you call elm_recv again and get the next response (or a timeout), etc. If you've finished with this transaction and proceed to send another request or command - that's when it breaks.

@aaeegg
Copy link
Contributor Author

aaeegg commented Oct 9, 2023

I split off the ELM interface problem into #87 because it's a separate issue from the misinterpretation of the TCM's response to clearDiagnosticInformation.

aaeegg added a commit to aaeegg/freediag that referenced this issue Oct 10, 2023
…ard it

and keep listening until we get the final response. This is necessary for
requests that return a few routineNotCompleteOrServiceInProgresses before
they complete, such as clearDiagnosticInformation on AW50-42.

Fixes fenugrec#84
aaeegg added a commit to aaeegg/freediag that referenced this issue Oct 10, 2023
…ard it

and keep listening until we get the final response. This is necessary for
requests that return a few routineNotCompleteOrServiceInProgresses before
they complete, such as clearDiagnosticInformation on AW50-42.

Fixes fenugrec#84
aaeegg added a commit to aaeegg/freediag that referenced this issue Oct 10, 2023
If we get a routineNotCompleteOrServiceInProgress response, discard it and
keep listening until we get the final response. This is necessary for
requests that return a few routineNotCompleteOrServiceInProgresses before
they complete, such as clearDiagnosticInformation on AW50-42.

Fixes fenugrec#84
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need more info waiting for submitter
Projects
None yet
Development

No branches or pull requests

2 participants