Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to recognize "In state connection received CLIENT ALERT" #853

Open
buhln opened this issue Jun 24, 2024 · 14 comments
Open

How to recognize "In state connection received CLIENT ALERT" #853

buhln opened this issue Jun 24, 2024 · 14 comments
Labels
Status: Available No one has claimed responsibility for resolving this issue.

Comments

@buhln
Copy link

buhln commented Jun 24, 2024

I observed the following issue: The communication between the client and server stopped without a disconnect on the client side. The broker log shows the follwing line:

[MQTT] socket_error: {tls_alert,{internal_error,"TLS server: In state connection received CLIENT ALERT: Fatal - Internal Error\n"}}

Due to the fact that we are using a quiet outdated emqx broker I do not want to call it a bug but I can not easily update the broker by myself due to the fact that it is part of a bigger infrastructure. So I have to look for a workaround, like reconnecting my client by force.

Do you have an idea how to recognize this situation from the client side?

Thanks in advance!

Reproduction

Not reproduceable

Environment

  • Python version:
  • Library version: 2.1.0
  • MQTT server: emqx 4.3.8
@github-actions github-actions bot added the Status: Available No one has claimed responsibility for resolving this issue. label Jun 24, 2024
@MattBrittan
Copy link
Contributor

If the broker is dropping the connection then the I would expect the on_disconnect() callback to be called (depending on what happens with the connection this may be triggered by the kaapalive process) or, perhaps, on_connect_fail() if the connection was never established . Unfortunately without more info (your code, logs etc) it's difficult to say much more.

@MattBrittan MattBrittan added the Status: More info needed More information needed from issue author label Jul 17, 2024
@buhln
Copy link
Author

buhln commented Jul 29, 2024

We could not observe this issue for some time now. Perhaps it is/was a problem with the EQMX browser. I close the issue due to the fact that I can not provide more information at the moment. If we observe it again I will check the on_disconnect() callback (which for sure is implemented ;-) ) and perhaps reopen the issue. Thanks for supporting!

@buhln buhln closed this as completed Jul 29, 2024
@buhln
Copy link
Author

buhln commented Jul 30, 2024

As normal, something happens right after you closed an issue. Today we get a new error message with tls_alert:

2024-07-30T07:54:12.991315+02:00 [warning] M1496_PES_-350@10.0.0.2:57095 [MQTT] socket_error: {tls_alert,{bad_record_mac,"TLS server: In state connection at tls_record_1_3.erl:280 generated SERVER ALERT: Fatal - Bad Record MAC\n decryption_failed"}}
2024-07-30T07:54:12.991725+02:00 [error] supervisor: 'esockd_connection_sup - <0.1899.0>', errorContext: connection_shutdown, reason: {tls_alert,{bad_record_mac,"TLS server: In state connection at tls_record_1_3.erl:280 generated SERVER ALERT: Fatal - Bad Record MAC\n decryption_failed"}}, offender: [{pid,<0.31731.566>},{name,connection},{mfargs,{emqx_connection,start_link,[[{deflate_options,[]},{max_conn_rate,500},{active_n,100},{zone,external},{proxy_address_header,<<>>},{proxy_port_header,<<>>},{supported_subprotocols,[]}]]}}]

Slightly different as you mention @MattBrittan the on_connect() callback was called and worked. But the connection did not work after the connect. No additional error messages where created. What kind of information, beside the server logs, can I provide to look into this in more detail?

@buhln buhln reopened this Jul 30, 2024
@MattBrittan
Copy link
Contributor

Thanks @buhln,

But the connection did not work after the connect.

With the info provided to date I'm unsure if there is an issue in this library (it's not attempting to reconnect), an exception is being thrown (caught by your code but loop not restarted), or if it's an issue on the EMQX side (and the client is attempting to reconnect but the server is not responding/functioning).

Are you able to provide some code showing how you initialise the library (connection options etc) and which callbacks are in place? Full logs would also be very helpful (but I understand they can be difficult to produce in a production environment with an issue that occurs so infrequently). Even just enabling error logging in the client might produce some useful info.

Unfortunately issues like this can be difficult to trace. The more info you can provide, the more likely it is that someone will be able to find the cause.

@buhln
Copy link
Author

buhln commented Jul 31, 2024

@MattBrittan thank you for you fast reply!

Here the code of my mqtt class which is controlling the MQTT connection. The callbacks were created dynamically by using the function subscribe() to to the fact that the hole application is configured by configuration files and a "state machines" is running in the background.

from paho.mqtt import client as mqtt_client
import random
import os

class mqttClient:
    def __init__(self):
        self.BROKER = os.getenv('MQTT_Server')
        self.PORT = int(os.getenv('MQTT_Port'))
        self.USERNAME = os.getenv('MQTT_User')
        self.PASSWORD = os.getenv('MQTT_Password')
        self.CLIENT_ID = os.getenv('MachineNo') + f'-{random.randint(0, 1000)}'
        self.client = self.connect_mqtt()

        self._reconnect_flag = False
        self._topics = []

    def on_connect(self, client, userdata, flags, reason_code, properties):
        if reason_code == 0 and self.client.is_connected():
            # Resubscribe to all topics
            if self._reconnect_flag:
                logPes.info("Reconnected to MQTT Broker")
                logPes.debug("Resubscribe on all topics")
                for topic in self._topics:
                    self.subscribe(topic[0], topic[1], topic[2])
            else:
                logPes.info("Connected to MQTT Broker")

            if not self._reconnect_flag: self._reconnect_flag = True
        else:
            logPes.error(f'Failed to connect, reason code {reason_code}')

    def on_disconnect(self, client, userdata, flags, reason_code, properties):
        logPes.warning(f'MQTT Broker disconnected, reason code {reason_code}')
        # This is legacy. Reason_code 16 does not occur in the current version of the Paho MQTT library
        if reason_code == 16:
            logPes.error(f'Fatal disconnect, kill application')
            exit('Fatal MQTT disconnect: reason_code=16') # Docker will restart application

    def connect_mqtt(self):
        logPes.info('Connecting to MQTT broker ' + self.BROKER + ':' + str(self.PORT))
        client = mqtt_client.Client(mqtt_client.CallbackAPIVersion.VERSION2, self.CLIENT_ID)
        client.tls_set(ca_certs='./ca.pem')
        client.username_pw_set(self.USERNAME, self.PASSWORD)
        client.connect(self.BROKER, self.PORT, keepalive=45)
        client.on_connect = self.on_connect
        client.on_disconnect = self.on_disconnect
        return client
    
    def subscribe(self, topic, qos, callback):
        """Subscribe on topics"""
        self.client.subscribe(topic, qos)
        self.client.message_callback_add(topic, callback)
        if [topic, qos, callback] not in self._topics:
            self._topics.append([topic, qos, callback])
        logPes.debug(f"Subscribed on `{topic}` topic")

##-> END MQTT code

def main():
    # Connect to MQTT broker
    try:
        mqtt = mqttClient()
    except Exception as error:
        logging.error("An error in MQTT connection occurred: {}".format(error))
        exit("An error in MQTT connection occurred: {}".format(error))

    #
    # Some instances of classes were created here which are using the mqtt client...
    # [...]
    #

    try:
        # Start infinit loop to run mqtt client
        mqtt.client.loop_forever()
    except Exception as error:
        logging.error("An error in MQTT connection occurred: {}".format(error))
        mqtt.client.disconnect()
    
    except KeyboardInterrupt:
        logging.debug("Programm stopped by user")
        mqtt.client.disconnect()

if __name__ == "__main__":
    main()

After reconnecting, the (self._reconnect_flag == True) the last log messages are the Subscribed on {topic} topic messages. So the subscribtions were carried out. But no messages were recievied by the application from the broker.

Due to the fact that in all error messages the tag tls_alert occurs I tend to belive that the problem is connected to the client.tls_set() function only?

But if you have some ideas and tips to improve the robustness of my code I am looking forward to read it! :-)

Thanks!

@MattBrittan
Copy link
Contributor

This is a tricky one, I can't see anything obviously wrong with your code and am still not clear if the issue is with this library or EMQX.

Due to the fact that in all error messages the tag tls_alert occurs I tend to belive that the problem is connected to the client.tls_set() function only?

That would be correct in so far as client.tls_set() enables TLS. Basically it appears that EMQX is erroring whilst processing the incoming data (i.e. decrypting the TLS packets). Unfortunately "Internal Error" does not really provide an indication as to the cause of this (and it may just be corruption of some kind). What should happen at that point is that the connection drops and the client detects this (either directly or via the keepalive process) and reconnects. Obviouslly this is not happening but we don't really have enough info to determine why this is not happening.

In fact I'm not even sure if the above is fully correct, it's possible that the error logged by EMQX relates to a different connection and whatever happens when the error is processed leads to some EMQX subsystems failing which results in the loss of subscriptions on the server side (this is very much guesswork!).

In order to determine if this is an issue with the paho client I think we are going to need debug logs (unless someone else raises a similar issue), I understand that these would be difficut to get given how infrequently this occurs!

Anothe option would be to add some form of watchdog in your app (i.e. if no message received in a minute then exit so Docker will restart the app OR call SUBSCRIBE again). This is not ideal but might be the pragmatic solution.

@buhln
Copy link
Author

buhln commented Aug 21, 2024

After updating to EMQX to v5.7.2 this issue looks like it is gone away. I will monitor it with loglevel DEBUG of the paho client.

@buhln
Copy link
Author

buhln commented Aug 21, 2024

It looks like that the client disconnects now but is able to reconnect again:

2024-08-21 14:21:40,369 - paho.mqtt.client - DEBUG - Sending PUBLISH (d0, q0, r1, m88), 'b'xxx'', ... (476 bytes)
2024-08-21 14:21:40,371 - paho.mqtt.client - ERROR - failed to receive on socket: [SSL: BAD_LENGTH] bad length (_ssl.c:2406)
2024-08-21 14:21:40,371 - paho.mqtt.client - DEBUG - Sending PUBLISH (d0, q0, r0, m89), 'b'xxx'', ... (738 bytes)
2024-08-21 14:21:40,372 - PES - WARNING - MQTT Broker disconnected, reason code Unspecified error
2024-08-21 14:21:41,415 - paho.mqtt.client - DEBUG - Sending CONNECT (u1, p1, wr0, wq0, wf0, c1, k45) client_id=b'xxx-636'
2024-08-21 14:21:41,458 - paho.mqtt.client - DEBUG - Received CONNACK (0, 0)
2024-08-21 14:21:41,459 - paho.mqtt.client - DEBUG - Sending PUBLISH (d0, q0, r1, m93), 'b'xxx'', ... (88 bytes)

I think the interessting line is:

2024-08-21 14:21:40,371 - paho.mqtt.client - ERROR - failed to receive on socket: [SSL: BAD_LENGTH] bad length (_ssl.c:2406)

Here the corresponding line in the log of the broker (EMQX v5.7.2):

2024-08-21T14:21:39.794411+02:00 [info] clientid: xxx_yyy_-636, msg: terminate, peername: 10.0.x.x:46556, username: xxx, reason: {shutdown,ssl_closed}

The problem looks like to be located on the client side. Whats quiet weird is that we run the application for other machines in the same Docker Swarm (same image) without observing this issue. Any ideas?

@MattBrittan
Copy link
Contributor

Issue #691 looks similar, but this is quite old and minimal information was provided. Had a quick look and found this issue that looks similar but was only replicable when using gevent. There are also a few references to the error being produced when multiple operations on the socket are run concurrently (for example socket.IO issues) - I have had a look at the client code and the only way I could see that happening is if multiple network loops are running.

I'll remove the more_info_needed tag as I believe there is now enough info for this to be investigated further, however I suspect it's going to be difficult to resolve without a reproducable example (as you say this is only happening on one of your nodes. Will need someone with more knowledge of Python networking code to step in here!

@MattBrittan MattBrittan added Status: Accepted It's clear what the subject of the issue is about, and what the resolution should be. and removed Status: More info needed More information needed from issue author Status: Accepted It's clear what the subject of the issue is about, and what the resolution should be. labels Aug 21, 2024
@buhln
Copy link
Author

buhln commented Aug 22, 2024

I checkted the same logs again and perhaps the following infos can be helpfull.

I recognized, that just a blink before the issue occours a big message was send (message id 84). We call this kind of file Operation Instruction and this is part of the functionality of the system. The message is 1456152 bytes in size and contains a JSON object with metadata and a PDF encoded as a base64 string. The broker accepts message sizes up to 30MB.

This message is send as QoS1 but the PUBACK message for this big message is never recieved. Perhaps this incoming PUBACK message causes the issue? The debug log of the broker shows now problems here.

2024-08-21 14:21:40,290 - paho.mqtt.client - DEBUG - Received PUBLISH (d0, q2, r0, m34), 'xxx/sendopinstruction', ...  (62 bytes)
2024-08-21 14:21:40,290 - paho.mqtt.client - DEBUG - Sending PUBREC (Mid: 34)
2024-08-21 14:21:40,295 - paho.mqtt.client - DEBUG - Received PUBREL (Mid: 34)
2024-08-21 14:21:40,295 - paho.mqtt.client - DEBUG - Sending PUBLISH (d0, q0, r0, m83), 'b'xxx/messages'', ... (120 bytes)
2024-08-21 14:21:40,295 - PES - DEBUG - Incoming request for OperationInstruction for order on station 1
2024-08-21 14:21:40,306 - paho.mqtt.client - DEBUG - Sending PUBLISH (d0, q1, r0, m84), 'b'xxx/opinstruction'', ... (1456152 bytes)
2024-08-21 14:21:40,307 - paho.mqtt.client - DEBUG - Sending PUBLISH (d0, q0, r1, m85), 'b'xxx/messages'', ... (129 bytes)
2024-08-21 14:21:40,307 - PES - INFO - OperationInstruction xxx for order on station 1 sent
2024-08-21 14:21:40,310 - paho.mqtt.client - DEBUG - Sending PUBCOMP (Mid: 34)
2024-08-21 14:21:40,362 - paho.mqtt.client - DEBUG - Sending PUBLISH (d0, q0, r0, m86), 'b'xxx/messages'', ... (101 bytes)
2024-08-21 14:21:40,362 - PES - DEBUG - Next task in StateMachine is filetransfer-out
2024-08-21 14:21:40,366 - paho.mqtt.client - DEBUG - Sending PUBLISH (d0, q0, r0, m87), 'b'xxx/messages'', ... (546 bytes)
2024-08-21 14:21:40,366 - PES - DEBUG - Statemachine: {xxx}
2024-08-21 14:21:40,369 - paho.mqtt.client - DEBUG - Sending PUBLISH (d0, q0, r1, m88), 'b'xxx/statemachine'', ... (476 bytes)
2024-08-21 14:21:40,371 - paho.mqtt.client - ERROR - failed to receive on socket: [SSL: BAD_LENGTH] bad length (_ssl.c:2406)
2024-08-21 14:21:40,371 - paho.mqtt.client - DEBUG - Sending PUBLISH (d0, q0, r0, m89), 'b'xxx/messages'', ... (738 bytes)
2024-08-21 14:21:40,371 - PES - DEBUG - Order to upload: {xxx}
2024-08-21 14:21:40,372 - PES - WARNING - MQTT Broker disconnected, reason code Unspecified error
2024-08-21 14:21:40,375 - PES - INFO - Transfer MachineProgram on Pallet : xxx
2024-08-21 14:21:40,380 - PES - DEBUG - Uploading file xxx
2024-08-21 14:21:41,415 - paho.mqtt.client - DEBUG - Sending CONNECT (u1, p1, wr0, wq0, wf0, c1, k45) client_id=b'xxx_xxx_-636'
2024-08-21 14:21:41,458 - paho.mqtt.client - DEBUG - Received CONNACK (0, 0)
2024-08-21 14:21:41,459 - paho.mqtt.client - DEBUG - Sending PUBLISH (d0, q0, r1, m93), 'b'xxx/messages'', ... (88 bytes)
2024-08-21 14:21:41,459 - PES - WARNING - PES reconnected to MQTT Broker
2024-08-21 14:21:41,462 - paho.mqtt.client - DEBUG - Sending PUBLISH (d0, q0, r0, m94), 'b'xxx/messages'', ... (81 bytes)
2024-08-21 14:21:41,462 - PES - DEBUG - Resubscribe on all topics

Theoreticaly I could change the outgoing big message to QoS 0 and wait for the issue again, but perhaps we are able to fix the issue in a more clean way.

@MattBrittan
Copy link
Contributor

Are you able to test sending the large message over the same link with another tool (e.g. mosquitto_pub/MQTTX? - if using mosquitto_pub then use --repeat to keep the connection open). Just want to try to eliminate as many possibilities as I can (I've seen similar issues where there was an issue with the MTU, doubt that it's the problem here but worth a try!).

@buhln
Copy link
Author

buhln commented Aug 22, 2024

Now, the real tricky part starts...

As discuessed the issue happens only on one node and not all the time. On other nodes we are transmitting even bigger PDFs up to 15MB without problems.

The PDF is send to a frontend in REACT. I assume that the operator, who is using this frontend on this node has the function of the frontend all the time open which requesting automatically the PDF file from the backend when an other kind of message is recieved from the backend by the frontend (Received PUBLISH (d0, q2, r0, m34), 'xxx/sendopinstruction', ...).

So, the only difference I see at the moment between the affected node and all the other nodes without the issue is the load on the mqtt loop just right in this moment. The load in number of messages is not really higher than on the other nodes in this moment of the functionality of the backend. The difference is just, that one additional big messages comes on top.

My understanding is, that the client forwards messages to the loop and these arescheduled by the loop. So, load on the loop should not be the problem in general?

@buhln
Copy link
Author

buhln commented Aug 23, 2024

This morning we had a variation of the error which leads to a disconnect to the broker. The paho lib didn't recognized the discconect and therefore it didn't reconnect. The loop didn't crash.

2024-08-23 06:52:52,533 - PES - DEBUG - Incoming request for OperationInstruction for order on station 1
2024-08-23 06:52:52,545 - paho.mqtt.client - DEBUG - Sending PUBLISH (d0, q1, r0, m223), 'b'xxx/opinstruction'', ... (1698928 bytes)
2024-08-23 06:52:52,547 - paho.mqtt.client - DEBUG - Sending PUBLISH (d0, q0, r1, m224), 'b'xxx/messages'', ... (129 bytes)
2024-08-23 06:52:52,547 - PES - INFO - OperationInstruction xxx.pdf for order on station 1 sent
2024-08-23 06:52:52,549 - paho.mqtt.client - DEBUG - Sending PUBCOMP (Mid: 57)
2024-08-23 06:52:52,549 - paho.mqtt.client - DEBUG - Sending PUBLISH (d0, q0, r1, m225), 'b'xxx/messages'', ... (83 bytes)
2024-08-23 06:52:52,549 - PES - INFO - Send Order sequence complete
2024-08-23 06:52:52,553 - paho.mqtt.client - DEBUG - Sending PUBLISH (d0, q0, r1, m226), 'b'xxx/messages'', ... (88 bytes)
2024-08-23 06:52:52,554 - paho.mqtt.client - ERROR - failed to receive on socket: EOF occurred in violation of protocol (_ssl.c:2406)
2024-08-23 06:52:52,555 - paho.mqtt.client - ERROR - failed to receive on socket: [SSL: BAD_LENGTH] bad length (_ssl.c:2406)

Again, it was in the context of sending the big message 223 (1698928 bytes). I now reduced QoS for this message from 1 to 0. Again it looks like that the PUBACK msg of this big messages was the first not recieved one.

In this issue #637 (comment) it is mention it could be a kind of "race of condition" on the socket. Espacially because the timing of the order the mqtt messages are sent could be influenced by the operators behaviour on the frontend side. This could be the reason why we have this issue on one node only.

Perhaps I have to mention that our application is also publishing with the publish.single() function from time to time single messages with QoS 2 (but not TLS encrypted) to a different broker. I try now to investigate if this happens in the same period of time when the big message is sent.

@buhln
Copy link
Author

buhln commented Sep 20, 2024

Today we had this following error message and a disconnect from the broker. The client did not reconnect to the broker.

2024-09-19 19:17:13,629 - paho.mqtt.client - ERROR - failed to receive on socket: [SSL: BAD_LENGTH] bad length (_ssl.c:2406)
2024-09-19 19:17:13,631 - paho.mqtt.client - ERROR - failed to receive on socket: self._sock is None

Some ideas how to catch this error and call the connect callback? My code is available in one of the earlier posts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Available No one has claimed responsibility for resolving this issue.
Projects
None yet
Development

No branches or pull requests

2 participants