Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

major redesign, reprising connection as DuplexConnection #102

Merged
merged 24 commits into from
May 16, 2021

Conversation

robertream
Copy link
Collaborator

@russelltg
It appears that we have started to step on toes, so I'm submitting this mostly finished draft pull request where I have been exploring how to clean up the tokio I/O loop, which lead me down a path of reintroducing a connection struct that this time encapsulates the Sender and Receiver. I know this is a very large change set, but please consider the benefits from the design direction I am exploring here. I am contradicting my prior advocacy for keeping the sender and receiver algorithms free of communication, concurrent capable. Since we are now just running the whole I/O loop from one thread, concurrency is less important anyhow. I hope you agree that this has significantly simplified the code. I have two different versions of the I/O loop that are alternated somewhat randomly at runtime. One approach uses an Action/Input enum pair and a single function to drive the loop, this is similar to the prior design of the Sender/Receiver enums. The other approach uses a set of command and query methods to drive the loop. I prefer the ergonomics of the the enum approach. It made debug tracing the code trivial too.

One of the larger tests you added recently doesn't compile with all the changes I made, so I commented it out. I don't have the time to go back and fix it, but I can later, I just felt it was important that you take a look at the changes I have made before our paths diverge too far.

I also fixed several bugs that I ran into while testing. For instance, we weren't respecting SND in the sender algorithm, which broke congestion control. The timestamp_rollover test ran much faster than real time with this bug, but with the fix it no longer does. I have set it to ignored for now. We should replace this long running test with a few more focused unit tests. In fact, it might already be mostly covered by the existing TimeStamp unit tests.

I'd like to finish this up and move onto a first pass at metrics, so quick feedback would be appreciated.

@russelltg
Copy link
Owner

russelltg commented May 10, 2021

I'll take a deeper look at this in a few days. In the meantime--

All the srt-protocol integration tests are meant to run much faster than realtime, they never check the wall clock they just simulate time passing--so you're saying that now the algorithm takes a full CPU core to run?

Also, I've been using this test to see how well it can deal with higher-bandwidth connections--I should check it in and just set it to be ignored but for now you can grab it with git restore -s high_bandwidth -- ./srt-tokio/tests/high_bandwidth.rs. It seems to freeze up now.

I suggest you run it with cargo test --test=high_bandwidth -- --nocapture, as it reports statistics. It's not a proper test with pass/fail behavior, it just tries to push a few MB/s through it.

@russelltg
Copy link
Owner

russelltg commented May 10, 2021

Also nice apparently proptest ICE's the nightly compiler! 🥳 rust-lang/rust#85128

@robertream
Copy link
Collaborator Author

I'll take a deeper look at this in a few days. In the meantime--

All the srt-protocol integration tests are meant to run much faster than realtime, they never check the wall clock they just simulate time passing--so you're saying that now the algorithm takes a full CPU core to run?

The problem is that sending data should be limited by the SND timer, this is enables congestion control to function, but the current implementation can potentially run the sender algorithm every time the next "action" is pulled, regardless of the send timer. Sending should only happen when the clock ticks forward far enough to trigger a SND, but this seems to have added a bit more computational cost when running the faster than real time simulation tests. The algorithm shouldn't take a full CPU to run, the tests just run noticeably slower. This could be due to inefficiencies in the test harnesses/simulation code too. I haven't spent much time optimizing yet.

Also, I've been using this test to see how well it can deal with higher-bandwidth connections--I should check it in and just set it to be ignored but for now you can grab it with git restore -s high_bandwidth -- ./srt-tokio/tests/high_bandwidth.rs. It seems to freeze up now.

I suggest you run it with cargo test --test=high_bandwidth -- --nocapture, as it reports statistics. It's not a proper test with pass/fail behavior, it just tries to push a few MB/s through it.

Thanks, that is helpful.

Copy link
Owner

@russelltg russelltg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is much cleaner, and I would be happy to merge it with some more work.

srt-protocol/src/protocol/handshake/mod.rs Show resolved Hide resolved
@@ -118,7 +108,7 @@ pub struct Receiver {
receive_buffer: RecvBuffer,

/// Shutdown flag. This is set so when the buffer is flushed, it returns Async::Ready(None)
shutdown_flag: bool,
shutdown_flag: Option<Instant>,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update comment here--it's no longer a flag--and I'm not sure what particular instant this refers to

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this is a shutdown with a timeout. There were many tests that would fail intermittently/randomly due to injected packet loss or race conditions, this was particularly true of the tokio tests. The connect and rendezvous algorithm tests were failing intermittently for this same reason. We need to implement timeouts for them as well.

// list before progressing further through the sender algorithm. This
// appears to be inconsistent with the UDT spec. Is it consistent
// with the reference implementation?
return;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is right, as it doesn't set the snd timer period--I think it is more correct to not return here. Not sure though

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, and this is something we ought to cover with a focused unit test, but I also think the "every 16th packet rapid send" part of the algorithm also should not be run except for when we send a new data packet. I believe this is why I had this returning here. I intended to go back and fix it. This is how the reference implementation works.

fn handle_handshake_packet(&mut self, handshake: HandshakeControlInfo, now: Instant) {
if let Some(control_type) = self.handshake.handle_handshake(&handshake) {
pub fn handle_handshake_packet(&mut self, handshake: HandshakeControlInfo, now: Instant) {
if let Some(control_type) = self.handshake.handle_handshake(handshake) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider moving this logic into DuplexConnection? Thoughts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I was incrementally refactoring, just hadn't gotten to it yet. I also intend to move all the connection Drain/Shutdown/Close logic to DuplexConnection as well, the keepalive timer too, and a few other things.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, now I want to leave this on the Sender for now because that DuplexConnection doesn't have the means for buffering packets to send yet, and maybe it never will. For this PR I'm going to just leave this the way it is. Eventually I will probably explore managing buffers externally using some form of context or dependency injection, perhaps via a "send" Fn, arena allocator, etc.

fn on_snd(&mut self, now: Instant) {
let now_ts = self.transmit_buffer.timestamp_from(now);
let window_length = self.metrics.rtt + self.settings.send_tsbpd_latency;
self.send_buffer.release_late_packets(now_ts, window_length);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does releasing late packets like this require sending a DropRequest packet?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe so. I implemented this as a solution to fixing a stall condition while draining on close/shutdown, yet this should be handled by the send_buffer.pop() logic further down in on_snd, so I should either properly implement DropRequest or remove for now.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not have that as part of this PR. Either leave it with a TODO comment and linked issue or remove and leave a TODO with a linked issue.

srt-protocol/src/protocol/sender/mod.rs Outdated Show resolved Hide resolved
} else if self.status.should_drain() && self.send_buffer.len() == 1 {
if let Some(packet) = self.send_buffer.pop() {
self.send_data(packet, now);
self.lr_acked_packet = self.transmit_buffer.next_sequence_number;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intended to at least remove this redundancy.

srt-protocol/tests/helpers/simulation.rs Outdated Show resolved Hide resolved
srt-protocol/tests/timestamp_rollover.rs Outdated Show resolved Hide resolved
srt-protocol/tests/timestamp_rollover.rs Outdated Show resolved Hide resolved
@russelltg
Copy link
Owner

russelltg commented May 12, 2021

So things that need to happen before I merge this:

  • Rename and recomment shutdown_flag
  • Comment the missing DropRequest implementation
  • Fix ensure_open to either be less confusing or remove
  • Fix not_enough_latency test--requires seeing if we want to keep the connection initialization as part of that
  • Uncomment the DuplexConnection unit tests

As well as any other cleanup you want to do.

Things that don't need to be part of this PR but could be or could be new issues:

@codecov
Copy link

codecov bot commented May 14, 2021

Codecov Report

Merging #102 (f5cf152) into master (b5fefd0) will increase coverage by 0.66%.
The diff coverage is 78.81%.

❗ Current head f5cf152 differs from pull request most recent head 7601b25. Consider uploading reports for the commit 7601b25 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master     #102      +/-   ##
==========================================
+ Coverage   81.37%   82.03%   +0.66%     
==========================================
  Files          42       42              
  Lines        6373     6764     +391     
==========================================
+ Hits         5186     5549     +363     
- Misses       1187     1215      +28     
Impacted Files Coverage Δ
srt-protocol/src/pending_connection.rs 32.87% <0.00%> (-0.46%) ⬇️
srt-protocol/src/pending_connection/connect.rs 87.35% <ø> (-2.49%) ⬇️
srt-tokio/src/lib.rs 100.00% <ø> (ø)
srt-protocol/src/pending_connection/rendezvous.rs 81.17% <50.00%> (-0.36%) ⬇️
srt-tokio/src/pending_connection.rs 80.50% <50.00%> (-0.23%) ⬇️
srt-tokio/src/tokio/socket.rs 62.17% <57.72%> (-20.34%) ⬇️
srt-protocol/src/protocol/receiver/mod.rs 91.59% <80.00%> (+3.27%) ⬆️
srt-protocol/src/protocol/sender/mod.rs 88.84% <82.14%> (+0.47%) ⬆️
srt-protocol/src/connection.rs 88.14% <88.04%> (-11.86%) ⬇️
srt-protocol/src/lib.rs 100.00% <100.00%> (ø)
... and 31 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b5fefd0...7601b25. Read the comment docs.

@russelltg
Copy link
Owner

I'll fix up the not_enough_latency test

@russelltg russelltg marked this pull request as ready for review May 16, 2021 01:42
@russelltg
Copy link
Owner

Waiting for CI to run, apparently there's an issue with github actions rn. Will merge if everything looks good

@russelltg russelltg merged commit 8d922ea into russelltg:master May 16, 2021
@robertream robertream deleted the connection branch June 17, 2021 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants