cs144Lab1 | LiamY Blog

type

status

slug

date

summary

get start

最开始branch都没有，原因是fork的时候没有能把所有分支fork（选成了Copy the DEFAULT branch only），不过找到了解决方法：

Fork a repository - GitHub Docs

A fork is a new repository that shares code and visibility settings with the original “upstream” repository.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/fork-a-repo#forking-a-repository

Creating and deleting branches within your repository - GitHub Docs

You can create or delete branches directly on GitHub.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-and-deleting-branches-within-your-repository

看看任务呢：

As part of the lab assignment, you will implement a TCP receiver: the module that receives datagrams and turns them into a reliable byte stream to be read from the socket by the application—just as your webget program read the byte stream from the webserver in Checkpoint 0.

The TCP sender is dividing its byte stream up into short segments (substrings no more than about 1,460 bytes apiece) so that they each fit inside a datagram. But the network might reorder these datagrams, or drop them, or deliver them more than once. The receiver must reassemble the segments into the contiguous stream of bytes that they started out as.

In this lab you’ll write the data structure that will be responsible for this reassembly: a Reassembler. It will receive substrings, consisting of a string of bytes, and the index of the first byte of that string within the larger stream（回忆top-down书里面第三章讲TCP部分说编号是按字节编号，每个segment的序号选择应用层的字节的首位编号）. Each byte of the stream has its own unique index, starting from zero and counting upwards. As soon as the Reassembler knows the next byte of the stream, it will write it to the Writer side of a ByteStream— the same ByteStream you implemented in checkpoint 0. The Reassembler’s “customer” can read from the Reader side of the same ByteStream.

Here’s what the interface looks like:

⋆Why am I doing this? TCP robustness(健壮性) against reordering and duplication comes from its ability to stitch（缝合） arbitrary（任意的） excerpts（节选） of the byte stream back into the original stream. Implementing this in a discrete（分离的） testable module will make handling incoming segments easier.

The full (public) interface of the reassembler is described by the Reassembler class in the reassembler.hh header. Your task is to implement this class. You may add any private members and member functions you desire to the Reassembler class, but you cannot change its public interface.

2.1 What should the Reassembler store internally?

The insert method informs the Reassembler about a new excerpt(节选) of the ByteStream, and where it fits in the overall stream (the index of the beginning of the substring).

In principle（原则上）, then, the Reassembler will have to handle three categories of knowledge:

Bytes that are the next bytes in the stream. The Reassembler should push these to the stream (output .writer()) as soon as they are known.

Bytes that fit within the stream’s available capacity but can’t yet be written, because earlier bytes remain unknown. These should be stored internally in the Reassembler.

Bytes that lie beyond the stream’s available capacity. These should be discarded. The Reassembler’s will not store any bytes that can’t be pushed to the ByteStream either immediately, or as soon as earlier bytes become known.

The goal of this behavior is to limit the amount of memory used by the Reassembler and ByteStream, no matter how the incoming substrings arrive. We’ve illustrated this in the picture below. The “capacity” is an upper bound on both:

The number of bytes buffered in the reassembled ByteStream (shown in green), and

The number of bytes that can be used by “unassembled” substrings (shown in red)

You may find this picture useful as you implement the Reassembler and work through the tests—it’s not always natural what the “right” behavior is.

积累一些有用的F&Q和Tips

When should bytes be written to the stream? As soon as possible. The only situation in which a byte should not be in the stream is that when there is a byte before it that has not been “pushed” yet.

May substrings provided to the insert() function overlap? Yes.

Will I need to add private members to the Reassembler? Yes. Substrings may arrive in any order, so your data structure will have to “remember” substrings until they’re ready to be put into the stream—that is, until all indices before them have been written.

Is it okay for our re-assembly data structure to store overlapping substrings? No. It is possible to implement an “interface-correct” reassembler that stores overlapping substrings. But allowing the re-assembler to do this undermines the notion of “capacity” as a memory limit. If the caller provides redundant knowledge about the same index, the Reassembler should only store one copy of this information.

Will the Reassembler ever use the Reader side of the ByteStream? No—that’s for the external customer. The Reassembler uses the Writer side only

So a reasonable implementation of the Reassembler might be about 50–60 lines of code for the Reassembler (on top of the starter code).

Is it okay for our re-assembly data structure to store overlapping substrings? No. It is possible to implement an “interface-correct” reassembler that stores overlapping substrings. But allowing the re-assembler to do this undermines the notion of “capacity” as a memory limit. If the caller provides redundant knowledge about the same index, the Reassembler should only store one copy of this information.

Use comments to explain complex or subtle pieces of code. Use “defensive programming”—explicitly check preconditions of functions or invariants, and throw an exception if anything is ever wrong. Use modularity in your design—identify common abstractions and behaviors and factor them out when possible. Blocks of repeated code and enormous functions will make it hard to follow your code.

You can test your code (after compiling it) with cmake --build build --target check1

至此，第一遍的check1.pdf就算读完了

开始做吧

从熟悉接口开始

private:
  ByteStream output_; // the Reassembler writes to this ByteStream

void insert( uint64_t first_index, std::string data, bool is_last_substring );

* Insert a new substring to be reassembled into a ByteStream. * `first_index`: the index of the first byte of the substring * `data`: the substring itself * `is_last_substring`: this substring represents the end of the stream * `output`: a mutable reference to the Writer * * The Reassembler's job is to reassemble(重新组装) the indexed * substrings (possibly out-of-order * and possibly overlapping) back into the original ByteStream. * As soon as the Reassembler * learns the next byte in the stream, it should write it to the output. * * If the Reassembler learns about bytes that * fit within the stream's available capacity * but can't yet be written (because earlier bytes remain unknown), * it should store them * internally until the gaps are filled in. * * The Reassembler should discard（丢弃） any * bytes that lie beyond the stream's available capacity * (i.e., bytes that couldn't be written even if earlier gaps get filled in). * * The Reassembler should close the stream after writing the last byte.

这里有些地方描述目前有疑惑：

As soon as the Reassembler learns the next byte in the stream, it should write it to the output.这里的write to the output 是什么意思？难道是使用output_私有变量使用push方法(把这个byteStream转成writer使用)？因为我之前理解的byteStream是让我们模拟TCP提供可靠数据传输的过程。的确应该是这样

第一步是先完成接口

我的思路是先完成后再一次次迭代优化，前面不用考虑太多优化步骤，不过数据结构的选择需要多考虑。

对于bytes in the Reassembler’s internal storage，~~决定放在一个unordered_map/map（这里我也不确定哪个更好，先用map吧，因为感觉增删次数比较多）的数据结构里面，key就是index，value就是string。~~

这里有个问题，一直很困扰我，因为overlap的问题，这个情况到底是什么？

考虑这个，我更换数据结构为array，主要是很困难的地方是给的会不连续，比如给了一个string三个byte，bcd（index 1），再来mbc(index 5)，再来一个c（index 2），虽然index上前进了，但是实际上没有,所以没用，但是来个n(index 4)就会被加入，这种很头疼。得想一个高效的算法解决这个问题。

第一个想法，就是换成array，把传入的string变成c_str成为c的一个数组，按照下标+index-next_bytes_index把它加入进array,这样就能在有重复的时候也能解决问题，array的大小在capacity就行因为要求传入的不会超过capacity，只需要维护。但是这样不仅资源上消耗大，需要string拆开移动，并且还有一个问题是如何实现，就是下面这个函数👇

这个函数需要我记录有多少bytes是缓存着的，上面的方式的话实现这个的开销更大了。

再想想，这样就需要对于每次insert进来的string，看看它的index是否acceptable，然后再判断它是否overlap，这个思路可以继续下去，我就在这里往下想了，结合前面的图来

通过available_capacity()可以知道红色部分可以容纳的最大大小，只要我们记录一个nextBytesIndex变量来表示下一个需要的字节，这样我们就能得到first unacceptable index=nextBytesIndex+available_capacity.

这里就要做一些工作

来的index+string_len ≤ first unacceptable index,否则超过了capacity就直接discard

C++ bitset 用法 - 长岛冰茶、 - 博客园

C++的 bitset 在 bitset 头文件中，它是一种类似数组的结构，它的每一个元素只能是０或１，每个元素仅用１bit空间。下面是具体用法构造函数 bitset常用构造函数有四种，如下注意：用字符串构造时，字符串只能包含 '0' 或 '1' ，否则会抛出

https://www.cnblogs.com/magisk/p/8809922.html

想想乱序和overlap，你不能把overlap的数据存储，要能主动鉴别，还要记录缓存的量，那么就是每次的len=string_len-overlapping_len,如何寻找overlapping可以包装为一个函数，返回值是overlapping的长度.我有一个思路，对于overlapping，维护一个bitset的数组，这样空间开销直接/8，通过检测为0表示此处没有数据，为1表示有。并且移动由位运算完成。然后乱序的问题，~~我采用priority queue，通过堆排的方式，让最小index出现，每次string加入堆的时候，进行裁剪，去掉有交集的部分进入堆（裁剪后可能变多）。~~ 我想了，换成set，使用pair来组合index和string完成排序。

C++ bitset 用法 - 长岛冰茶、 - 博客园

https://www.cnblogs.com/magisk/p/8809922.html

C++ STL bitset 容器详解 - Seaway-Fu - 博客园

C++ STL bitset 容器详解本篇随笔讲解$C++STL$中$bitset$容器的用法及常见使用技巧。 $bitset$容器概论 $bitset$容器其实就是个$01$串。可以被看作是一个$bool$数组。它比$bool$数组更优秀的优点是：节约空间，节约时间，支持基本的位运算。在$b

https://www.cnblogs.com/fusiwei/p/11720118.html

后面考虑到空间的利用，取消了bitset，而是用区间分割的方式对于新来的字符串进行裁剪，去掉overlapping进入堆，最终的实现效果很好，但是中间经历了不少debug。因为去重的时候需要考虑边界情况，最开始的思路是直接使用许多if判断来确定下标去重，后来发现可以直接提供max和min函数来规范每次左右字串的上下界下标，从而能够去重，并且包含边界情况的完整考虑。还有一个很坑的点，之前一直卡在win这个测试，是因为有可能传入的字符串已经被完整的assemble过，这种需要跳过不去处理。

附上github链接可以参考，最后优化的版本比较精简，可以倒回去看前面的commit来参考思路。

minnow

KMSorSMS • Updated Jun 30, 2024

c++积累篇（过程中积累的）

在C++中，explicit关键字用于防止类构造函数的不希望的隐式类型转换。默认情况下，C++允许使用一个参数的构造函数进行隐式类型转换。

例如，考虑以下类：

在这个例子中，MyFunction(42)是合法的，因为42可以被隐式地转换为MyClass类型。

如果你不希望这种隐式类型转换发生，你可以使用explicit关键字：

在这个例子中，MyFunction(42)会导致编译错误，因为42不能被隐式地转换为MyClass类型。你必须显式地进行类型转换，如MyFunction(MyClass(42))。