Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization of importer execution logic and output information. #285

Open
GangLiCN opened this issue Sep 5, 2023 · 1 comment
Open
Labels
type/feature req Type: feature request

Comments

@GangLiCN
Copy link

GangLiCN commented Sep 5, 2023

Is your feature request related to a problem? Please describe.

  1. Importer 导入速度比较慢,用户无法提前预知 导入需要花费的时间;
  2. Importer 执行过程中因为磁盘剩余空间不足,异常退出;
  3. Importer 目前不支持断点续传。

Describe the solution you'd like

  1. Import 执行过程的输出 建议增加 更有意义的性能指标,
    现在控制台输出的只能看到: 当前已经导入了多少条记录和网络延时,
    用户可能更希望看到的是 每秒导入了多少条记录 类似于tpmc这种
    性能指标。

改进建议如下:

  1. Importer执行时,建议输出结果加个进度条显示
    或者增加进度说明。例如:
    一共多少个csv文件,当前处理的是哪个csv文件,
    本csv文件一共需要导入多少条记录
    现在导入了多少条记录
    预计需要花费的时间。
  1. 导入测试数据集前对磁盘容量进行检测,如果剩余磁盘空间 小于 预估的容量,
    则报错提示无法导入。并输出具体的错误信息。

  2. 预估容量的计算要考虑到底层存储的问题,例如底层存储使用了RocksDB的话,
    会有写放大的问题出现,这样可能会占据更多的磁盘空间,因此在预估磁盘容量
    时尽量按照上限计算;

  3. 断点续传: 例如有20个csv文件,已经完成了10个,在导入第11个文件的时候
    因为磁盘空间不足导致导入中断,下次再运行导入程序能不能从第11个文件开始,
    不用再重复导入已经完成的文件。

Describe alternatives you've considered

Additional context

@veezhang
Copy link
Contributor

veezhang commented Sep 7, 2023

@GangLiCN Thanks for your advice.

  1. 4.x already supports more information, but will not print the information of each file, a log record is as follows:
    {"level":"info","ts":"2023-03-02T13:47:38+08:00","caller":"manager/manager.go:394","msg":"40s 14s 74.00%(48 MiB/65 MiB) Records{Finished: 1352009, Failed: 0, Rate: 33592.35/s}, Requests{Finished: 10573, Failed: 0, Latency: 9.897295ms/37.723963ms, Rate: 262.70/s}, Processed{Finished: 1352009, Failed: 0, Rate: 33592.35/s}","app":"nebula-importer"}

2&3. I think maintainers need to plan the size of the disk and expand it when necessary. In addition, the importer cannot know the disk usage, and there is no suitable algorithm to estimate how much disk is needed.

  1. Importer is currently designed to be a stateless tool.

@MuYiYong How do you think?

@QingZ11 QingZ11 changed the title Importer执行逻辑和输出信息优化 Optimization of importer execution logic and output information. Sep 18, 2023
@QingZ11 QingZ11 added the type/feature req Type: feature request label Oct 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/feature req Type: feature request
Projects
None yet
Development

No branches or pull requests

3 participants