Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

新手请教一下大佬,我现在想让大模型完全记忆一个代码仓库的代码,我是放在预训练阶段呢呢,还是放在SFT阶段呢,或者其他阶段呢? #117

Open
kill136 opened this issue Jan 18, 2025 · 2 comments

Comments

@kill136
Copy link

kill136 commented Jan 18, 2025

如题,在不同阶段,这个仓库的代码数据集格式,是否不同?,需要预处理特定的格式吗? 比如这个仓库地址:https://github.com/nocobase/nocobase

@jingyaogong
Copy link
Owner

需要很强记忆力的任务放在pretrain阶段

后训练,也就是基于通用pretrain模型继续做小学习率的post-pretrain,过拟合这个仓库即可

指令微调是需要对话模版数据,是高度有监督的过程,数据质量要求很高,不适合。

@kill136
Copy link
Author

kill136 commented Jan 20, 2025

需要很强记忆力的任务放在pretrain阶段

后训练,也就是基于通用pretrain模型继续做小学习率的post-pretrain,过拟合这个仓库即可

指令微调是需要对话模版数据,是高度有监督的过程,数据质量要求很高,不适合。

感谢大佬指教,想做一个类似cursor的IDE ,需要现做一个本地化的代码模型,实时强记忆整个仓库,想多讨教一下:post-pretrain 数据集如何制作呢,有什么特别格式吗,如何保证整个仓库的整体性呢?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants