-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The entities extracted from Chinese manual documents are very messy #596
Comments
Have you tried Qwen or Doubao to extract the entities? |
To be honest, I used these models to extract entities. I also tried Baidu's Wenxin 4, and the extraction results were relatively poor. I am now going to change the prompt. |
How well does using openai gpt4 work for Chinese? Which models have you already tried? |
Before I am using llama3 and gemma2, they perform not well on the Chinese documents like network novel. Yesterday, I was trying DeepSeeker, this looks good for me. And now it is able to neo4j to visualize them. You can access my 微信公众号 喂饭教程!全网首发Neo4J可视化GraphRAG索引to test it. |
@Nuclear6 if you trying prompt tune, you should try a large language model and optimize for chinese like qwen or moonshot. Before I was using gemma2 9b, it is very bad, the prompt is too bad, the generated example is wrong. Besides, it's hard to complete index procedure when using prompt-tune, you will meet lots of errors. I have tried this for an entire afternoon and I gave up. But you can try. python -m graphrag.prompt_tune --root . --domain "Chinese web novels" --language Chinese --chunk-size 300 --output prompt_zh Remember update your entity in settings.yaml when you done prompt-tune. |
Hi, do you mean prompts that are written in Chinese? Could you share them if possible? Thanks! |
I changed the entities extraction prompt to Chinese and got a graph with less entities but they looks kind of better than English prompt's result. Is there any test data like a pair of raw file with high quality generated graph to compare Chinese prompt with English prompt for graph generation? |
我这边有8个说明书文档,总共130KB,做了如下优化效果才稍微好点: 构建索引阶段,模型使用豆包128k,跑一次10块钱:
查询阶段发现查出来的实体和query相差太大,原因是采用自定义的embedding服务,需要去掉和cl100k_base的相关操作,修改之后,效果有所提升。 这是我的中文电子说明书优化经验,供大家参考!!! I made the following optimizations: Index building phase:
In the query phase, it is found that the searched entities are too different from the query. The reason is that the custom embedding service is used, and the operations related to cl100k_base need to be removed. After the modification, the effect is improved. This is my experience in optimizing Chinese electronic manuals for your reference! ! ! |
感谢大佬分享,我用deepseeker构建索引 注册后免费500万,应该足够跑了。 另外有一个问题请教 Thank you for sharing. I used Deepseeker to build the index. After registering, I get 5 million for free, which should be enough to run.
Additionally, I have a question to ask: |
1 官方用的分块是先把文档token化,按照token数进行切分,对于中文来说容易出现乱码,我看Langchain-ChatChat开源项目中用中文字符数进行切分,有效避免chunk存在乱码。 官方chunk:https://github.com/microsoft/graphrag/blob/main/graphrag/index/verbs/text/chunk/strategies/tokens.py 2 我感觉分块跟模型没有太大关系,选择中文那种分块逻辑能够保证句子完整性,模型理解可能更好点。 3 没有使用官方的prompt调优,听你说容易报错,我直接拿4o对照翻译生成对应的模板。 4 按照我的理解,一个文档还是多个文档区别不大。它是针对分块抽取实体,然后针对实体和描述构建embedding,文档名没看到有太大的联系。 |
非法感谢~我确实有看到Chunk中存在乱码,感谢解惑~ 👍👍👍 |
hi, will the auto prompting help me define all the entities in domain? or I have to manually define in the setting |
@dinhngoc267 It is recommended that the input document example defines the entity type with the help of the gpt-4o model |
from my experience, using auto prompt tune, it didn’t generate all of the domain entities. It references your input document and generate some examples. I feel it doesn’t perform well. As @Nuclear6 said, maybe it’s better using gpt4 help to generate prompt giving your input document as example for gpt-4 reference. |
代码改动,避免使用cl100k_base切分token出现乱码的问题,感谢Nuclear6 提供的思路
|
actually no, I have tried and the entity type are random. You can try first use prompt tune and then update the entity extraction prompt with your custom entities |
@KylinMountain Hey, I've tried this
|
@dinhngoc267 this feature is not release yet, it is still on the main branch. You may try pull the code |
What is the function of the Sorry, I didn't carefully read the development documentation. I have found it in the document. https://microsoft.github.io/graphrag/posts/prompt_tuning/auto_prompt_tuning/ |
If you are using the Open source model which doesn't support json mode, the generated prompt doesn't work well. You may meet some error like EmptyNetworkError. You can see the prompt in the entity_extraction.txt is very too bad. I have made a fix: #661, which works well for me. |
Hi @KylinMountain did you notice that some records in the community records are English? It makes the result is English too if there is a question use that records. Where to custom the prompt in the final answer? Or should I modify the question of python -m graphrag.query --root ./ragtest --method local {question} as [question] + [some language require description]? But If I modify like this I think it affect the process of ranking node in retrieval step , as it doesn't expect nose in the question |
大佬,问下这个代码具体要怎么在graphrag中使用 |
我报了个错哎 |
@dinhngoc267
|
|
这是你入参传入的报错,你有做过任何改动吗?或者提供详细的报错日志 |
Consolidating language support issues here: #696 |
请教,你这份代码可以直接跑通处理中文chunk,下面下面中,为什么是[source_doc_idx] * len(chunk),而不是单单[source_doc_idx] 一个呢。如下代码生成的create_base_text_units.csv表中,document_ids一栏每个chunk项都有n_tokens项目,都是重复的。有什么意义呢? |
这段代码并没有任何意义,只是为了符合graphrag的输入 |
I used the Chinese manual document to build it, and found that the extracted entities were very messy. Is there any good way to optimize it?
The text was updated successfully, but these errors were encountered: