Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XML Escaping #282

Open
atollk opened this issue Jan 10, 2025 · 5 comments
Open

XML Escaping #282

atollk opened this issue Jan 10, 2025 · 5 comments
Labels
enhancement New feature or request

Comments

@atollk
Copy link
Contributor

atollk commented Jan 10, 2025

I like the feature of having a structured output via XML. However, the string concatenation as it is done in xmlStyle.ts doesn't really work. XML requires escaping of certain characters: https://stackoverflow.com/questions/1091945/what-characters-do-i-need-to-escape-in-xml-documents

Basically, if the repo you are reading with repomix contains an XML file itself, it will be parsed as part of the repo tree, not as file contents. The "&&" symbol that is used for boolean conjunction in many languages is disallowed in XML files entirely and will make the output invalid.

The proper solution would be to use an XML serializer that takes care of any escaping that is needed.

@atollk
Copy link
Contributor Author

atollk commented Jan 10, 2025

I'd be fine with submitting a PR for this but it probably requires some refactoring of src/core/output, so I'd rather check with your first.

@yamadashy
Copy link
Owner

yamadashy commented Jan 11, 2025

Hi, @atollk !
Thank you for your suggestion regarding XML character escaping!

Regarding XML character escaping, I experimented with it in the past but ultimately chose not to implement it for a few reasons:

  1. Escaping would increase token count, which is a concern when working with LLM context limits
  2. In my use cases (mainly ts, js, md, vue files), LLMs have no issues parsing the unescaped content correctly

However, I agree that proper XML escaping might be useful for some users.
We could consider adding this as an optional feature.

{
  "output": {
    "xmlStyleEscape": true,   // Enable XML escaping
    // ... other output options ...
  }
}

@yamadashy yamadashy added the enhancement New feature or request label Jan 11, 2025
@atollk
Copy link
Contributor Author

atollk commented Jan 11, 2025

Hey. Makes sense to me. Just FYI, I encountered this issue when trying to implement RAG.

If it's behind a new flag anyway, I'd propose to alter the output slightly as well to make it parseable as a single XML document. As it currently stands, you always have to manually remove the first line ("This file is a merged...") and add surrounding tags around the rest.

@atollk atollk mentioned this issue Jan 12, 2025
2 tasks
@yamadashy
Copy link
Owner

@atollk
Thank you for creating this PR!

I hadn't tried RAG myself, so this was a blind spot for me.
The direction looks really good - I particularly like the parsableStyle naming, as it's flexible enough to support other output styles like markdown in the future, not just XML.

Would you mind also adding documentation about the new parsableStyle option to the README?

@yamadashy
Copy link
Owner

@atollk
I've just released v0.2.21 with your XML escaping improvements:
https://github.com/yamadashy/repomix/releases/tag/v0.2.21

I'll handle the website updates tomorrow. Thank you for your contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants