• Is there any existing way to 'clean' an RTF document, stripping out unnecessary tags? For example, WPTools creates a fairly clean RTF file, that clocks in at 7KB. If I open and then save that file in Word, it bloats up to 36KB. Word just adds a lot of extra RTF junk code.

    Our system can bounce RTF files around through several external systems, and users can paste in text from outside sources, like Word. We'd like a way to strip out RTF control words that have no meaningful effect on the documents.

    Thanks in advance.

    • Offizieller Beitrag

    Hi,

    Word will always add about 32KB of theam, style, font and color data. But it is really bad when images are used. Then images are saved in ttwo formats, the compressed and the uncompressed(!) and both are also hex encoded. You can easily create a 32MB file from a 200KB RTF file.

    I think there was a registry hack to configure Word to not save the double image.

    Julian

  • Oh, believe me, I know the problem here is entirely within Word!

    I was just hoping there was some existing way to load an RTF (ideally in WPT) that had been mucked up with all the extra garbage in Word, then parse it to determine which RTF control words and such actually were relevant, and output a minimalist version.

    For example, a document uses one or two fonts, but when Word touches it, suddenly the font table has a dozen or more entries in it. Ideally, there would be a way to strip the unused ones out. And the 6KB "themedata" section that Word adds. And the list goes on and on.

    Aside: Long ago in the BBS days, a lot of us sysops used ANSI sequences to give our boards color menus and such. There was a tool we used to draw the ANSI screens, but like Word it added a lot of stuff that wasn't strictly necessary, and we invariably ended up hand-tweaking the code to make it as small as possible for transfer across slow modems. The more things change, the more they stay the same.