Optimizing RTF code

EricSchreiber

Is there any existing way to 'clean' an RTF document, stripping out unnecessary tags? For example, WPTools creates a fairly clean RTF file, that clocks in at 7KB. If I open and then save that file in Word, it bloats up to 36KB. Word just adds a lot of extra RTF junk code.

Our system can bounce RTF files around through several external systems, and users can paste in text from outside sources, like Word. We'd like a way to strip out RTF control words that have no meaningful effect on the documents.

Thanks in advance.

support

Hi,

Word will always add about 32KB of theam, style, font and color data. But it is really bad when images are used. Then images are saved in ttwo formats, the compressed and the uncompressed(!) and both are also hex encoded. You can easily create a 32MB file from a 200KB RTF file.

I think there was a registry hack to configure Word to not save the double image.

Julian

EricSchreiber

So, no way to scrub the extra junk out of an RTF file and get it back down to a more reasonable size?

support

Hi,

I think there was a registry hack to make Word not save a duplicate of bitmap data which is already saved as JPEG.

The styles could probably be reduced by using a different standard template.

However this is not under WPTools control.

Julian

EricSchreiber

Oh, believe me, I know the problem here is entirely within Word!

I was just hoping there was some existing way to load an RTF (ideally in WPT) that had been mucked up with all the extra garbage in Word, then parse it to determine which RTF control words and such actually were relevant, and output a minimalist version.

For example, a document uses one or two fonts, but when Word touches it, suddenly the font table has a dozen or more entries in it. Ideally, there would be a way to strip the unused ones out. And the 6KB "themedata" section that Word adds. And the list goes on and on.

Aside: Long ago in the BBS days, a lot of us sysops used ANSI sequences to give our boards color menus and such. There was a tool we used to draw the ANSI screens, but like Word it added a lot of stuff that wasn't strictly necessary, and we invariably ended up hand-tweaking the code to make it as small as possible for transfer across slow modems. The more things change, the more they stay the same.