Removing Index Markers from Writer: a How-To

classic Classic list List threaded Threaded
11 messages Options
CVAlkan CVAlkan
Reply | Threaded
Open this post in threaded view
|

Removing Index Markers from Writer: a How-To

In previous posts, I described how Writer adds extra index markers when updating an Alphabetical Index. One side effect of this behavior is that, even if an item is later removed from the concordance file, the marker remains in the text, and therefore in the index.

So, here's how to remove all of the index markers from a Writer document so you can start with a clean slate. To do this, you will need to be running LibreOffice on some flavor of Linux/Unix, or at least on a system that has a command line or some text editor with "sed" or "grep" capabilities.

1: Make a backup of your Writer document. You know the consequences if something goes amiss.
2: Open the document in Writer, and choose Save As "OpenDocument Text (Flat XML) (fodt)"
   This creates an uncompressed XML version of the document.
   On my system (Ubuntu), I was unable to decompress the odt version, as the OS complained it was malformed, but using the native capability is always a better idea.
3: Close the document and exit Writer.
4: Open a command line shell, preferably in the directory containing the fodt file.
5: Run the following command (all one line - broken apart here for clarity):
   sed 's/<text:alphabetical-index-mark text:string-value="\([A-Za-z]*\)"\/>//g'
   < Old_File_Name_and_Path.fodt
   > New_File_Name_and_Path.fodt
   Depending on the file size and processor speed, this may take a bit.
   If this gives errors, you're on your own.
6: Close the command line shell.
7: Open the new "cleansed" fodt file with Writer.
8: The file should look the same but without any alphabetical index markers. (Your index formatting is still there, though)
9: Go to where your alphabetical index is located, right click on it and select "Update Index/Table"
A: All of the index entries should disappear; if any remain, go find them on the referenced pages and manually delete them. Apparently, some of the indexes are embedded in others and aren't found by the sed command above.
   I didn't bother to try figuring out how or why that happened. I had several hundred markers, of which only five weren't removed.
B: Now, go back to the index and select Edit Index/Table, then File | Open.
C: Select the original concordance file (assuming you have it set up how you want it), and let Writer go do its thing.
D: You now have a "clean" document with no duplicate index entries.
E: LOOK AT IT CAREFULLY, of course, before replacing your original. The document I tried this on was over four hundred pages with lots of tables, graphics and so forth, and I found no problems, but it's up to you to determine if everything is ok.

I hope this helps any others who might be using alphabetic indexes.
Mareich Mareich
Reply | Threaded
Open this post in threaded view
|

Re: Removing Index Markers from Writer: a How-To

Thanks for this work-around.  I think it will also work on Mac OSX as it includes sed in Terminal.  
I will try it as soon as I get a chance.
pbw pbw
Reply | Threaded
Open this post in threaded view
|

Re: Removing Index Markers from Writer: a How-To

In reply to this post by CVAlkan
I've asked a question on this list about soffice, and this topic is the
reason I was asking.

Let's say we can use the command line to 1) convert an .odt to an .fodt
file, and 2) to convert it back again.

If so, I can write a script that uses soffice to do
1) as above
perl -e .... # perform the text substitutions
2) as above to re-establish the modified .odt file.

Peter West
"Other seed fell among thorns, and the thorns grew up and choked it..."

On 29/01/2014 1:45 am, CVAlkan wrote:

> In previous posts, I described how Writer adds extra index markers
> when updating an Alphabetical Index. One side effect of this behavior
> is that, even if an item is later removed from the concordance file,
> the marker remains in the text, and therefore in the index.
>
> So, here's how to remove all of the index markers from a Writer
> document so you can start with a clean slate. To do this, you will
> need to be running LibreOffice on some flavor of Linux/Unix, or at
> least on a system that has a command line or some text editor with
> "sed" or "grep" capabilities.
>
> 1: Make a backup of your Writer document. You know the consequences
> if something goes amiss. 2: Open the document in Writer, and choose
> Save As "OpenDocument Text (Flat XML) (fodt)" This creates an
> uncompressed XML version of the document. On my system (Ubuntu), I
> was unable to decompress the odt version, as the OS complained it was
> malformed, but using the native capability is always a better idea.
> 3: Close the document and exit Writer. 4: Open a command line shell,
> preferably in the directory containing the fodt file. 5: Run the
> following command (all one line - broken apart here for clarity): sed
> 's/<text:alphabetical-index-mark
> text:string-value="\([A-Za-z]*\)"\/>//g' <
> Old_File_Name_and_Path.fodt
>> New_File_Name_and_Path.fodt
> Depending on the file size and processor speed, this may take a bit.
> If this gives errors, you're on your own. 6: Close the command line
> shell. 7: Open the new "cleansed" fodt file with Writer. 8: The file
> should look the same but without any alphabetical index markers.
> (Your index formatting is still there, though) 9: Go to where your
> alphabetical index is located, right click on it and select "Update
> Index/Table" A: All of the index entries should disappear; if any
> remain, go find them on the referenced pages and manually delete
> them. Apparently, some of the indexes are embedded in others and
> aren't found by the sed command above. I didn't bother to try
> figuring out how or why that happened. I had several hundred markers,
> of which only five weren't removed. B: Now, go back to the index and
> select Edit Index/Table, then File | Open. C: Select the original
> concordance file (assuming you have it set up how you want it), and
> let Writer go do its thing. D: You now have a "clean" document with
> no duplicate index entries. E: LOOK AT IT CAREFULLY, of course,
> before replacing your original. The document I tried this on was over
> four hundred pages with lots of tables, graphics and so forth, and I
> found no problems, but it's up to you to determine if everything is
> ok.
>
> I hope this helps any others who might be using alphabetic indexes.
>
>
>
>
> -- View this message in context:
> http://nabble.documentfoundation.org/Removing-Index-Markers-from-Writer-a-How-To-tp4094327.html
>
>
Sent from the Users mailing list archive at Nabble.com.
>

--
Peter West
"Other seed fell among thorns, and the thorns grew up and choked it..."

--
To unsubscribe e-mail to: [hidden email]
Problems? http://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: http://wiki.documentfoundation.org/Netiquette
List archive: http://listarchives.libreoffice.org/global/users/
All messages sent to this list will be publicly archived and cannot be deleted
CVAlkan CVAlkan
Reply | Threaded
Open this post in threaded view
|

Re: Removing Index Markers from Writer: a How-To

Peter:

I mentioned sed and grep, but don't see any reason why perl couldn't be used as well. If you test this and it works, please post back to give others another option.

BUT: my sed command only removed the markers from the fodt. As I mentioned I was unable to convert the odt to an fodt (essentially uncompressing the odt to readable xml) using the unzip capability of my OS (as I'm pretty sure could be done with earlier open office documents).

Since LO can easily write and read fodt files, though, it really wasn't necessary to do any file format conversion, and I didn't bother spending the time to figure out how to do everything in one shot.

Frank
pbw pbw
Reply | Threaded
Open this post in threaded view
|

Re: Removing Index Markers from Writer: a How-To

Hi Frank,

Finally got back to this. Can you check this for me against an .fodt
file, please?

  perl -pi'orig_*' -e 's/<text:alphabetical-index-mark
text:string-value="[[:alpha:]]*"//g'
On 29/01/2014 11:23 pm, CVAlkan wrote:

> Peter:
>
> I mentioned sed and grep, but don't see any reason why perl couldn't be used
> as well. If you test this and it works, please post back to give others
> another option.
>
> BUT: my sed command only removed the markers from the fodt. As I mentioned I
> was unable to convert the odt to an fodt (essentially uncompressing the odt
> to readable xml) using the unzip capability of my OS (as I'm pretty sure
> could be done with earlier open office documents).
>
> Since LO can easily write and read fodt files, though, it really wasn't
> necessary to do any file format conversion, and I didn't bother spending the
> time to figure out how to do everything in one shot.
>
> Frank
>
>
>
> --
> View this message in context: http://nabble.documentfoundation.org/Removing-Index-Markers-from-Writer-a-How-To-tp4094327p4094482.html
> Sent from the Users mailing list archive at Nabble.com.
>

--
Peter West
"...and a sword will pierce through your own soul also..."

--
To unsubscribe e-mail to: [hidden email]
Problems? http://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: http://wiki.documentfoundation.org/Netiquette
List archive: http://listarchives.libreoffice.org/global/users/
All messages sent to this list will be publicly archived and cannot be deleted
pbw pbw
Reply | Threaded
Open this post in threaded view
|

Re: Removing Index Markers from Writer: a How-To

The file name goes at the end of the command, of course.

On 4/02/2014 3:44 pm, Peter West wrote:
> Hi Frank,
>
> Finally got back to this. Can you check this for me against an .fodt
> file, please?
>
>   perl -pi'orig_*' -e 's/<text:alphabetical-index-mark
> text:string-value="[[:alpha:]]*"//g'

--
Peter West
"...and a sword will pierce through your own soul also..."

--
To unsubscribe e-mail to: [hidden email]
Problems? http://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: http://wiki.documentfoundation.org/Netiquette
List archive: http://listarchives.libreoffice.org/global/users/
All messages sent to this list will be publicly archived and cannot be deleted
pbw pbw
Reply | Threaded
Open this post in threaded view
|

Re: Removing Index Markers from Writer: a How-To

In reply to this post by CVAlkan
Hi Frank,

This script works on OS X, to the extent that does the conversions and
creates new filtered odt files.  The only questions are whether it works
in Linux, and whether the filter does the right thing with the index
markers.

I haven't created any files with such index-markers, so maybe you can
run it against your files to see how it goes.

http://pbw.id.au/src/sh/strip-odt-index-markers

It's a shell script, using soffice and perl.


On 29/01/2014 11:23 pm, CVAlkan wrote:

> Peter:
>
> I mentioned sed and grep, but don't see any reason why perl couldn't be used
> as well. If you test this and it works, please post back to give others
> another option.
>
> BUT: my sed command only removed the markers from the fodt. As I mentioned I
> was unable to convert the odt to an fodt (essentially uncompressing the odt
> to readable xml) using the unzip capability of my OS (as I'm pretty sure
> could be done with earlier open office documents).
>
> Since LO can easily write and read fodt files, though, it really wasn't
> necessary to do any file format conversion, and I didn't bother spending the
> time to figure out how to do everything in one shot.
>
> Frank
>

--
Peter West
"...and a sword will pierce through your own soul also..."

--
To unsubscribe e-mail to: [hidden email]
Problems? http://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: http://wiki.documentfoundation.org/Netiquette
List archive: http://listarchives.libreoffice.org/global/users/
All messages sent to this list will be publicly archived and cannot be deleted
CVAlkan CVAlkan
Reply | Threaded
Open this post in threaded view
|

Re: Removing Index Markers from Writer: a How-To

In reply to this post by pbw
Peter:

The actual perl command should be changed slightly to:
perl -pi'orig_*' -e 's/<text:alphabetical-index-mark text:string-value="[[:alpha:]]*"\/>//g' Index_Experiment.fodt

After [[:alpha:]]*" the \/> needs to be added to remove the "/>" ending of the XML tag - otherwise it seems to work fine.

The full blown shell script you sent me (I don't see it here on the forum for some reason) needs to be modified in the same way of course.

I used [A-Za-z] instead of the [:alpha:] that you used because some systems don't respect that substitution syntax (I can't remember what it's called), which limits things just a bit, but a comment might be added to take care of that - the [:alpha:] syntax, again, is probably a little easier to understand for those not familiar with grep, sed and their relatives.

You should post the shell script, as it is probably easier to use (?) for some folks than my simple sed command, since it takes care of hand-holding, locating the right directories and so forth.

Of course, I hope some of the LibreOffice developers will incorporate the option and capability to remove old markers when an index is regenerated, and "fix" the generator so that it doesn't add additional markers to the same word when updating takes place. (It doesn't always do that, but I haven't figured out the exact conditions when it does).

So - good work.
Peter West Peter West
Reply | Threaded
Open this post in threaded view
|

Re: Removing Index Markers from Writer: a How-To

D'uh!

Thanks Frank.  I've updated the script with that change.

Attachments aren't accepted by this list, are they? What did you mean by
"post the shell script"?

Peter

Peter West
"...and a sword will pierce through your own soul also..."

On 5/02/2014 1:54 am, CVAlkan wrote:

> Peter:
>
> The actual perl command should be changed slightly to:
> perl -pi'orig_*' -e 's/<text:alphabetical-index-mark
> text:string-value="[[:alpha:]]*"\/>//g' Index_Experiment.fodt
>
> After [[:alpha:]]*" the \/> needs to be added to remove the "/>" ending of
> the XML tag - otherwise it seems to work fine.
>
> The full blown shell script you sent me (I don't see it here on the forum
> for some reason) needs to be modified in the same way of course.
>
> I used [A-Za-z] instead of the [:alpha:] that you used because some systems
> don't respect that substitution syntax (I can't remember what it's called),
> which limits things just a bit, but a comment might be added to take care of
> that - the [:alpha:] syntax, again, is probably a little easier to
> understand for those not familiar with grep, sed and their relatives.
>
> You should post the shell script, as it is probably easier to use (?) for
> some folks than my simple sed command, since it takes care of hand-holding,
> locating the right directories and so forth.
>
> Of course, I hope some of the LibreOffice developers will incorporate the
> option and capability to remove old markers when an index is regenerated,
> and "fix" the generator so that it doesn't add additional markers to the
> same word when updating takes place. (It doesn't always do that, but I
> haven't figured out the exact conditions when it does).
>
> So - good work.
>
>
>
>
> --
> View this message in context: http://nabble.documentfoundation.org/Removing-Index-Markers-from-Writer-a-How-To-tp4094327p4095465.html
> Sent from the Users mailing list archive at Nabble.com.
>

--
To unsubscribe e-mail to: [hidden email]
Problems? http://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: http://wiki.documentfoundation.org/Netiquette
List archive: http://listarchives.libreoffice.org/global/users/
All messages sent to this list will be publicly archived and cannot be deleted
CVAlkan CVAlkan
Reply | Threaded
Open this post in threaded view
|

Re: Removing Index Markers from Writer: a How-To

Hi Peter:

Hmmm. I hadn't noticed that there was no "attachment" button on this forum, as I haven't ever uploaded anything longer than a few lines.

Under the "More" button on the top of the message box, there is an option to "upload a file" but I don't know if other users will have easy access to that.

Maybe someone who is more familiar with this forum can advise. But, even it it only goes to the LibreOffice folks themselves, I think that would be useful, since perhaps they can use it is a guide for further development of the indexing feature.

Frank
TomD TomD
Reply | Threaded
Open this post in threaded view
|

Re: Removing Index Markers from Writer: a How-To

Hi :)
CVAlkan (=Frank) is using Nabble to view things from the mailing-list
and for posting.  The 2 other ways are GMane and as just normal
emails.  Of those 3 ways it's only Nabble that has a system for
uploading 'attachments'.

Follow the links in Frank's email to get to the right place in Nabble
or go through the official LibreOffice website thru "Get Help" and
find the correct thread by looking at the subject-lines and/or
date&time.


With scripts people have often just copy&pasted the code directly into
an email rather than try to upload a file.  Either way around is fine.
Regards from
Tom :)








On 5 February 2014 12:12, CVAlkan <[hidden email]> wrote:

> Hi Peter:
>
> Hmmm. I hadn't noticed that there was no "attachment" button on this forum,
> as I haven't ever uploaded anything longer than a few lines.
>
> Under the "More" button on the top of the message box, there is an option to
> "upload a file" but I don't know if other users will have easy access to
> that.
>
> Maybe someone who is more familiar with this forum can advise. But, even it
> it only goes to the LibreOffice folks themselves, I think that would be
> useful, since perhaps they can use it is a guide for further development of
> the indexing feature.
>
> Frank
>
>
>
> --
> View this message in context: http://nabble.documentfoundation.org/Removing-Index-Markers-from-Writer-a-How-To-tp4094327p4095645.html
> Sent from the Users mailing list archive at Nabble.com.
>
> --
> To unsubscribe e-mail to: [hidden email]
> Problems? http://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
> Posting guidelines + more: http://wiki.documentfoundation.org/Netiquette
> List archive: http://listarchives.libreoffice.org/global/users/
> All messages sent to this list will be publicly archived and cannot be deleted
>

--
To unsubscribe e-mail to: [hidden email]
Problems? http://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: http://wiki.documentfoundation.org/Netiquette
List archive: http://listarchives.libreoffice.org/global/users/
All messages sent to this list will be publicly archived and cannot be deleted