Ridiculous xml?

classic Classic list List threaded Threaded
5 messages Options
doubi doubi
Reply | Threaded
Open this post in threaded view
|

Ridiculous xml?

Hi all,

Recently noticed something very odd about the contents of content.xml.
It looks like a text span has been placed around every individual word,
with a different one around every individual space:

</text:span><text:span text:style-name="T1">Web</text:span><text:span
text:style-name="T2">. </text:span><text:span
text:style-name="T1">But</text:span><text:span text:style-name="T2">
</text:span><text:span text:style-name="T1">it</text:span><text:span
text:style-name="T2"> </text:span><text:span
text:style-name="T1">is</text:span><text:span text:style-name="T2">
</text:span><text:span text:style-name="T1">not</text:span><text:span
text:style-name="T2">

...etc.

I can only imagine this makes the files somewhat bigger than I would
have thought was necessary?

More of a problem for me is that I was using a shell script to inflate
content.xml and grep it for a certain string within the text. I was
accounting for odd whitespace, but obviously this mad tagging thwarted
such a simple approach. Have now adapted with a perl script to strip all
xml tags before grepping, but I'm still curious about why content.xml
appears this way?

Might it be because the file was imported from .doc format? Is it a
transformation "bug" of some kind?

Bests,

--Ryan
_______________________________________________
LibreOffice mailing list
[hidden email]
http://lists.freedesktop.org/mailman/listinfo/libreoffice
Eike Rathke Eike Rathke
Reply | Threaded
Open this post in threaded view
|

Re: Ridiculous xml?

Hi Ryan,

On Friday, 2011-07-29 11:30:43 +0100, Ryan Jendoubi wrote:

> Recently noticed something very odd about the contents of
> content.xml. It looks like a text span has been placed around every
> individual word, with a different one around every individual space:
>
> </text:span><text:span
> text:style-name="T1">Web</text:span><text:span
> text:style-name="T2">. </text:span><text:span
> text:style-name="T1">But</text:span><text:span text:style-name="T2">
> </text:span><text:span text:style-name="T1">it</text:span><text:span
> text:style-name="T2"> </text:span><text:span
> text:style-name="T1">is</text:span><text:span text:style-name="T2">
> </text:span><text:span
> text:style-name="T1">not</text:span><text:span text:style-name="T2">
Looks like all spaces have a style assigned that is different from the
text, maybe another font or height or ... color? ;-)

Check where the styles T1 and T2 actually differ.

  Eike

--
 PGP/OpenPGP/GnuPG encrypted mail preferred in all private communication.
 Key ID: 0x293C05FD - 997A 4C60 CE41 0149 0DB3  9E96 2F1A D073 293C 05FD

_______________________________________________
LibreOffice mailing list
[hidden email]
http://lists.freedesktop.org/mailman/listinfo/libreoffice

attachment0 (205 bytes) Download Attachment
doubi doubi
Reply | Threaded
Open this post in threaded view
|

Re: Ridiculous xml?

Hi Eike

On 29/07/11 22:03, Eike Rathke wrote:
> Looks like all spaces have a style assigned that is different from the
> text, maybe another font or height or ... color? ;-)
>
> Check where the styles T1 and T2 actually differ.

It seems T1 and T2 have one small difference:

<style:style style:name="T1" style:family="text">
<style:text-properties fo:language="en" fo:country="GB"/>
</style:style>
<style:style style:name="T2" style:family="text">
<style:text-properties fo:language="en" fo:country="GB"
style:font-name-asian="Arial"/>
</style:style>

What I don't understand though is how these different styles could have
been applied.

I certainly didn't go through the document selecting each space and
fiddling with the Asian font settings!

So my question is how they could have been applied in such a crazy way
which is not at all noticeable in the UI?

-r
_______________________________________________
LibreOffice mailing list
[hidden email]
http://lists.freedesktop.org/mailman/listinfo/libreoffice
doubi doubi
Reply | Threaded
Open this post in threaded view
|

Re: Ridiculous xml?

On 30/07/11 11:50, Ryan Jendoubi wrote:
> It seems T1 and T2 have one small difference:
>
> <style:style style:name="T1" style:family="text">
> <style:text-properties fo:language="en" fo:country="GB"/>
> </style:style>
> <style:style style:name="T2" style:family="text">
> <style:text-properties fo:language="en" fo:country="GB"
> style:font-name-asian="Arial"/>
> </style:style>

That was from content.xml. Interestingly, their definition is different
when I save the document out as .fodt:

<style:style style:name="T1" style:family="text"/>
<style:style style:name="T2" style:family="text">
<style:text-properties fo:language="en" fo:country="GB"/>
</style:style>
<style:style style:name="T3" style:family="text">
<style:text-properties fo:language="en" fo:country="GB"
style:font-name-asian="Arial"/>
</style:style>

In this XML document the words and spaces alternate between T2 and T3...

So there appears to be some degree of inconsistency and arbitrariness at
least in the way these styles are stored as XML.

Of course, when I navigate through the .odt in LibreOffice the only text
style which appears to be present at those parts is "Default". Even when
I apply other text styles to a block of text, the saved XML alternates
styles between the words and spaces.

Why could this be happening?

Cheers,

-r

_______________________________________________
LibreOffice mailing list
[hidden email]
http://lists.freedesktop.org/mailman/listinfo/libreoffice
Michael Meeks Michael Meeks
Reply | Threaded
Open this post in threaded view
|

Re: Ridiculous xml?

Hi Ryan,

On Sat, 2011-07-30 at 12:04 +0100, Ryan Jendoubi wrote:
> That was from content.xml. Interestingly, their definition is different
> when I save the document out as .fodt:

        Hmm; that seems quite odd :-) fodt should intercept the same ODF export
filter at a very low level, certainly below the semantic understanding
of styles.

        I suspect (as you suggested) the problem is produced by an import
filter - so, finding a very small test case: perhaps just a  "hello
world" document, and chasing through the style creation process perhaps
in the debugger to see where that new style is created would be a good
approach.

        Then again, if you can produce it just by typing "hello world" we may
have something more interesting - can you ? and we should prolly have a
bug filed to collect all your great research.

        All the best,

                Michael.

--
 [hidden email]  <><, Pseudo Engineer, itinerant idiot


_______________________________________________
LibreOffice mailing list
[hidden email]
http://lists.freedesktop.org/mailman/listinfo/libreoffice