Multiple newlines

classic Classic list List threaded Threaded
11 messages Options
John Kaufmann John Kaufmann
Reply | Threaded
Open this post in threaded view
|

Multiple newlines

Documents archived in Project Gutenberg are typically simple text, with each line ending in <CR><LF> (Hex:0D0A), so that paragraphs are separated by an empty line <CR><LF><CR><LF>. I thought it would be simple to convert one such (5657.txt) to format in Writer, but stumbled on elementary problems in Find-&-Replace [Ctrl-H] using regular expressions:

        (1) "\n" is not found. Should not "\n" match one of the codes in <CR><LF>? [If not, what code(s) should "\n" match?]

        (2) Although "$" is found (matches to <CR><LF>), "$$" (for successive occurrences of <CR><LF>) is not found. Why?

        (3) Doing Find "$" & Replace with " " (single space), <CR><LF> is replaced by " " (single space). However,
            doing Find "$" & Replace with "@" (single @char), <CR><LF> is replaced by "@@" (double @char).  Why?

John

--
To unsubscribe e-mail to: [hidden email]
Problems? https://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: https://wiki.documentfoundation.org/Netiquette
List archive: https://listarchives.libreoffice.org/global/users/
Privacy Policy: https://www.documentfoundation.org/privacy
Brian Barker Brian Barker
Reply | Threaded
Open this post in threaded view
|

Re: Multiple newlines

At 16:38 20/07/2020 -0400, John Kaufmann wrote:
>Documents archived in Project Gutenberg are typically simple text,
>with each line ending in <CR><LF> (Hex:0D0A), so that paragraphs are
>separated by an empty line <CR><LF><CR><LF>. I thought it would be
>simple to convert one such (5657.txt) to format in Writer, ...

It is.

>... but stumbled on elementary problems in Find-&-Replace [Ctrl-H]
>using regular expressions:
>(1) "\n" is not found. Should not "\n" match one of the codes in
><CR><LF>? [If not, what code(s) should "\n" match?]

First, once you have your text in a word processor, you do not have
<CR> or <LF> or <CR><LF> or anything else like that in your text;
instead you have *paragraph breaks*. There is no character there,
despite the pilcrow that you can get Writer to display. And what you
are calling "empty lines" are actually empty paragraphs. "\n" in the
"Search for" field matches line breaks, not paragraph breaks. (And
line beaks are line breaks - also no "codes".)

>(2) Although "$" is found (matches to <CR><LF>), ...

No, "$" does not match anything; instead, it anchors the expression
before it to the end of a paragraph. So an expression ending with "$"
will match text only if it comes at the end of its paragraph.

>... "$$" (for successive occurrences of <CR><LF>) is not found. Why?

"$$" has no sense. If anything it means "this pattern needs to match
something that is *really, really* at the end of a paragraph"!

>(3) Doing Find "$" & Replace with " " (single space), <CR><LF> is
>replaced by " " (single space). However, doing Find "$" & Replace
>with "@" (single @char), <CR><LF> is replaced by "@@" (double @char). Why?

I don't think that's true. In any case, there are no <CR><LF>s present.

To achieve what you want:

First combine single-line paragraphs:
o Apply Default paragraph style to all the text.
o Select all the text.
o Apply AutoCorrect.
(You may need to adjust the minimum length of such paragraphs in
AutoCorrect Options - possibly to 0%.)

Then remove empty paragraphs:
o Search for "^$" (no quotes) and replace with nothing.
("^" anchors your pattern to the start of a paragraph and "$" to the
end. So "^$" matches a paragraph with nothing in it.)

I trust this helps.

Brian Barker


--
To unsubscribe e-mail to: [hidden email]
Problems? https://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: https://wiki.documentfoundation.org/Netiquette
List archive: https://listarchives.libreoffice.org/global/users/
Privacy Policy: https://www.documentfoundation.org/privacy

John Kaufmann John Kaufmann
Reply | Threaded
Open this post in threaded view
|

Re: Multiple newlines

Hi Brian,

Thanks for introducing a fundamental concept that my brain had not yet grasped. As for the details:

On 2020-07-20 18:49, Brian Barker wrote:
> At 16:38 20/07/2020 -0400, John Kaufmann wrote:
>> Documents archived in Project Gutenberg are typically simple text, with each line ending in <CR><LF> (Hex:0D0A), so that paragraphs are separated by an empty line <CR><LF><CR><LF>. I thought it would be simple to convert one such (5657.txt) to format in Writer, ...
>> ... but stumbled on elementary problems in Find-&-Replace [Ctrl-H] using regular expressions:
>> (1) "\n" is not found. Should not "\n" match one of the codes in <CR><LF>? [If not, what code(s) should "\n" match?]
>
> First, once you have your text in a word processor, you do not have <CR> or <LF> or <CR><LF> or anything else like that in your text; instead you have *paragraph breaks*. There is no character there, despite the pilcrow that you can get Writer to display. And what you are calling "empty lines" are actually empty paragraphs. "\n" in the "Search for" field matches line breaks, not paragraph breaks. (And line beaks are line breaks - also no "codes".)

You sent me to do something I should have done before asking the question: examine a hex dump of an ODT content.xml file.  I see what you mean about "no codes": A paragraph is just a text string between XML bounds <text:p ...> and </text:p>, and a line break (inside the paragraph bounds) is just <text:line-break/>. Sorry; I should have done that sooner, rather than rely on a different paradigm that once worked for me.


>> (2) Although "$" is found (matches to <CR><LF>), ...
>
> No, "$" does not match anything; instead, it anchors the expression before it to the end of a paragraph. So an expression ending with "$" will match text only if it comes at the end of its paragraph.
>
>> ... "$$" (for successive occurrences of <CR><LF>) is not found. Why?
>
> "$$" has no sense. If anything it means "this pattern needs to match something that is *really, really* at the end of a paragraph"!

:-)!  Given what I learned after your answer on (1), this (2) is obvious.


>> (3) Doing Find "$" & Replace with " " (single space), <CR><LF> is replaced by " " (single space). However, doing Find "$" & Replace with "@" (single @char), <CR><LF> is replaced by "@@" (double @char). Why?
>
> I don't think that's true. In any case, there are no <CR><LF>s present.

You're right. Examining again, I note that the "@@" was a remnant of a prior attempt to replace the <CR><LF><CR><LF> of the original text file with "@@" (prior to replacing all other <CR><LF> with " "). A faulty mental model led to a bad approach and a misreading of the result.


> To achieve what you want:
>
> First combine single-line paragraphs:
> o Apply Default paragraph style to all the text.
> o Select all the text.
> o Apply AutoCorrect.
> (You may need to adjust the minimum length of such paragraphs in AutoCorrect Options - possibly to 0%.)

Thank you! - that is not obvious (and I had not found it in the Help or the Writer Guide), but was worth the price of having this problem arise. AutoCorrect had the effect of changing most non-empty "Default Style" paragraphs to "Text Body" style, with the rest chosen [by spacing hints?] to be "Hanging Indent", "Heading", "Heading 1", "List", "List 1", "Numbering 2" or "Text Body Indent". (Empty paragraphs remained "Default Style".)  That was a MUCH more elaborate and sophisticated AutoCorrect than I ever would have imagined.

Now I understand the point of AutoCorrect Option "Combine single line paragraphs if length greater than 50%". But how do you "adjust the minimum length of such paragraphs in AutoCorrect Options - possibly to 0%"? (The fact that I don't find the setting suggests that I may have also missed something basic in your explanation.)

Note: Even after having this excellent explanation on a use of AutoCorrect, I went back to the Writer Guide and still don't find it. So it may take quite some time to learn enough of such subtleties to feel that I have a solid grasp.


> Then remove empty paragraphs:
> o Search for "^$" (no quotes) and replace with nothing.
> ("^" anchors your pattern to the start of a paragraph and "$" to the end. So "^$" matches a paragraph with nothing in it.)

Again I like your pedagogical approach, matching the action with the reasoning. You should be a teacher.


> I trust this helps.

Helps?! It was an education. (If you could just answer the follow-up question on AutoCorrect Options...)

Thanks again,
John

--
To unsubscribe e-mail to: [hidden email]
Problems? https://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: https://wiki.documentfoundation.org/Netiquette
List archive: https://listarchives.libreoffice.org/global/users/
Privacy Policy: https://www.documentfoundation.org/privacy
Philip Jackson-2 Philip Jackson-2
Reply | Threaded
Open this post in threaded view
|

Re: Multiple newlines

On 21/07/2020 07:16, John Kaufmann wrote:
>
> Now I understand the point of AutoCorrect Option "Combine single line paragraphs if length greater than 50%". But how do you "adjust the minimum length of such paragraphs in AutoCorrect Options - possibly to 0%"? (The fact that I don't find the setting suggests that I may have also missed something basic in your explanation.)

In the autocorrect options, highlight the option to combine single line paragraphs greater than 50% and then tap the edit button just underneath. There you can adjust your 50%.

Philip



--
To unsubscribe e-mail to: [hidden email]
Problems? https://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: https://wiki.documentfoundation.org/Netiquette
List archive: https://listarchives.libreoffice.org/global/users/
Privacy Policy: https://www.documentfoundation.org/privacy
mikey mikey
Reply | Threaded
Open this post in threaded view
|

Re: Multiple newlines

In reply to this post by John Kaufmann
Much simpler than using a hex editor:

1. rename the .odt to .zip and use a zip archive manager.  you want to look
in content.xml

2. save the file as an .fodt (its the same file in open text.) view that
with a plain text editor. This works great unless you have a lot of images
in it. textified images are too much of a good thing.


On Tue, Jul 21, 2020 at 12:18 AM John Kaufmann <[hidden email]> wrote:

> Hi Brian,
>
> Thanks for introducing a fundamental concept that my brain had not yet
> grasped. As for the details:
>
> On 2020-07-20 18:49, Brian Barker wrote:
> > At 16:38 20/07/2020 -0400, John Kaufmann wrote:
> >> Documents archived in Project Gutenberg are typically simple text, with
> each line ending in <CR><LF> (Hex:0D0A), so that paragraphs are separated
> by an empty line <CR><LF><CR><LF>. I thought it would be simple to convert
> one such (5657.txt) to format in Writer, ...
> >> ... but stumbled on elementary problems in Find-&-Replace [Ctrl-H]
> using regular expressions:
> >> (1) "\n" is not found. Should not "\n" match one of the codes in
> <CR><LF>? [If not, what code(s) should "\n" match?]
> >
> > First, once you have your text in a word processor, you do not have <CR>
> or <LF> or <CR><LF> or anything else like that in your text; instead you
> have *paragraph breaks*. There is no character there, despite the pilcrow
> that you can get Writer to display. And what you are calling "empty lines"
> are actually empty paragraphs. "\n" in the "Search for" field matches line
> breaks, not paragraph breaks. (And line beaks are line breaks - also no
> "codes".)
>
> You sent me to do something I should have done before asking the question:
> examine a hex dump of an ODT content.xml file.  I see what you mean about
> "no codes": A paragraph is just a text string between XML bounds <text:p
> ...> and </text:p>, and a line break (inside the paragraph bounds) is just
> <text:line-break/>. Sorry; I should have done that sooner, rather than rely
> on a different paradigm that once worked for me.
>
>
> >> (2) Although "$" is found (matches to <CR><LF>), ...
> >
> > No, "$" does not match anything; instead, it anchors the expression
> before it to the end of a paragraph. So an expression ending with "$" will
> match text only if it comes at the end of its paragraph.
> >
> >> ... "$$" (for successive occurrences of <CR><LF>) is not found. Why?
> >
> > "$$" has no sense. If anything it means "this pattern needs to match
> something that is *really, really* at the end of a paragraph"!
>
> :-)!  Given what I learned after your answer on (1), this (2) is obvious.
>
>
> >> (3) Doing Find "$" & Replace with " " (single space), <CR><LF> is
> replaced by " " (single space). However, doing Find "$" & Replace with "@"
> (single @char), <CR><LF> is replaced by "@@" (double @char). Why?
> >
> > I don't think that's true. In any case, there are no <CR><LF>s present.
>
> You're right. Examining again, I note that the "@@" was a remnant of a
> prior attempt to replace the <CR><LF><CR><LF> of the original text file
> with "@@" (prior to replacing all other <CR><LF> with " "). A faulty mental
> model led to a bad approach and a misreading of the result.
>
>
> > To achieve what you want:
> >
> > First combine single-line paragraphs:
> > o Apply Default paragraph style to all the text.
> > o Select all the text.
> > o Apply AutoCorrect.
> > (You may need to adjust the minimum length of such paragraphs in
> AutoCorrect Options - possibly to 0%.)
>
> Thank you! - that is not obvious (and I had not found it in the Help or
> the Writer Guide), but was worth the price of having this problem arise.
> AutoCorrect had the effect of changing most non-empty "Default Style"
> paragraphs to "Text Body" style, with the rest chosen [by spacing hints?]
> to be "Hanging Indent", "Heading", "Heading 1", "List", "List 1",
> "Numbering 2" or "Text Body Indent". (Empty paragraphs remained "Default
> Style".)  That was a MUCH more elaborate and sophisticated AutoCorrect than
> I ever would have imagined.
>
> Now I understand the point of AutoCorrect Option "Combine single line
> paragraphs if length greater than 50%". But how do you "adjust the minimum
> length of such paragraphs in AutoCorrect Options - possibly to 0%"? (The
> fact that I don't find the setting suggests that I may have also missed
> something basic in your explanation.)
>
> Note: Even after having this excellent explanation on a use of
> AutoCorrect, I went back to the Writer Guide and still don't find it. So it
> may take quite some time to learn enough of such subtleties to feel that I
> have a solid grasp.
>
>
> > Then remove empty paragraphs:
> > o Search for "^$" (no quotes) and replace with nothing.
> > ("^" anchors your pattern to the start of a paragraph and "$" to the
> end. So "^$" matches a paragraph with nothing in it.)
>
> Again I like your pedagogical approach, matching the action with the
> reasoning. You should be a teacher.
>
>
> > I trust this helps.
>
> Helps?! It was an education. (If you could just answer the follow-up
> question on AutoCorrect Options...)
>
> Thanks again,
> John
>
> --
> To unsubscribe e-mail to: [hidden email]
> Problems?
> https://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
> Posting guidelines + more: https://wiki.documentfoundation.org/Netiquette
> List archive: https://listarchives.libreoffice.org/global/users/
> Privacy Policy: https://www.documentfoundation.org/privacy
>

--
To unsubscribe e-mail to: [hidden email]
Problems? https://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: https://wiki.documentfoundation.org/Netiquette
List archive: https://listarchives.libreoffice.org/global/users/
Privacy Policy: https://www.documentfoundation.org/privacy
John Kaufmann John Kaufmann
Reply | Threaded
Open this post in threaded view
|

Re: Multiple newlines

In reply to this post by Philip Jackson-2
On 2020-07-21 08:57, Philip Jackson wrote:
> On 21/07/2020 07:16, John Kaufmann wrote:
>>
>> Now I understand the point of AutoCorrect Option "Combine single line paragraphs if length greater than 50%". But how do you "adjust the minimum length of such paragraphs in AutoCorrect Options - possibly to 0%"? (The fact that I don't find the setting suggests that I may have also missed something basic in your explanation.)
>
> In the autocorrect options, highlight the option to combine single line paragraphs greater than 50% and then tap the edit button just underneath. There you can adjust your 50%.

Ah! - so many little Easter eggs for the observant.
Thanks, Philip!

-john

--
To unsubscribe e-mail to: [hidden email]
Problems? https://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: https://wiki.documentfoundation.org/Netiquette
List archive: https://listarchives.libreoffice.org/global/users/
Privacy Policy: https://www.documentfoundation.org/privacy
John Kaufmann John Kaufmann
Reply | Threaded
Open this post in threaded view
|

Re: Multiple newlines

In reply to this post by mikey
On 2020-07-21 09:09, Michael H wrote:

> On Tue, Jul 21, 2020 at 12:18 AM John Kaufmann wrote:
> ...
>> You sent me to do something I should have done before asking the question:
>> examine a hex dump of an ODT content.xml file.  I see what you mean about
>> "no codes": A paragraph is just a text string between XML bounds <text:p
>> ...> and </text:p>, and a line break (inside the paragraph bounds) is just
>> <text:line-break/>. Sorry; I should have done that sooner, rather than rely
>> on a different paradigm that once worked for me.
>>
> Much simpler than using a hex editor:
>
> 1. rename the .odt to .zip and use a zip archive manager.  you want to look
> in content.xml
>
> 2. save the file as an .fodt (its the same file in open text.) view that
> with a plain text editor. This works great unless you have a lot of images
> in it. textified images are too much of a good thing.

That is in essence what I did: look at a presentation of the file codes, using a text editor with hex sidebar. Thanks, Michael.

-john

--
To unsubscribe e-mail to: [hidden email]
Problems? https://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: https://wiki.documentfoundation.org/Netiquette
List archive: https://listarchives.libreoffice.org/global/users/
Privacy Policy: https://www.documentfoundation.org/privacy
Brian Barker Brian Barker
Reply | Threaded
Open this post in threaded view
|

Re: Multiple newlines

In reply to this post by John Kaufmann
At 01:16 21/07/2020 -0400, John Kaufmann wrote:
>You sent me to do something I should have done before asking the question: ...

No criticism intended, of course.

>>First combine single-line paragraphs: [...]
>
>AutoCorrect had the effect of changing most non-empty "Default
>Style" paragraphs to "Text Body" style, with the rest chosen [by
>spacing hints?] to be "Hanging Indent", "Heading", "Heading 1",
>"List", "List 1", "Numbering 2" or "Text Body Indent". (Empty
>paragraphs remained "Default Style".)  That was a MUCH more
>elaborate and sophisticated AutoCorrect than I ever would have imagined.

You may or may not want all that to happen. If not, it is wise to use
this technique first, before applying any other formatting. You can
then choose to select all the text and apply a paragraph style of
choice - Default, Text Body, or whatever.

>Now I understand the point of AutoCorrect Option "Combine single
>line paragraphs if length greater than 50%". But how do you "adjust
>the minimum length of such paragraphs in AutoCorrect Options -
>possibly to 0%"? (The fact that I don't find the setting suggests
>that I may have also missed something basic in your explanation.)

As has already been said, you use the Edit... button - which will be
greyed out until you highlight the relevant option.

>Note: Even after having this excellent explanation on a use of
>AutoCorrect, I went back to the Writer Guide and still don't find it.

In suspect you are right.

>>Then remove empty paragraphs: [...]
>
>Again I like your pedagogical approach, matching the action with the
>reasoning.

Good-oh!

>You should be a teacher.

(Er, I was one.)

Brian Barker  


--
To unsubscribe e-mail to: [hidden email]
Problems? https://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: https://wiki.documentfoundation.org/Netiquette
List archive: https://listarchives.libreoffice.org/global/users/
Privacy Policy: https://www.documentfoundation.org/privacy

John Kaufmann John Kaufmann
Reply | Threaded
Open this post in threaded view
|

Re: Multiple newlines

Hi Brian,

On 2020-07-21 14:09, Brian Barker wrote:
> At 01:16 21/07/2020 -0400, John Kaufmann wrote:
> ...
>> AutoCorrect had the effect of changing most non-empty "Default Style" paragraphs to "Text Body" style, with the rest chosen [by spacing hints?] to be "Hanging Indent", "Heading", "Heading 1", "List", "List 1", "Numbering 2" or "Text Body Indent". (Empty paragraphs remained "Default Style".)  That was a MUCH more elaborate and sophisticated AutoCorrect than I ever would have imagined.
>
> You may or may not want all that to happen. If not, it is wise to use this technique first, before applying any other formatting. You can then choose to select all the text and apply a paragraph style of choice - Default, Text Body, or whatever.

As I said, I was surprised that AutoCorrect had the effect of sorting different plain text paragraphs into different paragraph styles. I would never have expected "AutoCorrect" to include a function so ambitious, and am still thinking about the implications. [What is the basis for assigning paragraphs to styles? Does this imply something reserved about certain styles, or are styles profiled (if so, how? - on what basis?) for this function?] The least one can say it that it's inarguably clever - and as a practical matter, cleaning up from that was not burdensome.

But, in time, I would love to know.  After all, every other AutoCorrect function is a product of explicit rules. I'm sure this is also, but have not had time to think about what those rules would be, and how they are built. [Just a couple days ago that I was thinking in terms of editors parsing on specific (low-ASCII) codes rather than XML statements. I'm still adjusting. ;-) ]

John

--
To unsubscribe e-mail to: [hidden email]
Problems? https://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: https://wiki.documentfoundation.org/Netiquette
List archive: https://listarchives.libreoffice.org/global/users/
Privacy Policy: https://www.documentfoundation.org/privacy
Brian Barker Brian Barker
Reply | Threaded
Open this post in threaded view
|

Re: Multiple newlines

In reply to this post by John Kaufmann
At 01:16 21/07/2020 -0400, John Kaufmann wrote:
>You sent me to do something I should have done before asking the
>question: examine a hex dump of an ODT content.xml file. I see what
>you mean about "no codes": A paragraph is just a text string between
>XML bounds <text:p ...> and </text:p>, and a line break (inside the
>paragraph bounds) is just <text:line-break/>.

Actually, I don't think they are even that: those are just how they
are represented in Open Document Format. Remember that documents
files can be saved from LibreOffice is other formats too. I think
what you have in the editing window at the end of a paragraph is
defined not by how it will be represented in any saved file but by
its properties in the window - in other words, what you can do with
it and how you do it. And the answer to that is simply a "paragraph
break" and a "line break".

Brian Barker  


--
To unsubscribe e-mail to: [hidden email]
Problems? https://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: https://wiki.documentfoundation.org/Netiquette
List archive: https://listarchives.libreoffice.org/global/users/
Privacy Policy: https://www.documentfoundation.org/privacy

John Kaufmann John Kaufmann
Reply | Threaded
Open this post in threaded view
|

Re: Multiple newlines

Brian,

On 2020-07-21 22:21, Brian Barker wrote:
> At 01:16 21/07/2020 -0400, John Kaufmann wrote:
>> You sent me to do something I should have done before asking the question: examine a hex dump of an ODT content.xml file. I see what you mean about "no codes": A paragraph is just a text string between XML bounds <text:p ...> and </text:p>, and a line break (inside the paragraph bounds) is just <text:line-break/>.
>
> Actually, I don't think they are even that: those are just how they are represented in Open Document Format. Remember that documents files can be saved from LibreOffice is other formats too. I think what you have in the editing window at the end of a paragraph is defined not by how it will be represented in any saved file but by its properties in the window - in other words, what you can do with it and how you do it. And the answer to that is simply a "paragraph break" and a "line break".

You are right, of course: one of those "other formats" is plain text, which gets us back to the <CR><LF> ASCII codes, which is where we came into this discussion. I agree with your paradigm -- that a paragraph (or tab, or other construct) is what you see in the word processor, not the representation of that construct in any particular file format -- but I'm just stuck expressing myself in a former paradigm.
This may take time. :-)

Thanks,
John

--
To unsubscribe e-mail to: [hidden email]
Problems? https://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: https://wiki.documentfoundation.org/Netiquette
List archive: https://listarchives.libreoffice.org/global/users/
Privacy Policy: https://www.documentfoundation.org/privacy