I'm assuming here that WPFTS only returns a single result for a document containing the search term, even though the search term might appear several times in various parts of the document?
The Smart Excerpt algorithm will try to show all found sentences while it's possible to fit them into the excerpt length limit (e.g. 500 characters). In case it found just one sentence with the queried word(s), it will show only that sentence. It avoids adding some "dummy" sentences around in order to save screen space for other search results.
How does it decide which excerpt to display,
First, it finds the shortest sentences with all queried words inside. Then it will find the shortest sentence with the remaining words. It will add more sentences after that, in case it's possible within the length limit.
and would it be possible to add a flag in the search results to state something like "Search term appears a further x times in the document"?
It depends on the query mode you're using. Sometimes people prefer to use "OR" logic, in this case, these sentences which contain at least one word may be recognized as "good". If you are using "AND" logic, then we should recognize only sentences that contain ALL words only. However, in the real-life, the search algorithm does not make difference between sentences while searching. It's intentionally done because in complex texts you can find words close, but placed in different sentences. For example for the phrase "beautiful cats," the next text should be recognized as good:
"Article about cats. They are just beautiful.". In this case, Smart Excerpt will show both sentences. So it's just not possible to count the exact "number of appears".
What we actually can do is to place something like "(...there are more appears)" at the end of the excerpt in case we were unable to show all the "good" sentences because of length limitation. Good idea. I definitely need to implement this.
my current view is that the excerpt should always be a continuous copy of the original
It is mostly impossible to save all the text between two sentences because two "good" sentences can be far from each other (and it's often so) so it's not possible to show the whole of this construction within the boundaries of 500 characters.
Yes, sometimes it's not enough text to understand the context since sentences could be pretty short like your #1 case. But people always can click on to search result item and find the context in the original post.
paragraph breaks might be good indicators, better still a double paragraph break (i.e. a blank line in the text)
Often it's not simple to detect the end of the start of the paragraph. Because of text type. If you're using plain text, you can use either one or two linefeed characters to form paragraphs. When you're using HTML, you may use <p> tags instead or two <br> tags... etc. That's why we still stick with sentence boundaries.
One more good idea which I have is to add some delimiter characters between the sentences which are not linked in the original text, for example, if we removed some text between two "good" sentences, we may put "..." there to indicate that the part of the text was hidden. It should remove the mess.
What do you think?