Having loaded our 1,800 pdf documents (magazines etc) on to our website, and successfully implemented WPFTS to search within them, we are now turning our attention to the 40,000 historic photographs in our archive. Around half have been digitised, mainly as jpegs but with some tifs, and we are working on the remainder.
Each photo will, in due course, be annotated with a description and I would like to know if other users have experience of using WPFTS to search this type of information. Currently, the digital images are stored on hard drives, and the information is on spreadsheets, but we are researching software to combine the two, to enable the images to be published on our website, and for the scanning and researching process to be suitable for collaborative teamworking.
Posts
-
Searching data attached to image files
-
Search Results - BOOLEAN Operators and Relevance
Our implementation of WPFTS is configured with the Default Search Logic set to "AND".
I did a search of our website for "2015 Committee", both with and without the quotation marks (the results were the same), and the second and third most relevant results did indeed contain the phrase "2015 Committee"; the second twice and the third once
However, the top result did not. It was a Bibliography for the year 2015, and contained "Committee" nine times, and "2015" 962 times, but in completely different parts of the document.
This feels more like an "OR" result than an "AND" one, but maybe I misunderstand how the "AND" and "OR" operators work? -
RE: Excerpt Text Peculiarities
Your explanation helps a lot, and I think adding a summary to your documentation would help other people too.
Originally, I had assumed that each occurrence of the search term in the document would produce a separate result with a 500 word excerpt "wrapped around" the search term. However, if I understand correctly, each document containing the search term only returns one result, and the excerpt might include a number of sentences from different parts of the document, these generally being the shortest sentences found (as you've described), up to the 500 word limit?
When these excerpt sentences are not continuous in the document, perhaps they could be numbered and placed in new paragraphs, to make clear that they are not a continuous section of text from the document?
Your idea of adding text at the end of the excerpt to signify when there are further "good" sentences, would also help. Maybe "there are "X" further appearances in other parts of this document". (You could leave the number "X" out if the software can't provide the number). -
RE: Excerpt Text Peculiarities
@Nick said in Excerpt Text Peculiarities:
BRISTOL AND SOUTH WALES UNION RAILWAY
To me, the main problem is that the excerpt text is not a continuous copy of the text in the original document, because it has thrown out a sentence in the middle of the relevant paragraph. For someone reading the excerpt, this missing sentence might be vital in order to understand the context of the search term within the document.
I'm sure there will always be situations where any algorithm creates anomalies, but my current view is that the excerpt should always be a continuous copy of the original.
Where to start and end the excerpt is more tricky, but paragraph breaks might be good indicators, better still a double paragraph break (i.e. a blank line in the text). In the example above, the text above the blank line (containing "Tuborg") belongs to a completely different topic, and is irrelevant to the search term.
It might also help if the specified character limit was more fully used. We have ours set to 500 (I assume this is characters), but in some cases we are getting excerpts of well under 100 characters.
I'm assuming here that WPFTS only returns a single result for a document containing the search term, even thought the search term might appear several times in various parts of the document? How does it decide which excerpt to display, and would it be possible to add a flag in the search results to state something like "Search term appears a further x times in the document"?
Nick -
Excerpt Text Peculiarities
I have a problem with the excerpt text WPFTS is displaying in the search results. It seems to be selecting some, but not all, of the text surrounding the search term, almost as though some text in the paragraph did not belong there. By way of example, here is some text that WPFTS has Indexed correctly from one of our documents:
"Tuborg Brewery with red and green straw hats, so familiar a sight on the streets of Copenhagen. JEOFFRY SPENCETHE BRISTOL AND SOUTH WALES UNION RAILWAY, John Norris, 32 pp, 5 photo illus, 2 maps, soft covers. RCHS 1985, ISBN 0-901461-38-5 £2.40 + p&p.
The rail journey between Bristol and South Wales was shortened by the Severn Bridge in 1879 and again by the Severn Tunnel in 1886, but an earlier scheme to avoid the detour via Gloucester utilised a combination of ferry and rail travel. For that purpose the Bristol and South Wales Union Railway Company was incorporated in 1857. An existing ferry had to be improved and various difficulties overcome before the new link could be formally opened on 1 January 1864."
When I searched for "ISBN 0-901461-38-5" the excerpt was "RCHS 1985, ISBN 0-901461-38–5 £2.40 + p&p.".
When I searched for "incorporated in 1857" the excerpt was "For that purpose the Bristol and South Wales Union Railway Company was incorporated in 1857."
When I searched for "via Gloucester" the excerpt was "The rail journey between Bristol and South Wales was shortened by the Severn Bridge in 1879 and again by the Severn Tunnel in 1886, but an earlier scheme to avoid the detour via Gloucester utilised a combination of ferry and rail travel."
The text in these three examples was continuous from the index, but much shorter than the 500 characters I had specified in the WPFTS settings.
However, when I searched for "BRISTOL AND SOUTH WALES UNION RAILWAY", the excerpt was "JEOFFRY SPENCE THE BRISTOL AND SOUTH WALES UNION RAILWAY, John Norris, 32 pp, 5 photo illus, 2 maps, soft covers. For that purpose the Bristol and South Wales Union Railway Company was incorporated in 1857." So here there is a whole sentence and more missing out of the middle of the excerpt.
Could you investigate, please?
-
RE: "Prevent Direct Access" Plugin
Thank you so much. We'll try this, and I'll let you know the results.
-
RE: "Prevent Direct Access" Plugin
Good morning, we are using the Members Plugin (by Memberpress) version 3.1.3.
I'm very grateful that you are taking an interest in this topic. One of the, perhaps, unintended consequences of WPFTS is, obviously, that it makes it a lot easier to find documents that were previously difficult to find because they were hiding in full sight amongst, in our case, hundreds of other documents. As I've said before, often it is just inconvenient and might lead to a minor loss of new members, but in other cases it could have very adverse effects on their businesses. I think this is one of the topics that is worth discussion, even though the solution will sometimes lie with a different plugin.
By the way, we did try having the "attachment" page visible for a while, but it displayed an enormous image of the document thumbnail, with a much smaller copy that contained the link below it. The link was, therefore, effectively hidden by the large image. Hence, using it as we have discussed but not being able to see it (at least, not for more than a second or two) would be very good.. Nick -
RE: "Prevent Direct Access" Plugin
I know that some users will want to make accessing media files impossible, which I believe involves a plugin like "Prevent Direct Access".
On our website, our pdfs are copies of our magazines, and we would be content with just making it difficult for non-members to access them.
Is it possible to use the internal "attachment" page of the file to do this?
We could make this page members only, but could we also set up an automatic redirection command from this page to the URL of the document? Hence, a non-members would not see this page but would be taken to the "Join Us" (instead of 404) page. Members would not see it either, because they would be automatically redirected to the document.
Might this work? -
"Prevent Direct Access" Plugin
Has anyone had experience of using this plugin with WPFTS? We thought it might be the answer to preventing non-members of our Society from seeing protected pdf files - i.e. the ones that our members pay subscriptions in order to see.
We installed the free version as a trial but, in addition to a number of other shortcomings, we discovered that non-members could find references/excerpts to the protected files in the search results using WPFTS search (which is good), but that clicking on the search result opened the protected file (which is bad). There is an option to buy a "Gold" upgrade to the plugin, but I'd like some confidence it will work before paying, and a query to the developer hasn't yet been responded to.
I did exchange some emails on protecting files with Epsilon a couple of months ago, but I thought this plugin might offer a straightforward way to protect our pdfs. -
RE: WPFTS Not Recognising Columns in a PDF Document
@EpsilonAdmin I've sent the requested information by email.
-
WPFTS Not Recognising Columns in a PDF Document
I noticed a peculiarity in the excerpt that appeared as a result of a search I did yesterday.
When I searched the pdf concerned using my PDF Editor Program, it found four instances of the search term. Dragging my pointer over the text confirmed that the OCR has correctly picked up the two-column layout of the document.
The excerpt in the WPFTS search results contained two sentences, from the second and third occurrences of the search, so it hadn't generated an excerpt from, particularly, the first occurrence. Is there a reason for this?
More importantly, the text in the first sentence of the excerpt didn't make sense until I realised that WPFTS had ignored the two-column layout and was reading straight across the page, picking up text from the left and right columns alternately. I think this is a significant bug, maybe in WPFTS or maybe elsewhere, so I'd appreciate a fix. -
RE: PDF Search Results: Titles and Excerpts
@EpsilonAdmin said in PDF Search Results: Titles and Excerpts:
#main .post h2.fusion-post-title a {
font-size: 21px;
}
#main .post h2.fusion-post-title {
margin-bottom: 5px;
}Thank you for your further advice. I added your code to the end of the code in the Custom CSS Styling dialog, but decided that a font-size of 20px, and a margin-bottom of 0px, worked best for our website. The results can be seen by inserting some text (try "Worcester") in the search box that is top-right here: https://rchs.org.uk/
-
RE: PDF Search Results: Titles and Excerpts
Many thanks for the advice. We've implemented it, and our search results now show in a single column across the page, which is exactly what we wanted.
-
RE: Altering the way Search Results are displayed.
Thanks. I'll raise the issue with our webmaster. We're using Avada.
I only asked because it did seem possible, in WPFTS, to alter the format of the items displayed under the Title in the search results, but not the Title itself. -
Altering the way Search Results are displayed.
Is it possible to reduce the size of the thumbnail, and the Title, in the search results? I find that they take up too much room on the page, and this limits the number of results that can be displayed on one page.
Also, could the page navigation buttons at the bottom of each page of results be amended to show the total number of pages, and preferably also the total number of search results? If this could then be repeated at the top of each page, it would make navigating around the results easier.
Ideally I'd prefer the search results to be displayed in a horizontal format across the width of the page, with the thumbnail on the left-hand side, and then the Title with the excerpt and other information below it. -
RE: Sorting of Search Results
We have over 1,000 pdf documents on our website, and the most important is our Journal, with 238 editions at present. As an example, a couple of issues of our Journal are here, but they are all similar in structure:
https://rchs.org.uk/wp-content/uploads/2020/02/Journal-100-Nov-1975.pdf
https://rchs.org.uk/wp-content/uploads/2020/02/Journal-001-Jan-1955.pdf
You will see that at the bottom of the second page of Journal 100 is a table of contents for the issue and, if the pdf is saved and then opened in Acrobat Reader, there is an equivalent set of bookmarks. It would help if a search term appearing in the article title (as listed in the table of contents) was given a higher weighting than one in the text of the document, but quite often the term will only occur in the text, and not in the title at all.
Ideally, we would like the weighting to be based upon articles, but weightings based upon the the whole Journal is acceptable. This is because the articles tend to be on unrelated topics within a Journal, so there is probably little difference between the number of instances of a specific search term within an article, and within its parent document.
I think it would help me if you could explain, in non-technical terms, how the four weightings operate with WPFTS. I've looked at the TFIDF article on Wikipedia, and understand the basics, but the majority of the article is too technical for me. Perhaps this information could also be added to the WPFTS documentation?
As a related issue, could WPFTS open the document listed in the search results at the first article page where the search term is found? I realise it might go to the article title instead, and if so that wouldn't really help much. -
RE: Sorting of Search Results
Most of the documents we have on our website are pdfs of our Journals, that were originally paper copies going back to 1955. Each Journal typically contains 6 - 8 substantial articles of about 10 pages each, plus many shorter articles and including comments on earlier articles.
So, the more times a search term is mentioned in an article and, particularly, if the term is included in the title of the article, the more likely it is that the article is specifically about the search term or something closely related to it.
On the assumption that WPFTS can't tell where one article ends and another begins, the number of instances within the document would be a good proxy. -
Sorting of Search Results
Is there a way to modify the relevance sort weightings when applied to attachments, particularly pdfs, based upon the number of times the search term appears within the document? (i.e. the more often the term appears, the higher the relevance).
-
RE: PDF Search Results: Titles and Excerpts
Thanks. I've now had the Avada fix applied, and the 300 character text is displaying. Great!
I'm attaching the two files mentioned above. These are the corrected versions, so are slightly different to those on the website. In particular, in the Document Properties of the website file, the filename is just repeated (is this done automatically?), whereas on the corrected version I have amended the title to something more meaningful. In both cases, it is the Title that is displayed in the search results.Journal-001 Jan 1955.pdf Journal-002 Apr 1955.pdf