B4J Question Bubble Search or Search Anywhere

Peter Lewis · Jan 3, 2020

Hi All
A While ago I got some help about entering some text and the table will be filtered by finding the text entered anywhere in the database field. This worked very well. However I am looking for an upgraded idea of this.

For example if I had to search songs:- I entered the following "black peas intro clean" This will filter out all those words from the database field and show the results.

At the moment I can only put in one word, as soon as i use spaces the initial system does not work.

This was from the original post .

search anywhere contiguous letters

Thank you

emexes · Jan 3, 2020

It'd be great if there was an SQL operator that handled this (LIKES?) but... I think you'll have to hand-spin it, using AND to add each extra word to the WHERE clause.

As a first quick test, you could .Replace the spaces with wildcard "%" in the search string, that'll get you halfway there. Will find instances where all words exist but only when the words are in the same order eg "peas black" (SQL "%peas%black%") would NOT find "black eyed peas".

emexes · Jan 3, 2020

emexes said:
It'd be great if there was an SQL operator that handled this (LIKES?) but... I think you'll have to hand-spin it

Perhaps I thought wrong. AGAINST seems to be a common thing now.

https://dev.mysql.com/doc/refman/8.0/en/fulltext-search.html

LucaMs · Jan 3, 2020

Peter Lewis said:
For example if I had to search songs

emexes said:
It'd be great if there was an SQL operator that handled this (LIKES?)

I think the "Like" operator is not the best to use, in this case; if the search engine doesn't like a song, it could rule it out!

?

emexes · Jan 3, 2020

LucaMs said:
I think the "Like" operator is not the best to use, in this case; if the search engine doesn't like a song, it could rule it out!

So you're AGAINST LIKE? ?

emexes · Jan 3, 2020

Peter Lewis said:
I am looking for an upgraded idea of this.

Jokes and wordplay aside: how'd you go with this? Is it quick enough to keep up with keystrokes in the search box?

Peter Lewis · Jan 3, 2020

What I am trying to replicate but make some improvements is a free program called hddb

hddb download

this program is great and does 80% of the things I want but there are no new releases and seems like the author no longer updates it..

One way I thought was to have 3 input search boxes so the code can look at all 3 and search. If any are blank then it will not use it.

Peter Lewis · Jan 3, 2020

emexes said:
Jokes and wordplay aside: how'd you go with this? Is it quick enough to keep up with keystrokes in the search box?

I do not know if it will . I just tried to load the sqlite file into a table which has 3.7m records and I got an Error

Error Message:

java.lang.OutOfMemoryError: Java heap space

emexes · Jan 3, 2020

Peter Lewis said:
3.7m records

Slightly more records than I was expecting ?

I'm interested in the speed of the WHERE LIKE seaching of that dataset. How long does the one-word-only search currently take, for common and uncommon words?

emexes · Jan 3, 2020

Peter Lewis said:
I just tried to load the sqlite file into a table which has 3.7m records and I got an Error

I was going to suggest UltimateListView might be worth a look, has examples of datasets in the 100,000's. Then realised we are in B4J, not B4A. Spewin.

emexes · Jan 3, 2020

If it is an SQLite database on a PC - does this work:

https://www.sqlitetutorial.net/sqlite-full-text-search/

eg use "black peas something".Trim.Replace(" ", " ").Replace(" ", CRLF).Replace(CRLF, " AND ") to generate MATCH string "black AND peas AND something", and then do this:

Peter Lewis · Jan 3, 2020

I suppose I might be doing this all wrong. Instead of loading this table of 3.7m records which will increase every month, maybe I should just run a query with the RESULTS in a table. This would be faster and less memory intensive. i really liked the way that hddb worked and I have been using it for over 5 years now

emexes · Jan 3, 2020

Peter Lewis said:
maybe I should just run a query with the RESULTS in a table.

Some result sets could be quite large (eg the list of .B4A files ;-)

This would be faster and less memory intensive.

but I think you might be right about this, assuming the table is smart enough to only access the data needed for display.

emexes · Jan 3, 2020

Peter Lewis said:
I really liked the way that hddb worked and I have been using it for over 5 years now

How does it work? I think I've been doing a poor man's version of it by periodically doing a whole-disk DIR /S /O /A into a text file, and writing a program to scan that file for keywords, same as you are implementing. Takes perhaps 3 seconds to read 65 MB file and scan 1.2M entries.

Peter Lewis · Jan 3, 2020

I did leave the link above for it and it is freeware with no virus's. I got some help from people on this board with some code to scan the disks with took 35 minutes to scan 40tb (4x10tb) disks and made a 500mb sqlite file. Code is below. I commented out the MD5 generation as it was very slow and also did not do what I wanted.

Non-UI code:

'Non-UI application (console / server application)
#Region Project Attributes
    #CommandLineArgs:
    #MergeLibraries: True
    #AdditionalJar: sqlite-jdbc-3.27.2.1
    #AdditionalJar: bcprov-jdk15on-1.64
#End Region

Sub Process_Globals
    Dim nbFiles As Long
    Dim nbDir As Long
    
    Dim FileDB As String = "fileinfo.db"
    Dim sql As SQL
End Sub

Sub AppStart (Args() As String)
    nbFiles = 0
    nbDir = 0
    
    CreateDB

    ' scan one or several folders
    sql.BeginTransaction
    ReadDir("i:\",True)
    ReadDir("f:\",True)
    ReadDir("g:\",True)
    ReadDir("j:\",True)
    sql.TransactionSuccessful
    Log (nbFiles)
    Log (nbDir)
    
    'SearchDuplicate
    
    sql.Close
End Sub

'Return true to allow the default exceptions handler to handle the uncaught exception.
Sub Application_Error (Error As Exception, StackTrace As String) As Boolean
    Return True
End Sub

Sub ReadDir(folder As String, recursive As Boolean)
    Dim lst As List = File.ListFiles(folder)
    If lst.IsInitialized Then
        For i = 0 To lst.Size - 1
            If File.IsDirectory(folder,lst.Get(i)) Then
                Dim v As String
                v = folder&"\"&lst.Get(i)
                nbDir = nbDir+1
                If recursive Then
                    ReadDir(v,recursive)
                End If
            Else
                Dim req As String = "INSERT INTO TFileInfo VALUES (?,?,?, ?, ?)"
                sql.ExecNonQuery2(req, Array As Object (lst.Get(i), GetFileExt(lst.Get(i)).ToLowerCase,  folder, File.Size(folder,lst.Get(i)),nbFiles))
                nbFiles = nbFiles+1
                If nbFiles Mod 1000 = 0 Then Log (nbFiles)
            End If
        Next
    Else
        Log (folder & " : non accessible")
    End If
End Sub


Sub CreateDB
    If File.Exists(File.DirApp,FileDB) Then     File.Delete(File.DirApp, FileDB)
    sql.InitializeSQLite(File.DirApp,FileDB, True)

    sql.ExecNonQuery($"CREATE TABLE "TFileInfo"( "filename" TEXT,"extension" TEXT,"folder" TEXT,"filesize" INTEGER,"md5"  TEXT)"$)   
End Sub

Sub SearchDuplicate
    Dim req As String = $"select rowid, * from  tfileinfo
                        where filesize in (
                                SELECT  filesize
                                FROM     tfileinfo
                                 where (filesize<> 0)
                                GROUP BY filesize
                                HAVING   COUNT(filesize) > 1 )"$
    Dim count As Int =0
    
    Dim Cursor As ResultSet
    Cursor = sql.ExecQuery(req)
    Do While Cursor.NextRow
            Dim h As String = Cursor.GetString("md5")
            If h.Length <> 32 Then h = Hash(Cursor.GetString("folder"),Cursor.GetString("filename") )
            sql.ExecNonQuery2("Update TFileInfo set md5=? Where rowid=?",Array As Object (h, Cursor.GetInt("rowid")) )
            Log(Cursor.GetInt("rowid"))
            count = count +1
    Loop
    Cursor.Close
    Log ("total duplicate " & count)

    
    ' output a summary of duplicate files
    req = $"Select nuple, count(nuple) as cn from (
        Select  count(md5) As nuple
        FROM     tfileinfo
        GROUP BY md5
        HAVING   count(md5) > 1 )
        Group By nuple     "$
    Cursor = sql.ExecQuery(req)
    Do While Cursor.NextRow
        Log(Cursor.GetString("nuple") & " " &  Cursor.GetString("cn"))
    Loop


End Sub


Sub Hash (dir As String, filename As String) As String
    Dim in As InputStream
    If File.Size(dir, filename) > 1000000000 Then
        Log ("****** file ignore ****")
        Log (filename & " " & File.Size(dir, filename) )
        Return "TOO BIG"
    End If

    in = File.OpenInput(dir,filename)
    Dim buffer(File.Size(dir, filename)) As Byte
    in.ReadBytes(buffer, 0, buffer.length)
    Dim Bconv As ByteConverter
    Dim data(buffer.Length) As Byte
    Dim md As MessageDigest
    data = md.GetMessageDigest(buffer, "MD5")
    Return  Bconv.HexFromBytes(data)
End Sub

Sub GetFileExt(FullPath As String) As String
    Return FullPath.SubString(FullPath.LastIndexOf(".")+1)
End Sub

emexes · Jan 3, 2020

Peter Lewis said:
I did leave the link above for it and it is freeware with no virus's.

Sorry, my question was badly phrased... but you've answered it anyway, and now I can see the size issue. Since you're on a PC and thus the SQLite implementation should include freetext searching, using that existing service will probably work better than doing your own version of it.

Re: MD5, bear in mind that, for deduplicating, this says either "these files are definitely different" or "these files might be the same", it does not say "these files are the same". Thus you can use a weaker but faster hash algorithm, eg, just a 32-bit XOR, perhaps with some rotations thrown in for good measure. If two files have the same hash, then they might be the same, and should be compared directly for a definitive yae or nay.

Another timesaver is to only hash the start and end say 8192 bytes of the file, since if the files are different, usually there is a difference in these areas. Saves having to hash the gigabytes of probably-identical stuff in the middle. On the rare occasion that the ends are identical but the middles are different: we're going to do a full comparison anyway, so we'd find out then.

Daestrum · Jan 4, 2020

Just as a side note, if you are searching mp3 files, file size is not a good way to check if they are the same.
If one file has ID3 tags and the other doesn't, the file size will be different, even though the music part is the same.

Peter Lewis · Jan 4, 2020

emexes said:
Sorry, my question was badly phrased... but you've answered it anyway, and now I can see the size issue. Since you're on a PC and thus the SQLite implementation should include freetext searching, using that existing service will probably work better than doing your own version of it.

Re: MD5, bear in mind that, for deduplicating, this says either "these files are definitely different" or "these files might be the same", it does not say "these files are the same". Thus you can use a weaker but faster hash algorithm, eg, just a 32-bit XOR, perhaps with some rotations thrown in for good measure. If two files have the same hash, then they might be the same, and should be compared directly for a definitive yae or nay.

Another timesaver is to only hash the start and end say 8192 bytes of the file, since if the files are different, usually there is a difference in these areas. Saves having to hash the gigabytes of probably-identical stuff in the middle. On the rare occasion that the ends are identical but the middles are different: we're going to do a full comparison anyway, so we'd find out then.

I found a problem with using the MD5 and that is why I disabled it. I had 2 files that had the same MD5 and the same filesize but the ID tag was different. One was the clean version and one was the dirty version. So that would not be suitable.

Peter Lewis · Jan 4, 2020

Daestrum said:
Just as a side note, if you are searching mp3 files, file size is not a good way to check if they are the same.
If one file has ID3 tags and the other doesn't, the file size will be different, even though the music part is the same.

We normally run all the files though a ID3 program to replace tags that are missing. My problem is i belong to a DJ music pool which releases around 30GB per day of music. As a producer, they upload their creation to multiple services for distribution and the record pool gives you releases from all the services. They do not check for duplicates. So i might have 10 copies of the same song and they are exactly the same. these are the ones i want to reduce down to 1. If there are some others that have different ID tags or filenames, then it is not a big issue. Basically 80/20 Rule. 80% will be the easiest to find and resolve and take 20% of our time, 20% will be the difficult ones and take 80% of your time. So I am not interested in finding the 20%

udg · Jan 4, 2020

A different approach: ElasticSearch
Set your own ES "server", feed its DB and do free-form text seraches. If you didn't read it before, take some time to read its description and features so to decide whether it could be a viable solution for you (beware: it takes time to set up everything right and then integrate the engine with your existing code).

B4J Question Bubble Search or Search Anywhere

Active Member

Expert

Expert

Expert

Expert

Expert

Active Member

Active Member

Expert

Expert

Expert

Active Member

Expert

Expert

Active Member

Expert

Expert

Active Member

Active Member

Expert

Similar Threads