B4J Question Bubble Search or Search Anywhere

Peter Lewis · Jan 3, 2020

Hi All
A While ago I got some help about entering some text and the table will be filtered by finding the text entered anywhere in the database field. This worked very well. However I am looking for an upgraded idea of this.

For example if I had to search songs:- I entered the following "black peas intro clean" This will filter out all those words from the database field and show the results.

At the moment I can only put in one word, as soon as i use spaces the initial system does not work.

This was from the original post .

search anywhere contiguous letters

Thank you

emexes · Jan 4, 2020

Peter Lewis said:
I had 2 files that had the same MD5 and the same filesize but the ID tag was different.

I know I was bleating on about how same hash value didn't guarantee same file contents, but... the odds of you hitting that problem, especially with MD5, are pretty low.

If I had to hazard a guess, it would be that one of the files was modified after its MD5 was calculated and stored in the database.

If you've still got the two files, which are different but have the same MD5 in the database, can you recalculate the MD5's from the files using eg md5sum, see if they are indeed the same.

LucaMs · Jan 4, 2020

Why not look for the wheel already invented?
https://www.google.it/search?ei=t5E...hUKEwjV1dabhermAhXR3KQKHYnnBT0Q4dUDCAo&uact=5

emexes · Jan 4, 2020

Peter Lewis said:
So i might have 10 copies of the same song and they are exactly the same. these are the ones i want to reduce down to 1.

Is the MP3 data the same, ie, byte-for-byte exact?

I had a similar problem in having to match FLAC files that had different metadata but same audio data. The audio data would be at a different offset in each file, sometimes thousands of bytes apart.

The initial solution was to open an audio file, grab a 100 byte block of audio data from the middle, and then search each and every other audio file for that block. This worked 100% but was no speed demon. And in your case, with millions of files, it'd be unviable.

The faster solution was to pull some fingerprints from each audio file, and then put them in a database (ok, text file in my case) and then search for duplicates much like you are doing with the SQL statement and COUNT clause.

The fingerprints were "a run of 9 bytes of ascending value", eg 200, 10, 35, 37, 58, 121, 143, 157, 193, 227, 181. I think 9 bytes gave on average 4 fingerprints per MB. I checked that all of the fingerprints in file A were matched in file B, and vice-versa, but what I found was that one fingerprint match between files was enough identify those that had identical audio content. If the audio content had been altered in the slightest bit, then the MP3 encoded output would alter entirely.

emexes · Jan 4, 2020

LucaMs said:
Why not look for the wheel already invented?

Agreed. That's what Peter's doing here. Also, audio matching is an interesting topic, and pretty much every situation is different.

It even used to be that multiple rips from the same CD would not match byte-for-byte because there was no clear and precise delineation of (song) tracks defined in the CD standard.

B4J Question Bubble Search or Search Anywhere

Peter Lewis

Active Member

emexes

Expert

LucaMs

Expert

emexes

Expert

emexes

Expert

Similar Threads