B4J Question file directorty Lists to sql

Peter Lewis

Active Member
Licensed User
Longtime User
Hi All

It has been over 2 years since I used B4j and I have forgotten a lot and there have been so many improvements. I just need a push in the right direction please.

My problem is I have a huge list of files over 10 million. There are quite a few duplicates over 5 drives. I wanted to maybe use the checksum of the files to determine if they are duplicates.

Firstly which I am looking for this "push in the right direction" is to read all the disks recursively and put as much info of each file in a sql database. I think that would be the best place to sort and determine duplicates.

Then I want the ability to do a bubble search to show the results in a window which I can then Drag each one of the results into a program. An example would be to sort pictures and then using a keyword being able to drop them into a design program or MP3 music and being able to drag them into a mp3 player.

What does work similar to what I want to create is Hddb a hard drive database of all files. The problem with this is there has been no dev for the past 6-7 years and the limitation of 4.8 million files is a problem. The same problem exists with EVERYTHING another Hard drive database.

I would maybe also use the ID3 lib to extract more info that can be stored in the database.

Thank you in advance for steering me in the right direction. If you just leave some keywords for me to research, that would be great. If you want to explain a bit more it would also be appreciated

Thank You
 

Erel

B4X founder
Staff member
Licensed User
Longtime User
Recursively collecting all files can be done in two ways, synchronously (ListFiles) and asynchronously (ListFilesAsync). If you are building a non-ui solution then go with the simpler option and do everything synchronously.

Untested code that searches for possible duplicates by just testing the size and the modified time:
B4X:
Sub Process_Globals
    Private Files As List
    Private FilesMap As Map
    Type FileData (FilePath As String, Modified As Long, Size As Long)
End Sub

Sub AppStart (Args() As String)
    Files.Initialize
    FilesMap.Initialize
End Sub

Sub CollectFilesInFolder (Parent As String)
    For Each f As String In File.ListFiles(Parent)
        Dim FullPath As String = File.Combine(Parent, f)
        If File.IsDirectory(Parent, f) Then
            CollectFilesInFolder(FullPath)
        Else
            Dim fd As FileData = CreateFileData(FullPath, File.LastModified(FullPath, ""), File.Size(FullPath, ""))
            Files.Add(fd)
            Dim key As String = FileDataToKey(fd)
            If FilesMap.ContainsKey(key) Then
                Log("Potential duplicate: " & fd & CRLF & FilesMap.Get(key))
            Else
                FilesMap.Put(key, fd)
            End If
        End If
    Next
End Sub

Private Sub FileDataToKey (fd As FileData) As String
    Return fd.Modified & "|" & fd.Size
End Sub

Public Sub CreateFileData (FilePath As String, Modified As Long, Size As Long) As FileData
    Dim t1 As FileData
    t1.Initialize
    t1.FilePath = FilePath
    t1.Modified = Modified
    t1.Size = Size
    Return t1
End Sub

This is not a full solution.
 
Upvote 0

Peter Lewis

Active Member
Licensed User
Longtime User
Recursively collecting all files can be done in two ways, synchronously (ListFiles) and asynchronously (ListFilesAsync). If you are building a non-ui solution then go with the simpler option and do everything synchronously.

Untested code that searches for possible duplicates by just testing the size and the modified time:
B4X:
Sub Process_Globals
    Private Files As List
    Private FilesMap As Map
    Type FileData (FilePath As String, Modified As Long, Size As Long)
End Sub

Sub AppStart (Args() As String)
    Files.Initialize
    FilesMap.Initialize
End Sub

Sub CollectFilesInFolder (Parent As String)
    For Each f As String In File.ListFiles(Parent)
        Dim FullPath As String = File.Combine(Parent, f)
        If File.IsDirectory(Parent, f) Then
            CollectFilesInFolder(FullPath)
        Else
            Dim fd As FileData = CreateFileData(FullPath, File.LastModified(FullPath, ""), File.Size(FullPath, ""))
            Files.Add(fd)
            Dim key As String = FileDataToKey(fd)
            If FilesMap.ContainsKey(key) Then
                Log("Potential duplicate: " & fd & CRLF & FilesMap.Get(key))
            Else
                FilesMap.Put(key, fd)
            End If
        End If
    Next
End Sub

Private Sub FileDataToKey (fd As FileData) As String
    Return fd.Modified & "|" & fd.Size
End Sub

Public Sub CreateFileData (FilePath As String, Modified As Long, Size As Long) As FileData
    Dim t1 As FileData
    t1.Initialize
    t1.FilePath = FilePath
    t1.Modified = Modified
    t1.Size = Size
    Return t1
End Sub

This is not a full solution.

Thank you Erel , I presume that the non UI version would be faster ?
 
Upvote 0

Quandalle

Member
Licensed User
Here is a solution I use where I scan all directories and store the information in a SQLite database.
The next step calculates and stores a hash code (md5) on the files that have the same size.
Then I simply query the database using the md5 field to scan/filter the duplicates and do the desired processing.
The solution is not based on the date and the name of the file because sometimes files are identical even though they have different names or dates.
The complementary libraries used are : JSQL, Byteconverter, Encryption

B4X:
'Non-UI application (console / server application)
#Region Project Attributes
    #CommandLineArgs:
    #MergeLibraries: True
    #AdditionalJar: sqlite-jdbc-3.27.2.1
    #AdditionalJar: bcprov-jdk15on-159
#End Region

Sub Process_Globals
    Dim nbFiles As Long
    Dim nbDir As Long
    
    Dim FileDB As String = "fileinfo.db"
    Dim sql As SQL
End Sub

Sub AppStart (Args() As String)
    nbFiles = 0
    nbDir = 0
    
    CreateDB

    ' scan one or several folders   
    sql.BeginTransaction
    ReadDir("c:\....",True)
    ReadDir("D:\....",True)
    
    sql.TransactionSuccessful
    Log (nbFiles)
    Log (nbDir)
    
    SearchDuplicate
    
    sql.Close   
End Sub

'Return true to allow the default exceptions handler to handle the uncaught exception.
Sub Application_Error (Error As Exception, StackTrace As String) As Boolean
    Return True
End Sub

Sub ReadDir(folder As String, recursive As Boolean)
    Dim lst As List = File.ListFiles(folder)
    If lst.IsInitialized Then
        For i = 0 To lst.Size - 1
            If File.IsDirectory(folder,lst.Get(i)) Then
                Dim v As String
                v = folder&"\"&lst.Get(i)
                nbDir = nbDir+1
                If recursive Then
                    ReadDir(v,recursive)
                End If
            Else
                Dim req As String = "INSERT INTO TFileInfo VALUES (?,?,?, ?, ?)"
                sql.ExecNonQuery2(req, Array As Object (lst.Get(i), GetFileExt(lst.Get(i)).ToLowerCase,  folder, File.Size(folder,lst.Get(i)),nbFiles))
                nbFiles = nbFiles+1
                If nbFiles Mod 1000 = 0 Then Log (nbFiles)
            End If
        Next
    Else
        Log (folder & " : non accessible")
    End If
End Sub


Sub CreateDB
    If File.Exists(File.DirApp,FileDB) Then     File.Delete(File.DirApp, FileDB)
    sql.InitializeSQLite(File.DirApp,FileDB, True)

    sql.ExecNonQuery( $"CREATE TABLE "TFileInfo" (
        "filename" TEXT,
        "extension" TEXT,
        "folder" TEXT,
        "filesize" INTEGER,
        "md5"  TEXT
    "$)   
End Sub

Sub SearchDuplicate   
    Dim req As String = $"select rowid, * from  tfileinfo
                        where filesize in (
                                SELECT  filesize
                                FROM     tfileinfo
                                 where (filesize<> 0)
                                GROUP BY filesize
                                HAVING   COUNT(filesize) > 1 )"$
    Dim count As Int =0
    
    Dim Cursor As ResultSet
    Cursor = sql.ExecQuery(req)
    Do While Cursor.NextRow
        Dim h = Cursor.GetString("md5")
        If h.Length <> 32 Then h = Hash(Cursor.GetString("folder"),Cursor.GetString("filename") )
        sql.ExecNonQuery2("Update TFileInfo set md5=? Where rowid=?",Array As Object (h, Cursor.GetInt("rowid")) )
        Log(Cursor.GetInt("rowid"))   
        count = count +1
    Loop
    Cursor.Close
Log ("total duplicate " & count)

    
    ' output a summary of duplicate files
    req = $"Select nuple, count(nuple) as cn from (
        Select  count(md5) As nuple
        FROM     tfileinfo
        GROUP BY md5
        HAVING   count(md5) > 1 )
        Group By nuple     "$
    Cursor = sql.ExecQuery(req)
    Do While Cursor.NextRow
        Log(Cursor.GetString("nuple") & " " &  Cursor.GetString("cn"))
    Loop


End Sub


Sub Hash (dir As String, filename As String) As String
    Dim in As InputStream
    If File.Size(dir, filename) > 1000000000 Then
        Log ("****** file ignore ****")
        Log (filename & " " & File.Size(dir, filename) )
        Return "TOO BIG"
    End If

    in = File.OpenInput(dir,filename)
    Dim buffer(File.Size(dir, filename)) As Byte
    in.ReadBytes(buffer, 0, buffer.length)
    Dim Bconv As ByteConverter
    Dim data(buffer.Length) As Byte
    Dim md As MessageDigest
    data = md.GetMessageDigest(buffer, "MD5")
    Return  Bconv.HexFromBytes(data)
End Sub

Sub GetFileExt(FullPath As String) As String
    Return FullPath.SubString(FullPath.LastIndexOf(".")+1)
End Sub
 
Upvote 0

Peter Lewis

Active Member
Licensed User
Longtime User
Here is a solution I use where I scan all directories and store the information in a SQLite database.
The next step calculates and stores a hash code (md5) on the files that have the same size.
Then I simply query the database using the md5 field to scan/filter the duplicates and do the desired processing.
The solution is not based on the date and the name of the file because sometimes files are identical even though they have different names or dates.
The complementary libraries used are : JSQL, Byteconverter, Encryption

B4X:
'Non-UI application (console / server application)
#Region Project Attributes
    #CommandLineArgs:
    #MergeLibraries: True
    #AdditionalJar: sqlite-jdbc-3.27.2.1
    #AdditionalJar: bcprov-jdk15on-159
#End Region

Sub Process_Globals
    Dim nbFiles As Long
    Dim nbDir As Long
   
    Dim FileDB As String = "fileinfo.db"
    Dim sql As SQL
End Sub

Sub AppStart (Args() As String)
    nbFiles = 0
    nbDir = 0
   
    CreateDB

    ' scan one or several folders  
    sql.BeginTransaction
    ReadDir("c:\....",True)
    ReadDir("D:\....",True)
   
    sql.TransactionSuccessful
    Log (nbFiles)
    Log (nbDir)
   
    SearchDuplicate
   
    sql.Close  
End Sub

'Return true to allow the default exceptions handler to handle the uncaught exception.
Sub Application_Error (Error As Exception, StackTrace As String) As Boolean
    Return True
End Sub

Sub ReadDir(folder As String, recursive As Boolean)
    Dim lst As List = File.ListFiles(folder)
    If lst.IsInitialized Then
        For i = 0 To lst.Size - 1
            If File.IsDirectory(folder,lst.Get(i)) Then
                Dim v As String
                v = folder&"\"&lst.Get(i)
                nbDir = nbDir+1
                If recursive Then
                    ReadDir(v,recursive)
                End If
            Else
                Dim req As String = "INSERT INTO TFileInfo VALUES (?,?,?, ?, ?)"
                sql.ExecNonQuery2(req, Array As Object (lst.Get(i), GetFileExt(lst.Get(i)).ToLowerCase,  folder, File.Size(folder,lst.Get(i)),nbFiles))
                nbFiles = nbFiles+1
                If nbFiles Mod 1000 = 0 Then Log (nbFiles)
            End If
        Next
    Else
        Log (folder & " : non accessible")
    End If
End Sub


Sub CreateDB
    If File.Exists(File.DirApp,FileDB) Then     File.Delete(File.DirApp, FileDB)
    sql.InitializeSQLite(File.DirApp,FileDB, True)

    sql.ExecNonQuery( $"CREATE TABLE "TFileInfo" (
        "filename" TEXT,
        "extension" TEXT,
        "folder" TEXT,
        "filesize" INTEGER,
        "md5"  TEXT
    "$)  
End Sub

Sub SearchDuplicate  
    Dim req As String = $"select rowid, * from  tfileinfo
                        where filesize in (
                                SELECT  filesize
                                FROM     tfileinfo
                                 where (filesize<> 0)
                                GROUP BY filesize
                                HAVING   COUNT(filesize) > 1 )"$
    Dim count As Int =0
   
    Dim Cursor As ResultSet
    Cursor = sql.ExecQuery(req)
    Do While Cursor.NextRow
        Dim h = Cursor.GetString("md5")
        If h.Length <> 32 Then h = Hash(Cursor.GetString("folder"),Cursor.GetString("filename") )
        sql.ExecNonQuery2("Update TFileInfo set md5=? Where rowid=?",Array As Object (h, Cursor.GetInt("rowid")) )
        Log(Cursor.GetInt("rowid"))  
        count = count +1
    Loop
    Cursor.Close
Log ("total duplicate " & count)

   
    ' output a summary of duplicate files
    req = $"Select nuple, count(nuple) as cn from (
        Select  count(md5) As nuple
        FROM     tfileinfo
        GROUP BY md5
        HAVING   count(md5) > 1 )
        Group By nuple     "$
    Cursor = sql.ExecQuery(req)
    Do While Cursor.NextRow
        Log(Cursor.GetString("nuple") & " " &  Cursor.GetString("cn"))
    Loop


End Sub


Sub Hash (dir As String, filename As String) As String
    Dim in As InputStream
    If File.Size(dir, filename) > 1000000000 Then
        Log ("****** file ignore ****")
        Log (filename & " " & File.Size(dir, filename) )
        Return "TOO BIG"
    End If

    in = File.OpenInput(dir,filename)
    Dim buffer(File.Size(dir, filename)) As Byte
    in.ReadBytes(buffer, 0, buffer.length)
    Dim Bconv As ByteConverter
    Dim data(buffer.Length) As Byte
    Dim md As MessageDigest
    data = md.GetMessageDigest(buffer, "MD5")
    Return  Bconv.HexFromBytes(data)
End Sub

Sub GetFileExt(FullPath As String) As String
    Return FullPath.SubString(FullPath.LastIndexOf(".")+1)
End Sub
Hi, Thank you for your input. Was this initially a Android Program ? B4J does not pick up MessageDigest.
Rgds
 
Upvote 0

Erel

B4X founder
Staff member
Licensed User
Longtime User
I presume that the non UI version would be faster ?
Not really. The difference is that in a non-ui app you don't need to worry about holding the main thread for a long time. In a UI app it will cause the form to freeze. You can solve it by switching to the async methods, however it will be more complicated.
 
Upvote 0

Quandalle

Member
Licensed User
In a UI app it will cause the form to freeze. You can solve it by switching to the async methods, however it will be more complicated.

On a non-UI B4J program you just have to call "log" to output a message.
On a B4J-UI program (outside the development environment), you have to display the progress on an element of the interface such as a label for example. On the other hand, to refresh the label display, the current browsing activity must be interrupted by a "sleep".

The problem occurs in a recursive path : the call to sleep goes up one level in recursive calls but it seems that the display is not refreshed correctly.
To solve this problem, I made a non-recursive directory browse routine by changing recurisivity by a loop, just as efficient. (cf code exemple )

Non recursive:
dim rootdir as string
dim nbFiles as int
dim nbDir as int
dim label1 as label
...


Sub Button2_Click
    Dim dc As DirectoryChooser
  
    nbFiles = 0
    nbDir = 0
    dc.Initialize
    dc.Title = "Select root directory to scan"
    If rootDir <> "" Then dc.InitialDirectory = rootDir
    rootDir  = dc.Show(MainForm)
    Dim rs As ResumableSub = ReadDir2(rootDir, True)
    Wait For(rs) Complete (nb As Int )
    Log("end scan")
    Label1.Text = nbFiles & " Files" & " " & nbDir & " Dir"
End Sub




Sub ReadDir2(startfolder As String, recursive As Boolean)    As ResumableSub
    Dim folders As List
    folders.Initialize
  
    folders.Add(startfolder)
  
    Do While folders.Size > 0
        Dim currentfolder As String = folders.Get(0)
        folders.RemoveAt(0)
              
        Dim lst As List = File.ListFiles(currentfolder)
        If lst.IsInitialized Then
            For i = 0 To lst.Size - 1
                If File.IsDirectory(currentfolder,lst.Get(i)) Then
                    nbDir = nbDir+1
                    If recursive Then folders.Add(currentfolder&"\"&lst.Get(i))
                Else
                    nbFiles = nbFiles+1
                    If nbFiles Mod 10 = 0 Then
                        Label1.Text= nbFiles & " files" & " " & nbDir & " directories"
                        Sleep(0)
                    End If
                End If
            Next
        Else
            Log (currentfolder & " : non accessible")
        End If
    Loop
    Return nbFiles
End Sub
 
Upvote 0
Top