B4J Question Encryption 1.1 library memory leak?

cbal03

Active Member
Licensed User
Longtime User
I'm creating a UI b4j program and find that while iterating the filesystem and stuffing a database (9 fields) all works fine. However, when I use the MD5 feature of the encryption v1.1 library (I am getting a checksum of the files themselves) memory quickly rises out of the heap space (>8GB within the first few hundred iterations) and crashes the program.

With the MD5 code commented out and a string "na" in the result variable the program completes iterating all files in the selected directory (17,187 files), loads em all into a single list and updates an sqli table in a single batch without exceeding 1.2GB memory.

The crash also occurs if I specify 'SHA-1" except it crashes faster (maybe 150 iterations).
Files greater than about 1.5 GB will crash the program instantly.

I am replacing a media system I wrote years ago where I rely on PHP to perform this task (it still runs great).
That media system uses an http server, database server and php to deliver all of my home media to my devices and computer web browsers. This new system is intended to be a single application that will handle everything and perform the same job.

Is there a better way to get the MD5 or similar using B4J?
Thanks for any comments.


update:
found this

and applied inline java to perform the task. It is much faster and does not generate the out of memory error. It actually uses a little less memory.

However, larger files (a GB or more) still crash with 'out of memory' error.

Update:
MEMORY LEAK?
NO.
I was loading an entire file into a variable and causing the 'out of memory error'.
The answer is to stream the file in and process it in chunks then return the result.
(for anyone else with this issue)
 
Last edited:

MicroDrie

Well-Known Member
Licensed User
Longtime User
Maybe you can try the following small inline Java code. It worked for me with a 104 mb mp3 file, JavaObject library 2.06 and B4J version 10.20 (64 bit).

Call inline Java:
    Dim s As String = NativeMe.RunMethod("TestMD5", Array ("MMyTestValue"))
    Log(s)
    Dim s As String = NativeMe.RunMethod("generateMD5", Array ("c:\example\YourFile.mp3"))
    Log(s)

Replace "c:\example\YourFile.mp3" with the drive, path and file name of the file you want to calculate the MD5 hash code for.

Inline Java code:
#If JAVA

import java.io.FileInputStream;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
    public static String TestMD5(String originalString) { //<-- static method

        // Initialize the MessageDigest object for MD5 hashing.
        MessageDigest md = null;
        try {
            md = MessageDigest.getInstance("MD5");
        } catch (NoSuchAlgorithmException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

        // Input the data you want to hash into a byte array.
        byte[] bytesOfMessage = originalString.getBytes();

        // Use the digest method to perform the hashing.
        byte[] digest = md.digest(bytesOfMessage);

        // Finally, convert the byte array to a hexadecimal String.
        StringBuilder sb = new StringBuilder();
        for (byte b : digest) {
            sb.append(String.format("%02x", b));
        }
        String hashValue = sb.toString();
        return hashValue;
    }

    public static String generateMD5(String filePath) throws Exception {
        MessageDigest md = MessageDigest.getInstance("MD5");
        try (FileInputStream fis = new FileInputStream(filePath)) {
            byte[] buffer = new byte[8192];
            int bytesRead;
            while ((bytesRead = fis.read(buffer)) != -1) {
                md.update(buffer, 0, bytesRead);
            }
        }
        byte[] hash = md.digest();
        StringBuilder sb = new StringBuilder();
        for (byte b : hash) {
            sb.append(String.format("%02x", b));
        }
        return sb.toString();
    }
#End If
 
Upvote 0

cbal03

Active Member
Licensed User
Longtime User
Thanks Microdrie. Will give it a shot and report back shortly.
 
Upvote 0

cbal03

Active Member
Licensed User
Longtime User
Your suggestion works very well. I can process large lists of files each in excess of 5GB without any errors and only minimal increase in memory usage (~100MB rise).
The UI does hang for the 10 seconds or so each file takes to process but I think I think I can use your example to streamline the process using Wait For and manually feeding smaller chunks to the inline java and then reassembling the MD5 on return.
I can probably also show the progress of each file being processed along with the overall progress with minimal UI interruptions.
Thank you so much for teaching me how to fish. Now I can go after the big ones!
 
Upvote 0

MicroDrie

Well-Known Member
Licensed User
Longtime User
Glad your crash problem is resolved. What about the long time to process the MD5, the question is why you need the MD5 hash code.
(I am getting a checksum of the files themselves)
The calculation of a CRC32 is much faster than md5, but is not a cryptographically secure checksum. CRC32 is widely used for data integrity checks, error detection, and identification of duplicates in databases.
CRC32 and MD5 are both technical means to achieve your specific goal. Perhaps you can reevaluate your goal, and perhaps you can use CRC32 which is faster.
 
Upvote 0

cbal03

Active Member
Licensed User
Longtime User
Thank you for the info. I'll look into CRC32. To be honest I didn't really care about the type of CRC back when I built the WAMP style media server I just assumed MD5 was good enough and didn't look for a more efficient way of uniquely identifying the files. Now that I think about it there was a memory issue back then too.. I had to force PHP to use more memory than the default.
 
Upvote 0

MicroDrie

Well-Known Member
Licensed User
Longtime User
If MD5 is not a requirement, but speed is important, you might consider using the 64 bit version of 'xxHash - Extremely fast hash algorithm'. You might also consider increasing the buffer size to reduce the amount of disk I/O. That is something to consider, though. Ideally, the disk size should be equal to the buffer size, or if that is not possible, you can use the greatest common divisor and see what the time gain is.
 
Upvote 0

cbal03

Active Member
Licensed User
Longtime User
Although I have zero experience with java.. I have managed to get teh CRC32 of each file using your example.
CRC is significantly faster when supplying .mp3 files (5 to 10 MB) but is signifanctly slower on larger tv show and movie files (few hundred MB to a few GB). Oh, and memory usage is minimal on all files. :)
Now that I think a little more about it, I think that was the reason I chose MD5. I read (long ago) that it was reasonably fast on both small and large files.

CRC32 and inline Java:
import java.util.zip.CRC32;

    public static Long generateCRC32(String filePath) throws Exception {

        CRC32 crc = new CRC32();

    

        try (FileInputStream fis = new FileInputStream(filePath)) {

            byte[] buffer = new byte[8192];

            int bytesRead;

            while ((bytesRead = fis.read(buffer)) != -1) {

                crc.update(buffer, 0, bytesRead);

            }

        }

        return crc.getValue();

    }
 
Last edited:
Upvote 0

cbal03

Active Member
Licensed User
Longtime User
If MD5 is not a requirement, but speed is important, you might consider using the 64 bit version of 'xxHash - Extremely fast hash algorithm'. You might also consider increasing the buffer size to reduce the amount of disk I/O. That is something to consider, though. Ideally, the disk size should be equal to the buffer size, or if that is not possible, you can use the greatest common divisor and see what the time gain is.
Interesting.
 
Upvote 0

MicroDrie

Well-Known Member
Licensed User
Longtime User
Interesting.
Interesting, especially if you want to squeeze out all possible time and memory gain. There are two possibilities at disk level. Either an entire track is read into the buffer at once, or a block with different sectors of a track is read into a buffer. Because disk I/O is relatively small, you should keep it as limited as possible. The sector is 512 or 4096 bytes or a multiple of this. If the CRC / MD5 buffer is a multiple of disk buffer, you need the least disk I/O actions for a certain amount of more or less memory usage. In addition, beside of how defragmented the sectors and used tracks are stored on the hard disk plays, also the other disk I/O actions for other activities play a role. Necessary movement of the disk head during reading and writing also takes time. That's why the buffer is only written to disk if the buffer is full or the file is closed.
 
Upvote 0

Daestrum

Expert
Licensed User
Longtime User
I have some code for MD5 calc (in java) It does a 1.5GB (video) file in 456ms. (needs a recent java for the virtual threads)
B4X:
#if java
import java.io.RandomAccessFile;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
import java.security.MessageDigest;
import java.util.Map;
import java.util.concurrent.ConcurrentSkipListMap;
import java.util.concurrent.Executors;
import java.util.concurrent.ExecutorService;


    public static String calculateMD5(String filePath, int chunkSize) throws Exception {
        // Initialize the thread-safe map to store chunk MD5 results
        Map<Integer, byte[]> hashResults = new ConcurrentSkipListMap<>();

        // Open the file and map it into memory
        RandomAccessFile file = new RandomAccessFile(filePath, "r");
        FileChannel channel = file.getChannel();
        long fileSize = channel.size();
        int numberOfChunks = (int) Math.ceil((double) fileSize / chunkSize);

        // Create a virtual thread executor
        try (ExecutorService executor = Executors.newThreadPerTaskExecutor(Thread.ofVirtual().factory())) {

            // Process each chunk in parallel
            for (int chunkIndex = 0; chunkIndex < numberOfChunks; chunkIndex++) {
                int finalChunkIndex = chunkIndex;
                executor.submit(() -> {
                    try {
                        long start = (long) finalChunkIndex * chunkSize;
                        long size = Math.min(chunkSize, fileSize - start);
                        MappedByteBuffer buffer = channel.map(FileChannel.MapMode.READ_ONLY, start, size);

                        MessageDigest md5Digest = MessageDigest.getInstance("MD5");
                        md5Digest.update(buffer);
                        hashResults.put(finalChunkIndex, md5Digest.digest());
                    } catch (Exception e) {
                        throw new RuntimeException("Error processing chunk " + finalChunkIndex, e);
                    }
                });
            }
        }

        // Combine the chunk hashes sequentially into the final MD5 hash
        MessageDigest finalDigest = MessageDigest.getInstance("MD5");
        for (byte[] chunkHash : hashResults.values()) {
            finalDigest.update(chunkHash);
        }

        // Convert the final hash to a hex string
        StringBuilder result = new StringBuilder();
        for (byte b : finalDigest.digest()) {
            result.append(String.format("%02x", b));
        }
        return result.toString();
    }

    public static void doIt(String filename) throws Exception {
        String filePath = filename;
        int chunkSize = 512*1024; // chunk size
        String md5Hash = calculateMD5(filePath, chunkSize);
        System.out.println("MD5 Hash: " + md5Hash);
    }

#End If
 
Last edited:
Upvote 0

cbal03

Active Member
Licensed User
Longtime User
Interesting, especially if you want to squeeze out all possible time and memory gain. There are two possibilities at disk level. Either an entire track is read into the buffer at once, or a block with different sectors of a track is read into a buffer. Because disk I/O is relatively small, you should keep it as limited as possible. The sector is 512 or 4096 bytes or a multiple of this. If the CRC / MD5 buffer is a multiple of disk buffer, you need the least disk I/O actions for a certain amount of more or less memory usage. In addition, beside of how defragmented the sectors and used tracks are stored on the hard disk plays, also the other disk I/O actions for other activities play a role. Necessary movement of the disk head during reading and writing also takes time. That's why the buffer is only written to disk if the buffer is full or the file is closed.
I looked around a bit and couldnt find the actual libraries or files to download for xxHash other than some C files. I did do quite a bit of reading about all of the platforms it supports. Also, I'm unsure of how to implement xxHash in B4J.
 
Upvote 0

cbal03

Active Member
Licensed User
Longtime User
I have some code for MD5 calc (in java) It does a 1.5GB (video) file in 456ms. (needs a recent java for the virtual threads)
B4X:
#if java
import java.io.RandomAccessFile;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
import java.security.MessageDigest;
import java.util.Map;
import java.util.concurrent.ConcurrentSkipListMap;
import java.util.concurrent.Executors;
import java.util.concurrent.ExecutorService;


    public static String calculateMD5(String filePath, int chunkSize) throws Exception {
        // Initialize the thread-safe map to store chunk MD5 results
        Map<Integer, byte[]> hashResults = new ConcurrentSkipListMap<>();

        // Open the file and map it into memory
        RandomAccessFile file = new RandomAccessFile(filePath, "r");
        FileChannel channel = file.getChannel();
        long fileSize = channel.size();
        int numberOfChunks = (int) Math.ceil((double) fileSize / chunkSize);

        // Create a virtual thread executor
        try (ExecutorService executor = Executors.newThreadPerTaskExecutor(Thread.ofVirtual().factory())) {

            // Process each chunk in parallel
            for (int chunkIndex = 0; chunkIndex < numberOfChunks; chunkIndex++) {
                int finalChunkIndex = chunkIndex;
                executor.submit(() -> {
                    try {
                        long start = (long) finalChunkIndex * chunkSize;
                        long size = Math.min(chunkSize, fileSize - start);
                        MappedByteBuffer buffer = channel.map(FileChannel.MapMode.READ_ONLY, start, size);

                        MessageDigest md5Digest = MessageDigest.getInstance("MD5");
                        md5Digest.update(buffer);
                        hashResults.put(finalChunkIndex, md5Digest.digest());
                    } catch (Exception e) {
                        throw new RuntimeException("Error processing chunk " + finalChunkIndex, e);
                    }
                });
            }
        }

        // Combine the chunk hashes sequentially into the final MD5 hash
        MessageDigest finalDigest = MessageDigest.getInstance("MD5");
        for (byte[] chunkHash : hashResults.values()) {
            finalDigest.update(chunkHash);
        }

        // Convert the final hash to a hex string
        StringBuilder result = new StringBuilder();
        for (byte b : finalDigest.digest()) {
            result.append(String.format("%02x", b));
        }
        return result.toString();
    }

    public static void doIt(String filename) throws Exception {
        String filePath = filename;
        int chunkSize = 512*1024; // chunk size
        String md5Hash = calculateMD5(filePath, chunkSize);
        System.out.println("MD5 Hash: " + md5Hash);
    }

#End If
I'll give this a shot Daestrum. At first glance it appears very powerful. Thanks for sharing!
 
Upvote 0

Daestrum

Expert
Licensed User
Longtime User
The logic behind it
1, split the file into chunks (in my case 512Kb)
2, calculate the MD5 of the chunks (in virtual threads >4000 threads for a 2.4GB file)
3, save the partial MD5 in a thread safe map - keyed by the chunk number (MD5.update needs them in sequence)
4, read map sequentially and update hash
5, return hash as Hex string.
 
Upvote 0

cbal03

Active Member
Licensed User
Longtime User
The logic behind it
1, split the file into chunks (in my case 512Kb)
2, calculate the MD5 of the chunks (in virtual threads >4000 threads for a 2.4GB file)
3, save the partial MD5 in a thread safe map - keyed by the chunk number (MD5.update needs them in sequence)
4, read map sequentially and update hash
5, return hash as Hex string.
I tried to run your inline code but i ran into compile errors ranging from ' symbol not found method ofVirtual()', to 'jfx application not found' to 'lambda not supported' in every configuration i could think of with respect to b4j and java version.
At one point a compile error suggested i use -source 8 or higher to enable lambda and also use a compile switch to enable preview...

Do you have a recommended setup to get this going?
All this testing has left me with B4J v10.2 pointing to java 8 and my application does run properly without virtual threading.

edit: jdk 11
 
Last edited:
Upvote 0

Daestrum

Expert
Licensed User
Longtime User
I believe virtual threads were in java 21+

On the later java's you need to use (this is my current line)
B4X:
#JavaCompilerPath: 24, D:\jdk-24\bin\javac.exe
 
Upvote 0

cbal03

Active Member
Licensed User
Longtime User
I believe virtual threads were in java 21+

On the later java's you need to use (this is my current line)
B4X:
#JavaCompilerPath: 24, D:\jdk-24\bin\javac.exe
ok i've got everything going... love the new ide (can build executables now) and java 24 with jfx is working smoothly.. i have it running on a serving machine albeit a bit dated (AMD FX 6300 16GB RAM - 6 core - built it loooong ago). The drives are local to the machine (WD 10k rpm Raptors) and not across the network from my dev box... I increased (int chunkSize = 1024*1024; // chunk size) and memory usage shows sharktooth waveform based on file. 3GB has been the max... mostly hanging around 1.5 to 2.5GB.
I'm logging times in teh debug DOS window (commented your stsout and added return type and value to 'doIt' method).
times lok much better than I've ever seen.. 17 seconds for a 922MB file, 47 seconds for 1.932 GB file.
of course the ui hangs while waiting on each but refreshes as teh files are advanced through.

overall very impressive. :)

ohh and there is a warning issued in the Logs.

WARNING: A restricted method in java.lang.System has been called
WARNING: java.lang.System::load has been called by com.sun.glass.utils.NativeLibLoader in module javafx.graphics (file:/C:/Java/jdk-24/bin/../javafx/lib/javafx.graphics.jar)
WARNING: Use --enable-native-access=javafx.graphics to avoid a warning for callers in this module
WARNING: Restricted methods will be blocked in a future release unless native access is enabled
WinRT: combase.dll not found <<=== ??? Server library??
WARNING: A restricted method in java.lang.System has been called
WARNING: java.lang.System::load has been called by org.sqlite.SQLiteJDBCLoader in an unnamed module (file:/D:/MY%20Apps/Basic4Java/ctb_media_server/Objects/ctb_media_server_orig.jar)
WARNING: Use --enable-native-access=ALL-UNNAMED to avoid a warning for callers in this module
WARNING: Restricted methods will be blocked in a future release unless native access is enabled
WARNING: A terminally deprecated method in sun.misc.Unsafe has been called
WARNING: sun.misc.Unsafe::allocateMemory has been called by com.sun.marlin.OffHeapArray (file:/C:/Java/jdk-24/bin/../javafx/lib/javafx.graphics.jar)
WARNING: Please consider reporting this to the maintainers of class com.sun.marlin.OffHeapArray
WARNING: sun.misc.Unsafe::allocateMemory will be removed in a future release
 
Upvote 0

cbal03

Active Member
Licensed User
Longtime User
Untitled.png
 
Upvote 0

cbal03

Active Member
Licensed User
Longtime User
File Size: 3,308.31 MB
Begin: 03:35:00
Generated Chksum: 5b20184fbeba519048eb809639f5b63c
DB Update Query: Adding Item (4): C:\temp\Aquaman\Aquaman (2018).mp4
End: 03:35:03
Lapsed: 00:00:02

2 seconds for a 3.3GB file :)

the slower times are definitely hardware related.
 
Upvote 0

Daestrum

Expert
Licensed User
Longtime User
The log messages are just a side effect of the later java enforcing the rules about using non public APIs. (One is generated by javafx itself - which they are trying to fix)

Hardware does make a big difference as you say. (I should have pointed out my timings were using DEBUG mode, non-ui, SSD, 12 thread processor and 32GB ram.)
Finding the sweet spot for the chunk size is they key, at 512Kb mine was fastest.
 
Upvote 0
Top