I had a deeper look into your problem, the attached project is a demonstrator to your request.
To test beeps, the program has three beep mp3 files included.
w_500_4.mp3 is the reference beep it is composed by 4 frequencies (500, 1000, 1500, 2000 Hz)
w_520_4.mp3 a comparative beep it is composed by 4 frequencies (520, 1020, 1520, 2020 Hz)
w_530_4.mp3 a comparative beep it is composed by 4 frequencies (530, 1030, 1530, 2030 Hz)
Program flow:
1. Record of a sound signal (the beeps)
The program reads 8192 time samples (can be changed).
2. FFT calculation
3. Peak detection, there is a peak threshold which means that only peaks with a magnitude higher than the threshold are taken into account.
The threshold level, in the program, is 15% of the max peak level (can be changed).
4. After a click on Beep, the beep is compared to the reference beep by their number of frequency components and their frequencies.
If the number of frequency components is different, the beeps are considered being different.
If all frequencies of the different components are within a limit (25Hz in the program) the beeps are considered being the same.
Test:
1. Click on Sound
This records the reference beep.
The time signal is shown.
2. Click on FFT
Shows the FFT graph.
You see a horizontal red line, which is the peak detector threshold level.
On the right you see the detected peaks with their frequency.
3. Click on Beep
You see a red FFT graph for the generated beep.
A Toastmessage appears showing if the beep is considered being the same or not.
A click on REC records the mic input, like a spectrum analyser.
Some information about FFT.
You need to know the relationship between the sampling frequency, the number of time samples, the acquisition time and the frequency resolution.
The table below shows it:
In the first line we have 44100, which is the sampling frequency and I put it in the table only for comparison, it cannot be used for FFT calculations, the number of samples, for FFT, must be a power of 2.
I found that the number of 8192 time signal samples is a good compromise.
Acquisition time less than 200ms and a frequency resolution about 5 Hz.