Android Question SQLite remove duplicate records.[SOLVED TWICE]

Roger Daley · Oct 12, 2019

Hi All,

Can anyone point me towards an example of removing duplicate records in a db?
I have searched for methods/commands in SQLite and DBUtils but nothing stood out to me.
Trying to match every record with every other record through iteration does not appeal and I'm sure there is a cleaner way do this.
NOTE: When I say duplicate records, I mean the data in all columns of both records is the same not just matching the "Names" column.

Regards Roger

Roger Daley · Oct 16, 2019

Mahares said:
That is great that you solved it using any of two sets of codes. But there is an important piece of the puzzle missing. To prevent future duplicates from the existing table you have to include a primary key composed of the three columns. For that you have to create a temp table with the PRIMARY key:

B4X:

strQuery="CREATE TABLE IF NOT EXISTS sitesBU (SiteName TEXT, Longitude TEXT, Latitude TEXT, PRIMARY KEY (SiteName,Longitude,Latitude))"

then insert all the records from Sites table to SitesBU table, then drop table Sites, and finally rename SItesBU to Sites.

Mahares
Fortunately not an issue for me in this case. I bulk load a CSV file which is appended to the existing table, then run this Sub to clean out any duplicates.
Where entries are made manually they are checked for duplicates at the time.

Good information for others trying to do something similar.

Regards Roger

klaus · Oct 16, 2019

Use a Long variable BeginTime = DateTime.Now at the beginning and at the end substract it from the final time Duration = DateTime.Now - BeginTime.
And Log it in Debug and Release mode, Log(Duration).

Roger Daley · Oct 16, 2019

klaus said:
Use a Long variable BeginTime = DateTime.Now at the beginning and at the end substract it from the final time Duration = DateTime.Now - BeginTime.
And Log it in Debug and Release mode, Log(Duration).

Kluas,

Using the three line code I get get a count of 34.
Using the single line code I get a count of 26.

The single line is faster.

Regards Roger

CORRECTION:
After running both versions of the Sub many times if find the time is highly variable and there is NO stand out difference. Sorry about the first statement based on a single run.

Roger

RB Smissaert · Oct 16, 2019

Roger Daley said:
Kluas,

Using the three line code I get get a count of 34.
Using the single line code I get a count of 26.

The single line is faster.

Regards Roger

CORRECTION:
After running both versions of the Sub many times if find the time is highly variable and there is NO stand out difference. Sorry about the first statement based on a single run.

Roger

With only 1100 records it is not worth measuring.
In that case it is premature optimization, which we all now is the source of all evil.
You will need something like a few hundred thousand records.

RBS

Roger Daley · Oct 16, 2019

RB Smissaert said:
With only 1100 records it is not worth measuring.
In that case it is premature optimization, which we all now is the source of all evil.
You will need something like a few hundred thousand records.

RBS

Alas, this may remain one of life's unanswered questions.

Regards Roger

RB Smissaert · Oct 16, 2019

Roger Daley said:
Alas, this may remain one of life's unanswered questions.

Regards Roger

It doesn't have to be.
Simple to extend your table to over hundred thousand.

RBS

Roger Daley · Oct 17, 2019

RBS, Klaus, rraswisak, Et Al

I have done a rerun with a .CSV with over 145K records. Five runs of both versions produced the following results.

Single line of code time: 1096, 1100, 1107, 1194, 1129
Three lines of code time: 693, 632, 615 , 973, 694

Interesting!

Regards Roger

rraswisak · Oct 17, 2019

I don't came with DBA background, just general database user. Base on my experience - this is one of the disadvantage of a table not being set for any constraint (primary key or index column). when working on small of record it will work just fine, but if it has huge of record then performance can be an issue.

Using where clause to a non-index column will taking more time rather than those which has indexed one. Joining two or more table using non-index column could has performance issue too.

Mahares · Oct 17, 2019

Roger Daley said:
Single line of code time: 1096, 1100, 1107, 1194, 1129
Three lines of code time: 693, 632, 615 , 973, 694
Interesting!

I suspect the GROUP BY several fields plays a big role in slowing down the single line query. Although it looks pretty and more elegant to use a one line statement, we are not running a pageantry event. We may be better off with the method that uses the multi line statement, and you can still add to it the PRIMARY KEY to make it complete. I would love to get hold of a database that has thousands or records and thousands of duplicates and many columns to test it.

AnandGupta · Oct 17, 2019

Mahares said:
I would love to get hold of a database that has thousands or records and thousands of duplicates and many columns to test it.

This may help,
https://www.sample-videos.com/download-sample-sql.php (Rows 1,000,000)

Just append the same two/three times to get duplicate records

Regards,

Anand

Mahares · Oct 17, 2019

AnandGupta said:
Just append the same two/three times to get duplicate records

The site on your link is not accessible. If you or Roger has a very large SQLite database with duplicates and many columns to attach, I will be glad to test or you can do it yourself and let us know the results

AnandGupta · Oct 17, 2019

Mahares said:
The site on your link is not accessible.

Yes, looks like lots of DDOS started on the site. Will see another one.

Regards,

Anand

Roger Daley · Oct 17, 2019

Mahares said:
The site on your link is not accessible. If you or Roger has a very large SQLite database with duplicates and many columns to attach, I will be glad to test or you can do it yourself and let us know the results

Sorry Mahares, I only have small needs. A 1000 records with 3 columns is big for me.

Regards Roger

Mahares · Oct 18, 2019

I created a 100000 records SQLite table with only 3 columns that has some duplicates on a Galaxy Tab A 7 inch tablet with OS 5.1. Here are the benchmark tests:

Using Single line syntax provided by RB Smissaert:
Number of recs created with dupl.: 100000
Secs To delete duplicates: 3.664
rec count after del dupl:99668

Using Multi line syntax provided by rraswisak
Number of recs created with dupl.: 100000
Secs To delete duplicates: 2.717
rec count after del dupl:99717

When I used a Multi line syntax plus PRIMARY KEY, the time was actually 2.619 seconds. When I created a UNIQUE INDEX on the table stripped of duplicates, it took a whopping 5.7 seconds to delete the duplicates and create the index. You decide.

AnandGupta · Oct 18, 2019

Mahares said:
When I used a Multi line syntax plus PRIMARY KEY, the time was actually 2.619 seconds. When I created a UNIQUE INDEX on the table stripped of duplicates, it took a whopping 5.7 seconds to delete the duplicates and create the index. You decide.

It is becoming difficult to decide

Better show a big continuous progress circle and send process to background if user presses back/cancel.
We simply do not know how long it will take in a particular database.

Regards,

Anand

RB Smissaert · Oct 18, 2019

Mahares said:
I created a 100000 records SQLite table with only 3 columns that has some duplicates on a Galaxy Tab A 7 inch tablet with OS 5.1. Here are the benchmark tests:

Using Single line syntax provided by RB Smissaert:
Number of recs created with dupl.: 100000
Secs To delete duplicates: 3.664
rec count after del dupl:99668

Using Multi line syntax provided by rraswisak
Number of recs created with dupl.: 100000
Secs To delete duplicates: 2.717
rec count after del dupl:99717

When I used a Multi line syntax plus PRIMARY KEY, the time was actually 2.619 seconds. When I created a UNIQUE INDEX on the table stripped of duplicates, it took a whopping 5.7 seconds to delete the duplicates and create the index. You decide.

Causing some confusion there:
You mention different record numbers remaining after deleting the duplicates. So, did one the of the 2 methods do it wrongly?
Of course the unique index only needs to be created once.

RBS

Mahares · Oct 18, 2019

RB Smissaert said:
Causing some confusion there:
You mention different record numbers remaining after deleting the duplicates

No confusion. The data base table was created on the fly using random numbers because I do not have a table that has duplicate problems. That is why the number changes slightly each time.

AnandGupta said:
It is becoming difficult to decide

The purpose of the test is only to demonstrate that on occasions, although a one line syntax may do the job and looks more elegant, the multi line syntax performed better. You can check it out for yourself or ignore the test. The test on a much faster device showed much lower lapsed time for both methods, but still higher time for the single line syntax.

RB Smissaert · Oct 18, 2019

Mahares said:
No confusion. The data base table was created on the fly using random numbers because I do not have a table that has duplicate problems. That is why the number changes slightly each time.

The purpose of the test is only to demonstrate that on occasions, although a one line syntax may do the job and looks more elegant, the multi line syntax performed better. You can check it out for yourself or ignore the test. The test on a much faster device showed much lower lapsed time for both methods, but still higher time for the single line syntax.

> using random numbers

OK, confusion cleared.

RBS

AnandGupta · Oct 19, 2019

Mahares said:
No confusion. The data base table was created on the fly using random numbers because I do not have a table that has duplicate problems. That is why the number changes slightly each time.

The purpose of the test is only to demonstrate that on occasions, although a one line syntax may do the job and looks more elegant, the multi line syntax performed better. You can check it out for yourself or ignore the test. The test on a much faster device showed much lower lapsed time for both methods, but still higher time for the single line syntax.

Ok. Better we use the multi line method, with progress circle, depending on size of data.

Regards,

Anand

Android Question SQLite remove duplicate records.[SOLVED TWICE]

Well-Known Member

Well-Known Member

Expert

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Well-Known Member

Active Member

Expert

Expert

Expert

Expert

Well-Known Member

Expert

Expert

Well-Known Member

Expert

Well-Known Member

Expert

Similar Threads