Android Code Snippet Full Text Search (FTS) as it Applies to SQLite

Mahares

Well Known Member
Licensed User
NAME: Demonstrating some of the syntax used in creating and using FTS tables.

DESCRIPTION:
SQLite supports Full Text Search (FTS3 and FTS4). The purpose of this snippet is to introduce some of its basic syntax and operators. This is a rudimentary introduction to FTS as I just recently started learning about it. FTS is a fast way to look for specific words or sentences in text columns of a SQLite table. You can also search the entire table with one SELECT statement. For newer applications, FTS4 is recommended.

B4X:
Dim DBTableName As String="fts_table"
To create a table:with one column. Tables can also have several columns.
B4X:
txt="CREATE VIRTUAL TABLE IF NOT EXISTS " & DBTableName & " USING fts4(content TEXT)"
 SQL1.ExecNonQuery(txt)
Actually when creating a table with name: fts_table, three to five real (non-virtual) tables referred to as shadow tables are also created to store the underlying data. Fts_table_content, fts_table_docsize and fts_table_segdir and fts_table_segments. FTS created tables require a lot more memory than regular tables, but their queries execute a lot faster.


To delete a table, the syntax is the same as a normal table:
B4X:
 txt="DROP TABLE IF EXISTS " & DBTableName 
SQL1.ExecNonQuery(txt)

To Insert or update records, the syntax is the same as a normal SQLite table, here to insert:
B4X:
 txt="INSERT INTO fts_table (content) VALUES('Your testing score ranked you highest')" 
  SQL1.ExecNonQuery(txt)

To make a simple SELECT statement in column ‘content’, To search for records where the full word fiduciary (case insensitive search) is found anywhere in any record:
B4X:
txt="SELECT * FROM " & DBTableName & " WHERE content MATCH 'fiduciary'"
Dim RS As ResultSet =SQL1.ExecQuery(txt)
  Do While RS.NextRow
     Log(RS.GetString("content") ) 
  Loop

To select all records which contain both words Europe and Australia
B4X:
txt="SELECT * FROM " & DBTableName & " WHERE content MATCH 'Europe Australia'"

To search for records that have the word sabbat and any other words like Sabbath, sabbatini. The word sabbat can be anywhere in the record.
B4X:
txt="SELECT * FROM " & DBTableName & " WHERE content MATCH 'sabbat*'"

To search for records that contain both cheese and cream, but both words cannot be more than 8 words apart. Cheese can be before cream or after in the record. If the integer 8 is omitted, the default is 10 words apart.
B4X:
 txt="SELECT * FROM " & DBTableName & " WHERE content MATCH 'Cheese NEAR/8 cream'"

To select records which contain Europe but not spain. Note, it is case insensitive. There are more complex queries supported:
B4X:
txt="SELECT * FROM " & DBTableName & " WHERE content MATCH 'Europe -spain'"

To select all records that have a word or sentence starting with europea and that also have a word or sentence starting with bird:
B4X:
txt="SELECT * FROM " & DBTableName & " WHERE content MATCH ‘europea* bird*'"

To select all records which either have flamingo or bird. Notice OR must be uppercase. Otherwise, it is considered part of the search string:
B4X:
txt="SELECT * FROM " & DBTableName & " WHERE content MATCH ‘flamingo OR bird'"

To select all records where the three words: flamingo, or and bird are found. Notice lowercase or. This statement yields a very different result set from the previous one despite the same syntax:
B4X:
txt="SELECT * FROM " & DBTableName & " WHERE content MATCH ‘flamingo or bird'"
BENCHMARK:
I made a comparison between an FTS table and a normal table, where both contain the same 91300 text records, using different SELECT statements on a Samsung Galaxy Tab A tablet. In all cases the query ran a lot faster with FTS. The FTS query took an average of 0.01 second and the ordinary table query ran in about 1 second, in some cases 100 times faster with FTS, but generally a large percentage or few times faster.. For instance, searching for ‘lyterian’ it took 0 second querying the FTS table as opposed to 1.04 seconds querying an ordinary table.
I encourage adding to and improving this thread.

Tags: FTS3, FTS4, MATCH

Dependency: SQL Library
 

rboeck

Well-Known Member
Licensed User
You can use DBUtils sub's like in the 'normal' databases and and all sql statements in a standard way.

In my own apps i prefere to use FTS styled databases, because the google style of search is well known, an specially in mobile apps nobody want to select fields for searching... This together with the enourmos speed
One small negative effect of FTS: you can not use all ODBC drivers for database import, export and integration, but some exists.
 

Mahares

Well Known Member
Licensed User
What about inserting/updating?
Although the main benefits from FTS is the extremely fast search speed, I also conducted a speed test comparison between regular table and FTS table INSERT:
For importing 91000 records of text file 7 MB in size, it took 8.38 seconds for the regular table insert and 18.51 seconds for the FTS insert. The test was conducted on a Samsung Galaxy Tab A tablet. The INSERT is usually just a one time event. I did not compare UPDATE.
 

keirS

Well-Known Member
Licensed User
NAME: Demonstrating some of the syntax used in creating and using FTS tables.
Actually when creating a table with name: fts_table, three to five real (non-virtual) tables referred to as shadow tables are also created to store the underlying data. Fts_table_content, fts_table_docsize and fts_table_segdir and fts_table_segments. FTS created tables require a lot more memory than regular tables, but their queries execute a lot faster.

Think it's important to point out they are far faster for text searches. They are far slower for any other kind of query. This is because the only field supported by FTS is the TEXT field apart from ROWID. An FTS table doesn't support indexes, foreign keys or triggers so integrity and consistency checking has to be performed manually.



BENCHMARK
:
I made a comparison between an FTS table and a normal table, where both contain the same 91300 text records, using different SELECT statements on a Samsung Galaxy Tab A tablet. In all cases the query ran a lot faster with FTS. The FTS query took an average of 0.01 second and the ordinary table query ran in about 1 second, in some cases 100 times faster with FTS, but generally a large percentage or few times faster.. For instance, searching for ‘lyterian’ it took 0 second querying the FTS table as opposed to 1.04 seconds querying an ordinary table.
I encourage adding to and improving this thread.
Try comparing a join on a int column between two indexed normal tables and a join between a normal indexed table and a FTS table.
 

Mahares

Well Known Member
Licensed User
They are far slower for any other kind of query.
It is true that FTS is not the answer to every search scenario. That is why SQLite consider FTS3 and FTS4 extensions. all values inserted into FTS table columns (except the rowid column) are converted to TEXT data type before being saved to enhance the search. FTS uses built-in full-text index. The biggest drawback is the memory consumption because data is stored in 6 tables for every table created.
join between a normal indexed table and a FTS table.
Can you share any speed comparison in one of your tests and contribute an example to this thread? Very little is written in this forum about this subject.
 

keirS

Well-Known Member
Licensed User
Can you share any speed comparison in one of your tests and contribute an example to this thread? Very little is written in this forum about this subject.
This isn't done on Android because I cant free up the 6GB required for the data easily on any of my devices.

My table definitions

B4X:
CREATE TABLE [AUTHORS] (
  [AUTHORID] INTEGER,
  [FIRST_NAME] CHARACTER(30),
  [LAST_NAME] CHARACTER(40));

CREATE INDEX [AUTHORS_AUTHORID] ON [AUTHORS] ([AUTHORID]);

CREATE INDEX [AUTHORS_FIRSTNAME] ON [AUTHORS] ([LAST_NAME]);

CREATE INDEX [AUTHORS_LASTNAME] ON [AUTHORS] ([LAST_NAME]);

CREATE TABLE BOOKS("TITLE" CHARACTER(122),
"AUTHORID" INTEGER,
"BOOKID" INTEGER,
"PUBLISHERID" INTEGER,
"SUMMARY" CLOB);

CREATE INDEX [BOOK_TITLE] ON [BOOKS] ([TITLE]);

CREATE INDEX [BOOKS_AUTHORID] ON [BOOKS] ([AUTHORID]);

CREATE INDEX [BOOKS_PUBLISHERID] ON [BOOKS] ([PUBLISHERID]);

CREATE INDEX [BOOKS_BOOKID] ON [BOOKS] ([BOOKID]);

CREATE VIRTUAL TABLE BooksFTS USING fts4(BookID int PRIMARY KEY,AuthorID Int, PublisherID Int,Title text default "", Summary text default "");

CREATE TABLE PUBLISHERS("PUBLISHERID" INTEGER,
"NAME" CHARACTER(100));

CREATE INDEX [PUBLISHERS_PUBLISHERID] ON [PUBLISHERS] ([PUBLISHERID]);

CREATE INDEX [PUBLISEHERS_NAME] ON [PUBLISHERS] ([NAME]);


CREATE TABLE [WORDS] (
  [WORD] CHARACTER(60),
  [WORDID] INTEGER,
  [HASH] BIGINT);

CREATE INDEX [WORDS_WORDID] ON [WORDS] ([WORDID]);

CREATE TABLE [WORDSINBOOKS] (
  [WORDID] INTEGER,
  [BOOKID] INTEGER,
  [HASH] BIGINT);

CREATE INDEX [WORDSINBOOKS_BOOKID] ON [WORDSINBOOKS] ([BOOKID]);

CREATE INDEX [WORDSINBOOKS_WORDID] ON [WORDSINBOOKS] ([WORDID]);

CREATE INDEX [WORDSINBOOKS_HASH] ON [WORDSINBOOKS] ([HASH] ASC);

So I have tables containing 86,000 books in both normal ("BOOKS") and FTS ("BooksFTS") forms. These are related to 2 tables Authors and Publishers.

The summary column as been generated randomly and is taken from a list of 365,000 English words contained in the WORDS table. The summary is anywhere between 250 and 1000 words.

WORDSINBOOKS is a lookup table to find words in books. The function below was used generate a hash to populate the hash for each word.

B4X:
Sub GetHash(s As String) As Int
    Dim j As JavaObject = s
    Return j.RunMethod("hashCode",Null)
End Sub
So on to the test results using the BooksFTS table to search for the word "diplomats" in summary column and getting the first and last name of the author.

B4X:
SELECT BooksFTS.Bookid,BooksFTS.Title,BooksFTS.Summary,AUTHORS.LAST_NAME, AUTHORS.FIRST_NAME FROM BooksFTS LEFT OUTER JOIN AUTHORS ON BooksFTS.AuthorID = AUTHORS.AUTHORID WHERE Summary MATCH "diplomats"
Average time over 1000 queries was 38 milliseconds (171 Results)

Doing the same query on the BOOKS table using Summary Like "%diplomats%"

B4X:
SELECT BOOKS.BOOKID,BOOKS.TITLE,BOOKS.SUMMARY,AUTHORS.LAST_NAME,AUTHORS.FIRST_NAME FROM BOOKS LEFT OUTER JOIN AUTHORS ON BOOKS.AUTHORID = AUTHORS.AUTHORID WHERE SUMMARY LIKE "%diplomats%"
Average time over 1000 queries was 3884 milliseconds (171 Results)

Getting the hash for the word diplomats (1768639329) and using the WORDSINBOOKS lookup table.

B4X:
SELECT BOOKS.BOOKID,BOOKS.TITLE,BOOKS.SUMMARY,AUTHORS.LAST_NAME,AUTHORS.FIRST_NAME FROM BOOKS LEFT OUTER JOIN AUTHORS ON BOOKS.AUTHORID = AUTHORS.AUTHORID WHERE BOOKID IN (SELECT BOOKID FROM WORDSINBOOKS WHERE HASH = 1768639329)

Average time over 1000 queries was 42 milliseconds (171 Results)


So using a hash lookup table is only slightly slower than using FTS.

So on to the bit where FTS tables fair very poorly. A query to get all books by the author Stephen King.

Using BooksFTS:

B4X:
SELECT AUTHORS.*, BooksFTS.* FROM  AUTHORS INNER JOIN BooksFTS ON AUTHORS.AUTHORID = BOOKSFTS.AuthorID WHERE AUTHORS.LAST_NAME = "King"  AND  AUTHORS.FIRST_NAME = "Stephen"

Average time over 1000 queries was 3733 milliseconds (313 results)

Using Books:

B4X:
SELECT AUTHORS.*, BOOKS.* FROM  AUTHORS INNER JOIN BOOKS ON  AUTHORS.AUTHORID = BOOKS.AUTHORID where AUTHORS.LAST_NAME = "King"  and  AUTHORS.FIRST_NAME = "Stephen"
Average time over 1000 queries was 79 milliseconds (313 results)

So you have to be very careful how you structure your tables and query's using FTS.





 

rboeck

Well-Known Member
Licensed User
Thanks for your work! I have read somewhere, that sqlite has some problems with joins, but i dont know exactly, if it was a timing problem of if some type of joins are not supported. So i think it would be better for benchmarking, to split between the real search time and the time, which is needed for generating the join; maybe the join is the more time consuming part?
 

keirS

Well-Known Member
Licensed User
SQLte only supports 3 types of join. It doesn't support right joins and full outer joins for example. In this example the join is the time consuming part.

B4X:
SELECT AUTHORS.*, BooksFTS.* FROM AUTHORS INNER JOIN BooksFTS ON AUTHORS.AUTHORID = BOOKSFTS.AuthorID WHERE AUTHORS.LAST_NAME = "King"AND AUTHORS.FIRST_NAME = "Stephen"
This is because BOOKSFTS.AuthorID is not (and cannot be) indexed so SQLite has to do a linear search (iterate through every row) on BooksFTS to match BOOKSFTS.AuthorID to AUTHORS.AUTHORID instead doing a lookup using an index. There is also the overhead of having to convert BOOKSFTS.AuthorID from TEXT to INT but I suspect this a negligible factor in the overall time taken.
 
Top