Android Tutorial [B4X] Database modelling

This tutorial is about db modelling with a small example. Of course this is just an overwiew and there is much more. I assume you know how to create tables and Co.

Databases

A db is a collection of related data and it contains tables. Mostly it's called database system like (SQlite or MySQL, etc.) as it brings a lot of methods, functions and services with it like access methods like the db language to do queries or to create new tables.

Tables

A table can be called an object or class. It contains all data related to this object. Objects are "things" as they exist in reality like a customer, an item, a car, an order, etc. When you know this, you are almost there when you create a new db. For every object there should/must be an own table. So when you think about an app handling customers, orders and items you know you need 3 tables. So all the data related to a customer (like name and address) will be stored here. A table has columns and rows like you know it from Excel.

Relations

This is a very important point. Every object (table) interacts with each other as they are related to each other. A customer places an order so there's a relation between the customer and his/her order. Easy but important. Each order has items which is another relation. No order without items and no order without a customer placing it. Like views in B4x this relations can be seen as parent/child relations. In this case the cutomer is the parent and the orders are "children" of the customer object. Every "child" has a tag (foreign key) to identify it's parent.

Indexes

In the early days we had phone books on paper. To find someones phone number there was a register/index with the first letter of the surname. So we "stored" the person John Smith under "S". In every db system you can add an index to any column you want. The goal is to get a very fast access to the data you need at that time. If you need to get all birthdays of all persons you would need to scan all of the persons in you paper phone book which takes a lot of time. Using db's solves this problem by just adding a new index to your table. This can be done at any time of you need it. This of course depends on the needs of your app.

Unique/Primary Keys/Autoincrement

If you handle customers in a table you need to identify the exact one you mean. The name isn't a good idea because there are probably thousands of John Smith's you have as you customers. Even if you combine it with the birthday there are several Johns with the same birthdate. To uniquely identitify ONE person/customer there is always a Unique Key like customer number, social number, tax number, etc. You could create this unique ID on your own (if a specific format is needed) but all db systems help you here. You can add a column (integer) and define it as autoincrement. When you insert a new row (= new customer here) the db system cares automatically and inserts a new number. If the last one was 1111 the next will be 1112 and wil be unique. For every table I use this method so I can identify each entry by it's unique key (like customer 1112 has placed an order with order id 4455 with items with id's 6665, 9993 and 4444).

Normalization/Denormalization

Wikipedia: "Normalization or normalisation refers to a process that makes something more normal or regular"

Think about the customer you want to handle. A customer has an address and some other properties. He/she can have more than one address and more than one payment methods. The idea here is to store the addresses and payment methods to another table which is a good idea. Here we talk about db modeling. You can bring this idea to the max and "disassemble" a customer to the atoms like a customer is human and has children and is male and so one like a class with methods and entities, etc. This is sometimes too mathematic and leads to speed problems. On the other side it is very powerful when you have to create a CRM system (customer relation system).

The db model is one thing. To have a more realistic scenery you can "denormalizate" this model. Just like you meet your friends. Sometimes it's ok to have some drinks and food. You don't need to know everything at this moment. Just the things you need at that point. DB: You only create one table for the customer with one address and all properties. Enough for your app. In the model it's maybe 12 objects a customer is "made of".

Size doesn't matter & most common mistakes

I sometimes read threads with "My app is slow because the db is slow" or "I want to load 100.000 rows in a ListView". My answer here: Either a design or a programming mistake and the people wonder.

With a good design, size doesn't matter. If you have a db with 1 bilion entries, the selection of data is as fast as if you have with 100 entries. A good query takes 0.02 secs or less. If it's 2.4 secs (which is over a 100 times slower) there IS an error in your db design, query or even a problem in your code flow.

Never retrieve more data than a user can handle at one time! Loading 100.000 rows in a ListView does NOT make sense as one user can't handle so many data. Good practice is to load 50 or so and load another 50 if needed. Of course there are exceptions but then we are talking about a batch system which processes data and not a single user.

So let's start with a simple example

I want to build an app where customers can place orders with items. So I have 4 tables:

customers
orders
orderitems
items

Why orderitems AND items?

Items is the catalog of the items you CAN order. Orderitems are the items the customer actually orders. Imagine you change a price later. With this completed orders are "safe" as they represent the finished order as it was at the time of ordering.

In customers we have 3 customers:

CID Name
1 Bill Miller
2 Sam Smith
3 Caroline James

As you see every customer has it's uniquey ID. If we address "Sam Smith" we will use ID 2 in our app. The name is not important anymore (except to display it). Benefit, too: We save a lot of space as integers mostly need less space than a full name.


In items:

IID Itemname Price
1 Cake 10
2 Teddybear 15
3 Milk 1.5

Same here. Every item has an unique ID, too. Like in customers we only use the ID from here.

And what if I need to change the price from a specific date on?

Two thoughts here:

- the price for previous orders must be kept
- the new price is valid in the future

Just add to columns like validfrom and voidfrom. Insert a new item (same name) and by setting the dates in both you can easily set when the old is void and the new one is valid

IID Itemname Price validfrom voidfrom
1 Cake 10 2018-01-01 2018-04-04
4 Cake 12 2018-01-04 9999-12-31
2 Teddybear 15 2018-01-01 9999-12-31
3 Milk 1.5 2018-01-01 9999-12-31

To display the item list use

"Select * FROM items WHERE validfrom <= 'today' and voidfrom > 'today'"

which only displays ID 4 for new orders from April 4th and 1 before. Benefit: You can change a price for the future (like from January 1st, 2019).


So if Sam orders milk:

orders

OID CID
2233 = New Order 2 = Sam

By inserting a new row in orders the OrderId (OID) will be increased and set automatically. Customer is Sam with his ID 2 (see the customer's table)

orderitems

IOID OID IID Price
1 2233 = which order 3 = Milk 1.5

In words: Customer 2 has placed an order with an id 2233 and this order has one item with itemorderid 3 (= milk). If you add a second item:

IOID OID IID Price
1 2233 = which order 3 = Milk 1.5
2 2233 = same order 4 = Cake 12

Note: All data (like prices) in orderitems are frozen as it represents the order at the time the order was placed. This may not be important for today but you need to keep the data for years and the prices here may not change because it is a historic view (finished order).


Accessing orders and it's items

As you know the customer (here e.g. Sam Smith with ID 2) you can easily access his orders by

"Select * FROM orders WHERE CID=2"

As a customer can have more than one order, use a cursor here to process all of them. Assuming you display the orders in a view and clicking on it will show the order details (orderitems) you know the ID or the order and can access the items by

"Select * FROM orderitems WHERE OID=2233"

Add more columns you need like orderdate, orderstate, quantity, etc.
 

keirS

Well-Known Member
Licensed User
Longtime User
Size doesn't matter & most common mistakes

With a good design, size doesn't matter. If you have a db with 1 bilion entries, the selection of data is as fast as if you have with 100 entries. A good query takes 0.02 secs or less. If it's 2.4 secs (which is over a 100 times slower) there IS an error in your db design, query or even a problem in your code flow.

This isn't correct. Query speed is dependent on the indexes being able to be held in memory on the DB server. Doesn't matter how good your DB design is you will encounter significant performance bottlenecks if the DB engine has to page the index to disk.

Plenty of good queries take longer than 0.02 seconds. They certainly do if you are performing calculations on 1 billion records.

Why orderitems AND items?

Items is the catalog of the items you CAN order. Orderitems are the items the customer actually orders. Imagine you change a price later. With this completed orders are "safe" as they represent the finished order as it was at the time of ordering.

In customers we have 3 customers:

CID Name
1 Bill Miller
2 Sam Smith
3 Caroline James

As you see every customer has it's uniquey ID. If we address "Sam Smith" we will use ID 2 in our app. The name is not important anymore (except to display it). Benefit, too: We save a lot of space as integers mostly need less space than a full name.


In items:

IID Itemname Price
1 Cake 10
2 Teddybear 15
3 Milk 1.5

Same here. Every item has an unique ID, too. Like in customers we only use the ID from here.

And what if I need to change the price from a specific date on?

Two thoughts here:

- the price for previous orders must be kept
- the new price is valid in the future

Just add to columns like validfrom and voidfrom. Insert a new item (same name) and by setting the dates in both you can easily set when the old is void and the new one is valid

IID Itemname Price validfrom voidfrom
1 Cake 10 2018-01-01 2018-04-04
4 Cake 12 2018-01-04 9999-12-31
2 Teddybear 15 2018-01-01 9999-12-31
3 Milk 1.5 2018-01-01 9999-12-31

To display the item list use

"Select * FROM items WHERE validfrom <= 'today' and voidfrom > 'today'"

which only displays ID 4 for new orders from April 4th and 1 before. Benefit: You can change a price for the future (like from January 1st, 2019).


So if Sam orders milk:

orders

OID CID
2233 = New Order 2 = Sam

By inserting a new row in orders the OrderId (OID) will be increased and set automatically. Customer is Sam with his ID 2 (see the customer's table)

orderitems

IOID OID IID Price
1 2233 = which order 3 = Milk 1.5

In words: Customer 2 has placed an order with an id 2233 and this order has one item with itemorderid 3 (= milk). If you add a second item:

IOID OID IID Price
1 2233 = which order 3 = Milk 1.5
2 2233 = same order 4 = Cake 12

Note: All data (like prices) in orderitems are frozen as it represents the order at the time the order was placed. This may not be important for today but you need to keep the data for years and the prices here may not change because it is a historic view (finished order).

That is an absolutely awful design for what you are trying to achieve. You are using multiple unique ID's in the items table to represent the same item. That's a big no no in DB schema design.

Instead you should have a separate itemprices table.
B4X:
PID     IID   Price   ValidFrom         VoidFrom
1          1      10       2018-01-01     2018-04-04
10         1      12       2018-01-04     9999-12-31
 

KMatle

Expert
Licensed User
Longtime User
That is an absolutely awful design

Denormalisation isn't awful. If it's fast and meets the needs, it's ok. Instead of beeing "rude", it was more helful if you expand this thread with useful design ideas or an own example. My example is just an example.

Do not take the 0.02 secs too mathematic. It just show that design is important. However...

This isn't correct. Query speed is dependent on the indexes being able to be held in memory on the DB server. Doesn't matter how good your DB design is you will encounter significant performance bottlenecks if the DB engine has to page the index to disk.

Yes, it is true. Size does not really matter if the design is good. At work we have half a billion rows and the speed is like I mentioned. If it's not, it's a design mistake. And hey, everyone is allowed to split data into several tables (= denormalisation) to come to 0.02 secs :)

I have a B4J app using SQlite 3 million rows in ONE table and about 15 million in ONE another (plus some additional tables). Guess how long a query takes... 12 GB RAM on my PC is enough to speed it up. SQLite is incredible fast :)

Plenty of good queries take longer than 0.02 seconds

Sure. If it's 0.2 secs, it's ok, if it's 1 sec, maybe, but longer means

- the design isn't optimal (no, it isn't)
- the design is ok, but you are doing some kind of a batch processing which is ok, but... I'm talking about a FAST online system
- you are doing "special" queries which are "outside the main design" (which is ok, too)
 

keirS

Well-Known Member
Licensed User
Longtime User
Denormalisation isn't awful. If it's fast and meets the needs, it's ok. Instead of beeing "rude", it was more helful if you expand this thread with useful design ideas or an own example. My example is just an example.

So my design was not useful? It's a far more normal approach to schema design than yours. Ask a DBA which one they would prefer and I think most of them would opt for my solution. If you are going to write a tutorial about database modeling don't you think it is a good idea to present something which is a normalized design rather than some wacky denormalised solution?
 

Ed Brown

Active Member
Licensed User
Longtime User
Denormalisation isn't awful. If it's fast and meets the needs, it's ok.
Completely agree. Normalisation of a database was the preferred method way back when disk space was expensive. Normalisation is still being used today in a lot of databases.

It's not necessary to denormalise the entire database but, denormalising the data where speed and performance is required makes a lot of sense. It should be noted that although denormalising has a lot of performance benefits it does comes with the cost of increased storage.
 

keirS

Well-Known Member
Licensed User
Longtime User
Completely agree. Normalisation of a database was the preferred method way back when disk space was expensive. Normalisation is still being used today in a lot of databases.

It's not necessary to denormalise the entire database but, denormalising the data where speed and performance is required makes a lot of sense. It should be noted that although denormalising has a lot of performance benefits it does comes with the cost of increased storage.

Normalization is used for easier maintenance of data consistency and simpler object relational mapping. Highly normalized databases better represent an object orientated architecture.
 

Ed Brown

Active Member
Licensed User
Longtime User
Normalization is used for easier maintenance of data consistency and simpler object relational mapping. Highly normalized databases better represent an object orientated architecture.
A normalised database has nothing to do with object relational mapping or even object orientated architecture. Normalised databases have been around long before either of those concepts were even a thing.

Normalisation is the process of reducing data storage and improving data integrity but it does not perform well with large datasets. Denormalising a database improves efficiency of data retrieval for both large and small datasets and allows for higher throughput of queries. If Google, Twitter, Facebook etc etc used normalised databases for their searches and storage of user information then they would be terrible sites and services to use as they would be slow to respond to the query demands those services get now.

On the topic of integrity, denormalised databases offer just as much integrity providing the design is good and applies to both normalised and denormalised databases. The cost of disk space today is allowing companies like Google and Facebook etc to have huge datasets and I can guarantee that they won't be normalised databases.
Facebook, as an example, created their own database and later open-sourced it - it's called Cassandra and it's not a normalised database.
 

Erel

B4X founder
Staff member
Licensed User
Longtime User
Like anything related to performance optimizations, you need to start with the design that makes the most sense for your requirements and is easy to maintain. In 99% of the cases such design will also have good enough performance.
In most cases normalized databases are easier to maintain and to be kept consistent.

If the same data is stored in multiple tables then any update must carefully update all tables.
 

keirS

Well-Known Member
Licensed User
Longtime User
A normalised database has nothing to do with object relational mapping or even object orientated architecture. Normalised databases have been around long before either of those concepts were even a thing.

A bit of a history lesson:

First fully OOP language: Smalltalk in 1972
SQL developed in 1974 by IBM
Biggest investors in the propagation of Smalltalk as a development environment: IBM

The concepts of OPP and an RDBMS were both formulated at roughly the same time in the late 1960's and early 1970's.

Normalisation is the process of reducing data storage and improving data integrity but it does not perform well with large datasets. Denormalising a database improves efficiency of data retrieval for both large and small datasets and allows for higher throughput of queries. If Google, Twitter, Facebook etc etc used normalised databases for their searches and storage of user information then they would be terrible sites and services to use as they would be slow to respond to the query demands those services get now. On the topic of integrity, denormalised databases offer just as much integrity providing the design is good and applies to both normalised and denormalised databases. The cost of disk space today is allowing companies like Google and Facebook etc to have huge datasets and I can guarantee that they won't be normalised databases.

The cost of memory has also come down. That means I can take a 40gb database and throw it on a dedicated MySQL / Maria DB server with 128gb ram and SAS SSD's and configure the InnoDB engine to use 80% of the ram which would mean the whole DB can be cached in memory for not a lot of money. I suspect most B4X developers will not be dealing with multi terabyte databases but I could be wrong.





Facebook, as an example, created their own database and later open-sourced it - it's called Cassandra and it's not a normalised database.

Cassandra is not an RDBMS so I really can't see the relevance mentioning this as the concepts of designing a schema in Cassandra are different to the concepts designing a schema in an RDBMS.
 
Top