The textbook answer to hierarchies in SQL uses the "adjacency model". In the adjacency model, only the parent_id is stored with each child, as it is really all that is needed to determine the tree structure. However, the tree cannot be displayed without recursively tracing the entire table. The end result, is that INSERT operations are extremely fast. The SELECT operation to pull the entire tree, however, becomes a self-referencing monster that creates tremendous overhead for a large tree.
We will use a message board for this exercise.
Message boards come in two flavors: Single- or Multi- threaded. Single-threaded boards sort messages by date only, with the newest message appearing at the bottom of the page. On a single-threaded board the visitor must ascertain if the last post is part of the general thread, or a response to a particular message. Multi-threaded boards store the messages hierarchically, so that many sub-discussions can break away from a main thread.
A well constructed multi-threaded discussion board should fill these three requirements:
- The messages must be threaded properly
- We must be able to extract an entire thread quickly
- There must be no practical limit on the number of replies in a discussion
Requirement 1: Threading Properly
Multi-threaded boards display messages in a hierarchical tree to distinguish who has responded to whom, and in what order. We are going to model our rules after the traditional newsgroup way of ordering messages with new replies appearing to the right, and down.
This may seem like a no-brainer at first, but it is very easy to overlook a badly sorted thread. There are many boards around, for example, that display discussions like so:
1 : message 1
- 2 : reply to 1
- - 4 : 2nd reply to 2
- - 3 : 1st reply to 2
- - - 5 : reply to 3
At first glance that may seem correct, but a newsgroup would sort it like this:
1 : message 1
- 2 : reply to 1
- - 3 : 1st reply to 2
- - - 5 : reply to 3
- - 4 : 2nd reply to 2
Did you catch the difference right away? We might be able to forgive those that sort the wrong way until we try to read one of their boards, which is an excercise in frustration for anyone used to the newsgroup format. Good examples of properly ordered threads can be seen at any of Google's newsgroups.
Requirement 2: Extract entire thread quickly
By its very nature a message board application will spend more of its lifetime displaying than inserting. Therefore, we are more interested in developing a fast SELECT than we are in avoiding a slow INSERT.
Requirement 3: No practical limits
We would not want our hugely popular board to have either a limit on messages per discussion or a cap on how many messages in the discussion to display at once. These may be features you want to add, but not tolerances for our application to dictate to us.
This exercise wouldn't be much fun without any threads to look at, so we will create some sample data to work with by writing a routine that will randomly post a determinate number of messages to the database. We will set a 70% chance that a new post will be in reply to the last message posted, and a 30% chance that a previous message will be randomly selected and responded to instead. While a real-world board may have a stronger tendency to move in a forward direction, the 70/30 ratio creates many backward folds in the tree which are useful to us for testing accuracy.
Database Setup
So the adjacency model, will not serve us well for multi-threaded board. Instead, we will use a model first introduced to me by Jeff Howden. Jeff calls it the "rank and tree level" model. We will name the table to hold our threads, "Discussions":
CREATE TABLE Discussions (
id INTEGER PRIMARY KEY NOT NULL,
subject VARCHAR(200) NULL,
message LONGVARCHAR NULL,
date_created DATETIME NOT NULL,
created_by INTEGER NOT NULL,
threads_id INTEGER NOT NULL,
parent_id INTEGER NOT NULL,
tree_level INTEGER NOT NULL,
rank INTEGER UNIQUE NOT NULL
);
We'll need a dummy row to perform our experiments, let's insert it now:
INSERT INTO Discussions
(
message
, date_created
, created_by
, threads_id
, parent_id
, tree_level
, rank
)
VALUES
(
'message 1'
, #CreateODBCDateTime(Now())#
, 1
, 1
, 0
, 1
, 1
)
The id, subject, message, and date_created, fields should be self-explanatory. The created_by field would likely reference a key in a members table, or possibly hold an email address for a simpler board. The threads_id column will indicate the starting message in each discussion, which allows us to pull entire discussions quickly, and will make many of our insert operations more manageable. The last three columns: parent_id, tree_level, and rank determine the position of each message in the discussion.
- parent_id functions as it does in the adjacency model, indicating who this message is in reply to.
- tree_level measures the size of the row's left indent, and is also used in calculating the correct point of insertion.
- rank defines the order of the items in the tree.
Here is the SELECT statement to pull an entire discussion in tree order:
SELECT
id
, rank
, subject
, message
, tree_level
, date_created
FROM Discussions
WHERE threads_id = #url.id#
ORDER BY rank
In the interest of portability, we will use a server-side langauge to store variables and control program flow. It is of course possible to perform this entire task in SQL, but it is not currently possible to make a cross-vendor solution using SQL only. This has been tested on SQL Server 2000, other platforms may not support all the syntax used below. the TOP statement, for example, works only in SQL Server and Access, MySQL users will have to convert it to a LIMIT clause. Pre-version 4 MySQL users will need major modifications, as MySQL 3 does not support subqueries.
We will be using ColdFusion as our server-side language. If you are unfamiliar with it, know that variables are surrounded by pound signs. Thus, if the URL for our page was www.foo.bar?id=5 the WHERE statement above would be interpreted as: WHERE threads_id = 5. It should be trivial to port to your favorite language, as we are only using it for IFs and to assign local variables.
Random INSERT Routine
We will make two templates for this exercise: a template to perform random inserts, and the insert routine itself. The random insert template has little value beyond this exercise, but what fun would it be to make a message board with no messages in it? The insert routine is separate, so that we can easily incorporate it into our message board code later on. The random inserter will be named, "generate_discussions.cfm" and the insert template, "insert_thread.cfm".
<!--- TEMPLATE 1 -- generate_discussions.cfm
********************************************
Begin generate_discussions.cfm template
a loop is opened,
a message is chosen to reply to,
and then the INSERT routine "insert_thread.cfm" is included.
********************************************
--->
<cfloop from="1" to="100" index="i">
<!--- the "to" attribute defines
the total number of random threads generated --->
<cfif i EQ 1>
<!--- if first item in loop, start a new discussion --->
<cfquery name="qGetNext"
datasource="#datasource#">
<!--- replace #datasource# with your datasource name to run --->
SELECT MAX(id) + 1 AS next_id
FROM Discussions
</cfquery>
<cfset threads_id = qGetNext.next_id>
<cfset lastpost_id = qGetNext.next_id - 1>
<!--- start a new thread in the Discussions table --->
<cfquery name="qInsert"
datasource="#datasource#">
INSERT INTO Discussions
(
subject
, parent_id
, threads_id
, tree_level
, rank
, date_created
)
VALUES
(
'message 1'
, #qGetNext.next_id#
, #qGetNext.next_id#
, 0
, 1
, #CreateODBCDateTime(Now())#
)
</cfquery>
<cfelse>
<!--- else not the first item in the loop,
set up 70/30 chance of replying to last or prev --->
<cfif RandRange(1,10) GT 3>
<!--- 70% chance of posting a
response to the most recent message --->
<cfset parent_id = lastpost_id>
<cfelse>
<!--- else, randomly pick a previous message to respond to --->
<cfset parent_id = RandRange(threads_id + 1,(lastpost_id - 1))>
<cfif parent_id LT threads_id>
<cfset parent_id = threads_id>
</cfif>
</cfif>
<!--- include insert discussions template --->
<cfinclude template="./insert_thread.cfm">
</cfif>
<!--- end if i eq 1--->
<cfset threads_id = threads_id>
<cfset lastpost_id = lastpost_id + 1>
</cfloop>
<!---
********************************************
END generate_discussions.cfm TEMPLATE
********************************************
--->
The INSERT Routine
Before we examine "insert_thread.cfm", let's take a look at what we are up against. The conditions for adding new messages are modeled after the real world. New messages enter with the parent_id of the message they are responding to taped to their forehead. For our test situation, we will have discussions that use a sequential set of IDs, which makes looking at the threads a little easier. This is of course unlikely in the real world, where many discussions will be occuring at once.
Below, is an example of a 30-thread tree created with our two templates. Examining this, we can study the different scenarios that arise when we want to insert a new message. The primary key has been started at 11, and sequential ID numbers have been used to make the threading easy to follow. Here is the simple code that creates the tree display seen below:
<!---
********************************************
BEGIN simple display code
********************************************
--->
<cfquery name="qGetDiscussion"
datasource="#datasource#">
SELECT
id
, rank
, subject
, message
, tree_level
, date_created
FROM
discussions
WHERE
threads_id = 11
ORDER BY rank
</cfquery>
<pre>
<cfoutput query="qGetDiscussion">
#qGetDiscussion.id# --
<cfloop
from="1"
to="#tree_level#"
index="p">--</cfloop> #subject#
</cfoutput>
</pre>
<!---
********************************************
END simple display code
********************************************
--->
This simple code yields the tree we see below. Let us pretend we are the next record, with an ID of 41, entering this thread at different points.
11 -- message 1
12 ---- reply to 11
13 ------ reply to 12
14 -------- reply to 13
15 ---------- reply to 14
16 ------------ reply to 15
17 -------------- reply to 16
18 ---------------- reply to 17
37 ------------------ reply to 18
38 -------------------- reply to 37
39 ---------------------- reply to 38
40 ------------------------ reply to 39
19 ---------------- reply to 17
20 ------------------ reply to 19
27 -------------------- reply to 20
28 ---------------------- reply to 27
35 -------------------- reply to 20
36 ---------------------- reply to 35
21 ------------ reply to 15
22 -------------- reply to 21
23 ---------------- reply to 22
24 ------------------ reply to 23
25 -------------------- reply to 24
26 ---------------------- reply to 25
33 ---------- reply to 14
34 ---------- reply to 14
29 -------- reply to 13
30 ---------- reply to 29
31 ------------ reply to 30
32 -------------- reply to 31
If we are replying to ID 32, we have a very straight-up operation since 32 is at the end of the tree. We can simply get the parent's rank and tree_level, add 1 to each, and insert the new message. A reply to ID 26 would be simliar, as it is at the end of a branch. In the case of 26 though, we will have to add one to all ranks below first, in order to make room for the new incoming record. This is the routine for any record at the end of its branch such as IDs 40, 28, 36, 26, 34 and 32.
If record 41 is replying to record ID 20, we need to fall after 36 and before 21. In this case we can find the insertion point by starting at the record we are replying to, and drawing a line straight down until we hit a record that is less than or equal to the parent. If we find a record, we will have marked the record we want to overtake. Room will need to be made in the rank column to hold the incoming message, so we will "shuffle" the ranks forward to make room for the new at this point. If we did not find a record drawing a line straight down from the message being replied to, then the new reply belongs at the end of the discussion.
The INSERT Template
Now that we have a clearer picture of the task at hand, here is the "insert_thread.cfm" template:
<!--- TEMPLATE 2 -- insert_thread.cfm
********************************************
Begin insert_thread.cfm template
- the correct insertion point
for new messages is determined,
- rank column is shuffled if necessary,
- the new message is inserted
incoming variables:
#parent_id#
- incoming variable of the
parent_id we are responding to
#threads_id#
- incoming variable holding the
thread or discussion this message is part of
********************************************
--->
<!--- a reply will always have a tree_level
equal to its parent's tree_level + 1 --->
<cfquery name="qGetTreeLevel"
datasource="#datasource#">
SELECT tree_level,rank
FROM Discussions
WHERE id = #parent_id#
</cfquery>
<cfset new_tree_level = qGetTreeLevel.tree_level + 1>
<!--- check for children --->
<cfquery name="qChildCheck"
datasource="#datasource#">
SELECT id,rank
FROM Discussions
WHERE parent_id = #parent_id#
</cfquery>
<cfif qChildCheck.recordcount GTE 1>
<!--- there are children,
draw a line straight down from the parent,
and see if we hit a record --->
<cfquery name="qLocateInsertion"
datasource="#datasource#">
SELECT TOP 1 id, rank
FROM Discussions
WHERE threads_id = #threads_id#
AND (tree_level <=
(SELECT tree_level
FROM Discussions
WHERE id = #parent_id#))
AND (rank >
(SELECT rank
FROM Discussions
WHERE id = #parent_id#))
ORDER BY rank
</cfquery>
<cfif qLocateInsertion.recordcount EQ 1>
<!--- if we just found a record, it's rank
is the rank our new message will take over
we must shuffle all the ranks in this discussion --->
<cfset new_rank = qLocateInsertion.rank>
<cfquery name="qShuffleRanks"
datasource="#datasource#">
UPDATE Discussions
SET rank = rank + 1
WHERE threads_id = #threads_id#
AND rank >= #new_rank#
</cfquery>
<cfelse>
<!--- we didn't find a record drawing
a line straight down from the parent,
the new message belongs at the end of the tree--->
<cfquery name="qGetRank"
datasource="#datasource#">
SELECT MAX(rank) + 1 AS lastrank
FROM Discussions
WHERE threads_id = #threads_id#
</cfquery>
<cfset new_rank = qGetRank.lastrank>
</cfif>
<cfelse>
<!--- else, parent has no existing children --->
<cfset new_rank = qGetTreeLevel.rank + 1>
<!--- query to test if we are at the end of the tree --->
<cfquery name="qCheckEnd"
datasource="#datasource#">
SELECT id
FROM Discussions
WHERE threads_id = #threads_id#
AND rank = (SELECT Max(rank) FROM Discussions)
</cfquery>
<cfif qCheckEnd.id NEQ parent_id>
<!--- if the parent is not at the end,
we need to shuffle the ranks and
create a gap for the new message--->
<cfquery name="qShuffleRanks"
datasource="#datasource#">
UPDATE Discussions
SET rank = rank + 1
WHERE threads_id = #threads_id#
AND rank >= #new_rank#
</cfquery>
</cfif>
</cfif>
<!--- close if qChildCheck.recordcount GTE 1 --->
<!--- we now have the information we
need for insert, and the ranks have
been shuffled to make room if necessary --->
<cfquery name="qInsert"
datasource="#datasource#">
INSERT INTO Discussions
(
subject
, parent_id
, threads_id
, tree_level
, rank
, date_created
)
VALUES
(
'reply to #parent_id#'
, #parent_id#
, #threads_id#
, #new_tree_level#
, #new_rank#
, #CreateODBCDateTime(Now())#
)
</cfquery>
<!---
********************************************
END insert_thread.cfm TEMPLATE
********************************************
--->
While this may look like a complicated routine to run for every new message, the overhead from these operations is trivial. Local tests showed the total time for the insert_thread.cfm template to run remained in the sub 100 MS range, even when shuffling the ranks of 2000 messages.
Conclusion
This solution yields opposite performance results from the adjacency model. We have developed a a slower INSERT, in favor of a faster SELECT. It is also worth noting that messages can be displayed in single-threaded format, by simply ordering the SELECT query by date_created instead of rank.
Once again, the SELECT statement to display a discussion is trivial:
SELECT
id
, rank
, subject
, message
, tree_level
, date_created
FROM Discussions
WHERE threads_id = #url.id#
ORDER BY rank
The performance results of this SELECT statement:
threads query execution time
10 0 MS
200 16 MS
2000 94 MS
Clearly, we have thoroughly fulfilled our three initial requirements of proper threading, select speed, and no practical limitations.
3-24-2008
3-10-2008
3-9-2008
3-7-2008
3-4-2008