Daryl's ColdFusion Primer

[<--Prev] [Next-->]
[Printable Version] [Search: ] [Daryl's TCP/IP Primer] [Daryl's ColdFusion Blog]
ColdFusion Slow?  Find out why with SeeFusion

4. Queries and SQL

Relational Databases (brief intro)

Most popular database systems are "relational" database systems. The simplest way to think of relational databases is to imagine a bunch of spreadsheets linked to each other, each containing one type of data, organized into rows and columns. Each spreadsheet is called a "table." Tables are linked to one another by matching certain data columns. Each link is called a "relationship", hence the term "relational database."

Think of a table of "Customers"

CustomerID CustomerName CustomerPhone CustomerEmail
1 Bob Smith 913-555-1212 bob@reallyslow.net
2 Sally Jones 612-555-1212 sjones@somecompany.com
3 John Public 816-555-1212 jqp@aol.com

Notice that there is no redundancy in the data; each row contains unique data. Each customer, actually, has been assigned a unique number. The column (or combination of columns) that always finds exactly one row, in this case CustomerID, is called the Primary Key. Although not absolutely required by many database systems, all tables should have a unique primary key. Unique real-world data, such as SSN or phone number, can be used as the primary key. Primary keys created using real-world data are known as "natural keys." If you create a number to represent each row, certain problems inherent in using natural keys (which will become apparent later) go away. Keys created using arbitrarily assigned data (usually numbers) are known as "artificial keys."

*

I strongly recommend using artificial keys whenever possible! Updating data that's part of a natural key becomes very problematic after the relationship has been propagated to many other tables. But remember: once a customer knows what "customer ID" they are, it starts to become a natural key. If, for some reason, someone needs to change an ID (e.g., someone is not happy being customer 666), then that change becomes difficult. One recommendation is to assign a customer number that's not part of any foreign key relationship, then use something that's impossible to remember, such as the result of CreateUUID(), as the artificial key.

Now, let's say that we have a table called "orders" to hold orders placed by customers:

OrderID CustomerID OrderDate WidgetName Qty
1 1 3/12/2000 FlidgyWidget 2
2 1 4/14/2000 WidgetFoo 4
3 2 4/15/2000 CagyWidget 17000.5

Hmmm... with this table structure, we have an artificial key (OrderID) to keep things unique, but what if the customer wants to order more than one item in an order? This table won't handle more than one widget per order. So, let's break this out into two tables, so Bob can order WidgetBars to go with his WidgetFoos in order 2:

Table Orders

OrderID CustomerID OrderDate
1 1 3/12/2000
2 1 4/14/2000
3 2 4/15/2000

Table OrderLineItems

OrderLineItemID OrderID WidgetName Qty
1 1 FlidgyWidget 2
2 2 WidgetFoo 4
3 2 WidgetBar 4
4 3 CagyWidget 17000.5

Great! Now we can have as many line items as we want on each order. Notice how the CustomerID exists in Orders to "relate" the Orders table to the Customers table. Since any customer can have many orders, this example shows a "one-to-many" relationship. The CustomerID column in Orders is considered to be a "foreign key" to Customers: the data represented by the Orders.CustomerID can be found in the Customers table by matching each Orders.CustomerID to the corresponding Customers.CustomerID.

The process of splitting up data into tables so that the database can accurately represent, or model, the real-world, is called database design. Breaking up tables so that there is as little redundancy in the data as possible is called "normalizing" the database. There are various stages of normalization known as "normal forms." If each table cell contains exactly one piece of data, the table is said to have achieved "first normal form." Third normal form seems to balance data integrity and select speed well.

* Database normalization is not an end; normalization is a means to an end. The more normalized a database is, the easier it is to update data in the database. The less normalized a database is, the easier it is to retrieve the data (uh, generally speaking. And remember, all generalizations are false.) Your normalization choices are driven by the way users use your database; if it's almost completely read-only, even violations of first normal form may be acceptable. When in doubt, err on the side of normalization.

The design of the database is one of the most important parts of a database driven application. A well-designed database enables flexibility in applications, and makes systems more robust. A poorly designed database can cripple a project-- and bad database design can make project goals completely unattainable. Interviewing people familiar with the data you are modeling into a database is a step that must not be overlooked. If possible, explain the database design to the users that know the data, and have them help you decide if your data structures are adequate to the task at hand. I generally don't start coding until I have a database built and ready. I'm not suggesting that the database design doesn't (or shouldn't) change after the project has started; but coding shouldn't start until you have a good database design to work against.

SQL

There are four basic operations performed on databases:

  1. Putting data into a table (INSERT)
  2. Viewing data in a table (SELECT)
  3. Changing data already in a table (UPDATE)
  4. Removing data from a table (DELETE)

The most common type of query (all database commands in SQL are called queries) is a SELECT statement, where you retrieve data from one or more tables. The syntax for a SELECT statement is usually fairly simple, and reads almost like English. Lets say I want all the information about customers named "Bob Smith":

SELECT *
FROM Customers
WHERE CustomerName = 'Bob Smith'

SQL is not case-sensitive, nor does it require each clause to begin on a new line; I capitalize all SQL keywords, use mixed case for column names, and break the query into several lines, as a matter of style. When frequently retrieving data from certain columns, it's often a good idea to place an index on those columns. A database index, which is like a book index (but much better), allows the database engine to quickly find relevant rows without having to search the whole database for them. If you find yourself searching by CustomerName often, it is a very good idea to index that column. The tradeoff: indexes consume disk space (though that's not usually a problem), and since indexes must be updated whenever an insert is performed, having many indexes on a table can significantly slow inserts. I've seen the simple addition of an index turn a 20 second response time into a 0.2 second response time. When doing performance tuning on applications, start by examining the WHERE clause of all of your queries. There should be an index on a piece of the WHERE clause that eliminates at least 90% of all rows from consideration before other parts of the WHERE clause are processed. Be very careful with OR conjunctions in where clauses. Most database systems will only use one index per table; therefore, an OR may cause the DBMS to have to read all rows from the table (this is called a "table scan") which can be very inefficient. It's often much more efficient to run two different queries than it is to run one query that has WHERE conditions on two different columns, or'ed together.

CFQUERY

Probably the second most used ColdFusion tag after <CFOUTPUT>, <CFQUERY> allows you to send any database query to a DBMS query engine. The main attributes are DataSource and Name. The name is used to name the result set to facilitate use of the results later in the template. As a matter of style, I preface all queries with the letter "q".

To run the query from the last section:

<CFQUERY Name="qCustomers" DataSource="#DataSource#">
  SELECT *
  FROM Customers
  WHERE CustomerName = 'Bob Smith'
</CFQUERY>

Notice the fact that the value for DataSource is #DataSource#. One feature of ColdFusion is that it looks for and runs a template named application.cfm before running each request. I use application.cfm to check security (when necessary), and to set global variables such as DataSource. Then, I use #DataSource# in all my CFQUERYs. If the ODBC datasource name changes for any reason, I only have to fix it in one place.

After a query is executed, the results are stored in a variable scope with the name you specified. You may not name a query using the same name as a built-in variable scope. (When ColdFusion 3.0 was released, I had to do quite a bit of cleaning up of old code. I had named my security check query "Session" and suddenly "Session" was a reserved word..!) You can determine how many rows were returned by the query by examining the "RecordCount" variable that is added to the scope when the query is created; e.g.:

<CFOUTPUT>
There were #qCustomers.RecordCount# customer(s) found.
</CFOUTPUT>

Although you might expect the RecordCount variable to contain the number of records affected by an INSERT, UPDATE, or DELETE, this is not the case. There is no good, cross-platform way to get that information other than simply running a SELECT beforehand and seeing how many records would be returned by that WHERE clause.

Indexing and Query Performance

When you want to find information on a topic from a reference manual, you will typically look for the topic in the book's index. An index is logically separate from the book--an attachment to the end to facilitate faster lookups of the data contained in the rest of the book. Database information is a bit more structured than a reference book--more like a phone book. A phone book is stored in order by last name and first name. However, if you wanted to find a name, given an address or a phone number, you would wind up reading (on average) half of the phone book to find the name. A database engine would read the whole book every time, in case there is more than one name associated with a phone number. Now, imagine the phone book having multiple indexes at the end, that simply map addresses to names, and phone numbers to names. Suddenly, your lookups become much more efficient. A database table will typically have several indexes on chosen columns or groups of columns. "Which columns to index" is a decision made by the database administrator (DBA) based on the queries he/she expects to be run at the database. If you are a developer in a smaller shop, you may take on many of the roles of a DBA for projects, including determining table structures and determining which columns to include in and index or indexes.

Since an index must be updated every time any data in the index in changed (by an insert, update, or delete), indexes will slow those operations somewhat. However, a useful index can have a dramatic effect on the speed and efficiency of a select. DBMSes are usually comprised of two primary components: the query optimizer and the execution engine. The role of the query optimizer is to look at queries presented to the database, and decide which index(es) will get the data with the least cost to the execution engine. The most "expensive" operation for a database to perform is a physical I/O, where a chunk of data is physically retrieved from or written to a hard drive. If the query optimizer cannot find a useful index for a select statement, it will tell the execution engine to read the entire table, which can be very time consuming in terms of physical I/O and record locking. (Record locking is used to prevent someone from changing data while you're using it. Or, to prevent you from using data someone else is changing.) Reading an entire table to service a query is known as a table scan and should be avoided in most cases. Most DBMSes will allow you to ask the query optimizer what indexes it will use (if any) based on a given query. This is referred to as "getting a query plan." Indexes are chosen by comparing the indexes available to the columns referenced in the "where" clause of the query. Therefore, care should be taken to make sure queries reference at least one column in an index in their "where" clauses. If no index is available, strongly consider adding one to help the query perform adequately. When in doubt, get a query plan from the DBMS, and decide from there if you can either:

The Query Object: Floor Wax or Dessert Topping?

With apologies to Saturday Night Live

Now, for a limited time, you too can be the proud owner of a query object! What, you ask, is a query object? A query object is the data structure created when you run a query using <CFQUERY>.

Features:

Creating a query obect

A query object is created every time you run a query, and is conveniently named using the NAME attribute of <CFQUERY>. Then, you can simply use the QUERY attribute of <CFOUTPUT> to display the results of the query! For example, let's assume you want to produce a sorted list all of the users in the Users table. With just seven lines of code, it's as good as done!

<CFQUERY Name="qUsers" Datasource="#request.dsn#">
SELECT LastName, FirstName FROM Users
ORDER BY LastName, FirstName
</CFQUERY>

<CFOUTPUT QUERY="qUsers">
#qUsers.FirstName# #qUsers.LastName#<BR>
</CFOUTPUT>

This query might produce output like this:

Jane Doe
John Doe
Fred Flintstone
Jane Jetson
Marge Simpson

You can also group the output by using the Group attribute and nesting <CFOUTPUT> tags!

<CFOUTPUT Query="qUsers" Group="LastName">
The #LastName# Family: <CFOUTPUT>#FirstName# </CFOUTPUT><BR>
</CFOUTPUT>

The Doe Family: Jane John
The Flintstone Family: Fred
The Jetson Family: Jane
The Simpson Family: Marge

* Every time the value in the column indicated by the Group attribute changes, the outer loop is re-run. You can also create more than two levels of nesting by using <CFOUTPUT Group="foo"> mulitple times. The Query attribute is only used on the outermost CFOUTPUT. Unfortunately, <CFLOOP> does not have a Group attribute.

Works like an array!

You can also get at the contents of rows without looping, simply by specifying the column name and row number! In the example above, the value of #qUsers.LastName[3]# is "Flintstone".

* If you don't specify a row number, you always get the first row of the query.

Works like a structure!

You can even use structure syntax to access columns of a query. In the example above, #qUsers["LastName"][4]# will produce "Jetson"

* However, attempts to use functions that expect structures, such as structKeyList(qUsers) or <CFLOOP Collection="qUsers"> will not work.

Additional properties available if you call now!

That's right, if you call now, we will include, absolutely free, these additional properties with every query created!

* If you're not in a loop, the value of MyQuery.CurrentRow is always 1.

Extra, Bonus Property Available for a Limited Time!

Operators are standing by, call now!

Tastes great! And just look at that shine!

* (No operators are standing by.)


Next: Template Style and Structure (Paranoia 101)
Copyright ©2000-2008 Daryl Banttari. See [Disclaimer]. [About Daryl]