object and db table identity

object and db table identity

Old forum URL: forums.lhotka.net/forums/t/2471.aspx


jhw posted on Friday, March 02, 2007

I have always used a autonumber id integer for every table so that there is a single primary key. This key is used as the id for objects that are loaded out of the db. One problem with this is that objects don't get their id until submitted into the db. Sometimes it would be nice to have the id prior to saving.

My question is in regards to using guid's instead. Is there a lot of cost to using them compared to an interger as an id. And can a person rely on them to be unique.

pelinville replied on Friday, March 02, 2007

It makes things easier if you use GUIDs. Anything that makes things easier has a cost.
 
So the question is, is the cost worth it?
 
I say yes.
 
Others say no.
 
Do a search of the forum(s) and decide for yourself.

RockfordLhotka replied on Saturday, March 03, 2007

If you do use an autonumber, you still do need to give your objects a "temporary id" until they are saved. GetIdValue() must always return a unique value.

The easiest way to do this is to use negative numbers, auto-decrementing them from a static/Shared field in your class.

private static int _lastNewId;

[RunLocal]
protected override void DataPortal_Create(object criteria)
{
  _lastNewId--;
  _id = _lastNewId;
  // ...
}

If you let DP_Create() run on an app server, you'd need to use some locking code for thread-safety:

private static int _lastNewId;

protected override void DataPortal_Create(object criteria)
{
  _id = System.Threading.Interlocked.Decrement(_lastNewId);
  // ...
}

 

Obelix replied on Monday, March 05, 2007

Is that unique value,

A) Unique to every object in the BO Object Model I have created?

or

B) Unique within a given Collection/List of objects?

i.e. does every object in a single hierarchy of object and collections have to have it's own Unique GUID, or do all objects in a given instance of an object model share the same GUID?

ajj3085 replied on Monday, March 05, 2007

It should be unique within the type.  So all Customer objects should have their own unique Ids, but Orders may have their own set of unique ids which may overlap the set of customer ids... unless of course you'll be storing Customer objects in the same list as Orders.. but I don't see a good reason to do that!

Andy

Obelix replied on Monday, March 05, 2007

Thanks Andy. Makes complete and utter sense when explained like that.

ThanX
Obelix

ajj3085 replied on Monday, March 05, 2007

Obelix, glad I could help.  Usually I'm the one that doesn't explain things clearly. Smile [:)]

JonM replied on Monday, March 05, 2007

We are using the CombID, so far it works great.  It is a guid, but instead of being completely random it is based on datetime+random.  This makes it so the guids generated are always incrementing and therefore solve the database indexing problems that guids traditionally have.

dshafer replied on Friday, March 09, 2007

Jon,

This incrementing guid sounds interesting.  I've tried googling it, but can't find any other mention of it.  Do you know of a site that explains the use of this combId?

Thanks,

Dustin

JoeFallon1 replied on Friday, March 09, 2007

I recall reading that article. It is by Jimmy Nilsson.

Here it is:

http://www.informit.com/articles/printerfriendly.asp?p=25862&rl=1

Joe

dshafer replied on Tuesday, March 13, 2007

Joe,

Thanks for the link.  In case anybody is interested, the following C# code will create the COMB or "sequential" guid.  By the way, this code is only meant as a sample, and is by no means the best or most efficient way to do this.

//Store the current date and time
DateTimedtNow = DateTime.Now;

//Number of days since 01/01/1900
DateTime dtBase = new DateTime(1900,01,01);
TimeSpan ts = dtNow - dtBase;
int days = ts.Days;  

//Convert number of days to hex
string hexDays = Convert.ToString(days, 16).PadLeft(4, '0').ToUpper();

//Number of milliseconds since midnight
double milliSeconds = dtNow.TimeOfDay.TotalMilliseconds;

//Convert number of milliseconds to number of sql clock ticks (3.333333333 milliseconds per clock tick)
double milliPerTick = 3.0 + (1.0/3.0);
Int64 clockTicks = Convert.ToInt64(milliSeconds / milliPerTick);

//Convert number of clock ticks to hex>
string hexTicks= Convert.ToString(clockTicks, 16).PadLeft(8, '0').ToUpper();

//Create a "sequential" guid using the first 24 characters of a generated guid, plus the binary datetime that we generated earlier
Guid nextSeqGuid = new Guid(Guid.NewGuid().ToString().Substring(0,24) + hexDays + hexTicks);

Dustin

Clayton replied on Wednesday, March 14, 2007

jhw:

I have always used a autonumber id integer for every table so that there is a single primary key. This key is used as the id for objects that are loaded out of the db. One problem with this is that objects don't get their id until submitted into the db. Sometimes it would be nice to have the id prior to saving.

My question is in regards to using guid's instead. Is there a lot of cost to using them compared to an interger as an id. And can a person rely on them to be unique.



Using a GUID in the database is a bad idea for a number of reasons. I'll give a couple here:

1. A uniqueidentifier is twice the size of a bigint and 4 times the size of an int. In our high storage world, this isn't a big deal. But it isn't just the storage of the value in the column. If this is your primary key, then it is also indexed. This means your index is 2 or 4 times larger than needed, which means it could take up to two or four times as long to traverse the index in lookups to find rows. Since the majority of actions against a database are reads, I think it's obvious why doubling or more the time for performing an index seek is a bad idea.

2. A uniqueidentifier is a non-sequential number. If you cluster on a uniqueidentifier as most primary keys are, then already inserted rows may have to be moved in order to accommodate another row stuck into the table at a random location. Disk I/O is a very slow operation which significantly increases the lock time for your inserting table. For auto-incremented clusters, this is never going to happen as all rows are tacked onto the end of the table.

There are many other less significant reasons to avoid using a GUID, but these two are two of the top ones.

Your best bet is to start with the negative number like Rocky mentioned and work your way up. If you still want to use a GUID, then add it as an additional column on the table but still use an auto-number as the primary key.

The GetIdValue method only has to be unique for the CSLA object. Create a new GUID when an object is created and save that to the database along with the other columns. Then refill that value when you DP_Fetch. That way you can still use an auto-number for the table's primary key and all of your queries and joins.

ajj3085 replied on Thursday, March 15, 2007

jhw:
1. A uniqueidentifier is twice the size of a bigint and 4 times the size of an int. In our high storage world, this isn't a big deal. But it isn't just the storage of the value in the column. If this is your primary key, then it is also indexed. This means your index is 2 or 4 times larger than needed, which means it could take up to two or four times as long to traverse the index in lookups to find rows. Since the majority of actions against a database are reads, I think it's obvious why doubling or more the time for performing an index seek is a bad idea.


This is true, although a number is a number and computers are very good with numbers.  Also consider that as we move to 64 bit computing, the comparision would be two dwords, which should be pretty performant.

jhw:
2. A uniqueidentifier is a non-sequential number. If you cluster on a uniqueidentifier as most primary keys are, then already inserted rows may have to be moved in order to accommodate another row stuck into the table at a random location. Disk I/O is a very slow operation which significantly increases the lock time for your inserting table. For auto-incremented clusters, this is never going to happen as all rows are tacked onto the end of the table.


This is not true; if you look a few posts up, you see a discussion on a COMBID, which takes a guid and modifies it in such a way that part of it is still sequential, and thus doesn't screw up the indexing.

Clayton replied on Thursday, March 15, 2007

ajj3085:
This is true, although a number is a number and computers are very good with numbers.  Also consider that as we move to 64 bit computing, the comparision would be two dwords, which should be pretty performant.


Twice as long is still twice as long. It is common to accept some performance penalty where there is a offsetting benefit. But I don't think a 100% performance penalty on the most prevalent operation should be taken lightly.

ajj3085:
This is not true; if you look a few posts up, you see a discussion on a COMBID, which takes a guid and modifies it in such a way that part of it is still sequential, and thus doesn't screw up the indexing.


I saw the COMBID, but that's only taking a random GUID and replacing that last 6 bytes with 6 sequential bytes. Those last 6 bytes may have some semblance of being sequential, but I can assure you they will not sort sequentially when inserted into the database. The first 10 bytes are still random and will not sort in the same order they are inserted. Which means indexing/clustering on the COMBID will still give you the same penalties as indexing/clustering on a real GUID.

ajj3085 replied on Thursday, March 15, 2007

Clayton:
Twice as long is still twice as long. It is common to accept some performance penalty where there is a offsetting benefit. But I don't think a 100% performance penalty on the most prevalent operation should be taken lightly.


I'm sure most of us here also index nchar / nvarchar columns, and probably set the collation to case insensitive.  Comparatively, those would take much more time than two dword comparisions.  I don't think we need to worry about using a 128 bit number vs. a 32 bit one..

Clayton:
I saw the COMBID, but that's only taking a random GUID and replacing that last 6 bytes with 6 sequential bytes. Those last 6 bytes may have some semblance of being sequential, but I can assure you they will not sort sequentially when inserted into the database. The first 10 bytes are still random and will not sort in the same order they are inserted. Which means indexing/clustering on the COMBID will still give you the same penalties as indexing/clustering on a real GUID.


Funny, because the linked article gives some interesting results that seem to contradict you for both inserting and selecting data.  Also, FWIW, that's not the first article I've seen describing the modification of a Guid in that way... the whole point of the COMBID is to make the Guid quicker in database inserting and selecting.  If the techinique didn't work, no one would have bothered  with it..

Keep in mind, most of us here are not building applications for millions of users querying Google sized databases; COMBIDs might have a slight hit on performance (if you believe the article, its almost just as good as ints), but not so much that it will cause anyone a huge problem.  For the most part, network traffic slows our apps down the most.

Clayton replied on Friday, March 16, 2007

I also index nchar/nvarchar columns. They have their place. But I try to avoid writing joins against them which you would likely do with a GUID primary key.

The linked article is very easy to test to see that the ordering is still random. Simply create a temporary table, load the results into the table in the order in which the article gave them, then return an ordered resultset. It will not be in the same order the values were placed into the table. The results may be somewhat less random than a full GUID, but they are still unordered. It is obvious that it mitigates the problems with a true GUID, but it does not eliminate them.

Taking the 5 generated directly from the article in the order they were listed, you can use the code below to show that they still order differently from the order they are inserted.

declare @table table (
    [guid] uniqueidentifier
)

insert into @table values ('E25AFE33-DB2D-4502-9BF0-919001862D20')
insert into @table values ('83E689D3-8549-4094-B223-919001862D20')
insert into @table values ('CC22A56D-0CD5-43C5-990E-919001862D20')
insert into @table values ('D5149998-1718-468C-B1AD-919001862D20')
insert into @table values ('CBD0182D-4A0E-40AC-9A4C-919001862D20')

select * from @table order by guid


Keep in mind that a clustered index physically orders the data in a table. If you are inserting new rows into a table with a sequentially increasing clustered index, then each insert will physically occur one after the other on the disk. Because of this, page splits won't occur during inserts. Using a GUID (or the COMBID), data will be inserted randomly into physical data pages which will cause page splits.

The databases don't have to be exceptionally large to see a perceptable difference in performance. I say this from experience.

Brian Criswell replied on Friday, March 16, 2007

Um, the article did point out that the ordering of COMBs is slightly random because COMBs generated in the same .003 second timeslice will generate the same last six bytes and the ordering will depend on the rest of the COMB.  If you separate the inserts so that there is .003 seconds between them, you get the order they were generated in.

declare @table table (
    [guid] uniqueidentifier
)

select @aGuid = cast(cast(newid() as binary(10)) + cast(getdate() as binary(6)) as uniqueidentifier)
select @aGuid
insert into @table values (@aGuid)
waitfor delay '00:00:00.003'

select @aGuid = cast(cast(newid() as binary(10)) + cast(getdate() as binary(6)) as uniqueidentifier)
select @aGuid
insert into @table values (@aGuid)
waitfor delay '00:00:00.003'

select @aGuid = cast(cast(newid() as binary(10)) + cast(getdate() as binary(6)) as uniqueidentifier)
select @aGuid
insert into @table values (@aGuid)
waitfor delay '00:00:00.003'

select @aGuid = cast(cast(newid() as binary(10)) + cast(getdate() as binary(6)) as uniqueidentifier)
select @aGuid
insert into @table values (@aGuid)
waitfor delay '00:00:00.003'

select @aGuid = cast(cast(newid() as binary(10)) + cast(getdate() as binary(6)) as uniqueidentifier)
select @aGuid
insert into @table values (@aGuid)
waitfor delay '00:00:00.003'

select * from @table order by guid

Running a simple test on my system showed that I can only fit about 114 inserts into a .003 second time frame.  Your results will vary.

declare @table table (
    [guid] uniqueidentifier
)

declare @int int
select @int = 0

while @int < 10000
begin
    insert into @table values (cast(cast(newid() as binary(10)) + cast(getdate() as binary(6)) as uniqueidentifier))
    select @int = @int + 1
end

select * from @table order by guid

The reason that I like Guids is that I can generate a unique number on the client and that will be the object's id.  COMBs generated on different clients would be slightly more prone to generating out of order ids.  Although the page splits would still not be nearly as much of a problem as with plain guids.

For most applications the programming simplicity added by guids would probably come at a neglible cost, especially if the move is made to COMBs, where inserting half a million rows only takes .1 seconds longer.

Can you not rebuild your pages periodically as part of maintenance?

Justin replied on Friday, March 16, 2007

Clayton:

Twice as long is still twice as long. It is common to accept some performance penalty where there is a offsetting benefit. But I don't think a 100% performance penalty on the most prevalent operation should be taken lightly.

Just becasue it is twice as long in bytes does not equate to twice as long in time to perform an operation especially when dealing with indexes that are implemented using btrees. If you go by the benchmarks in that article it show only a 10% increase in query time with that dataset.

Clayton:

I saw the COMBID, but that's only taking a random GUID and replacing that last 6 bytes with 6 sequential bytes. Those last 6 bytes may have some semblance of being sequential, but I can assure you they will not sort sequentially when inserted into the database. The first 10 bytes are still random and will not sort in the same order they are inserted. Which means indexing/clustering on the COMBID will still give you the same penalties as indexing/clustering on a real GUID.

Thats like saying inserting a the date of  '1/2/2006' and then '1/1/2007' will not be inserted sequencially or sort sequentially because of the alphabetically sort of the date string. Guid's are not strings, the string representation of a guid is hex based and follows PPOONNMM-LLKK-JJII-GGHH-AABBCCDDEEFF the sort however is based in its binary representation which is based on the letter ordering in that string pattern, notice how the last group is sorted first.   

Also in SQL 2005 they introduced the newsequentialid() function to do something similar to COMB for exactly this issue.

Clayton replied on Friday, March 16, 2007

I hadn't heard of the NewSequentialId in Sql2005, so I did some searching. I found an article comparing it against NewId and BigInt. http://sqljunkies.com/Article/4067A1B1-C31C-4EAF-86C3-80513451FC03.scuk

While page splits may be avoided by this new function versus the COMBID's function, the other performance issues still remain. In fact, if you add it up using the author's example, instead of a 1.1GB database, he has a 1.8GB database. Taking just the indexes into account, that's an extra 324MB of index data to search through.

This article also shows that when an insert for a sequential guid and a bigint are done in the same batch, the total effort of the guid insert is 60% and the bigint insert is 40%. If it were only a 10% overhead, wouldn't the numbers would be more along the lines of 52/48?

pelinville replied on Friday, March 16, 2007

See?  Just like I said, some say guids are good and some say they are not.
 
If you are a performance freak then guids are going to twist your nickers in a bunch.  Often for no reason because the user won't even see a difference.
 
But performance people are a wierd bunch so there you go. 

Michael Hildner replied on Friday, March 16, 2007

pelinville:
See?  Just like I said, some say guids are good and some say they are not.
 
If you are a performance freak then guids are going to twist your nickers in a bunch.  Often for no reason because the user won't even see a difference.
 
But performance people are a wierd bunch so there you go. 

Laughing out loud here - nothing like a little entertainment after a long day :) Thanks pelinville!

Justin replied on Saturday, March 17, 2007

Clayton:
I hadn't heard of the NewSequentialId in Sql2005, so I did some searching. I found an article comparing it against NewId and BigInt. http://sqljunkies.com/Article/4067A1B1-C31C-4EAF-86C3-80513451FC03.scuk

While page splits may be avoided by this new function versus the COMBID's function, the other performance issues still remain. In fact, if you add it up using the author's example, instead of a 1.1GB database, he has a 1.8GB database. Taking just the indexes into account, that's an extra 324MB of index data to search through.

This article also shows that when an insert for a sequential guid and a bigint are done in the same batch, the total effort of the guid insert is 60% and the bigint insert is 40%. If it were only a 10% overhead, wouldn't the numbers would be more along the lines of 52/48?

You seem to be glossing over how Comb Guid's work and how SQL stores them, if you insert rows into the system within a small span of time the last 48bits(which is sorted first) will be the same and the rest random, so yes it will act just like a normal guid. However in the real world most systems do not have rows inserted that quickly, therefore the the last 48bits slowy increment minimizing page splits. Comb id is not flawed your test was unless the systems you develop for insert 300 rows a second (in which case Guid's are probably a bad choice, then again any surrogate key would probably be a bad choice). Yes newsequientialid() is a better algorithm as it covers your test case aswell.

Also again you seem to have a simplistic view of how data is retrieved in SQL as though its always a index scan and not a seek and somehow SQL is transversing all that extra 324 MB to find a single value. Also how many databases have 20 million rows where 25% of the data is the surrogate PK? Very synthetic and unrealistic. 

When I said 10% increase in query time I should have been more specific in select/retrieval time, not insert time. Notice how that article did not get into retrieval time instead only focusing on insert time, in most typical OLTP systems the read write ratio is 80:20. The weighted cost of inserts is usually much lower than the select time, yet most of guid's disadvantages are in inserts, less so with sequential guids.

Although bigint and int definetly have thier place, performance is usually a non factor if guid's are used for identifying "root entities" such as Project or Resource, and the development/replication/migration advantages are great.

Justin

Clayton replied on Monday, March 19, 2007

Justin:
You seem to be glossing over how Comb Guid's work and how SQL stores them, if you insert rows into the system within a small span of time the last 48bits(which is sorted first) will be the same and the rest random, so yes it will act just like a normal guid. However in the real world most systems do not have rows inserted that quickly, therefore the the last 48bits slowy increment minimizing page splits. Comb id is not flawed your test was unless the systems you develop for insert 300 rows a second (in which case Guid's are probably a bad choice, then again any surrogate key would probably be a bad choice). Yes newsequientialid() is a better algorithm as it covers your test case aswell.


If you're using CSLA's root/child pattern, you don't think you'd have multiple inserts in that span of time?

Justin:
Also again you seem to have a simplistic view of how data is retrieved in SQL as though its always a index scan and not a seek and somehow SQL is transversing all that extra 324 MB to find a single value. Also how many databases have 20 million rows where 25% of the data is the surrogate PK? Very synthetic and unrealistic.


When I said 10% increase in query time I should have been more specific in select/retrieval time, not insert time. Notice how that article did not get into retrieval time instead only focusing on insert time, in most typical OLTP systems the read write ratio is 80:20. The weighted cost of inserts is usually much lower than the select time, yet most of guid's disadvantages are in inserts, less so with sequential guids.


Although bigint and int definetly have thier place, performance is usually a non factor if guid's are used for identifying "root entities" such as Project or Resource, and the development/replication/migration advantages are great.

Justin



No, my view is not that simplistic. As a member of the performance "weird bunch", to quote a poster above, I'm very much aware of my query plans and how SQL retrieves data. Though even with a seek, how many additional MBs of index information must be sifted through to get to what you're seeking?

My point is not that GUIDs don't have a place. My point is only that GUIDs should not be the default. I've spent over 40 hours on the phone with MS's SQL team fixing performance issues related to using GUIDs as primary keys. I'm intimiately familiar with the ins and outs of how SQL treats indexes and data pages among other things. Yes, this system was large, but not abnormally large. It was a supply chain application that serviced only 11 production plants.

Development/replication/migration can all be solved quite simply with GUIDs as the primary key. But if we're talking about simple applications, then replication and migration are probably not necessary items. As for development, I could argue that its easier to develop using sequential numeric keys than GUIDs for the sheer debugging and ad-hoc query savings. Its much quicker to type in "384" into an ad-hoc query than to copy and paste (or re-type) a GUID. Solving the unique GetIdValue problem with a utility method at the start of development is a one-time thing. Its also very simple to solve the replication/migration issues using a GUID as a non-primary key.

pelinville replied on Monday, March 19, 2007

Clayton:

Development/replication/migration can all be solved quite simply with GUIDs as the primary key. But if we're talking about simple applications, then replication and migration are probably not necessary items.
 
I find that this is quickly not being the case. With the advent of SQL Everywhere and pocket pc and cell phone apps, being able to say "this thing here refers to this, and only this, set of data" is a wonderful thing.
 
Clayton:
As for development, I could argue that its easier to develop using sequential numeric keys than GUIDs for the sheer debugging and ad-hoc query savings. Its much quicker to type in "384" into an ad-hoc query than to copy and paste (or re-type) a GUID.
 
Them be fightin' words mister!  At first I thought the same as you, but work with guids long enough and they simply don't pose that problem. 
 
First if you query on the first 4 or five 'characters' in the guid you will usually find the one you want.  Maybe a couple more will show up but finding the row you need isn't that difficult.
 
Now if these ad-hoc queries are in an application then of course you never have to type in that guid to begin with.
 
And if you do development in Visual Studio use Developer Express Code Templates to basically write the sql for you.
 
And while '384' is easy to remember and type I find '4395053' to be very easy to forget and/or mistype when I am a hurry so I usually use copy/paste anyway to make sure I get it right. 
 
But MAINLY guids make things like simple object caching, object comparison, object creation and maintanence and services much easier to write and maintain. 
 
Clayton:
Solving the unique GetIdValue problem with a utility method at the start of development is a one-time thing.
 
That usually does not work while off line.  Again, pocketpc/cellphone apps are not always connected to the network.  My customers expect offline capability even for their simple apps. And clickOnce apps that are deployed over the internet suffer user-noticeable performance hits when you have to retrieve the next number. 
 
Clayton:
Its also very simple to solve the replication/migration issues using a GUID as a non-primary key.
 
This is where I agree with you 100%.  Needing to use something other than a GUID for the primary key because it is a high transaction system I understand.  But I will still want a GUID to id the object. My apps would probably not even use the PK in that case so the GUID would have to be a Unique Key.
 
So you can convince me that there will be performance hits if a GUID is used as the primary key.  And I don't care.  There are ways to overcome this hit to a certain extent. Whatever performance problems that remain are often not noticeable or can be solved with a bit more hardware. 
 
But to say development is easier when using numbers to identify the objects.  That is just crazy talk. I mean the fact that one piece of information id's that OBJECT among all other OBJECTS in the UNIVERSE!

BTW. Don't take the comments I wrote personally.  Maintenance/time of development freaks like me are also weird.  Just not as weird as you.Big Smile [:D]

Justin replied on Wednesday, March 21, 2007

Clayton:

If you're using CSLA's root/child pattern, you don't think you'd have multiple inserts in that span of time?

You would for the child objects of course, but they would all have identical guid for thier FK to the parent so again minimized page splits.  This is why I said they are ideal for root object PK's. Obviously it would be a bad choice to say have a GUID for the PK of a Order Item child object of an Order object, it should instead be the Order GUID + a sequential int.

Clayton:
No, my view is not that simplistic. As a member of the performance "weird bunch", to quote a poster above, I'm very much aware of my query plans and how SQL retrieves data. Though even with a seek, how many additional MBs of index information must be sifted through to get to what you're seeking?

Obviously with only a 10% performance penalty on joins for 400% the key bytes it's not "shifting" through all those extra bytes. If you are familiar with how SQL server and most relational DB's implement indexes you would know more bytes per unque key value does not have a 1:1 effect on index seek time.

Clayton:

My point is not that GUIDs don't have a place. My point is only that GUIDs should not be the default. I've spent over 40 hours on the phone with MS's SQL team fixing performance issues related to using GUIDs as primary keys. I'm intimiately familiar with the ins and outs of how SQL treats indexes and data pages among other things. Yes, this system was large, but not abnormally large. It was a supply chain application that serviced only 11 production plants.

Again for someone who is intimately familiar with with SQL indexes and storage engine it seems odd you would come here proclaiming 1st that a index that is 2-4 times the size of an int takes 2-4 times as long to transverse to find matches, and 2nd that we can be assurred COMB GUID will not sort sequencially since both are incorrect.

Clayton:

Development/replication/migration can all be solved quite simply with GUIDs as the primary key. But if we're talking about simple applications, then replication and migration are probably not necessary items. As for development, I could argue that its easier to develop using sequential numeric keys than GUIDs for the sheer debugging and ad-hoc query savings. Its much quicker to type in "384" into an ad-hoc query than to copy and paste (or re-type) a GUID. Solving the unique GetIdValue problem with a utility method at the start of development is a one-time thing. Its also very simple to solve the replication/migration issues using a GUID as a non-primary key.

You could argue that point, and on a small development project that might be true. My argument from experience implementing small projects that evolved into large projects all based on 32 bit ints for keys is that yes "384" easy easy to rememeber "2154336873" is not so easy. Even smaller numbers are easy to transpose and anyone who is serious about debugging or retreiving the right data ad-hoc is going to use cut and paste to eliminate typo's. Secondly "384" is too easy to remeber especially out of context and being passed around within variables, as it is easy to store the key of one entity and make a simple error and have it be used to retrevie a different type of entity that just happens to have the same id, not generating error, GUID avoid this simple mistake all together.

Nothing wrong with being a part of the "weird bunch" when it comes to performance, but I have found most reasons used to avoid GUID's are from misinformation, and trying to eek out 10% more PK lookup performance is not the best place to spend optimiztion time, as those queries are usually the best performing in the entire application, vs the more complex analysis query's or user search functions that don't really even deal with a PK lookup but are comparing the other attributes of the entity and taking up most of the resources.

Justin

Clayton replied on Wednesday, March 21, 2007

Justin:

You would for the child objects of course, but they would all have identical guid for thier FK to the parent so again minimized page splits.  This is why I said they are ideal for root object PK's. Obviously it would be a bad choice to say have a GUID for the PK of a Order Item child object of an Order object, it should instead be the Order GUID + a sequential int.

Obviously with only a 10% performance penalty on joins for 400% the key bytes it's not "shifting" through all those extra bytes. If you are familiar with how SQL server and most relational DB's implement indexes you would know more bytes per unque key value does not have a 1:1 effect on index seek time.

Again for someone who is intimately familiar with with SQL indexes and storage engine it seems odd you would come here proclaiming 1st that a index that is 2-4 times the size of an int takes 2-4 times as long to transverse to find matches, and 2nd that we can be assurred COMB GUID will not sort sequencially since both are incorrect.

You could argue that point, and on a small development project that might be true. My argument from experience implementing small projects that evolved into large projects all based on 32 bit ints for keys is that yes "384" easy easy to rememeber "2154336873" is not so easy. Even smaller numbers are easy to transpose and anyone who is serious about debugging or retreiving the right data ad-hoc is going to use cut and paste to eliminate typo's. Secondly "384" is too easy to remeber especially out of context and being passed around within variables, as it is easy to store the key of one entity and make a simple error and have it be used to retrevie a different type of entity that just happens to have the same id, not generating error, GUID avoid this simple mistake all together.

Nothing wrong with being a part of the "weird bunch" when it comes to performance, but I have found most reasons used to avoid GUID's are from misinformation, and trying to eek out 10% more PK lookup performance is not the best place to spend optimiztion time, as those queries are usually the best performing in the entire application, vs the more complex analysis query's or user search functions that don't really even deal with a PK lookup but are comparing the other attributes of the entity and taking up most of the resources.

Justin



My intent from my posts was not to say that all of the data will be sifted through. That's why I used phrases such as "up to two or four times" in my first post instead of giving it as an absolute. Obviously it depends on the fill factor of the index pages, etc. I wish there was a program that could graphically display the B-tree with int vs GUID values as a way to illustrate my point, but I don't know of one.

My whole intent has been to point out that there are obvious performance and space penalties for GUIDs that are not strictly necessary. The benefits of GUIDs can be had using the GUID as a non-PK and still retain the benefits from using a sequential number as the PK. I also think I've shown quite simply that the COMB GUID does not sort sequentially in all circumstances.

My whole reasoning for posting has been to give light to the problems I faced with a GUID as PK system. Yet this seems to have devolved into something akin to an argument, so I'll rest my case here. I will certainly read any response you may have, but it is unlikely I will continue. I believe we may be at a point where we have to agree to disagree.

Justin replied on Wednesday, March 21, 2007

Clayton:

My intent from my posts was not to say that all of the data will be sifted through. That's why I used phrases such as "up to two or four times" in my first post instead of giving it as an absolute. Obviously it depends on the fill factor of the index pages, etc. I wish there was a program that could graphically display the B-tree with int vs GUID values as a way to illustrate my point, but I don't know of one.

My whole intent has been to point out that there are obvious performance and space penalties for GUIDs that are not strictly necessary. The benefits of GUIDs can be had using the GUID as a non-PK and still retain the benefits from using a sequential number as the PK. I also think I've shown quite simply that the COMB GUID does not sort sequentially in all circumstances.

My whole reasoning for posting has been to give light to the problems I faced with a GUID as PK system. Yet this seems to have devolved into something akin to an argument, so I'll rest my case here. I will certainly read any response you may have, but it is unlikely I will continue. I believe we may be at a point where we have to agree to disagree.

Lets not mince words, this is an debate, you arguing against and I arguing for GUID's both based on our experiences, I would hope all here benifit in some way from the debate and would not characterize it as "devolved".

You seemed to be pretty susinct it your proclimations that GUID's will at least *2 the index seek time (when the reality is *1.10)  and that COM GUID's would never sort sequentially (arguing based on thier alphabetic sort). Instead of admitting error with these claims you have only "moved the goalposts", redefining what you meant by "doubling or more the time for performing an index seek" and "COMBID will still give you the same penalties as indexing/clustering on a real GUID".

I simply wanted to quell any misinformation about GUID's and COMB GUID's that may turn people away from them.

Perhaps a summary of known pro's and con's we can agree on? Please correct me if this has any errors (Note most natural keys for data are more than 16bytes and are not sequetial therfore having similar issues as GUID's):

Pros:

1. GUID's are statistically unique allowing the PK of a entity to be assigned disconnected from a central database, allowing a new BO's to be initialized without a roundtrip to the central DB.

2. Since GUID are globally unique no special measures must be taken for replication as there will be no domain conflicts when merging data from separate systems.

2.Random GUID's lead to more secure systems as it is nearly impossible to guess previous or next PK's for an entity or even randomly guess a valid PK.  

Cons:

1. GUID's are 16bytes, int are 4bytes, bigint 8bytes, GUID require more storage than int's or bigint's could be up to 2-4 times slower on an index scan, but only ~1.10 times slower on a seek.

2. Random GUID's can cause excessive index page splitting and fragmentation (once fill factor has been exhausted) affecting tables that have a high volume of inserts. This can be even more pronounced if the GUID is a clustered index as the table becomes fragmented. This requires more database maintence overhead to reorganize indexes. 

COMB GUID's eliminate con #2 in systems that generate GUID's at less then 300 a second(this does not affect batch inserts as the COMB GUID algorithm can be implemented in client code generating the GUID at real time not as it's being inserted into db).

SQL Newsequentialid() function can eliminate con #2 at any generation rate but cannot be implemented in client code.

3. GUID's cannot be used for paging datasets or other comparision's that involve > or < as they are meaningless.

Justin

 

 

hurcane replied on Thursday, March 22, 2007

 This has been an interesting thread to read. We took the compromise route in our office. None of our transactional objects (customer orders, purchase orders, etc.) know their primary key when they are new.

We have two private fields in these objects to handle the identity. One is the primary key and the other is a GUID. DataPortal_Create instanatiates the guid variable. DataPortal_Fetch instantiates the primary key variable.

Our GetIdValue overrides look like this for these objects:
Public Overrides Sub GetIdValue() As Object
    Return IIf(Me.IsNew, mFakeID, mPrimaryKey)
End Sub

The GUID doesn't go in the database, so the performance people on our team are satisfied. It really just gets used for comparison purposes. We don't explicitly use GetIdValue in any of our business or UI code.

Copyright (c) Marimer LLC