I have always used a autonumber id integer for every table so that there is a single primary key. This key is used as the id for objects that are loaded out of the db. One problem with this is that objects don't get their id until submitted into the db. Sometimes it would be nice to have the id prior to saving.
My question is in regards to using guid's instead. Is there a lot of cost to using them compared to an interger as an id. And can a person rely on them to be unique.
If you do use an autonumber, you still do need to give your objects a "temporary id" until they are saved. GetIdValue() must always return a unique value.
The easiest way to do this is to use negative numbers, auto-decrementing them from a static/Shared field in your class.
private static int _lastNewId;
[RunLocal]
protected override void DataPortal_Create(object criteria)
{
_lastNewId--;
_id = _lastNewId;
// ...
}
If you let DP_Create() run on an app server, you'd need to use some locking code for thread-safety:
private static int _lastNewId;
protected override void DataPortal_Create(object criteria)
{
_id = System.Threading.Interlocked.Decrement(_lastNewId);
// ...
}
Is that unique value,
A) Unique to every object in the BO Object Model I have created?
or
B) Unique within a given Collection/List of objects?
i.e. does every object in a single hierarchy of object and collections have to have it's own Unique GUID, or do all objects in a given instance of an object model share the same GUID?
Thanks Andy. Makes complete and utter sense when explained like that.
ThanX
Obelix
Jon,
This incrementing guid sounds interesting. I've tried googling it, but can't find any other mention of it. Do you know of a site that explains the use of this combId?
Thanks,
Dustin
I recall reading that article. It is by Jimmy Nilsson.
Here it is:
http://www.informit.com/articles/printerfriendly.asp?p=25862&rl=1
Joe
Joe,
Thanks for the link. In case anybody is interested, the following C# code will create the COMB or "sequential" guid. By the way, this code is only meant as a sample, and is by no means the best or most efficient way to do this.
//Store the current date and time//Number of days since 01/01/1900
DateTime dtBase = new DateTime(1900,01,01);
TimeSpan ts = dtNow - dtBase;
int days = ts.Days;
//Convert number of days to hex
string hexDays = Convert.ToString(days, 16).PadLeft(4, '0').ToUpper();
//Number of milliseconds since midnight
double milliSeconds = dtNow.TimeOfDay.TotalMilliseconds;
//Convert number of milliseconds to number of sql clock ticks (3.333333333 milliseconds per clock tick)
double milliPerTick = 3.0 + (1.0/3.0);
Int64 clockTicks = Convert.ToInt64(milliSeconds / milliPerTick);
//Convert number of clock ticks to hex>
string hexTicks= Convert.ToString(clockTicks, 16).PadLeft(8, '0').ToUpper();
//Create a "sequential" guid using the first 24 characters of a generated guid, plus the binary datetime that we generated earlier
Guid nextSeqGuid = new Guid(Guid.NewGuid().ToString().Substring(0,24) + hexDays + hexTicks);
Dustin
jhw:I have always used a autonumber id integer for every table so that there is a single primary key. This key is used as the id for objects that are loaded out of the db. One problem with this is that objects don't get their id until submitted into the db. Sometimes it would be nice to have the id prior to saving.
My question is in regards to using guid's instead. Is there a lot of cost to using them compared to an interger as an id. And can a person rely on them to be unique.
jhw:1. A uniqueidentifier is twice the size of a bigint and 4 times the size of an int. In our high storage world, this isn't a big deal. But it isn't just the storage of the value in the column. If this is your primary key, then it is also indexed. This means your index is 2 or 4 times larger than needed, which means it could take up to two or four times as long to traverse the index in lookups to find rows. Since the majority of actions against a database are reads, I think it's obvious why doubling or more the time for performing an index seek is a bad idea.
jhw:2. A uniqueidentifier is a non-sequential number. If you cluster on a uniqueidentifier as most primary keys are, then already inserted rows may have to be moved in order to accommodate another row stuck into the table at a random location. Disk I/O is a very slow operation which significantly increases the lock time for your inserting table. For auto-incremented clusters, this is never going to happen as all rows are tacked onto the end of the table.
ajj3085:This is true, although a number is a number and computers are very good with numbers. Also consider that as we move to 64 bit computing, the comparision would be two dwords, which should be pretty performant.
ajj3085:This is not true; if you look a few posts up, you see a discussion on a COMBID, which takes a guid and modifies it in such a way that part of it is still sequential, and thus doesn't screw up the indexing.
Clayton:Twice as long is still twice as long. It is common to accept some performance penalty where there is a offsetting benefit. But I don't think a 100% performance penalty on the most prevalent operation should be taken lightly.
Clayton:I saw the COMBID, but that's only taking a random GUID and replacing that last 6 bytes with 6 sequential bytes. Those last 6 bytes may have some semblance of being sequential, but I can assure you they will not sort sequentially when inserted into the database. The first 10 bytes are still random and will not sort in the same order they are inserted. Which means indexing/clustering on the COMBID will still give you the same penalties as indexing/clustering on a real GUID.
Clayton:
Twice as long is still twice as long. It is common to accept some performance penalty where there is a offsetting benefit. But I don't think a 100% performance penalty on the most prevalent operation should be taken lightly.
Just becasue it is twice as long in bytes does not equate to twice as long in time to perform an operation especially when dealing with indexes that are implemented using btrees. If you go by the benchmarks in that article it show only a 10% increase in query time with that dataset.
Clayton:
I saw the COMBID, but that's only taking a random GUID and replacing that last 6 bytes with 6 sequential bytes. Those last 6 bytes may have some semblance of being sequential, but I can assure you they will not sort sequentially when inserted into the database. The first 10 bytes are still random and will not sort in the same order they are inserted. Which means indexing/clustering on the COMBID will still give you the same penalties as indexing/clustering on a real GUID.
Thats like saying inserting a the date of '1/2/2006' and then '1/1/2007' will not be inserted sequencially or sort sequentially because of the alphabetically sort of the date string. Guid's are not strings, the string representation of a guid is hex based and follows PPOONNMM-LLKK-JJII-GGHH-AABBCCDDEEFF the sort however is based in its binary representation which is based on the letter ordering in that string pattern, notice how the last group is sorted first.
Also in SQL 2005 they introduced the newsequentialid() function to do something similar to COMB for exactly this issue.
pelinville:See? Just like I said, some say guids are good and some say they are not.If you are a performance freak then guids are going to twist your nickers in a bunch. Often for no reason because the user won't even see a difference.But performance people are a wierd bunch so there you go.
Laughing out loud here - nothing like a little entertainment after a long day :) Thanks pelinville!
Clayton:I hadn't heard of the NewSequentialId in Sql2005, so I did some searching. I found an article comparing it against NewId and BigInt. http://sqljunkies.com/Article/4067A1B1-C31C-4EAF-86C3-80513451FC03.scuk
While page splits may be avoided by this new function versus the COMBID's function, the other performance issues still remain. In fact, if you add it up using the author's example, instead of a 1.1GB database, he has a 1.8GB database. Taking just the indexes into account, that's an extra 324MB of index data to search through.
This article also shows that when an insert for a sequential guid and a bigint are done in the same batch, the total effort of the guid insert is 60% and the bigint insert is 40%. If it were only a 10% overhead, wouldn't the numbers would be more along the lines of 52/48?
You seem to be glossing over how Comb Guid's work and how SQL stores them, if you insert rows into the system within a small span of time the last 48bits(which is sorted first) will be the same and the rest random, so yes it will act just like a normal guid. However in the real world most systems do not have rows inserted that quickly, therefore the the last 48bits slowy increment minimizing page splits. Comb id is not flawed your test was unless the systems you develop for insert 300 rows a second (in which case Guid's are probably a bad choice, then again any surrogate key would probably be a bad choice). Yes newsequientialid() is a better algorithm as it covers your test case aswell.
Also again you seem to have a simplistic view of how data is retrieved in SQL as though its always a index scan and not a seek and somehow SQL is transversing all that extra 324 MB to find a single value. Also how many databases have 20 million rows where 25% of the data is the surrogate PK? Very synthetic and unrealistic.
When I said 10% increase in query time I should have been more specific in select/retrieval time, not insert time. Notice how that article did not get into retrieval time instead only focusing on insert time, in most typical OLTP systems the read write ratio is 80:20. The weighted cost of inserts is usually much lower than the select time, yet most of guid's disadvantages are in inserts, less so with sequential guids.
Although bigint and int definetly have thier place, performance is usually a non factor if guid's are used for identifying "root entities" such as Project or Resource, and the development/replication/migration advantages are great.
Justin
Justin:You seem to be glossing over how Comb Guid's work and how SQL stores them, if you insert rows into the system within a small span of time the last 48bits(which is sorted first) will be the same and the rest random, so yes it will act just like a normal guid. However in the real world most systems do not have rows inserted that quickly, therefore the the last 48bits slowy increment minimizing page splits. Comb id is not flawed your test was unless the systems you develop for insert 300 rows a second (in which case Guid's are probably a bad choice, then again any surrogate key would probably be a bad choice). Yes newsequientialid() is a better algorithm as it covers your test case aswell.
Justin:Also again you seem to have a simplistic view of how data is retrieved in SQL as though its always a index scan and not a seek and somehow SQL is transversing all that extra 324 MB to find a single value. Also how many databases have 20 million rows where 25% of the data is the surrogate PK? Very synthetic and unrealistic.
When I said 10% increase in query time I should have been more specific in select/retrieval time, not insert time. Notice how that article did not get into retrieval time instead only focusing on insert time, in most typical OLTP systems the read write ratio is 80:20. The weighted cost of inserts is usually much lower than the select time, yet most of guid's disadvantages are in inserts, less so with sequential guids.
Although bigint and int definetly have thier place, performance is usually a non factor if guid's are used for identifying "root entities" such as Project or Resource, and the development/replication/migration advantages are great.Justin
Clayton:
Development/replication/migration can all be solved quite simply with GUIDs as the primary key. But if we're talking about simple applications, then replication and migration are probably not necessary items.
Clayton:As for development, I could argue that its easier to develop using sequential numeric keys than GUIDs for the sheer debugging and ad-hoc query savings. Its much quicker to type in "384" into an ad-hoc query than to copy and paste (or re-type) a GUID.
Clayton:Solving the unique GetIdValue problem with a utility method at the start of development is a one-time thing.
Clayton:Its also very simple to solve the replication/migration issues using a GUID as a non-primary key.
Clayton:
If you're using CSLA's root/child pattern, you don't think you'd have multiple inserts in that span of time?
You would for the child objects of course, but they would all have identical guid for thier FK to the parent so again minimized page splits. This is why I said they are ideal for root object PK's. Obviously it would be a bad choice to say have a GUID for the PK of a Order Item child object of an Order object, it should instead be the Order GUID + a sequential int.
Clayton:No, my view is not that simplistic. As a member of the performance "weird bunch", to quote a poster above, I'm very much aware of my query plans and how SQL retrieves data. Though even with a seek, how many additional MBs of index information must be sifted through to get to what you're seeking?
Obviously with only a 10% performance penalty on joins for 400% the key bytes it's not "shifting" through all those extra bytes. If you are familiar with how SQL server and most relational DB's implement indexes you would know more bytes per unque key value does not have a 1:1 effect on index seek time.
Clayton:
My point is not that GUIDs don't have a place. My point is only that GUIDs should not be the default. I've spent over 40 hours on the phone with MS's SQL team fixing performance issues related to using GUIDs as primary keys. I'm intimiately familiar with the ins and outs of how SQL treats indexes and data pages among other things. Yes, this system was large, but not abnormally large. It was a supply chain application that serviced only 11 production plants.
Again for someone who is intimately familiar with with SQL indexes and storage engine it seems odd you would come here proclaiming 1st that a index that is 2-4 times the size of an int takes 2-4 times as long to transverse to find matches, and 2nd that we can be assurred COMB GUID will not sort sequencially since both are incorrect.
Clayton:
Development/replication/migration can all be solved quite simply with GUIDs as the primary key. But if we're talking about simple applications, then replication and migration are probably not necessary items. As for development, I could argue that its easier to develop using sequential numeric keys than GUIDs for the sheer debugging and ad-hoc query savings. Its much quicker to type in "384" into an ad-hoc query than to copy and paste (or re-type) a GUID. Solving the unique GetIdValue problem with a utility method at the start of development is a one-time thing. Its also very simple to solve the replication/migration issues using a GUID as a non-primary key.
You could argue that point, and on a small development project that might be true. My argument from experience implementing small projects that evolved into large projects all based on 32 bit ints for keys is that yes "384" easy easy to rememeber "2154336873" is not so easy. Even smaller numbers are easy to transpose and anyone who is serious about debugging or retreiving the right data ad-hoc is going to use cut and paste to eliminate typo's. Secondly "384" is too easy to remeber especially out of context and being passed around within variables, as it is easy to store the key of one entity and make a simple error and have it be used to retrevie a different type of entity that just happens to have the same id, not generating error, GUID avoid this simple mistake all together.
Nothing wrong with being a part of the "weird bunch" when it comes to performance, but I have found most reasons used to avoid GUID's are from misinformation, and trying to eek out 10% more PK lookup performance is not the best place to spend optimiztion time, as those queries are usually the best performing in the entire application, vs the more complex analysis query's or user search functions that don't really even deal with a PK lookup but are comparing the other attributes of the entity and taking up most of the resources.
Justin
Justin:You would for the child objects of course, but they would all have identical guid for thier FK to the parent so again minimized page splits. This is why I said they are ideal for root object PK's. Obviously it would be a bad choice to say have a GUID for the PK of a Order Item child object of an Order object, it should instead be the Order GUID + a sequential int.
Obviously with only a 10% performance penalty on joins for 400% the key bytes it's not "shifting" through all those extra bytes. If you are familiar with how SQL server and most relational DB's implement indexes you would know more bytes per unque key value does not have a 1:1 effect on index seek time.
Again for someone who is intimately familiar with with SQL indexes and storage engine it seems odd you would come here proclaiming 1st that a index that is 2-4 times the size of an int takes 2-4 times as long to transverse to find matches, and 2nd that we can be assurred COMB GUID will not sort sequencially since both are incorrect.
You could argue that point, and on a small development project that might be true. My argument from experience implementing small projects that evolved into large projects all based on 32 bit ints for keys is that yes "384" easy easy to rememeber "2154336873" is not so easy. Even smaller numbers are easy to transpose and anyone who is serious about debugging or retreiving the right data ad-hoc is going to use cut and paste to eliminate typo's. Secondly "384" is too easy to remeber especially out of context and being passed around within variables, as it is easy to store the key of one entity and make a simple error and have it be used to retrevie a different type of entity that just happens to have the same id, not generating error, GUID avoid this simple mistake all together.
Nothing wrong with being a part of the "weird bunch" when it comes to performance, but I have found most reasons used to avoid GUID's are from misinformation, and trying to eek out 10% more PK lookup performance is not the best place to spend optimiztion time, as those queries are usually the best performing in the entire application, vs the more complex analysis query's or user search functions that don't really even deal with a PK lookup but are comparing the other attributes of the entity and taking up most of the resources.
Justin
Clayton:
My intent from my posts was not to say that all of the data will be sifted through. That's why I used phrases such as "up to two or four times" in my first post instead of giving it as an absolute. Obviously it depends on the fill factor of the index pages, etc. I wish there was a program that could graphically display the B-tree with int vs GUID values as a way to illustrate my point, but I don't know of one.
My whole intent has been to point out that there are obvious performance and space penalties for GUIDs that are not strictly necessary. The benefits of GUIDs can be had using the GUID as a non-PK and still retain the benefits from using a sequential number as the PK. I also think I've shown quite simply that the COMB GUID does not sort sequentially in all circumstances.
My whole reasoning for posting has been to give light to the problems I faced with a GUID as PK system. Yet this seems to have devolved into something akin to an argument, so I'll rest my case here. I will certainly read any response you may have, but it is unlikely I will continue. I believe we may be at a point where we have to agree to disagree.
Lets not mince words, this is an debate, you arguing against and I arguing for GUID's both based on our experiences, I would hope all here benifit in some way from the debate and would not characterize it as "devolved".
You seemed to be pretty susinct it your proclimations that GUID's will at least *2 the index seek time (when the reality is *1.10) and that COM GUID's would never sort sequentially (arguing based on thier alphabetic sort). Instead of admitting error with these claims you have only "moved the goalposts", redefining what you meant by "doubling or more the time for performing an index seek" and "COMBID will still give you the same penalties as indexing/clustering on a real GUID".
I simply wanted to quell any misinformation about GUID's and COMB GUID's that may turn people away from them.
Perhaps a summary of known pro's and con's we can agree on? Please correct me if this has any errors (Note most natural keys for data are more than 16bytes and are not sequetial therfore having similar issues as GUID's):
Pros:
1. GUID's are statistically unique allowing the PK of a entity to be assigned disconnected from a central database, allowing a new BO's to be initialized without a roundtrip to the central DB.
2. Since GUID are globally unique no special measures must be taken for replication as there will be no domain conflicts when merging data from separate systems.
2.Random GUID's lead to more secure systems as it is nearly impossible to guess previous or next PK's for an entity or even randomly guess a valid PK.
Cons:
1. GUID's are 16bytes, int are 4bytes, bigint 8bytes, GUID require more storage than int's or bigint's could be up to 2-4 times slower on an index scan, but only ~1.10 times slower on a seek.
2. Random GUID's can cause excessive index page splitting and fragmentation (once fill factor has been exhausted) affecting tables that have a high volume of inserts. This can be even more pronounced if the GUID is a clustered index as the table becomes fragmented. This requires more database maintence overhead to reorganize indexes.
COMB GUID's eliminate con #2 in systems that generate GUID's at less then 300 a second(this does not affect batch inserts as the COMB GUID algorithm can be implemented in client code generating the GUID at real time not as it's being inserted into db).
SQL Newsequentialid() function can eliminate con #2 at any generation rate but cannot be implemented in client code.
3. GUID's cannot be used for paging datasets or other comparision's that involve > or < as they are meaningless.
Justin
Copyright (c) Marimer LLC