I'd like to get some PROS and CONS to a couple of different strategies.
When importing a file I have to verify that each of 6,000 records has a valid account number. To do this I query a database table that contains about 15,000 records. The process goes pretty fast, for all 6,000 records about 40 rules are checked for each record and over 6,000 hits are made to the database. This all happens in about a minute.
I'd like to tighten up the time a bit on this and one strategy is to cache the list of 15,000 account numbers and validate against that list. I was thinking of using a Dictionary. This would eliminate 6,000 individual hits to the database but it would probably take up memory, although probably not alot.
How would you handle this?
I'd probably just bring back all the account records. 15,000 doesn't seem like it would take a non-trivial amount of memory. You may hit problems in the future if the number of records increases though.
Could also use an n-tier approach; if you app server is your db server as well, the app server could talk to Sql through shared memory, so your 6,000 database hits should be fairly quick.
Another option depending on how unique the account numbers are in the 6000 records is to keep a running cache as you do a check. Worst case you make 6000 DB reads and 0 cached data checks.
Best case all 6000 records have the same account number so you do 1 DB read and 5999 cached data checks.
You could also do something like reading in all your 6000 records, then using linq to get the unique set of account numbers, then pass those to the validation routine.
A lot depends on the network traffic and where your time bottleneck is (data transfer, memory, etc).
I think your suggested solution would indeed reduce your 6,000 round-trips to the database to perhaps 1 (read the entire account list in one query). 15,000 records is nothing with today's memory configurations. (Typically 1 or more GB per client).
I've found that round-trips dominate over almost everything else as you try to distribute your implementation over different tiers. You might be able to do 6,000 round trips in 1 minute in a gigabit/sec network with low latency and lightly loaded server, but in many real-world situations, this will take much longer. When running with a remote data portal, I've found just 100 round-trips can take over a minute.
Thank you for the ideas, gentlemen. For now I went with caching the 15,000 account numbers. I could improve upon this by creating the cache, doing the validation, and then emptying the cache to free up memory but it works fine the way that it is. The other solutions will have to come into play when we change systems in a year - we'll have upwards of 10x the amount of data that would need to be cached so it may not work out so well using this strategy.
Sounds like a fine solution then. Cache everything, one db hit, and the rest is cached hits. Its only when the memory the cache takes up becomes too large that you'll have to reevaluate, either caching per hit, or like i said just eliminating the network hope by having an app server on the db server.
As others have pointed out, its the network hopes that will kill you; 1,000 seperate sql statements over a network link is going to be slower than even batching 1,000 queries but sending as one command. The the amount of data comes into play; 1gb of data going over two links is going to take more time than having it go over one link.
Just to give you some more alternatives:
1. Use BulkCopy into temporary tables in the database and run your validation in f.ex an SP / .NET in SqlServer. That should be really fast and handle quite large data. SqlBulkCopy is supported from .NET 2.0 .
2. As long as the data required for validation is not to much to load into memory - then load into objects and use an indexing library (like I4O) combined with LINQ to run validation. Also have a look at the Update operator - gives very nice readable code.
Copyright (c) Marimer LLC