Pageable/virtual mode CSLA collections for "very" large database tables

rsbaker0 posted on Wednesday, May 27, 2009

I'm looking for some input on how best to implement a pageable CSLA collection, perhaps an ERLB-derived class, that lets you scroll through an arbitrarily large database table (potentially millions of rows) without loading all the objects.

Of course, no one actually scrolls through millions of records -- you apply a filter, and so my intent is to support the DevExpress IListServer interface and use their XtraGrid's rich filtering interface to let the user constrain the results.

I have a protototype working fairly well that load pages of 100 objects at a time and will retain a configurable number of pages in memory and discard them in FIFO order when the limit is exceeded. However, it's not a true CSLA collection -- actually each "page" is currently an ERLB.

Can I implement a pageable collection by deriving directly from ERLB? At first glance, it seems like you could override the indexer and Count properties as a start, but it seems like there are potential trouble spots and I'm not sure where else I would run into difficulty. For example, what would you do with GetEnumerator()?

Any advice appreciated...

RockfordLhotka replied on Wednesday, May 27, 2009

The only paging support built into CSLA is designed around web forms, where you can implement IReportTotalRowCount.

But the implementation in that case is basically the same as what you are doing - the static factory method returns one page of data at a time as the user moves from page to page via the datagrid. The IReportTotalRowCount is necessary so the datagrid knows how many total rows there are, so it can display the appropriate UI glyphs for navigation.

RockfordLhotka replied on Thursday, May 28, 2009

If you want to be able to foreach over a virtual list, then you'll almost certainly have to create your own list type without using a pre-existing base class. The .NET base classes tend to assume concrete lists - they actually contain and manage a real list. Your class would not contain a real list - it would just give the illusion of there being a real list as it pages over small list-segments.

This is not unlike what SonOfPirate may need to do to accomplish the multi-threaded scenario he's working on (separate thread) - maybe the two of you can collaborate to some degree - though maybe the requirements are just too different.

SonOfPirate replied on Friday, May 29, 2009

The first question I have is if you really need to cache all of the data. If you don't, probably because the data is volitile, then you can simply add PageNumber and PageSize parameters to your factory method and criteria class so they can be used in your stored procedure and allow the database to handle returning only the requested records.

If you do want or need to cache all of the data, then I would agree with Rocky that a "view" on the original data would be best. I've accomplished something similar in the past but without the filtering. My approach also assumes a fixed window approach - as opposed to a sliding window where you drop the top item and add another at the bottom as you scroll. In other words, you display 1-100, then 2-101, 3-102 and so on. That can be done, however. I just didn't try it.

The way I did it was to have the UI bound to my custom view class which sat on top of the actual collection class. The collection class contained a nested class that wrapped each element and added an Index property. This allowed me to populate the collection with items out of sequence. The collection was sorted by this Index property.

The trick was implementing logic to determine whether or not the data was already loaded into memory or not. Essentially I'd check the collection to see what items within the range between ((PageNumber - 1) * PageSize) and (PageNumber * PageSize - 1) already exist and retrieve the rest from the database. As the app was used, the underlying collection would slowly build-up cached copies of all the records, one page at a time.

The view used the same PageNumber and PageSize values to create an offset for the actual index in the underlying collection. Oh, and the Item[index] property on the collection used the Index property to find the item rather than the absolute index/position in the collection. I created a custom enumerator on the view that used the same offset as the starting position and incremented this value until it reached the page size.

Not sure of this completely fits your needs. I suppose this could be extended to support filtering by simply defining the filter criteria in the factory method so it is passed to the database when retrieving records - if that works with your UI.

Hope that helps...

rsbaker0 replied on Saturday, May 30, 2009

Thanks. I definitely don't want to cache "all" the data, just enough so that if the user scrolls around locally they get reasonable performance.

In my prototype implementation, I have pages of 100 objects, with a maximum of 20 pages at a time. So, this is a cache of 2000 rows at the moment.

I had started this before posting, so currently I have a class derived from EditableRootList base that has it's own implementations of this[] and Count. To make it "work" I actually had to add an additional implementation of ICollection.this[].

It's pretty weird though. In the debugger, the list shows "Count=0", but then when you expand it, it actually shows the full virtual list count (e.g. 10,000,000, whatever) of "null" objects (even though only 1 page of 100 objects has actually been loaded). So, whatever the debugger is using to display the data is not hitting my implementation.

However, at runtime, the DevExpress XtraGrid seems quite happy to use it and their "server" mode (IListServer implementation) also works. Sorting, filtering, grouping operations all properly flush the cache and cause a refetch with the updated criteria. Even auto-complete works when binding to a drop-down type control, although implementing this seems to require fetching the entire set of keys while normal paging does not.

It's just proof of concept at this stage. My concern is that while I've been a developer for over 20 years, I have just 2 years of C# and it's way too easy to get something "working" without realizing the implications of what you are doing and what pitfalls await you... :)

Jack replied on Sunday, May 31, 2009

What about using a linq query against your master list to divide it into logical pages. (n-2, n-1, n, n+1, n+2) and when ever your users scroll to n+1 then you start an async load of n+3, shift your current view to n+1, and when the retrieve returns append it to the master list.

Then you just decide how many pages to keep and how many n+x pages you need to keep in order for the UI paging to appear faster than your async load of the next pages.

If you filter the list down, you just filter your master list and your pages all adjust automatically. If your filter changes your retrieval requirements then you just load and append the new criteria. You can play with what to cache based on whatever makes sense.

I'm doing some similiar things with presenting a MRU list first + their active records and then async loading things I 'think' they might need (the rest of the project records). For low usage records (inactive) they can just wait the few seconds to retrieve them.

Jack

rsbaker0 replied on Tuesday, June 02, 2009

RockfordLhotka:

If you want to be able to foreach over a virtual list, then you'll almost certainly have to create your own list type without using a pre-existing base class. The .NET base classes tend to assume concrete lists - they actually contain and manage a real list.

This indeed turned out to be correct. You can fool the base class some of the time, but, not surprisingly, the non-virtual IndexOf and indexer calls in ERLB don't see my derived class implementations.

I could probably get past my latest hurdle by overriding SaveItem completely, but it looks like I'm just asking for trouble with this approach.

rsbaker0 replied on Wednesday, June 03, 2009

OK, I've reworked what I have now so that it doesn't derive from anything, but instead now directly implements IBindingList (at least partially) and Csla.Core.IParent, thus mimicing to the extent possible what EditableRootListBase does. I'm trying to copy what I need out of EditableRootListBase. I have updates working, even changes to the object key (which get complicated when you are internal indexes on the key).

A few questions (some slightly OT):

(1) Do I need the IDataPortalTarget features from ERLB? If so, this interface is current marked as internal which is a problem.

(2) For enumeration, it simply isn't always feasible to scroll through "all" the objects in the underlying result set. It would not finish in reasonable amount of time and would transmit and unreasonable amount of data over the wire. For now, I've implemented the enumerator as simply retrieving the objects that are actually loaded. Is this reasonable or should I just throw a NotSupportedException()?

(3) To properly support IBindingList versus just IList, it seems like I need mainly to hook the PropertyChanged item of each object as it is loaded and added to the internal cache, and then remove the hook when the objects are flushed from the cache. Is this correct or are their other potential gotchas I am overlooking?

RockfordLhotka replied on Wednesday, June 03, 2009

1 – I don’t think you do. That interface is used to optimize performance when the data portal interacts with an object. But your collection is never used directly by the data portal so it won’t matter.

2 – Have you looked at ‘yield return’? That language construct (C# only) is designed to help implement enumerators over virtual lists, and similar scenarios.

3 – I think that’s true – the most basic implementation (no sorting, etc) is really not terribly hard.

Rocky

rsbaker0 replied on Wednesday, June 03, 2009

Thanks for your help as always.

Hmmm. Regarding #1, I did try to implement this a true "CSLA collection". It implements DataPortal_Fetch() taking a Criteria parameter, etc. The initial fetch returns the first page of objects across the data portal. Does this change anything?

About the enumerator, I did implement a true enumerator versus using yield return, but if the underlying result set has, say, 10 million rows and you are using 100 objects per page and fetching pages on an on-demand basis, that's 100,000 round-trips to traverse the complete collection. Even at an optimistic 1/10 second per page, that's almost 3 hours just for the traversal code, which is why I was considering alternatives. On the other hand, I can't easily stop someone from trying to traverse the entire collection using the indexer either.

rsbaker0 replied on Monday, July 27, 2009

^^^^^^^^

Thanks for the link. There is some good information in there. We're still on WinForms, but I think that should be useful nonetheless.

The current implementation we have is targeted toward supporting the DevExpress "IListServer" interface, which is small but non-trivial, and also not all that well documented.

The general idea is to have an underlying DAL that can read a page of data from anywhere in the possible result set with a given sort sequence, and then have a cache of pages corresponding to ranges of the result set that have been recently visited. If the filter or sort sequence changes, the entire cache is discarded.

Implementing IListServer was a very interesting exercise. The DevExpress grid supports a very rich criteria interface for filtering the displayed data, and also does grouping. The user can use a graphical filter builder to construct almost arbitrary filter criteria. The really slick part is that if you are using NameValueList-like implementation to show human-friendly values for data in the grid, their filter builder will show the "nice" name but provide the underlying value in the actual filter.

When the grid has all the data (e.g. not virtual), it does all the work to filter and group the data in memory. In "server mode", though, any filtering, grouping, or sort change requires an updated SQL query. They have some smart folks working for them, as I found it surprisingly easy to translate the user-built filter for the grid into the corresponding WHERE clause.

Note that if you have a fully LINQ enabled data layer, you get IListServer for "free", but alas, this was not to be in our case.

RockfordLhotka replied on Monday, July 27, 2009

That's the trick isn't it? The plumbing for paging needs to flow from the UI all the way down to the DAL.

Web Forms does a pretty good job, defining a clear set of plumbing at the UI layer that flows into your business layer or DAL (if you don't have a business layer). CSLA .NET allows you to flow that information through the business layer (factory methods and data portal) to your DAL, and so there's a good end-to-end story for Web Forms.

Windows Forms didn't define any plumbing like that. We're just now seeing some definition around paging in WPF/Silverlight, because there's increasing recognition that even a smart client app sometimes (mostly for important edge cases) needs paging.

It is interesting when third party vendors define interfaces, but what we really need is for Microsoft to define them so they act as a consistent model for the platform.

Pageable/virtual mode CSLA collections for "very" large database tables

rsbaker0 posted on Wednesday, May 27, 2009

RockfordLhotka replied on Wednesday, May 27, 2009

rsbaker0 replied on Thursday, May 28, 2009

RockfordLhotka replied on Thursday, May 28, 2009

SonOfPirate replied on Friday, May 29, 2009

rsbaker0 replied on Saturday, May 30, 2009

Jack replied on Sunday, May 31, 2009

rsbaker0 replied on Tuesday, June 02, 2009

rsbaker0 replied on Wednesday, June 03, 2009

RockfordLhotka replied on Wednesday, June 03, 2009

rsbaker0 replied on Wednesday, June 03, 2009

Jack replied on Sunday, July 26, 2009

rsbaker0 replied on Monday, July 27, 2009

RockfordLhotka replied on Monday, July 27, 2009