CSLA for Silverlight: How to improve performance?

paupdb posted on Monday, August 31, 2009

I have been doing some investigation into a particularly slow load time being experienced when pulling back a fairly large object graph and populating this into a DataGrid.

Using some simple debug writelns, I've been able to seperate the time taken as follows:

Total Time from calling Refresh() on CslaDataProvider, to DataGrid being fully loaded: 7-8 secs
Total object graph size before compression: 16MB
Total object graph size after compression: 760KB

I have a custom WcfProxy and WcfPortal on the SL and server sides respectively which apply zip compression to the ObjectData going between server and SL client.

My debug output is as follows:

End documents fetch at: 5:37:09 PM <-- This is at the end of my main DataPortal_Fetch
Start compress at: 5:37:10 PM <-- This is the WcfPortal (server-side) compression
End compress at: 5:37:10 PM; elapsed time: 00:00:00.2656250
Start decompress at: 5:37:10 PM <-- This is the WcfProxy (SL-side) decompression
End decompress at: 5:37:10 PM; elapsed time: 00:00:00.4062500
ModelChanged at: 5:37:14 PM <-- This is when the CslaDataProvider.DataChanged event fires

From here, theres another 2-3 secs being spent loading the DataGrid after the ItemsSource is reassigned. I don't think this part can be optimised much more since most of the code executing is internal to the DataGrid control.

My current focus for optimisation is the 4 second delay between when the object graph has been decompressed on the SL client and the DataChanged event being raised by the CslaDataProvider.

I know there is a fair bit of serialisation that takes place in this area and I am wondering whether there is anything that can be done to improve this.
I am also puzzled as to why the serialisation on the server side is so fast compared to the SL side - since the time between my DataPortal_Fetch completing and the server-side compression begin is only ~1 sec.

Note all the code is executing on my local PC so the server-side and SL-side is on the same hardware.

Any suggestions welcome.

sergeyb replied on Monday, August 31, 2009

How many rows do you have in that graph? I think having hundreds or, even worse, thousands of rows anywhere in Silverlight application is going to be a problem. In this cases I tend to re-design the user interface and introduce some kind of search screen to reduce number of rows shown at any given time on the client.

Sergey Barskiy

Principal Consultant

office: 678.405.0687 | mobile: 404.388.1899

Microsoft Worldwide Partner of the Year | Custom Development Solutions, Technical Innovation

From: paupdb [mailto:cslanet@lhotka.net]
Sent: Monday, August 31, 2009 3:47 AM
To: Sergey Barskiy
Subject: [CSLA .NET] CSLA for Silverlight: How to improve performance?

paupdb replied on Monday, August 31, 2009

I'm pulling back 100 rows but the individual objects are pretty heavy - around 10-20 properties are child objects and there is a fair bit of data coming back. This cannot be worked around though given the requirements we have - which is to provide a data-rich interface.

We also already have data paging in place which defaults to 20 rows per page - however the users can go as high as they want, so we're using 100 rows per page as a ceiling baseline for testing.

The thing that I am wondering is why the deserialisation seems to take ~4 secs on the SL client, yet the serialisation on the server-side is in the order of ~1 sec.
Is this something where the new binary serialisation in Csla 3.8 might show improvement?
Is there anything else I can turn on/off to improve performance - e.g. turn off undo completely?

RockfordLhotka replied on Monday, August 31, 2009

One thing you might try for a baseline is to do a clone on the server - just to get a timing.

Serialization is a lot simpler than deserialization, which could explain some of the difference.

But your server could be more powerful. Or .NET might be better at this than SL for some reason.

So timing the deserialization on the server would be interesting, because it would provide a more meaningful basis for comparison to help identify whether the difference is something in .NET/SL, the computer or CSLA's MobileFormatter.

paupdb replied on Monday, August 31, 2009

OK, I added a clone to my Documents object which occurs right at the end of the DataPortal_Fetch on the server.
Output is as follows:

Fetching 20 Document rows into Documents
End documents fetch at: 8:55:02 AM
End documents clone at: 8:55:02 AM; elapsed time is 00:00:00.2343750
Start compress at: 8:55:03 AM
End compress at: 8:55:03 AM; elapsed time: 00:00:00.0781250
Start decompress at: 8:55:03 AM
End decompress at: 8:55:03 AM; elapsed time: 00:00:00.1093750
ModelChanged at: 8:55:04 AM; elapsed time: 00:00:00.9531250

Fetching 100 Document rows in Documents
End documents fetch at: 8:55:41 AM
End documents clone at: 8:55:42 AM; elapsed time is 00:00:00.7031250
Start compress at: 8:55:43 AM
End compress at: 8:55:43 AM; elapsed time: 00:00:00.3125000
Start decompress at: 8:55:43 AM <-- (WcfProxy.ConvertResponse)
End decompress at: 8:55:43 AM; elapsed time: 00:00:00.3750000 <-- (WcfProxy.ConvertResponse)
ModelChanged at: 8:55:47 AM; elapsed time: 00:00:03.2812500

The clone on the server-side is around 3x faster than the SL client side deserialisation.
I am using Csla 3.7 atm and I am running all of this on my local PC, so it can't be hardware since the server is local to my PC too.

RockfordLhotka replied on Monday, August 31, 2009

You know I misled you with using the term “clone”.

If you use Clone() you are using BinaryFormatter – an entirely different technology.

You need to write your own ‘clone’ that uses MobileFormatter to serialize/deserialize the object graph to get an accurate picture. You can grab the clone code from the Silverlight base class, but execute it on the .NET server side.

Sorry about that.

Rocky

paupdb replied on Monday, August 31, 2009

OK, haven't had time to copy over the MobileFormatter code, but I did try to run the same test case and diagnostics using the CSLA 3.8 Alpha release:

20 Rows
End documents fetch at: 12:09:51 PM
Start compress at: 12:09:51 PM
End compress at: 12:09:51 PM; elapsed time: 00:00:00.0625000
Start decompress at: 12:09:51 PM
End decompress at: 12:09:52 PM; elapsed time: 00:00:00.0781250
ModelChanged at: 12:09:52 PM; elapsed time: 00:00:00.8125000

100 Rows
End documents fetch at: 12:10:32 PM
Start compress at: 12:10:33 PM
End compress at: 12:10:33 PM; elapsed time: 00:00:00.2031250
Start decompress at: 12:10:33 PM
End decompress at: 12:10:33 PM; elapsed time: 00:00:00.2343750
ModelChanged at: 12:10:36 PM; elapsed time: 00:00:02.7031250

So the new serialisation code in 3.8 does seem to reduce the SL deserialisation time by around 0.5 secs (it took 3.28 secs on Csla 3.7 and now takes 2.7 secs) on a big object graph.
I think the actual ObjectData byte[] is a little smaller too, hence some minor improvements in the compress/decompress times too.

Obviously would still be good to try to reduce this further, so later this week I'll probably move into the Csla code and see if I can isolate the time consuming areas.

paupdb replied on Monday, August 31, 2009

I've been tinkering with the MobileFormatter's Deserialize method to see if I can isolate the time consuming portions.
Basically on a 100 row object graph, the line below takes up 1.4 secs of the total 2.75 secs spent in the Deserialize method:
List<SerializationInfo> deserialized = dc.ReadObject(reader) as List<SerializationInfo>;

Given the size of the object graph (16MB), this isn't unexpected and there doesn't seem to be any kind of optimization possible.

I then looked through what goes on after the deserialized list is loaded and found that I could save 0.3 sec on the 100 row test if I cached the Types that are normally loaded over and over via reflection.

I added a static cache of the type names and Types:

    private static Dictionary<string, Type> _typeCache =
      new Dictionary<string, Type>();

    private static Type GetTypeFromCache(string typeName)
    {
      if (!_typeCache.ContainsKey(typeName)) {
        var type = Csla.Reflection.MethodCaller.GetType(typeName);

        if (type == null) {
          throw new SerializationException(string.Format(
            Resources.MobileFormatterUnableToDeserialize,
            typeName));
        }
        _typeCache.Add(typeName, type);
      }

      return _typeCache[typeName];
    }

And then changed the Deserialize(XmlReader reader) method to use the GetTypeFromCache method:
      _deserializationReferences = new Dictionary<int, IMobileObject>();
      foreach (SerializationInfo info in deserialized)
      {
        Type type = GetTypeFromCache(info.TypeName);

        if (type == typeof(NullPlaceholder))

Having a static Dictionary of types may be more memory usage, but maybe the performance gain is worth it?

paupdb replied on Tuesday, September 01, 2009

I did originally have an instance cache and the performance gain was pretty similar, so yeah an instance cache is certainly viable too.

I went static because I liked the idea of persisting across multiple deserialization calls, thus avoiding the initial reflection cost after the first time a Type is ever encountered by the MobileFormatter. A lot of the calls made within my application involve the same classes, with repetitive fetches -e.g. in the case of data paging.

I have also looked at whether or not a similar cache approach could be used to cache the type's ConstructorInfo so that Activator.CreateInstance can be dropped in favour of just invoking the cached ctor delegate.
So far there has not been a noticable improvement in performance, however I believe there is some kind of internal caching in the Activator class up to 16 unique types. So it might just be that my tests are not putting enough different types through the deserialisation (and thus Activator) as yet.

RockfordLhotka replied on Tuesday, September 01, 2009

I wouldn’t spend a lot of time on the CreateInstance(), as Justin is working on an enhancement to MethodCaller that will do the work with an Expression – reflect once, compile and then execute for the rest.

I think in the general case the instance cache is the better answer. I understand what you are saying, but some applications have little-used types that’d end up cached – and in a large app that could be problematic.