Use of GDAL and data updates.

Topics: Data Access, SharpMap v2.0
Developer
Aug 25, 2007 at 4:28 PM
The current builds rely on GDAL for the processing of a lot of the map data formats. I think the writers of GDAL have done a great job, but in some areas it lacks the abilty to update the data (E.g S-57 data updates cannot be installed so that data becomes out of date very quickly), and also to render some formats, S-57 in a data inter change format, the rendering of the charts is done with the S-52 speciifaction. In the roadmap you have talked about the abilty to apply data updates to the data. This implies that because you are currently bound by the workings of the GDAL team, do you intend to replace the GDAL with your own code? This would be a large undertaking but it would mean that you would have more control of the direction of sharpmap.
Coordinator
Aug 25, 2007 at 5:23 PM
Cairn,

Updating in v2.0 will be data-provider specific. If a provider implements IWritableVectorLayerProvider, then it is saying that it supports updates. If it only supports ILayerProvider (raster) or IVectorLayerProvider (vector), then it is a read-only provider. This way, we are decoupled from the progress (or lack thereof) of the various providers we can use to get data. The ShapeFile provider, which comes with SharpMap, implements IWritableVectorLayerProvider, and so updates are possible. GDAL has some write capability, so we will be able to update in certain circumstances, and not in others, as you note.

Since the provider model is flexible, an updatable provider for the S-57 format could be created without having to stop relying on GDAL for other formats. However, creating a provider can be a significant investment of time, and, with all the other commitments which SharpMap committers have, it is unlikely to come from one of them. This is where your effort could assist us. If you are interested in creating a provider for S-57 data, I'm sure you'd get assistance and guidance on this forum.
Developer
Aug 26, 2007 at 9:54 AM

I will give to ago! Where is the best location to get the most up to date source. I have looked at the google code store ,and the Codeplex bridge, from what I have seen I think it may be the google store, this seems to be the most up todate but only contains controls and no main applications to run. The codeplex bridge source runs only after a bit of messging about!

I am also interested using WPF with sharpmap, but the google code only seems to have an intial shell of code, is this as far it has got or in there another location of the WPF version. The work I have done with WPF and OS Mastermap does not perform very well with 50,000+ objects!

Please let me know where to get the best version of code so that I can get going.

Thanks
Coordinator
Aug 26, 2007 at 5:29 PM
Great!

Since you are interested in doing updates, v2.0 is the way to go, and that is still located on Google code. Even though the UI components are still being completed, the data provider model in v2.0 is fairly well settled. I just added the SharpMap.Extensions sub-tree to the codebase last night, and most of them are data providers (the PostGIS provider was converted most carefully), so it should give you a good idea on how to begin. Since you are looking to read and write data directly to a file, the ShapeFile provider will take you further on how that is done, since it is a working implementation of IWritableVectorLayerProvider. It also has good unit test coverage which you might be able to copy to a large degree, so you wouldn't need to depend on a functioning UI to proceed.

As for the WPF implementation, you're right it isn't yet progressed very far. However, it shouldn't be too much work in the v2.0 architecture, since all the common rendering and presentation logic has been factored out. 50,000 objects is a lot - I'm not surprised it slowed down. I haven't been able to completely test how to resolve rendering for very large data sets, but my initial thoughts were to use System.Windows.Media.PathGeometry instances (since they are not UIElements, but rather inherit from Freezable) until a certain zoom level, and then switch to System.Windows.Shapes.Path to get UI events (which will be easy, since they use the same path). For some post-v2.0 release (something like v2.x), I want to abstract all vertex data - from the data source through the geoprocessing pipeline to the screen - into a buffer, much like DirectX does. Not only will geoprocessing computations be more efficient, but I can also compute adjacency information and simplify paths at certain zoom levels. This should allow handling of very large data sets with WPF's retained graphics model.

Getting back to the provider - if you need any assistance, don't hesitate to post on these forums. Chances are you'll get the help you're looking for.
Developer
Aug 29, 2007 at 1:24 PM

Things are going OK, and I've got the framework in and building Ok. Whilst starting the detail I have noted what my be a potential problem with a possible solution to run by you. With S57 all the data is held in the same file (data Source) but ideally this should be split over a number of layers, and sometimes layers are not proceessed because of the detail involved.
Rather than have a large number of data sources on one file, I propose that in the IVectorLayerProvider, that methods such as GetExtents ()and GetGeomtriesinView() ( 7 methods in total would need addtions) have addition implementions with either a string or an int to specify a layer / geometry layer, and an int to specify current scale.
e.g
GetExtents()
GetExtents(string strColumn)
GetExtents(int nLayerCode);
GetExtents(string strColumn, int scale)
GetExtents(int nLayerCode, int scale);

Although I am only looking at S57 at the minute, I know this problem will rise again with GML and OSGB GML (Mastermap) where their data set files are 300Mb+ at least with S57 the limit is to 5Mb files.
Please let me know if you wish to go with this, or if you wish to limit S57 to one layer, or form lots of instances of the data source seting the layer information within the creating of the data source ( this does not solve the problem of view with scale).

Cairn.
Aug 29, 2007 at 9:36 PM
Sorry for interrupting...

OGR library supports getting one layer out of the S-57 file, depending the environment variables, which I think is bad. Developer would be free to use whichever layer he/she wants.

I was looking the potential developing a pure .NET provider for S-57, and I think it would be better way to get over the problem you are reporting, by choosing the name of layer(set of data in S-57, actually) one wants to access its data, on the constructor of the provider instance - so you wouldn't have to change all the interfaces, and by referring to an instance of the provider, one would be referring to a set of data. You can use multiple instances in order to access different sets of data in the same file. That would help you, in incorporating the data from the update files in the chosen dataset.


George J.
Developer
Aug 30, 2007 at 7:41 AM
Edited Aug 30, 2007 at 12:38 PM

George,

What I am doing at the minute is to developing an extension provider for SharrpMap V2.0 which will be a pure S57 .net provider. The reason I was asking the question is that S57 has over 100 layers within its data set, Creating over 100 instances of the data provider, and then another 100 layers seems a little over the top to me, as I belive that one instance of the data provider will suffice. The other point is that SharpMap does not take into account scales, S57 details that some objects can only be displayed providing the scale meets certain limits. This proposal Ibelive would cater for both of this problems, and as the V2.0 framework is relatively new it would be easier to llok at incorperating this now rather than later when the UI presention is still not complete.

Another reason for requesting this type of change is that I know that the OSGB GML implmentation has the same type of scale view rules, with a small data file being in excess 300MB with 80,000+ data objects it seems a little silly to create large number of instance providers on the same file.

Coordinators, please can you advise.


Cairn
Aug 31, 2007 at 4:09 AM
Cairn,

I totally agree with you about the richness of the data S-57 data files and the huge number of layers they include. On the other hand, the projection of the data these files include, it is not a S-57 data provider problem – the presentation problem of the data as you describe it, it is an S-52 problem, and perhaps it is concerning the ILayer class and the classes inheriting from it, than the data provider classes.

SharpMap has the means to make a layer visible when the zoom is within certain range of values. There are SharpMap examples that are presenting labels of the countries’ capital cities and major cities, which demonstrate this feature. There is no need to implement S-52 projection mechanism, in order to create layers from the S-57 file data.

I find your concerns about the performance of SharpMap, while using large number of objects logical, and I think the on-the fly creation of specialized index files (create once/read always) – just the Shapefile data provider does – is the way to go. Even C-Map’s S-57/S-52 kernel creates interim files in order to project S-57 data files.


George J.
Developer
Aug 31, 2007 at 8:02 AM
George,

Just to make sure, are you looking at Version 2.0 of SharpMap, where a numbers of changes have been make the the architure?

I agreed that S57 is just the exchange format and S52 is the presetation specification. The solution I am working at the minute will generate a halfway house, When the file is requested it will check if its up todate with the versioning against the collection of S57 files, if not will then regenerated the intermin working file from the collection of S57 exchange files. This intermin file will contain all the data from all the layers.

The concern on performance is not based on using S57 data, but other data formats where instead of a max size of 5Mb, the small multi layer data files are 300Mb+. So while I am doing an S57 solution I am also looking ahead at potential problems that may occur with other file formats which I am also working with on another project.

Cairn.
Coordinator
Sep 9, 2007 at 10:07 AM
Hey Cairn and George -

Sorry for not being attentive to this thread sooner...

Thanks for hashing out these issues. Let's see if we can break them down a bit.

Cairn: The first problem you note is that certain data sets (like the S57 set, but I'm more familiar with ESRI personal geodatabases and ArcInfo coverage files / directories) have multiple layers. If I understand correctly, you think it will be a performance hit to have, say, 100 instances of a provider open on this dataset, which would currently be needed with the one-to-one provider instance to layer design. My guess is that it won't be too much of an issue. The provider object has a typically rather small runtime footprint, and having 100 handles to a file isn't too much of an issue, especially if something like a memory-mapped file technique is used. For databases, this could be more of an issue, but with connection pooling, it is probably not too overburdening to have 100 connection objects, either. The benefit is a much simpler model to program with. If you have multiple layers of data in a data set, just create one layer for each one you want, passing the layer info in through the provider connection string. Unless we can produce some convincing usability argument or get some measurable performance data, I'm sort of hesitant to do what appears to be a premature optimization. Rule #1 of optimization: Don't.

The next issue appears to be whether layer information is queried based on a particular zoom scale. Since this is set per layer, then the one-layer-object-to-one-data-layer comes into play again, where you can set the Min and Max extents where the layer data is visible. Perhaps I'm missing how this can't be used to skip reading data at too far a zoom level and thus get 300+ MB of data forced into RAM (which is what I think you're ligitimate concern is).
Developer
Sep 9, 2007 at 4:47 PM

Thanks for getting back, I see you have being busy on V2 with the number of changed files.

Going to the first issue, and the number of different data sets which contain all the data for multiple layers in one file (Yes S57, but also GML with implementations like the UK OS implementation called MasterMap). S57 is not much a problem in that files are limited to less than 5 MB , but MasterMap has no size limit. If you have to create an instance of the data provider for each layer in master map equates too 40 + data providers, each with a XML Dom reading 300+ MB file. These figures are small in terms of file size, because as you then start to look at the latest GML standards which also contain 3D Data, files are moving towards the GB region. There is a format standard called CityGML which is a full 3D map of cities, there data sets are enormous compared to the ones you are looking at the minute. This is why I rasie the concern, and if the application is being used on PDA, its going to be a none starter as performance on the devices is not the best, and they don't have any where near the memory foot print of your PC.

For the second point, when the presentation layer requests the geometies for the extent from a data provider, the provider will know when returning the list of geometies to only include those that match the scale rule for the type of data given. Some mapping formats have rules or guidance, that an object ( not neccessary the layer) is only displayed providing the scale is greater that a predefined limit. This means on some layers only a selection of the layers objects should be displayed even though thay belong to the same layer, because some objects are not big enough to warrant display.

I can work around this in terms of the S57 provider, but experiance has taught me to consider these problems with the larger data sets now, as is resulted in going back to redoing lots of work. While V2 is relatively new and its easier to resolve design issues now, but as more people pick up V2 it will be become harder and break more things. If it found you need to make the changes towards the release of V2 whcih will break a lot of other peoples work. I am currently considering writing a MasterMap data provider to fit into SharpMap V2 and don't want to get stuck due to performance problems, hence trying with S57 first.


Cairn.


Coordinator
Sep 9, 2007 at 6:06 PM
Hey Carin -

Thanks for resuming the thread so robustly. I'm glad there are smart people working on this problem.

One of the things that a provider author has to do is to make sure that it can access the data without putting too much of a burden on local resources. I've got a > 1GB shapefile I'm working with here, and my goal is to get the ShapeFile provider to read and write to it quickly and efficiently. This means that I have to keep most of the data out of RAM and on the disk, and only work with the metadata of the dataset, such as the extents and a spatial index. If I have 100 shapefiles, all > 1GB, SharpMap should be able to handle it, I feel. It's really up to the provider author to make sure that data is handled efficiently. Of course, if a user does something like add a bunch of multi-MB datasets and sets the MaxVisible to some large number so it can be completely seen, well... there isn't much that can be done (although I suspect there is more than we are currently doing).

After further reflection, though (I'm glad you got me thinking about it), it would seem that there are some metadata structures, like the spatial index especially, which would benefit being shared across provider instances. There are two ways to do this, it seems to me. The first is to make this possibility the responsibility of the provider to maintain shared state at the type level. This could be implemented with a static instance of the spatial index, for example. The second way is to implement a provider factory model, much like ADO.Net v2 provides, except also provide a set of metadata services to the provider instances which are created. This could be used to pass in the spatial index and coordinate transform (in case the data needs reprojection) when the instance is constructed. I don't have a good grasp on the trade-offs in these approaches, so your insights are valued.

On the second point, I see the problem, now. There is a per-feature scale threshold in certain datasets, whereas SharpMap currently supports only per-layer. This is an interesting problem which I want to make sure we consider more carefully (in terms of provider-specific or provider-general filtering)... If you could make a Work Item for it, I'll assign it to v2.0 Beta 2 and we can work out the details for that release in the context of that Work Item. I've promised everyone that the interface will change between Beta 1 and Beta 2, so it shouldn't be a shocker... especially since most folks aren't writing providers.
Developer
Sep 9, 2007 at 6:59 PM

I will think carefully about the wording and raise the work item in the next few days.

The other thing to remember is that not all data formats have a spatial index or the likes, some are purely a set of random data contained with a file for the program to sort out, the reliance of indexing systems cannot be used when considering all data providers. I think we need to do some more work on considering the use of one data provider neing used by many instances of layer.

Cairn.
Coordinator
Sep 9, 2007 at 7:20 PM
You're right - most datasets don't have a spatial index. Shapefiles, for example, usually don't, and if they do, they are in a proprietary format which ESRI hasn't published. SharpMap uses a spatial index internally, and we can build it for any dataset, given a set of features and each feature's envelope. You could use the SharpMap.Indexing.RTree.DynamicRTree index for the S57 provider, since you are making it writeable, and that index supports updates.
Sep 9, 2007 at 8:17 PM
Edited Sep 9, 2007 at 8:17 PM
A >1gb shapefile??!?? Even ESRI heavily discourages the use of shapefiles on larger datasets - actually they more or less discourage the use of shapefiles (except for data exchange), especially if its data that change. I think in your case you would be much better off using a clean database approach like MsSqlSpatial.
Sep 9, 2007 at 8:35 PM

Odegaard wrote:
A >1gb shapefile??!?? Even ESRI heavily discourages the use of shapefiles on larger datasets - actually they more or less discourage the use of shapefiles (except for data exchange), especially if its data that change. I think in your case you would be much better off using a clean database approach like MsSqlSpatial.


This is an exaggeration - I think the point codekaizen wants to make is, first you write a provider, then you fine tune it. In order to fine tune the shapefile provider, he has a large set of data. This is not a working dataset, it is a testing dataset... ;)


George J.
Coordinator
Sep 9, 2007 at 8:46 PM
Haha, yea... I'm not saying I'd want to use the shapefile... just test with it. It's amazing how profiling SharpMap with a 1GB shapefile allows performance bottlenecks to clearly show up.

I agree, generally, with ESRI - geodatabases (esp. thanks to MsSqlSpatial and PostGIS) are so accessible and perform so well, that it really is better, overall, to go with one.
Developer
Sep 10, 2007 at 1:57 PM
That sounds fine, but how does the RTree cope with multiple layers within the same file, from the quick check I have just carried out, the tree system appears to sole cope with a signal layer type system, so that the data provider would then have to filter the response from the tree search into only items of that required layer, which could result in sero returns after quite a large search.
The next problem is that if the data is to remain in its properity format how do you place an index on a tree item within a very large XML file, and then jump to it without having to parse the whole file first?, or in the case of S57 the file is split into three distict parts, file descriptions, Feature data objects , which then reference Spiatial objects which contain positions of the Feature data, the link between the two is a ref ID , but this does not relate to a file postion or offset, as one spitial object may server many Feature data objects.

The only way to resolve some of these problems is to convert the input data into an internal format, and use that, but even then if the internal is still a multi layer file how does the indexer work?

I think we need to spend a little time thinking about a go way to go forward.

Cairn
Sep 10, 2007 at 2:18 PM
For my two cents, I would really like to see a provider that can be used with multiple layers. With wrappers, we have done things like this in our COM wrappers around another tool. We can create a provider that knows about a related collection of shapefiles.. some are lines, some are points, etc, and then we can extract specific layers from this provider. For instance, we can get a layer of irregularly shaped polygons, and a layer of related pretty label locations stored in a different, but related shapefile, and the provider manages all of this. I realize this is not quite the same situation being discussed here, but it is still related.

I have a related question that I'll start in a new thread, in order to avoid hijacking this one.