gridfit: estimates a surface on a 2d grid, based on scattered data Retrieved from the Matlab File Exchange on 12 sept 2010 from: http://www.mathworks.de/matlabcentral/fileexchange/8998 By Thijs Damsma Replicates are allowed. All methods extrapolate to the grid boundaries. Gridfit uses a modified ridge estimator to generate the surface, where the bias is toward smoothness. Gridfit is not an interpolant. Its goal is a smooth surface that approximates your data, but allows you to control the amount of smoothing. usage #1: zgrid = gridfit(x,y,z,xnodes,ynodes); usage #2: [zgrid,xgrid,ygrid] = gridfit(x,y,z,xnodes,ynodes); usage #3: zgrid = gridfit(x,y,z,xnodes,ynodes,prop,val,prop,val,...); Arguments: (input) x,y,z - vectors of equal lengths, containing arbitrary scattered data The only constraint on x and y is they cannot ALL fall on a single line in the x-y plane. Replicate points will be treated in a least squares sense. ANY points containing a NaN are ignored in the estimation xnodes - vector defining the nodes in the grid in the independent variable (x). xnodes need not be equally spaced. xnodes must completely span the data. If they do not, then the 'extend' property is applied, adjusting the first and last nodes to be extended as necessary. See below for a complete description of the 'extend' property. If xnodes is a scalar integer, then it specifies the number of equally spaced nodes between the min and max of the data. ynodes - vector defining the nodes in the grid in the independent variable (y). ynodes need not be equally spaced. If ynodes is a scalar integer, then it specifies the number of equally spaced nodes between the min and max of the data. Also see the extend property. Additional arguments follow in the form of property/value pairs. Valid properties are: 'smoothness', 'interp', 'regularizer', 'solver', 'maxiter' 'extend', 'tilesize', 'overlap' Any UNAMBIGUOUS shortening (even down to a single letter) is valid for property names. All properties have default values, chosen (I hope) to give a reasonable result out of the box. 'smoothness' - scalar or vector of length 2 - determines the eventual smoothness of the estimated surface. A larger value here means the surface will be smoother. Smoothness must be a non-negative real number. If this parameter is a vector of length 2, then it defines the relative smoothing to be associated with the x and y variables. This allows the user to apply a different amount of smoothing in the x dimension compared to the y dimension. Note: the problem is normalized in advance so that a smoothness of 1 MAY generate reasonable results. If you find the result is too smooth, then use a smaller value for this parameter. Likewise, bumpy surfaces suggest use of a larger value. (Sometimes, use of an iterative solver with too small a limit on the maximum number of iterations will result in non-convergence.) DEFAULT: 1 'interp' - character, denotes the interpolation scheme used to interpolate the data. DEFAULT: 'triangle' 'bilinear' - use bilinear interpolation within the grid (also known as tensor product linear interpolation) 'triangle' - split each cell in the grid into a triangle, then linear interpolation inside each triangle 'nearest' - nearest neighbor interpolation. This will rarely be a good choice, but I included it as an option for completeness. 'regularizer' - character flag, denotes the regularization paradignm to be used. There are currently three options. DEFAULT: 'gradient' 'diffusion' or 'laplacian' - uses a finite difference approximation to the Laplacian operator (i.e, del^2). We can think of the surface as a plate, wherein the bending rigidity of the plate is specified by the user as a number relative to the importance of fidelity to the data. A stiffer plate will result in a smoother surface overall, but fit the data less well. I've modeled a simple plate using the Laplacian, del^2. (A projected enhancement is to do a better job with the plate equations.) We can also view the regularizer as a diffusion problem, where the relative thermal conductivity is supplied. Here interpolation is seen as a problem of finding the steady temperature profile in an object, given a set of points held at a fixed temperature. Extrapolation will be linear. Both paradigms are appropriate for a Laplacian regularizer. 'gradient' - attempts to ensure the gradient is as smooth as possible everywhere. Its subtly different from the 'diffusion' option, in that here the directional derivatives are biased to be smooth across cell boundaries in the grid. The gradient option uncouples the terms in the Laplacian. Think of it as two coupled PDEs instead of one PDE. Why are they different at all? The terms in the Laplacian can balance each other. 'springs' - uses a spring model connecting nodes to each other, as well as connecting data points to the nodes in the grid. This choice will cause any extrapolation to be as constant as possible. Here the smoothing parameter is the relative stiffness of the springs connecting the nodes to each other compared to the stiffness of a spting connecting the lattice to each data point. Since all springs have a rest length (length at which the spring has zero potential energy) of zero, any extrapolation will be minimized. Note: The 'springs' regularizer tends to drag the surface towards the mean of all the data, so too large a smoothing parameter may be a problem. 'solver' - character flag - denotes the solver used for the resulting linear system. Different solvers will have different solution times depending upon the specific problem to be solved. Up to a certain size grid, the direct \ solver will often be speedy, until memory swaps causes problems. What solver should you use? Problems with a significant amount of extrapolation should avoid lsqr. \ may be best numerically for small smoothnesss parameters and high extents of extrapolation. Large numbers of points will slow down the direct \, but when applied to the normal equations, \ can be quite fast. Since the equations generated by these methods will tend to be well conditioned, the normal equations are not a bad choice of method to use. Beware when a small smoothing parameter is used, since this will make the equations less well conditioned. DEFAULT: 'normal' '\' - uses matlab's backslash operator to solve the sparse system. 'backslash' is an alternate name. 'symmlq' - uses matlab's iterative symmlq solver 'lsqr' - uses matlab's iterative lsqr solver 'normal' - uses \ to solve the normal equations. 'maxiter' - only applies to iterative solvers - defines the maximum number of iterations for an iterative solver DEFAULT: min(10000,length(xnodes)*length(ynodes)) 'extend' - character flag - controls whether the first and last nodes in each dimension are allowed to be adjusted to bound the data, and whether the user will be warned if this was deemed necessary to happen. DEFAULT: 'warning' 'warning' - Adjust the first and/or last node in x or y if the nodes do not FULLY contain the data. Issue a warning message to this effect, telling the amount of adjustment applied. 'never' - Issue an error message when the nodes do not absolutely contain the data. 'always' - automatically adjust the first and last nodes in each dimension if necessary. No warning is given when this option is set. 'tilesize' - grids which are simply too large to solve for in one single estimation step can be built as a set of tiles. For example, a 1000x1000 grid will require the estimation of 1e6 unknowns. This is likely to require more memory (and time) than you have available. But if your data is dense enough, then you can model it locally using smaller tiles of the grid. My recommendation for a reasonable tilesize is roughly 100 to 200. Tiles of this size take only a few seconds to solve normally, so the entire grid can be modeled in a finite amount of time. The minimum tilesize can never be less than 3, although even this size tile is so small as to be ridiculous. If your data is so sparse than some tiles contain insufficient data to model, then those tiles will be left as NaNs. DEFAULT: inf 'overlap' - Tiles in a grid have some overlap, so they can minimize any problems along the edge of a tile. In this overlapped region, the grid is built using a bi-linear combination of the overlapping tiles. The overlap is specified as a fraction of the tile size, so an overlap of 0.20 means there will be a 20% overlap of successive tiles. I do allow a zero overlap, but it must be no more than 1/2. 0 <= overlap <= 0.5 Overlap is ignored if the tilesize is greater than the number of nodes in both directions. DEFAULT: 0.20 'autoscale' - Some data may have widely different scales on the respective x and y axes. If this happens, then the regularization may experience difficulties. autoscale = 'on' will cause gridfit to scale the x and y node intervals to a unit length. This should improve the regularization procedure. The scaling is purely internal. autoscale = 'off' will disable automatic scaling DEFAULT: 'on' Arguments: (output) zgrid - (nx,ny) array containing the fitted surface xgrid, ygrid - as returned by meshgrid(xnodes,ynodes) Speed considerations: Remember that gridfit must solve a LARGE system of linear equations. There will be as many unknowns as the total number of nodes in the final lattice. While these equations may be sparse, solving a system of 10000 equations may take a second or so. Very large problems may benefit from the iterative solvers or from tiling. Example usage: x = rand(100,1); y = rand(100,1); z = exp(x+2*y); xnodes = 0:.1:1; ynodes = 0:.1:1; g = gridfit(x,y,z,xnodes,ynodes); Note: this is equivalent to the following call: g = gridfit(x,y,z,xnodes,ynodes, ... 'smooth',1, ... 'interp','triangle', ... 'solver','normal', ... 'regularizer','gradient', ... 'extend','warning', ... 'tilesize',inf); Author: John D'Errico e-mail address: woodchips@rochester.rr.com Release: 2.0 Release date: 5/23/06