MapWindow Developer Team : MapWindow Discussion Forum

Hi, I'm working on Jenks natuaral breaks classification now and haven't succeeded to implement it so far. Here are several references: [en.wikipedia.org] [

Jenks natural breaks

Posted by:

**Sergei**()Date: June 15, 2010 04:03PM

Hi,

I'm working on Jenks natuaral breaks classification now and haven't succeeded to implement it so far.

Here are several references:

[en.wikipedia.org]

[resources.arcgis.com]

[www.terraseer.com]

I use formula from the last link (at the bottom), and shift values from class with largest sum of squared deviations from average (SSD) to class with the smallest SSD. I managed to get SSD almost equal in each class as a result, but this doesn't ensure the global minimum around all classes. Correspondingly my breaks noticeably different from ESRI's. So some assumptions are wrong.

1. (easy one) How to calculate sum of squared deviations between classes (links 1,2)? I don't quite understand what is it, perhaps because of difficulties with translation.

2. Maybe someone is acqainted with calculation and can give an advice about criteria to choose values for shifting?

Thanks,

Sergei

I'm working on Jenks natuaral breaks classification now and haven't succeeded to implement it so far.

Here are several references:

[en.wikipedia.org]

[resources.arcgis.com]

[www.terraseer.com]

I use formula from the last link (at the bottom), and shift values from class with largest sum of squared deviations from average (SSD) to class with the smallest SSD. I managed to get SSD almost equal in each class as a result, but this doesn't ensure the global minimum around all classes. Correspondingly my breaks noticeably different from ESRI's. So some assumptions are wrong.

**Questions**1. (easy one) How to calculate sum of squared deviations between classes (links 1,2)? I don't quite understand what is it, perhaps because of difficulties with translation.

2. Maybe someone is acqainted with calculation and can give an advice about criteria to choose values for shifting?

Thanks,

Sergei

**Re: Jenks natural breaks**

Posted by:

**gngdowid**()Date: June 16, 2010 03:05AM

Sergei,

don't just look at the references you are giving here. Think outside the square.

Ask yourself what creates natural breaks in datasets about natural phenomina (attributes in an attribute table).

Take for example a large area of forested landscape for which average tree heights have been recorded as attribute for every grid cell (say 100 m by 100 m cell size).

The forested area contains various tree species, each may dominate a number of grid cell areas.

Different species will have their own typical tree height. That means grid cells will show an attribute value close to the average species tree height which dominates the cell. (Of course, you don't know which species is dominant when you start to analyse the data set).

Looking at each tree species in turn (if you would know the grid cells concerned - which you don't know of course), you would find that the heights would behave like a normal distributed data set centered on the average species tree height with a spread either side (measurable by the standard deviation of this data).

Now return to the entire area data set, a compound of various species. This data set effectively is a compound of the normal distributions of each of the tree species heights found in the area. If you create a frequency distribution histogram for the heights in the entire area you will see a curve with rounded peaks separated by rounded valleys (that's the theory, the practice may not be that clear). (see [en.wikipedia.org] 1st graph on the top right).

The location of these valleys are the location (tree height) of the natural breaks you are after.

So, create a frequency distribution histogram table of occurance counts of tree heights ordered by tree heights and search for these valleys.

Your 3rd reference says on the right "Meani..j is the mean of the class bounded by i and j." "i" and "j" are the position of the valleys to the left and the right of a peak in the frequency distribution table.

Hope this helps.

Cheers, Gerd

don't just look at the references you are giving here. Think outside the square.

Ask yourself what creates natural breaks in datasets about natural phenomina (attributes in an attribute table).

Take for example a large area of forested landscape for which average tree heights have been recorded as attribute for every grid cell (say 100 m by 100 m cell size).

The forested area contains various tree species, each may dominate a number of grid cell areas.

Different species will have their own typical tree height. That means grid cells will show an attribute value close to the average species tree height which dominates the cell. (Of course, you don't know which species is dominant when you start to analyse the data set).

Looking at each tree species in turn (if you would know the grid cells concerned - which you don't know of course), you would find that the heights would behave like a normal distributed data set centered on the average species tree height with a spread either side (measurable by the standard deviation of this data).

Now return to the entire area data set, a compound of various species. This data set effectively is a compound of the normal distributions of each of the tree species heights found in the area. If you create a frequency distribution histogram for the heights in the entire area you will see a curve with rounded peaks separated by rounded valleys (that's the theory, the practice may not be that clear). (see [en.wikipedia.org] 1st graph on the top right).

The location of these valleys are the location (tree height) of the natural breaks you are after.

So, create a frequency distribution histogram table of occurance counts of tree heights ordered by tree heights and search for these valleys.

Your 3rd reference says on the right "Meani..j is the mean of the class bounded by i and j." "i" and "j" are the position of the valleys to the left and the right of a peak in the frequency distribution table.

Hope this helps.

Cheers, Gerd

**Re: Jenks natural breaks**

Posted by:

**Sergei**()Date: June 16, 2010 10:07AM

Gerd,

thanks for your answer. I tried to think 'outside the square' and I guess the formula for 'sum of squared deviations between classes' is:

I checked it on easy example and indeed:

SDBC = sum of squared deviations from array mean - sum of squared deviations of individual observations from the class means around all classes.

Then wikipedia suggests:

'After inspecting each of the SDBC, a decision is made to move one unit from the class with the largest SDBC toward the class with the lowest SDBC.'

But I don't see any sense in it. It's obvious that SDBC will be the largest for classes on the sides of distribution and smallest in it's center. So such shift criteria will simply move all values in the central class. Do I miss anything?

Thanks,

Sergei

Edited 1 time(s). Last edit at 06/16/2010 05:46PM by Sergei.

thanks for your answer. I tried to think 'outside the square' and I guess the formula for 'sum of squared deviations between classes' is:

**SDBC = sum (frequency of class * (class mean - array mean)^2);**I checked it on easy example and indeed:

SDBC = sum of squared deviations from array mean - sum of squared deviations of individual observations from the class means around all classes.

Then wikipedia suggests:

'After inspecting each of the SDBC, a decision is made to move one unit from the class with the largest SDBC toward the class with the lowest SDBC.'

But I don't see any sense in it. It's obvious that SDBC will be the largest for classes on the sides of distribution and smallest in it's center. So such shift criteria will simply move all values in the central class. Do I miss anything?

Thanks,

Sergei

Edited 1 time(s). Last edit at 06/16/2010 05:46PM by Sergei.

**Re: Jenks natural breaks**

Posted by:

**gngdowid**()Date: June 16, 2010 03:27PM

Sergei,

yes, we are on the same wave length here.

Formulae like

The orange, red and blue curves have the same Mean but different spreads and will show in the summation graph as a single peak - no chance to separate these three classes from each other. The green distribution will create a separate peak in the summation graph - there is a chance to separate that one from the rest, provided you can identify the boundary values (i and j) for the left and the right boundary. The math you are looking at (SDBC for example) is a statistical attempt to find these using least squares theory. In most cases this approach is very slow unless you can provide the algoritm with some reasonable start values for i and j and use the math to improve on these.

If you follow my suggestion of creating an ordered frequency distribution table you can search for locations where the tangent to small sections of the curve (inside a small moving window over the table entries) is horizontal and the curve itself curves upward either side. That will give you such and approximation for either i or j. If you also search for the horizontal tangent where the curve moves below on either side you find the approximation for one of the species distribution's mean.

With this you then can calculate things like SDBC and the shifts by maximising the SDBCs and minimising the SDICs Squared deviations inside classes) for all classes step by step.

In practice I doubt that there will be much improvement needed after you have found the approximations for boundaries (i,j) using my suggestion. If you cannot see peaks and valleys in the compound distribution curve, then the data will not allow clear separation (classification) of separate species distributions in the first place and you would have to use additional information (e.g. another attribute) to achieve separation. That would lead to multidimensional classification approaches.

Good luck,

Gerd

yes, we are on the same wave length here.

Formulae like

**SDBC = sum (frequency of class * (class mean - array mean));**are all fine theory. The practice in most cases of combined distributions looks a bit more difficult. In the graph example from [en.wikipedia.org] (1st graph on the top right) the separate Normal distributions for various classes are clearly represented. But when you start with a data set to be classified, you don't know about the separate distributions, all you have is the summation of the number of occurances (imagine you sum up the vertical heights of the curves in the graph and draw a new wiggly curve for the sums).The orange, red and blue curves have the same Mean but different spreads and will show in the summation graph as a single peak - no chance to separate these three classes from each other. The green distribution will create a separate peak in the summation graph - there is a chance to separate that one from the rest, provided you can identify the boundary values (i and j) for the left and the right boundary. The math you are looking at (SDBC for example) is a statistical attempt to find these using least squares theory. In most cases this approach is very slow unless you can provide the algoritm with some reasonable start values for i and j and use the math to improve on these.

If you follow my suggestion of creating an ordered frequency distribution table you can search for locations where the tangent to small sections of the curve (inside a small moving window over the table entries) is horizontal and the curve itself curves upward either side. That will give you such and approximation for either i or j. If you also search for the horizontal tangent where the curve moves below on either side you find the approximation for one of the species distribution's mean.

With this you then can calculate things like SDBC and the shifts by maximising the SDBCs and minimising the SDICs Squared deviations inside classes) for all classes step by step.

In practice I doubt that there will be much improvement needed after you have found the approximations for boundaries (i,j) using my suggestion. If you cannot see peaks and valleys in the compound distribution curve, then the data will not allow clear separation (classification) of separate species distributions in the first place and you would have to use additional information (e.g. another attribute) to achieve separation. That would lead to multidimensional classification approaches.

Good luck,

Gerd

**Re: Jenks natural breaks**

Posted by:

**Sergei**()Date: June 16, 2010 05:45PM

Gerd,

to analyze frequency distribution like you suggest is quite intuitive thing. It's fairly possible to design frequency function and to find its extremums. No doubt it's a good thing for analysis of distribution with unknown properties, when number of breaks wasn't specified beforehand (mixture of tree species with different mean heights).

But here is somewhat different task. I work with symbology. A user is going to build color scheme. Number of breaks is specified externally (by user). It's quite fast and easy to get approximation of breaks position (by setting in-class SSD as close to equal around all classes as possible). The question is how to achive global optimum, i.e. to find minimum SSD around all classes. I don't see criterion to shift values from one class to another other than checking the actual change of SSD after each move and to cancel wrong moves. I don't like such approach because, first, the implemntation isn't easy, second, there is no gurantee against falling into local optimum.

Does anybody know how to achieve this?

Thanks,

Sergei

to analyze frequency distribution like you suggest is quite intuitive thing. It's fairly possible to design frequency function and to find its extremums. No doubt it's a good thing for analysis of distribution with unknown properties, when number of breaks wasn't specified beforehand (mixture of tree species with different mean heights).

But here is somewhat different task. I work with symbology. A user is going to build color scheme. Number of breaks is specified externally (by user). It's quite fast and easy to get approximation of breaks position (by setting in-class SSD as close to equal around all classes as possible). The question is how to achive global optimum, i.e. to find minimum SSD around all classes. I don't see criterion to shift values from one class to another other than checking the actual change of SSD after each move and to cancel wrong moves. I don't like such approach because, first, the implemntation isn't easy, second, there is no gurantee against falling into local optimum.

Does anybody know how to achieve this?

Thanks,

Sergei

**Re: Jenks natural breaks**

Posted by:

**pmeems**()Date: June 17, 2010 02:42AM

Sergei,

I'm not very good at statistics but found some more info about quantiles and Fisher-Jenks natural breaks: Applied-Spatial-Data-Analysis-With-R page 77 (88)

Color schemes

Colorbrewer: Color Advice for Maps

A ColorTool for ArcMap is available based on the Colorbrower. I've contacted them and ask if their source code is available so we can convert it to a MW plug-in or use it in the OCX.

More about coloring:

ScaleMaster is a structured diagram for organizing multi-scale mapping using multiple databases and design, selection, and generalization decisions

Hope it helps.

Paul

--

Don't forget to read the new documentation: www.mapwindow.org/documentation/mapwingis4.8

Join us Google+: MapWindow GIS Google+ Community

Join the MapWindow Group on LinkedIn! LinkedIn - MapWindow Group

Download the latest beta installer at:

tinyurl.com/mwMonthly 32-Bit

tinyurl.com/mwMonthlyx64 64-Bit

Follow me on Twitter MapWindow_nl to read when a new installer is published.

---

Paul Meems

The Netherlands

[www.bontepaarden.nl]

Release manager, configuration manager and

forum moderator of MapWindow GIS

Owner of MapWindow.nl - Support for

Dutch speaking users: www.mapwindow.nl

*******

Everything I say or write is my personal opinion and

not the opinion of the company I work for.

*******

View my profile on LinkedIn

I'm not very good at statistics but found some more info about quantiles and Fisher-Jenks natural breaks: Applied-Spatial-Data-Analysis-With-R page 77 (88)

Color schemes

Colorbrewer: Color Advice for Maps

A ColorTool for ArcMap is available based on the Colorbrower. I've contacted them and ask if their source code is available so we can convert it to a MW plug-in or use it in the OCX.

More about coloring:

ScaleMaster is a structured diagram for organizing multi-scale mapping using multiple databases and design, selection, and generalization decisions

Hope it helps.

Paul

--

Don't forget to read the new documentation: www.mapwindow.org/documentation/mapwingis4.8

Join us Google+: MapWindow GIS Google+ Community

Join the MapWindow Group on LinkedIn! LinkedIn - MapWindow Group

Download the latest beta installer at:

tinyurl.com/mwMonthly 32-Bit

tinyurl.com/mwMonthlyx64 64-Bit

Follow me on Twitter MapWindow_nl to read when a new installer is published.

---

Paul Meems

The Netherlands

[www.bontepaarden.nl]

Release manager, configuration manager and

forum moderator of MapWindow GIS

Owner of MapWindow.nl - Support for

Dutch speaking users: www.mapwindow.nl

*******

Everything I say or write is my personal opinion and

not the opinion of the company I work for.

*******

View my profile on LinkedIn

**Re: Jenks natural breaks**

Posted by:

**Sergei**()Date: June 17, 2010 05:51AM

Paul,

thanks for links. I've implemented the code from R already. The same implentation is available in Python. But I still can't achieve ESRI breaks. The authors mentioned that the results are somewhat different from ESRI, but don't say how far and which algorithm is better.

Upon my calculations for US counties area (sum of squared deviations):

whole variation: ~5000 * 10^7

In-class variation after classification (sum of squared deviations):

equal count classes: 2206 * 10^7

equal interval classes: 447*10^7

natural breaks with shifts (mine): 107 * 10^7

natural breaks R: 103 * 10^7

natural breaks ESRI: 93 * 10^7.

So ESRI results are better. I don't know whether I translated R code correctly. The algorithm is complex and it's difficult to access how precise it is. The one thing is clear - it'll be definately slower than the appoach with shifts, which is described in wiki or on ESRI site.

I've alredy spent serious span of time for it. So my decision now - to leave simple implementation with shifts (averaging SSD around classes), as it's fast and gives tolerable approximation of natural breaks (major part of variance is distributed after all). Most our users will never notice whether the breaks we provide are 'absolutely natural' or 'close to natural'. But they will notice that such approach is better than equal interval/equal count classifications.

Perhaps someone else will find error in my translation of R code (in case there is one), or will improve approach with shifts.

Thanks,

Sergei

thanks for links. I've implemented the code from R already. The same implentation is available in Python. But I still can't achieve ESRI breaks. The authors mentioned that the results are somewhat different from ESRI, but don't say how far and which algorithm is better.

Upon my calculations for US counties area (sum of squared deviations):

whole variation: ~5000 * 10^7

In-class variation after classification (sum of squared deviations):

equal count classes: 2206 * 10^7

equal interval classes: 447*10^7

natural breaks with shifts (mine): 107 * 10^7

natural breaks R: 103 * 10^7

natural breaks ESRI: 93 * 10^7.

So ESRI results are better. I don't know whether I translated R code correctly. The algorithm is complex and it's difficult to access how precise it is. The one thing is clear - it'll be definately slower than the appoach with shifts, which is described in wiki or on ESRI site.

I've alredy spent serious span of time for it. So my decision now - to leave simple implementation with shifts (averaging SSD around classes), as it's fast and gives tolerable approximation of natural breaks (major part of variance is distributed after all). Most our users will never notice whether the breaks we provide are 'absolutely natural' or 'close to natural'. But they will notice that such approach is better than equal interval/equal count classifications.

Perhaps someone else will find error in my translation of R code (in case there is one), or will improve approach with shifts.

Thanks,

Sergei

**Re: Jenks natural breaks**

Posted by:

**amesdani**()Date: June 17, 2010 08:03AM

Sergei... my suggestion is to try to match the R results and/or results from one of the other open source GIS projects - maybe GRASS has an implementation? Rather than try to match ESRI. They have hidden all of their algorithms so there is no point is trying to "divine" what they are doing in the black box. But if you can match GRASS or R then we're in excellent shape I think. - Dan

**Re: Jenks natural breaks**

Posted by:

**bmarch**()Date: June 17, 2010 08:53AM

To add to Dan's comment about ESRI's implementation. There is no point in trying to reproduce their results if its not clear how they got them in the first place. Open source projects are about transparency as much as they are about free and modifiable code. Pick one or more comparable methods that are well documented and understood in the scientific community (or statistical community in this case) and implement them as best you can. Also looking at GRASS or gvSIG is a good idea as they may have code snippets and optimizations that you can use (especially GRASS since its C++ or C)

Brian

Brian

**Re: Jenks natural breaks**

Posted by:

**pmeems**()Date: June 18, 2010 12:39AM

I agree with Dan and Brian.

Use a methods that is well documented and don't focus too much on how ESRI is doing it.

--

Paul

--

Don't forget to read the new documentation: www.mapwindow.org/documentation/mapwingis4.8

Join us Google+: MapWindow GIS Google+ Community

Join the MapWindow Group on LinkedIn! LinkedIn - MapWindow Group

Download the latest beta installer at:

tinyurl.com/mwMonthly 32-Bit

tinyurl.com/mwMonthlyx64 64-Bit

Follow me on Twitter MapWindow_nl to read when a new installer is published.

---

Paul Meems

The Netherlands

[www.bontepaarden.nl]

Release manager, configuration manager and

forum moderator of MapWindow GIS

Owner of MapWindow.nl - Support for

Dutch speaking users: www.mapwindow.nl

*******

Everything I say or write is my personal opinion and

not the opinion of the company I work for.

*******

View my profile on LinkedIn

Use a methods that is well documented and don't focus too much on how ESRI is doing it.

--

Paul

--

Don't forget to read the new documentation: www.mapwindow.org/documentation/mapwingis4.8

Join us Google+: MapWindow GIS Google+ Community

Join the MapWindow Group on LinkedIn! LinkedIn - MapWindow Group

Download the latest beta installer at:

tinyurl.com/mwMonthly 32-Bit

tinyurl.com/mwMonthlyx64 64-Bit

Follow me on Twitter MapWindow_nl to read when a new installer is published.

---

Paul Meems

The Netherlands

[www.bontepaarden.nl]

Release manager, configuration manager and

forum moderator of MapWindow GIS

Owner of MapWindow.nl - Support for

Dutch speaking users: www.mapwindow.nl

*******

Everything I say or write is my personal opinion and

not the opinion of the company I work for.

*******

View my profile on LinkedIn

**Re: Jenks natural breaks**

Posted by:

**gngdowid**()Date: June 18, 2010 03:41AM

Sergei,

I agree with Paul's comment:

I would go even further than that:

Letting the user set the number of breaks leads in most cases to unnatural breaks, not what nature provided. Looking at the frequency distribution histogram will provide for the real natural breaks, not for a pre-determined number.

You can

Creating the frequency distribution histogram will not be too time consuming. It's just a single loop through all shapefile feature entries and counting the individual attribute values occurences between attribute minimum and maximum.

Next you probably want to find the local table maxima of averages inside the moving window to determine the number of classes (a loop throgh the table). Then find the minima positions to the left and right of these maxima as class boundaries (a second loop through the table).

Allocating colors, symbols and/or patterns to display for the classes then should be a simple task for you.

This way we avoid to make un-justified assumptions about the number of classes but still retain sensitivity control with the user.

Cheers, Gerd

I agree with Paul's comment:

**Use a methods that is well documented and don't focus too much on how ESRI is doing it.**I would go even further than that:

**Let's do it better**.Letting the user set the number of breaks leads in most cases to unnatural breaks, not what nature provided. Looking at the frequency distribution histogram will provide for the real natural breaks, not for a pre-determined number.

You can

**control the sensitivity**by letting the user**set the size of the moving window**over the histogram table from narrow to wide.Creating the frequency distribution histogram will not be too time consuming. It's just a single loop through all shapefile feature entries and counting the individual attribute values occurences between attribute minimum and maximum.

Next you probably want to find the local table maxima of averages inside the moving window to determine the number of classes (a loop throgh the table). Then find the minima positions to the left and right of these maxima as class boundaries (a second loop through the table).

Allocating colors, symbols and/or patterns to display for the classes then should be a simple task for you.

This way we avoid to make un-justified assumptions about the number of classes but still retain sensitivity control with the user.

Cheers, Gerd

**Re: Jenks natural breaks**

Posted by:

**Sergei**()Date: June 19, 2010 04:36AM

Thanks for suggestions, it's good to have such extensive feedback! I'll do further testing to assess the available algorithms. As for 'well-documented methods', actually I didn't see any scientific paper on the problem. If anybody has such, I'll be glad to read it.

Thanks,

Sergei

Thanks,

Sergei

Sorry, only registered users may post in this forum.