Algorithm for clustering points into pairs (exactly two) by proximity without re-use

Algorithm for clustering points into pairs (exactly two) by proximity without re-use

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I have a reasonably large list of centroids that I want to cluster into groups of two by proximity (minimizing proximity).

I've explored k-means, which does cluster them by proximity, but the count of members in each group varies. With k-means you set a number of clusters, not a number of members in each cluster.

The nearest-neighbor problem solves this issue for two items from the set, but not against the entire data set.

K-nearest neighbors seems to break them into groups of N, but it appears to allow for points to be reused. In my scenario there can be no overlap.

Is there a particular algorithm, or suite of algorithms designed to address this? I'm pretty handy when I know what I'm working against, but I don't have a good sense of how to approach the problem.

To add more about the context and what we're trying to solve:

The points represent a number of sites throughout the USA. Each of these sites is a competitor (supply). Independently, we've aggregated demand (from census data, etc). We want to average the nearest pairs so that we can use the aggregated supply when calculating our supply/demand indexes for a given spatial extent (defined by the demand polygons).

We need to use at least two points so that individual data from a given site is obscured. This is a licensing/privacy requirement. We would otherwise analyze every point individually. We don't want to use more than two, because that further obscures the data. By using two, we adhere to licensing requirements, while minimizing the effect of averaging across a cluster.

Why does k-means clustering algorithm use only Euclidean distance metric?

Is there a specific purpose in terms of efficiency or functionality why the k-means algorithm does not use for example cosine (dis)similarity as a distance metric, but can only use the Euclidean norm? In general, will K-means method comply and be correct when other distances than Euclidean are considered or used?

[Addition by @ttnphns. The question is two-fold. "(Non)Euclidean distance" may concern distance between two data points or distance between a data point and a cluster centre. Both ways have been attempted to address in the answers so far.]


Assad A, Ball M, Bodin L and Golden B (1983). Routing and scheduling of vehicles and crews: the state of the art. Comput Opns Res 10: 63–211.

Lenstra J and Rinnooy Kan A (1981). Complexity of vehicle routing and scheduling problems. Networks 11: 221–228.

Tansini L (2001). Algoritmos de Asignación para MDVRPTW. Master Thesis–PEDECIBA, 2001, Instituto de Computación, Facultad de Ingeniería, UDELAR.

Caseau Y and Laburthe F (1998). A fast heuristic for large routing problems. Presented at IFORS 98, Kaunas, Lithuania.

Laporte G, Gendreau M, Potvin JY and Semet F (2000). Classical and modern heuristics for the vehicle routing problem. Int Trans Opl Res 7: 285–300.

Toth P and Vigo D (1998). The granular tabu search (and its application to the vehicle routing problem). Working paper, DEIS, University of Bologna.

Cordeau JF, Laporte G and Mercier A (2001). A unified tabu search heuristic for vehicle routing problems with time windows. J Opl Res Soc 52: 928–936.

Reimann M, Doerner K and Hartl RF (2003). Analyzing a unified ant system for the VRP and some of its variants. In: Günther et al (ed). EvoWorkshops 2003, Lecture Notes in Computer Science, Vol 2611, Springer-Verlag, Berlin, Heidelberg, pp 300–310.

Rousseau LM, Gendreau M, Pesant G and Focacci F (2004). Solving VRPTWs with constraint programming based column generation. Ann Opl Res 130: 199–216.

Berger J, Barkaoui M and Bräysy O (2001). A parallel hybrid genetic algorithm for the vehicle routing problem with time windows. Working paper, Defense Research Establishment Valcartier, Canada.

Czech ZJ and Czarnas P (2002). Parallel simulated annealing for the vehicle routing problem with time windows. Presented at the 10th Euromicro Workshop on Parallel, Distributed and Network-based Processing, Canary Islands, Spain.

Sa’adah P and Paechter B (2004). Improving vehicle routing using a customer waiting time colony. In: Goos G, Hartmanis J and van Leeuwen J (eds). EvoCOP 2004, Lecture Notes in Computer Science, Vol 3004, Springer-Verlag, Berlin, pp 188–198.

Bramel J and Simchi-Levi D (1997). On the effectiveness of set covering formulations for the vehicle routing problem with time windows. Ops Res 45: 295–301.

Potvin J and Rousseau J (1993). A parallel route building algorithm for the vehicle routing and scheduling problem with time windows. Eur J Opl Res 66: 331–340.

Solomon M (1987). Algorithms for the vehicle routing and scheduling problems with time window constraints. Opns Res 35: 254–264.

Salhi S and Nagy G (1999). A cluster insertion heuristic for single and multiple depot vehicle routing problems with backhauling. J Opl Res Soc 50: 1034–1042.

Ioannou G, Kritikos M and Prastacos G (2001). A greedy look-ahead heuristic for the vehicle routing problem with time windows. J Opl Res Soc 52: 523–537.

Cordeau JF, Gendreau M and Laporte G (1997). A tabu search heuristic for periodic and multi-depot vehicle routing problems. Networks 30: 105–119.

Salhi S and Sari M (1997). A multi-level composite heuristic for the multi-depot vehicle fleet mix problem. Eur J Opl Res 103: 95–112.

Desaulniers G, Lavigne J and Soumis F (1998). Multi-depot vehicle scheduling problems with time windows and waiting costs. Eur J Opl Res 111: 479–494.

Russell R and Igo W (1979). An assignment routing problem. Networks 9: 1–17.

Urquhart M, Viera O, Gonzalez M and Cancela H (1997). Vehicle routing techniques applied to a milk collection problem. Presented at INFORMS Fall Meeting, Dallas, TX, USA.

Foulds LR and Wilson JM (1997). A variation of the generalized assignment problem arising in the New Zealand Dairy Industry. Ann Opns Res 69: 105–114.

Giosa D, Tansini L and Viera O (1999). Assignment algorithms for the multi-depot vehicle routing problem. Presented at SADIO, Buenos Aires, Argentina.

Berry M and Lindoff G 1995. Data Mining Techniques: for Marketing, Sales and Customer Support. John Wiley & Sons: Chichester.

Giosa D, Tansini L and Viera O (2002). New assignment algorithms for the multi-depot vehicle routing problem. J Opl Res Soc 53: 977–984.

Automatic clustering algorithms: a systematic review and bibliometric analysis of relevant literature

Cluster analysis is an essential tool in data mining. Several clustering algorithms have been proposed and implemented, most of which are able to find good quality clustering results. However, the majority of the traditional clustering algorithms, such as the K-means, K-medoids, and Chameleon, still depend on being provided a priori with the number of clusters and may struggle to deal with problems where the number of clusters is unknown. This lack of vital information may impose some additional computational burdens or requirements on the relevant clustering algorithms. In real-world data clustering analysis problems, the number of clusters in data objects cannot easily be preidentified and so determining the optimal amount of clusters for a dataset of high density and dimensionality is quite a difficult task. Therefore, sophisticated automatic clustering techniques are indispensable because of their flexibility and effectiveness. This paper presents a systematic taxonomical overview and bibliometric analysis of the trends and progress in nature-inspired metaheuristic clustering approaches from the early attempts in the 1990s until today’s novel solutions. Finally, key issues with the formulation of metaheuristic algorithms as a clustering problem and major application areas are also covered in this paper.

This is a preview of subscription content, access via your institution.

5.1  How are FunGCs Computed?

FunGCs are computed by a two-step process. Given all genes in a given organism,

Step 1: Compute a pairwise functional-linkage score between every pair of genes in the organism.

Step 2: Compute FunGCs by searching for highly connected sets of functionally linked genes from Step 1

In a moment we will consider these steps in more detail. But first we discuss the reliance of these methods on the ortholog data within BioCyc.

5. Results Against Analog Cases

[27] The pairwise homogenization algorithm produces a list of breakpoint dates and adjustments for each input series. Although it is possible to evaluate results at the individual station series level, the focus here is on the aggregate, network-wide impacts as reflected in changes to the regional mean value. We present these aggregate results beginning with the simplest analog error structure and moving progressively to the more complex models.

[28] Figure 2 provides a geographic perspective of the trends in the “perfect data” analog both for the raw input data (Figure 2a) and for the data homogenized by the default version of the algorithm (Figure 2b). The trends were calculated by interpolating the annual temperature values to a 0.25 × 0.25 degree grid and then calculating the trend for each grid box as described by Menne et al. [2009] . The default version of the algorithm essentially preserves the pattern of trends although there appears to be some minor smoothing of the spatial pattern. Nevertheless, in the case of “perfect data,” no version of the pairwise algorithm makes unwarranted adjustments sufficient to move the average CONUS trend away from the true trend, and the average series produced by the 100 randomized versions of the algorithm are indistinguishable from those based on the raw input data (see auxiliary material).

[29] In the “Big breaks, perfect metadata” case, the unadjusted input data are characterized by a noisy, heterogeneous field of trends caused by the imposition of random breaks in the network throughout the series. As shown in Figure 3a, the impact is a mix of trends with positive and negative biases. In this case, the default algorithm comes close to reproducing the true spatial pattern and magnitude of trends (Figures 3b and 3c), which is expected given that the timing of all breaks is known. Nevertheless, some randomized versions of the algorithm do not make use of the metadata and treat all breaks as undocumented. Further, the use of a significance test when estimating the magnitude of each break means the recovery of the true climate signal from the input data is not necessarily perfect. However, since there is not an overall bias associated with the imposed errors, the randomized versions of the algorithm all produce CONUS average trends that do not deviate substantially from the true background trend (Figure 4) and there is no sign preference to the potential residual error.

[30] In the “Mixed break sizes, some clustering” analog, errors are clustered in time (between 1915 and 1975 and somewhat more heavily from 1915 to 1945), and a sign preference is present in the errors. In this case, the homogenized trends since 1900 and since 1950 from the ensemble are all greater than the raw input trend (Figure 5), an indication that the algorithm is accounting for the sign bias in the imposed errors during the periods when the errors are concentrated.

[31] In the “Clustering and sign bias” family of analogs, the imposed errors exhibit an even larger sign preference and are more clustered in time, including nearer to the end of the series, which biases average trends for all periods since 1900. The impact of the sign bias on the raw input trends for the full period can be seen in Figure 6. Relative to the true values (Figure 6b) a larger number of trends are too high rather than too low in the unadjusted data (Figure 6a). Nevertheless, the default version of the pairwise homogenization algorithm comes close to reproducing both the magnitude and pattern of the underlying temperature trends (Figure 6c) in spite of the sign preference. As shown in Figure 7, all randomized versions of the algorithm produce homogenized series that bring the CONUS average closer to the true value for all trend periods, with some algorithm configurations, including the default version, yielding results very close to “truth” - moving the trend more than 95% percent toward the true climate signal. In particular, the impact of the pervasive positive errors seeded in 70% of the analog series after 1980 is reduced by all ensemble members. Notably, the potential residual error is essentially one-tailed in this case there is a low probability of overcompensating for the bias changes by a small amount.

[32] Figure 8 provides a summary overview of the “Clustering and sign bias” family of analogs (and additional time series are provided as auxiliary material). Because each of these four analogs was seeded with identical errors, any difference in homogenization performance for a particular ensemble member is a function only of the presence or absence of a forced response component and the timing and patterns of natural internal variations simulated by the various underlying models. Results indicate that while the efficiency of individual members is somewhat dependent on the nature of the underlying climate signal and covariance structure, the relative performance of each member measured by the degree to which the true trend is recovered remains largely unchanged from analog to analog within the family. In other words, the performance of any particular version of the algorithm appears to be largely—but not completely–invariant of underlying climate signal as shown in Figure 9. Moreover, a comparison of Figures 4, 7, and 8 also suggests that the underlying error structure is a more fundamental consideration in the ability of the algorithm to retrieve the true underlying climate signal rather than the nature of the climate signal itself. In light of this, it may be possible to choose a number of pairwise algorithm configurations that should be expected to be relatively good performers under a wide variety of error characteristics.

[33] Results for the most challenging analog “Very many small breaks with sign bias” are summarized in Figures 10 and 11. In this case, a large percentage of the breaks are likely below the magnitude that can be efficiently detected by the pairwise (or perhaps any) algorithm. Consequently, the various ensembles produced by the randomized versions of the algorithm do not move the trend far enough toward the true trend value (Figure 10). Likewise, the geographic distribution of trends (Figure 11) indicates that the systematic bias caused by the imposed errors are only partially removed by the homogenization algorithm, the consequence of which is a residual mean bias that underestimates the true CONUS trend and a heterogeneous field of trends.

[34] Finally, we note that a 100-member randomization was considered at the outset to be sufficient to explore the sensitivity of the various parameters, especially since not all of them were expected to have a substantial impact on the results. By way of confirmation, the “clustering and sign bias-C20C1” analog was run through 500 randomizations of the algorithm and the results were compared to the original 100 member ensemble as well as smaller numbers of combinations. As Figures S6–S10 indicate, the median and interquartile ranges are well represented with 100 members and the worst case scenario implication from this expanded randomization is that the range of the ensemble trends may be underestimated by about 25%. However, it is worth noting that the only outlier in the expanded 500 member ensemble not captured by the 100-member ensemble resulted from a particularly conservative set of settings that minimized the impact of the homogenization. More generally, it is the conservative tail, which minimizes adjustments, that is poorly quantified with smaller ensemble sizes rather than the more aggressive tail of the distribution that samples solutions closer to the target truth. In future the potential exists to massively parallelize such data set creation through citizen scientists and their IT capabilities akin to e.g., [ Allen, 1999 ] if the pairwise homogenization code can be made suitably portable and platform independent. This could also open up new opportunities such as derivation of a neural network algorithm tuning approach either explicitly or through, for example, interfacing with the serious gaming community [ Krotoski, 2010 ].

[35] To summarize, based on all analog results we conclude that:

1. In cases where there is no sign bias to the seeded errors, the randomized versions of the algorithm produces results clustered around the true trend.

2. For cases in which there were errors seeded with a sign bias, all randomized versions of the algorithm moved the trend in the correct direction.

3. Rather than overcorrect, the randomized algorithms generally do not correct the trend enough in the presence of errors with a sign bias because of incomplete adjustments that bias the underlying trends. The propensity to under-correct is sensitive to the frequency and magnitude of imparted breaks with more frequent and smaller breaks leading to more incomplete corrections.

4. The algorithm is potentially capable of adjusting data even when pervasive network wide quasi-contemporaneous changes of a similar nature occur.

5. Although algorithm performance is somewhat impacted by natural climate variations and the presence of forced changes, this impact is secondary to that of the error structure imparted on the raw observations. The error structure, which is unknown in the real-world, is the primary limiting factor on algorithm efficiency.

Preamble on infectious diseases

Infectious diseases are caused by pathogenic microorganisms, such as bacteria, viruses, parasites, or fungi. The diseases can be symptomatic or asymptomatic. Certain infectious diseases such as human immunodeficiency virus (HIV) can be fairly asymptomatic but can lead to disastrous consequences after few years if uncontrolled ( The spread of infectious diseases varies from microorganisms to microorganisms. For instance, certain viruses such as HIV are only transmitted upon close physical contacts (sexual transmission or blood contact) while influenza virus infection is transmitted by emitted droplets following sneezing, coughing, or speaking, within few meters of distance. Zoonotic diseases are infectious diseases of animals that can cause disease when transmitted to humans.

In the 20th-century infectious diseases were responsible for the largest number of premature death and disability worldwide. The Spanish flu occurred in the beginning of the previous century ( Taubenberger and Morens, 2006 It is estimated that one-third of the world’s population (500 million individuals) was infected and has symptoms during the 1918� pandemic ( Fig. 1 A). The disease was one of the deadliest of all influenza pandemics. It was estimated that at least 50 million individuals died following the infection. The impact of this pandemic was not restricted to the first quarter of the 20th century since almost all cases of influenza A were caused by mutated versions of the 1918 virus. While we will not cover the virologic or immunological aspect of influenza infection, it is important to understand the purpose of this chapter why the pandemic occurred. The 1918 flu pandemic happened during World War I where proximity, bad hygiene, and unusual mass movement (troops and population) helped the spread of the virus. Even the United Stated reported more than 600,000 death in its country despite the distance. Many of the countries involved in the war �iled” to communicate on the death toll caused by influenza. This was purposely kept silence in order to sustain public morale. While this could be understood on a military aspect, it has deadly consequences as the virus would come in other waves. At that time, viruses were not known yet and diagnostic, prevention and treatments were very limited. As such, people would suffer from influenza virus itself (flu illness) and its consequences such as lung infection by bacteria (pneumonia) in susceptible individuals. This shows how poor communication and wrong usage of pandemics data could affect millions of lives. Since then, progresses have been made in order to follow influenza A pandemics. Since 1952, the World Health Organization’s Global Influenza Surveillance and Response System (GISRS) have been monitoring the evolution of influenza viruses. It also serves as a global alert mechanism for emerging viruses with pandemic potential as observed in 1918. We now better understand the factors that influence transmission ( Fig. 1 B). Influenza is just one of the various pandemics we have been through. In fact, besides influenza, smallpox, tuberculosis, and cholera are constant threats ( Holmes et al., 2017 ). Improving the hygiene conditions and vaccination campaigns have been very effective means to reduce the spread of infections. There are different cases of viral spread, for instance, there is constant follow-up on polio cases as three countries still report cases while WHO has the mission to eradicate it completely. The 21st century has already seen emerging pandemic infectious such as SARS (severe acute respiratory syndrome), MERS (Middle East respiratory syndrome), Ebola, and Zika viruses. By controlling infections, we can reduce premature death as well as infection-driven diseases such as cirrhosis (hepatitis B), liver cancer (hepatitis C), stomach cancer (Helicobacter pylori), or worsening of conditions such as cardiovascular and respiratory (influenza A). Because we cannot always rely on medicine to develop rapidly vaccines or other treatments, the best prevention is to detect early possible pandemics and stop the transmission. By blocking transmission, we could eventually also reduce the mutation of the viruses and thus keep the virus in a stage that vaccines could help fight.

Lessons from the 1918 “Spanish” flu. (A) Graph representing the number of deaths during the peak of the 1918 influenza pandemic. (B) Since the “Spanish” flu, much knowledge has been acquired in the mechanisms of influenza transmission and factors influencing it.

1 Introduction

A central question in phonological typology (and in phonology more generally) is whether there are principles that govern the size, structure and constituent parts of phonological inventories, and if so, what they are. Research in recent decades has proposed numerous factors, often extralinguistic, that predict the composition of phonological inventories. Such proposed factors include demography (Pericliev Reference Pericliev 2004, Hay & Bauer Reference Hay and Bauer 2007, Donohue & Nichols Reference Donohue and Nichols 2011, Moran et al. Reference Moran, McCloy and Wright 2012, Greenhill Reference Greenhill, Bowern and Evans 2014), environment and climate (Everett Reference Everett 2013, Everett et al. Reference Everett, Blasi and Roberts 2015, Reference Everett, Blasi and Roberts 2016), genetics (Dediu & Ladd Reference Dediu and Ladd 2007, Creanza et al. Reference Creanza, Ruhlen, Pemberton, Rosenberg, Feldman and Ramachandran 2015), geography and population movements (Atkinson Reference Atkinson 2011), culture (Labov et al. Reference Labov, Rosenfelder and Fruehwald 2013) and anatomy (Dediu et al. Reference Dediu, Janssen and Moisik 2017).

Structural, i.e. language-internal or systemic, factors, include the ‘size predicts’ generalisation: the number of segments in an inventory largely determines its content, such that small systems recruit few (and basic) dimensions, while larger systems entail additional (and secondary) dimensions (Lindblom & Maddieson Reference Lindblom, Maddieson, Hyman and Li 1988). In this paper, we focus on another structural factor, namely feature economy. The feature-economy principle is one of the mainstays of contemporary discussions of phonological segment inventories in the languages of the world. Two different, albeit largely congruent, formulations of this principle were proposed by Lindblom & Maddieson (‘small paradigms tend to exhibit ‘unmarked’ phonetics whereas large systems have ‘marked’ phonetics’ Reference Lindblom, Maddieson, Hyman and Li 1988: 70) and Clements (‘languages tend to maximise the ratio of sounds over features’ Reference Clements 2003: 287). This idea goes back at least to early work in structuralist phonology, including Trubetzkoy ( Reference Trubetzkoy 1939), Martinet ( Reference Martinet 1952) and Hockett ( Reference Hockett 1955), who were interested in the extent to which phonological inventories are symmetrical with respect to features, or, in other words, how much ‘mileage’ phonological inventories get out of individual features see an overview of early developments of this concept in Clements ( Reference Clements 2003). Similar conclusions were later reached using different formulations and/or different datasets (Marsico et al. Reference Marsico, Maddieson, Coupé and Pellegrino 2004, Coupé et al. Reference Coupé, Marsico, Pellegrino, Pellegrino, Marsico, Chitoran and Coupé 2009, Mackie & Mielke Reference Mackie, Mielke, Clements and Ridouane 2011, Moran Reference Moran 2012, Dunbar & Dupoux Reference Dunbar and Dupoux 2016), and theoretical and experimental investigations of feature economy have become a major line of phonological research: see Pater ( Reference Pater 2012), Verhoef et al. ( Reference Verhoef, Kirby and de Boer 2016) and Seinhorst ( Reference Seinhorst 2017).

The aim of the current paper is not to propose another explanation or interpretation of the feature-economy principle, but to take a step back in order to reassess how well it actually fits the structure of phonological segment inventories of the world's languages, focusing on consonants.

Clements ( Reference Clements 2003: 288–289) hypothesises that the feature-economy principle can only be constrained by functional factors: ‘avoided feature combinations can be shown to be inefficient from the point of view of speech communication. That is, their articulation is relatively complex, or their auditory attributes are not distinct enough from those of some other sound in the system’. Footnote 1 Marsico et al. ( Reference Marsico, Maddieson, Coupé and Pellegrino 2004) and Coupé et al. ( Reference Coupé, Marsico, Pellegrino, Pellegrino, Marsico, Chitoran and Coupé 2009) attempted to quantify the amount of residual variance left unexplained by the feature-economy principle by computing the redundancy factors and the cohesion of phonological inventories in UPSID (Maddieson & Precoda Reference Maddieson and Precoda 1992). Our aim is to provide an exploratory assessment of the structure of this residual variance. Our premise is that if Clements’ assessment of the explanatory power of the feature-economy principle were correct, we would be able to explain the majority of exceptions to the feature-economy principle by invoking perception and/or production factors, and that the variance left unexplained after that would consist of random noise due to the probabilistic nature of sound change. Clements’ hypothesis therefore would be falsified (in a non-statistical, observationist way) if we were to discover that there are principles governing inventory structures that do not stem from the abovementioned types of functional factors.

The core of our approach to testing this hypothesis is the notion of a co-occurrence class . Co-occurrence classes are groups of sounds that tend to be found together in inventories. We provide a fully algorithmic definition of this notion in §2, but at this stage we would like to explore its implications. We regard co-occurrence classes as a particularly powerful method of phonological analysis, since nearly all principles governing the structure of phonological inventories are plausibly reflected in the structure of these classes. We give two examples in (1).

Our primary interest is how the feature-economy principle is reflected in the structure of co-occurrence classes. In order to investigate this, we propose to first reformulate the principle in a more structural fashion. Building on the notion of sound-inventory symmetry explored by Dunbar & Dupoux ( Reference Dunbar and Dupoux 2016), we operationalise the feature-economy principle by interpreting it as largely synonymous with the layering principle : new classes of sounds arise by virtue of adding new features to already existing combinations. An empirical confirmation of this is found in Moran ( Reference Moran 2012: 248) with respect to vowels, such that ‘once languages expand their inventories beyond cardinal vowels, they tend to do so by either nasalization or lengthening, and to a lesser extent by adding diphthongs to the inventory’.

This formulation has the advantage of providing a simple way to articulate a structural prediction: if we investigate empirical co-occurrence classes, we should see that they are progressively defined by a succession of additional features. Thus we should see both large classes dominated by basic distinctions (place, manner and VOT) and smaller classes in which these distinctions are augmented by different additional articulations. We call classes that respect the feature-economy/layering principle conformant classes . Most importantly, the feature-economy/layering principle predicts that certain constellations should not exist. In particular, it prohibits cross-layer connections (the close patterning of segments with different numbers of features turned on) and cross-feature connections (the close patterning of segments with different privative features turned on). That is, if there exists a class of palatalised segments, we do not expect some of the members of this class to pattern with either labialised segments (which would be a cross-feature connection) or plain segments (which would be a cross-layer connection), as this would imply that languages do not exhaust the usefulness of the [+palatalised] feature. We call classes that do not respect the feature-economy/layering principle non-conformant classes. A good example of a conformant class is long voiced stops: /bː dː gː/. They are distinguished by a single distinctive feature value [+long], and exhaust the possible combinations of VOT and manner values for all places of articulation. Short voiced stops /b d g/, however, do not form a conformant class in our data. Instead they are embedded inside a large complex class, the ‘first extension set’. We discuss conformant and non-conformant classes in §4.

It is important to stress that this type of analysis is based on bidirectional dependences (the presence of segment A is probabilistically dependent on the presence of segment B, and vice versa), not on unidirectional implicational universals (languages with segment A tend to also have segment B, but segment B is also frequently found without segment A). For example, languages with /pʲ/ have a very strong tendency to have /p/ as well. Nevertheless, the bidirectional co-occurrence dependence between these segments is very low: the absence of /pʲ/ is a very weak indicator of the absence of /p/. On the other hand, the absence of /pʲ/ is a strong indicator of the absence of /bʲ/, and vice versa.

This paper aims to make the following contributions. First, we propose a statistical method for identifying co-occurrence classes of sounds in the world's languages. Second, using this method, we empirically identify several co-occurrence classes worthy of attention in themselves, one of them being the basic consonant inventory . Third, using the structure of the co-occurrence classes identified by this new method, we show the limits of the applicability of the feature-economy principle in its various formulations.

The paper is organised as follows. In §2, the method used to derive co-occurrence classes is described, together with the dataset it is applied to. In §3, the resulting classification of the major types of segments in the languages of the world is presented, and a brief overview of the classes is given. §4 is devoted to the consequences of the structure of the co-occurrence classes for the feature-economy principle, and §5 presents conclusions.

Major Map

Major Maps help undergraduate students discover academic, co-curricular, and discovery opportunities at UC Berkeley based on intended major or field of interest. Developed by the Division of Undergraduate Education in collaboration with academic departments, these experience maps will help you:

Explore your major and gain a better understanding of your field of study

Connect with people and programs that inspire and sustain your creativity, drive, curiosity and success

Discover opportunities for independent inquiry, enterprise, and creative expression

Engage locally and globally to broaden your perspectives and change the world

Reflect on your academic career and prepare for life after Berkeley

Use the major map below as a guide to planning your undergraduate journey and designing your own unique Berkeley experience.

Watch the video: K-Means Clustering شرح update


  1. Kajigar

    And what follows?

  2. Arnt

    I think, that you are mistaken. Let's discuss it.

  3. Devlin

    I can suggest to come on a site on which there are many articles on this question.

  4. Severne

    remarkably, very funny idea

Write a message