Package cornetto :: Module simcornet :: Class SimCornet
[hide private]
[frames] | no frames]

Class SimCornet

source code

   object --+    
            |    
cornet.Cornet --+
                |
               SimCornet

An extension of the Cornet class which adds word similarity measures. This assumes counts are added to the Cornetto database.

Instance Methods [hide private]
 
open(self, cdb_lu, cdb_syn, verbose=False)
Open and parse Cornetto database files with counts
source code
dict
get_count(self, lu_spec, subcount=False, format=None)
Get (sub)counts for lexical units satisfying this specification
source code
dict
get_total_counts(self)
Get the total counts per category and overall
source code
dict
get_probability(self, lu_spec, subcount=False, smooth=False, cat_totals=False, format=None)
Get probability (p) for lexical units satisfying this specification, where the probability is defined as lu_count / total_count.
source code
dict
get_info_content(self, lu_spec, subcount=False, smooth=False, cat_totals=False, format=None)
Get information content (IC) for lexical units satisfying this specification, defined as the negative log of the lexical unit's probability, i.e.
source code
float or None
resnik_sim(self, lu_spec1, lu_spec2, smooth=False, cat_totals=False, format=None)
Compute the semantic similarity as decribed in Philip Resnik's paper "Using Information Content to Evaluate Semantic Similarity in a Taxonomy" (1995).
source code
float or None
jiang_conrath_dist(self, lu_spec1, lu_spec2, smooth=False, cat_totals=False, format=None)
Compute the semantic distance as decribed in Jay Jiang & David Conrath's paper "Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy" (1997).
source code
float or None
jiang_conrath_sim(self, lu_spec1, lu_spec2, smooth=False, cat_totals=False, format=None)
Returns Jiang & Conrath's distance converted to a similarity by means of sim = 1 / (1 + dist).
source code
float or None
lin_sim(self, lu_spec1, lu_spec2, smooth=False, cat_totals=False, format=None)
Compute the semantic similarity as decribed in the paper Dekang Lin's paper "An information-theoretic definition of similarity" (1998).
source code
 
_get_lu_count(self, lu, subcount=False)
get (sub)count of lexical unit
source code
 
_p(self, lu, subcount=False, smooth=False, cat_totals=False)
probility on the basis of MLE using (sub)counts
source code
 
_IC(self, lu, subcount=False, smooth=False, cat_totals=False, base=2)
Information Content
source code

Inherited from cornet.Cornet: __init__, all_common_subsumers, ask, get_lex_unit_by_id, get_lex_units, get_related_lex_units, get_related_synsets, get_synset_by_id, get_synsets, least_common_subsumers, set_max_depth, set_output_format, test_lex_units_relation

Inherited from object: __delattr__, __format__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __repr__, __setattr__, __sizeof__, __str__, __subclasshook__

Class Variables [hide private]
Properties [hide private]

Inherited from object: __class__

Method Details [hide private]

open(self, cdb_lu, cdb_syn, verbose=False)

source code 

Open and parse Cornetto database files with counts

Parameters:
  • cdb_lu (file or filename) - xml definition of the lexical units with counts
  • cdb_syn (file or filename) - xml definition of the synsets
  • verbose - verbose output during parsing
Overrides: cornet.Cornet.open

get_count(self, lu_spec, subcount=False, format=None)

source code 

Get (sub)counts for lexical units satisfying this specification

>>> c.get_counts("varen")
{'varen:noun:1': 434,
 'varen:verb:1': 15803,
 'varen:verb:2': 15803,
 'varen:verb:3': 15803}
>>> pprint(c.get_counts("varen", subcount=True))
{'varen:noun:1': 434,
 'varen:verb:1': 18977,
 'varen:verb:2': 62086,
 'varen:verb:3': 15803}
Parameters:
  • lu_spec - lexical unit specification
  • subcount (bool) - return subcounts instead of plain counts
  • format ('spec', 'xml', 'raw') - output format
Returns: dict
mapping of lexical units in requested output format to (sub)counts

Note: As the counts are currently based on lemma plus part-of-speech, not on the sense, they are the same for all senses of the same category,

get_total_counts(self)

source code 

Get the total counts per category and overall

The categories are "noun", "verb", "adj", "other"; "all" represents the overall count.

>>> c.get_total_counts()
{'adj': 62156445,
 'all': 518291832,
 'noun': 187143322,
 'other': 199269966,
 'verb': 69722099}
Returns: dict
mapping of categories to counts

get_probability(self, lu_spec, subcount=False, smooth=False, cat_totals=False, format=None)

source code 

Get probability (p) for lexical units satisfying this specification, where the probability is defined as lu_count / total_count.

By default, the total count is taken to be the sum of counts over all word forms in the Cornetto database. However, when comparing two words of the same category (nouns, verbs, adjectives) it may be more appropriate to take the sum over only the word forms of this category. This method is used if the keyword "cat_totals" is true.

>>> c.get_probabilities("varen")
{'varen:noun:1': 8.3736608066013281e-07,
 'varen:verb:1': 3.0490544176663777e-05,
 'varen:verb:2': 3.0490544176663777e-05,
 'varen:verb:3': 3.0490544176663777e-05}
Parameters:
  • lu_spec (string) - lexical unit specification
  • subcount (bool) - use subcounts instead of plain counts
  • smooth (bool) - smooth counts by adding one to lexical units with a zero count
  • cat_totals (bool) - use total count for category of lexical unit instead of overall total count
  • format ('spec', 'xml', 'raw') - output format
Returns: dict
mapping of lexical units in requested output format to probabilties

get_info_content(self, lu_spec, subcount=False, smooth=False, cat_totals=False, format=None)

source code 

Get information content (IC) for lexical units satisfying this specification, defined as the negative log of the lexical unit's probability, i.e. -log_2(lu_count / total_count)

If a lexical unit has a count of zero, the probability is zero, the log is undefined, and None is returned. Unless the keyword "smooth" is true, in which case the count is smoothed by adding one.

If no lexical unit matches the specification, an empty mapping is returned.

>>> pprint(c.get_info_content("plant"))
{'plant:noun:1': 14.51769181264614}
>>> pprint(c.get_info_content("plant", subcount=True))
{'plant:noun:1': 10.482770362490861}
Parameters:
  • lu_spec (string) - lexical unit specification
  • subcount (bool) - use subcounts instead of plain counts
  • smooth (bool) - smooth counts by adding one to lexical units with a zero count
  • cat_totals (bool) - use total count for category of lexical unit instead of overall total count
  • format ('spec', 'xml', 'raw') - output format
Returns: dict
mapping of lexical units in requested output format to information content

resnik_sim(self, lu_spec1, lu_spec2, smooth=False, cat_totals=False, format=None)

source code 

Compute the semantic similarity as decribed in Philip Resnik's paper "Using Information Content to Evaluate Semantic Similarity in a Taxonomy" (1995). It is defined for a pair of concepts c1 and c2 as:

argmax [IC(c1) + IC(c2) - 2 * IC(lcs)] for all lcs in LCS(c1, c2)

In other words, the maximum value of the information content over all least common subsumers of two concepts. An important point is that the counts of an LCS, as used in computing its probabilty, is the sum of its own count plus the counts of all concepts that it subsumes.

As suggested by Resnik, it can be extended to _word_ similarity by taking the maximum over the scores for all concepts that are senses of the word. This means that if just two words are specified - without a category or sense - two sets of matching lexical units are retrieved. For every combination of lexical units from these two sets, the LCS is computed (if any), and the one with the maximum information content is selected.

If no LCS is found, this can mean two things:

  1. The two words have no LCS because they truely have nothing in common. In this case we assume the LCS is zero and therefore we return zero.
  2. The two words should have something in common, but the correct LCS is not present in the Cornetto database. However, since there is no way to know this, we consider this the same as (1), and zero is returned.

There are two more marginal cases:

  1. No lexical units in the Cornetto database match the specifications.
  2. All LCS have a subcount of zero, and no smoothing was applied, so its IC is undefined.

In both cases None is returned.

Notice that it is difficult to compare Resnik's word similarities, because they depend on the subcounts. With identical words, for instance, resnik_sim("iets", "iets") = 1.3113543459343666 whereas resnik_sim("spotje", "spotje") = 25.141834494846584

Parameters:
  • lu_spec1 - first lexical unit(s) specification
  • lu_spec2 - second lexical unit(s) specification
  • smooth (bool) - smooth by adding one to lexical units with a zero count
  • cat_totals (bool) - use total count for category of lexical unit instead of overall total count
  • format ('spec', 'xml', 'raw') - output format
Returns: float or None
similarity score greater than or equal to zero

jiang_conrath_dist(self, lu_spec1, lu_spec2, smooth=False, cat_totals=False, format=None)

source code 

Compute the semantic distance as decribed in Jay Jiang & David Conrath's paper "Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy" (1997). It is defined for a pair of concepts c1 and c2 as:

argmin [IC(c1) + IC(c2) - 2 * IC(lcs)] for all lcs in LCS(c1, c2)

This is without the edge and node weighting scheme, which is not implemented here. The measure is extended to a _word_ distance measure by taking the minimum over the scores for all concepts (lexical units) that are senses of the word (cf. doc of resnik_sim).

If no LCS is found, this can mean two things:

  1. The two words have no LCS because they truely have nothing in common. In this case we assume the LCS is zero and therefore we return the minimun of IC(c1) + IC(c2).
  2. The two words should have something in common, but the correct LCS is not present in the Cornetto database. However, since there is no way to know this, we consider this the same as (1), and we return the minimun of IC(c1) + IC(c2).

There are three more marginal cases:

  1. No lexical units in the Cornetto database match the specifications.
  2. All matching lexical units have a subcount of zero, and no smoothing was applied, so their IC is undefined.
  3. All LCS have a subcount of zero, and no smoothing was applied, so its IC is undefined. This implies that all subsumed lexical units must have a subcount of zero, and therefore (2) must be the case as well.

In all of these three cases None is returned.

Parameters:
  • lu_spec1 - first lexical unit(s) specification
  • lu_spec2 - second lexical unit(s) specification
  • smooth (bool) - smooth by adding one to lexical units with a zero count
  • cat_totals (bool) - use total count for category of lexical unit instead of overall total count
  • format ('spec', 'xml', 'raw') - output format
Returns: float or None
distance greater than of equal to zero

jiang_conrath_sim(self, lu_spec1, lu_spec2, smooth=False, cat_totals=False, format=None)

source code 

Returns Jiang & Conrath's distance converted to a similarity by means of sim = 1 / (1 + dist). See jiang_conrath_dist

If the distance is None, so is the similarity.

The translation from distance to similarity is not uniform. That is, the space between the distances 1 and 2 and between the distances 2 and 3 is the same (i.e. 1), but the space between the corresponding similaraties, i.e. between 0.5 and 0.33 and between 0.33 and 0.25, is not.

Parameters:
  • lu_spec1 - first lexical unit(s) specification
  • lu_spec2 - second lexical unit(s) specification
  • smooth (bool) - smooth by adding one to lexical units with a zero count
  • cat_totals (bool) - use total count for category of lexical unit instead of overall total count
  • format ('spec', 'xml', 'raw') - output format
Returns: float or None
similarity score between zero and one included.

lin_sim(self, lu_spec1, lu_spec2, smooth=False, cat_totals=False, format=None)

source code 

Compute the semantic similarity as decribed in the paper Dekang Lin's paper "An information-theoretic definition of similarity" (1998). It is defined for a pair of concepts c1 and c2 as:

argmax [2 * IC(lcs) / (IC(c1) + IC(c2))] for all lcs in LCS(c1, c2)

This measure is extended to a _word_ distance measure by taking the maximum over the scores for all concepts (lexical units) that are senses of the word (cf. doc of resnik_sim).

If no LCS is found, this can mean two things:

  1. The two words have no LCS because they truely have nothing in common. In this case we assume the IC of the LCS is zero and we return zero.
  2. The two words should have something in common, but the correct LCS is not present in the Cornetto database. However, since there is no way to know this, we consider this the same as (1), and we return zero.

There are three more marginal cases:

  1. No lexical units in the Cornetto database match the specifications.
  2. All matching lexical units have a subcount of zero, and no smoothing was applied, so their IC is undefined.
  3. All LCS have a subcount of zero, and no smoothing was applied, so its IC is undefined. This implies that all subsumed lexical units must have a subcount of zero, and therefore (2) must be the case as well.

In all of these three cases None is returned.

Parameters:
  • lu_spec1 - first lexical unit(s) specification
  • lu_spec2 - second lexical unit(s) specification
  • smooth (bool) - smooth by adding one to lexical units with a zero count
  • cat_totals (bool) - use total count for category of lexical unit instead of overall total count
  • format ('spec', 'xml', 'raw') - output format
Returns: float or None
similarity score between zero and one included