Dictionary Data --------------- Research notes. There are currently 63 data files in the 'words' directory. Of these, 8 are not distinct (*biolg*, *medical*) and so there are effectively just 55 "clusters" here. There are 1754 semicolons in 4.0.dict and 1772 colons. This implies that there are approx 1650 to 1700 word clusters in 4.0.dict since many of the semi-colons appear in lines that merely define new classes. A better count of the contents of 4.0.dict yeilds 1430 distinct clusters. There seem to be 86863 word forms in the dicts Example cluster from Siva's dataset: cluster469 bets.n -- ../blah/words.n.2.s doubts.n -- ../blah/blah-29 excuses.n -- ../blah/blah-34 foes.n -- ../blah/words.n.2.s warnings.n -- ../blah/blah-29 Actual disjunct usage: select inflected_word, disjunct, count, log_cond_probability from disjuncts where inflected_word='bets.n' order by log_cond_probability; bets.n | Jp- Dmc- | 5.38320328295231 | 2.68897695164809 bets.n | Op- | 6.59906960930676 | 2.79728207561233 bets.n | Op- Dmc- | 4.49985344521703 | 2.94756384236018 bets.n | Jp- A- MXp+ MXp+ | 2.94644784927368 | 3.5584651263364 bets.n | Jp- A- | 2.8032719194889 | 3.63033016407109 bets.n | Op- Mv+ | 2.38083738088607 | 3.86597277463304 doubts.n | Op- | 14.7235374869777 | 2.53482148983126 doubts.n | Op- Dmc- | 12.8798744678498 | 2.75360123030737 doubts.n | Jp- A- | 3.70244218036532 | 4.39933529761974 doubts.n | Op- A- | 4.28538444498555 | 4.52871084843059 doubts.n | Opt- | 2.90120184421541 | 4.75116183218627 doubts.n | Jp- Dmc- | 2.40070396848023 | 5.02435498790713 excuses.n | Op- Dmc- | 5.50880998373031 | 2.32890577902052 excuses.n | Op- | 5.03419046103953 | 2.45888667993668 excuses.n | Jp- Dmc- | 4.23024629056454 | 2.70990481825512 excuses.n | Op- TOn+ | 1.90192013978957 | 3.86318980988967 excuses.n | Op- AN- TOn+ | 1.79344245046377 | 3.9479150280805 excuses.n | Opn- | 1.65557911992073 | 4.06331052106999 foes.n | Op- Dmc- | 7.72758442535996 | 3.08401721340472 foes.n | Jp- | 5.78156289178878 | 3.50257518460873 foes.n | Jp- Dmc- | 8.53048111009413 | 3.55652688759394 foes.n | Op- | 4.24155412614344 | 3.94944175213513 warnings.n | Op- | 13.1191083714365 | 2.73150374115749 warnings.n | Op- Dmc- | 12.4493113420905 | 2.80710747394272 warnings.n | Jp- Dmc- | 8.38247973471882 | 3.37772441764546 Here's another curious one: cluster992 banker.n fisherman.n illustrator.n lyricist.n mechanic.n periodical.n psychiatrist.n sculptor.n all from words.n.1 -- thus does not broaden coverage ... but are very nearly all a profession! mechanic.n | Js- Ds- | 13.7642659600825 | 2.88500665850946 mechanic.n | Os- Ds- AN- | 7.06177791953084 | 3.84783097573959 mechanic.n | AN+ | 6.95599334826693 | 3.86960587427955 mechanic.n | Js- Ds- AN- | 6.24886311846786 | 4.02426868916609 mechanic.n | Ost- Ds- R+ Bs+ | 5.70536887645721 | 4.15554226141072 fisherman.n | Ost- Ds- | 6.96868003904821 | 3.15873229404902 fisherman.n | Js- Ds- | 6.63831343245697 | 3.22880096148911 fisherman.n | Ost- Ds- A- | 5.21447241306305 | 3.57709641838825 fisherman.n | AN+ | 5.15915525704624 | 3.59248284744609 illustrator.n | Js- Ds- | 23.8048364557326 | 2.60514384269322 illustrator.n | Ost- Ds- | 16.1435659294952 | 3.55061043888198 illustrator.n | Ost- Ds- A- | 12.5473636660028 | 3.5717400794719 illustrator.n | Ost- Ds- R+ Bs+ | 6.37835476174951 | 4.43506613246927 illustrator.n | Ost- Ds- AN- | 6.57567423582078 | 4.43628494235792 illustrator.n | AN+ | 5.92789142578842 | 4.54073145072105 periodical.n | AN+ | 13.523933645105 | 2.25884735662492 periodical.n | Ost- Ds- R+ Bs+ | 4.69391736388206 | 3.78549785079099 periodical.n | Os- Ds- | 3.54950597882271 | 4.18867205040734 periodical.n | Os- Ds- Mv+ | 4.4908520579338 | 4.32151671611172 periodical.n | Js- Ds- | 3.46312434598804 | 4.51594109897193 Examined 1165 clusters, recorded 626 Examined 13026 words, and 2218422 disjuncts Average 11.181116 words/cluster; average 3543.805112 dj's/recored-cluster real 3m42.396s user 3m35.157s recorded 628 recorded 622 Examined 1165 clusters, recorded 622 Examined 12952 words, and 2239866 disjuncts Average 11.117597 words/cluster; average 3601.070740 dj's/recored-cluster Got 74 mismatch warnings fixes w/o: 226 w/: 225 bilog w/o: 38 w/: 38 To get the full-length list -- Disjunct *d1 = build_disjuncts_for_dict_node(dn); -- but is obsolete ... free_disjuncts(d1) instead, use build_sentence_disjuncts() which use build_disjuncts_for_X_node() make float pt: in build-disjuncts.c == done todo -- build_disjuncts_for_X_node == done build_clause == done build_disjunct == done build_sentence_disjuncts -- preparation.c but preparation.c ... prepare_to_parse from api.c sentence_parse and retry from link-parser with more null counts. ====================== Historical trends: enwiki/A: grep -- version 4.3.5 num_skipped_words= * | wc 773352 num_skipped_words="0" 388819 or 50.3% num_skipped_words="1" 148214 or 19.2% num_skipped_words="2" 83234 or 10.8% num_skipped_words="3" 43957 or 5.7% num_skipped_words="4" 28998 or 3.8% num_skipped_words="5" 19677 or 2.5% enwiki/E: grep --- version 4.3.5 or so num_skipped_words= * | wc 980218 num_skipped_words="0" 479076 or 48.9% num_skipped_words="1" 190183 or 19.4% num_skipped_words="2" 107265 or 10.9% num_skipped_words="3" 56875 or 5.8% num_skipped_words="4" 39240 or 4.0% num_skipped_words="5" 27431 or 2.8% enwiki/J: grep --- version -4.5.7 or so num_skipped_words= * | wc 1744284 num_skipped_words="0" 914187 or 52.4% num_skipped_words="1" 332653 or 19.1% num_skipped_words="2" 176185 or 10.1% num_skipped_words="3" 87241 or 5.0% num_skipped_words="4" 57509 or 3.3% num_skipped_words="5" 38483 or 2.2%