Jump to content

Module:data consistency check/documentation

Wiktionary වෙතින්

This module checks the validity and internal consistency of the language, language family, and script data used on Wiktionary: the modules in Category:Language data modules as well as Module:scripts/data.

Discrepancies detected:

  • Code: EL.. Saw name: Latin. Expected name: ලතින්.
  • Code: LL.. Saw name: Latin. Expected name: ලතින්.
  • Code: ML.. Saw name: Latin. Expected name: ලතින්.
  • Code: VL.. Saw name: Latin. Expected name: ලතින්.
  • Code: abs. Saw name: Ambonese මැලේ. Expected name: Ambonese Malay.
  • Code: acw. Saw name: Hijazi අරාබි. Expected name: Hijazi Arabic.
  • Code: acy. Saw name: Cypriot අරාබි. Expected name: Cypriot Arabic.
  • Code: aeb. Saw name: Tunisian අරාබි. Expected name: Tunisian Arabic.
  • Code: afb. Saw name: Gulf අරාබි. Expected name: Gulf Arabic.
  • Code: ajp. Saw name: South Levantine අරාබි. Expected name: South Levantine Arabic.
  • Code: ang. Saw name: Old ඉංග්‍රීසි. Expected name: Old English.
  • Code: apc. Saw name: North Levantine අරාබි. Expected name: North Levantine Arabic.
  • Code: ary. Saw name: Moroccan අරාබි. Expected name: Moroccan Arabic.
  • Code: arz. Saw name: Egyptian අරාබි. Expected name: Egyptian Arabic.
  • Code: ayl. Saw name: Libyan අරාබි. Expected name: Libyan Arabic.
  • Code: br. Saw name: Breton. Expected name: බ්‍රෙටන්.
  • Code: cmn-ear. Saw name: Mandarin. Expected name: මැන්ඩරීන්.
  • Code: cy. Saw name: Welsh. Expected name: වේල්ස.
  • Code: dra-okn. Saw name: Old කන්නඩ. Expected name: Old Kannada.
  • Code: dum. Saw name: Middle ඕලන්ද. Expected name: Middle Dutch.
  • Code: enm. Saw name: Middle ඉංග්‍රීසි. Expected name: මධ්‍යකාලීන ඉංග්‍රීසි.
  • Code: fr-CA. Saw name: French. Expected name: ප්‍රංශ.
  • Code: frk. Saw name: Proto-West Germanic. Expected name: ප්‍රොටෝ-බටහිර ජර්මානු.
  • Code: frm. Saw name: Middle ප්‍රංශ. Expected name: මධ්‍යකාලීන ප්‍රංශ.
  • Code: fro. Saw name: Old ප්‍රංශ. Expected name: Old French.
  • Code: gd. Saw name: Scottish Gaelic. Expected name: ස්කොට්ස් ගේලික්.
  • Code: gem. Saw name: Germanic. Expected name: ජර්මානු.
  • Code: gkm. Saw name: Ancient Greek. Expected name: පුරාතන ග්‍රීක.
  • Code: gmh. Saw name: Middle High ජර්මානු. Expected name: Middle High German.
  • Code: gml. Saw name: Middle Low ජර්මානු. Expected name: Middle Low German.
  • Code: gmq-mno. Saw name: Middle නෝර්වීජියානු. Expected name: Middle Norwegian.
  • Code: gmq-oda. Saw name: Old ඩෙන්මාර්ක. Expected name: Old Danish.
  • Code: gmq-osw. Saw name: Old ස්වීඩන්. Expected name: Old Swedish.
  • Code: gmw-ecg. Saw name: East Central ජර්මානු. Expected name: East Central German.
  • Code: gmw-jdt. Saw name: Jersey ඕලන්ද. Expected name: Jersey Dutch.
  • Code: gmy. Saw name: Mycenaean ග්‍රීක. Expected name: Mycenaean Greek.
  • Code: goh. Saw name: Old High ජර්මානු. Expected name: Old High German.
  • Code: grk-mar. Saw name: Mariupol ග්‍රීක. Expected name: Mariupol Greek.
  • Code: gsw. Saw name: Alemannic ජර්මානු. Expected name: Alemannic German.
  • Code: gv. Saw name: Manx. Expected name: මැන්ක්ස්.
  • Code: idb. Saw name: Indo-පෘතුගීසි. Expected name: Indo-Portuguese.
  • Code: inc-ash. Saw name: Ashokan ප්‍රාකෘත. Expected name: Ashokan Prakrit.
  • Code: itc-ola. Saw name: Latin. Expected name: ලතින්.
  • Code: kaw. Saw name: Old ජාවා. Expected name: Old Javanese.
  • Code: kw. Saw name: Cornish. Expected name: කෝනිෂ්.
  • Code: kxd. Saw name: Brunei මැලේ. Expected name: Brunei Malay.
  • Code: la-ecc. Saw name: Latin. Expected name: ලතින්.
  • Code: la-lat. Saw name: Latin. Expected name: ලතින්.
  • Code: la-med. Saw name: Latin. Expected name: ලතින්.
  • Code: la-vul. Saw name: Latin. Expected name: ලතින්.
  • Code: ltc. Saw name: Middle චීන. Expected name: Middle Chinese.
  • Code: meo. Saw name: Kedah මැලේ. Expected name: Kedah Malay.
  • Code: mga. Saw name: Middle අයිරිෂ්. Expected name: Middle Irish.
  • Code: ms-cla. Saw name: Malay. Expected name: මැලේ.
  • Code: ms-old. Saw name: Malay. Expected name: මැලේ.
  • Code: nds. Saw name: Low ජර්මානු. Expected name: Low German.
  • Code: nds-de. Saw name: German Low ජර්මානු. Expected name: German Low German.
  • Code: nod. Saw name: Northern තායි. Expected name: Northern Thai.
  • Code: obr. Saw name: Old බුරුම. Expected name: Old Burmese.
  • Code: och. Saw name: Old චීන. Expected name: Old Chinese.
  • Code: odt. Saw name: Old ඕලන්ද. Expected name: Old Dutch.
  • Code: oge. Saw name: Old ජෝර්ජියානු. Expected name: Old Georgian.
  • Code: ohu. Saw name: Old හංගේරියානු. Expected name: Old Hungarian.
  • Code: ojp. Saw name: Old ජපන්. Expected name: Old Japanese.
  • Code: okm. Saw name: Middle කොරියානු. Expected name: Middle Korean.
  • Code: oko. Saw name: Old කොරියානු. Expected name: Old Korean.
  • Code: osp. Saw name: Old ස්පාඤ්ඤ. Expected name: Old Spanish.
  • Code: ota. Saw name: Ottoman තුර්කි. Expected name: Ottoman Turkish.
  • Code: pal. Saw name: Middle පර්සියානු. Expected name: Middle Persian.
  • Code: pdc. Saw name: Pennsylvania ජර්මානු. Expected name: Pennsylvania German.
  • Code: peo. Saw name: Old පර්සියානු. Expected name: Old Persian.
  • Code: rmg. Saw name: Traveller නෝර්වීජියානු. Expected name: Traveller Norwegian.
  • Code: roa-opt. Saw name: Old Galician-පෘතුගීසි. Expected name: Old Galician-Portuguese.
  • Code: ruo. Saw name: Istro-රුමේනියානු. Expected name: Istro-Romanian.
  • Code: ruq. Saw name: Megleno-රුමේනියානු. Expected name: Megleno-Romanian.
  • Code: sa-ved. Saw name: Sanskrit. Expected name: සංස්කෘත.
  • Code: sga. Saw name: Old අයිරිෂ්. Expected name: Old Irish.
  • Code: sit-pro. Saw name: Proto-Sino-ටිබෙට්. Expected name: Proto-Sino-Tibetan.
  • Code: sou. Saw name: Southern තායි. Expected name: Southern Thai.
  • Code: tbq-lob-pro. Saw name: Proto-Lolo-බුරුම. Expected name: Proto-Lolo-Burmese.
  • Code: trk-oat. Saw name: Old Anatolian තුර්කි. Expected name: Old Anatolian Turkish.
  • Code: vec. Saw name: Venetan. Expected name: Venetian.
  • Code: xaa. Saw name: Andalusian අරාබි. Expected name: Andalusian Arabic.
  • Code: xcl. Saw name: Old ආමේනියානු. Expected name: Old Armenian.
  • Code: zlw-ocs. Saw name: Old චෙක්. Expected name: Old Czech.
  • Code: zlw-opl. Saw name: Old පෝලන්ත. Expected name: Old Polish.
  • Literary Chinese, the canonical name for the code lzh-lit, is wrong; it should be Literary Chinese.
  • The code xln and the canonical name Alanic should be removed; they are not found in Module:etymology languages/data.
  • Literary Chinese, the canonical name for the code lzh-lit, is wrong; it should be Literary Chinese.
  • The code oos and the canonical name Old Ossetic should be removed; they are not found in Module:etymology languages/data.
  • Literary Chinese භාෂාව (lzh-lit) has a canonical name that is not unique; it is also used by the code lzh.
  • The data key preprocess_links for ??? (th-new) is invalid.
  • The code ira-mid and the canonical name Middle Iranian should be removed; they are not found in Module:families/data.
  • The code ira-old and the canonical name Old Iranian should be removed; they are not found in Module:families/data.
  • The code ira-mid and the canonical name Middle Iranian should be removed; they are not found in Module:families/data.
  • The code ira-old and the canonical name Old Iranian should be removed; they are not found in Module:families/data.
  • The canonical name Abu' Arapesh (aah) is missing.
  • Abu', the canonical name for the code aah, is wrong; it should be Abu' Arapesh.
  • The code hnm and the canonical name Hainanese should be removed; they are not found in a submodule of Module:languages.
  • The canonical name Itzá (itz) is missing.
  • Itza', the canonical name for the code itz, is wrong; it should be Itzá.
  • The code luh and the canonical name Leizhou Min should be removed; they are not found in a submodule of Module:languages.
  • The code sjc and the canonical name Shaojiang Min should be removed; they are not found in a submodule of Module:languages.
  • The canonical name Venetian (vec) is missing.
  • Venetan, the canonical name for the code vec, is wrong; it should be Venetian.
  • The code xln and the canonical name Alanic should be removed; they are not found in a submodule of Module:languages.
  • Abu', the canonical name for the code aah, is wrong; it should be Abu' Arapesh.
  • The code hnm and the canonical name Hainanese should be removed; they are not found in a submodule of Module:languages.
  • Itza', the canonical name for the code itz, is wrong; it should be Itzá.
  • The code luh and the canonical name Leizhou Min should be removed; they are not found in a submodule of Module:languages.
  • The code sjc and the canonical name Shaojiang Min should be removed; they are not found in a submodule of Module:languages.
  • Venetan, the canonical name for the code vec, is wrong; it should be Venetian.
  • The code xln and the canonical name Alanic should be removed; they are not found in a submodule of Module:languages.
  • Proto-Central Togo භාෂාව (alv-gtm-pro) does not have the expected name "Proto-Ghana-Togo Mountain", even though it is the proto-language of the Ghana-Togo Mountain භාෂා (alv-gtm).
  • Proto-Arawa භාෂාව (auf-pro) does not have the expected name "Proto-Arauan", even though it is the proto-language of the Arauan භාෂා (auf).
  • Proto-Amuesha-Chamicuro භාෂාව (awd-amc-pro) has a proto-language code associated with the invalid code awd-amc.
  • Proto-Kampa භාෂාව (awd-kmp-pro) has a proto-language code associated with the invalid code awd-kmp.
  • Proto-Arawak භාෂාව (awd-pro) does not have the expected name "Proto-Arawakan", even though it is the proto-language of the Arawakan භාෂා (awd).
  • Proto-Paresi-Waura භාෂාව (awd-prw-pro) has a proto-language code associated with the invalid code awd-prw.
  • Proto-Ta-Arawak භාෂාව (awd-taa-pro) does not have the expected name "Proto-Ta-Arawakan", even though it is the proto-language of the Ta-Arawakan භාෂා (awd-taa).
  • Proto-Rukai භාෂාව (dru-pro) has a proto-language code associated with Rukai (dru), which is not a family.
  • Proto-Basque භාෂාව (euq-pro) does not have the expected name "Proto-Vasconic", even though it is the proto-language of the Vasconic භාෂා (euq).
  • Proto-Germanic භාෂාව (gem-pro) does not have the expected name "Proto-ජර්මානු", even though it is the proto-language of the ජර්මානු භාෂා (gem).
  • Proto-Norse භාෂාව (gmq-pro) does not have the expected name "Proto-North Germanic", even though it is the proto-language of the North Germanic භාෂා (gmq).
  • ප්‍රොටෝ-බටහිර ජර්මානු භාෂාව (gmw-pro) does not have the expected name "Proto-බටහිර ජර්මානු", even though it is the proto-language of the බටහිර ජර්මානු භාෂා (gmw).
  • Proto-Kamta භාෂාව (inc-krn-pro) does not have the expected name "Proto-KRNB lects", even though it is the proto-language of the KRNB lects (inc-krn).
  • ප්‍රොටෝ-ඉන්දු-යුරෝපීය භාෂාව (ine-pro) does not have the expected name "Proto-ඉන්දු-යුරෝපීය", even though it is the proto-language of the ඉන්දු-යුරෝපීය භාෂා (ine).
  • Kelantan Peranakan Hokkien, the canonical name for mis-hkl, is repeated in the table of aliases.
  • Proto-Chumash භාෂාව (nai-chu-pro) does not have the expected name "Proto-Chumashan", even though it is the proto-language of the Chumashan භාෂා (nai-chu).
  • Proto-Maidun භාෂාව (nai-mdu-pro) does not have the expected name "Proto-Maiduan", even though it is the proto-language of the Maiduan භාෂා (nai-mdu).
  • Proto-Mixe-Zoque භාෂාව (nai-miz-pro) does not have the expected name "Proto-Mixe-Zoquean", even though it is the proto-language of the Mixe-Zoquean භාෂා (nai-miz).
  • Proto-Pomo භාෂාව (nai-pom-pro) does not have the expected name "Proto-Pomoan", even though it is the proto-language of the Pomoan භාෂා (nai-pom).
  • Proto-Mazatec භාෂාව (omq-maz-pro) does not have the expected name "Proto-Mazatecan", even though it is the proto-language of the Mazatecan භාෂා (omq-maz).
  • Proto-Ossetic භාෂාව (os-pro) has a proto-language code associated with Ossetian (os), which is not a family.
  • Proto-Ossetic භාෂාව (os-pro) lists the invalid language code xln as its ancestor.
  • Proto-North Sarawak භාෂාව (poz-swa-pro) does not have the expected name "Proto-North Sarawakan", even though it is the proto-language of the North Sarawakan භාෂා (poz-swa).
  • Proto-Salish භාෂාව (sal-pro) does not have the expected name "Proto-Salishan", even though it is the proto-language of the Salishan භාෂා (sal).
  • Proto-Samic භාෂාව (smi-pro) does not have the expected name "Proto-Sami", even though it is the proto-language of the Sami භාෂා (smi).
  • Proto-Kuki-Chin භාෂාව (tbq-kuk-pro) does not have the expected name "Proto-Kukish", even though it is the proto-language of the Kukish භාෂා (tbq-kuk).
  • Proto-Saka භාෂාව (xsc-sak-pro) does not have the expected name "Proto-Sakan", even though it is the proto-language of the Sakan භාෂා (xsc-sak).
  • Proto-Sarmatian භාෂාව (xsc-sar-pro) has a proto-language code associated with the invalid code xsc-sar.
  • Blissymbols script (Blis) is not used by any language and has no characters listed for auto-detection.
  • Cypro-Minoan script (Cpmn) is not used by any language.
  • හිරගනා script (Hira) is not used by any language.
  • Kana script (Hrkt) is not used by any language.
  • Image-rendered script (Image) is not used by any language and has no characters listed for auto-detection.
  • International Phonetic Alphabet script (Ipach) is not used by any language and has no characters listed for auto-detection.
  • Moon script (Moon) is not used by any language and has no characters listed for auto-detection.
  • Morse code (Morse) is not used by any language and has no characters listed for auto-detection.
  • Musical notation script (Music) is not used by any language.
  • Unspecified script (None) is not used by any language and has no characters listed for auto-detection.
  • Rongorongo script (Roro) is not used by any language and has no characters listed for auto-detection.
  • Rumi numerals script (Rumin) is not used by any language.
  • flag semaphore (Semap) is not used by any language and has no characters listed for auto-detection.
  • Visible Speech script (Visp) is not used by any language and has no characters listed for auto-detection.
  • mathematical notation script (Zmth) is not used by any language.
  • symbol script (Zsym) is not used by any language.
  • undetermined script (Zyyy) is not used by any language and has no characters listed for auto-detection.
  • uncoded script (Zzzz) is not used by any language and has no characters listed for auto-detection.
  • The codes fa-Arab, ug-Arab, ks-Arab, ps-Arab, ur-Arab, tt-Arab, ota-Arab, ku-Arab, mzn-Arab and sd-Arab are currently alias codes. Only one code should be used in the data.
  • The codes ms-Arab and kk-Arab are currently alias codes. Only one code should be used in the data.
  • The data key sort_by_scraping for ජපන් script (Jpan) is invalid.

Checks performed

[සංස්කරණය]

For multiple data modules:

  • Codes for languages, families and etymology-only languages must be unique and cannot clash with one another.
  • Canonical names for languages, families, and etymology-only languages must not be found in the list of other names.
  • Each name in the list of other names must appear only once.
  • otherNames, if present, must be an array.
  • Wikidata item IDs must be a positive integer or a string starting with Q and ending with decimal digits.

The following must be true of the data used by Module:languages:

  • Each code must be defined in the correct submodule according to whether it is two-letter, three-letter or exceptional.
  • The canonical name (field 1) must be present and must not be the same as the canonical name of another language.
  • If field 2 is not nil, it must a valid Wikidata item ID.
  • If field 3 or family is given and not nil, it must be a valid family code.
  • If field 4 or scripts is given and not nil, it must be an array, and each string in the array must be a valid script code.
  • If ancestors is given, it must be an array, and each string in the array must be a valid language or etymology language code.
  • If family is given, it must be a valid family code.
  • If type is given, it must be one of the recognised values (regular, reconstructed, appendix-constructed).
  • If entry_name is given, it must be a table that contains either two arrays (from and to) or a string (remove_diacritics) or both.
  • If sort_key is given, it may either be a string, or at table that in turn contains either two arrays (from and to) or a string (remove_diacritics).
  • If entry_name or sort_key is given, the from array must be longer or equal in length to the to array.
  • If standardChars is given, it must form a valid Lua string pattern when placed between square brackets with ^ before it ("[^...]). (It should match all characters regularly used in the language, but that cannot be tested.)
  • If override_translit is set, translit must also be set, because there must be a transliteration module that can override manual transliteration.
  • If link_tr is present, it must be true.
  • Have no data keys besides these: 1, 2, 3, "entry_name", "sort_key", "display", "otherNames", "aliases", "varieties", "type", "scripts", "ancestors", "wikimedia_codes", "wikipedia_article", "standardChars", "translit", "override_translit", "link_tr".

Checks not performed:

  • If translit is present, it should be the name of a module, and this module should contain a tr function that takes a pagename (and optionally a language code and script code) as arguments.
  • If sort_key is a string, it should be the name of a module, and this module should contain a makeSortKey function that takes a pagename (and optionally a language code and script code) as arguments.
  • If entry_name or sort_key is a table and contains a field remove_diacritics, the value of the field should be a string that forms a valid Lua pattern when it is placed inside negated set notation ([^...]).

These are not checked here, because module errors will quickly crop up in entries if these conditions are not met, assuming that Module:utilities attempts to generate a sortkey for a category pertaining to the language in question, or full_link attempts to use the transliteration module.

Module:languages/code to canonical name and Module:languages/canonical names must contain all the codes and canonical names found in the data submodules of Module:languages, and no more.

The following must be true of the data used by Module:etymology languages:

  • canonicalName must be given.
  • parent must be given must be a valid language, family or etymology-only language code.
  • If ancestors is given, it must be an array, and each string in the array must be a valid language or etymology language code. The etymology language should also be listed as the ancestor of a regular language.
  • Have no data keys besides these: "canonicalName", "otherNames", "parent", "ancestors", "wikipedia_article", "wikidata_item".

Codes in Module:families data must:

  • Have canonicalName, which must not be the same as the canonical name of another family.
  • If family is given, it must be a valid family code.
  • Have at least one language or subfamily belonging to it.
  • Have no data keys besides these: "canonicalName", "otherNames", "family", "protoLanguage", "wikidata_item".

Codes in Module:scripts data must:

  • Have canonicalName.
  • Have at least one language that lists it as one of its scripts.
  • Have a characters pattern for script autodetection, and this must form a valid Lua string pattern when placed between square brackets ("[...]"). (It should match all characters in the script, but that cannot be tested.)
  • Have no data keys besides these: "canonicalName", "otherNames", "parent", "systems", "wikipedia_article", "characters", "direction".


"https://si.wiktionary.org/w/index.php?title=Module:data_consistency_check/documentation&oldid=185280" වෙතින් සම්ප්‍රවේශනය කෙරිණි