Module:data consistency check/documentation
This module checks the validity and internal consistency of the language, language family, and script data used on Wiktionary: the modules in Category:Language data modules as well as Module:scripts/data.
Output
[සංස්කරණය]Discrepancies detected:
- Code:
EL.
. Saw name: Latin. Expected name: ලතින්. - Code:
LL.
. Saw name: Latin. Expected name: ලතින්. - Code:
ML.
. Saw name: Latin. Expected name: ලතින්. - Code:
VL.
. Saw name: Latin. Expected name: ලතින්. - Code:
abs
. Saw name: Ambonese මැලේ. Expected name: Ambonese Malay. - Code:
acw
. Saw name: Hijazi අරාබි. Expected name: Hijazi Arabic. - Code:
acy
. Saw name: Cypriot අරාබි. Expected name: Cypriot Arabic. - Code:
aeb
. Saw name: Tunisian අරාබි. Expected name: Tunisian Arabic. - Code:
afb
. Saw name: Gulf අරාබි. Expected name: Gulf Arabic. - Code:
ajp
. Saw name: South Levantine අරාබි. Expected name: South Levantine Arabic. - Code:
ang
. Saw name: Old ඉංග්රීසි. Expected name: Old English. - Code:
apc
. Saw name: North Levantine අරාබි. Expected name: North Levantine Arabic. - Code:
ary
. Saw name: Moroccan අරාබි. Expected name: Moroccan Arabic. - Code:
arz
. Saw name: Egyptian අරාබි. Expected name: Egyptian Arabic. - Code:
ayl
. Saw name: Libyan අරාබි. Expected name: Libyan Arabic. - Code:
br
. Saw name: Breton. Expected name: බ්රෙටන්. - Code:
cmn-ear
. Saw name: Mandarin. Expected name: මැන්ඩරීන්. - Code:
cy
. Saw name: Welsh. Expected name: වේල්ස. - Code:
dra-okn
. Saw name: Old කන්නඩ. Expected name: Old Kannada. - Code:
dum
. Saw name: Middle ඕලන්ද. Expected name: Middle Dutch. - Code:
enm
. Saw name: Middle ඉංග්රීසි. Expected name: මධ්යකාලීන ඉංග්රීසි. - Code:
fr-CA
. Saw name: French. Expected name: ප්රංශ. - Code:
frk
. Saw name: Proto-West Germanic. Expected name: ප්රොටෝ-බටහිර ජර්මානු. - Code:
frm
. Saw name: Middle ප්රංශ. Expected name: මධ්යකාලීන ප්රංශ. - Code:
fro
. Saw name: Old ප්රංශ. Expected name: Old French. - Code:
gd
. Saw name: Scottish Gaelic. Expected name: ස්කොට්ස් ගේලික්. - Code:
gem
. Saw name: Germanic. Expected name: ජර්මානු. - Code:
gkm
. Saw name: Ancient Greek. Expected name: පුරාතන ග්රීක. - Code:
gmh
. Saw name: Middle High ජර්මානු. Expected name: Middle High German. - Code:
gml
. Saw name: Middle Low ජර්මානු. Expected name: Middle Low German. - Code:
gmq-mno
. Saw name: Middle නෝර්වීජියානු. Expected name: Middle Norwegian. - Code:
gmq-oda
. Saw name: Old ඩෙන්මාර්ක. Expected name: Old Danish. - Code:
gmq-osw
. Saw name: Old ස්වීඩන්. Expected name: Old Swedish. - Code:
gmw-ecg
. Saw name: East Central ජර්මානු. Expected name: East Central German. - Code:
gmw-jdt
. Saw name: Jersey ඕලන්ද. Expected name: Jersey Dutch. - Code:
gmy
. Saw name: Mycenaean ග්රීක. Expected name: Mycenaean Greek. - Code:
goh
. Saw name: Old High ජර්මානු. Expected name: Old High German. - Code:
grk-mar
. Saw name: Mariupol ග්රීක. Expected name: Mariupol Greek. - Code:
gsw
. Saw name: Alemannic ජර්මානු. Expected name: Alemannic German. - Code:
gv
. Saw name: Manx. Expected name: මැන්ක්ස්. - Code:
idb
. Saw name: Indo-පෘතුගීසි. Expected name: Indo-Portuguese. - Code:
inc-ash
. Saw name: Ashokan ප්රාකෘත. Expected name: Ashokan Prakrit. - Code:
itc-ola
. Saw name: Latin. Expected name: ලතින්. - Code:
kaw
. Saw name: Old ජාවා. Expected name: Old Javanese. - Code:
kw
. Saw name: Cornish. Expected name: කෝනිෂ්. - Code:
kxd
. Saw name: Brunei මැලේ. Expected name: Brunei Malay. - Code:
la-ecc
. Saw name: Latin. Expected name: ලතින්. - Code:
la-lat
. Saw name: Latin. Expected name: ලතින්. - Code:
la-med
. Saw name: Latin. Expected name: ලතින්. - Code:
la-vul
. Saw name: Latin. Expected name: ලතින්. - Code:
ltc
. Saw name: Middle චීන. Expected name: Middle Chinese. - Code:
meo
. Saw name: Kedah මැලේ. Expected name: Kedah Malay. - Code:
mga
. Saw name: Middle අයිරිෂ්. Expected name: Middle Irish. - Code:
ms-cla
. Saw name: Malay. Expected name: මැලේ. - Code:
ms-old
. Saw name: Malay. Expected name: මැලේ. - Code:
nds
. Saw name: Low ජර්මානු. Expected name: Low German. - Code:
nds-de
. Saw name: German Low ජර්මානු. Expected name: German Low German. - Code:
nod
. Saw name: Northern තායි. Expected name: Northern Thai. - Code:
obr
. Saw name: Old බුරුම. Expected name: Old Burmese. - Code:
och
. Saw name: Old චීන. Expected name: Old Chinese. - Code:
odt
. Saw name: Old ඕලන්ද. Expected name: Old Dutch. - Code:
oge
. Saw name: Old ජෝර්ජියානු. Expected name: Old Georgian. - Code:
ohu
. Saw name: Old හංගේරියානු. Expected name: Old Hungarian. - Code:
ojp
. Saw name: Old ජපන්. Expected name: Old Japanese. - Code:
okm
. Saw name: Middle කොරියානු. Expected name: Middle Korean. - Code:
oko
. Saw name: Old කොරියානු. Expected name: Old Korean. - Code:
osp
. Saw name: Old ස්පාඤ්ඤ. Expected name: Old Spanish. - Code:
ota
. Saw name: Ottoman තුර්කි. Expected name: Ottoman Turkish. - Code:
pal
. Saw name: Middle පර්සියානු. Expected name: Middle Persian. - Code:
pdc
. Saw name: Pennsylvania ජර්මානු. Expected name: Pennsylvania German. - Code:
peo
. Saw name: Old පර්සියානු. Expected name: Old Persian. - Code:
rmg
. Saw name: Traveller නෝර්වීජියානු. Expected name: Traveller Norwegian. - Code:
roa-opt
. Saw name: Old Galician-පෘතුගීසි. Expected name: Old Galician-Portuguese. - Code:
ruo
. Saw name: Istro-රුමේනියානු. Expected name: Istro-Romanian. - Code:
ruq
. Saw name: Megleno-රුමේනියානු. Expected name: Megleno-Romanian. - Code:
sa-ved
. Saw name: Sanskrit. Expected name: සංස්කෘත. - Code:
sga
. Saw name: Old අයිරිෂ්. Expected name: Old Irish. - Code:
sit-pro
. Saw name: Proto-Sino-ටිබෙට්. Expected name: Proto-Sino-Tibetan. - Code:
sou
. Saw name: Southern තායි. Expected name: Southern Thai. - Code:
tbq-lob-pro
. Saw name: Proto-Lolo-බුරුම. Expected name: Proto-Lolo-Burmese. - Code:
trk-oat
. Saw name: Old Anatolian තුර්කි. Expected name: Old Anatolian Turkish. - Code:
vec
. Saw name: Venetan. Expected name: Venetian. - Code:
xaa
. Saw name: Andalusian අරාබි. Expected name: Andalusian Arabic. - Code:
xcl
. Saw name: Old ආමේනියානු. Expected name: Old Armenian. - Code:
zlw-ocs
. Saw name: Old චෙක්. Expected name: Old Czech. - Code:
zlw-opl
. Saw name: Old පෝලන්ත. Expected name: Old Polish.
- Literary Chinese, the canonical name for the code
lzh-lit
, is wrong; it should be Literary Chinese. - The code
xln
and the canonical name Alanic should be removed; they are not found in Module:etymology languages/data.
- Literary Chinese, the canonical name for the code
lzh-lit
, is wrong; it should be Literary Chinese. - The code
oos
and the canonical name Old Ossetic should be removed; they are not found in Module:etymology languages/data.
- Literary Chinese භාෂාව (
lzh-lit
) has a canonical name that is not unique; it is also used by the codelzh
. - The data key
preprocess_links
for ??? (th-new
) is invalid.
- The code
ira-mid
and the canonical name Middle Iranian should be removed; they are not found in Module:families/data. - The code
ira-old
and the canonical name Old Iranian should be removed; they are not found in Module:families/data.
- The code
ira-mid
and the canonical name Middle Iranian should be removed; they are not found in Module:families/data. - The code
ira-old
and the canonical name Old Iranian should be removed; they are not found in Module:families/data.
- Durango Nahuatl family (
azc-dur
) has no child families or languages. - Old Indo-Aryan family (
inc-old
) has no child families or languages.
- The canonical name Abu' Arapesh (
aah
) is missing. - Abu', the canonical name for the code
aah
, is wrong; it should be Abu' Arapesh. - The code
hnm
and the canonical name Hainanese should be removed; they are not found in a submodule of Module:languages. - The canonical name Itzá (
itz
) is missing. - Itza', the canonical name for the code
itz
, is wrong; it should be Itzá. - The code
luh
and the canonical name Leizhou Min should be removed; they are not found in a submodule of Module:languages. - The code
sjc
and the canonical name Shaojiang Min should be removed; they are not found in a submodule of Module:languages. - The canonical name Venetian (
vec
) is missing. - Venetan, the canonical name for the code
vec
, is wrong; it should be Venetian. - The code
xln
and the canonical name Alanic should be removed; they are not found in a submodule of Module:languages.
- Abu', the canonical name for the code
aah
, is wrong; it should be Abu' Arapesh. - The code
hnm
and the canonical name Hainanese should be removed; they are not found in a submodule of Module:languages. - Itza', the canonical name for the code
itz
, is wrong; it should be Itzá. - The code
luh
and the canonical name Leizhou Min should be removed; they are not found in a submodule of Module:languages. - The code
sjc
and the canonical name Shaojiang Min should be removed; they are not found in a submodule of Module:languages. - Venetan, the canonical name for the code
vec
, is wrong; it should be Venetian. - The code
xln
and the canonical name Alanic should be removed; they are not found in a submodule of Module:languages.
- Norwegian Bokmål භාෂාව (
nb
) has ඩෙන්මාර්ක භාෂාව (da
) set as an ancestor, but is not in the East Scandinavian family (gmq-eas
). - Norwegian Bokmål භාෂාව (
nb
) has Middle Norwegian භාෂාව (gmq-mno
) set as an ancestor, but is not in the West Scandinavian family (gmq-wes
). - Ossetian භාෂාව (
os
) is in the Scythian family (xsc
) and has Old Ossetic භාෂාව (oos
) set as an ancestor, but it is not possible to form an ancestral chain between them.
- Caribbean Hindustani භාෂාව (
hns
) has Bhojpuri භාෂාව (bho
) set as an ancestor, but is not in the Bihari family (inc-bih
). - Caribbean Hindustani භාෂාව (
hns
) has Awadhi භාෂාව (awa
) set as an ancestor, but is not in the Eastern Hindi family (inc-hie
).
- Leti භාෂාව (
lti
) has data in Module:languages/data/3/l, but does not have corresponding data in Module:languages/data/3/l/extra.
- Old Ossetic භාෂාව (
oos
) is in the Scythian family (xsc
) and has Proto-Ossetic භාෂාව (os-pro
) set as an ancestor, but it is not possible to form an ancestral chain between them.
- Jassic භාෂාව (
ysc
) is in the Scythian family (xsc
) and has Old Ossetic භාෂාව (oos
) set as an ancestor, but it is not possible to form an ancestral chain between them.
- Proto-Central Togo භාෂාව (
alv-gtm-pro
) does not have the expected name "Proto-Ghana-Togo Mountain", even though it is the proto-language of the Ghana-Togo Mountain භාෂා (alv-gtm
). - Proto-Arawa භාෂාව (
auf-pro
) does not have the expected name "Proto-Arauan", even though it is the proto-language of the Arauan භාෂා (auf
). - Proto-Amuesha-Chamicuro භාෂාව (
awd-amc-pro
) has a proto-language code associated with the invalid codeawd-amc
. - Proto-Kampa භාෂාව (
awd-kmp-pro
) has a proto-language code associated with the invalid codeawd-kmp
. - Proto-Arawak භාෂාව (
awd-pro
) does not have the expected name "Proto-Arawakan", even though it is the proto-language of the Arawakan භාෂා (awd
). - Proto-Paresi-Waura භාෂාව (
awd-prw-pro
) has a proto-language code associated with the invalid codeawd-prw
. - Proto-Ta-Arawak භාෂාව (
awd-taa-pro
) does not have the expected name "Proto-Ta-Arawakan", even though it is the proto-language of the Ta-Arawakan භාෂා (awd-taa
). - Proto-Rukai භාෂාව (
dru-pro
) has a proto-language code associated with Rukai (dru
), which is not a family. - Proto-Basque භාෂාව (
euq-pro
) does not have the expected name "Proto-Vasconic", even though it is the proto-language of the Vasconic භාෂා (euq
). - Proto-Germanic භාෂාව (
gem-pro
) does not have the expected name "Proto-ජර්මානු", even though it is the proto-language of the ජර්මානු භාෂා (gem
). - Proto-Norse භාෂාව (
gmq-pro
) does not have the expected name "Proto-North Germanic", even though it is the proto-language of the North Germanic භාෂා (gmq
). - ප්රොටෝ-බටහිර ජර්මානු භාෂාව (
gmw-pro
) does not have the expected name "Proto-බටහිර ජර්මානු", even though it is the proto-language of the බටහිර ජර්මානු භාෂා (gmw
). - Proto-Kamta භාෂාව (
inc-krn-pro
) does not have the expected name "Proto-KRNB lects", even though it is the proto-language of the KRNB lects (inc-krn
). - ප්රොටෝ-ඉන්දු-යුරෝපීය භාෂාව (
ine-pro
) does not have the expected name "Proto-ඉන්දු-යුරෝපීය", even though it is the proto-language of the ඉන්දු-යුරෝපීය භාෂා (ine
). - Kelantan Peranakan Hokkien, the canonical name for
mis-hkl
, is repeated in the table ofaliases
. - Proto-Chumash භාෂාව (
nai-chu-pro
) does not have the expected name "Proto-Chumashan", even though it is the proto-language of the Chumashan භාෂා (nai-chu
). - Proto-Maidun භාෂාව (
nai-mdu-pro
) does not have the expected name "Proto-Maiduan", even though it is the proto-language of the Maiduan භාෂා (nai-mdu
). - Proto-Mixe-Zoque භාෂාව (
nai-miz-pro
) does not have the expected name "Proto-Mixe-Zoquean", even though it is the proto-language of the Mixe-Zoquean භාෂා (nai-miz
). - Proto-Pomo භාෂාව (
nai-pom-pro
) does not have the expected name "Proto-Pomoan", even though it is the proto-language of the Pomoan භාෂා (nai-pom
). - Proto-Mazatec භාෂාව (
omq-maz-pro
) does not have the expected name "Proto-Mazatecan", even though it is the proto-language of the Mazatecan භාෂා (omq-maz
). - Proto-Ossetic භාෂාව (
os-pro
) has a proto-language code associated with Ossetian (os
), which is not a family. - Proto-Ossetic භාෂාව (
os-pro
) lists the invalid language codexln
as its ancestor. - Proto-North Sarawak භාෂාව (
poz-swa-pro
) does not have the expected name "Proto-North Sarawakan", even though it is the proto-language of the North Sarawakan භාෂා (poz-swa
). - Proto-Salish භාෂාව (
sal-pro
) does not have the expected name "Proto-Salishan", even though it is the proto-language of the Salishan භාෂා (sal
). - Proto-Samic භාෂාව (
smi-pro
) does not have the expected name "Proto-Sami", even though it is the proto-language of the Sami භාෂා (smi
). - Proto-Kuki-Chin භාෂාව (
tbq-kuk-pro
) does not have the expected name "Proto-Kukish", even though it is the proto-language of the Kukish භාෂා (tbq-kuk
). - Proto-Saka භාෂාව (
xsc-sak-pro
) does not have the expected name "Proto-Sakan", even though it is the proto-language of the Sakan භාෂා (xsc-sak
). - Proto-Sarmatian භාෂාව (
xsc-sar-pro
) has a proto-language code associated with the invalid codexsc-sar
.
- Language code
nan-hnm
is not found in Module:languages/data/exceptional, and should be removed from Module:languages/data/exceptional/extra. - Language code
nan-luh
is not found in Module:languages/data/exceptional, and should be removed from Module:languages/data/exceptional/extra. - Proto-Sarmatian භාෂාව (
xsc-sar-pro
) has data in Module:languages/data/exceptional, but does not have corresponding data in Module:languages/data/exceptional/extra.
apc
is set as an ISO 639-3 code on multiple items:Q56593
සහQ22809485
.kjv
is set as an ISO 639-3 code on multiple items:Q838165
සහQ31199873
.msn
is set as an ISO 639-3 code on multiple items:Q3331111
සහQ3563857
.ttt
is set as an ISO 639-3 code on multiple items:Q56489
සහQ123964178
.
- Blissymbols script (
Blis
) is not used by any language and has no characters listed for auto-detection. - Cypro-Minoan script (
Cpmn
) is not used by any language. - හිරගනා script (
Hira
) is not used by any language. - Kana script (
Hrkt
) is not used by any language. - Image-rendered script (
Image
) is not used by any language and has no characters listed for auto-detection. - International Phonetic Alphabet script (
Ipach
) is not used by any language and has no characters listed for auto-detection. - Moon script (
Moon
) is not used by any language and has no characters listed for auto-detection. - Morse code (
Morse
) is not used by any language and has no characters listed for auto-detection. - Musical notation script (
Music
) is not used by any language. - Unspecified script (
None
) is not used by any language and has no characters listed for auto-detection. - Rongorongo script (
Roro
) is not used by any language and has no characters listed for auto-detection. - Rumi numerals script (
Rumin
) is not used by any language. - flag semaphore (
Semap
) is not used by any language and has no characters listed for auto-detection. - Visible Speech script (
Visp
) is not used by any language and has no characters listed for auto-detection. - mathematical notation script (
Zmth
) is not used by any language. - symbol script (
Zsym
) is not used by any language. - undetermined script (
Zyyy
) is not used by any language and has no characters listed for auto-detection. - uncoded script (
Zzzz
) is not used by any language and has no characters listed for auto-detection. - The codes
fa-Arab
,ug-Arab
,ks-Arab
,ps-Arab
,ur-Arab
,tt-Arab
,ota-Arab
,ku-Arab
,mzn-Arab
andsd-Arab
are currently alias codes. Only one code should be used in the data. - The codes
ms-Arab
andkk-Arab
are currently alias codes. Only one code should be used in the data. - The data key
sort_by_scraping
for ජපන් script (Jpan
) is invalid.
Checks performed
[සංස්කරණය]For multiple data modules:
- Codes for languages, families and etymology-only languages must be unique and cannot clash with one another.
- Canonical names for languages, families, and etymology-only languages must not be found in the list of other names.
- Each name in the list of other names must appear only once.
otherNames
, if present, must be an array.- Wikidata item IDs must be a positive integer or a string starting with
Q
and ending with decimal digits.
The following must be true of the data used by Module:languages:
- Each code must be defined in the correct submodule according to whether it is two-letter, three-letter or exceptional.
- The canonical name (field
1
) must be present and must not be the same as the canonical name of another language. - If field
2
is notnil
, it must a valid Wikidata item ID. - If field
3
orfamily
is given and notnil
, it must be a valid family code. - If field
4
orscripts
is given and notnil
, it must be an array, and each string in the array must be a valid script code. - If
ancestors
is given, it must be an array, and each string in the array must be a valid language or etymology language code. - If
family
is given, it must be a valid family code. - If
type
is given, it must be one of the recognised values (regular
,reconstructed
,appendix-constructed
). - If
entry_name
is given, it must be a table that contains either two arrays (from
andto
) or a string (remove_diacritics
) or both. - If
sort_key
is given, it may either be a string, or at table that in turn contains either two arrays (from
andto
) or a string (remove_diacritics
). - If
entry_name
orsort_key
is given, thefrom
array must be longer or equal in length to theto
array. - If
standardChars
is given, it must form a valid Lua string pattern when placed between square brackets with^
before it ("[^...]
). (It should match all characters regularly used in the language, but that cannot be tested.) - If
override_translit
is set,translit
must also be set, because there must be a transliteration module that can override manual transliteration. - If
link_tr
is present, it must betrue
. - Have no data keys besides these:
1, 2, 3, "entry_name", "sort_key", "display", "otherNames", "aliases", "varieties", "type", "scripts", "ancestors", "wikimedia_codes", "wikipedia_article", "standardChars", "translit", "override_translit", "link_tr"
.
Checks not performed:
- If
translit
is present, it should be the name of a module, and this module should contain atr
function that takes a pagename (and optionally a language code and script code) as arguments. - If
sort_key
is a string, it should be the name of a module, and this module should contain amakeSortKey
function that takes a pagename (and optionally a language code and script code) as arguments. - If
entry_name
orsort_key
is a table and contains a fieldremove_diacritics
, the value of the field should be a string that forms a valid Lua pattern when it is placed inside negated set notation ([^...]
).
These are not checked here, because module errors will quickly crop up in entries if these conditions are not met, assuming that Module:utilities attempts to generate a sortkey for a category pertaining to the language in question, or full_link
attempts to use the transliteration module.
Module:languages/code to canonical name and Module:languages/canonical names must contain all the codes and canonical names found in the data submodules of Module:languages, and no more.
The following must be true of the data used by Module:etymology languages:
canonicalName
must be given.parent
must be given must be a valid language, family or etymology-only language code.- If
ancestors
is given, it must be an array, and each string in the array must be a valid language or etymology language code. The etymology language should also be listed as the ancestor of a regular language. - Have no data keys besides these:
"canonicalName", "otherNames", "parent", "ancestors", "wikipedia_article", "wikidata_item"
.
Codes in Module:families data must:
- Have
canonicalName
, which must not be the same as the canonical name of another family. - If
family
is given, it must be a valid family code. - Have at least one language or subfamily belonging to it.
- Have no data keys besides these:
"canonicalName", "otherNames", "family", "protoLanguage", "wikidata_item"
.
Codes in Module:scripts data must:
- Have
canonicalName
. - Have at least one language that lists it as one of its scripts.
- Have a
characters
pattern for script autodetection, and this must form a valid Lua string pattern when placed between square brackets ("[...]"
). (It should match all characters in the script, but that cannot be tested.) - Have no data keys besides these:
"canonicalName", "otherNames", "parent", "systems", "wikipedia_article", "characters", "direction"
.