Jump to content

Talk:ARPABET

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

Contradiction

[edit]

The article says the higher the digit, the higher the stress, which would of course be very awkward and counter-intuitive. The example shows the opposite. 1 = primary stress, 2 = secondary stress, thus the lower the digit, the higher the stress, except for zero.Bostoner (talk) 20:21, 23 March 2009 (UTC)[reply]

Missing phone?

[edit]

I am not hugely familiar with arpabet, but I am using the beep dictionary (http://svr-www.eng.cam.ac.uk/~ajr/wsjcam0/node8.html), which uses arpabet and it contains the phone 'AX' for schwa (ə) sounds rather than 'AH.' I can't seem to find an official arpabet standard, but when I search for arpabet, I get several references that include 'AX' such as http://www.telecom.tuc.gr/~ntsourak/tutorial_arpabet.htm, http://www.stanford.edu/class/linguist238/fig04.01.pdf, and http://www-rohan.sdsu.edu/~gawron/compling/chap4/fig04.02.pdf. In fact, the only page I've seen in my (brief) search that does combine ʌ and ə into 'AH' is the CMU reference. ChristineInMaryland (talk) 19:01, 25 May 2010 (UTC)[reply]

Good points. I haven't been able to find anything that does a good job of describing the differences between the TIMIT and CMU versions of the arpabet either, but TIMIT is way more detailed (syllabic nasals, flaps, unreleased stops, etc.). The distinction between ʌ and ə is still in CMU, but it's preserved in the stress marks (AH1 vs AH0). The same is true for ER and AXR, which are mapped to ER1 and ER0 in CMU. I think that in the article, /her/ shouldn't be one of the examples for ɝ, or if it is, it should be transcribed HH ER1, not HH ER0. (CMU has both alternates.) 173.64.164.223 (talk) 14:19, 20 July 2010 (UTC)[reply]
Note that CMU does not differentiate between /ʌ/ and /ə/ in the /AH0/ case as it will merge unstressed STRUT vowels with the COMMA vowel, but not stressed STRUT vowels. For example, consider <undone> /ʌndˈʌn/, and <about>, /əbˈaʊt/. CMU cannot adequately transcribe all these cases so they can be differentiated. The /ER/-/AXR/ case is clearer as English does not have an unstressed NURSE vowel. The CMU transcription is consistent with the modern interpretation of the NURSE vowel as /əː/ in British English, where the /ə/ is only long in stressed positions. Rhdunn (talk) 16:46, 12 January 2015 (UTC)[reply]
The /ER/-/AXR/ case is clearer as English does not have an unstressed NURSE vowel. – That's not necessarily true. For instance, the last syllable of Gutenberg is pronounced as a lengthened schwa in RP, i.e. /ˌɜː/. But British dictionaries tend to not bother with secondary stress preceded by primary stress, so they just have /əː/ sans stress mark. CMU seems no different, as it has G UW1 T AH0 N B ER0 G, i.e. /ˈɡutənbɚɡ/. But this would suggest the last vowel would be pronounced as just a normal /ə/ in a non-rhotic accent. Marking the secondary stress could have easily prevented inaccuracy like this. (Granted, CMU doesn't seem to give much crap about dialectal difference anyway as it has no symbol for the unmerged LOT, though.) Nardog (talk) 03:18, 9 September 2017 (UTC)[reply]

IPA Phonemes

[edit]

I'm going to update the chart to include the IPA representation for each phoneme. Note that I'm somewhat familiar with the IPA but am not exactly familiar with the sounds of the Arpabet. Therefore my edits may need to be corrected. Theshibboleth (talk) 10:17, 1 July 2010 (UTC)[reply]

Nevermind, somehow I didn't see the IPA phones already on the page 0_o Theshibboleth (talk) 10:18, 1 July 2010 (UTC)[reply]

Some Notes Based Mostly On Memory

[edit]

I was involved in the ARPA Speech Understanding Project at CMU. I don't have full documentation, so this is not ready for inclusion, but I'm passing this on in case someone wants to find the documents to confirm my memory.

1) The ARPAbet actually came in two forms: the one and two letter forms. The one letter form was only occasionally used. It used both upper and lower case letters. Back then, a lot of work was still being done using low-end punch-card systems and teletypes (in fact, a slang term for CRT terminals were "glass-teletypes"), which did not have lower case letters available (they used the ASCII 6-bit encoding). The impression I had was that the intention was that the one-letter form was meant to be the preferred form, but the two letter, all caps, was there as a lowest common denominator version.

A copy of the chart we used (I was the immediate supervisor of the undergraduates who were hired to do manual phonetic transcriptions) can be found in [1]. It includes both the one and two character forms.

The two letter version almost immediately dominated for three reasons (this is probably not documented in print anywhere): 1) You didn't have to worry about what display or input hardware was available when files or software was exchanged, 2) It was hard to remember many arbitrary choices of upper vs lower case (e.g., "s" was "S" but "S" was "SH") 3) Words containing lexically aRBItrArY but semantically SiGNiFIcaNt mixes of case is hard to read and to type.

2) The two-letter form was supposed to be unambiguously parsable with a one-letter look ahead both forward and backwards. I have no idea why easy backward parsing was considered desirable, and I don't know of anyone who used that property. That was a good thing since one of the CMU team (and probably others elsewhere) discovered that the backward parsing was actually ambiguous. I have no idea whether this was published in any form.

3) As the above referenced chart shows there were additional symbols for phonetic punctuation. Use of "Punctuation marks ...[used] like in the written language" was not part of the original system, and in fact, using some (e.g., ".") would be in violation of it.

4) Some auxiliary informal standards developed both within and between groups but I don't remember the details very well. Among other things these were used for indicating segment timing, sub-phonemic structure (e.g., attack phase of plosives), diacriticals like aspiration and nasalization and allophonic variations. Some of this was just tacked on, some represented what should go inside the official annotation brackets and both square (I think that became the convention for enclosing the entire string) and curly brackets (we may have used them for allophonic variations in dictionaries and alternative matchings in transcription) were used in addition to the official "( )", "< >" and "** *". The use of "_" for word boundaries was pretty much universally replaced with " " quite quickly.

A lot of this was motivated by the ARPAbet being designed with a much narrower scope than the project required. It was intended as a phonetic alphabet for English (essentially, pronunciation templates), but speech recognition required detailed phonic annotation (classification of what speech sounds actually occurred).

5) Traditionally at least, it was always either "ARPAbet" (a acronym in caps portmanteaued to a word fragment treated as a suffix) or "ARPABET" (because back then, computer terms tended to be all in caps) -- but not Arpabet as in the article.

96.233.98.186 (talk) 18:43, 25 September 2012 (UTC)Topher Cooper[reply]

References

Making an important fix

[edit]

The example for AO is misleading:

AO ɔ bought

Bought is AA in America and AO in Britain (cite: tophonetics.com, http://svn.code.sf.net/p/cmusphinx/code/trunk/cmudict/cmudict-0.7b)

Switching to "bore" which has a more similar vowelacross british/american english. — Preceding unsigned comment added by 129.170.195.68 (talk) 22:25, 13 November 2018 (UTC)[reply]

The cot–caught merger wasn't as advanced in the US as it is today when ARPABET was created. I don't disagree that a word with [ɔr] may be better suited as the example, though. Nardog (talk) 17:15, 14 November 2018 (UTC)[reply]

The cot–caught merger has made AO absurdly ambigious. All instances of AO should be replaced with AA or OW. Sandizer (talk) 04:41, 17 February 2023 (UTC)[reply]