Talk:Lexical analysis
This is the talk page for discussing improvements to the Lexical analysis article. This is not a forum for general discussion of the article's subject. |
Article policies
|
Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL |
This article is rated C-class on Wikipedia's content assessment scale. It is of interest to the following WikiProjects: | ||||||||||||||||||
|
The contents of the Tokenization (lexical analysis) page were merged into Lexical analysis on 10 June 2017. For the contribution history and old versions of the redirected page, please see its history; for the discussion at that location, see its talk page. |
Wiki Education Foundation-supported course assignment
[edit]This article was the subject of a Wiki Education Foundation-supported course assignment, between 24 August 2020 and 9 December 2020. Further details are available on the course page. Student editor(s): Oemo01.
Above undated message substituted from Template:Dashboard.wikiedu.org assignment by PrimeBOT (talk) 02:31, 17 January 2022 (UTC)
Examples
[edit]This article is not clear enough! It needs more examples!
Robert A.
The link to ocaml-ulex in 'Links' is broken.
Frank S.
Robert, this article is actually very poorly articulated. It seems that its authors do not have enough clarity of how to explain it, for it to be comprehensible. And the article should probably be marked that way - that is requires further clarity.Stevenmitchell (talk) 16:08, 25 January 2014 (UTC)
Types of tokens
[edit]Can someone explain to me what types of tokens there are? (Keywords, identifiers, literals? Or are there more) And do each of these types go into a symbol table of that type? i.e. an identifier table, keyword table and literal table? Or are they just stored into one uniform symbol table? —Dudboi 02:17, 6 November 2006 (UTC)
It really depends. For instance, for the example language (PL/0 - see the appropriate for example source code) in the article, here are the available token types:
multi-character operators: '+', '-', '*', '/', '=', '(', ')', ':=', '<', '<=', '<>', '>', '>=' language required punctuation: ',', ';', '.', literal numbers - specifically, integers identifiers: a-zA-Z {a-zA-Z0-9} keywords: "begin", "call", "const", "do", "end", "if", "odd", "procedure", "then", "var", "while"
Other languages might have more token types. For instance, in the C language, you would have string literals, character literals, floating point numbers, hexadecimal numbers, directives, and so on.
I've seen them all stored in a single table, and I've also seen them stored in multiple tables. I don't know if there is a standard, but from the books I've read, and the source code I've seen, multiple tables appears to be popular.
second task performed during lexical analysis is to make entry of tokens into a symbol table if it is there.Some other task performed during lexical analysis are: 1.to remove all comments,tab,blank spaces and machin characters. 2.to produce error massages occerrd in a source programs.
See the following links for simple approachable compiler sources:
http://wiki.riteme.site/wiki/PL/0 http://www.246.dk/pascals1.html http://www.246.dk/pascals5.html
See http://www.246.dk/pl0.html for more information on PL/0. 208.253.91.250 18:07, 13 November 2006 (UTC)
merger and clean up
[edit]I've ``merged`` token (parser) here. The page is a bit of a mess now though. The headings I've made should help sort that out. The examples should be improved so that they take up less space. The lex file should probably be moved to the flex page. --MarSch 15:37, 30 April 2007 (UTC)
- done the move of the example. --MarSch 15:39, 30 April 2007 (UTC)
Next?
[edit]It would be nice to see what is involved in the next step, or at least see a link to the page describing the whole process of turning high level code into low level code
-Dusan B.
- you mean compiling? --MarSch 10:53, 5 May 2007 (UTC)
Lexical Errors
[edit]There's a mention that scanning fails on an "invalid token". It doesn't seem particularly clear what constitutes a lexical error other than a string of garbage characters. Any ideas? --138.16.23.227 (talk) 04:54, 27 November 2007 (UTC)
If the lexer finds an invalid token, it will report an error.
- The comment needs some context. Generally speaking, a lexical analyzer may report an error. But that is usually (for instance in Lex programming tool) under control of the person who designs the rules for the analysis. The analyzer itelf may reject the rules because they're inconsistent. On the other hand, the parser is more likely to have built-in behavior -- yacc for instance fails on any mismatch, requires the developer to specify how to handle errors (updating the article to reflect these comments requires citing reliable sources of course - talk pages aren't a source) Tedickey (talk) 13:56, 27 November 2007 (UTC)
Lexer, Tokenizer, Scanner, Parser
[edit]There should be some more clearly explaned what exactly differs those modules, what are their particular jobs and how they're composed together (maybe some illustration?) —Preceding unsigned comment added by Sasq777 (talk • contribs) 16:06, 11 June 2008 (UTC)
I agree, the article does not make it clear which is what, what comes first etc. In fact, there seems to be a few contradictory statements. A block diagram would be ideal, but a clear explanation is urgently needed. Thanks. 122.29.91.176 (talk) 01:03, 21 June 2008 (UTC)
Something like page 4 of this document, which incidentally can be reproduced easily here (see Preface). 122.29.91.176 (talk) 08:50, 21 June 2008 (UTC)
In computer science and linguistics, parsing, or, more formally, syntactic analysis, is the process of analyzing a sequence of tokens to determine their grammatical structure with respect to a given (more or less) formal grammar.
Parsing is also an earlier term for the diagramming of sentences of natural languages, and is still used for the diagramming of inflected languages, such as the Romance languages or Latin. The term parsing comes from Latin pars (ōrātiōnis), meaning part (of speech).[1][2] —Preceding unsigned comment added by 122.168.58.48 (talk) 13:09, 25 September 2008 (UTC)
There are two different definitions of tokenization, one in the Tokens, the other in the Tokenization subsection. —Preceding unsigned comment added by 134.2.173.25 (talk) 14:24, 28 March 2011 (UTC)
The lexing example should use a list, not XML nor s-expressions
[edit]In the article, there are two examples I want to comment on. One shows XML. Another uses an s-expression. These examples create the impression that lexing produces a tree data structure. However, lexical analysis generates a list of tokens (often tuples). To be clear, a list, not a tree. In the lexers I've used, tokens do not contain other tokens. There may be exceptions, but if so, they are uncommon.
To show lexing, it is better to use a (simple) list data structure. Below is an example in Python. (JSON would also be suitable. The idea is to choose a commonly used data representation to get the point across; namely, you only need a list of tuples.)
from enum import Enum
class Token(Enum):
WORD = 1
PUNC = 2
tokenization = [
(Token.WORD, "The"),
(Token.WORD, "quick"),
(Token.WORD, "brown"),
(Token.WORD, "fox"),
(Token.WORD, "jumps"),
(Token.WORD, "over"),
(Token.WORD, "the"),
(Token.WORD, "lazy"),
(Token.WORD, "dog"),
(Token.PUNC, "."),
]
DavidCJames (talk) 15:45, 10 March 2021 (UTC)
Regarding "Lexical analyzer generators" section
[edit]It seems to be a duplicate of List of parser generators. Probably can be removed? 98.176.182.199 (talk) 02:57, 6 December 2009 (UTC)
- It's not a duplicate (as the casual reader will observe). Tedickey (talk) 12:00, 6 December 2009 (UTC)
Let's revisit this. At some point in the last nine years, the Comparison of parser generators page has grown sections for both regular- and context-free analyzers. The list here is best converted to a reference into that other page. I vote to perform that operation. 141.197.12.183 (talk) 16:40, 25 January 2019 (UTC)
table-driven vs directly coded
[edit]I dont think the table-driven approach is the problem - see 'control table' article - flex appears to be inefficient and is not using a trivial hash function.—Preceding unsigned comment added by 81.132.137.100 (talk • contribs)
- Editor comments that lex/flex doesn't use hashing (that may be relevant). My understanding of the statement is that state-tables can grow very large when compared to hand-coded parsers. Agreeing that they're simpler to implement, there's more than one aspect of efficiency. Tedickey (talk) 20:37, 2 May 2010 (UTC)
- It really depends what is meant by a "table-driven" approach. If you are simply talking about tables of "raw" specifications that have to be parsed/interpreted before execution v. embedded hand coded
If
statements, the original text may be correct. - If however you are talking about a really well designed "execution ready" control table - that is then used to input an already compacted and well designed "table of specifications" (v. verbose text based instructions), it can be much superior in algorithmic efficiency, maintainability and just about every other way. In other words, the phrase "table-driven" is maybe used rather ambiguously in this context, without reference to the much deeper "table-driven programming" approach. —Preceding unsigned comment added by 86.139.34.219 (talk) 10:53, 7 May 2010 (UTC)
- It really depends what is meant by a "table-driven" approach. If you are simply talking about tables of "raw" specifications that have to be parsed/interpreted before execution v. embedded hand coded
SICP book example
[edit]I am not familiar with the code examples in SICP, can someone please provide the section/example name/number and possibly a link to the example. Thanks, Ckoch786 (talk) 01:43, 7 February 2014 (UTC)
recursion vs regular expressions
[edit]A recent edit, using the sense that recursion is good, and popular means the same thing equated popular with support for recursive regular expressions. We need a reliable source of information discussing that aspect and which design features directly support recursion TEDickey (talk) 11:17, 1 January 2017 (UTC)
Many popular regular expression libraries support that which the article expressly says they "are not powerful enough to handle". The author of that statement is making an inaccurate opinion and it should be removed. I am taking no other position on the merits of regular expression versus parser capability other than to state the article is misleading in these facts I mentioned. I have provided the following references for the merit of revising the article. Hrmilo (talk) 21:30, 1 January 2017 (UTC) [1] [2]
The MSDN topic doesn't mention recursion; the other link mentions it without ever providing a definition (other than circular references), equates it to balancing groups, and doesn't relate that to any of the senses of recursion as used by others, much less explain why the page's author thought it was recursion. Implying that all regular expression parsers do this is misleading, particularly since it precedes a discussion of lex. If you want to expand on that, you might consider reading the POSIX definitions (e.g., regular expressions, and lex. (Keep in mind also that a single source doesn't warrant adding its terminology to this topic). TEDickey (talk) 14:59, 2 January 2017 (UTC)
@Hrmilo: The popular "regular-expression" libraries in question in fact recognize a context-free or even context-sensitive language. They are called "regular expression" because the syntax they support bears a strong surface resemblance to the true "regular expression" -- a formally defined concept. See https://wiki.riteme.site/wiki/Regular_expression#Formal_language_theory 141.197.12.183 (talk) 16:46, 25 January 2019 (UTC)
References
lexer generator
[edit]This section in the article states:
The most established is Lex (software), paired with the yacc parser generator, and the free equivalents Flex/bison.
However, [1] is a public domain yacc, so implying that flex and bison are free contrary to yacc is in my opinion not accurate. — Preceding unsigned comment added by Ngzero (talk • contribs) 12:00, 20 April 2018 (UTC)
- In my opinion [Berkeley_Yacc] could be mentioned and yacc be stated explicitly as the original 'yacc'? Then again it could be clear that yacc means just the original yacc if you read it twice.
- Ngzero (talk) 12:09, 20 April 2018 (UTC)
Section on Software
[edit]The software section in this article was removed some time ago by consensus, which was a good decision because there is already a link to separate Wikipedia page with a list of implementations and sufficient details. It seems it is put back (?) with some rather specific examples instead of trying to be generic. Worse (in my opinion), there are claims in that section such as "Lex is too complex and inefficient" that clearly do not belong there. I will remove this section unless someone has a strong argument why it should be included. — Preceding unsigned comment added by Robert van Engelen (talk • contribs) 13:24, 29 April 2021 (UTC)
tokenizing editors
[edit]Is there a Wikipedia article that discusses the kind of tokenization used many BASIC dialects? My understanding is:
- (a) When typing in lines of a program or loading a program from disk/cassette, common keywords such as "PRINT", "IF(" (including the opening parenthesis), "GOSUB", etc. are recognized and stored as a single byte or two in RAM, some or all spaces were discarded, line numbers and internal decimal numbers are converted to internal binary format(s), etc.
- (b) When listing the program with LIST or storing a program to disk/cassette, those tokens are expanded back to the full human-readable name of the keyword, spaces were inserted as necessary (pretty-printing), those binary values are printed out in decimal, etc.
This tokenization (lexical analysis) article covers the sort of thing done in compilers similar to (a), but that article says nothing about (b), which compilers never do, but practically every BASIC dialect mentioned in the "List of computers with on-board BASIC" article does.
The source-code editor article briefly mentions "tokenizing editors" that do both (a) and (b). I've heard that many Forth language implementations have a "see" command that does something similar to (b). The BASIC interpreter#Tokenizing and encoding lines article and most of our articles on specific BASIC implementations (GFA BASIC, GW-BASIC, Atari BASIC#Tokenizer, etc.) briefly discuss such tokenization.
But I still wonder -- do any other languages typically have implementations that do such tokenization?
Should *this* "tokenization (lexical analysis)" article have a section at least mentioning (b), or is there already some other article that discusses it?
(I'm reposting this question that I previously posted at Talk:BASIC#tokenization, because it seems more widely applicable than just the BASIC language. ). --DavidCary (talk) 19:19, 27 September 2023 (UTC)
Wiki Education assignment: Linguistics in the Digital Age
[edit]This article was the subject of a Wiki Education Foundation-supported course assignment, between 15 January 2024 and 8 May 2024. Further details are available on the course page. Student editor(s): Minhngo6 (article contribs).
— Assignment last updated by Minhngo6 (talk) 18:32, 7 March 2024 (UTC)