$23.99
An objective CS 236 is to help you write complex programs by using mathematical concepts as the basis for solving real world problems through computation. Mathematical thinking leads to programs that are clear, organized, and verifiably correct. This first projects shows you how discrete math structures form the basis, not only for writing the code, but for the code itself. It thus also provides a clear guide to maintaining and extending the code by any other programmer who also understands the discrete math structures.
You are expected to write code from scratch that clearly emulates the discrete math structures on which the solution is built. You are expected to justify your design and your code as one that is maintainable and extendable by other programmers who understand and are conversant with discrete mathematical structures. You should also learn that for complex programs, if you hack the code or formulate your code off of the discrete math structures, it will take you longer to write, it is less likely to be correct.
Please be aware that the project standards apply to this project and all projects in this course, and in particular for this projects, you may not use a regular expression library.
Description
A lexical analyzer groups characters in an input stream into tokens. A token is a sequence of one or more characters that form a single element of a language (e.g., a symbol, a numerical value, a string literal, or a keyword).
The project is to write a lexical analyzer for a subset of the Datalog language [http://en.wikipedia.org/wiki/Datalog]. The Lexer should output the UNDEFINED token if an undefined symbol is found. The input to your Lexer is a text file and the output is a print‑out of formatted tokens explained below. Please refer to the project standards for details on how to indicate the program input and where the output should be directed.
A token object is comprised of:
a string ‑ extracted from the file text
a number ‑ the line the token started on
a token type ‑ listed below
The lexical analyzer must use finite state machines to recognize tokens in the input streams. It is recommended that you consider using enumerated types to enumerate the potential states of the machine.
Token Types
You are to generate tokens from a text input stream for a subset of Datalog. The list of all token types that you will be expected to identify is below. You need to define the token types in your code.
http://wiki.cs.byu.edu/cs-236/lexical-analyzer 1/5
3/23/2015
cs-236:lexical-analyzer[CS W iki]
Token Type
Description
Examples
COMMA
The ',' character
,
PERIOD
The '.' character
.
Q_MARK
The '?' character
?
LEFT_PAREN
The '(' character
(
RIGHT_PAREN
The ')' character
)
COLON
The ':' character
:
COLON_DASH
The string “:‑”
:‑
MULTIPLY
The ' * ' character
*
ADD
The ' + ' character
+
SCHEMES
The string “Schemes”
Schemes
FACTS
The string “Facts”
Facts
RULES
The string “Rules”
Rules
QUERIES
The string “Queries”
Queries
valid
invalid
ID
A letter followed by 0 or more letters or digits and is not a keyword (Schemes, Facts, Rules, Queries)
Identifier1
1stPerson
Person
Person_Name
Schemes
Any sequence of characters enclosed in single quotes. Two single quotes denote an apostrophe within the string. For line‑number
'This is a string'
STRING
counts, count all \n's within a string. A string token’s line number is the line where the string starts. If EOF is found before the end
' ' (The empty string)
of the string, it is undefined (see UNDEFINED token below).
'this isn''t two strings'
A line comment starts with # and ends at the next newline or EOF.
# This is a comment
#| This is a tricky
COMMENT
A block comment starts with a #| and ends with a |#. The comment's line number is the line where the comment started. Like
multiline comment |#
#| This is an illegal block
STRING, if EOF is found before the end of the block comment, it is UNDEFINED (see UNDEFINED token below).
comment
because it ends with EOF
WHITESPACE
Any character recognized by the isspace() function defined in ctype.h. Though we give it a token type here, these tokens will be
ignored and not saved. Be sure to count the line numbers when skipping over white space.
Any character not tokenized as a string, keyword, identifier, symbol, or white space is undefined. Additionally, any non‑
$&^ (Three undefined
tokens)
UNDEFINED
terminating string or non‑terminating block comment is undefined. In both of the latter cases we reach EOF before finding the end
'any string that does not
of the string or the end of the block comment.
end
EOF
End of input file
Output Format
http://wiki.cs.byu.edu/cs-236/lexical-analyzer 2/5
3/23/2015 cs-236:lexical-analyzer[CS W iki]
For this and all subsequent labs, you MUST match the formatting requirements exactly. Automated testing systems are used to check your code (similar to those used by professional software design companies) and it is desired that you experience writing code that matches a given protocol.
The expected output is the representation of a token, printed one per line, followed by a single line that states how many tokens there were. You should NOT print any information for a WHITESPACE token. Format tokens as follows:
( T o k e n _ T y p e , " V a l u e " , L i n e _ N u m b e r )
Notice there are no spaces on either side of the commas separating the three token's elements. The token type should appear in this list, spelled exactly as specified (all capital letters). The text is the value of the token surrounded by double quotes. You should have only one token per line, and the last line should look like:
T o t a l T o k e n s = # # #
where ### is replaced by the number of tokens found (excluding whitespace). ALL OUTPUT IS CASE SENSITIVE. See the example below:
Example 1
For input (the line numbers are not part of the input–they are to help clarify new lines)
0
1
.
0
2
,
s t r i n g '
0
3
' a
0
4
# A
c o m m e n t
0
5
S
c
h e m e s
0
6
F a c t s R u l e s
0
7
:
:
-
0
8
The Lexer should output
( P E R I O D , " . " , 1 )
( C O M M A , " , " , 2 )
( S T R I N G , " ' a
s t r i n g ' " , 3 )
( C O M M E N T , " # A
c o m m e n t " , 4 )
( S C H E M E S , " S c h e m e s " , 5 )
( I D , " F a c t s R u l e s " , 6 )
( C O L O N , " : " , 7 )
( C O L O N _ D A S H , " : - " , 7 )
( E O F ,
" " , 8 )
=
9
T o t a l
T o k e n s
Example 2
Input
http://wiki.cs.byu.edu/cs-236/lexical-analyzer 3/5
3/23/2015
cs-236:lexical-analyzer[CS W iki]
0
1
Q u e r i e s :
H ( ' S n o o p y ' , R , ' M ' , H )
0
2
I s I n R o o m A t D
0
3
# S c h e m e s F a c
t s R u l e s <
0
4.
=
0
5
# | c o m m e n t
0
6
w o w | #
0
7
Output
( Q U E R I E S , " Q u e r i e s " , 1 )
( C O L O N , " : " , 1 )
( I D , " I s I n R o o m A t D H " , 2 )
( L E F T _ P A R E N , " ( " , 2 )
( S T R I N G , " ' S n o o p y ' " , 2 )
( C O M M A , " , " , 2 )
(
I
D
,
"
R
"
,
2
)
( C O M M A , " , " , 2 )
( S T R I N G , " ' M ' " , 2 )
( C O M M A , " , " , 2 )
(
I
D
,
"
H
"
,
2
)
(
R
I
G
H
T
_
P
A
R
E
N
,
"
)
"
,
2
)
a c t s R u l e s < " , 3 )
(
C
O
M
M
E
N
T
,
"
#
S
c
h
e
m
e
s
F
( P E R I O D , " . " , 4 )
e
n t
=
( C O M M E N T , " # | c o m m
w
o
w
|
#
"
,
5
)
)
(
E
O
F
,
"
"
,
7
n s
=
1
6
T o t a l
T o k e
Example 3
Input
0
1
Q
u
e
r
i
e s
:
0
2
I s I n R o o m A t D H ( ' S n o o p y ' , R , ' M ' H ) ?
0
3
R u l e s
F a c t s
S c h e m e s
0
4
& @ Q u e r i e s
0
5
:
h
e
l
l
o
0
6
'
A
' t h i s
h a s
a
0
7
I
e
a m '
$
0
8
R
t
u
r
n
n e a r
0
9
T h e
e n d ' ' s
1
0
Output
( Q U E R I E S , " Q u e r i e s " , 1 )
( C O L O N , " : " , 1
)
)
( I D , " I s I n R o o m A t D H " , 2
(
L
E
F
T
_
P
A
R
E
N
, " (
"
,
2
)
, 2
)
(
S
T
R
I
N
G
,
"
'
S
n
o
o
p
y
'
"
( C O M M A , " , " ,
2
)
(
I
D
,
"
R
"
,
2
)
,
2
)
(
C
O
M
M
A
,
"
,
"
http://wiki.cs.byu.edu/cs-236/lexical-analyzer 4/5
3/23/2015
cs-236:lexical-analyzer[CS W iki]
( S T R I N
G , " ' M ' " , 2 )
( I D , "
H
" , 2 )
( R I G H T
_ P A R E N , " ) " , 2 )
( Q _ M A R
K , " ? " , 2 )
)
( R U L E S
, " R u l e s " , 3
( F A C T S , " F a c t s " , 3 )
( S C H E M E S , " S c h e m e s " , 3 )
( U N D E F I N E D , " & " , 4 )
( U N D E F I N E D , " @ " , 4 )
( Q U E R I E S , " Q u e r i e s " , 4 )
( C O L O N , " : " , 5 )
( S T R I N G , " ' h e l l o
I
a
m
'
"
,
6
)
( U N D E F I N E D , " $ " , 7 )
(
I
D
,
"
A
"
,
7 )
h a s
a
( U N D E F I N E D , " ' t h i s
R
e
t
u
r
n
n e a r
T h e
)
e n d ' ' s
"
,
7
( E O F , " " , 1 0 )
=
2 4
T o t a l
T o k e n s
Submission
Review the project standards for creating a zip archive. You submission cannot be scored if you fail to create the archive correct.
Navigate to Learning Suite [http://learningsuite.byu.edu], select 'CS 236', and click on 'Assignments' on the course home page. Click on the link to the relevant project and at the bottom of the description click on 'View/Submit'. Use the resulting upload dialog to upload your zip archive.
Pass‑off
Pass‑off your project directly to a TA during normal TA hours after your archive is uploaded to learningsuite [http://learningsuite.byu.edu].
TAs help students on a first‑come/first‑serve basis and are under no obligation to stay later than designated hours so plan accordingly.
Please review the syllabus for the pass‑off requirements.
cs‑236/lexical‑analyzer.txt · Last modified: 2015/02/23 17:14 by egm