CS 363 Comparative Programming Languages

Download Report

Transcript CS 363 Comparative Programming Languages

Computer Systems Lab
TJHSST
Current Projects 2004-2005
First Period
Current Projects, 1st Period
• Caroline Bauer: Archival of Articles via RSS and
Datamining Performed on Stored Articles
• Susan Ditmore: Construction and Application of a Pentium
II Beowulf Cluster
• Michael Druker: Universal Problem Solving Contest Grader
2
Current Projects, 1st Period
• Matt Fifer: The Study of Microevolution Using Agent-based
Modeling
• Jason Ji: Natural Language Processing: Using Machine
Translation in Creation of a German-English Translator
• Anthony Kim: A Study of Balanced Search Trees:
Brainforming a New Balanced Search Tree
• John Livingston: Kernel Debugging User-Space API
Library (KDUAL)
3
Current Projects, 1st Period
• Jack McKay: Saber-what? An Analysis of the Use of
Sabermetric Statistics in Baseball
• Peden Nichols: An Investigation into Implementations of
DNA Sequence Pattern Matching Algorithms
• Robert Staubs: Part-of-Speech Tagging with Limited
Training Corpora
• Alex Volkovitsky: Benchmarking of Cryptographic
Algorithms
4
Archival of Articles
via RSS and
Datamining
Performed on Stored
Articles
RSS (Really Simple Syndication,
encompassing Rich Site Summary and RDF
Site Summary) is a web syndication protocol
used by many blogs and news websites to
distribute information it saves people having
to visit several sites repeatedly to check for
new content. At this point in time there are
many RSS newsfeed aggregators available to
the public, but none of them perform any sort
of archival of information beyond the RSS
metadata. The purpose of this project is to
create an RSS aggregator that will archive
the text of the actual articles linked to in the
RSS feeds in some kind of linkable,
searchable database, and, if all goes well,
implement some sort of datamining
5
Archival of Articles via RSS, and Datamining
Performed on Stored Articles
Caroline Bauer
Abstract:
RSS (Really Simple Syndication, encompassing Rich Site Summary and RDF Site
Summary) is a web syndication protocol used by many blogs and news websites to
distribute information; it saves people having to visit several sites repeatedly to
check for new content. At this point in time there are many RSS newsfeed
aggregators available to the public, but none of them perform any sort of archival of
information beyond the RSS metadata. As the articles linked may move or be
eliminated at some time in the future, if one wants to be sure one can access them in
the future one has to archive them oneself; furthermore, should one want to link
such collected articles, it is far easier to do if one has them archived. The purpose of
this pro ject is to create an RSS aggregator that will archive the text of the actual
articles linked to in the RSS feeds in some kind of linkable, searchable database,
and, if all goes well, implement some sort of datamining capability as well.
Archival of Articles via RSS, and Datamining
Performed on Stored Articles
Caroline Bauer
Introduction
This paper is intended to be a detailed summary of all of the author's findings regarding
the archival of articles in a linkable, searchable database via RSS.
Background RSS
RSS stands for Really Simple Syndication, a syndication protocol often used by
weblogs and news sites. Technically, RSS is an xml-based communication standard that
encompasses Rich Site Summary (RSS 0.9x and RSS 2.0) and RDF Site Summary
(RSS 0.9 and 1.0). It enables people to gather new information by using an RSS
aggregator (or "feed reader") to poll RSS-enabled sites for new information, so the user
does not have to manually check each site. RSS aggregators are often extensions of
browsers or email programs, or standalone programs; alternately, they can be webbased, so the user can view their "feeds" from any computer with Web access.
Archival Options Available in Existing RSS Aggregators Data Mining
Data mining is the searching out of information based on patterns present in large
amounts of data. //more will be here.
Archival of Articles via RSS, and Datamining
Performed on Stored Articles
Caroline Bauer
Purpose
The purpose of this project is to create an RSS aggregator that, in addition to serving as
a feed reader, obtains the text of the documents linked in the RSS feeds and places it
into a database that is both searchable and linkable. In addition to this, the database is
intended to reach an implementation wherein it performs some manner of data mining
on the information contained therein; the specifics on this have yet to be determined.
Development Results Conclusions Summary References
1. "RSS (protocol)." Wikipedia. 8 Jan. 2005. 11 Jan. 2005 <http://en.
wikipedia.org/wiki/RSS_%28protocol%29>. 2. "Data mining." Wikipedia. 7 Jan. 2005.
12 Jan. 2005. <http://en. wikipedia.org/wiki/Data_mining>.
Construction and
Application of a
Pentium II Beowulf
Cluster
I plan to construct a super computing
cluster of about 15-20 or more
Pentium II computers with the
OpenMosix kernel patch. Once
constructed, the cluster could be
configured to transparently aid
workstations with computationally
expensive jobs run in the lab. This
project would not only increase the
computing power of the lab, but it
would also be an experiment in
building a lowlevel, lowcost cluster
with a stripped down version of
Linux, useful to any facility with old
computers they would otherwise deem
outdated.
9
Construction and Application of a Pentium II
Beowulf Cluster
Susan Ditmore
Text version needed
(your pdf file won't copy to text)
Universal
Problem Solving
Contest Grader
Michael Druker
(poster needed)
11
Universal Problem Solving Contest
Grader
Michael Druker
Steps so far:
Creation of directory structure for the grader, the contests,
the users, the users' submissions, the test cases.
-Starting of main grading script itself.
Refinement of directory structure for the grader.
-Reading of material on bash scripting language to be able
to write the various scripts that will be necessary.
Universal Problem Solving Contest
Grader
Current program:
Michael Druker
#!/bin/bash
CONDIR="/afs/csl.tjhsst.edu/user/mdruker/techlab/code/new/"
#syntax is "grade contest user program"
contest=$1
user=$2
program=$3
echo "contest name is " $1
echo "user's name is " $2
echo "program name is " $3
Universal Problem Solving Contest
Grader
Michael Druker
Current program continued:
#get the location of the program and the test data
#make sure that the contest, user, program are valid
PROGDIR=${CONDIR}"contests/"${contest}"/users/"${user}
echo "user's directory is" $PROGDIR
if [ -d ${PROGDIR} ]
then echo "good input"
else echo "bad input, directory doesn't exist"
exit 1 fi
exit 0
Study of
Microevolution Using
Agent-Based Modeling
in C++
The goal of the project is to create a
program that uses an agent-environment
structure to imitate a very simple natural
ecosystem: one that includes a single type
of species that can move, reproduce, kill,
etc. The "organisms" will contain
genomes (libraries of genetic data) that
can be passed from parents to offspring
in a way similar to that of animal
reproduction in nature. As the agents
interact with each other, the ones with
the characteristics most favorable to
survival in the artificial ecosystem will
produce more children, and over time,
the mean characteristics of the system
should start to gravitate towards the traits
that would be most beneficial. This
process, the optimization of physical
traits of a single species through passing
on heritable advantageous genes, is
known as microevolution.
15
THE STUDY OF MICROEVOLUTION
USING AGENTBASED MODELING
Matt Fifer
Abstract
The goal of the project is to create a program that uses an agentenvironment structure to imitate a very simple natural ecosystem:
one that includes a single type of species that can move, reproduce,
kill, etc. The "organisms" will contain genomes (libraries of genetic
data) that can be passed from parents to offspring in a way similar to
that of animal reproduction in nature. As the agents interact with
each other, the ones with the characteristics most favorable to
survival in the artificial ecosystem will produce more children, and
over time, the mean characteristics of the system should start to
gravitate towards the traits that would be most beneficial. This
process, the optimization of physical traits of a single species
through passing on heritable advantageous genes, is known as
microevolution.
THE STUDY OF MICROEVOLUTION
USING AGENTBASED MODELING
Matt Fifer
Purpose
One of the most controversial topics in science today is the debate of
creationism vs. Darwinism. Advocates for creationism believe that
the world was created according to the description detailed in the 1st
chapter of the book of Genesis in the Bible. The Earth is
approximately 6,000 years old, and it was created by God, followed
by the creation of animals and finally the creation of humans, Adam
and Eve. Darwin and his followers believe that from the moment the
universe was created, all the objects in that universe have been in
competition. Everything - from the organisms that make up the
global population, to the cells that make up those organisms, to the
molecules that make up those cells has beaten all of its competitors
in the struggle for resources commonly known as life.
THE STUDY OF MICROEVOLUTION
USING AGENTBASED MODELING
Matt Fifer
This project will attempt to model the day-today war between
organisms of the same species. Organisms, or agents, that can move,
kill, and reproduce will be created and placed in an ecosystem. Each
agent will include a genome that codes for its various characteristics.
Organisms that are more successful at surviving or more successful
at reproducing will pass their genes to their children, making future
generations better suited to the environment. The competition will
continue, generation after generation, until the simulation terminates.
If evolution has occurred, the characteristics of the population at the
end of the simulation should be markedly different than at the
beginning.
THE STUDY OF MICROEVOLUTION
USING AGENTBASED MODELING
Matt Fifer
Background
Two of the main goals of this project are the study of microevolution and
the effects of biological mechanisms on this process. Meiosis, the
formation of gametes, controls how genes are passed from parents to their
offspring. In the first stage of meiosis, prophase I, the strands of DNA
floating around the nucleus of the cell are wrapped around histone proteins
to form chromosomes. Chromosomes are easier to work with than the
strands of chromatin, as they are packaged tightly into an "X" structure
(two ">"s connected at the centromere). In the second phase, metaphase I,
chromosomes pair up along the equator of the cell, with homologous
chromosomes being directly across from each other. (Homologous
chromosomes code for the same traits, but come from different parents, and
thus code for different versions of the same trait.) The pairs of
chromosomes, called tetrads, are connected and exchange genetic material.
THE STUDY OF MICROEVOLUTION
USING AGENTBASED MODELING
Matt Fifer
This process, called crossing over, results in both of the chromosomes being
a combination of genes from the mother and the father. Whole genes swap
places, not individual nucleotides. In the third phase, anaphase I, fibers
from within the cell pull the pair apart. When the pairs are pulled apart, the
two chromosomes are put on either side of the cell. Each pair is split
randomly, so for each pair, there are two possible outcomes. For instance,
the paternal chromosome can either move to the left or right side of the cell,
with the maternal chromosome moving to the opposite end. In telophase I,
the two sides of the cell split into two individual cells. Thus, for each cell
undergoing meiosis, there are 2n possible gametes. With crossing over,
there are almost an infinite number of combinations of genes in the
gametes. This large number of combinations is the reason for the genetic
biodiversity that exists in the world today, even among species. For example,
there are 6 billion humans on the planet, and none of them is exactly the
same as another one.
THE STUDY OF MICROEVOLUTION
USING AGENTBASED MODELING
Matt Fifer
Procedure
This project will be implemented with a matrix of agents. The matrix,
initialized with only empty spaces, will be seeded with organisms by an
Ecosystem class. Each agent in the matrix will have a genome, which will
determine how it interacts with the Ecosystem. During every step of the
simulation, an organism will have a choice whether to 1. do nothing 2. move
to an empty adjacent space 3. kill an organism in a surrounding space, or 4.
reproduce with an organism in an adjacent space. The likelihood of the
organism performing any of these tasks is determined by the organism's
personal variables, which will be coded for by the organism's genome. While
the simulation is running, the average characteristics of the population will
be measured. In theory, the mean value of each of the traits (speed, agility,
strength, etc.) should either increase with time or gravitate towards a
particular, optimum value.
THE STUDY OF MICROEVOLUTION
USING AGENTBASED MODELING
Matt Fifer
At its most basic level, the program written to model microevolution is an
agentenvironment program. The agents, or members of the Organism class,
contain a genome and have abilities that are dependent upon the genome.
Here is the declaration of the Organism class:
class Organism {
public: Organism();
//constructors Organism(int ident, int row2, int col2);
Organism(Nucleotide* mDNA, Nucleotide* dDNA, int ident,
bool malefemale, int row2, int col2);
~Organism(); //destructor void printGenome();
void meiosis(Nucleotide* gamete);
Organism* reproduce(Organism* mate, int ident, int r, int c);
int Interact(Organism* neighbors, int nlen); ...
THE STUDY OF MICROEVOLUTION
USING AGENTBASED MODELING
Matt Fifer
//assigns a gene a numeric value int Laziness();
//accessor functions int Rage(); int SexDrive(); int Activity(); int
DeathRate(); int ClausIndex(); int Age(); int Speed(); int Row(); int Col();
int PIN(); bool Interacted(); bool Gender(); void setPos(int row2, int col2);
void setInteracted(bool interacted); private: void randSpawn(Nucleotide*
DNA, int size); //randomly generates a genome Nucleotide *mom, *dad;
//genome int ID, row, col, laziness, rage, sexdrive, activity, deathrate,
clausindex, speed; //personal characteristics double age; bool male,
doneStuff; ...
THE STUDY OF MICROEVOLUTION
USING AGENTBASED MODELING
Matt Fifer
The agents are managed by the environment class, known as Ecosystem. The
Ecosystem contains a matrix of Organisms.
Here is the declaration of the Ecosystem class:
class Ecosystem { public: Ecosystem(); //constructors Ecosystem(double
oseed); ~Ecosystem(); //destructor void Run(int steps); //the simulation void
printMap(); void print(int r, int c); void surrSpaces(Organism* neighbors, int
r, int c, int &friends); //the neighbors of any cell private: Organism **
Population; //the matrix of Organisms };
};
THE STUDY OF MICROEVOLUTION
USING AGENTBASED MODELING
Matt Fifer
The simulation runs for a predetermined number of steps within the
Ecosystem class. During every step of the simulation, the environment class
cycles through the matrix of agents, telling each one to interact with its
neighbors. To aid in the interaction, the environment sends the agent an
array of the neighbors that it can affect. Once the agent has changed (or not
changed) the array of neighbors, it sends the array back to the environment
which then updates the matrix of agents. Here is the code for the Organisms
function which enables it to interact with its neighbors:
THE STUDY OF MICROEVOLUTION
USING AGENTBASED MODELING
Matt Fifer
int Organism::Interact(Organism* neighbors, int nlen) //returns 0 if the
organism hasn't moved & 1 if it has { fout << row << " " << col << " ";
if(!ID)//This Organism is not an organism { fout << "Not an organism,
cannot interact!" << endl; return 0; } if(doneStuff)//This Organism has
already interacted once this step { fout << "This organism has already
interacted!" << endl; return 0; } doneStuff = true; int loop; for(loop = 0; loop
< GENES * CHROMOSOMES * GENE_LENGTH; loop++) { if(rand() %
RATE_MAX < MUTATION_RATE) mom[loop] = (Nucleotide)(rand() % 4);
if(rand() % RATE_MAX < MUTATION_RATE)
THE STUDY OF MICROEVOLUTION
USING AGENTBASED MODELING
Matt Fifer
The Organisms, during any simulation step, can either move, kill a neighbor,
remain idle, reproduce, or die. The fourth option, reproduction, is the most
relevant to the project. As explained before, organisms that are better at
reproducing or surviving will pass their genes to future generations. The most
critical function in reproduction is the meiosis function, which determines
what traits are passed down to offspring. The process is completely random,
but an organism with a "good" gene has about a 50% chance of passing that
gene on to its child. Here is the meiosis function, which determines what
genes each organism sends to its offspring:
void Organism::meiosis(Nucleotide *gamete) { int x, genect, chromct,
crossover; Nucleotide * chromo = new Nucleotide[GENES *
GENE_LENGTH], *chromo2 = new Nucleotide[GENES *
GENE_LENGTH]; Nucleotide * gene = new Nucleotide[GENE_LENGTH],
*gene2 = new Nucleotide[GENE_LENGTH]; ... (more code)
THE STUDY OF MICROEVOLUTION
USING AGENTBASED MODELING
Matt Fifer
The functions and structures above are the most essential to the running of
the program and the actual study of microevolution. At the end of each
simulation step, the environment class records the statistics for the agents in
the matrix and puts the numbers into a spreadsheet for analysis. The
spreadsheet can be used to observe trends in the mean characteristics of the
system over time. Using the spreadsheet created by the environment class, I
was able to create charts that would help me analyze the evolution of the
Organisms over the course of the simulation.
THE STUDY OF MICROEVOLUTION
USING AGENTBASED MODELING
Matt Fifer
The first time I ran the simulation, I set the program so that there was no
mutation in the agent's genomes. Genes were strictly created at the outset of
the program, and those genes were passed down to future generations. If
microevolution were to take place, a gene that coded for a beneficial
characteristic would have a higher chance of being passed down to a later
generation. Without mutation, however, if one organism possessed a
characteristic that was far superior to the comparable characteristics of other
organisms, that gene should theoretically allow that organism to "dominate"
the other organisms and pass its genetic material to many children, in effect
exterminating the genes that code for less beneficial characteristics.
THE STUDY OF MICROEVOLUTION
USING AGENTBASED MODELING
Matt Fifer
For example, if an organism was created that had a 95% chance of
reproducing in a given simulation step, it would quickly pass its genetic
material to a lot of offspring, until its gene was the only one left coding
for reproductive tendency, or libido.
THE STUDY OF MICROEVOLUTION
USING AGENTBASED MODELING
Matt Fifer
As you can see from Figure 1, the average tendency to reproduce increases
during the simulation. The tendency to die decreases to almost nonexistence.
The tendency to remain still, since it has relatively no effect on anything, stays
almost constant. The tendency to move to adjacent spaces, thereby spreading
one's genes throughout the ecosystem, increases to be almost as likely as
reproduction. The tendency to kill one's neighbor decreases drastically,
probably because it does not positively benefit the murdering organism. In
Figure 2, we can see that the population seems to stabilize at about the same
time as the average characteristics. This would suggest that there was a large
amount of competition among the organisms early in the simulation, but the
competition quieted down as one dominant set of genes took over the
ecosystem.
THE STUDY OF MICROEVOLUTION
USING AGENTBASED MODELING
Matt Fifer
Figure 4 These figures show the results from the second run of the
program, when mutation was turned on. As you can see, many of the same
trends exist, with reproductive tendency skyrocketing and tendency to kill
plummeting. Upon reevaluation, it seems that perhaps the tendencies to
move and remain idle do not really affect an agent's ability survive, and
thus their trends are more subject to fluctuations that occur in the
beginning of the simulation. One thing to note about the mutation
simulation is the larger degree of fluctuation in both characteristics and
population. The population stabilizes at about the same number, but swings
between simulation steps are more pronounced. In Figure 3, the
stabilization that had occurred in Figure 1 is largely not present.
THE STUDY OF MICROEVOLUTION
USING AGENTBASED MODELING
Matt Fifer
Conclusion
The goal of this project at the outset was to create a system that modeled
trends and processes from the natural world, using the same mechanisms
that occur in that natural world. While this project by no means definitively
proves the correctness of Darwin's theory of evolution over the creationist
theory, it demonstrates some of the basic principles that Darwin addressed
in his book, The Origin of Species. Darwin addresses two distinct processes-natural selection and artificial selection. Artificial selection, or selective
breeding, was not present in this project at all. There was no point in the
program where the user was allowed to pick organisms that survived.
Natural selection, though it is a stretch because nature was the inside of a
computer, was simulated. Natural selection, described as the "survival of the
fittest," is when an organism's characteristics enable it to survive and pass
those traits to its offspring.
THE STUDY OF MICROEVOLUTION
USING AGENTBASED MODELING
Matt Fifer
In this program, "nature" was allowed to run its course, and at the end of
the simulation, the organisms with the best combination of characteristics
had triumphed over their predecessors. "Natural" selection occurred as
predicted.
*All of the information in this report was either taught last year in A.P.
Biology last year and, to a small degree, Charles Darwin's The Origin of
Species. I created all of the code and all of the charts in this paper. For my
next draft, I will be sure to include more outside information that I have
found in the course of my research*
Using Machine
Translation in a
German – English
Translator
This project attempts to take the
beginning steps towards the goal of
creating a translator program that
operates within the scope of
translating between English and
German.
35
Natural Language Processing: Using
Machine Translation in Creation of a
German-English Translator
Jason
Ji
Abstract:
The field of machine translation - using computers to provide
translations between human languages - has been around for
decades. And the dream of an ideal machine providing a perfect
translation between languages has been around still longer. This
pro ject attempts to take the beginning steps towards that goal,
creating a translator program that operates within an extremely
limited scope to translate between English and German. There are
several different strategies to machine translation, and this pro ject
will look into them - but the strategy taken to this pro ject will be
the researcher's own, with the general guideline of "thinking as a
human."
Natural Language Processing: Using
Machine Translation in Creation of a
German-English Translator
Jason Ji
For if humans can translate between language, there must be
something to how we do it, and hopefully that something - that
thought process, hopefully - can be transferred to the machine and
provide quality translations.
Background
There are several methods of varying difficulty and success to
machine translation. The best method to use depends on what sort
of system is being created. A bilingual system translates between
one pair of languages; a multilingual system translates between
more than two systems.
Natural Language Processing: Using
Machine Translation in Creation of a
German-English Translator
Jason Ji
The easiest translation method to code, yet probably least
successful, is known as the direct approach. The direct approach
does what it sounds like it does - takes the input language (known
as the "source language"), performs 2 morphological analysis whereby words are broken down and analyzed for things such as
prefixes and past tense endings, performs a bilingual dictionary
look-up to determine the words' meanings in the target language,
performs a local reordering to fit the grammar structure of the
target language, and produces the target language output. The
problem with this approach is that it is essentially a word-for-word
translation with some reordering, resulting often in mistranslations
and incorrect grammar structures.
Natural Language Processing: Using
Machine Translation in Creation of a
German-English Translator
Jason Ji
Furthermore, when creating a multilingual system, the direct
approach would require several different translation algorithms one or two for each language pair. The indirect approach involves
some sort of intermediate representation of the source language
before translating into the target language. In this way, linguistic
analysis of the source language can be performed on the
intermediate representation. Translating to the intermediary also
enables semantic analysis, as the source language input can be
more carefully to detect idioms, etc, which can be stored in the
intermediary and then appropriately used to translate into the
target language.
Natural Language Processing: Using
Machine Translation in Creation of a
German-English Translator
Jason Ji
The transfer method is similar, except that the transfer is language
dependent - that is to say, the French-English intermediary
transfer would be different from the EnglishGerman transfer. An
interlingua intermediary can be used for multilingual systems.
Theory
Humans fluent in two or more languages are at the moment better
translators than the best machine translators in the world. Indeed,
a person with three years of experience in learning a second
language will already be a better translator than the best machine
translators in the world as well.
Natural Language Processing: Using
Machine Translation in Creation of a
German-English Translator
Jason Ji
Yet for humans and machines alike, translation is a process, a
series of steps that must be followed in order to produce a
successful translation. It is interesting to note, however, that the
various methods of translation for machines - the various
processes - become less and less like the process for humans as
they become more complicated. Furthermore, it was interesting to
notice that as the method of machine translation becomes more
complicated, the results are sometimes less accurate than the
results of simpler methods that better model the human rationale
for translation.
Natural Language Processing: Using
Machine Translation in Creation of a
German-English Translator
Jason Ji
Therefore, the theory is, an algorithm that attempts to model the
human translation process would be more successful than other,
more complicated methods currently in development today. This
theory is not entirely plausible for full-scale translators because of
the sheer magnitude of data that would be required. Humans are
better translators than computers in part because they have the
ability to perform semantic analysis, because they have the
necessary semantic information to be able to, for example,
determine the difference in a word's definition based on its usage
in context. Creating a translator with a limited-scope of
vocabulary would require less data, leaving more room for
semantic information to be stored along with definitions.
Natural Language Processing: Using
Machine Translation in Creation of a
German-English Translator
Jason Ji
A limited-scope translator may seem unuseful at first glance, but
even humans fluent in any language, including their native
language, don't know the entire vocabulary of the language. A
language has hundreds of thousands of words, and no human
knows even half of them all. A computer with a vocabulary of
commonly used words that most people know, along with
information to avoid semantic problems, would therefore be still
useful for nonprofessional work.
Development
On the most superficial level, a translator is more user-friendly for
an average person if it is GUI-based, rather than simply text-based.
This part of the development is finished. The program presents a
GUI for the user.
Natural Language Processing: Using
Machine Translation in Creation of a
German-English Translator
Jason Ji
A JFrame opens up with two text areas and a translate button. The
text areas are labeled "English" and "German". The input text is
typed into the English window, the "Translate" button is clicked,
and the translator, once finished, outputs the translated text into the
German text area. Although typing into the German text area is
possible, the text in the German text area does not affect the
translator process. The first problem to deal with in creating a
machine translator is to be able to recognize the words that are
inputted into the system. A sentence or multiple sentences are input
into the translator, and a string consisting of that entire sentence
(or sentences) is passed to the translate() function.
Natural Language Processing: Using
Machine Translation in Creation of a
German-English Translator
Jason Ji
The system loops through the string, finding all space (' ')
characters and punctuation characters (comma, period, etc) and
records their positions. (It is important to note the position of each
punctuation mark, as well as what kind of a punctuation mark it is,
because the existence and position of punctuation marks alter the
meaning of a sentence.)
The number of words in the sentence is determined to be the
number of spaces plus one. By recording the position of each space,
the string can then be broken up into the words. The start position
of each word is the position of each space, plus one, and the end
position is the position of the next space. This means that
punctuation at the end of any given word is placed into the String
with that word, but this is not
Natural Language Processing: Using
Machine Translation in Creation of a
German-English Translator
Jason Ji
a problem: the location of each punctuation mark is already
recorded, and the dictionary look-up of each word will first check to
ensure that the last character of each word is a letter; if not, it will
simply disregard the last character. The next problem is the biggest
problem of all, the problem of actual translation itself. Here there is
no code yet written, but development of pseudocode has begun
already. As previously mentioned, translation is a process. In order
to write a translator program that follows the human translation
process, the human process must first be recognized and broken
down into programmable steps. This is no easy task. Humans with
five years of experience
Natural Language Processing: Using
Machine Translation in Creation of a
German-English Translator
Jason Ji
in learning a language may already translate any given text quickly
enough, save time to look up unfamiliar words, that the process
goes by too quickly to fully take note of. The basic process is not
entirely determined yet, but there is some progress on it. The
process to determine the process has been as followed: given a
random sentence to translate, the sentence is first translated by a
human, then the process is noted. Each sentence given has everincreasing difficulty to translate.
Natural Language Processing: Using
Machine Translation in Creation of a
German-English Translator
Jason Ji
For example: the sentence, "I ate an apple," is translated via the
following process: 1) Find the sub ject and the verb. (I; ate) 2)
Determine the tense and form of the verb. (ate = past, imperfekt
form) a) Translate sub ject and verb. (Ich; ass) (note - "ass" is a
real German verb form.) 3) Determine what the verb requires. (ate ¿ eat; requires a direct ob ject) 4) Find what the verb requires in the
sentence. (direct ob ject comes after verb and article; apple) 5)
Translate the article and the direct ob ject. (ein; Apfel) 6) Consider
the gender of the direct ob ject, change article if necessary. (der
Apfel; ein -¿ einen) Ich ass einen Apfel.
Natural Language Processing: Using
Machine Translation in Creation of a
German-English Translator
Jason Ji
References
(I'll put these in proper bibliomumbo jumbographical order later!)
1. http://dict.leo.org (dictionary) 2. "An Introduction To Machine
Translation" (available online at
http://ourworld.compuserve.com/homepages/WJHutchins/IntroMTTOC.htm) 3.
http://www.comp.leeds.ac.uk/ugadmit/cogsci/spchlan/machtran.htm
(some info on machine translation) 4.
A Study of
Balanced
Search Trees
This project investigates four different
balanced search trees for their
advantages and
disadvantages, thus ultimately their
efficiency. Runtime and memory
space management are two main
aspects under the study. Statistical
analysis is provided to distinguish
subtle
difference if there is any. A new
balanced search tree is suggested and
compared with the four balanced
search trees.
50
A Study of Balanced Search Trees:
Brainforming a New Balanced Search Tree
Anthony Kim
Abstract
This project investigates four different balanced search trees for their
advantages and disadvantages, thus ultimately their efficiency. Run
time and memory space management are two main aspects under the
study. Statistical analysis is provided to distinguish subtle differences
if there is any. A new balanced search tree is suggested and
compared with the four balanced search trees under study. Balanced
search trees are implemented in C++ extensively using pointers and
structs.
A Study of Balanced Search Trees:
Brainforming a New Balanced Search Tree
Anthony
Kim
Introduction
Balanced search trees are important data structures. A normal binary
search tree has some disadvantages, specifically from its dependence
on the incoming data, that significantly affects its tree structure
hence its performance. Height of search tree is the maximum
distance from the root of the tree to a leaf. An optimal search tree is
one that tries to minimize its height given some number of data. To
improve its height thus its efficiency, balanced search trees have
been developed that self-balance themselves into optimal tree
structures that allows quicker access to data stored in the trees, For
example red-black treee is a balanced binary tree that balances
according to color pattern of nodes (red or black) by rotation
functions.
A Study of Balanced Search Trees:
Brainforming a New Balanced Search Tree
Anthony Kim
Rotation function is a hall mark of nearly all balanced search tree;
they rotate or adjust subtree heights from a pivot node. Many
balanced trees have been suggested and developed: red-black tree,
AVL tree, weight-balanced tree, B tree, and more.
A Study of Balanced Search Trees:
Brainforming a New Balanced Search Tree
Anthony Kim
Background Information
Search Tree Basics
This pro ject requires a good understanding of binary trees and
general serach tree basics. A binary tree has nodes and edges. Nodes
are the elements in the tree and edges represent relationship between
two nodes. Each node in a binary tree is connected by edgesto zero
to two nodes. In general search tree, each node can have more than 2
nodes as in the case of B-tree. The node is called a parent and nodes
connected by edges from this parent node are called its children. A
node with no child is called a leaf node. Easy visualization of binary
tree is a real tree put upside down on a paper with roots on the top
and branches on the bottom.
A Study of Balanced Search Trees:
Brainforming a New Balanced Search Tree
Anthony Kim
The grandparent of a binary tree is called root. From the root, the
tree branches out to its immediate children and subsequent
descendents. Each node's children are designated by left child and
right child. One property of binary search tree is that the value stored
in the left child is less than or equal to the value stored in parent. The
right child's value is, on the other hand, greater than the parent's.
(Lef t <= Parent, P arent < Right)
A Study of Balanced Search Trees:
Brainforming a New Balanced Search Tree
Anthony Kim
3.2 Search Tree Functions
There are several main functions that go along with binary tree and
general search trees: insertion, deletion, search, and traversal. In
insertion, a data is entered into the search tree, it is compared with
the root. If the value is less than or equal to the root's then the
insertion functino proceeds to the left child of the root and compares
again. Otherwise the function proceeds to the right child and
compares the value with the node's. When the function reaches the
end of the tree, for example if the last node the value was compared
with was a leaf node, a new node is created at that position with the
new inserted value. Deletion function works similarly to find a node
with the value of interest (by going left and right accordingly).
A Study of Balanced Search Trees:
Brainforming a New Balanced Search Tree
Anthony Kim
Then the funciton deletes the node and fixes the tree (changing
parent children relationship etc.) to keep the property of binary tree
or that of general search tree. Search function or basically data
retrieval is also similar. After traversing down the tree (starting
from the rot), two cases are possible. If there is a value in interest is
encountered on the traversal, then the functino replys that there is
such data in the tree. If the traversal ends at a leaf node with no
encounter of the value in search, then the function simply returns
the otherwise. There are three kinds of travesal functions to show
the structure of a tree: preorder, inorder and postorder.
A Study of Balanced Search Trees:
Brainforming a New Balanced Search Tree
Anthony Kim
They are recursive functions that print the data in special order.
For example in preorder traversal, as the prefix pre suggests, first
the value of node is printed then the recursive repeats to the left
subtree and then to the right subtree. Similary, in inorder traversal,
as the prefix in suggests, first the left subtree is output, then the
node's value, then the right subtree. (Thus the node's value is
output in the middle of the function.) Same pattern applies to the
postorder transversal.
A Study of Balanced Search Trees:
Brainforming a New Balanced Search Tree
Anthony Kim
3.3 The Problem
It is not hard to see that the structure of a binary search tree (or
general search tree) that the order of data input is important. In a
optimal binary tree, the data are input so that insertion occurs just
right which makes the tree balanced, the size of left subtree is
approximately equal to the size of right subtree at each node in the
tree. In an optimal binary tree, the insertion, deletion, and search
function occur in O(log N ) with N as the number of data in the
tree. This follows from that whenever data comparison occurs and
subsequent traversal (to the left or to the right) the number of
possible subset divides in half at each turn. However that's only
when the input is nicely ordered and the search tree is balanced.
A Study of Balanced Search Trees:
Brainforming a New Balanced Search Tree
Anthony Kim
It's also possible that the data are input so that only right nodes are
added. (Root- > right- > right- > right...)It's obvious that the search
tree now looks like just a linear array. And it is. And this give O(N )
to do insertion, deletion and search operation. This is not efficient.
Thus search trees are developed to perform its functions efficiently
regardless of data input.
4 Balanced Search Trees
Four ma jor balanced search trees are investigated. Three of them,
namely red-black tree, height-balanced tree, and weight-balanced
tree are binary search trees. The fourth, B-tree, is multiple children
(> 2) search tree.
A Study of Balanced Search Trees:
Brainforming a New Balanced Search Tree
Anthony Kim
4.1 Red-black tree
Red-black search tree is a special binary with a color scheme; each
node is either black or red. There are four properties that makes a
binary tree a red-black tree. (1) The root of the tree is colored
black. (2) All paths frmo the root to the leaves agree on the number
of black nodes. (3) No path from the root to a leaf may contain two
consecutive nodes colored red. (4) Every path from a node to a leaf
(of the descendents) has the same number of black nodes. The
performance of balanced search is directly related to the height of
the balanced tree. For a binary, lg (number of nodes) is usually the
optimal height. In the case of Red-black tree with n nodes, it has
height at most 2lg (n + 1). The proof is noteworthy, but difficult to
understand.
A Study of Balanced Search Trees:
Brainforming a New Balanced Search Tree
Anthony Kim
In order the prove the assertion that Red-black tree's height is at
most 2lg (n + 1) we should first define bh(x). bh(x) is defined to be
the number of black nodes on any path from, but not including a
node x, to a leaf.Notice that black height (bh) is well defined under
the property 2 of Red-black tree. It is easy to see that black height
of a tree is the black height of its root. First we shall prove that the
subtree rooted at any given node x contains at least ( 2 bh(x)) - 1
nodes. We can prove this by induction on the height of a node x:
The base case is bh(x) = 0, which suggests that x must be a leaf
(NIL). This is true then it follows that subtree rooted at x contains
20 - 1 = 0. The following is the inductive step. Let say node x has
positive height and has two children.
A Study of Balanced Search Trees:
Brainforming a New Balanced Search Tree
Anthony Kim
Note that each child has a black-height of either bx(x), if it is a red
node, or bh(x)-1, if it is a black node. It follows that the subtree
rooted at xcontains at least: 2(2( bh(x)) - 1 - 1) + 1 = 2( bh(x)) - 1.
The first term refers to the minimum bounded by the sum of black
height left and right. and the second term (the 1) refers to the root.
Doing some algedra this leades to the right side of the equaiton.
Having proved this then the maximum height of Red-black tree is
fairly straightforward method. Not Let h be the height of the tree.
Then by property 3 of Red-black tree, at least half of the nodes on
any simple path from the root to a leaf must be black. So then the
black-height of the root must be at least h/2. n >= 2( h/2) - 1 which
is equivalent to n >= 2( bh(x)) - 1 n + 1 >= 2( h/2) lg (n + 1) >= lg
(2( h/2)) = h/2 h <= 2lg (n + 1) 4
A Study of Balanced Search Trees:
Brainforming a New Balanced Search Tree
Anthony Kim
Therefore we just proved that a red-black tree with n nodes has
height at most 2lg (n + 1).
4.2 Height Balanced Tree
Height balanced tree is a different approach to bound the
maximum height of a binary search tree. For each node, heights of
left subtree and right subtree are stored. The key idea is to balance
the tree by rotating around a node that has greater than threshold
height difference between the left subtree and the right subtree. All
boils down to the following property: (1) At each node, the
difference between height of left subtree and height of right
subtree is less than threshold value. Height balanced tree should
yield lg (n) height depends on the threshold value.
A Study of Balanced Search Trees:
Brainforming a New Balanced Search Tree
Anthony Kim
An intuitive, less rigorous and yet valid proof is provided. Imagine
a simple binary tree in the worst case scenario, a line of nodes. If
the simple binary tree were to be transformed into a height
balanced tree, the following process should do it. (1) Pick some
node near the middle of a given strand of nodes so that the
threshold property satisfies (absolute value(leftH () - rightH ())) (2)
Define this node as a parent and the resulting two strands (nearly
equal in length) as leftsubtree and rightsubtree appropriately. (3)
Repeat steps (1) and (2) on the leftsubtree and the rightsubtree.
First note this process will terminate. It's because at each step, the
given strand will be split in two halves smaller than the original
tree. So this shows the number of nodes in a given strand will
decrease.
A Study of Balanced Search Trees:
Brainforming a New Balanced Search Tree
Anthony Kim
This will eventually reach a terminal size of nodes determined by
the threshold height difference. If a given strand is impossible to
divide so that the threshold height difference holds, then that is the
end for that sub recursive routine. Splitting into two halves
recursively is analogous to dividing a mass into two halves each
time. Dividing by 2 in turn leads to lg (n). So it follows the height
of height-balanced tree should be lg (n), or something around that
magnitude. It is interesting to note that height balanced tree is
roughly complete binary tree. This is because height balancing
allows nodes to gather around the top. There is probably a decent
proof for this observation, but simple intuition is enough to see
this.
A Study of Balanced Search Trees:
Brainforming a New Balanced Search Tree
Anthony Kim
4.3 Weight Balanced Tree
Weight balanced tree is very similar to height balanced tree. It is very
the same idea, but just different nuance. The overall data structure is
also similar. Instead of heights of left subtree and right subtree, weights
of left subtree and right subtree are kept. The weight of a tree is defined
as the number of nodes in that tree. The key idea is to balance the tree
by rotating around a node that has greater than threshold weight
difference between the left subtree and the right subtree. Rotating
around a node shifts the weight balance to a favorable one, specifically
the one with smaller difference of weights of left subtree and right
subtree. Weight balanced tree has the following main property: (1) At
each node, the difference between weight of left subtree and weight of
right subtree is less than the threshold value.
A Study of Balanced Search Trees:
Brainforming a New Balanced Search Tree
Anthony Kim
Similar approach used to prove height balanced tree is used to show lg
(n) of weight balanced tree. The proof uses mostly intuitive argument
built on recursion and induction. Transforming a line of nodes, the
worst case scenario in a simple binary tree, to a weight balanced tree
can be done by the following steps. (1) Pick some node near the middle
of a given strand of nodes so that the threshold property satisfies
(absolutev alue(lef tW () - rig htW ())) (2) Define this node as a parent
and the resulting two strands (nearly equal in length) as leftsubtree and
rightsubtree appropriately. (3) Repeat steps (1) and (2) on the
leftsubtree and the rightsubtree. It is easy to confuse the first step in
height balanced tree and weight balanced tree, but picking the middle
node surely satisfies both the height balanced tree property and weight
balanced tree.
A Study of Balanced Search Trees:
Brainforming a New Balanced Search Tree
Anthony Kim
Maybe the weight balanced tree property is well defined, since the
middle node presumably has same number of nodes before and after its
position. This process will terminate. It's because at each step, the given
strand will be split in two halves smaller than the original strand. So
this shows the number of nodes in a given strand will decrease. This
will eventually reach a terminal size of nodes determined by the
threshold weight difference. Splitting into two halves recursively is
analogous to dividing a mass into two halves each time. Dividing by 2 in
turn leads to lg (n). So it follows the height of weight-balanced tree
should be lg (n), or something around that magnitude. Like height
balanced tree, weight balanced tree is roughly complete binary tree.
A Study of Balanced Search Trees:
Brainforming a New Balanced Search Tree
Anthony Kim
A New Balanced Search Tree(?)
A new balanced search tree has been developed. The binary tree has no
theoretical value to computer science, but probably has practical value.
The new balanced search tree will referred as median-weight-mix tree
for each node will have a key, zero to two children, and some sort of
weight.
5.1 Background
Median-weight-mix tree probably serves no theoretical purpose because
its not perfect. It has no well defined behavior that obeys a set of
properties. Rather it serves practical purpose mostly likely in statistics.
A Study of Balanced Search Trees:
Brainforming a New Balanced Search Tree
Anthony Kim
Median-weight-mix tree is based on following assumption in data
processing: (1) Given lower bound and upper bound of total data input,
random behavior is assumed, meaning data points will be evenly
distributed around in the interval. (2) Multiple bells is assumed to be
present in the interval. The first property is not hard to understand. This
is based on the idea that nature is random. The data points will be
scatter about, but evenly since random means each data value has equal
chance of being present in the data input set. An example of this
physical modeling would be a rain. In a rain, rain drops fall randomly
onto ground. In fact, one can estimate amount of rainfall by sampling a
small area.
A Study of Balanced Search Trees:
Brainforming a New Balanced Search Tree
Anthony Kim
Amount of rain is measured in the small sampling area and then total
rail fall can be calculated by numerical pro jection, ratio or whatever
method. The total rain fall would be rainfall-in-small-area * area-oftotalarea / area-of-small-area. The second assumption is based upon
less apparent observation. Nature is not completely random, which
means some numbers will occur more often than others. When the data
values and the frequency of those data values are plotted on 2D plane, a
wave is expected. There are greater hits in some range of data values
(the crests) than in other range of data values (the trough). A practical
example would be height. One might expect well defined bell-shaped
curve based on the average height.(People tends to be 5 foot 10 inches.)
But this is not true when you look at it global scale, because there are
isolated populations around the world. The average height of
Americans is not necessarily the average height of Chinese. So this
wave shaped curve is assumed.
A Study of Balanced Search Trees:
Brainforming a New Balanced Search Tree
Anthony Kim
5.2 Algorithm
Each node will have a key (data number), an interval (with lower and
upper bounds of its assigned interval) and weights of left subtree and
right subtree. The weights of each subtree are calculated is based on
constants R and S. Constant R represents the importance of focusing
frequency heavy data points. Constant S represents the importance of
focusing frequency weak data points. So the ratio R/S consequently
represents the relative importance of frequency heavy vs. frequency
weak data points. Then tree will be balanced to adjust to a favorable R/S
ratio at each node by means of rotating, left rotating and right rotating.
A Study of Balanced Search Trees:
Brainforming a New Balanced Search Tree
Anthony Kim
6.1 Methodology
6
Idea
Evaluating binary search trees can be done in various ways because
they can serve number of purposes. For this pro ject, a binary search
tree was developed to take some advantage of random nature of
statistics with some assumption. Therefore it is reasonable to do
evaluation on this basis. With this overall purpose, several behaviors of
balanced search trees will be examined. Those are: (1) Time it takes to
process a data set (2) Average time retrieval of data (3) Height of the
binary tree The above properties are the ma jor ones that outline the
analysis. Speed is important and each binary tree is timed to check how
long it takes to process input data. But average time retrieval of data is
also important because it is best indication of efficiency of the data
structures. What is the use when you can input a number quick but
retrieve it slow? Lastly, height of the binary tree is check to see if how
theoretical idea works out in practical situation.
A Study of Balanced Search Trees:
Brainforming a New Balanced Search Tree
Anthony Kim
Detail
6.2
It is worthwhile to note how each behaviors are measured in C++. For
measuring time it take to process a data set, the starting time and the
ending time will be recorded by function clock () under time.h library.
Then the time duration will be (End-Time - StartTime) / CLOCKS PER
SEC. The average time retrieval of data will be calculated by first
summing time it takes check each data points in the tree and dividing
this sum by the number of data points in the binary tree. Height of the
binary tree, the third behavior under study, is calculated by tree
traversal, pre-, in- or post-order, by simply taking the maximum
height/depth visited as each node is scanned. There will be several test
cases (identical) to check red-black binary tree, height-balanced tree,
weight-balanced tree, and median-weight-mix tree. First category of test
run will be test cases with gradually increasing number of randomly
generated data points. Second category of test run will be hand
manipulated.
A Study of Balanced Search Trees:
Brainforming a New Balanced Search Tree
Anthony Kim
Data points will still be randomly generated however under some
statistical behaviors, such a "wave," a single bell curve, etc. Third
category of test run will be real life data points such as heights, ages,
and others. Due to immense amount of data, some proportional scaling
might be used to accommodate the memory capability of the balanced
binary trees.
7 Result Analysis
C++ codes of the balanced search trees will be provided. Testing of
balanced search trees for their efficiency and such. Graphs and table
will be provided. Under construction
8 Conclusion Under Construction 9 Reference Under Construction
http://newds.zefga.net/snips/Docs/BalancedBSTs.html
App endix A: Other Balanced Search Trees App endix B: Co des
Linux Kernel
Debugging API
The purpose of this project is to create
an implementation of much of the
kernel API that functions in user
space, the normal environment that
processes run in. The issue with
testing kernel code is that the live
kernel runs in kernel space, a separate
area that deals with hardware
interaction and management of all the
other processes. Kernel space
debuggers are unreliable and very
limited in scope; a kernel failure can
hardly dump useful error information
because there's no operating system
left to write that information to disk.
77
Kernel Debugging User-Space API Library
(KDUAL)
John Livingston
Abstract:
The purpose of this project is to create an implementation of much
of the kernel API that functions in user space, the normal
environment that processes run in. The issue with testing kernel
code is that the live kernel runs in kernel space, a separate area
that deals with hardware interaction and management of all the
other processes. Kernel space debuggers are unreliable and very
limited in scope; a kernel failure can hardly dump useful error
information because there's no operating system left to write that
information to disk. Kernel development is quite likely the most
important active project in the Linux community.
Kernel Debugging User-Space API Library
(KDUAL)
John Livingston
Any aids to the development process would be appreciated by the
entire kernel development team, allowing them to do their work
faster and pass changes along to the end user quicker. This
program will make a direct contribution to kernel developers, but
an indirect contribution to every future user of Linux.
Introduction and Background
The Linux kernel is arguably the most complex piece of software
ever crafted. It must be held to the most stringent standards of
performance, as any malfunction, or worse, security flaw, could be
potentially fatal for a critical application. However, because of the
nature of the kernel and its close interaction with hardware, it's
extremely difficult to debug kernel code.
Kernel Debugging User-Space API Library
(KDUAL)
John Livingston
The goal of this project is to create a C library that provides the
kernel API, but operates in ordinary user space, without actual
interaction with the underlying system. Kernel code currently being
tested can then be compiled against this library for testing without
the risks and confusion of testing it on a live system.
Process The design of this API has an extremely simple
development process: Research, code, debug. Sub-tasks are
somewhat difficult to define, as the library cannot do very much of
use until complete. However, the rapidly growing source code,
along with small demonstrations of sections of the library, is
sufficient for progress reporting purposes. Development thus far
has simple, no special tools have been needed beyond the vim
editor, the GNU C compiler and linker, and a very large amount of
work time.
Kernel Debugging User-Space API Library
(KDUAL)
John Livingston
Testing of the library with simple functions will be trivial, it is the
eventual goal of this project construct a small patch to the kernel
using this library both as a demonstration of the library's
effectiveness and to solve an existing problem. This patch would
allow seamless use of the Andrew File System (AFS) with the 2.6.x
kernel, greatly benefiting the lab's workstations by allowing an
immediate migration to 2.6, which has large improvements. On a
more detailed level, I have been implementing sections of the
Linux VFS, as well as math processing. VFS is necessary to handle
"file interaction" in the virtual kernel, while most of the
mathematical work has been to optimize basic functions (add,
subtract, compare, etc.) using x86 assembly.
Kernel Debugging User-Space API Library
(KDUAL)
John Livingston
Because this library attempts to simulate a program that uses
hardware directly for computation, its own internal simulation of
that computation must be as fast as possible. It will never reach
anywhere near the speed of the actual kernel, but the speed
difference between the original C syntax for addition and its
equivalent in inline assembly is a tenfold increase. These two
sections of the kernel library will be my primary contribution to
this project. Current code for the two spans several thousand lines.
The majority of the codebase is written; however, minor changes,
fixes, and improvements will still require significant effort. This
project's success is dependent on efficiency as much as simple
functionality.
Kernel Debugging User-Space API Library
(KDUAL)
John Livingston
References http://www.kernel.org/ The Linux Kernel Archives.
http://www.debian.org/ The Debian distribution of Linux, the best
one, not to start a flame war or anything. http://lkml.org The
archive of the Linux Kernel Mailing List, the primary method of
communication for kernel developers http://www.openafs.org An
open source implementation of the Andrew File System.
An Analysis of
Sabermetric Statistics
in Baseball
For years, baseball theorists have pondered the
most basic question of baseball statistics:
which statistic most accurately predicts which
team will win a baseball game. With this
information, baseball teams can rely on
technological, statistical-based scouting
organizations. The book, Moneyball addresses
the advent of sabermetric statistics in the
1980s and 1990s and shows how radical
baseball thinkers instituted a new era of
baseball scouting and player analyzation. This
project analyzes which baseball statistic is
the single most important. It has been found
that new formulas, such as OBP, OPS, and
Runs Created correlate better with the number
of runs a team scores than traditional statistics
such as batting average.
84
Saber-what? An Analysis of the Use of
Sabermetric Statistics in Baseball
Jack McKay
Abstract:
For years, baseball theorists have pondered the most basic question
of baseball statistics: which statistic most accurately predicts which
team will win a baseball game. With this information, baseball teams
can rely on technological, statistical-based scouting organizations.
The book, Moneyball addresses the advent of sabermetric statistics
in the 1980s and 1990s and shows how radical baseball thinkers
instituted a new era of baseball scouting and player analyzation. This
project analyzes which baseball statistic is the single most important.
It has been found that new formulas, such as OBP, OPS, and Runs
Created correlate better with the number of runs a team scores than
traditional statistics such as batting average.
Saber-what? An Analysis of the Use of
Sabermetric Statistics in Baseball
Jack McKay
Introduction:
For some time, a baseball debate has been brewing. Newcomers and
sabermetricians (the “Statistics Community”) feel that baseball can
be analyzed as a scientific entity. The Sabermetric Manifesto by Bill
James serves as the Constitution for these numbers-oriented people.
Also, Moneyball by Michael Lewis serves as the successful model of
practical application of their theories. Traditional scouts (the
“Scouting Community”) contend that baseball statistics should not
over-analyzed and stress the importance of intangibles and the need
for scouts. The debate can also be interpreted in terms of statistics.
Baseball lifers feel that stats such as batting average are the most
important.
Saber-what? An Analysis of the Use of
Sabermetric Statistics in Baseball
Jack McKay
Meanwhile, the Statistics Community feels that complex, formulaic
stats can better predict a player’s contributions to a team. The
discussion continues in the offices of baseball teams around the
country: are computer algorithms better than human senses?
From a statistical sense, baseball is an ideal sport. Plate appearances
are discrete events with few, distinct results. In fact, results can be
limited to a few distinct outcomes: hit, walk, or out. Outcomes can
also be expressed more specifically: single, double, triple, home run,
walk, strike-out, fly-out… etc. Most importantly, the outcomes of
past plate appearances can accurately predict the outcomes of future
plate appearances. Baseball statisticians continue to desire more
information in their field in order to become better at analyzing the
past and predicting the future.
Saber-what? An Analysis of the Use of
Sabermetric Statistics in Baseball
Jack McKay
Definition of Terms
BA – Batting Average
OBP – On Base Percentage
OPS – On Base Percentage Plus Slugging
OPS Adjusted – On Base Percentage * 1.2 Plus Slugging Percentage
Runs Created – On Base Percentage * Slugging Percentage
Saber-what? An Analysis of the Use of
Sabermetric Statistics in Baseball
Jack McKay
Theory: Sabermetric Teachings
(1) Best Pitching Statistic
DIPS is Defensive Independent Pitching Statistic - "Looking
mainly at a pitcher's strikeouts, walks and home runs allowed per
inning does a better job of predicting ERA than even ERA does. It's
very counterintuitive to see that singles and doubles allowed don't
matter a whole lot moving forward." (Across the Great Divide)
The Defense Independent Pitching Statistic was invented to provide
by sabermetricians as a alternate statistic to ERA. Sabermetricians
think that ERA does a poor job of future prediction because it is
greatly altered by stadium characteristics, the opposing team, luck,
and defense. Hence, the invention of DIPS.
Saber-what? An Analysis of the Use of
Sabermetric Statistics in Baseball
Jack McKay
(2) Best Hitting Statistic
OBP and OPS as more indicative hitting statistics than the current
default: Batting Average. OBP is on-base percentage, which is
essential a measure of batting average and plate discipline. OPS is
Slugging Percentage plus OBP, which gives a measure of power and
plate discipline. In fact, it has been shown by my research that OPS
does the best job of any conventional statistics in correlating to wins.
Old-school scouts say that Batting Average is a better predictor of a
player's potential, because plate discipine can be learned.
The ultimate example is that one team could hit three home runs
to get three runs, while another team could have two walks, followed
by a home run, to get three runs. In this case, it is shown that walks
are important!
Saber-what? An Analysis of the Use of
Sabermetric Statistics in Baseball
Jack McKay
(3) The lack of need for Scouts
“The scouts have only a limited idea of what the guy's gonna do. He
might do this, he might do that, he might be somewhere in the
middle. What you're trying to do is you're trying to take the guys
who you think have the best chance. I fully admit that you can't tell
the future via stats. My point is that scouting has that equal amount
of unpredictability. You can only know so much. You're scouts,
you're not fortune tellers.”
It is many sabermetricians view that real scouts are important:
nowadays, “scouts” can operate from a laptop looking at a baseball's
player's stats.
Saber-what? An Analysis of the Use of
Sabermetric Statistics in Baseball
Jack McKay
(4) Draft College Players
“A player who is 21 is simply closer to his peak abilities than a
player who's 18.”
Therefore, it is extremely risky to draft younger, usually high-school
players. How they play in high school may be far away from how
they play in the pros. Meanwhile, the best College Players can
sometimes be plugged into a Major League Baseball team's rotation
a season or two after being drafted. Simply, college players are more
proven.
Saber-what? An Analysis of the Use of
Sabermetric Statistics in Baseball
Jack McKay
(5) Use Minor League Statistics to predict Major League
numbers
Guys who have higher on-base percentages in Triple-A tend to have
higher on-base percentages in the major leagues. However, this area
is very underdeveloped, and no studies have been conducted in the
field.
Saber-what? An Analysis of the Use of
Sabermetric Statistics in Baseball
Jack McKay
Method
My own analysis had two parts. First, I obtained statistical data
about teams in the past ten years and entered it into a Microsoft
Excel spreadsheet. I calculated the correlation between certain
statistics of teams and the number of runs they scored that season.
The second part of my research consists of a computer program in
C++. Right now, I have the “engine” of my program working. The
program plays a game between two teams, then outputs a full box
score displaying many hitter statistics. With this framework in place,
I can tell the program to play the game many times, and store the
statistics in variables that I can output to a different file.
Saber-what? An Analysis of the Use of
Sabermetric Statistics in Baseball
Jack McKay
Ways In the future in which I will produce graphs based on my C++
program:
Graph of percentage of games won versus number of games played...
should even out at the correct percentage when sample size is larger
enough.
Graph of correlation percentage between OBP and games won,
SLUG and games won... and more. Should be a bar graph because
the correlation is just a number from -1 to 1.
Bar graph with effect of artificially changing OBP and SLUG... what
is the effect on runs scored. Which has a bigger effect?
Saber-what? An Analysis of the Use of
Sabermetric Statistics in Baseball
Jack McKay
Findings: Correlation Data
Correlation = 0.824
Saber-what? An Analysis of the Use of
Sabermetric Statistics in Baseball
Jack McKay
Findings: Correlation Data
Correlation = 0.953 (better)
Saber-what? An Analysis of the Use of
Sabermetric Statistics in Baseball
Jack McKay
Findings: Correlation Data
Correlation = 0.956 (best)
Saber-what? An Analysis of the Use of
Sabermetric Statistics in Baseball
Jack McKay
References
“Across the Great Divide.”
http://sports.espn.go.com/mlb/columns/story?columnist=schwarz_a
lan&id=1963830
Lewis, Michael. Moneyball. W.W. Norton, New York.
2004.“Sabermetrics.”
“Sabermetrics.” http://en.wikipedia.org/wiki/Sabermetrics
“Sabermetric Revolution Sweeping the Game.”
http://proxy.espn.go.com/mlb/columns/story?columnist=neyer_rob
&id=1966043
An Investigation into
Implementations of
DNA Sequence
Pattern Matching
Algorithms
There is an immense amount of
genetic data generated by government
efforts such
as the human genome project and by
organization efforts such as The
Institute for Genomic Research
(TIGR). there exist large amounts of
unused processing power in schools
and labs across the country.
Harnessing some of this power is a
useful problem not just for the
specific application in Bioinformatics
of DNA sequence pattern matching.
100
An Investigation into Implementations of
DNA Sequence Pattern Matching Algorithms
Peden Nichols
Abstract
The BLAST (Basic Local Alignment Search Tool) algorithm of
genetic comparison is the main tool used in the Bioinformatics
community for interpreting genetic data. Existing implementations
of this algorithm (in the form of programs or web interfaces) are
widely available and free. Therefore, the most significant limiting
factor in BLAST implementations is not accessibility but
computing power. My project deals with possible methods of
alleviating this limiting factor by harnessing computer resources
which go unused in long periods of idle time.
An Investigation into Implementations of
DNA Sequence Pattern Matching Algorithms
Peden Nichols
The main methods used are grid computing, dynamic load
balancing, and backgrounding. Background There is an immense
amount of genetic data generated by government efforts such as
the human genome project and by organization efforts such as The
Institute for Genomic Research (TIGR). The task of extracting
useful information from this data requires such processing power
that it overwhelms current computational resources. However,
there exist large amounts of unused processing power in schools
and labs across the country; most computers are never being used
all of the time, and most of the time that computers are being used
their processors are nowhere near 100% load.
An Investigation into Implementations of
DNA Sequence Pattern Matching Algorithms
Peden Nichols
Harnessing some of this unused power is a useful problem not just
for the specific application in Bioinformatics of DNA sequence
pattern matching, but for many computationally intensive
problems which could be solved more accurately and faster with
increased resources. Procedure The first step in harnessing unused
processor power is to clearly establish and document the existence
and magnitude of that unused power. Accomplishing this task
An Investigation into Implementations of
DNA Sequence Pattern Matching Algorithms
Peden Nichols
requires that we establish some metrics for describing computer
load and develop a way to keep a record of those metrics over time.
Perl is an ideal language with which to write a program which
could perform this task because of its text manipulation
capabilities and high speed. The program "cpuload" uses the
Linux "uptime" command every second, parses the output, and
writes the results to a file which is then plotted using gnuplot. The
graph shows the results over one execution of the BLAST
algorithm comparing two strains of e-coli bacteria.
An Investigation into Implementations of
DNA Sequence Pattern Matching Algorithms
Peden Nichols
Remote machine tests have the following procedure:
ssh to target processor
Record test number, processor name, and any users
Ask any users to notice performance changes
Run ~/web-docs/techlab/BLAST/formatdb -iEcK12.FA -pT -oT
-nK12-Prot
Run ~/techdocs/cpuload for 5 data points
Record start time
Run ~/web-docs/techlab/BLAST/blastall -pblastp -dK12-Prot
-iEcSak.FA -ok12vssak -e.001
Record end time
Allow cpuload run for approximately 5 more data points
An Investigation into Implementations of
DNA Sequence Pattern Matching Algorithms
Peden Nichols
(cont.)
vim runstats :w tests/testX
Record any user-reported performance changes
The use of grid computing to optimize BLAST implementations is
not an original idea; a program called mpiblast has already been
written and made available to the public. However, implementing
mpiblast in any given environment is not a trivial task. For
example, our systems lab, although it has mpi installed on several
computers, has not maintained a list of which computers are
available to run parallel programs. My next task was to compile
this list using essentially trial and error and running a test mpi
program, mpihello.c.
An Investigation into Implementations of
DNA Sequence Pattern Matching Algorithms
Peden Nichols
My next task was to compile this list using essentially trial and
error and running a test mpi program, mpihello.c. See poster for
pictures of the old, obsolete lamhosts list and the updated working
version. Here are the results for single remote machine tests,
including selected graphs of cpuload output: Test 1: tess No users
Start: 9:09 End: 9:16 Test 2: beowulf Jack McKay Start: 8:57 End:
9:04 User report: "I experienced no slow down or loss of
performance. But if I had a loss of performance that persisted for
over thirty six hours, rest assured, I would have contacted my
doctor." oedipus: no route to host Test 3: antigone No users Start:
8:43 End: 8:51 Test 4: agammemnon Jason Ji Start: 9:53 End:
10:01 User report: Did you experience any slow down at all?
"No".
An Investigation into Implementations of
DNA Sequence Pattern Matching Algorithms
Peden Nichols
Test 5: loman Michael Drukker Start: 8:44/src/redirect.php End:
8:51 User report: "I'm not noticing anything, but I'm not doing
anything computationally intensive, so..." Test 6: lordjim Robert
Staubs Start: 8:57 End: 9:04 User report: "I wasn't really using
the computer during that time." Test 7: faustus Caroline Bauer
Start: 9:25 End: 9:34 User report: "I haven't noticed anything,
so..." Test 8: okonokwo Alex Volkovitski Start: 10:10 End: 10:19
User report: Test 9: joad No users Start: 9:15 End: 9:23 Analysis
Tests I run on single remote machines generate two dependent
variables: running time and CPU load over the test's duration. So
far nine tests have been run, six with users on the target machine
and three without users on the target machine.
An Investigation into Implementations of
DNA Sequence Pattern Matching Algorithms
Peden Nichols
As is visible from the graphs above, the tests have similar results
with similar durations, indicating that performance for grid
computing in the systems lab is indeed predictable and repeatable.
Furthermore, the user testimonials so far unanimously agree that
no change in performance was noticed. Further Testing Plans In
future tests of multiple machines running simultaneously, I could
look at how effectively each test used its resources by creating an
"efficiency" metric. A formula for this metric could perhaps be E =
1/(t*n) efficiency = 1/((running time) * (# of machines)) Because
of the transfer time involved in MPI programming, one machine
will probably be the most efficient.
An Investigation into Implementations of
DNA Sequence Pattern Matching Algorithms
Peden Nichols
The interesting question I will address, though, is how much more
efficient is one machine than two? Three? How many machines
can you utilize before realizing a huge drop in efficiency? In
general, there is also an optimum balance between transfer time
and processing power for any given algorithm to run in the
shortest time. At this point, adding more processors actually slows
down the program because the increase in transfer time outweighs
the added processing power. The ideal number of processors is
generally higher for more complex algorithms; adding two
numbers together is clearly fastest when run on only one computer,
while BLAST algorithms can benefit from more processors.
An Investigation into Implementations of
DNA Sequence Pattern Matching Algorithms
Peden Nichols
It will interesting to see whether or not I can surpass this "optimal
number" for BLAST algorithms with the number of processors
available in the Systems Lab. A third dependent variable my tests
could possibly generate would be accuracy of output. If I could
develop a method of measuring this variable, it would probably be
the most interesting of all to investigate. For now, however, I will
leave it as a possibility while I focus on the other tests.
An Investigation into Implementations of
DNA Sequence Pattern Matching Algorithms
Peden Nichols
GENOME@HOME GENOME@HOME is a potential application
of Grid Computing to the implementation of BLAST algorithms.
The idea is to distribute implementations of BLAST on personal or
institutional computers and run those implementations during
down time or even in the background, while computers are being
used. To justify such a program to users, it is necessary to
demonstrate that such a program will not interfere with use of the
computer or slow down the computer's performance in any
noticeable way. References www.ncbi.nlm.nih.gov - The National
Center for Biotechnology Information's website, where I obtained
several implementations of BLAST. www.tigr.org - The Institute for
Genomic Research's website, which contains helpful background
information on genetic algorithms.
www.stanford.edu/group/pandegroup/genome/ - The primary site
for GENOME@HOME
Part-of Speech
Tagging with
Corpora
The aim of this project is to create and
analyze various methods of part-ofspeech tagging. The corpora used are
of extremely limited size thus offering
less occasion to rely entirely upon
tagging patterns gleamed from
predigested data. Methods used to
analyze the data and resolve tagging
ambiguities include Hidden Markov
Models and Bayesian Networks.
Results are analyzed by comparing
the system-tagged corpus with a
professionally tagged one.
113
Part-of-Speech Tagging with Limited Training
Corpora
Robert Staubs
Abstract The aim of this project is to create and analyze various
methods of part-of-speech tagging. The corpus usedthe Susanne
Corpusis of extremely limited size thus offering less occasion to rely
entirely upon tagging patterns gleamed from predigested data.
Methods used focus on the comparison of tagging correctness among
general and genre-specific training with limited training corpora.
Results are analyzed by comparing the system-tagged corpus with a
professionally tagged one.
Part-of-Speech Tagging with Limited Training
Corpora
Robert Staubs
1.1 Introduction
Problem
Part-of-speech (POS) tagging is a subfield within corpus linguistics
and computational linguistics. POS taggers are designed with the
aim of analyzing texts of sample language usecorporato determine
the syntactic categories of the words or phrases used in the text. POS
tagging serves as an underpinning to two fields above others: natural
language processing and corpus linguistics. It is useful to natural
language processingthe interpretation or generation of human
language by machinesin that it provides a way of preparing
processed texts to be interpretted syntactically. It is also useful in the
academic field of corpus
Part-of-Speech Tagging with Limited Training
Corpora
Robert Staubs
linguistics in the statistical analysis of how humans use their
language. The intention of this pro ject is to use and compare various
methods of POS tagging using a small amount of statistical training.
1.2 Scope
This project aims to try to achieve the best results with a limited
training corpus. Training data is limited to 53 of the 57 documents
making up the Susanne Corpus, the other 4 being reserved for testing
purposes. Each of the four testing segments will be used both with
general training (with all 53 of the others) and with genre-specific
training (with only one-fourth of those).
Part-of-Speech Tagging with Limited Training
Corpora
Robert Staubs
1.3 Background
Many different methods of POS tagging have been advanced in the
past but no attempts give hope of "perfect" tagging at the current
stage. Accuracy of over 90% on ambiguous words is typical for most
methods in current use (1), often well exceeding that. POS taggers
cannot at the current time mimic human methods for distinguishing
part of speech in language use. Work to get taggers to approach the
problem from all the expected human methodssemantic prediction,
syntactic prediction, lexical frequency, and syntactical category
frequency being the most prominenthave not yet reached full
fruition.
Part-of-Speech Tagging with Limited Training
Corpora
Robert Staubs
1.3.1 Hidden Markov Model
The Hidden Markov Model (HMM) method of POS tagging is
probably the most traditional. It generally requires extensive training
on one or more pre-tagged corpora. Decisions on the part of speech
of words or semantic units are made by analyzing by analyzing the
probability that one tag would follow another wP a P f (i,j ) f (i,w) nd
the probability that a certain word or unit has a certain tag tr ansition
= f (i) lexical = f (i) here f(i, j) represents the number of transitions
from tag i to tag j in the corpora, f(i, w) represents the total number
of words w with tag i, f(i) represents the frequency of tag i, and f(w)
represents the frequency of word w. Transitions and tags not seen are
given a small but non-zero probability (2). The HMM method
converges on its maximum accuracy, as opposed to some methods
(most not usable in this situation) which
Part-of-Speech Tagging with Limited Training
Corpora
Robert Staubs
converge to an accuracy level smaller than one attained earlier.
HMMs have a close affinity for neural network methods.
2.1 Procedure Training
Training data consists of: tags represented in the corpus, words
represented in the corpus, transitions represented in the corpus, and
the frequency of each. Words and tags are read in from the corpus
and stored alphabetically or in parallel in a series of arrays and
matrices. This data form the basis for the statistical information
extracted by taggers for making decisions on a unit's tag.
Part-of-Speech Tagging with Limited Training
Corpora
Robert Staubs
2.1.1 Training Implementation
Corpus data is stored in text files with each subsequent word on a
seperate line. Word, base word form, tag, etc. are stored on each line
tab-delineated. A data-extractor was created using the C++
programming language. The extractor stores each encountered word
in an ordered array of structs. If a struct for that word already exists,
an internal variable representing word-frequency is incremented. The
tag associated with that word is added to an internal array or, if the
tag is already stored there, its frequency is incremented. A similar
process is followed for encountered tags. Each tag is added to an
associated struct in an ordered array. Its frequency is incremented if
it is encountered more than once. The tag that occurred before the
added one is added to an array of preceding tags or, if that tag is
already present, its frequency is incremented.
Part-of-Speech Tagging with Limited Training
Corpora
Robert Staubs
References
1. D. Elworthy (10/1994). Automatic Error Detection in Part of
Speech Tagging.
2. D. Elworthy (10/1994). Does Baum-Welch Re-estimation Help
Taggers?
3. M. Maragoudakis, et al. Towards a Bayesian Stochastic Part-ofSpeech and Case Tagger of Natural Language Corpora
4. M. Marcus, et al. Building a large annotated corpus of English:
the Penn Treebank
5. G. Marton and B. Katz. Exploring the Role of Part of Speech in
the Lexicon
Part-of-Speech Tagging with Limited Training
Corpora
Robert Staubs
References (cont.)
6. T. Nakagawa, et al. (2001). Unknown Word Guessing and Partof-Speech Tagging Using Support Vector Machines
7. V. Savova and L. Pashkin. Part-of-Speech Tagging with Minimal
Lexicalization
8. K. Toutanova, et al. Feature-Rich Part-of-Speech Tagging with a
Cyclic Dependency Network
Benchmarking of
Cryptographic
Algorithms
The author intends to validate
theoretical numbers by constructing
empirical sets of data on cryptographic
algorithms. This data will then be
used to give factual predictions on the
security and efficiency of cryptography
as it applies to modern day applications.
123
Benchmarking of Cryptographic Algorithms
Alex Volkovitsky
Abstract The author intends to validate theoretical numbers by
constructing empirical sets of data on cryptographic algorithms. This
data will then be used to give factual predictions on the security and
efficiency of cryptography as it applies to modern day applications.
1 Introduction
Following is a description of the project and background information
as researched by the author.
Benchmarking of Cryptographic Algorithms
Alex Volkovitsky
1.1.1 Background Origins of Cryptography
Cryptography is a field of study into how two parties can exchange
valuable information over an insecure channel. Historically
speaking, the first use of encryption is attributed to Julius Caesar
when he used a ROT(3) algorithm to transfer military order within
his empire. The algorithm was based on the simple premise of
'rotating' letters (hence the abbreviation ROT) by 3 characters
such that 'a' became 'd', 'b' became 'e', etc. Decryption was the
reverse of this process in that the receiving party needed merely to
"un"-rotate the letters.
Benchmarking of Cryptographic Algorithms
Alex Volkovitsky
1.1.2 Basic Terminology and Concepts
At it's core cryptography assumes that two parties must communicate
over some insecure channel. The sender (generally referred to as
Alice) agrees on some encryption algorithm E(p,k) with the receiver
on the other end (Bob). E(p,k) is generally some function of two or
more variables, often mathematical in nature (such as all computer
algorithms), but not necessarily (as was the case of the notorious
Enigma machine used in World War II). The two variables in
question are 'p', the plain-text or data that must get across without
being read by any third party, and 'k', the key, some shared secret
which both Alice and Bob have agreed to over a previously
established secure connection.
Benchmarking of Cryptographic Algorithms
Alex Volkovitsky
On the receiving end, Bob must possess a decryption function such
that p=D(E(p,k),k). Meaning that if Bob knows the secret ('k'), he
can retrieve the original message 'p'. The most important aspect of
cryptography is the existence of the key which is able to transform
seemingly random gibberish into valuable information.
1.1.3 Symmetric Algorithms
Symmetric key algorithms are algorithms which are most often used
to transfer large amounts of data. Symmetric key algorithms use the
same key 'k' to encrypt and decrypt, and are generally based on
relatively quick mathematic functions such as XOR. The downside
of symmetric algorithms is the that since both parties must know
the exact same key, that key needs to have been transfered securely
in the past.
Benchmarking of Cryptographic Algorithms
Alex Volkovitsky
This means that for Alice and Bob to communicate using a
symmetric key algorithm, they must first either meet in person to
exchange slips of paper with the key, or alternatively (as is done
over the Internet) exchange a symmetric key over an established
public/private-key connection. The most common modern
symmetric algorithm is DES (Digital Encryption Standard).
1.1.4 Private/Public-Key Algorithms
Public-key cryptography is based on the concept that Alice and Bob
do not share the same key. Generally, Alice would generate both
the private key and the public key on her computer, save the private
key and distribute the public key.
Benchmarking of Cryptographic Algorithms
Alex Volkovitsky
If Bob would like to send a message to Alice, he first encrypts it
with her public key, making her the only person able to decrypt the
message. He sends the encrypted message (which even he himself
can no longer decrypt) and Alice is able to read it using her private
key. If Alice wishes to respond, she uses Bob's public key and
follows a similar procedure. Alternatively, if Bob wishes to verify
that it is Alice speaking and no one else, she can sign her
messages. Signing is using your own private key to encrypt a
message, such that anyone else may decrypt it and know that you
were the only person who could've encrypted it.
Benchmarking of Cryptographic Algorithms
Alex Volkovitsky
She would encrypt her message with her private key, then encrypt it
with Bob's public key. Upon receiving the message he would be the
only person able to decrypt it (being the only person knowing Bob's
private key) and then he would verify Alice's signature by
decrypting the actual message with her public key. The most
common modern day public-key algorithm is RSA, developed in
1977 by Ron Rivest, Adi Shamir and Len Adleman (hence the
abbreviation RivestShamirAdleman, or RSA), which is based on a
factoring problem.
1.2 Purpose of the Pro ject
Whereas much research has been done into theoretical
cryptography, very little has been done to prove simple formula
numbers and to look into the speeds at which various algorithms
operate.
Benchmarking of Cryptographic Algorithms
Alex Volkovitsky
My project seeks to observe several modern day algorithms and to
compute empirical data on the time it takes to encrypt and/or
decrypt different amounts of data, and how different algorithms
perform with varied key lengths, modes of operation, and data
sizes. Ideally my program could be run on different types of
machines to identify if certain architectures give an advantage to
the repeating mathematical computations required by
cryptographic algorithms. My pro ject also seeks to try to break
several algorithms using unrealistically small key lengths (using
real key lengths such as 64-bit could take years to break using
brute-force methods), this way I could extrapolate my data and give
predictions on the security afforded by actual key lengths.
Benchmarking of Cryptographic Algorithms
Alex Volkovitsky
1.3 Scope
Development Results Conclusions Summary
References
1. "Handbook of Applied Cryptography" University of Waterloo.
27 Jan. 2005. <http://www.cacr.math.uwaterloo.ca/hac/>.
2. "MCrypt" Sourceforge. 27 Jan. 2005.
<http://mcrypt.sourceforge. net>.
3. "Cryptography FAQ." sci.crypt newsgroup. 27 June 1999.
<http://www. faqs.org/faqs/cryptography- faq/>.