The Science of Speech

hidden user 2 truthful graph

NAVAL POSTGRADUATE SCHOOL
MONTEREY, CALIFORNIA
THESIS
THE SCIENCE OF SPEECH: DEVELOPING A COMPUTATIONAL MODEL FOR DIGITAL COMMUNICATION AND ITS RAMIFICATIONS FOR AUTHOR IDENTIFICATION IN CYBERSECURITY
by
Michael H. Walker III
March 2020
Thesis Advisor: Vinnie Monaco
Second Reader: Ralucca Gera
Approved for public release. Distribution is unlimited.
THIS PAGE INTENTIONALLY LEFT BLANK
REPORT DOCUMENTATION PAGE
Form Approved OMB No. 0704-0188
Public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instruction, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington, VA 22202-4302, and to the Office of Management and Budget, Paperwork Reduction Project (0704-0188) Washington, DC 20503.
1. AGENCY USE ONLY (Leave blank)
2. REPORT DATE March 2020
3. REPORT TYPE AND DATES COVERED Master’s thesis
4. TITLE AND SUBTITLE THE SCIENCE OF SPEECH: DEVELOPING A COMPUTATIONAL MODEL FOR DIGITAL COMMUNICATION AND ITS RAMIFICATIONS FOR AUTHOR IDENTIFICATION IN CYBERSECURITY
5. FUNDING NUMBERS
6. AUTHOR(S) Michael H. Walker III
7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) Naval Postgraduate School Monterey, CA 93943-5000
8. PERFORMING ORGANIZATION REPORT NUMBER
9. SPONSORING / MONITORING AGENCY NAME(S) AND ADDRESS(ES) SFS
10. SPONSORING / MONITORING AGENCY REPORT NUMBER
11. SUPPLEMENTARY NOTES The views expressed in this thesis are those of the author and do not reflect the official policy or position of the Department of Defense or the U.S. Government.
12a. DISTRIBUTION / AVAILABILITY STATEMENT Approved for public release. Distribution is unlimited.
12b. DISTRIBUTION CODE A
13. ABSTRACT (maximum 200 words)
Great strides have been made in identifying an author on the web by analyzing keystroke input, even down to determining what operating system the person was writing on at the time. Likewise, studying the author’s semantics and syntax provides helpful clues as to the identity of the author and whether or not the author is attempting to commit a forgery of some kind. However, most parse trees focus on either the human, or the machine, side of the Human-Machine Interface (HMI). Incorporating both sides of the HMI better accounts for the unique digital signature every web author creates by analyzing stylometry and keystroke dynamics. This research could be instrumental not only in finding malicious actors on the web, but also in distinguishing humans from machines by the way they use words. Thus, combining typing times with part-of-speech (POS) tags demonstrates crucial differences in where authors are likely to spend the most time in sentence composition.
14. SUBJECT TERMS artificial intelligence, cybersecurity, authentication, natural language processing
15. NUMBER OF PAGES 89
16. PRICE CODE
17. SECURITY CLASSIFICATION OF REPORT Unclassified
18. SECURITY CLASSIFICATION OF THIS PAGE Unclassified
19. SECURITY CLASSIFICATION OF ABSTRACT Unclassified
20. LIMITATION OF ABSTRACT UU
NSN 7540-01-280-5500
Standard Form 298 (Rev. 2-89) Prescribed by ANSI Std. 239-18
i
THIS PAGE INTENTIONALLY LEFT BLANK
ii
Approved for public release. Distribution is unlimited.
THE SCIENCE OF SPEECH: DEVELOPING A COMPUTATIONAL MODEL FOR DIGITAL COMMUNICATION AND ITS RAMIFICATIONS FOR AUTHOR IDENTIFICATION IN CYBERSECURITY
Michael H. Walker III Civilian, Naval Postgraduate School BA, University of Dallas, 2013 MA, Dominican School of Philosophy and Theology, 2018
Submitted in partial fulfillment of the requirements for the degree of
MASTER OF SCIENCE IN COMPUTER SCIENCE
from the
NAVAL POSTGRADUATE SCHOOL March 2020
Approved by: Vinnie Monaco Advisor
Ralucca Gera Second Reader
Peter J. Denning Chair, Department of Computer Science
iii
THIS PAGE INTENTIONALLY LEFT BLANK
iv
ABSTRACT
Great strides have been made in identifying an author on the web by analyzing keystroke input, even down to determining what operating system the person was writing on at the time. Likewise, studying the author’s semantics and syntax provides helpful clues as to the identity of the author and whether or not the author is attempting to commit a forgery of some kind. However, most parse trees focus on either the human, or the machine, side of the Human-Machine Interface (HMI). Incorporating both sides of the HMI better accounts for the unique digital signature every web author creates by analyzing stylometry and keystroke dynamics. This research could be instrumental not only in finding malicious actors on the web, but also in distinguishing humans from machines by the way they use words. Thus, combining typing times with part-of-speech (POS) tags demonstrates crucial differences in where authors are likely to spend the most time in sentence composition.
v
THIS PAGE INTENTIONALLY LEFT BLANK
vi
vii
TABLE OF CONTENTS
I. INTRODUCTION……………………………………………………………………………………..1
A. BACKGROUND…………………………………………………………………………….1
B. PURPOSE………………………………………………………………………………………7
C. STRUCTURE…………………………………………………………………………………8
II. PARSING………………………………………………………………………………………………..11
A. GRAMMAR …………………………………………………………………………………12
B. BINARY PARSE TREE………………………………………………………………..17
III. KEYSTROKE DYNAMICS …………………………………………………………………….21
A. KEYLOGGING ……………………………………………………………………………22
B. SEMANTICS………………………………………………………………………………..24
C. SYNTAX………………………………………………………………………………………28
D. EXPERIMENTATION …………………………………………………………………32
IV. UNIVERSAL METHODOLOGY…………………………………………………………….41
A. DATA ANALYSIS AND RESULTS………………………………………………41
B. CHINA AND RUSSIA…………………………………………………………………..56
C. CONCLUSION …………………………………………………………………………….58
LIST OF REFERENCES……………………………………………………………………………………63
INITIAL DISTRIBUTION LIST………………………………………………………………………..71
viii
THIS PAGE INTENTIONALLY LEFT BLANK
ix
LIST OF FIGURES
Figure 1. Binary Parse Tree……………………………………………………………………………19
Figure 2. User #1 Truthful Tree with Timestamps (Part I) …………………………………38
Figure 3. User #1 Truthful Tree with Timestamps (Part II)………………………………..39
Figure 4. User #1 False Code …………………………………………………………………………43
Figure 5. User #1 False Tree ………………………………………………………………………….43
Figure 6. User #2 Truthful Code …………………………………………………………………….46
Figure 7. User #2 Truthful Tree (Part I)…………………………………………………………..46
Figure 8. User #2 Truthful Tree (Part II)………………………………………………………….46
Figure 9. User #2 False Code …………………………………………………………………………48
Figure 10. User #2 False Tree ………………………………………………………………………….48
Figure 11. POHMM Relational Equivalency……………………………………………………..50
Figure 12. Markov Model Diagram ………………………………………………………………….50
Figure 13. Bar Chart Code ………………………………………………………………………………51
Figure 14. User #1 Truthful Review Part of Speech/ Mean Duration…………………….52
Figure 15. User #1 Deceptive Review Part of Speech/ Mean Duration………………….52
Figure 16. User #2 Truthful Review Part of Speech/ Mean Duration…………………….53
Figure 17. User #2 Deceptive Review Part of Speech/ Mean Duration………………….53
Figure 18. Line Chart Code……………………………………………………………………………..54
Figure 19. User #1 Truthful Review Hidden States …………………………………………….54
Figure 20. User #1 Deceptive Review Hidden States ………………………………………….55
Figure 21. User #2 Truthful Review Hidden States …………………………………………….55
Figure 22. User #2 Deceptive Review Hidden States ………………………………………….56
x
THIS PAGE INTENTIONALLY LEFT BLANK
xi
LIST OF TABLES
Table 1. POS and Timestamps (timestamples) ………………………………………………..33
Table 2. Basic NLTK POS Tagging ………………………………………………………………35
Table 3. Advanced Penn Treebank POS Tagging…………………………………………….36
Table 4. User #1 False Statement…………………………………………………………………..42
Table 5. User #2 Truthful Statement………………………………………………………………44
Table 6. User #2 False Statement…………………………………………………………………..47
xii
THIS PAGE INTENTIONALLY LEFT BLANK
xiii
LIST OF ACRONYMS AND ABBREVIATIONS
ANN Artificial Neural Network
BCI Brain-Computer Interface
CFG Context-Free Grammar
CNN Convolutional Neural Network
CNTK Microsoft Cognitive Toolkit
FCA Formal Concept Analysis
HMI Human Machine Interface
HMM Hidden Markov Model
IOT Internet of Things
KE Knowledge Engineering
LIS Locked-In Syndrome
ML Machine Learning
NLP Natural Language Processing
NLTK Natural Language Toolkit
NLU Natural Language Understanding
POS Part-of-Speech
POHMM Partially Observable HMM
RNN Recurrent Neural Network
SVM Support Vector Machine
TC Text Categorization
TENE Terrorism and Extremism Network Extractor
UNL Universal Networking Language
.
xiv
THIS PAGE INTENTIONALLY LEFT BLANK
xv
ACKNOWLEDGMENTS
I want to thank Professor John V. Monaco and Professor Ralucca Gera for advising this thesis, as well as Professor Cynthia Irvine and theMONARCH Scholarship for Service program for sponsoring me through the National Science Foundation. This material is based upon work supported by the National Science Foundation under Grant No. 1565443. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. Last but not least, I am indebted to my supportive parents; my helpful sister; and my silent brother, Luke, without whom the idea for this project would never have even occurred to me.
xvi
THIS PAGE INTENTIONALLY LEFT BLANK
1
I. INTRODUCTION
With the advent of the smart phone, texting has become a daily feature of post-modern existence. The reward of instant communication across the globe poses a whole host of risks to private and public security. A rash of mass shootings perpetrated by American citizens in recent months was presaged by ominous manifestos published on various underground websites. This negative trend has prompted Capitol Hill to initiate a campaign across social media platforms to develop a “Red Flag” analytic tool that could help law enforcement officials ascertain the credibility of threats posted online. Thus, the need for a more representative way to identify anonymous accounts by word usage and keystroke patterns on the Internet proves to be pressing. Therefore, it behooves us to dedicate this first chapter to a brief history of typography and how it has been leveraged in forensic investigation over the years. Then, we chart a course for how to develop a parse tree analysis that accounts for both static and dynamic elements in semantic construction through a recurrent neural network.
A. BACKGROUND
As Allan Haley et al. point out in the landmark text, Typography Referenced: “Technological developments, principally the ubiquity of computers, the availability of sophisticated software, and Internet connectivity, have raised even the average person’s awareness about the power of typography” [1]. As the legendary man of letters, Marshall McLuhan, takes great pains to note in The Gutenberg Galaxy, “The technology and social effects of typography incline us to abstain from noting interplay and, as it were, ‘formal’ causality, both in our inner and external lives. Print exists by virtue of the static separation of functions and fosters a mentality that gradually resists any but a separative and compartmentalizing or specialist outlook” [2]. That being said, we will need to take an interdisciplinary approach to detecting how inner lives cause external actions from static and dynamic data analysis of typography. After all, with regards to blogging, William Davis King maintains that to “collect is to write a life.” [3]. In fact, according to Yoni Van Den Eede, “Surely instant contact through our emails, SMSes, and online profiles retains
2
much of the mythical flavor of magic wands, totems, and sorcery…It would be interesting to fully expand this theme of ‘electronic magic’ into the general study of myth, archetype, and sacredness” [3].
While a survey of print culture in the vein of Neil Postman lies beyond our present scope, suffice it to say that ours is an age more counterintuitively literate than any preceding it. Even though society is arguably more awash with visual imagery than ever before, it is important to recollect that any semiotic system that transcribes phonemes into an alphabet operates fundamentally on the level of imagery in order to convey the aural information of verbal expression into visual information—from speaking to reading by means of writing, as it were. After the use of engraved tablets with pictograms like hieroglyphics and medieval manuscript calligraphic illumination, textual books began to be mass-produced for the first time with moveable typesets beginning with Gutenberg’s press in the 1450s. Flash-forward some five centuries before humanity underwent its next biggest shift in our ability to create and transmit the written word, and we arrive at the age of computing; of course, we have our Luddites, now as then.
Initially, computers were conceived of as little more than calculators, until it was realized that they could be used to construct sentences, not only algorithms, to appear on screens. Computers began to be used to communicate with humans using natural languages, rather than simply communicating with computers using artificial languages. This positive development expanded the scope of computers’ applicability to anyone with a desire to communicate, whereas previously such analytic machinery had been reserved for cryptographers performing complex operations on the most abstruse of codes. The universality of such a need to communicate as well as calculate drove the personal computing revolution, as well as the eventual development of the World Wide Web. But we get ahead of ourselves.Without a way to convey words in an expeditious manner using digital technology, all other imagined possibilities amounted to a pipe dream.
However, a landmark achievement toward the development of hypertext occurred in 1967, when IBM decided to collaborate with Father Roberto Busa, SJ, to create an index of all of the works of the legendary medieval philosopher and theologian, Saint Thomas Aquinas. This lemmatization required serious computing power, if it was to be remotely
3
successful. As Thomas N.Winter addresses in his foundational paper, “Roberto Busa, S.J., and the Invention of the Machine-Generated Concordance”:
Deeming it necessary to learn what significance words have in an author’s mind before attempting to gain insight into an author’s conceptual system, he envisioned a concordance of all the words of St. Thomas Aquinas including the conjunctions, prepositions, and pronouns, a vision which required dealing with 10,000,000 words: all phrases broken out, each phrase copied over once for every word within the phrase, the lemma indicated on each of these cards, then sorted. [4]
Furthermore, as the 1951 prospectus for this project pointed out, the work of “500 Dominicans… employed by Hugh de St. Cher in 1200 in Paris for the first Biblical Latin Concordance [could now be accomplished by a machine] certain of an accuracy which could never have been guaranteed by the cooperation of man’s sensorial and psychical nerve centers” [4]. The great potentials of this project to revolutionize computing in its applications to the humanities—including psychological science—were not lost upon the collaborators. As Paul Tasman, the project lead at IBM, presciently pronounced, “the machine searching application may initiate a new era of language engineering” [4]. Despite this progress, Father Busa declared in 1990, “Our generation has not done everything: for the young people there are still immense open spaces [which included] the recognition of all elementary and direct grammatical, i.e., semantic connections [and] the formalization of a logical train of thought winding through many paragraphs” [4]. These conspicuous lacunae in state of the art natural language processing will be precisely part of what this thesis seeks to redress.
Now that we have traversed this brief history of typing up until the present-day culture, we can explore how to advance the applications of typing into the new millennium. Before we can dive into the art of semantic construction itself using Prolog parsing techniques and Python neural nets, however, we would first be well-served to examine the ways in which the identification of anonymous authorship has been achieved and utilized in past criminal cases. The most conspicuous example to come to mind is that of Ted Kaczynski, a.k.a. “The Unabomber.” For the forgetful in our midst, Kaczynski was a mathematical genius from the University of California at Berkeley who went rogue after the government built a road through his woodland grounds, only to send bombs to computer
4
science professors such as David Gelernter as well as various airlines (giving rise to the nickname, an abbreviation of “University and Airline Bomber”) in an impotent attempt to thwart the technological progress he saw as an enslavement of society. He is currently serving a lifetime term sentence in Colorado’s Supermax Penitentiary. As the FBI history page notes:
The big break in the case came in 1995. The Unabomber sent us a 35,000 word essay claiming to explain his motives and views of the ills of modern society. After much debate about the wisdom of “giving in to terrorists,” FBI Director Louis Freeh and Attorney General Janet Reno approved the task force’s recommendation to publish the essay in hopes that a reader could identify the author…. Our linguistic analysis determined that the author of those papers and the manifesto were almost certainly the same. When combined with facts gleaned from the bombings and Kaczynski’s life, that analysis provided the basis for a search warrant. [5]
As anyone can observe from the previous passage, lexical analysis turned out to be at the heart of investigative proceedings to catch the culprit. Lives were in imminent danger, and had the FBI been unsuccessful in determining authorship, untold havoc could have been wreaked by a literal mad scientist.
While Kaczynski’s actions are indefensible, his thoughts demand to be reckoned with, if society is not to fall into the very trap he warned against; namely, the abuse of technology not as a means of human flourishing but as a vehicle for totalitarian control. After all, the Constitution was written with an agrarian republic, not a technocracy, in mind. As Kaczynski points out in the manifesto—“Industrial Society and its Future”—alluded to previously, “When the Bolsheviks in Russia were outsiders, they vigorously opposed censorship and the secret police … but as soon as they came into power themselves, they imposed a tighter censorship and created a more ruthless secret police than any that had existed under the tsars.” [6]. With this in mind, it is beyond doubt that America needs to develop a digital means of tracking the communication of our adversaries—internal and external – but if that means violating the ends to which America was first constituted, it would be better had the means not been developed in the first place. Inasmuch as any artifice cannot be said to be intelligent in the same fashion that a human being is, nor can any machine learn with the depth of a man or woman, our unprecedented powers of
5
computation still remain a tool at our disposal that we would be fools not to use in the persistent battle against cyber-crime and terrorism. Humans commit crimes, and only humans can solve them; the computer is only a tool either way.
In fact, no one knows this better than Gelernter himself, who said in a recent interview, “You worry that ‘advances in science and technology are always outpacing our ability or inclination to guard against them,’ but it seems to me that this is exactly what hasn’t happened” [7]. In part, Gelernter attributes America’s success in handling new technologies like AI to proper contextualization: “The people who know the mind best aren’t neurobiologists, they’re novelists and poets. Science must learn from the arts…. As important as scholarship and science are, arts and religion are more important” [7]. By reinvigorating the study of humanities alongside technological training, society should be able to adjust to new technology without a diminution to the human condition. Without a knowledge of what it means to be a human being, no amount of empirical data will assist us in the quest to resolve discord and relieve suffering in the human population, in this nation and across the globe. When the sciences and the arts converse freely—from the left and right hemispheres of the collective mind, so to speak—innovation increases overall quality of life.
Some institutional entities, such as the National Catholic Educational Association, have formalized this holistic integration into their pedagogical model with the acronym STREAM (Science, Technology, Religion, Engineering, Art, and Mathematics) [8]. Since language is just as much of an art as it is a science of sorts, there is nowhere better to implement this model than in the budding area of natural language processing. One of the hardest problems facing computer scientists throughout the next decade will be teaching machines to read. If a computer could parse the nuanced semantic content inherent to even the most basic of garden-path sentence patterns, it will be easier than ever before to determine who wrote what online. Now, with great power comes great responsibility, which is why it is crucial that such an invention does not become divorced from its inventor’s intent; otherwise, the menace of a robotic panopticon surveying the whole of society becomes not so much a distant possibility as an imminent probability, as we are already seeing in Communist countries like China with their facial recognition social
6
ranking system featured in Ashlee Vance’s searing documentary, “The People’s Republic of the Future” [9].
No one could have anticipated these potential risks and rewards quite as well as Warren Sturgis McCulloch—naval officer, philosopher, poet, and psychologist—whose pioneering work in neurobiology with John Von Neumann at Princeton and Walter Pitts at MIT hailed the dawn of the cyber age. This man who should need no introduction is a seminal figure in the history of artificial intelligence without whom no introduction to neural networks and their applications to national security would be complete. As he penned in his 1945 volume of verse, One Word After the Other, “We build our castles in the air / And from the air they tumble down / Unless we carry them up there / Until they crack the pate they crown” (lines 1–4) [10]. While some may consider neural nets to be “castles in the air,” McCulloch dedicated a great deal of his professional time and energy to exploring how the brain could be modeled computationally, as Lily E. Kay has described at length in her article, “From Logical Neurons to Poetic Embodiments of Mind: Warren S. McCulloch’s Project in Neuroscience” [10].
A man of letters obsessed by the idea of numbers, “McCulloch’s quest was propelled by twin passions: The logos as God’s immanence and His bequest to Man, and themilitary as an embodiment of order. McCulloch came to neural nets through a circuitous path: via theology and philosophy, mathematics, psychology, medicine, and neuropsychiatry” [10]. By predicating logic upon timed intervals as observed in the firing of nerve endings, McCulloch introduced the temporal dimension to symbolicmanipulation, which is what will be necessary to achieve in order to create a more realistic rendering of author identification online. Thus, “like poetry, which deliberately employs enigmatic structures to aim at higher wisdom, neural nets were contrived enigmatically to explicate, from first principles, the fundamental mechanisms for the emergence of perception and mentality” [10]. Such a formalization of autonomous processes at the circuit level eventually gave rise to the information architectures we take for granted today, on the fiftieth anniversary of his death.
Even as the extent of his contributions have fallen into discredit, the polymathic nature ofMcCulloch’s imagination seems more relevant now than when he worked: “Neural
7
nets became a model for the electronic memory, and, in turn, the computer would become a model of cognition … coalescing into a new area of communication and control with discernable intellectual coherence” [10]. Indeed, as Kay adduces, McCulloch “felt that the medieval schism between revelation and reason characterized by the age of Thomas Aquinas was now mirrored in the schism between psychology and physiology in the understanding of diseases called ‘mental’; Freud’s trichotomy of the soul wasmerely an extension of Plato’s political psychology” [10]. This frustration with the compartmentalization of academic disciplines and the consequent stifling of innovation led McCulloch to remark that Freud’s “epigoni, the latterday illuminati…have dethroned reason but to install social agencies…in the places of espionage, confession and conversion,” whereas neural nets could serve as “a reincarnation of Saint Thomas’ faith that God did not give us our senses to fool us” [10]. By regarding Logos as ground truth, McCulloch hoped that computers could provide a prism to the full spectrum of reality:
For McCulloch, the psychiatrist, experimental epistemologist, poet, militarist, and theological engineer, the real bridge between soma and psyche was the art of communication as the science of signals… This cybernetic ascendance from logical neurons to embodiments of mind was to be the key to normal and aberrant behavior, to the workings of the psyche, and the logical and poetic essence of humanness. Bridging Matter and Form, neural nets were to bring us closer to the monadic mind of God through the equation of mind with logic. [10]
By uniting science and art with philosophy and theology in his quixotic pursuit of truth, McCulloch meant to make machines modeled after the mind of man, that the mind of man might know the Maker by means of machines. Therefore, this holistic approach of interconnecting superficially disparate subjects can assist the next generation of computer scientists to identify individuals by both name and number in concert with one another, balancing out natural human motivations with artificial digital mechanizations.
B. PURPOSE
For better or worse, the federal government is invested in the sentiment analysis of semantical constructions, because investigators must sift through mountains of textual data shared on social media to look for clues about rogue activity. Obviously, the situation is
8
rife for abuse, so precautions must be taken in order that such budding technologies be not weaponized in a fashion reminiscent of the film, Minority Report. As Christine Fisher reports, the FBI posted over the summer a since deleted solicitation notice for analytic tools, such that, “Information constituting advanced notification is derived from constant monitoring of social media platforms based on keywords relevant to national security and location” [11]. Since such an ambitious goal could be construed as a bit foolhardy on a number of fronts, Fisher goes on to point out the obvious problems besetting such a superficially obvious solution to the persistent and pervasive threat of random acts of violence; namely, that “there is a danger that a tool like this could be abused by those in power and violate civil liberties” [11].
Likewise, Andrea O’Sullivan at Reason.com voiced a similar reticence about the proposed plan, referencing the results of a study conducted at the University of Virginia entitled “Benchmarking Twitter Sentiment Analysis”: “Computers just aren’t that great at parsing tone or intent. One algorithmic study of Twitter posts was only able to accurately gauge users’ political stances based on their posts about a third of a time. And this was in standard English. The problem gets worse when users use slang or a different language” [12]. Although these doubts issued by the press corps go not without merit, the fact remains that a parser portable across multiple languages would be a powerful tool in the arsenal of an investigator entrusted with the task of protecting the public interest and promoting the general welfare, and thus, we embark on this area of research with unremitting gusto, as policy concerns remain outside the scope of technical expertise. After all, at the end of the day, data is only one factor in the decision to pursue prosecution. Besides, as criminals become more savvy, they will likely not post their intent online at all.
C. STRUCTURE
Now that we have established the clear and present need for such an exploration (valid concerns notwithstanding), we can outline the technical trajectory of this undertaking. The parameters we shall be operating under cannot be better described than McCulloch and Pitts’ in their seminal paper, “The Logical Calculus of the Ideas Immanent in Nervous Activity”:
9
Because of the “all-or-none” character of nervous activity, neural events and the relations among them can be treated by means of propositional logic. It is found that the behavior of every net can be described in these terms, with the addition of more complicated logical means for nets containing circles; and that for any logical expression satisfying certain conditions, one can find a net behaving in the fashion it describes [13].
We will concern ourselves primarily with the logical expressions that serve as semantic cues embedded within an online text’s composition and the extent to which such cues can be correlated to authorship identification. Yoav Goldberg of Ben-Gurion University in Tel Aviv has compiled “A Primer on Neural Network Models for Natural Language Processing” that takes some of these time-honored principles and applies them to state of the art speech parsing. As Goldberg writes, “In natural language we often work with structured data of arbitrary sizes, such as sequences and trees. We would like to be able to capture regularities in such structures, or to model similarities between such structures. In many cases, this means encoding the structure as a fixed width vector, which we can then pass on to another statistical learner for further processing” [14]. Thus, before we can avail ourselves of any of the neural network software suites to analyze the Universal Network Language dataset, our first step will be to come up with a tree that is reflective not only of parts of speech, but also incorporating the time lapse of keystrokes into the proportional weights of the input node perceptrons, in order to achieve a more realistic metric representative of real-world social media international correspondence. After all, what good is a tool that can count how many times a word appears, if it has no sense of the word in context, let alone the vernacular it is spoken in online?
Similarly, as Gary Marcus writes in The Algebraic Mind, “My interest in cognitive science began in high school, with a naïve attempt to write a computer programthat I hoped would translate Latin into English. The program didn’t wind up being able to do all that much, but it brought me to read some of the literature on artificial intelligence. At the center of this literature was the metaphor of mind as machine” [15].While we will not be so naïve as to believe that our program in Python or Prolog can undo the topple of Babel, tweaking the grammatical model to include the element of time lies not beyond the realm of possibility. Indeed, Marcus has found that “The sentence-prediction model… was often able to predict plausible continuations for strings…. without any explicit grammatical
10
rules. For this reason, the simple recurrent network has been taken as strong evidence that connectionist models might obviate the need for grammatical rules” [15]. In this sense, we will be taking a bit of a middle-of-the-road approach as far as some of the biggest debates that rage in the field of computing at present. After all, “the ways in which multilayer perceptrons are brainlike (such as the fact that they consist of multiple units that operate in parallel) hold equally for many connectionist models that are consistent with symbol-manipulation, such as the temporal-synchrony framework… or arrays of McCulloch-Pitts neurons arranged into logic gates” [15]. Logic is the ordered relation between universal and particular concepts in space and time. As such, it is prior yet proper to every speech act by man or machine. While the categories that make us human are difficult to quantify, the fact that they are communicated as a function of time makes it impossible for them to elude enumeration entirely, so the subject of this study is completely warranted. By tracking the characters that a person types alongside the time it takes for them to type it, their individual character as a person will inevitably emerge as a pattern from the data and a face can be traced behind the scintillating screen.
11
II. PARSING
A few things must be borne in mind before we perform a parse tree analysis on a subset of the open-source corpus at our disposal. In order to develop a truly comprehensive computational model for human communication in the interest of identifying unique voices from the digital cacophony on the world wide web, some means of correlating sound and sense would prove to be absolutely imperative. Combined, the vocal pronunciation and the conceptual content constitute the form and the matter respectively that compose a person’s online identity such that the system would work irrespective of the natural language one is conversant in. As for sound, the natural course of action would be to utilize the International Phonetic Alphabet (IPA). In fact, most state-of-the-art voice recognition technologies do just that. As for sense, Aristotle’s ten ontological categories of being—quality, quantity, time, place, relation, etc.—cover most any idea one would seek to express around the world, agnostic toward the syntactical variations across various languages.
Phonemes could be connected to the thought-memes they represent in a hieroglyphic fashion. While his study had no impact on the conceptual content of speech, Wentian Li at the Feinstein Institute of Medical Research has mapped written Chinese characters to the spoken syllables they correspond to along a Poisson distribution [16]. Then, the veracity of these utterances could be determined along a sliding scale that qualifies areas between hard-yes and hard-no, much in the fashion that quantum gates operate according to quaternary code rather than binary. In that respect, it may not be a stretch to claim that Raymond Lull, long celebrated as the father of information science, anticipated this latest development in circuit design a millennium hence in the drafting of his Arbor Scientiae (Tree of Knowledge), which catalogued attributes of elemental principles through a four-fold hierarchy [17]. Likewise, recent research developments in mathematical logic resurrect the possibility of Gilles Deleuze’s rhizomes as a better template for the lately deceased Peruvian thinkerMiro Quesada’s notion of “paraconsistent logic,” wherein respondents can answer neither yes nor no to a pointed question [18]. Thusly, thought would flow from word to word, sentence to sentence, paragraph to paragraph, page to page, image to image, and note to note along a decentralized network
12
of roots [19]. In so doing, the connections created resemble the nebulae constellation of mold growing on an agar plate to map the railways of Tokyo, as has been accomplished in fluorescent lab cultures under black-light, mirroring the material intelligence of neurons, axons, and dendrites [20]. These observational findings also run counter to the rather vacuous claim that intellect derives as a mere emergent property from the interaction of chemical compounds, since such a model fails to account for the unique qualia experienced by sensate beings, as opposed to the indeterminate movement of bacteria, or viruses, by means of rhizomatic protrusions.
However, these multiplicitous paths of discovery are tangential to our current pursuit, which for the sake of simplicity, will assume the usage of English language (near-universal as its currency has become globally at the time of this writing) as well as a binary framework for understanding truth statements. Intriguingly, it has been noted that part of William Shakespeare’s enduring appeal as an English-language author hearkens back to the fact that his plays were often penned in iambic pentameter blank verse, which happens to be the closest rhythmic pattern to the heartbeat-like cadence of regular, everyday speech. Furthermore, the alternating unstressed and stressed syllables that make up these iambs readily translate to Claude Shannon’s binary model of 0’s and 1’s, like Morse code on the wire, or electrical signals pulsing across nerve cells. As Leonardo DaVinci famously stated, “Learn how to see. Everything is connected to everything else.” That being said, math underpins language, and language underpins math in impenetrably interpenetrating ways; in fact, another Renaissance figure by the name of Galileo Galilei wrote, “Mathematics is the language with which God has created the universe.” By formulating sentence structure along mathematical lines, we will develop a truer representation of how people actually communicate their thoughts, feelings, dreams, wishes, hopes and fears to one another.
A. GRAMMAR
Now, how feasible is this plan? It is not without precedent. While it would be impossible to execute a heuristic implementation that covered all potential instances, we can at least take a step forward in the right direction. In order to do so, we should first take a look at the work that has already been done in the area of text composition analysis for
13
the purpose of trust verification. In their article “Keystroke Patterns as Prosody in Digital Writings,” Banerjee et al. have used keystroke patterns in typed user responses to questions concerning controversial topics such as policies surrounding “gun control” and “gay marriage” to determine if things like the number of backspaces or deletions indicate that a person is not as likely to be truthful in what they are saying [21]. As the researchers explain, “For the other two topics—’gun control’ and ‘gay marriage’—we asked their opinion: support, neutral, or against. Then, they were asked to write a truthful and a deceptive essay articulating, respectively, their actual opinion and its opposite” [21]. Saliently, the team not only logged the revisions typists typed but also the time it took them to type it in, leading them to conclude that, “The longer time taken by deceptive writers in our data is a possible sign of increased cognitive burden when the writer is unable to maintain psychological distance” [21]. More indicative still was the word choice itself, as writers attempting to mask their true views tended to spend less time on content and more time on functional words [21]. Specifically, the researchers examined features such as varied editing patterns (deletions, insertions, substitutions, etc.) correlated to typing timestamp temporality (durations, pauses, latencies, etc.) in order to classify a typist’s attempts at masking thought incoherence, whether expressed verbally or textually [21].While whether one is telling the truth or not is ultimately irreducible in any medium, this method of analysis does provide a clue one way or the other.
While research like this tackles the hard questions of syntactical parsing and their ramifications for author identification/trust verification, others are taking a nuanced approach toward semantical parsing. Yanbo Xu’s “Random Walk on WordNet to Measure Lexical Semantic Relatedness” has detected conceptual similarities between grammatically disparate words in order to map their frequency of usage and contextual richness with high degrees of accuracy. For instance, “two dissimilar concepts can also be semantically related, such as window and house, since window is a part of house (meronymy) [22]. Likewise, as a team at Stanford’s Computer Science Department has determined, “Many tasks in NLP stand to benefit from robust measures of semantic similarity for units above the level of individual words…Our algorithm aggregates local relatedness information via a random walk over a graph constructed from an underlying lexical resource” [23].
14
Semantics is not to be neglected in parse tree analysis for the sake of the syntactical element, as in more ways than one, semantical nuances can be the hardest parts of language to parse; after all, as Epaminondas Kapetanios prefaced his text, Natural Language Processing: Semantic Aspects, with an imagined colloquy between Haldane and Turing found in John Casti’s The Cambridge Quintett in which Haldane inquires, “If… I was able to activate the English version of the universal grammar in my mind, the semantic content of which equals to zero. Where can I find in your theory that this proposition makes no sense?” to which Turing responds, “It is very simple, I do not know” [24]. That is precisely the question which we must attempt to answer, if we are to develop a useful computational model for communication. After all, Turing—the so-called father of Artificial Intelligence himself—was right to write (shortly before his untimely demise after state-mandated castration, no less) that, “In attempting to construct such machines [computers] we should not be irreverently usurping His [God’s] power of creating souls, any more than we are in the procreation of children: rather we are, in either case, instruments of His will providing mansions for the souls that He creates” [25]. Thus, in order to identify souls by the devices they virtually inhabit, the tree structures we chart must wed semantics with syntax.
However, Banerjee’s team was far from the first group to study typing. For that, we must go back further to Rumelhart and Norman’s landmark study in 1982 at the University of California San Diego’s Department of Psychology, “Simulating a Skilled Typist: A Study of Skilled Cognitive-Motor Performance,” in which they catalogue how omissions and deletions can correlate typing speed to the individual typing [26]. While this was seen as a drawback in the initial investigation, the rise of the Internet has made the vast differences in typing signatures serve as a great boon to digital investigators. As early as 2009, academic researchers like Kevin Dela Rosa and Jeffrey Ellen at SPAWAR Pacific were beginning to see the application of unique imprints to identification of authors in chatrooms. In their experimental setup, they observed through a series of binary classifiers the relevance of the information being transmitted to naval tactics [27]. Likewise, in the 2003 article, “Instant Messaging: Between the Messages,” Jeffrey Campbell et al. of the University of Maryland have recorded data indicating that only 43% of the time it took to
15
compose an instant message was actually spent typing—the lion’s share was consumed by thinking about what to type and revising what had already been typed up [28].
Our ability to trace thought patterns via type on digital devices, while very new, cannot be lightly taken for granted. Much of the research currently taking place explores the diverse applications this natural language processing technology possesses, ranging from law enforcement to therapeutic learning assistance and everything in between. As Daniel O’Day and Ricardo Calix of Purdue have stated in their article “Text Message Corpus: Applying Natural Language Processing to Mobile Device Forensics,” “The average mobile device user sends a large quantity of text and other short messages. These textmessage data are of great value to law enforcement investigators who may be analyzing a suspect’s mobile device or social media profile for evidence of criminal activity” [29] Privacy concerns notwithstanding, they are not mistaken that hearing what is said behind closed doors could provide critical clues for cracking a cold case. Pratap, Prasad, and Kumar of Siddhartha Engineering College have used Twitter as a template for using keyword search in micro-texts to make traffic monitoring decisions [30]. These decisions could potentially hold life-or-death consequences. With a high degree of granularity, Chinese researchers were able to detect a noticeable change in both semantic content related to self-harm in conjunction with increased social media posting frequency for individuals who went on to take their own life in the days leading up to their deaths [31]. Not only could text flagging help protect people from themselves by erecting preventative obstacles, it could also help protect children fromsubversivematerial on the web. Scientists at the Faculty of Informatics in Mahasarakham University, Thailand, have developed a Bayesian text analysis neural net algorithm that filters URLs based on images and English/Thai word counts as input vectors, so that school students do not have access to such contraband [32]. While not as prospectively damaging as suicide or pornography, plagiarism is another problem semantic parsing can help solve. In their article, “Fingerprinting based Detection System for Identifying Plagiarism in Malayalam Text Documents,” a couple of Indian computer scientists have developed a “winnowing algorithm” by comparing the text composition of documents for similarity in the highly complex language of Malayalam [33].
16
However, it is not all doom and gloom. This technology can be utilized not only to deter negative events but to induce positive ones. The Beijing Aerospace Control Center was relatively recently working on ways that a flexible text interpreter could help crews fulfill their missions when interacting with machinery aboard spacecraft through the calculation of functional expressions [34]. But one need not look to the heavens to see the applications of text parsing for the fulfillment of one’s objectives. Here on earth, deaf children can better learn with assistive devices like SWift, a SignWriting transcriber developed by French and Italian researchers that translates elements called glyphs on an interactive panel menu of concepts that combine spatial and temporal information into a single digital format to bridge the gap between sign language and spoken speech [35]. Advancements like this (via a brain-computer interface, or BCI) could help people suffering from a wide range of issues that leave them locked in their own body without a means to communicate with the outside world (“locked-in syndrome,” or LIS, for short), including autism, stroke, cerebral palsy, and traumatic brain damage. Elon Musk is actively investigating such applications with his ambitious Neuralink program. Whether it be fighting crime, exploring outer space, or teaching the deaf and mute to speak, none of these creative opportunities to provide healing to private individuals and hope to public societies would be possible without basic text parsing of typewritten data.
That being said, none of the aforementioned applications of text-based analysis can be put into action without identifying the users of the devices. Thankfully, plenty of people around the world are also working on this hard problem in computer science. While researchers at the Max Planck Institute in Germany have made great strides toward masking authors’ identities online by neural machine anonymization and obfuscation of class attributes to encoded input texts through verbal substitutions, their schemes are not completely impregnable against NLP techniques: “Natural language processing (NLP) methods including stylometric tools enable identification of authors of anonymous texts by analyzing stylistic properties of the text” [36]. What the Max Planck team failed to account for is the large amount of spoken-word information presently online. Authors who record their voices can also have their identities verified by binary-partitioned neural networks, as the Computer Engineering Department of Old Dominion University has discovered [37].
17
Neural networks can also be used to traverse over verbal datasets to create tree structures that relate documents and identify authors on the basis of text composition, according to recent research at the Beijing Institute of Technology; they conclude “that hierarchical tree structure network can more efficiently capture the logical semantic relations between discourse texts compared with the shallow feature classifications and sequential flat semantic models” [38]. Additionally, a collusion of scholars fromNew York and Louisiana collaborated on a combined approach of keystroke dynamics and linguistic analysis to analyze how type “bursts” better reflect the moment when inspiration strikes and truly distinguishes one author from another, revealed by key pieces of data such as parts of speech favored as well as characters most frequently selected to authenticate users [39]. Of course, this work built heavily upon John Monaco et al.’s focus on statistical measurements of vector differentials in duration, transition, and identification of type keys pressed “with the goal of identifying perpetrators or other malicious behavior” and “to determine authorship of emails, tweets, and instant messaging, in an effort to authenticate users of the more commonly used digital media” [40]. To summarize this cursory review of the extant literature on the subject of text parsing and its relevancy to securing the online sphere for the betterment of the general welfare, words as images and sounds are all we have to go by when making a distinction between people online, and the ramifications of this principle are manifold.
B. BINARY PARSE TREE
For this project, we will be examining the same open-source corpus used by Banerjee et al. as described above. First, we will illustrate how subsets of the dataset can be visually conveyed in a binary tree that bifurcates along the lines of the substituents of a sentence. The Natural Language Toolkit (NLTK) for Python uses a Prolog-like tagging system to identify the encapsulated parts of speech in a text file through basic string manipulation. Then, once the parts of speech in the text file have been tagged, we can create a tree to represent them hierarchically. This covers the space taken up by the words printed on the page. Second, the key logs collected by Banerjee etc., have been collimated into a csv file organized by user, session, character, and timestamp. Through an enfolded series of dictionaries with key-value pairs, we can extract individual user data to map the
frequency of each key along a hidden Markov chain. This process constitutes the temporal component of biometric identification of authors. Third, the parts of speech in the text file responses themselves will be the input nodes for a recursive neural network (RNN), the proportional weights of which will be determined by the timing probabilities in the csv file logs. By comparing our findings with theirs, we will finally be able to ascertain the utility of whether this is indeed a more integrated modality for author identification, insofar as it combines the content and the typing pattern together in the analysis. Finally, we can apply what we have learned to the semantic network of the Universal Networking Language (UNL) and see its potential for non-English language authors in international security. Now that we have laid the groundwork, so to speak, it is high time to present a small demonstration of what we have in mind with the tools at hand. As mathematical logic nowadays in vogue gravitates toward Noam Chomsky’s context-free grammars, we will adhere to a constituency-based grammar, which means that each word’s semantic intent will be contingent upon every other word in the sentence in order to confer meaning upon the reader/listener. Take for example this independent clause that appears as a recurrent theme in our vast corpus, cryptic and all-encompassing as it may be: “everyone will be judged by god.” Errors in capitalization and punctuation aside (useful as they are in tracing the words back to their author, based on age, education, race, gender, political orientation, etc.), this sentence can be parsed through a series of binary decisions based on part of speech using a script written with Python and the NLTK package library to render Figure 1.18
19
Figure 1. Binary Parse Tree
As we can see in Figure 1, the program we have written labels each word according to whether it is a subject, verb, preposition, or object and then visually represents the word in a binary hierarchy wherein each word is dependent on its part of speech to decide where it is to be placed. “Everyone” is the subject (S) of the sentence, and the rest of the sentence belongs to the passive voice verb phrase (VP), which has been further subdivided into the verb (V), and the prepositional phrase (PP). Further, the verb is divided based on helping verb, and the prepositional phrase is parsed by the preposition (“by”) and the object of the preposition (“God”). As one can imagine, things can get a lot more complicated than that based on the complexity of the sentence structure we are dealing with at the moment, but this simple example suffices as an elegant illustration of the potential the program offers to make sense of raw text data. For every person in the database, a tree could be written, and based on their messaging connections, their roots may all form a single entity, like the aspen groves in Colorado, or the Great Barrier Reef in Australia. Then again, the words of the famed poet and American war hero, Joyce Kilmer, come to mind: “Poems are made by fools like me, / But only God can make a tree” [41].
20
THIS PAGE INTENTIONALLY LEFT BLANK
21
III. KEYSTROKE DYNAMICS
Where our approach differs from previous analyses in this area of NLP is the incorporation of temporality into an otherwise purely syntactical parsing structure. Thus, we must make an in-depth study of keystroke dynamics, exploring the current literature in the field and how it applies to our own research. However, it would be well worth our time to preface this section with a few brief reflections on the reasons for such a study. After all, “The character of a man is known from his conversations,” in the prescient words of the Greek playwright Menander; hence, detection of malicious activity online or in person hinges primarily upon the speech people use to convey their ideas. Documentaries and feature films can be just as effective as any other medium in presenting the opportunities and challenges of artificial intelligence being used to determine character and credibility. As Marcel Just, a brain researcher at Carnegie Mellon University, reports in Werner Herzog’s fascinating documentary, Lo and Behold! Reveries of the Connected World:
When you read a sentence that says there are two elephants walking across the savannah, a computer program can tell that the same thought is going on in your brain, whether you are watching the video or reading a sentence. At a conceptual level, it is the same; it is also the same for people across languages—there is a universality for the alphabet of human thoughts, and it applies to the videos… but it also applies to spoken and written speech, and it crosses languages. We have a vocabulary, the brain has a vocabulary, and we are beginning to discover it… You could essentially—in the not-too-distant future—tweet thoughts [42].
Likewise, while his largely speculative account, The Holographic Universe, has widely been discredited for its emphasis on paranormal investigation at the expense of scientific rigor, Michael Talbot is right to describe the neuro-physical process of thought in the following way: “Electrical communications that take place between the brain’s nerve cells, or neurons, do not occur alone. Neurons possess branches like little trees, and when an electrical message reaches the end of one of these branches it radiates outward as does the ripple in a pond” [43]. The thought-provoking 2014 techno-thriller, Ex Machina, posits how machines could be made to mimic the mind when Caleb the computer coder extolls the anthropomorphic properties of the android Ava whom he has become smitten: “Her
22
language abilities are incredible. The system is stochastic, right? It’s non-deterministic. At first, I thought she was mapping from internal semantic form to syntactic tree structure and then getting linearized words, but then, I started to realize themodel is some kind of hybrid” [44]. While machinery mirrors mentality and vice versa, the only way to get at the heart of the mechanism of mind over matter is to meld the study of each side of the BCI together, and that is precisely where keystroke dynamics come into play, as the digital age is characterized by the interaction of minds by way of machines.
A. KEYLOGGING
There are a few key articles in the area of keylogging that will be necessary for us to take a look at before we proceed any further. Afroz, Brennan, and Greenstadt at Drexel University have assessed the extent to which feature classifiers based on word usage can be utilized to ascertain whether a given author was attempting to imitate another author or to obfuscate the original author in their seminal paper, “Detecting Hoaxes, Frauds, and Deception in Writing Style Online” [45]. The authors applied advanced machine learning techniques to the corpus with a view toward specific stylometric features, comprising the relative distribution of syllables, modal verbs, adjectives, adverbs, and pronoun usages to paint a picture of the linguistic identifiers that would distinguish one author from another for the purpose of detecting adversarial obfuscation [45]. The study also included comparisons between passages by famous American authors such as Cormac McCarthy, William Faulkner, and Ernest Hemingway and forged examples of their writings; these classifiers were then plotted according to a series of gratuitous equations measuring the information gain versus rate of deception [45]. While the forgeries would probably not be able to deceive a machine learning algorithm programmed to flag author obfuscation, a literature professor familiar with the original oeuvre would likely have an easier time discerning authorial identity on the basis of sentence construction, since the difference between authors would come down to a certain je ne sais quoi of bon mots that no machine is yet capable of reproducing. This distinction between success rates in authorial identification versus linguistic obfuscation could be said to constitute the true Turing test when it comes to textual automation. In the final assessment, it takes a well-trained human to identify whether a human speech is original or imitated. Machines can only approximate
23
human speech at this juncture, even though they are capable of generating some uncannily accurate-sounding dialogue. Depending to a large extent upon how we define what itmeans to learn, machine learning could be said to be a contradiction in terms. Does learning necessarily involve self-cognizance? If it does, then the state-of-the-art technology is still not quite up to snuff.
While we have already mentioned Banerjee’s text analysis, his research group has written another article segueing with the previous piece about stylometric deception detection [46]. It also happens to dovetail with our own style of analysis in that it makes use of context-free grammar (CFG) parse trees to detect deceptive verbal patterns with over ninety percent accuracy. The team went beyond shallow part-of-speech (POS) tagging to convey unigram/bigram classifiers for lexicalized production rules over three datasets of truthful and untruthful statements on Yelp, TripAdvisor, and essays about contentious topics like abortion, friendship, and the death penalty to reveal deep syntax features in each set [46]. Robust results generated by parse tree analysis surprisingly indicated that deceptive writing contained a higher frequency of subordinating conjunctive verbal phrases, which would reveal a tendency toward confabulation in the sampled sentences by supplying extraneous information in an attempt to mask its erroneousness.While this study produced more representative output than earlier models, it still fails to include the temporal component of typewriting as ours will try to do.
As far as discrimination between authentic and hoax posts is concerned, the Language and Information Technology Research Lab at the Media Studies Department in the University of Western Ontario has produced helpful research for deception detection with regard to “Fake News” [47]. Just as one man’s trash is another man’s treasure, one person’s lie is another’s falsehood, so the determination of what constitutes true news is nebulous at best. However, this article delineates some of the tell-tale signs of deception. The important thing to keep in mind is the presence of contextual clues, as one can only trust but verify when there is a basis for comparison. Thus, the Thomistic principle of “rarely affirm, seldom deny, and always distinguish” is the best policy to evaluate the veracity of Internet publications. Whether it be impostor accounts on social media, sham journalism, or satirical works of art, the Overton window of social acceptability comes into
24
play with the likelihood that even the most extravagant theories are bound to be believed by someone, under the proper circumstances. Such confusions can cause mass hysteria, as in the case of Orson Welles’ infamous H.G. Wells War of the Worlds radio prank. Similarly, a forged image of a Department of Defense missive declaring a national draft sparked panic online at the beginning of 2020, forcing the government to issue a disavowal of the false alarm [48]. As the researchers allude, “Tweets are especially suited for detecting such irregular communication behaviors in information sharing such as hoaxes” [47]. Thus, these Canadian professors have collimated nine qualities to look for from text corpora that would render it suitable for cross-analysis of truth-telling. These include the presence of true and false statements, digital accessibility, verifiability, homogeneity of length and content, regulated timeframe, and mode of transmission [47]. The Banerjee corpus that we will be using certainly abides by the above parameters they have set forth. By knowing what to look for, it will be more conducive for law enforcement agencies, educational institutions, and print publications to exercise discernment with regard to the material presented in the online domain.
B. SEMANTICS
We would be remiss if we neglected a brief discussion of the semantical constructions that can be typed due to solely focusing on the keys pressed to represent characters. After all, our model will incorporate not only the time of duration for each character but how long it took to type each word in the sentence for truthful and misleading users alike in the Banerjee corpus. As the human brain conceives of thoughts in words and chunks of words rather than spelling letter by letter, this POS model will act as a more accurate approximation of human speech, and therefore, be more precise in distinguishing one author from another on the computer. First, however, it would be helpful to provide a cursory examination of the current literature on the topic of semantic analysis for the purpose of digital author identification. Since syntax concerns structure but semantics concerns meaning, this is not merely a matter of “mere semantics,” as colloquial parlance would allude; rather, semantics makes all the difference in the world and is crucial for our understanding of author identification in Internet communication: If syntax is the blueprint, semantics is the building itself.
25
Many think tanks have tackled the problem of semantical parsing for author identification, including a joint endeavor by researchers at the University of Arizona and New York University, entitled “A Framework for Author Identification of Online Messages” [49]. In it, the authors describe how Chinese cannot be lexically broken down in the same fashion as Anglophone dialects, because word boundaries are not the same in a pictographic system [49]. Therefore, other means of dividing parts of speech have to be devised. That being said, neural backpropagation networks have had admirable success in classifying the identifying features of authors, whether they be poets, playwrights, or politicians. However, the percentage of accuracy decrements with respect to the size of the sample. As the researchers observed, it becomes more and more difficult to attribute authorship as the size of a particular author’s corpus expands. This perhaps counterintuitive phenomenon stands to reason, considering a truly great author’s works will vary greatly in word usage and sentence construction over the course of their creative career. Still, the tools heretofore implemented have been executed well enough to determine that most of the Federalist papers were indeed penned by President James Madison, as had long been suspected by legal scholars [49]. The Federalist papers are only one example of all the possibilities of this technology to trace authorship via semantical parsing, and there are many others, including those of the forensic variety, especially when applied to instant messaging platforms and servers.
Abbasi and Chen at the University of Arizona have followed up with this framework as it pertains specifically to terrorist threats against homeland security that appear regularly on online messaging boards [50]. As in the previous study, they categorize authorship analysis by stylometry under the broad domains of identification and characterization of writers’ styles by virtue of semantic content with a specific view toward Arabic scripts. As in the case of Chinese, some emendations need be made, since Arabic has a non-Cyrillic alphabet. To account for elongation, inflection, and diacritics, root words must be subdivided into stem words of infix and suffix in order for the model to run properly [50]. Lexical analysis measures the frequency of each character per passage, sentence, and word, even if the word only appears once in the corpus as a hapax legomena [50]. Syntax, according to their definition, comprises word usage patterns and punctuation,
26
while structure deals with organizational layout, whether that be according to paragraph or link embedding as the case may be [50]. The experiment was capped with content-specific control headings. However, it could be argued that such a distinction is a bit misleading and really just a question of scope more than anything else. Saliently, Abbasi and Chen adduce that the micro-fragmented dialect of the Internet can make it exceedingly difficult to establish a large enough sample size of verbiage to merit scrutiny [50]. In other words, the signal-to-noise ratio of online communication patterns’ quick back-and-forth soon diminishes the return-on-investment. Still, online writing provides a slew of other open-source information gathering clues that can assist in determining authorship, even for small slices of everyday speech. While most of the project recapitulated previous research, they applied the framework to a test bed of virulently anti-American Al-Qaeda correspondence using a support vector machine (SVM). What most distinguishes their modus operandi from previous attempts was the extraction of text data from the Dark Web using crawling analytics. Clustering algorithms filtered results to an accuracy that matched English variants on a KKK message board, but SVM proved more accurate than the decision tree prototype for feature set classifiers [50]. As the authors conclude, however, there is much more room for improvement when it comes to the development of an analytic tool that could encompass the entire Internet of Things (IOT).
Terrorism is a global problem; hence, it will need a global solution. To that end, Efstathios Stamatatos of the University of the Aegean in Greece has also worked on using NLP to track terrorist communications in the digital sphere, particularly with regard to handling the issue of imbalance in the training model of a set of classes, whether in English or Arabic [51]. While the problem does not plague our text corpus as greatly as it does other samples, a disproportionate number of passages from one type of author over another could skew the dataset to produce a biased output, especially in testbeds that are non-English speaking. Stamatatos categorized characters by n-grams as inputs for a SVM calibrated to measure differences rather than overlaps in text data; signally, this approach is language-agnostic yet combines grammar with rhetoric in speech patterns [51]. To normalize a more uniform model of the training set, he applied a Gaussian function, defining the imbalance as the biggest class over the smallest [51]. Employing several
27
different methods to modulate the majority versus the minority texts rendered mixed success. Even though he incorporated both semantical and syntactical analyses into his study, Stamatatos paid no heed to the time it took to produce the texts in the first place, which is where our research should provide a critical contribution.
Not only Greece but also Italy has produced quality research in the area of semantical analysis. Fabrizio Sebastiani of the Consiglio Nazionale della Ricerche has utilized machine learning (ML) for text categorization (TC) in diverse sources of Web content [52]. While the article is a bit dated by today’s state of the art standards, it does provide some helpful approaches on text indexing and knowledge engineering (KE) [52]. The first step of classifying a text corpus is to label it in categories according to Boolean logic [52]. This has already been done for the text corpus with which we are working. Next, documents can be ranked in a hierarchy organized by vocabulary content. As Sebastiani states, “Other applications we don explicitly discuss are speech categorization by means of a combination of speech recognition… [and] author identification for literary texts of unknown or disputed authorship,” which is relevant to our study [52]. Through automated metadata generation and text filtering, rules for categorization can be established; however, disambiguation of amphibological words and phrases pose significant challenges to automation that could only be solved by having a more robust word bank for a neural net to traverse through [52]. While KE builds the classifier, a learner is an automated builder of classifiers [52]. Then, documents can be indexed according to type to create the training test bed against which standards are validated [52]. In reducing the dimensions of the set, “The choice of a representation for text depends on what one regard as the meaningful units of text (the problem of lexical semantics) and the meaningful natural language rules for the combination of these units (the problem of compositional semantics)” [52]. While compositional semantics bleeds over into syntax, Sebastiani’s way of breaking down the problems confronting ML in NLP is a useful path toward new discoveries through inductive and probabilistic methods. In this sense, then, author identification by text composition is just as much a quandary for information science to solve as it is for computer science.
28
Before we move into the third and final subsection in this chapter, it would be good to take a brief aside to remind the reader that the applications of this technology extend far beyond the scope of law enforcement and enter into the realm of contribution to the wider world of art, literature, and culture. Semantical and syntactical analyses of graphemes and lexemes lend themselves to cracking some of the most mysterious codes in history, thereby melding classical cryptography and iconography. For instance, Vincent Jara Vera and Carmen Sanchez Avila at the Department of Applied Mathematics to Information Technology and Communications in the Polytechnic of Madrid published a fascinating research article proposing their solution to the strange letters graven on a statue of the Virgen of Candelaria, the meaning of which had been all been lost due to centuries of obscurity [53]. By performing in-depth analysis of word frequency, subject-verb-object (SVO) construction, and contextual clues, they were able to deduce a plausible explanation for the hidden writing as an amalgam of Berber, Amazigh, Latin, and Spanish: “The meaning of the expression, with a syntactic SVO form would be… the meaning of light… following the same semantic line of… the Mariological Lucan passages…with the Virgin Mary as the Lady who sheds Light, illuminator, the Illuminatrix” [53]. Likewise, such cultural enrichment, which is the fruit of computational linguistics, can shed light on the darkest corners of the World Wide Web.
C. SYNTAX
Since the text corpus we are parsing involves not only semantical content but syntactical content, it is necessary to peruse the extant work being done on syntactical analysis in relation to author identification. Zhao, Song, Liu, and Du at the Capital Normal University of Beijing, China have recently issued “Research on Author Identification Based on Deep Syntactic Features,” which introduced the syntactical side to tree parsing for author attribution [54]. As the researchers point out, “many of the emerging tasks such as the article plagiarism detection requires us to identify the author of the paragraph or even of the sentence, but previous work still focus on the identification of article. They are the limitations of the current author identification method” [54]. Since the Banerjee corpus is indeed broken up into individual sentences (and even words) by users speaking both truthfully and falsely, our method will be a novel approach in its focus on the sentence-
29
level syntax, not paragraphed, or paginated. However, the model they employ perfectly suits our own purposes, as “This paper emphasizes that syntactic features should be selected for the author identification, such as dependence relation and the syntax Tree” [54]. Their feature set is circumscribed by a series of nested scope, from character to word to sentence, of which “POS (part-of-speech)-based features include the relative frequency of verbs, nouns, adjectives and adverbs” for a collection of 23 novels authors have written [54]. However, rather than use NLTK, they implement their model with Stanford CoreNLP and pylyp. Besides, they give preferential treatment to syntax over semantics, while an ideal model combines both elements in a single system to more realistically depict speech patterns in online typing.
The veracity of syntactical analysis proves imperative to courtroom adjudication of forensic stylistics, as Carole Chaski, the Executive Director at the Institute for Linguistic Evidence has adduced [55]. Evidence collected from text samples cannot be admissible in court by Special Agents of the Department of Justice unless the methods scientists have developed can adequately be determined to trace authorship back to the computer user who typed the document in the first place. Chaski defends the use of such techniques against any barrage of doubt, reporting that “syntactically classified punctuation and syntactic analysis of phrase structure… correctly differentiate authors and cluster the questioned document with its actual author with a high level of accuracy” [55]. However, she importantly points out that “the type-token ratio is a venerable idea with predictable results in the context of Shakespeare studies, but within the confines of a forensic investigation with its typically short documents, the type-token ratio loses its utility” [55]. While such a contention may very well be contested by the experts Chaski claims to represent, it is worth noting, given the technique’s prevalence in the computational linguistics and intelligence communities. Meanwhile, she dismisses literary criticism as an empirically verifiable means of distinguishing authors, as film critics like Donald Foster attempted to do [55]. For a small sample size, distinguishing authors by syntactic patterns proved to be infallible; however, such a facile solution notoriously does not scale well, which is why we will have to incorporate other pieces of evidence, including timestamp durations, to create a more stable signature irrespective of other factors prey to change and manipulation.
30
Since we will be combining the nodes of parse trees derived from deterministic context-free grammars with the proportional weights derived from keylogging timestamps to architect a neural network, it would be helpful to also look at some of the research currently being done with regard to the applications of Artificial Neural Networks (ANN) to syntactical analysis and NLP in general. As Chun-Hsien Chen and Vasant Honavar have pointed out, “ANN systems on the other hand, have been inspired by (albeit overly simplified) models of biological neural networks… Today’s AI and ANN systems each demonstrate at least one way of performing a certain task (e.g., logical inference, pattern recognition, syntax analysis)” [56]. While this approach meshes with our own, the logical syntax of which they speak refers to artificial languages for compilers to process, not natural languages processed by human brains by means of computers. Thus, while any symbols tokenized by their model may very well make information storage, retrieval, and transmission more efficient, no finite-state machine yet in existence would be capable of parsing the vocabularies of human beings so as to distinguish one from another. This critical distinction brings out the need to define syntactical analysis by the parameters of a lexicon understandable by people accustomed to parts of speech rather than processors accustomed to binary bits.
Not only is it important to keep in mind that the languages we parse are natural rather than artificial but also that these natural languages may not necessarily be in English. While each country has its own dialect when it comes to writing programs that can encode malicious viruses into information networks, analysis of these countries’ languages is just as crucial to national security and defense as it is to be fluent in computer programming languages. Russia has been known for their cyber-meddling ever since the end of the Cold War. Thus, the difficulty of determining truth value in a world of ambiguity is highlighted by a recent paper from Iurii Stroganov and Dmitriy Pogorelov at Moscow State Technical University entitled “Ternary as Fuzzy Logic in Creation and Comparison of Syntax Trees when Determining Functional Styles” [57]. As Stroganov and Pogorelov declare, “Information, which the computer operates, is anyway displayed as units and zeros. Either ‘truth,’ or ‘lie’—binary logic… However, the main problem in the analysis of sentences and texts in a natural language is creation of a correct, also called ‘gold,’ tree of syntactic
31
analysis” [57]. This “golden tree” is precisely what we seek to develop by a hybrid domain for static and dynamic analysis of speech in real-time online. Thus, whether one would respond “yes,” “no,” or “maybe” in response to the truthfulness of any statement and the attribution of its e-author, a temporal syntax will be instrumental in discovering the lies of the liar and the truths of the truth-teller in any given human tongue.
China has also been a culprit in digital infiltration of late, and like Russia, scientists there have also made contributions to the sub-field of semantic analysis. Researchers from Wuhan have experimented with the potentiality of VML to visualize automatically generated syntax trees for analysis, but these trees do not factor the dimension of time into the equation, unlike ours will [58]. Likewise, Hongbo Li and Jianping Yu have written about melding syntax with semantics in their article, “Knowledge Representation and Discovery for the Interaction between Syntax and Semantics: A Case Study of Must” [59]. In it, they state, “Language is not only a symbol system, but also a value system. Constituents in a language are not isolated; they are, on the contrary, woven in an invisible net. Syntax and semantics are interactive in this net: syntactic differences would be mapped to semantics, and semantic differences would be reflected in syntax” [59]. In this fashion, then, they apply formal concept analysis (FCA) to the verb must [59]. However, the researchers miss the fact that epistemic versus root senses of a word fall under the category of semantical rather than syntactical analysis [58]. From a forensic perspective, quantification alone cannot suffice to create a clear picture of what any one word in English or any other language does to identify one author over another; things must always be taken in context, lest we risk making false inferences.
But what does all this grammatical material have to do with protecting communities at risk of violence, whether domestically or internationally?Well, a joint endeavor between the UK and British Columbia sought to tackle just that problem in the paper, “Positing the Problem: Enhancing Classification of Extremist Web Content Through Textual Analysis” [60]. What they found was that incorporating POS analysis into the Terrorism and Extremism Network Extractor (TENE) web-crawler produced a great match as to sites that actually pose a threat to peace and safety [60]. This auspicious invention can assist legal defenders in diagnosing terrorist activity online before they can enact their devious plans
32
in real life. The semantic analysis of this application involves “identification of ‘keywords’ (linguistic markers) that would represent the pro-extremist, neutral and anti-extremist categories… using Open NLP (a language processing tool) to develop a POS tagger… extracting nouns to create a frequency distribution” [60]. While this automated approach may be more exhaustive than our manual approach, it leaves off crucial keylogging timestamp data that could provide an even more robust determinant of whether a particular computer user’s posts constitute a credible cause for concern worth investigation and eventual prosecution. After all, rogue IPs only demonstrate so much about the intentions of the website administrator’s mentality toward the subjects presented on the Internet.
No discussion of social media analytics monitoring would be complete without mention of one of the most prolific news-reporting entities on the Web; namely, Twitter.com. An Indonesian informatics lab has recently evaluated Twitter traffic from the vantage of NLP, with mixed results [61]. After input data from Twitter tweets is extracted, normalized, and processed, POS tagging is performed, albeit given the fact that the textual content often does not abide by conventional rules of grammar and syntax [61]. Relevant for us is the fact that the timestamps associated with when a tweet was released factor into the input data for syntactical analysis, thus introducing a temporal component for identification purposes. However, ours will time the rate at which every word is typed rather than the time it takes between one tweet and another. In the last assessment, while Twitter is a logical place to search for dangerous data, the hardest aspect of its virtual topography is the relative inability to assize fact from fiction, especially across different languages.
D. EXPERIMENTATION
Before we delve into our experimentation over the text corpus collected by Banerjee et al., it would be good to re-visit the modes and means by which such verbal data was collected in the first place [21]. As a re-cap, users submitted truthful and falsified responses to restaurant reviews, gaymarriage debates, and gun control questionnaires using Amazon’s Mechanical Turk. The typing times for these essays were recorded as flows demarcated by document, key, space, and word, respectively. Data was classified by features of the users’ typing patterns, specifically the durations of and pauses between
33
deletions, insertions, and substitutions. We shall be focusing upon the timestamps associated with each word, as that makes the most sense to pair it with the individual parts of speech in each sentence diagram. Rather than using this data to determine the subjects’ psychological distance fromtheir source material, as these researchers did, we will be using it to compare and contrast whether pairing timestamps with parts of speech (thus, combining the temporal with the lexical) generates a more accurate digital signature of online speech acts. Combining the POS samples with their timestamps could inadvertently result in the serendipitous portmanteau malapropism, “timestamples,” as in Table 1:
Table 1. POS and Timestamps (timestamples)
Now that we have performed a cursory overview of the current research being done in NLP as it pertains to author authentication for forensic investigation, it is time to revisit the Natural Language Toolkit (NLTK) package and the Banerjee corpus of text data, except this time we will attach time stamps to the trees of the sentences parsed. For the purposes of original experimentation, we will compare a truthful restaurant review and a false
34
restaurant review from two separate users in the database to determine if incorporating keylogging timestamps into the parse tree nodes as proportional weights for a neural network is indeed a more accurate metric of assessment than merely focusing on grammatical analysis or typing frequency alone. On a semantical level, this testbed is skewed by virtue of the fact that restaurant reviews are about as neutral a topic as one can think of (unless there is an ethical conflict concerning what is being eaten). However, that may make it a more unbiased text sample than a more controversial topic of discussion would be.
The data we are working with required a bit of manipulation before this processing could even be accomplished. First, we organized what had been typed by the users by word rather than by character. While that eliminates some information regarding insertions, deletions, and backspaces, it narrows the scope to focus only on the key ingredients composing sentence construction. Second, we transformed the data to measure duration according to each word typed rather than each character. Even though this diminishes a degree of specificity inherent to the data set, it more closely parallels the actual structure of verbal thought that tends to proceed from word to word—or even phrase to phrase—in the human mind, rather than going from character to character. Third, we tagged these individual words according to their part of speech. The application programming interface (API) documentation on the NLTK Python package library identifies at least eight basic parts of speech with corresponding abbreviations: S (sentence), NP (noun phrase), VP (verb phrase), PP (prepositional phrase), Det (determiner, as in an article), N (noun), V (verb), and last but not least, P (preposition) (see Table 2) [62]. However, the taxonomy employed is largely ancillary to the degree of granularity specified in the parse tree. Thus, a more finely tuned catalogue of abbreviations to account for subordinating conjunctive clauses, etc., is illustrated in Table 3 [63].
35
Table 2. Basic NLTK POS Tagging
36
Table 3. Advanced Penn Treebank POS Tagging
37
Let us take for instance the following truthful (if rather pedestrian) statement by user #1: “when I want pizza in Louisville the first place that comes to mind is the local favorite.” Semantically it may be simpler than the statement we parsed in the last chapter, but syntactically it proves to be more complex, thus demanding a more sophisticated taxonomy for diagramming the parts of speech. Our NLTK POS tagger ramifies the substituent elements along the following lines in the course of this instantiation: “When” is a WRB (wh-adverb), “I” is PRP (personal pronoun), “want” is VBP (non-3rd person singular present verb), “pizza” is NN, “in” is IN (preposition or coordinating conjunction), “Louisville” is NNP (singular proper noun), “the” is DT (determiner), “first” is JJ (adjective), “place” is NN, “that” is WDT (Wh-determiner), “comes” is VBZ (3rd person singular present verb), “to” is TO (preposition), “mind” is VB (base form of the verb), “is” is VBZ, “the” is DT, “local” is JJ, and “favorite” is NN. In this breakdown, there are a couple limitations of the technology to consider. First, the program only works for English. Second, automation runs the risk of not picking up on subtle nuances in semantical content. For instance, the word “mind” can act as either a verb or a noun, and the tagger misidentified it as a verb in this case. This phenomenon would normally require that we manually override the program’s designation to NN rather than VB, except that the program only recognizes it as part of the infinitive form “to mind.” However, “favorite” could act as an adjective or as a noun, yet the program correctly identified it as the latter. Likewise, “I” was initially labelled as a common noun rather than a personal pronoun, depending upon capitalization. Thus, success is a bit haphazard as of yet. Furthermore, the subject must be correctly marked in order to architect a parse tree with NLTK, which, as any grade school student experienced with diagramming sentences on a chalkboard will inform you, is the last word “favorite” in the sentence in this instance, counter-intuitively as it may be after the linking verb “is”.
These relatively minor oversights notwithstanding, the next step is to attach the timestamp to each of the words in the sentence.While originally the data was organized to reflect the duration of each key pressed in accordance with the keylogging record, we collimated it to register the time lapse for every new word in the sentence. Even though this misses some of the specificity of the testbed, it produces a greater approximation of
38
the actual thought process that went on in the author’s mind when typing onto the keyboard to make the words appear on the screen. The original timestamps contained in the keylogging record and our calculation durations are all provided within millisecond resolution. Intriguingly, these nearly instantaneous typing times indicate how the human brain benefits from the effect of massive parallelism across neurons, so it is the computer that has to keep up with human cognition, rather than the other way around. “When” took 3362 milliseconds, “I” 128, “want” 553, “pizza” 896, “in” 88, “Louisville” 2662, “the” 664, “first” 455, “place” 472, “that” 313, “comes” 437, “to” 95, “mind” 424, “is” 88, “the” 161, “local” 1005, and “favorite” 1088 (as illustrated in Figures 2 and 3). It would be an interesting study in and of itself to see what words take the longest to type and the frequency with which they appear in the corpus, but that method may be reductive to the point of being counter-productive. Instead, we will adopt a more comprehensive model by correlating the POS with the timestamp for input to create a recurrent neural net (RNN). However, the process was rather painstaking and had to be entered in manually in order to operate efficiently, which leaves room for future optimization to introduce automation. Besides, there is a degree of flexibility in the compression of the tree generated, whereas a more hierarchical method would be preferable so as to distinguish between dependent and independent clauses in complex sentences—this built-in feature of NLTK accommodates “indefinitely deep recursion of sentential complements,” a “constellation of properties” known as “unbounded dependency construction” [62]. For the sake of simplicity, then, we will save this component for the next chapter when we compare and contrast trees produced from truthful and mendacious authors via neural networks.
Figure 2. User #1 Truthful Tree with Timestamps (Part I)
39
Figure 3. User #1 Truthful Tree with Timestamps (Part II)
40
THIS PAGE INTENTIONALLY LEFT BLANK
41
IV. UNIVERSAL METHODOLOGY
At the end of the last chapter, we encoded a bifurcated parse tree from the POS tags of user #1’s truthful statement and their associated timestamps. These coupled “timestamples” will constitute the inputs and weights of the RNN. But first we must traverse over user #1’s falsified statement, as well as user #2’s true and false statements, in order to establish a basis for comparison between authorial signatures. Then, we can parameterize a paradigm for author identification online through temporal-syntactic digital signatures across the text corpus in general and the Internet internationally. Thus, we will explore the extent to which this computational model for digital communication can be applied to other non-Anglophone languages, such as Chinese and Russian, particularlywith a view towards detecting and preventing terrorist subversion by rogue individuals and enemy nation-states, domestic and abroad. Finally, we will suggest new avenues for future research into this technology for the betterment of humanity, as well as summarizing the conclusions reached regarding the limitations and potentialities implicit in such cyber analytic tools.
A. DATA ANALYSIS AND RESULTS
Before we compare user #1 with user #2, let us first compare user #1’s truthful response with their false response using NLTK. The falsified response was even simpler than the true response, which in and of itself may be significant. When the author was being truthful, they seem to have felt more liberty to express themselves at length, rather than constricting their speech for the sake of not betraying their true alliances. Thus, in the fake text sample, user #1 spoke with such superlative praise as to raise suspicions, typing, “their BBQ is nothing to dismiss.” In fact, it is almost a double-negative, as “nothing” and “dismiss” both hold connotations of negation rather than affirmation. Since this sentence expresses what the author did not mean rather than what they did intend, it could be a clue to see whether false authors employed more negative verbiage than in their truthful counterparts, as a sort of subconscious slip. Upon closer examination, “their” has a timestamp of 2008 milliseconds and a POS tag of PRP (possessive pronoun), “BBQ” 547
42
ms and NNP (singular proper noun), “is” 71d ms and VBZ (third-person singular present verb), “nothing” 1104 ms and NN (singular noun), “to” 105 and TO (to), and “dismiss” 1000 ms and VB (base form verb). This information can be readily visualized in the comma-separated value cells in Table 4.
Table 4. User #1 False Statement
As one might readily detect, the timestamps for this sample are rather sporadic, with some words taking much longer to type regardless of the length of the word but rather correlated to the position in the sentence, indicating that the author is exerting extra effort to come up with things to say, whereas in a truthful statement, the words just flow more organically. Furthermore, there are many false starts, leaving dependent clauses with little meaning hanging. However, we focus on the full sentences for the purpose of parsing something that is inherently coherent. After chinking and chunking has been properly performed, NLTK renders the graphic to be found in Figure 5 via the code snippet to be found in Figure 4.
43
Figure 4. User #1 False Code
Figure 5. User #1 False Tree
Now that we have compared User #1’s false and true statements using NLTK and timestamps, let us proceed to look at User #2’s honest versus duplicitous commentary. As of yet, little distinguishable pattern has emerged whereby one could differentiate between veracity and mendacity on the basis of sentence construction and typing times alone, but perhaps a more detectable and reproducible order will appear once we feed these figures into the RNN to train our model over the dataset. While User #2 raved with gusto about the gourmet quality of Cleveland grilled cheese in this truthful writing sample, we took a small, generic independent clause from their restaurant review for parsing; namely, “they are the first and best to utilize this concept” (Table 5).
44
Table 5. User #2 Truthful Statement
As we can observe in Table 5, the sentence fragment “they are the first and best to utilize this concept” has the following breakdown: “they” (225 ms, PRP), “are” (188 ms, VBP), “the” (140 ms, DT), “first” (356ms, JJ), “and” (165 ms, CC), “best” (773, JJS), “to” (93 ms, TO), “utilize” (2172 ms, VB), “this” (232 ms, DT), “concept” (903 ms, NN). Before entering the textual data into our Python script, we can make a couple of tertiary observations. First, the timestamps for these POS tags are shorter than User #1’s, irrespective of the truth or falsity of their statements, which could be attributed to the fact that some people think faster than others on their feet; for instance, when typing for a timed test like this one. Second, the parts of speech that took the longest for User #2 to formulate were naturally the verb and noun choice, as they admit of far more various synonyms to select from than more functional speech constituents, such as articles and conjunctions. Thus, “utilize” and “concept,” while not very multi-syllabic, are not necessarily the most intuitive choices either, outside an academic context, such as in a restaurant review. When it comes to author identification by sentence construction, the single most crucial giveaway would be word choice, as the richness or paucity of one’s vocabulary reflects a great deal about an individual’s education, upbringing, profession, and level of global intelligence. However, it is important to keep in mind that small things, such as whether one tends to write in active or passive voice, can also speak volumes about one’s subconscious attitude towards that which he or she addresses.
45
After some finagling, an acceptable tree was rendered to represent User #2’s truthful review. Some prefatory notes would be beneficial here to explain one of the hardest problems facing AI research today. Computers have a notoriously difficult time parsing the nuances of human speech, especially given our seemingly infinite capacity for obfuscation of means and intents. Even the most generically innocuous statements, like the one currently under discussion, are prey to drastic misinterpretation once machinery gets involved. In the first place, the manual chunking process demands some manual chinking through tinkering in order to properly diagram the sentence. Furthermore, NLTK omits the traditional parts of speech from its Latinate roots, which we experiment with re-instantiating for the purpose of clarity. However, that demands that certain terms be corrected which the computer had failed to keep track of. For instance, “first” and “best” are technically adjectives and were initially flagged as such. However, in this deceivingly simple sentence, they act not as adjectives but rather as nouns that serve as the complement to the pronoun subject “they” via the copular verb “are.” To throw another wrench into the equation, when “best” and “first” were changed to nouns instead of adjectives, the part of the sentence they fell under (complement, prepositional phrase, etc.) became all catawampus in the graphic generated. Therefore, the terms in question had to be called superlative adverbs (RBS) in order for the program to generate an accurate graph. However, even this is a problematic stop-gap, in that the words are not adverbs, though they are indeed superlatives. A future edition of NLTK would do well to bear these inconsistencies in mind to accommodate a wider breadth of computational linguistic
46
displays. With that caveat then, Figures 6, 7, and 8 demonstrate the code and the output using NLTK to parse User #2’s truthful sentence.
Figure 6. User #2 Truthful Code
Figure 7. User #2 Truthful Tree (Part I)
Figure 8. User #2 Truthful Tree (Part II)
Let us now examine how this construction differs (if in any way, shape, or form) for User #2’s false statement versus the true one. Then, we can compare the two users
47
against each other as a basis for trust verification across the corpus. There were so many false starts in this sample that it was hard to find an untarnished independent clause in the text data, and the only one that could be found had a spelling error. However, the fact that the sample is characterized by dangling participles and typos is only further indication of User #2 fumbling for words to come up with in the advent that he or she is lying about what they truly believe. It would therefore seem upon a cursory assessment that it is more straightforward to differentiate between true and false statements than it is between users, because it would be difficult to distinguish between whether an anomalous expression owes more to the individual author’s verbal propensities, or whether they were under more pressure to confabulate a feasible response. The breakdown for User #2’s deceptive quote can be studied more in-depth in Table 6:
Table 6. User #2 False Statement
As we can see, the snippet we chose to parse is the following: “the (208 ms, DT) ambiance (1137 ms, NN) is (75 ms, VBZ) nice (359 ms, JJ) too (224 ms, RB).” Not surprisingly in the slightest, “is” took the least time to type, whereas “ambiance” took the longest (presumably due to the author trying to remember whether ambience was spelled with two letter a’s or not). Again, it serves us well to recollect here that punctuation has routinely been omitted for ease of parsing, as it raises a whole other host of signifying characteristics that are outside the scope of our classifiers for sentence construction. We will proceed to run our NLTK script in Python over this statement to produce similar results to User #2’s true statement, except this example is even simpler, since there is no
48
prepositional phrase to incorporate into the tree generated (Figure 10) with the code pictured in Figure 9.
Figure 9. User #2 False Code
Figure 10. User #2 False Tree
Before we quantify things further, a couple of observations can already be made. First, with both users selected here, the truthful statements were significantly more syntactically rich and layered, which made parsing the complex sentence structure more difficult. It could easily be surmised that the authors felt freer to express themselves with the truthful statements, which resulted in them being more heavily nuanced, whereas the bare minimum sufficed in the case of the false statements so as to keep the topic simple without straying into the realmof conjecture. Secondly, all four sentences parsed employed the copular verb form of “is.” This might be due to a preponderance for passive voice when composing one’s thoughts under surveillance. However, this phenomenon does not invalidate the results, considering that vast majority of the Internet is currently under surveillance.
49
Now that we have treated of NLTK for our sample subset, it is time to turn our attention to how neural net resources could be utilized to encode a recurrent model with this lexical and numerical input. Since the textual data and time-stamps when taken independently do not intimate much, the natural tool of choice to analyze any potential pattern therein would be the HiddenMarkov Model (HMM). If the diagrams NLTK renders are trees, then the ones an RNN renders are their roots. If we are to correlate timestamps (observable continuous functions) to their POS (hidden discrete units of measure) with realistic frequency, the Viterbi (or max-sum) algorithm makes most sense to implement, especially when one considers its near omnipresence in fields as wide-ranging as speech-recognition and bioinformatics. As Daniel Jurafsky of Stanford University and James Martin of the University of Colorado at Boulder assert in their text, Speech and Language Processing, “When applied to the problem of part-of-speech tagging, the Viterbi algorithm works its way incrementally through its input a word at a time, taking into account information gleaned along the way” [63]. Furthermore, “feed-forward neural networks employ fixed-size input vectors with associated weights to capture all relevant aspects of an example at once. This makes it difficult to deal with sequences of varying length, and they fail to capture important temporal aspects of language” [63]. Thus, by combining words with the time it took to type them with an HMM, we can approximate a genetic code of human thought.
Through their pioneering work at the U.S. Army Research Laboratory, John Monaco and Charles Tappert have extended this basic premise to develop a partially observable HMM (or, POHMM) that accommodates keystroke input (user behavior) as well as the task performed (restaurant review, etc.), corresponding to the equation and diagram they encoded in Figures 11 and 12, respectively [64]. These illustrations convey the transition between event and emission probabilities for hidden states in the chain of context-free variables. Here, we apply the library package in Python that they have created to account for POS tags; then, we are able to generate bar charts to illustrate the mean values for each part of speech in users 1 and 2’s true and false reviews. Finally, we plot the hidden states on line charts for each review. The resulting visual graphics form a fairly
50
stable digital signature, that could then be extended across the entire corpus as an analytic tool for the patterns of online communication.
Figure 11. POHMM Relational Equivalency
Figure 12. Markov Model Diagram
51
Figure 13 contains the code that displays the bar charts of mean values of timestamp durations associated with each POS for the four reviews under study in Figures 14–17.
Figure 13. Bar Chart Code
52
Figure 14. User #1 Truthful Review Part of Speech/ Mean Duration
Figure 15. User #1 Deceptive Review Part of Speech/ Mean Duration
53
Figure 16. User #2 Truthful Review Part of Speech/ Mean Duration
Figure 17. User #2 Deceptive Review Part of Speech/ Mean Duration
54
As we can see, the various reviews follow different distributions—sometimes, nouns take the longest average duration, sometimes adverbs, etc. We can now compare these real measured values against the potentials calculated by the POHMM algorithm, displayed by the line charts in Figures 19–22 generated by the code in Figure 18:
Figure 18. Line Chart Code
Figure 19. User #1 Truthful Review Hidden States
55
Figure 20. User #1 Deceptive Review Hidden States
Figure 21. User #2 Truthful Review Hidden States
56
Figure 22. User #2 Deceptive Review Hidden States
Basically, this model generates figures that mirror the actual values detected by printing an upper and lower bound for every part of speech’s predicted duration. Thus, with a fairly accurate degree of fit, we approximate the reality of each user’s contribution and whether or not they were being truthful by printing a digital signature of their authorship reflected in their unique temporal-syntactic patterns. The difference between each sample’s data analysis and results is striking. This model could be used not only to detect deception but better yet to identify authors online, acting as a sort of polygraph for text-based user input. With every spike and fall in the charts, we can trace the undulations of cogitation flowing from the author’s fingertips through the keyboard to the computer screen.
B. CHINA AND RUSSIA
This experiment assumes that a user has visited a website with a javascript-enabled browser that can record I/O interrupt events in the first place. Furthermore, our model heretofore assumes Anglophone sentential construction, which is an assumption that does not hold for the international nature of the internet. Thus, to formulate a truly universal model for digital communication, it is necessary to incorporate the polyglot aspect of online
57
dialogue into the equation. While critical interest has ebbed in recent years, there is no better place to start than with the Universal Networking Language (UNL) to study how online speech could be analyzed for security purposes irrespective of nation of origin. Since China and Russia have been a couple of the most salient information infiltrators in the cyber-sphere within recent memory, our brief synopsis of UNL will focus on them as state actors. However, wherever a threat emanates from, this model could accommodate it, given the proper requirements.
First, let us start with China. Xiaodong Shi and Yidong Chen from the Institute of Artificial Intelligence at Xiamen University invented a deconverter that translates UNL into Chinese via a series of walking tree graphs [65]. First, the researchers connected nodes. Next, from these nodes, they constructed a graph. Finally, they concatenated parametric attributes from sub-trees to build a whole tree. This methodology attempted to remediate syntactical ambiguity inherent to the Chinese language. The interface for de-conversion operates through a simple Apache module with remote procedure calls. Since the main aspect of de-conversion consists in the translation from an XML file to a list format, that is where our parse tree neural net technique could be introduced in future iterations of UNL to concoct a more representative framework for inter-lingual transmissions.
Now that we have looked at some of the most recent research being done in the area of universalizing tree structures to reflect languages as diverse as Chinese, let us examine how UNL could be used to reflect Russian syntax, even in the context of keylogging technology. As Igor Boguslavsky from the Russian Academy of Sciences in Moscow has pointed out, the biggest obstacle to correct interpretation using a universal methodology of tree structures is the major ambiguities between the lexical context of one language versus another, including English and Russian [66]. His approach dovetails with the combinatorial dictionaries of French and Russian to discover collocation properties inherent to theoretical semantics. By setting forth universal relations between universal concepts as operating functions, this dilemma is obviated, such that words retain their intention regardless of language. While we have only discussed two potential solutions in the brief space allotted here, the possibilities are truly endless and deserve further exploration in the course of future research.
58
C. CONCLUSION
This thesis has covered a lot of intellectual territory to tackle a global problem of information security by inventing a new mathematical model of communication via computers. In the first chapter, we establish a philosophical framework for speech analysis online given the context of
the history of typography. In the second chapter, we propose a novel approach to parse tree generation with NLTK. In the third chapter, we combine syntactical and semantical parsing with keylogging analytics, and in the fourth chapter, we demonstrate the potency of the POHMM to correlate POS tagging with keyboard typing. When the affluent dilettante Philippe Greenleaf (Maurice Ronet) tells the debonaire bon vivant Tom Ripley (Alain Delon) in Plein Soleil, the French adaptation of Patricia Highsmith’s novel The Talented Mr. Ripley, “Even if you could imitate my signature, you could not imitate an entire letter,” Tom replies with prophetic aplomb, “Everything is learned…I have your machine. It is very easy to identify the letters” [67].While the forged identity theft via typing made infamous in this classic example of the septième art is easier than ever to achieve in real life, machine learning techniques like the one we develop here also make it easier than ever to identify an author by the way they imitate letters.
Using the logic of neural nets to parse sentences is not without precedent, but correlating such results to keyboard input data is quite innovative, indeed. As Padraic Smyth of UC Irvine, David Heckerman of Microsoft, and Michael I. Jordan of the Department of Brain and Cognitive Sciences at MIT concluded in their broadly encompassing paper, “Probabilistic Independence Networks for Hidden Markov Probability Models” in the Journal of Neural Computation, “Graphical techniques for modeling the dependencies of randomvariables have been explored in a variety of different areas including statistics, statistical physics, artificial intelligence, speech recognition, image processing,
and genetics” [68]. Likewise, as Juan Antonio Perez-Ortiz and Mike Forcata of the Department of Languages and Information Systems at the University of Alacant in Spain state in their essay, “Part-of-Speech Tagging with Recurrent Neural Networks,” “Of course, hybrid approaches are possible which combine the power of rule-based and statistical PoS taggers” [69]. However, they use statistical frequencies of a word’s appearance to calculate the likelihood that it falls under one POS designation versus
59
another in order to automate tagging rather than resorting to a more manual means, which has little to do with typing character frequency. Indeed, Helmut Schmid of the Institute of Computational Linguistics in Germany concluded that his multilayer perceptron network tagger outperformed HMM automation, yet his model totally neglected the typing of the words themselves [70].
However, there is still much room for further optimization of our model. Researchers at Ohio State University “have presented a single layer network architecture to account for people’s ability to understand recursive structures in language” [71]. Therapeutic applications also abound. A research team at the Center for Vision, Speech and Signal Processing at the University of Surrey have used neural machine translation to generate videos of hand sign language associated with recorded speech [72]. From a law enforcement perspective, social media is one of the chief culprits for fraudulent activity on the web, and a group at Fudan University in Shanghai has used adversarial neural networks to perform part-of-speech tagging for tweets [73]. Furthermore, the Computer Science Department at the University of Asia Pacific in Bangladesh has enlisted UNL as a tool in solving semantic ambiguities in the language of Bangla [74]. Ergo, a global problem will require a globally minded solution, especially when it concerns Internet communication and the perils thereof.
As we have seen, making an in-depth analysis of the way in which keystrokes individuate one person from another lends itself to a few philosophical considerations concerning the nature of humanity in light of technology. After all, in the words of John Paul II, “The Internet is certainly a new ‘forum’ understood in the ancient Roman sense of that place where politics and business were transacted… where much of the social life of the city took place, and where the best and the worst of human nature was on display… Furthermore, the Internet radically redefines a person’s psychological relationship to time and space” [75]. Likewise, the Oxford don John Henry Newman declares, “Science, then, has to do with things, literature with thoughts; science is universal, literature is personal; science uses words merely as symbols, but literature uses language in its full compass, as including phraseology, idiom, style, composition, rhythm, eloquence, and whatever other properties are included in it” [76]. Thus, we have done our best here to bridge the personal
60
language of literature with the universal language of science in order to separate the best and worst of human behavior on the Internet.
This computational model for digital communication—while a small improvement over its predecessors—is not without its limitations. As Douglas Hofstadter wrote in his Pulitzer Prize-winning opus, Gödel, Escher, Bach: An Eternal Golden Braid, “probably no one will ever understand the mysteries of intelligence and consciousness in an intuitive way. Each of us can understand people, and that is probably about as close as you can come” [77]. Likewise, the Rhodes Scholar and Guggenheim Fellow Robert Penn Warren articulated legitimate trepidation about the mistake of treating people as points of data: “[T]he ideal of understanding men and telling their story, noble or vicious, will be replaced by the study of statistics or nonideographic units of an infinite series, and computers will dictate how such units, which do breathe and move, can best be manipulated for their own good” [78]. Dovetailing with that thesis, the media ecology pioneer and one time president of the Modern Language Association, Yale scholar William Ong, S.J., once penned, “Human language and thought are embedded in the nonverbal, the total human, historical, existential environment of utterance, with which they interact dialectically to produce meaning. This total environment cannot be entered into the computer. To digitize it would require infinite digitization” [79]. Still, the ability for information science to produce trees organizing the history of human thought and culture lends credence to the words of Alexander Solzhenitsyn’s Nobel Lecture: “So perhaps the ancient trinity of Truth, Goodness and Beauty is not simply an empty, faded formula…? If the tops of these three trees converge, as the scholars maintained, but the too blatant, too direct stems of Truth and Goodness are crushed… Beauty will push through… and in so doing will fulfill the work of all three?” [80].
Indeed, the first woman to receive a Ph.D. in computer science, Sister Mary Kenneth Keller, was a bit more optimistic about the future of artificial intelligence to help humanity rather than hurt us: “For the first time, we can now mechanically simulate the cognitive process. We can make studies in artificial intelligence. Beyond that, this mechanism [the computer] can be used to assist humans learning” [81]. Nowhere is this more abundantly clear than in the use of computer technology to assist those mute from
61
disability of one kind or another. Through ever more advanced typing techniques, we can
parse meaning and intent from mere signs and symbols to help even the most debilitated
learn how to communicate verbally once again. This technology offers hope to those like
the boy with adrenoleukodystrophy, who, at the end of the Academy Award-nominated
true story Lorenzo’s Oil, whispers internally beneath the trompe l’oeil vision of Heaven,
“I’ll be able to tell my brain to tell my toes, my fingers, my anything to do what I want
them to do. And then, one day, I’ll hear my voice, and all these words I’m thinking will get
outside my head” [82]. If the computer can be used to communicate truth and falsehood
alike, sometimes the answers to the most pressing questions lie hidden in plain sight.
62
THIS PAGE INTENTIONALLY LEFT BLANK
63
LIST OF REFERENCES
[1] A. Haley et al., Typography Referenced. Beverly, MA, USA: Rockport Publishers, 2012.
[2] M. McLuhan, The Gutenberg Galaxy: The Making of Typographic Man. Toronto, Canada: University of Toronto Press, 1962.
[3] Y. Van Den Eede, Amor Technologiae. Brussels, Belgium: VUBPress, 2013.
[4] T. Winter, “Roberto Busa, S.J., and the Invention of the Machine-Generated Concordance,” The Classical Bulletin, vol. 75.1, pp. 3–20, 1999.
[5] Federal Bureau of Investigation, “Unabomber,” Accessed Nov. 16, 2019. [Online]. Available: https://www.fbi.gov/history/famous-cases/unabomber.
[6] T. Kaczynski, Technological Slavery. Port Townsend, WA, USA: Feral House, 2010.
[7] C. Friedersdorf, “20 ideas from the mind of David Gelernter,” The Atlantic, 2017. [Online]. Available: https://www.theatlantic.com/politics/archive/2017/02/theres-enough-time-to-change-everything/517209/. [8] NCEA. “STREAM (Science, Technology, Religion, Arts and Mathematics) Resources,” 2020. [Online]. Available: https://www.ncea.org/NCEA/Learn/Resource/Academic_Excellence/STREAM_Resources.aspx.
[9] A. Vance, “The People’s Republic of the Future.” Canada: Bloomberg, 2019. [Online]. Available: https://www.youtube.com/watch?v=taZJblMAuko.
[10] L. Kay, “From Logical Neurons to Poetic Embodiments of Mind: Warren S. McCulloch’s Project in Neuroscience,” Science in Context, vol. 14, issue 4, pp. 591–614. Cambridge, England: Cambridge University Press, 2001.
[11] C. Fisher, “The FBI Plans More Social Media Surveillance,” Engadget, 2019. [Online]. Available: https://www.engadget.com/2019/07/12/fbi-social-media-monitoring-tool-rfp/. [12] A. O’Sullivan, “The Government Wants a ‘Red Flag’ Social Media Tool. That’s a Terrible Idea.,” Reason.com, 2019. [Online]. Available: https://reason.com/2019/08/20/the-government-wants-a-red-flag-social-media-tool-thats-a-terrible-idea/.
[13] W. McCulloch and W. Pitts, “A Logical Calculus of the Ideas Immanent in Nervous Activity,” Bulletin of Mathematical Biology, vol. 52, no. 1–2, pp. 99–115, 1990.
64
[14] Y. Goldberg, “A Primer on Neural Network Models for Natural Language Processing,” Journal of Artificial Intelligence Research, vol. 57, pp. 345–420, 2016.
[15] G. Marcus, The Algebraic Mind, Cambridge, Massachusetts, USA: MIT Press, 2001.
[16] W. Li, “Characterizing Ranked Chinese Syllable-to-Character Mapping Spectrum: A Bridge Between the Spoken and Written Chinese Language,” Journal of Quantitative Linguistics, 2013.
[17] E. Priani, “Ramon Llull,” The Stanford Encyclopedia of Philosophy (Spring 2017 Edition). Ed. Edward N. Zalta. [Online]. Available: https://plato.stanford.edu/archives/spr2017/entries/llull/.
[18] G. Priest, K. Tanaka, and Z. Weber, “Paraconsistent Logic,” The Stanford Encyclopedia of Philosophy (Summer 2018 Edition). Ed. Edward N. Zalta. [Online]. Available: https://plato.stanford.edu/archives/sum2018/entries/logic-paraconsistent/.
[19] D. Smith and J. Protevi, “Gilles Deleuze,” The Stanford Encyclopedia of Philosophy (Spring 2018 Edition). Ed. Edward N. Zalta. [Online]. Available: https://plato.stanford.edu/archives/spr2018/entries/deleuze/.
[20] F. Jabr, “How Brainless Slime Molds Redefine Intelligence,” Scientific American (November 2012). [Online]. Available: https://www.scientificamerican.com/article/brainless-slime-molds/.
[21] R. Banerjee, S. Feng, J. S. Kang, and Y. Choi, “Keystroke patterns as prosody in digital writings: A case study with deceptive reviews and essays,” Proc. of the 2014 Conf. on Emp. Meth. in Nat. Lang. Process., pp. 1469–1473, Doha, Qatar: Association for Computational Linguistics, 2014.
[22] Y. Xu, “Random Walk on WordNet to Measure Lexical Semantic Relatedness,” Duluth, Minnesota, USA: University of Minnesota, 2011. [Online]. Available: https://www.d.umn.edu/~tpederse/Pubs/yanbo-report.pdf.
[23] D. Ramage, A. Rafferty, and C. Manning, “Random Walks for Text Semantic Similarity,” Stanford, California, USA: Stanford University. Accessed February, 26, 2020. [Online]. Available: https://nlp.stanford.edu/pubs/wordwalk-textgraphs09.pdf.
[24] E. Kapetanios, D. Tatar and C. Sacarea, Natural Language Processing: Semantic Aspects. Boca Raton, Florida, USA: CRC Press, 2014.
[25] G. Dyson, Turing’s Cathedral. New York City, New York, USA: Knopf Doubleday Publishing Group, 2012.
65
[26] D. Rumelhart and D. Norman, “Simulating a Skilled Typist: A Study of Skilled Cognitive-Motor Performance,” Cognitive Science, vol. 6, no. 1, pp. 1–36, 1982. Available: 10.1207/s15516709cog0601_1.
[27] K. Dela Rosa and J. Allen, “Text Classification Methodologies Applied to Micro-text in Military Chat,” International Conference on Machine Learning and Applications, 2009.
[28] J. Campbell, E. Stanziola, and J. Feng, “Instant Messaging: Between the Messages,” IEEE, pp. 2193–2198, 2003.
[29] D. R. O’Day and R. A. Calix, “Text message corpus: Applying natural language processing to mobile device forensics,” San Jose, California, USA: IEEE International Conference on Multimedia and Expo Workshops (ICMEW), pp. 1–6, 2013.
[30] A. Pratap, J. Prasad, K. Kumar, and S. Babu, “An investigation on optimizing traffic flow based on Twitter Data Analysis,” 2nd International Conference on Inventive Communication and Computational Technologies, 2018.
[31] X. Huang, L. Xing, J. Brubaker, and M. Paul, “Exploring Timelines of Confirmed Suicide Incidents through Social Media,” IEEE International Conference on Healthcare Informatics, pp. 470–477, 2017.
[32] J. Polpinij, C. Sibunruang, S. Paungpronpitag, R. Chamchong, and A. Chotthanom, “A Web Pornography Patrol System by Content-based Analysis: In Particular Text and Image,” IEEE Conference on Systems, Man and Cybernetics, pp. 500–505, 2008.
[33] S. L., and S. Mary Idicula, “Fingerprinting based Detection System for Identifying Plagiarism in Malayalam Text Documents,” IEEE Intl. Conference on Computing and Network Communications, pp. 553–558, 2015.
[34] J. Li and J. Xing, “Text Analysis Technology in Crew Collaboration Scheduling System for Space Missions,” IEEE, pp. 43–47, 2014.
[35] C. Savina Bianchini, F. Borgia, and M. De Marsico, “SWift—A SignWriting editor to bridge between deaf world and e-learning,” 12th IEEE International Conference on Advanced Learning Technologies, pp. 527–530, 2012.
[36] R. Shetty, B. Schiele, and M. Fritz, “A 4NT: Author Attribute Anonymity by Adversarial Training of Neural Machine Translation,” 27th USENIX Security Symposium, pp. 1633–1650, 2018.
66
[37] C. Norton, S. Zahorian, and Z. Nossair, “The Application of Binary-Pair Partitioned Neural Networks to the Speaker Verification Task,” Accessed February 26, 2020, Norfolk, Virginia, USA: Old Dominion University. [Online]. Available: http://www.ws.binghamton.edu/zahorian/pdf/THE%20APPLICATION%20OF%20BINARY-PAIR%20PARTITIONED%20NEURAL%20NETWORKS%20TO%20THE%20SPEAKER%20VERIFICATION%20TASK.pdf.
[38] R. Geng, P. Jian, “Implicit Discourse Relation Identification based on Tree Structure Neural Network,” IEEE International Conference on Asian Language Processing (IALP), pp. 334–337, 2017.
[39] Locklear, H., Govindarajan, S., Sitova, Z., Goodkind, A., Brizan, D. G., Rosenberg, A., … Balagani, K. S. (2014). Continuous authentication with cognition-centric text production and revision features. In IJCB 2014 – 2014 IEEE/IAPR International Joint Conference on Biometrics [6996227] Institute of Electrical and Electronics Engineers Inc. https://doi.org/10.1109/BTAS.2014.6996227
[40] J. Monaco, J. Stewart, S. Cha, and C. Tappert, “Behavioral Biometric Verification of Student Identity in Online Course Assessment and Authentication of Authors in Literary Works,” IEEE 6th International Conference on Biometrics, BTAS, 2013.
[41] J. Kilmer, “Trees,” Poetry Foundation. Accessed Dec. 19, 2019. [Online]. Available: https://www.poetryfoundation.org/poetrymagazine/poems/12744/trees
[42] W. Herzog, Lo and Behold! Reveries of the Connected World. America: Magnolia Pictures, 2016. [Online]. Available: https://www.youtube.com/watch?v=uhdV7SKblhk
[43] M. Talbot and L. McTaggart, The Holographic Univerese: The Revolutionary Theory of Reality. New York City, New York, USA: Harper Perennial, 2011.
[44] A. Garland, Ex Machina. Hollywood: Universal Pictures, 2014.
[45] S. Afroz, M. Brennan, and R. Greenstadt, “Detecting Hoaxes, Frauds, and Deception in Writing Style Online,” IEEE Symposium on Security and Privacy, 2012.
[46] S. Feng, R. Banerjee, and Y. Choi, “Syntactic Stylometry for Deception Detection,” Proceedings for the Meeting of the Association for Computational Linguistics, pp. 171–175, 2012.
[47] V. Rubin, Y. Chen, and N. Conroy, “Detection Deception for News: Three Types of Fakes,” ASIST, 2015.
67
[48] N. Golgowski, “Army Says Text Messages About Iran War Draft Are Fake,” Huffington Post, 2020. Available: https://www.huffpost.com/entry/army-draft-text-messages-are-fake_n_5e1721edc5b61f70194a53b0.
[49] R. Zheng, J. Li, H. Chen, and Z. Huang, “A Framework for Authorship Identification of Online Messages: Writing-Style Features and Classification Techniques,” Journal of the American Society for Information Science and Technology, 57(3), pp. 378–393, 2006.
[50] A. Abbasi and H. Chen, “Applying Authorship Analysis to Extremist-Group Web Forum Messages,” IEEE Computer Security, pp. 67–75, 2005.
[51] E. Stamatatos, “Author Identification: Using Text Sampling to Handle the Class Imbalance Problem,” Information Processing and Management, 44(2), pp. 790–99, 2008.
[52] F. Sebastiani, “Machine Learning in Automated Text Categorization,” ACM Computing Surveys, vol. 34, no. 1, pp. 1–47, 2002
[53] V. Jara Vera and C. Sanchez Avila, “Linguistic Decipherment of the Lettering on the (Original) Carving of the Virgin of Candelaria from Tenerife (Canary Islands),” Religions 8(135), pp. 1–26, 2017.
[54] C. Zhao, W. Song, L. Liu, C. Du, and X. Zhao, “Research on Author Identification Based on Deep Syntactic Features,” International Symposium on Computational Intelligence and Design, pp. 277–279, 2017.
[55] C. Chaski “Empirical evaluations of language-based author identification techniques,” Forensic Linguistics 8(1), 2001.
[56] C. Chen and V. Honavar, “A Neural-Network Architecture for Syntax Analysis,” IEEE Transactions on Neural Networks, 10(1), pp. 94–114, 1999.
[57] I. Stroganov and D. Pogorelov, “Ternary as Fuzzy Logic in Creation and Comparison of Syntax Trees when Determining Functional Styles,” IEEE, pp. 1773–1776, 2019.
[58] C. Chuanxi and Q. Mian, “Visualization of Syntax Tree based on VML,” International Conference on Intelligence Science and Information Engineering, pp. 538–541, 2011.
[59] H. Li and J. Yu, “Knowledge Representation and Discovery for the Interaction between Syntax and Semantics: a Case Study of Must,” IEEE, pp. 153–157, 2014.
[60] G. Weir, E. Dos Santos, B. Cartwright, and R. Frank, “Positing The Problem: Enhancing Classification of Extremist Web Content Through Textual Analysis,” IEEE, 201A6.
68
[61] M. Aziz, A. Prihatmanto, D. Henriyan, and R. Wijaya, “Design and Implementation of Natural Language Processing with Syntax and Semantic Analysis for Extract Traffic Conditions from Social Media Data,” International Conference on System Engineering and Technology, pp. 43–48, 2015.
[62] S. Bird, E. Klein, and E. Loper, Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit, O’Reilly Media, 2009. [Online]. Available: http://www.nltk.org/book_1ed/.
[63] D. Jurafsky and J. Martin, Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, New Jersey, USA: Prentice Hall, 2019.
[64] J. Monaco and C. Tappert, “The partially observable hidden Markov model and its application to keystroke dynamics,” Pattern Recognition, vol. 76, pp. 449–462, 2018.
[65] X. Shi and Y. Chen, “A UNL Deconverter for Chinese,” Universal Networking Language: Advances in Theory and Applications, vol. 12, pp. 167–174, 2005.
[66] I. Boguslavsky, “Some Lexical Issues of UNL,” Universal Networking Language: Advances in Theory and Applications, vol. 12, pp. 101–108, 2005.
[67] R. Clement, Plein Soleil. Cannes: StudioCanal, 1960.
[68] P. Smyth, D. Heckerman, and M. Jordan, “Probabilistic Independence Networks for Hidden Markov Probability Models,” Neural Computation 9 (2), pp. 227–69, 1997.
[69] J. Perez-Ortiz and M. Forcada, “Part-of-speech tagging with recurrent neural networks,” International Joint Conference on Neural Networks, vol. 3, pp. 1588–1592, Washington, DC, USA: 2001.
[70] H. Schmid, “Part-of-speech tagging with neural networks,” Conf. on Comp. Ling., vol. 1, pp. 172–176, 1994.
[71] S. Dennis, L. Ding, and D. Mehay, “A Single Layer Network Model of Sentential Recursive Patterns,” Proceedings of the Annual Meeting of the Cognitive Science Society, vol. 31, 2009.
[72] Stoll et al., “Sign Language Production using Neural Machine Translation and Generative Adversarial Networks,” British Machine Vision Conference, 2018.
[73] T. Gui, H. Huang, M. Peng, and X. Huang, “Part-of-Speech Tagging for Twitter with Adversarial Neural Networks,” Conf. on Emp. Meth. In Nat. Lang. Proc., pp. 2411–2420, Copenhagen, Denmark: 2017.
69
[74] M. Mridha, A. Saha, and J. Das, “New Approach of Solving Semantic Ambiguity Problem of Bangla Root Words Using Universal Network Language (UNL),” International Conference on Informatics, Electronics, & Vision, 2014.
[75] John Paul II, “Internet: A New Forum for Proclaiming the Gospel,” 36th World Communications Day. Rome, Italy: 2002. [Online]. Available: http://www.vatican.va/content/john-paul- ii/en/messages/communications/documents/hf_jp-ii_mes_20020122_world-communications-day.html.
[76] John Henry Newman, The Idea of a University, London, England: Longmans, Green, and Co., 1907.
[77] D. Hofstadter, Gödel, Escher, Bach: An Eternal Golden Braid, New York, USA: Vintage Books, 1989.
[78] R. Warren, Democracy and Poetry, Cambridge, Mass, USA: Harvard University Press, 1975.
[79] W. Ong, T. Zlatic, S. van den Berg, Language as Hermeneutic: A Primer on the Word and Digitization, Cornell, New Jersey, USA: Cornell University Press, 2017.
[80] A. Solzhenitsyn, Nobel Lecture in Literature. 1970. [Online]. Available: https://www.nobelprize.org/prizes/literature/1970/solzhenitsyn/lecture/.
[81] A. Crezo, “The First Woman PhD in Computer Science Was a Nun,” Mental Floss. October 14, 2013. [Online]. Available: https://www.mentalfloss.com/article/53178/first-woman-earn-phd-computer-science-was-nun.
[82] G. Miller, Lorenzo’s Oil. Pittsburgh: Universal Pictures, 1992.
70
THIS PAGE INTENTIONALLY LEFT BLANK
71
INITIAL DISTRIBUTION LIST
1. Defense Technical Information Center
Ft. Belvoir, Virginia
2. Dudley Knox Library
Naval Postgraduate School
Monterey, California