Author Prop: Assisting the Creative Process with an Automated Intelligent Cognitive Prop for Writers

INTRODUCTION: Writing is a complex, metacognitive task which requires planning, execution, and evaluation of an evolving text simultaneously. OBJECTIVES: This paper presents Author Prop, a web application for writing that aims to address the need for “cognitive props” during composition. The software particularly targets students, whose relative inexperience may lead to high levels of cognitive load and low engagement during writing. METHODS: To provide a mental prop to writers, Author Prop provides topic-based decomposition of texts using intelligent highlighting. Furthermore, the software allows users to interact with topics (described as keyword lists) during periods of reflection, with the aim to ease and maintain creative thinking. RESULTS: Of the 78 surveyed users, 42% said they would use the app for writing, and 70% said they enjoyed using the app. CONCLUSION: Initial survey results from beta testing of the web application indicate encouraging reception of the software.


Student Engagement During Writing
Low engagement can be a problem when students participate in extended solo activities like writing. Students can have trouble remaining focused and engaged during the writing process. As well as being a solitary activity, writing is a complex task requiring extensive planning, awareness, and metacognition on the part of the writer [1,2]. Students that write essays or articles can experience disconnectedness with their written material. Bereiter and Scardamalia's Compare, Diagnose, and Operate model of writing postulates that writers edit material by comparing their own mental model of what they planned to write with what they have actually written [1]. As the complexity of the text grows, writers may have trouble understanding whether or not what the text is communicating matches their intentions. They may also have trouble with editing or ordering sections of text to maintain coherence and relevance. This phenomenon can be summarized by saying that students experience high cognitive load when writing (cognitive load describes a situation where learners' cognitive processing capabilities are overloaded during complex problem solving to the degree that they have no resources left for actually learning from the problem) [3]. Writing, which involves high levels of self-regulation, benefits from effective metacognitive strategies [2]. The proposed web application ("web app"), Author Prop, aims to provide writers with a

EAI Endorsed Transactions on Creative Technologies
Research Article

EAI Endorsed Transactions on
Creative Technologies 04 2019 -02 2020 | Volume 6 | Issue 20 | e1 C. Roberts 2 cognitive support that will lessen cognitive load through increased engagement and alternate ways of reflecting on or reconsidering the evolving state of document content during drafting.

Natural Language Processing Background
Natural language processing (NLP) is an area of artificial intelligence that focuses on taking structured or unstructured text or other language data, parsing it in a way that computers can understand, and conducting useful analyses on the data [4]. One sub-field of NLP is topic modelling, which typically works with written text data to extract the topics associated with a corpus of documents.
Topic models are probabilistic graphical models that postulate the occurrence of words in documents as a stochastic process governed by statistical distributions over unknown, "latent" topic variables. The values of these unknown topics determine which words are found in which document [5,6]. Topic models are typically characterized by lists of keywords that characterize that topic. Fig. 1 shows a graphic illustrating topic modelling using Latent Dirichlet Allocation (LDA), one of the most widely used topic modelling algorithms.

Related Work
While NLP for student writing improvement is a relatively immature field, examples of related work exist. The largest area where NLP has found application is in software aimed at helping students improve essay structure and writing style. These products may be focused on a specific type of writing such as college academic or legal writing [7]; they are often designed to automatically detect and teach a predefined set of characteristics associated with "good" essays. For instance, AcaWriter [8] automatically identifies patterns called "rhetorical moves" that indicate use of specific writing techniques such as use of emphasis and contrast in higher degree research writing; the tool offers feedback to students on essay rhetorical structure, such as the need for additional background information before presenting a research problem.
Fewer products use NLP with the aim to specifically improve student engagement and reflection (as opposed to improving essay structure); products that serve this niche typically aim to improve engagement and reflection in tandem with improving writing style. Few products take specific aim at the role of reflection in the writing process itself as a support for the cognitive load students may face while structuring their ideas during drafting.
Burstein et al [9] present an NLP-driven writing system that coaches students using feedback on essay qualities such as convincingness, clarity, and coherence. Their system flags student text with feedback and tips on techniques that can help improve essay quality. As part of the feedback, students are invited to review the coherence of topics automatically found in the essay. Topics, characterized by keywords identified either by the student directly or automatically by the system, are described using a cluster of semantically similar words from the student's text. One of the authors' explicit goals of their system is to "promote deep engagement and revision aimed at boosting the quality of students' writing across genres and academic disciplines." Villalon and Calvo's intent in developing Concept Map Miner [10] is much closer to Author Prop's aims than other tools. Concept Map Miner automatically generates concept maps, semantically connected graphical structures similar to knowledge graphs, from student papers with the intent to engage students on a meta-cognitive level, triggering student reflection and understanding of draft papers. However concept maps as presented in Concept Map Miner end up being more low-level representations of paper concepts than Benjamin Franklin was born in Milk Street, Boston, on January 6, 1706. His father, Josiah Franklin, was a tallow chandler who married twice, and of his seventeen children Benjamin was the youngest son. His schooling ended at ten, and at twelve he was bound apprentice to his brother James, a printer, who published the "New England Courant." To this journal he became a contributor, and later was for a time its nominal editor. But the brothers quarreled, and Benjamin ran away, going first to New York, and thence to Philadelphia, where he arrived in October, 1723. He soon obtained work as a printer, but after a few months he was induced by Governor Keith to go to London, where, finding Keith's promises empty, he again worked as a compositor till he was brought back to Philadelphia by a merchant named Denman, who gave him a position in his business. On Denman's death he returned to his former trade, and shortly set up a printing house of his own from which he published "The Pennsylvania Gazette," to which he contributed many essays, and which he made a medium for agitating a variety of local reforms. In 1732 he began to issue his famous "Poor Richard's Almanac" for the enrichment of which he borrowed or composed those pithy utterances of worldly wisdom which are the basis of a large part of his popular reputation. In 1758, the year in which he ceased writing for the Almanac, he printed in it "Father Abraham's Sermon," now The history of an invention, whether of science or art, may be compared to the growth of an organism such as a tree. The wind, or the random visit of a bee, unites the pollen in the flower, the green fruit forms and ripens to the perfect seed, which, on being planted in congenial soil, takes root and flourishes. Even so from the chance combination of two facts in the human mind, a crude idea springs, and after maturing into a feasible plan is put in practice under favourable conditions, and so develops. These processes are both subject to a thousand accidents which are inimical to their achievement. Especially is this the case when their object is to produce a novel species, or a new and great invention like the telegraph. It is then a question of raising, not one seedling, but many, and modifying these in the lapse of time. Similarly the telegraph is not to be regarded as the work of any one mind, but of many, and during a long course of years. Because at length the final seedling is obtained, are we to overlook the antecedent varieties from which it was produced, and without which typical results from topic modelling, due to the reliance on parts of speech grammar to extract and connect concepts at the sentence level. In contrast, topic modelling frames evidence for concepts as groups of words or word phrases, allowing for more abstract interpretation on the part of student writers.
Whitelock et al's OpenEssayist system [11] aims to provide feedback to students on completed essay drafts to improve student reflection for understanding, motivation, and writing style. Feedback consists of feedback on essay structure (introduction, discussion, and conclusion) organized through automatic identification of essay keywords, interactive organization of sentences and key words, and overall visualization of word clouds and word dispersion through the parts of the essay. As opposed to using topic modelling, Whitelock et al use unsupervised graph-based ranking algorithms based on the TextRank approach from Mihalcea and Tarau [12] to extract keywordoriented essay concepts. The OpenEssayist tool has been used in the UK's Open University system, and the authors have linked use of the tool to improved course overall grade [11]. Interestingly, the authors find that structure-based feedback of the kind provided by OpenEssayist can be as effective as content-based feedback in aiding student achievement [13].
While Author Prop is similar in intent to OpenEssayist and Concept Map Miner, it differs from these and other tools by: • Focusing specifically on partial or in-progress student drafts as opposed to completed drafts • Principally aiming to improve student cognitive load and engagement during drafting • Being a general-purpose tool for any kind of writing, as opposed to a specific type of writing such as college academic writing

Solution
The Author Prop web app presents users with two tabs, Edit and Review. The purpose of the Edit tab is for users to compare their entered text with a copy of their text highlighted according to the found topics. Text is highlighted with a different colour for each topic found in the text.
On the Edit tab, users are presented with a text area on the left, and an initially blank pane on the right. Placeholder text prompts users to either paste in existing text, or start writing in the text area. Once they have entered enough text to analyse, users can click the 'Analyse' button to trigger topic modelling. When the model is finished running, highlighted text populates the pane on the right-hand side.
The Edit tab was designed to display topic modelling results to users in a way that would not be distracting to ongoing editing. As such, alternate visual representations of topics were deferred to the Review tab. On the Edit tab, users can choose to ignore the highlighted results as they focus on writing; alternately they can quickly refer to model results by scanning through the colours in the mirror-image text on the right-hand side. Further text entered after a topic model is run will be highlighted according to the rules of the latest model. Users can click 'Analyse' at any point to update model results and highlighting. The purpose of the Review tab is for users to reflect on their in-progress draft, and compare the topic model's characterization of topics found in their paper with their own. Again, the tab is made up of two panes: on the lefthand side, users are presented with their highlighted text, matching the right-hand side of the Edit tab. On the righthand pane, users are presented with a visualization of the individual topics found by the topic modelling algorithm. The visualization depicts topics as coloured bubbles associated with keyword lists. The topics are coloured to match the highlighted text shown at left and on the Edit pane. As a 'Hint' tooltip describes, users can 'turn off' topics, which greys out the topic bubble and adjusts the highlighted text for the loss of that topic. Users can also group/ungroup topics by dragging topics to select them and clicking the '(Un)Combine' button. This combines two topics into a single bubble/colour, and aggregates their keyword lists. Topics are moveable: users can drag them around the right-hand pane area to associate them spatially.
After arriving at a highlighting they are satisfied with, users can then return to the Edit tab to return working with the updated highlighting.
Thus, users proceed in a cyclical fashion of editing and reflection, utilizing the Author Prop software as a cognitive prop.

Text Pre-Processing
Author Prop follows a basic initial NLP pipeline of tokenizing, removing stopwords, and lemmatizing text [14,15]. Tokenization refers to extracting words from documents by looking for whitespace or punctuation. Stopwords are very commonly occurring words like "a" and "the"; because such words occur very frequently, they can dominate found topics without adding value to results. Lemmatization refers to normalizing words across different inflectional forms such as tense or plurality. After this initial pre-processing, word combinations called n-grams are also created by looking at commonly occurring pairs and triples of words [14,15].
The words describing a document are further filtered using the 'term frequency-inverse document frequency' (TF-IDF) metric [14]. TF-IDF provides a ranking of word importance in a specific document relative to the corpus as a whole, and is calculated as the ratio of the frequency with which that word occurs in the specific document relative to the frequency with which it occurs in the corpus.
Because the proposed tool is intended to work with single documents as opposed to a corpus of documents, Author Prop treats paragraphs as "sub-documents" comprising a 'corpus' in the larger text. Sub-documents are also subjected to minimum and maximum word length requirements, and are grouped consecutively or broken apart if necessary, to increase the stability of topic modelling results.

Topic Modelling Algorithms
The topic modelling algorithm used by Author Prop is Latent Dirichlet Allocation (LDA) [16]. Multiple models were tested, including LDA, hierarchical LDA (hLDA) [17], and Independent Components Analysis (ICA). The ICA approach is similar to Grant, Skillicorn, and Cordy [18]. Hyper-parameter testing suggested that hLDA actually C. Roberts

EAI Endorsed Transactions on
Creative Technologies 04 2019 -02 2020 | Volume 6 | Issue 20 | e1 5 performs better than LDA for documents of longer length, for a certain range of minimum/maximum sub-document length; however, runtimes for the particular implementation of hLDA tested were significantly longer than LDA and deemed unacceptable for a web application.
Regardless of the algorithm used, the topic models tested take as input a document-word matrix of counts of words from the text pre-processing pipeline described in Section 4.1.1; one sub-document represents one row in the matrix. The output of the topic models, as mentioned in Section 1.2, are keyword lists characterizing topics found by the algorithm.

Hyper-Parameter Testing
Hyper-parameter testing was conducted to determine optimal combinations of text pre-processing values for minimum and maximum sub-document length, TF-IDF filter percentage, parts of speech to lemmatize, as well as topic modelling algorithm itself (as mentioned above). A grid search over the parameter space was conducted using a range of sample documents including at least five each from the Automated Student Assessment Prize dataset [19], a dataset of academic papers from the Neural Information Processing Systems conference [20], a dataset of New York Times articles [21], and Hans Christian Andersen fairy tales [22]. Hyper-parameters were selected based on average performance under the following metrics: distance from corpus, prominence within documents, and TC-PMI coherence [14]. The interested reader is referred to [14] for a discussion of these and other topic modelling metrics. Hyper-parameter testing results are shown in Appendix A: Web App NLP Hyper-parameter Testing Results.

Web App Interface
The web app utilizes the Plotly Dash † framework to integrate the python-based topic modelling pipeline with HTML and React JavaScript web frameworks. Dash utilizes the Flask web framework.
The python-based topic modelling pipeline utilizes the Natural Language Toolkit (NLTK) ‡ library for text preprocessing, as well as the lda python package § implementation of the LDA algorithm.

Results
The Author Prop web app is hosted at: https://authorprop.herokuapp.com/ Author Prop was beta tested via an eight-question survey, which involved testing the web app. Survey respondents were solicited through posting the survey on a class website (where participants get class participation points for completing the survey), through email to community writing † https://plot.ly/dash/ ‡ https://www.nltk.org § https://lda.readthedocs.io/en/latest/ organizations, and by posting the survey on the neighbourhood-based social media site Nextdoor ** (hypothetically visible by over 23,000 potential users residing in zip codes 94121, 94129, and 94118). 78 responses were received for the survey; questions and aggregated responses are detailed in Appendix B: Survey Results. SurveyMonkey † † was used to generate descriptive statistics and graphs of the results. In addition, predictive models were fit to the responses for Question 6, "Was the app easy to use," Question 7, "Did you enjoy using the app," and Question 8, "Would you use the app for writing," in order to generate inferential statistics about what types of writers are more receptive to the app.

Descriptive Statistics
Survey results show that the majority of users (60%) describe themselves as having above average (score of 4 or above on a scale of 1-5) writing experience, but the same population as a whole self-reports only slightly above average writing skill (38% with an average score of 3, and 45% with scores of 4 or above). Those with higher levels of writing experience are more likely to say that they enjoy writing, and to describe themselves as highly skilled.  Notable correlations between variables collected include the correlation between writing skill and writing experience at 0.58, the correlation between writing experience and writing enjoyment at 0.50, and the correlation between writing skill and writing enjoyment at 0.38.
In terms of reception, the majority of users (79%) enjoyed using the app (score of 3 or above on a scale of [1][2][3][4][5]. Similarly, 71% thought the app was easy to use, and 39% responded positively when asked if they would use the app for writing. In terms of high correlations, the correlation between enjoying the app and finding the app easy to use was 0.67, the correlation between responding that one would use the app and one's rating of app enjoyment is 0.54, and the correlation between responding one would use the app and finding the app easy to use is 0.29. Finding the app easy to use is evidently not the same thing as finding it useful for writing; however, one can rate lower ease of use scores and still find the app enjoyable and useful for writing. One interesting subset of respondents is those users that prefer poetry to prose. While these respondents comprise a small sample (8 in total), 75% of them respond that they would use the app for writing (well over the overall average), even though they are fairly evenly split across the spectrum as to whether or not they enjoyed using the app (for the general population, the latter is strongly correlated with intention to use the app for writing or not). The explanatory models described in Section 5.2 confirm that preferring poetry to prose is in general a strong predictor of positive app reception, in particular intending to use the app for writing. One hypothesis might be that Author Prop's abstract visualization of document topics mirrors the abstract visualization associated with the creation and appreciation of poetry.
Overall, app reception is more positive among respondents that describe themselves as having higher levels of writing experience and skill.

Explanatory Modelling and Inference
In order to enable objective and reproducible interpretations of the survey data results, and because of high correlations between variables noted above, models were used to facilitate explanatory inference.

Methodology
A series of models were fit to the survey data, to separately predict responses to Question 6, "Was the app easy to use," Question 7, "Did you enjoy using the app," and Question 8, "Would you use the app for writing." For each response variable, a set of model forms were tested, which included Logistic Regression, Random Forest, Gaussian Naïve Bayes, and Gradient Boosted (tree-based) Classifier for classification (Question 8), and Linear Regression, Random Forest, and Gradient Boosted (treebased) Regressor for regression (Questions 6 and 7). The Scikit-learn package [27] was used for modelling in python; the Statsmodels python package [28] was also used secondarily for Linear and Logistic Regression.
While learning curves with cross-validation were used to examine the sensitivity of the tested models to data size, due to small sample size and the explanatory (as opposed to predictive) purpose of the modelling, models were fit to the entire dataset for the purpose of explanatory inference [29]. For classification for Question 8, the predicted probability that someone would respond "Yes" was thresholded at the population sample mean for the purpose of classifying predicted probabilities as "Yes" or "No." In all cases, the Gradient Boosted Classifier/Regressor (GBM) outperformed the other model forms in terms of model validation metrics; ROCAUC was used as a metric for classification and MSE/MAE/explained variance score for regression. These models were selected for the purpose of explanatory inference.
To bolster confidence in variable-level inference, model validation metrics were examined across levels of individual variables to assess whether or not there were systematic patterns wherein certain variables were over or underpredicted; no significant evidence that would weaken the findings below was found.

Findings
For Question 6, "Was the app easy to use (scale of 1-5 with 5 highest)" the GBM achieved a MSE of 1.01, MAE of 0.75, and explained variance score of 35%. The most important features in this model using the tree-based feature importance metric were writing skill (0.28), writing enjoyment (0.20), writing experience (0.19), prefer poetry or prose (0.11), and the control variable for whether or not the survey began after initial improvements (0.11). As shown in the partial dependence plots in Figure 28 in Appendix C, people with high writing skill and either medium to low writing enjoyment or medium to low writing experience are more likely to find the app easy to use. We can confirm that this group, which represents 21% of the total sample, has an average ease of use rating of 3.78 compared to the overall average of 3.19. The standard deviation of the ease of use rating of the cohort is comparable to the population as a whole, 1.31 compared to 1.26. For Question 7, "Did you enjoy using the app (scale of 1-5 with 5 highest)" the GBM achieved a MSE of 0.82, MAE of 0.75, and explained variance score of 34%. The most important features in this model using the tree-based feature importance metric were writing skill (0.27), writing enjoyment (0.20), writing experience (0.19), and the control variable for whether or not the survey began after the app was released to the Nextdoor website (0.12). As shown in the partial dependence plots in Figure 29 in Appendix C, people with either high writing skill or high writing experience or both are more likely to more highly rate enjoying the app. We can confirm that this group, which represents 66% of the total sample, has an average enjoyment rating of 3.22 compared to the overall average of 3.00. The standard deviation of the enjoyment rating of the cohort is comparable to the population as a whole, 1.14 compared to 1.12.
For Question 8, "Would you use the app for writing (Yes/No)," the GBM achieved a ROCAUC of 0.82 on the training set. The true positive rate is 0.56, and the true negative rate is 0.78. The most important features in this model using the tree-based feature importance metric were writing skill (0.27), writing enjoyment (0.22), control for whether or not the survey began after initial improvements (0.16), and writing experience (0.14). As shown in the partial dependence plots in Figure 30 in Appendix C, people with either very high writing skill or the combination of medium writing enjoyment and high writing experience are more likely to say they would use the app for writing. We can confirm that this group, which represents 38% of the total sample, has an average "Yes" probability of 0.52 compared to the overall average of 0.42. The standard deviation of the "Yes" classification of the cohort is comparable to the population as a whole, 0.51 compared to 0.50.

Limitations
Survey results are somewhat biased by time; as survey responses were received, changes were made to the app UI to improve user experience. In particular, nearly half of users requested some form of explanation as to the purpose of the app (which was not explicitly stated on the website), and as to how they were expected to use the app. An additional popup, prompt text, and expanded tooltip were implemented to address these concerns. A control variable to indicate whether or not a survey response was begun after these initial improvements were made was introduced into the explanatory models to attempt to normalize for these effects.
Survey results are also biased by respondents, of which there are two main types: computer science graduate students, and members of the general population. If we infer that the general population responded most after the survey was initially released to them, we see that the general population respondents respond more positively to the app than the computer science graduate students. They also exhibit a self-selection bias of higher levels of writing experience and enjoyment. A control variable to indicate whether or not a survey response was begun after the survey was made available to the general population was introduced into the explanatory models to attempt to normalize for these effects.

Conclusion
Author Prop provides students with a tool that allows them to compare their mental maps of in-progress text drafts with topic model results from natural language processing. The web app provides modes for both editing and review/reflection, and allows users to refer to model results at their own pace. The review mode encourages users to think creatively by interacting with and informing alternate representations of their texts. The web app acts both as a cognitive prop and a tool to increase student engagement during writing. Though survey results are limited, feedback has been encouraging on the app experience/concept, with refinement desired on instructions for using the app.
Reflecting on the survey results, it is interesting to note the effect that higher levels of self-identified writing skill and experience have on positive reception of the app. It is also interesting and somewhat surprising that the app seems to appeal more to those with medium to low levels of selfrated writing enjoyment, even though those that enjoy writing more tend to have higher levels of writing experience. These results may indicate a potential market for Author Prop; in the larger sense, it would be an area of future research to investigate how other tools aimed at improving the writing process have achieved similar or different success with related methods. It is interesting to reflect on what the results say about the nature of the writing process itself: how exactly do humans manage the cognition of writing, and what kinds of abstractions or visualizations are useful cognitive props? Through research with AI-based tools, we may be able to provide evidence for theories about the cognition of writing.

Future Work
As noted by users, the app would benefit from a short animated introduction walking first-time users through how to use the app. Requests for better explanation of the web app's purpose and expected usage continued to be a prevalent theme in user-submitted comments even after adjustments were made to the app to explain functionality. Future work thus includes additional testing and research into intuitive user interfaces for the app, as well as better ways to guide first-time users through app usage. Further research into the app's reception is also desired.