Final Project - word2vec: color poems
This project allows users to create short, generative poems. It is a collaborative effort between myself, the user, and a machine-learning algorithm. A simple visual representation alongside the poem illustrates how language is being manipulated. The result is a short, 3-line poem with a 3 colorful rectangles next to it.
The underlying mechanism here is what's called "word2vector." This is a machine-learning process which generates mathematical relationships between words. I have been using an ml5 5000-word corpus, which contains a 300-parameter vector for each word. I haven't yet determined the original provenance of this dataset, but it is described as the 5000 most commonly used English words. This, of course, leaves us with many questions. Used where? Used by whom? Used for what?
Using word2vector, I am mapping words to the closest vector value of a simple color word.
Color words used: red, pink, orange, yellow, green, blue, grey, brown, gold, silver, black, and white.
Using a nearest from set calculation, I am able to declare a single color as a companion to any word in the corpus. These colors words are then offered to a user for a means of language manipulation. The resulting visual is a simple history of the user's experience.
The underlying motivation for this work was as an exploration into word2vector and how to use it in an artistic practice. Earlier explorations into this area are also covered in this blog post. During the project's process, I had to directly engage with some of the dangers of working with machine learning and large scale datasets. With that, this blog post stands as a process book for my overall experience. Also important to note, in the below sketch, there is a "Read Me" file. This acknowledges the many resources and creators whose work I've been learning from. An additional write up on references, resources, and influences is at the bottom of this post.
I needed to brainstorm about how to translate my simple word2vec explorations into a cohesive project. I started by thinking about how we describe changes in color in an HSV (hue, saturation, value) color mode. Could we change words in a similar way? For example, when we adjust the saturation of a color, we add or subtract grey. Applying this concept to language, we could add or subtract the word "grey" from another word. We could describe doing so as changing the "saturation" of the word. Similarly, adding or subtracting the word "bright" could affect a word's "value." Using the color words themselves could provide a way to affect a word's "hue." A few code sketches in this direction showed that the backend of this was workable. However, the user's experience (adjustable sliders to affect each parameter) was lacking. I realized that what I wanted was a history of how the word had changed. I was craving language to describe and understand what was happening to the words.
After sketching out some more elaborate ideas, I opted for a simplified user experience. I wanted the emphasis of this project to be an engagement with the word2vector model. One idea that I walked away from was an early-digital aesthetic for poetry creation. What I was describing as "MSPaint but for words." To be honest, this still sounds fun and might be kept in my back pocket for a while.
By creating an artifact of the experience, user's could more easily reflect upon what they observed using word2vector. I decided to move towards written generative poetry, supported with a visual map of the journey. It was at this point that I also defined and started to ask questions about the role of the creator. I outlined that there are in fact 3 co-creators for this work:
- The user of the final tool
- Myself, the developer of the tool
- The dataset, the materiality of the tool
At this point, I had already been using the dataset and methodology provided by ml5, in a p5js coding environment. I made the decision to prioritize moving quickly and continuing to build off of what I had made. The language I was using was mapped in an open-source ml5 json file (Please read my entire post for a more informed decision about using this data). The files there are referred to as "the 1000, 5000, 10000, and 25000 most common English words." Without an understanding of the dataset or its origins, I was introducing the dataset's problems into my own work.
I spent a short amount of time debating if I should build my own model (hopefully, my next pursuit!), before shrugging it off and committing to building something quickly and with more efficiency. I wanted to move fast. I was willing to break things.
With a fairly stable endpoint in mind, I was eager to jump into the development stage. Rather than designing the entire UX beforehand, I just started building. This proved to be a good strategy, since the UX for this type of work is deeply connected to the content. The content is only fully available when the backend is functioning. If I had designed this experience without also experiencing the model, I would have made a very different, and possibly problematic sketch.
Note to future self, when working with machine learning, the design stage should work in tandem with development. Content is critical, and is inaccessible without at least some development.
The second "let's move fast" decision that I made was to start the user with a selection of randomly selected words. This helped me simplify the need for an input system, and smoothed the user's point of entry into the sketch (since they didn't have to do the work of thinking of a word, which may or may not exist in this json feed.)
In summary, I am taking a list of 5000 words, which I have not vetted, and then telling a computer to randomly display any of those 5000 words. In retrospect, the potential for a problem is obvious. However, I was unprepared to suddenly see I had randomly selected a racial slur.
I'd like to refer to this next stage of the process as "considerate development." A phase which began with me stopping the work. I am incredibly grateful that I was able to uncover this problem in a private venue. I easily could have (and would have) shared this work and presented another person with a hostile experience. I stepped away from the project entirely, debating whether I should abandon it entirely. I had two problems to untangle. The selfish, silly problem of "needing a final project for this class" and the problem of responsibility. I needed to make sense of what I discovered, to understand where and when the problem was first introduced, and to consider what I should do with it.
The Problem of Responsibility
After some time and conversations, I determined a personal stance that my problem is not with the inclusion of this language in the dataset. Firstly, context is important for language. What would be hate speech when I, as a white, straight woman of privilege, says it, is not hate speech in every context. Furthermore, I think it's important to acknowledge that hate speech exists, and even more important to have data that allows us to engage with hateful language. If I want to build a tool that is actively anti-racist, I would need access to models that include these data points.
I concluded that the problem was introduced in two different moments:
- When I made the decision to use a dataset without personally vetting it
- When a dataset was made available, and suggested for beginners, without a disclosure that it included hate speech
(The third unlisted problem is the systemic racism and use of hate speech within our society, which is then reflected and reinforced within a dataset.) In terms of the first problem, I had a valuable learning experience. I have now had a visceral experience to understand the importance of vetting large datasets. Even when working on smaller, personal projects, there is a responsibility inherent to the work. Knowing why to vet data is simpler than knowing how to vet data, especially when the vetting criteria will be shifting depending on the use case, at hand. For the second problem, I've been in conversations with my professor, Ellen Nickles, who has been a vital source of support and advisement throughout this experience. We're working together on next steps, but are aligned in wanting to prevent future students from unintentionally creating a project with hate speech.
The Problem of My Project
I needed to make some decisions. I had about three days to get a working prototype ready for our class final. Solutions ranged from simple to complex, with what felt like an ethical cost attached. The most time-consuming, but correct solution would be to train my own model off of a dataset that I had control over. After some research, this seemed like the best solution, but would be difficult for me to understand, enact, and then return to building the rest of my project. The simplest solution would be to use the 1000-word corpus, which did not include hate speech, but would impact the efficacy of my work. Another simple solution would be to sanitize the dataset, and simply remove the hate speech from it. I didn't love any of these ideas. The simple solutions felt dishonest to the experience that I had. When I thought about my project, with a non-disclosed santiized dataset, I found it was suddenly uninteresting. The complicated solution however required time and the acquisition of new knowledge, which were two things I did not have readily available.
After consideration, I realized that what I wanted to do was sanitize the dataset, but I didn't want to lie about it and I didn't want to be forced to include hate speech in the code that I had written. I elected to instead write a "language intervention" script. I had been utilizing the word2vector as a mechanism to get me back to words. word --> vector --> vector --> word. The language intervention script interrupts this process for words that I have qualified as hate speech. The "word" itself is replaced by the first of 300 parameters in the vector. The user see this value in lieu of the word. When randomly generated, you have no way of knowing what word is being represented. However, the context of word relationships reveals a great deal about what is inside the word2vector model, if a user stumbles upon them.
Another wrinkle in this was in how to define hate speech. For this project, I opted for a very narrow definition, and only flagged 4 out of the 5000 word corpus. I left in a good number of words that are ugly. I left in words that are not inherently hateful, yet are frequently used in a hateful way. My hope is that by giving the user the option to see that the algorithm is suggesting these words, the user gains a better understanding of how these relationships can be mathematically stored and potentially misused. Humans are beautiful at handling nuance. Computers are not.