diff --git a/chapter-3.tex b/chapter-3.tex index a2ff637..2de8c81 100644 --- a/chapter-3.tex +++ b/chapter-3.tex @@ -59,6 +59,11 @@ \chapter{Large Content and Behavior Models To Predict, Understand, and Generate T5 and other language models like GPT-3, Pythia \cite{biderman2023pythia}, and Llama \cite{touvron2023llama} can solve a wide variety of tasks, including the ones for which they were not explicitly trained. For instance, language models trained on the next word prediction task showed generalization capabilities across a wide variety of tasks like question-answering, summarization, natural language inference, and translation \cite{brown2020language}. Recently, a series of papers have shown that this generalized ``world understanding'' captured in LLMs can be leveraged to enable them to ``see'' \cite{liu2023visual,li2023videochat,li2023blip2,zhu2023minigpt,ge2023planting,zhang2023video,bhattacharyya-etal-2023-video}. This is a significant capability enhancement since a model trained in language only settings can be made to reason about images and videos. These papers follow the same transfer learning approach advocated by T5, where they convert visual information to language space to leverage the ``text-to-text'' framework. They show that it is possible to teach a large language model, the new modality of vision, without needing to pre-train the model from scratch. Rather, using only a few million tokens, it is possible to scale LLMs' abilities to vision as well. Following this chain of thought, it could be possible to solve the effectiveness problem by posing it as a ``text-to-text'' problem. This is one of the paradigms we explore in this work. We show behavior generalization using several different types of behaviors. +\begin{figure*}[!t] + \centering + \includegraphics[width=1.0\textwidth]{images/factors of communication.pdf} + \caption{Communication process can be defined by seven factors: Communicator, Message, Time of message, Channel, Receiver, Time of effect, and Effect. Any message is created to serve an end goal. For marketers, the end goal is to bring in the desired receiver effect (behavior) (like clicks, purchases, likes, and customer retention). The figure presents the key elements in the communication pipeline - the marketer, message, channel, receivers, and finally, the receiver effect. \label{fig:factors-of-communication-chapter-lcbm}} +\end{figure*} Another possible way to integrate behavior with text is an encoder approach, which we will detail next. While behavior is a downstream effect of content, behavior contains signals about the content sent to the receiver and can help improve content-understanding and natural language processing. For instance, integration of human gaze data into neural network architectures has been explored for a range of computer vision tasks such as image captioning, visual question answering, and tagging \cite{karessli2017gaze,yu2017supervising,he2019human,boyd2022human}. In language processing, tracking a reader's eye movements provides information about the cognitive processes of text comprehension \cite{RaynerReadingComp, Just1980}. Hence, recent research has utilized features gleaned from readers' eye movement to improve the performance of complex NLP tasks such as sentiment analysis \cite{long-etal-2017-cognition, mishra-etal-2016-leveraging}, sarcasm detection \cite{mishra-etal-2016-harnessing}, part-of-speech tagging \cite{barrett-etal-2016-cross}, NER \cite{hollenstein-zhang-2019-entity}, and text difficulty \cite{ScanPathApp1}. While these studies show promise that behavior can be used to extract information about content, these are done in relatively small-scale lab settings needing real-time behavior to infer about content. Given these limitations, these approaches are not possible to scale. Scale helped LLMs to learn language. We therefore explore the paradigm of synthetic behavior generated over content and then scale it over to fine-tune a large language model to understand the possibilities of this paradigm better. We cover both the approaches next. @@ -71,8 +76,8 @@ \chapter{Large Content and Behavior Models To Predict, Understand, and Generate %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Large Content and Behavior Models (LCBM)} - -\textbf{How can we pose the effectiveness problem as a text-to-text problem?} The problem of effect is to know what the receiver does after receiving the message \cite{shannon-weaver-1949}. In general, for a piece of content, other than the content itself, we often have information about \textit{who} consumes the content and what his \textit{action} is on consuming the content. The latter is the effect described in Shannon's three levels of communication. For instance, an email, as a message from the communicator to the receiver, elicits certain actions from the receiver like link-clicks, replies, and read-time. While LLMs are trained on trillions of tokens of content, the training does not include the receiver effect. For instance, Enron Email \cite{klimt2004enron} is a popular corpus that is included in the training of LLMs like Pythia \cite{biderman2023pythia}. It contains 600K email content sourced from the Enron corporation, which LLMs use to learn how to write emails. However, it does not contain data about the receivers' activities, such as whether they opened the email, how long they kept it open (read-time), and what their reply was. Similarly, while major text corpora include a large number of public blogs and user forums to train LLMs like CommonCrawl, they are stripped of receiver behavior on forum messages, such as the number of likes, shares, and replies, before including them in LLM training (for instance, see \cite{biderman2022datasheet,penedo2023refinedweb}). +\label{sec:Large Content and Behavior Models (LCBM)} +In this work, we explore the paradigm of the effectiveness problem as a text-to-text problem. The problem of effect is to know what the receiver does after receiving the message \cite{shannon-weaver-1949}. In general, for a piece of content, other than the content itself, we often have information about \textit{who} consumes the content and what his \textit{action} is on consuming the content. The latter is the effect described in Shannon's three levels of communication. For instance, an email, as a message from the communicator to the receiver, elicits certain actions from the receiver like link-clicks, replies, and read-time. While LLMs are trained on trillions of tokens of content, the training does not include the receiver effect. For instance, Enron Email \cite{klimt2004enron} is a popular corpus that is included in the training of LLMs like Pythia \cite{biderman2023pythia}. It contains 600K email content sourced from the Enron corporation, which LLMs use to learn how to write emails. However, it does not contain data about the receivers' activities, such as whether they opened the email, how long they kept it open (read-time), and what their reply was. Similarly, while major text corpora include a large number of public blogs and user forums to train LLMs like CommonCrawl, they are stripped of receiver behavior on forum messages, such as the number of likes, shares, and replies, before including them in LLM training (for instance, see \cite{biderman2022datasheet,penedo2023refinedweb}). To pose the effectiveness problem as a text-to-text problem, we can include these \textit{behavior tokens} in the text along with content tokens and train the LLM to model both of those in the same space. This might help the LLM simulate the receiver effect, optimize for it, and reason about it. @@ -150,12 +155,12 @@ \section{Large Content and Behavior Models (LCBM)} -In this paper, we show initial experiments to integrate behavior as a new modality to increase the scope of multimodal LLMs from only content to both content and behavior. We call this new type of model a Large Content Behavior Model (LCBM). This class of models shows promise in enabling the LLMs to not only reason about content but also reason about and predict human behavior over that content. Further, LCBMs have the potential for behavior domain adaptation where models trained on one type of behavior can generalize on another behavior type (Fig.~\ref{fig:capabilities-radarplot}). Behavior simulation can enable many real-world applications, such as content recommendation, customer journey optimization, and A/B testing. To build LCBM, we introduce behavior instruction tuning (\S\ref{sec:behavior-instruction-tuning}), an attempt to extend the instruction tuning paradigm to behavior space, bringing all five communication factors (communicator, message, channel, receiver, and effect) into the same space (Fig.~\ref{fig:five-factors-communication}). Similar to \cite{brown2020language,raffel2020exploring,liu2023visual,ge2023planting}, we do not design best-in-class predictors for any of the downstream tasks. Rather, we show a model which shows generalization capabilities across a wide variety of content- and behavior-related tasks. To summarize, our paper makes the following two contributions: +In this work, we show initial experiments to integrate behavior as a new modality to increase the scope of multimodal LLMs from only content to both content and behavior. We call this new type of model a Large Content Behavior Model (LCBM). This class of models shows promise in enabling the LLMs to not only reason about content but also reason about and predict human behavior over that content. Further, LCBMs have the potential for behavior domain adaptation where models trained on one type of behavior can generalize on another behavior type (Fig.~\ref{fig:capabilities-radarplot}). Behavior simulation can enable many real-world applications, such as content recommendation, customer journey optimization, and A/B testing. To build LCBM, we introduce behavior instruction tuning (\S\ref{sec:behavior-instruction-tuning}), an attempt to extend the instruction tuning paradigm to behavior space, bringing all seven communication factors (communicator, message, channel, receiver, send time, receive time, and effect) into the same space (Fig.~\ref{fig:factors-of-communication-chapter-lcbm}). Similar to \cite{brown2020language,raffel2020exploring,liu2023visual,ge2023planting}, we do not design best-in-class predictors for any of the downstream tasks. Rather, we show a model which shows generalization capabilities across a wide variety of content- and behavior-related tasks. To summarize, we make the following two contributions: \begin{itemize}[leftmargin=*] - \item\textbf{Large Content Behavior Model (LCBM).} We develop a large multimodal model that shows capabilities of behavior simulation (given content), content simulation (given behavior), content understanding, and behavior understanding (Fig.~\ref{fig:figure-1-lcbm}). Following the text-to-text framework, we connect the Vicuna LLM \cite{touvron2023llama,vicuna2023} with an open-set visual encoder of EVA-CLIP \cite{sun2023eva} and instruction fine-tune it end-to-end on behavior instruction data. EVA-CLIP and QFormer \cite{li2023blip2} help the model to understand visual content in the language space, making it a Vision Language Model (VLM). During behavior instruction tuning, we teach the model to predict behavior given content and content given behavior using various instruction tasks (\S\ref{sec:behavior-instruction-tuning}). This helps us teach behavior modality to the VLM while grounding it in the natural language space. We use three datasets to show the performance of LCBM: a dataset consisting of YouTube videos as the content and the corresponding retention graph, likes, the number of views, and comment sentiment as receiver behavior; a dataset consisting of Twitter posts (text, images, and videos) and corresponding human behavior (like counts) extracted from 168 million tweets across 10135 enterprise Twitter accounts from 2007 to 2023; and an internal dataset of \companyName Marketing Emails\footnote[6]{We obtain \companyName Marketing Emails dataset by collaborating with the \companyName team.} (content) and the click-through rate corresponding to each segment they were sent to (behavior). We observe that teaching the LCBM behavior and content simulation improves its capabilities on them (expected), but the model also shows signs of domain-adaptation in behavior modality (few-shot capability, \textit{unexpected}) (Tables~\ref{table:behavior-domain-adaptation},\ref{table:behavior-simulation-like-simulation-twitter},\ref{table:content-simulation-twitter}) and improvements in behavior understanding (Figs.~\ref{fig:comment-explains},\ref{fig:replay-explains},\S\ref{sec:results}) (zero-shot capability, \textit{unexpected}) \cite{brown2020language}. See Fig.~\ref{fig:capabilities-radarplot} for a radar plot of all the capabilities and comparisons of performances across LCBM and state-of-the-art LLMs: GPT-3.5 and GPT-4. + \item\textbf{Large Content Behavior Model (LCBM).} We develop a large multimodal model that shows capabilities of behavior simulation (given content), content simulation (given behavior), content understanding, and behavior understanding (Fig.~\ref{fig:figure-1-lcbm}). Following the text-to-text framework, we connect the Vicuna LLM \cite{touvron2023llama,vicuna2023} with an open-set visual encoder of EVA-CLIP \cite{sun2023eva} and instruction fine-tune it end-to-end on behavior instruction data. EVA-CLIP and QFormer \cite{li2023blip2} help the model to understand visual content in the language space, making it a Vision Language Model (VLM). During behavior instruction tuning, we teach the model to predict behavior given content and content given behavior using various instruction tasks (\S\ref{sec:behavior-instruction-tuning}). This helps us teach behavior modality to the VLM while grounding it in the natural language space. We use three datasets to show the performance of LCBM: a dataset consisting of YouTube videos as the content and the corresponding retention graph, likes, the number of views, and comment sentiment as receiver behavior; a dataset consisting of Twitter posts (text, images, and videos) and corresponding human behavior (like counts) extracted from 168 million tweets across 10135 enterprise Twitter accounts from 2007 to 2023; and an internal dataset of \companyName Marketing Emails\footnote[6]{We obtain \companyName Marketing Emails dataset by collaborating with the \companyName team.} (content) and the click-through rate corresponding to each segment they were sent to (behavior). We observe that teaching the LCBM behavior and content simulation improves its capabilities on them (expected), but the model also shows signs of domain-adaptation in behavior modality (few-shot capability, \textit{unexpected}) (Tables~\ref{table:behavior-domain-adaptation}, \ref{table:behavior-simulation-like-simulation-twitter}, \ref{table:content-simulation-twitter}) and improvements in behavior understanding (Figs.~\ref{fig:comment-explains}, \ref{fig:replay-explains}, \S\ref{sec:results}) (zero-shot capability, \textit{unexpected}) \cite{brown2020language}. See Fig.~\ref{fig:capabilities-radarplot} for a radar plot of all the capabilities and comparisons of performances across LCBM and state-of-the-art LLMs: GPT-3.5 and GPT-4. - \item\textbf{Dataset and Test Benchmark.} To spur research on the topic of large content and behavior models, we release our generated behavior instruction fine-tuning data from over 40,000 public-domain YouTube videos and 168 million Twitter posts. The data contains: 1)~YouTube video links, automatically extracted key scenes, scene verbalizations, replay graph data, video views, likes, comments, channel name, and subscriber count at the time of collection, and 2)~Twitter extracted account names, tweet text, associated media (image and video) verbalizations (including image captions, keywords, colors, and tones), tweet timestamps, and like counts. We also release a benchmark to test performance on the joint content behavior space (\S\ref{sec:test benchmark}), introducing two types of tasks in this space: predictive and descriptive. In the predictive benchmark, we test the model's ability to predict behavior given the content and predict content given the behavior. In the descriptive benchmark, we validate its explanation of human behavior by comparing it with ground-truth annotations we obtain from human annotators that try to explain human behavior. See Figs.~\ref{fig:comment-explains},\ref{fig:replay-explains} for a few examples. + \item\textbf{Dataset and Test Benchmark.} To spur research on the topic of large content and behavior models, we release our generated behavior instruction fine-tuning data from over 40,000 public-domain YouTube videos and 168 million Twitter posts. The data contains: 1)~YouTube video links, automatically extracted key scenes, scene verbalizations, replay graph data, video views, likes, comments, channel name, and subscriber count at the time of collection, and 2)~Twitter extracted account names, tweet text, associated media (image and video) verbalizations (including image captions, keywords, colors, and tones), tweet timestamps, and like counts. We also release a benchmark to test performance on the joint content behavior space (\S\ref{sec:test benchmark}), introducing two types of tasks in this space: predictive and descriptive. In the predictive benchmark, we test the model's ability to predict behavior given the content and predict content given the behavior. In the descriptive benchmark, we validate its explanation of human behavior by comparing it with ground-truth annotations we obtain from human annotators that try to explain human behavior. See Figs.~\ref{fig:comment-explains}, \ref{fig:replay-explains} for a few examples. \end{itemize} @@ -189,7 +194,7 @@ \subsection{Setup} \subsubsection{The Content Behavior Corpus (CBC)} \label{sec:content behavior corpus} -The availability of large-scale unlabeled text data for unsupervised learning has fueled much of the progress of LLMs. In this paper, we are interested in modeling content and the corresponding receiver behavior in the same space. While available datasets contain trillions of content tokens (text, images, audio, and videos), they unfortunately do not contain the receiver effect. To address this, we utilize YouTube and Twitter, two large publicly available sources of content-behavior data, consisting of (a)~account name, account description, and number of subscribers and followers (\textit{communicator data}) \cy{Does this belong to content or behavior?}, (b)~rich content in the form of videos, images, creator-provided captions, titles, and descriptions (\textit{message}), (c)~behavior in the form of likes, views, user comments, and replay graph (\textit{receiver effect}). This covers all the five factors of communication (Fig.~\ref{fig:five-factors-communication}), with the channel being fixed (as YouTube or Twitter) and receivers being average channel followers and viewers of the communicator. Since content data is multimodal in the form of a combination of images, videos, and text, and behavior data is in the form of numbers, to model it using a text-to-text paradigm, we \textit{verbalize} both of them following the methodology we detail next. +The availability of large-scale unlabeled text data for unsupervised learning has fueled much of the progress of LLMs. In this work, we are interested in modeling content and the corresponding receiver behavior in the same space. While available datasets contain trillions of content tokens (text, images, audio, and videos), they unfortunately do not contain the receiver effect. To address this, we utilize YouTube and Twitter, two large publicly available sources of content-behavior data, consisting of (a)~account name, account description, and number of subscribers and followers (\textit{communicator data}), (b)~rich content in the form of videos, images, creator-provided captions, titles, and descriptions (\textit{message}), (c)~behavior in the form of likes, views, user comments, and replay graph (\textit{receiver effect}). This covers all the seven factors of communication (Fig.~\ref{fig:factors-of-communication-chapter-lcbm}), with the channel being fixed (as YouTube or Twitter) and receivers being average channel followers and viewers of the communicator. Since content data is multimodal in the form of a combination of images, videos, and text, and behavior data is in the form of numbers, to model it using a text-to-text paradigm, we \textit{verbalize} both of them following the methodology we detail next. \textit{Verbalization:} For the video $V$, YouTube provides us with 100 average viewer retention values $r_i \text{ for } i\in[0..100)$, corresponding to the entire video. The sampling rate of 100 is constant and independent of video length ($T$). Replay value $r_i$ corresponds to video frames between the timestamps $(T/100\times i, T/100\times(i+1))$, which denotes how often these frames were replayed compared to the most replayed frames. The metric has a value between 0 and 1 that identifies the video's relative retention performance at a given point in the video. To accommodate longer video lengths, we merge replay values until $T/100\times(i+j)-T/100\times i > 1 \text{ second with } j\in\{i+1,100\}$. We choose the replay value for this merged group of scenes as $max(r_i,..., r_j)$. Using this logic, we get replay values $R_i$ for $i \in[0..m]$, where $m=\lfloor 100/(\lceil 100/T \rceil) \rfloor$. @@ -321,7 +326,7 @@ \subsubsection{Content Behavior Test Benchmark} % \item Image Content: Given the image content, ground truth caption, objects present in the image, object locations, image visual complexity, and date of publication, predict the number of downloads and number of licenses purchased. The template for this is presented in Listing~\ref{listing:behavior-simulation-image-verbalization}. - \item\textbf{Content Simulation.} Here, the task is to predict content given receiver behavior (Listing~\ref{listing-video-content-simulation}). Given the video content in terms of scene-by-scene descriptions with the content of one group of five consecutive scenes content being masked, behavior values of all scenes, and channel information, the task is to choose the masked scene speech from a list of 25 options, chosen randomly from the entire test set. For YouTube, we chose to model this task as a discriminative task instead of a generative one since videos are generally long, and there could be multiple possible contents for a given behavior, whereas the ground truth is available only for one specific characterization of the content for a given behavior. For Twitter, we model this task as content generation. The Listing~\ref{listing-twitter-content-simulation} presents the format for this task. + \item\textbf{Content Simulation.} Here, the task is to predict content given receiver behavior (Listings~\ref{listing-video-content-simulation}, \ref{table:content-simulation-twitter}). For YouTube, given the video content in terms of scene-by-scene descriptions with the content of one group of five consecutive scenes content being masked, behavior values of all scenes, and channel information, the task is to choose the masked scene speech from a list of 25 options, chosen randomly from the entire test set. For YouTube, we chose to model this task as a discriminative task instead of a generative one since videos are generally long, and there could be multiple possible contents for a given behavior, whereas the ground truth is available only for one specific characterization of the content for a given behavior. For Twitter, we model this task as content generation. The Listing~\ref{listing-twitter-content-simulation} presents the format for this task. \item\textbf{Behavior Understanding.} The goal of this task is to check if the model can reason about observed or unobserved receiver behavior. For this task, we could ask the model to explain any behaviors given the content. However, only the YouTube receiver comments have ground truth available with the video. Without ground truth, we found that other behaviors, such as replay values, likes, and views, are difficult to explain by non-experts. Therefore, we ask the model to simulate the sentiment of the receivers' comments and describe its reasoning. To evaluate, we asked six annotators to annotate the reasons provided by the model on a scale of 0-5, with 0 implying the LLMs provided no sentiment or reasoning and 5 implying perfect reasoning. The annotators were free to rate the LLMs as they seemed fit. The annotators were asked to review the video content and the comments to help them evaluate the reasons. We average the ratings of three annotators to get an average rating for every video. Similarly, to review the sentiment correctness, we asked the annotators to judge the predicted sentiment rating with respect to user comments. @@ -402,11 +407,12 @@ \subsubsection{Behavior Instruction Fine-Tuning (BFT)} %We use the instruction fine-tuned LLM, VideoChat, to teach it the behavior modality. VideoChat is a vision language model (VLM) that combines video and image foundation models and large language models (LLMs) through a learnable neural interface. VideoChat follows a two-stage training: one where it aligns a video- and image-encoder embedding space with the text encoder embedding space and a second step where it instruction-tunes the LLM using a video- and image-language instruction dataset. This results in a VLM, which now understands spatiotemporal perception and reasoning on vision and language data. We base our content-behavior model on VideoChat. We further instruction-fine-tune VideoChat on content aligned behavior data to teach it the behavior modality. We expand on that next. -We prepare the content-behavior instruction datasets as explained next.\\ -\textbf{Teaching behavior in the forward direction} (predict behavior given content): In this instruction tuning task, we teach the model to predict behavior given the message sent by the communicator. Essentially, this teaches the model to predict behavior in the forward direction (as in Fig.~\ref{fig:five-factors-communication}). Concretely, we include the following information as part of verbalization - image and video embedding converted to the text space (using EvaCLiP \cite{sun2023eva}), scene-by-scene verbalization covering automatic speech recognition, scene captions, video/post caption and description, receiver behavior covering replay rates, views, and likes, and communicator information covering account name and follower count. The verbalisation pattern for this task is the same as given in the Listing~\ref{listing:behavior-simulation-video-verbalization}. +We prepare the content-behavior instruction datasets as explained next. + +\textbf{Teaching behavior in the forward direction} (predict behavior given content): In this instruction tuning task, we teach the model to predict behavior given the message sent by the communicator. Essentially, this teaches the model to predict behavior in the forward direction (as in Fig.~\ref{fig:factors-of-communication-chapter-lcbm}). Concretely, we include the following information as part of verbalization - image and video embedding converted to the text space (using EvaCLiP \cite{sun2023eva}), scene-by-scene verbalization covering automatic speech recognition, scene captions, video/post caption and description, receiver behavior covering replay rates, views, and likes, and communicator information covering account name and follower count. The verbalisation pattern for this task is the same as given in the Listing~\ref{listing:behavior-simulation-video-verbalization}. %We observed that VideoChat's hallucinations increase substantially when the video length increases. Therefore, we choose videos with a length between 10 and 100 seconds. -\textbf{Teaching behavior in the reverse direction} (predict content given behavior): This task teaches the model to learn about behavior in the reverse direction (Fig.~\ref{fig:five-factors-communication}). Here, the model learns to simulate content given behavior. The instruction for this task is given in Listing~\ref{listing-content-simulation-verbalization}. +\textbf{Teaching behavior in the reverse direction} (predict content given behavior): This task teaches the model to learn about behavior in the reverse direction (Fig.~\ref{fig:factors-of-communication-chapter-lcbm}). Here, the model learns to simulate content given behavior. The instruction for this task is given in Listing~\ref{listing-content-simulation-verbalization}. % 2. Adobe Stock Images: For an image $I$ from Adobe Stock, we have the following receiver behavior: downloads and all-time licenses. Further, we verbalize images by including objects detected using recognize anything model \cite{zhang2023recognize,huang2023tag2text}, object positions using segment anything model \cite{liu2023grounding,kirillov2023segany}, visual complexity using \cite{feng2023ic9600}, ground truth user-provided image captions, and date of posting the image online\footnote{Notably, we do not have information about the receiver in either YouTube or Adobe Stock. Correspondingly, we have average receiver behavior for both datasets.}. @@ -500,15 +506,26 @@ \subsubsection{Behavior Instruction Fine-Tuning (BFT)} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Results and Discussion} \label{sec:results} -\cy{The description here needs improvement. It is better to describe results for each task individually} Here, we discuss the results for the five tasks we discuss in Section~\ref{sec:test benchmark}, namely, behavior simulation, content simulation, behavior understanding, content understanding, and behavior domain adaptation. We compare the behavior fine-tuned model discussed in \S\ref{sec:behavior-instruction-tuning} with state-of-the-art content-only models like GPT-3.5, GPT-4, and Vicuna-13B. This allows us to compare how much including behavior tokens in the training of an LLM helps in improving the LLM's understanding of behavior and joint content and behavior spaces while retaining its understanding of the content space. -The results for the five tasks are presented in Tables~\ref{table:behavior-simulation-replay-values},\ref{table:behavior-simulation-like-simulation},\ref{table:content-simulation},\ref{table:behavior-understanding},\ref{tab:content-understanding}, \ref{table:behavior-domain-adaptation},\ref{table:behavior-simulation-like-simulation-twitter}, and \ref{table:content-simulation-twitter}. We note a few general trends. LCBM, while being 10x smaller than GPT-3.5 and 4, performs better than them on all behavior-related tasks. Further, we see that there is no significant difference between 10-shot and 2-shot GPT-4 or between GPT-3.5 and GPT-4, indicating that unlike other tasks, it is harder to achieve good performance through in-context-learning on the behavior modality. It can be observed that often GPT-3.5 and 4 achieve performance comparable to (or worse than) random baselines. Interestingly, the performance of GPTs on the content simulation task is also substantially behind LCBM. The way we formulate the content simulation task (Listing~\ref{listing-video-content-simulation}), it can be seen that a substantial performance could be achieved by strong content knowledge, and behavior brings in little variance. We still see a substantial performance gap between the two models. All of this indicates that large models like GPT-3.5 and 4 are not trained on behavior tokens. +The results for the five tasks are presented in Tables~\ref{table:behavior-simulation-replay-values}, \ref{table:behavior-simulation-like-simulation}, \ref{table:content-simulation}, \ref{table:behavior-understanding}, \ref{tab:content-understanding}, \ref{table:behavior-domain-adaptation}, \ref{table:behavior-simulation-like-simulation-twitter}, and \ref{table:content-simulation-twitter}. We note a few general trends. LCBM, while being 10x smaller than GPT-3.5 and 4, performs better than them on all behavior-related tasks. Further, we see that there is no significant difference between 10-shot and 2-shot GPT-4 or between GPT-3.5 and GPT-4, indicating that unlike other tasks, it is harder to achieve good performance through in-context-learning on the behavior modality. It can be observed that often GPT-3.5 and 4 achieve performance comparable to (or worse than) random baselines. Interestingly, the performance of GPTs on the content simulation task is also substantially behind LCBM. The way we formulate the content simulation task (Listing~\ref{listing-video-content-simulation}), it can be seen that a substantial performance could be achieved by strong content knowledge, and behavior brings in little variance. We still see a substantial performance gap between the two models. All of this indicates that large models like GPT-3.5 and 4 are not trained on behavior tokens. For the content understanding tasks (Table~\ref{tab:content-understanding}), predictably GPT-3.5, being the largest model, achieves the best results. However, we see that BFT helps the LLM to learn most content understanding tasks better than the base LLM. LCBM gets better results than both Vicuna and VideoChat. This indicates that behavior modality might carry additional information about the content, which might help an LLM understand content better \cite{khurana-etal-2023-synthesizing,klerke-etal-2016-improving,plank2016keystroke}. Next, we see that LCBM also shows signs of domain adaptation in the behavior modality. We see that on five tasks: comment sentiment prediction, comment sentiment reasoning (Table~\ref{table:behavior-understanding}), email behavior simulation (Table~\ref{table:behavior-domain-adaptation}), and Twitter behavior (Table~\ref{table:behavior-simulation-like-simulation-twitter}) and content simulation (Table~\ref{table:content-simulation-twitter}). We note that if the LCBM is trained on only email behavior simulation samples, it underperforms the model trained on both YouTube data and a few samples to make the model learn email format. Similarly, LCBM trained on both Twitter and YouTube performs better than the one just trained on Twitter, showing performance improvement by domain adaptation. Finally, Figs.~\ref{fig:comment-explains},\ref{fig:replay-explains} show a few samples where we query LCBM to explain replay and comment behavior and compare it with human explanations. We see that LCBM while verbose, can explain behavior well. +\begin{figure*}[!t] + \centering + \includegraphics[width=\textwidth]{images/comment-explains-compressed.pdf} + \caption{A few examples showing LCBM's ability to understand and explain human behavior of audience sentiment. We also compare it against other models like Vicuna and GPT-3.5. \label{fig:comment-explains}} +\end{figure*} + + +\begin{figure*}[!t] + \centering + \includegraphics[width=\textwidth]{images/replay-explains.jpeg} + \caption{A few examples showing LCBM's ability to understand and explain human behavior of scene replayability. We compare it against human-provided explanations of the same. \label{fig:replay-explains}} +\end{figure*} @@ -780,20 +797,29 @@ \subsection{Results and Discussion} \footnotetext[3]{Brand Separated means that the train and test set don't have any overlap in terms of brands, Time Separated means that the test set starts after the last tweet in the train set. BFT denotes behavior fine-tuning, and ICL stands for in-context learning. The best results over four runs are reported for all models. Best models are denoted in \valbest{green} and runner-ups in \valgood{blue}.} +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + +\subsection{Related Work} + +\textbf{Models of Human Communication:} +Communication is the situation in which a source transmits a message to a receiver with conscious intent to affect the latter’s behaviors \cite{osgood1957measurement,miller1966defining}. Thus, in the most general terms, communication implies a sender, a channel, a message, a receiver, a relationship between sender and receiver, an effect, a context in which communication occurs and a range of things to which 'messages' refer \cite{mcquail2015communication,lasswell1948structure}. As per this, all of the content produced by humanity is essentially communication from a sender to a receiver over some channel and with some effect. Despite much research on communication in social sciences since the 1900s, there has been little adoption of it in machine learning modeling. A prime artefact of this is that the biggest models in machine learning (LLMs) are trained only on content (messages) and ignore other factors in communication (the intended receiver, channel, and behavior) even when they are available. -%%%%%%%%%%%%%%%%%%%% -%%%%%%%%%%%%%%%%%%%% -\subsection{Conclusion} -In this paper, we make initial strides towards solving the effectiveness problem proposed by Shannon in his seminal paper on communication. The effectiveness problem deals with predicting and optimizing communication to get the desired receiver behavior. This can be seen as consisting of a string of capabilities: behavior simulation, content simulation, and behavior domain adaptation. We show that while large language models have great generalization capabilities, are unable to perform well on the effectiveness problem. We posit that the reason for this could be a lack of ``behavior tokens'' in their training corpora. Next, we train LLMs on behavior tokens to show that other than content understanding tasks, the trained models are now able to have good performance across all the behavior-related tasks as well. We also introduce a new Content Behavior Corpus (CBC) to spur research on these large content and behavior models (LCBMs). -\footnotetext[4]{\tiny Note that we cannot compare this model with GPT-3 due to the private nature of data.} +\textbf{Prior Efforts To Model Behavior:} While there has been much research in ML to model human behavior, it has been disconnected from language and, sometimes, real-world data. For instance, Agent-based modeling (ABMs), a popular paradigm in Reinforcement Learning, has been employed to model behavior \cite{bankes2002agent,romero2023two,park2023generative}. Nevertheless, ABMs tend to view humans as rational economic agents who communicate primarily through their actions, neglecting the significance of content in communication. In ABMs, agents strive to maximize their rewards, whereas communication does not always aim to optimize specific, well-defined reward signals. Moreover, the scarcity of large repositories containing extensive records of human actions poses a challenge when training ABMs to learn human behavior. Consequently, existing large models trained on human behavior, such as the ABMs and decision transformer and its variants, often rely on simulated data, such as game environments, rather than real human behavior \cite{chen2021decision}. This reliance on artificially generated data introduces biases inherent to the creators of the training data, making it difficult to capture authentic human behavior. However, recent advancements have demonstrated the potential of large models trained on real-world tokens encompassing various modalities, like images, videos, audio, and text, as the basis for diverse tasks \cite{ge2023planting,li2023blip2}. Notably, LLMs, as exemplars of foundation models, have exhibited impressive performance across a range of tasks, including those they were not explicitly trained for, such as emotion recognition, named entity recognition, and complex tasks like table understanding \cite{ye2023large,bhattacharyya-etal-2023-video}. -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +Further, there has also been much work in modeling behavior using conventional modeling techniques, such as regression, bagging and boosting \cite{mazloom2016multimodal,villarroel2019cutting}, neural networks \cite{ding2019social,wang2018retweet,khosla2014makes}, and transformers \cite{wu2021towards,xiao2022hierarchical}. While these models can certainly model behavior, LLMs show generalization powers which extend to capabilities much beyond just behavior simulation. For instance, once trained on behavior tokens, other than behavior simulation, LLMs can now generate behavior optimized content (Table~\ref{table:content-simulation}), explain behavior (Table~\ref{table:behavior-understanding}), and domain-adapt to other behaviors (Table~\ref{table:behavior-domain-adaptation}), none of which are shown by other models. The other concurrent works which model behavior using LLMs \cite{kang2023llms} model just behavior (for example, by CTR prediction) by attaching classification or regression heads to LLMs and thereby lose out on the text-to-text paradigm where LLMs show their best performance and generalization capabilities. In addition, similar to non LLM paradigm, this method loses out on other capabilities like generating behavior optimized content and explaining behavior. + + + + + + +\subsection{Verbalization Patterns} \begin{lstlisting}[caption={Verbalization pattern of videos for the behavior understanding task:},frame=single,label={listing-behavior-understanding},basicstyle=\scriptsize] Input: The video has the following scenes: @@ -856,30 +882,17 @@ \subsection{Conclusion} \begin{lstlisting}[caption={Verbalization pattern of Twitter posts for the behavior simulation task:},frame=single,label={listing-twitter-behavior-simulation},basicstyle=\scriptsize] -XXX -\end{lstlisting} - +Input: Given a tweet of pfizer posted by the account PfizerMed on 2023-01-12. Tweet : Announcing a new ASGCT-Pfizer grant to support independent medical education initiatives on genetic medicines. For details, click Request for Proposals. . Apply by January 30, 2022 #raredisease #ASGCT #GeneTherapy . Verbalisation of media content: \"caption\": \"A close-up of a DNA double helix, showcasing its structure and blue color\",\"keywords\": \"DNA, double helix, structure, blue, close-up, molecular biology, genetics, biology, scientific illustration\"}. Predict whether it will recieve high or low likes?", -\begin{lstlisting}[caption={Verbalization pattern of Twitter posts for the content simulation task:},frame=single,label={listing-twitter-content-simulation},basicstyle=\scriptsize] -XXX +Output: This tweet has low likes. \end{lstlisting} +\begin{lstlisting}[caption={Verbalization pattern of Twitter posts for the content simulation task:},frame=single,label={listing-twitter-content-simulation},basicstyle=\scriptsize] +Input: Generate a tweet given the media verbalization and the likes it got. Tweet is for pfizer to be posted by the account PfizerMed on 2023-01-12. Verbalisation of media content: \"caption\": \"A close-up of a DNA double helix, showcasing its structure and blue color\",\"keywords\": \"DNA, double helix, structure, blue, close-up, molecular biology, genetics, biology, scientific illustration\"}. This tweet has low likes." -\subsection{Other Related Work} - -\textbf{Models of Human Communication:} -Communication is the situation in which a source transmits a message to a receiver with conscious intent to affect the latter’s behaviors \cite{osgood1957measurement,miller1966defining}. Thus, in the most general terms, communication implies a sender, a channel, a message, a receiver, a relationship between sender and receiver, an effect, a context in which communication occurs and a range of things to which 'messages' refer \cite{mcquail2015communication,lasswell1948structure}. As per this, all of the content produced by humanity is essentially communication from a sender to a receiver over some channel and with some effect. Despite much research on communication in social sciences since the 1900s, there has been little adoption of it in machine learning modeling. A prime artefact of this is that the biggest models in machine learning (LLMs) are trained only on content (messages) and ignore other factors in communication (the intended receiver, channel, and behavior) even when they are available. - - -\textbf{Prior Efforts To Model Behavior:} While there has been much research in ML to model human behavior, it has been disconnected from language and, sometimes, real-world data. For instance, Agent-based modeling (ABMs), a popular paradigm in Reinforcement Learning, has been employed to model behavior \cite{bankes2002agent,romero2023two,park2023generative}. Nevertheless, ABMs tend to view humans as rational economic agents who communicate primarily through their actions, neglecting the significance of content in communication. In ABMs, agents strive to maximize their rewards, whereas communication does not always aim to optimize specific, well-defined reward signals. Moreover, the scarcity of large repositories containing extensive records of human actions poses a challenge when training ABMs to learn human behavior. Consequently, existing large models trained on human behavior, such as the ABMs and decision transformer and its variants, often rely on simulated data, such as game environments, rather than real human behavior \cite{chen2021decision}. This reliance on artificially generated data introduces biases inherent to the creators of the training data, making it difficult to capture authentic human behavior. However, recent advancements have demonstrated the potential of large models trained on real-world tokens encompassing various modalities, like images, videos, audio, and text, as the basis for diverse tasks \cite{ge2023planting,li2023blip2}. Notably, LLMs, as exemplars of foundation models, have exhibited impressive performance across a range of tasks, including those they were not explicitly trained for, such as emotion recognition, named entity recognition, and complex tasks like table understanding \cite{ye2023large,bhattacharyya-etal-2023-video}. - - - -Further, there has also been much work in modeling behavior using conventional modeling techniques, such as regression, bagging and boosting \cite{mazloom2016multimodal,villarroel2019cutting}, neural networks \cite{ding2019social,wang2018retweet,khosla2014makes}, and transformers \cite{wu2021towards,xiao2022hierarchical}. While these models can certainly model behavior, LLMs show generalization powers which extend to capabilities much beyond just behavior simulation. For instance, once trained on behavior tokens, other than behavior simulation, LLMs can now generate behavior optimized content (Table~\ref{table:content-simulation}), explain behavior (Table~\ref{table:behavior-understanding}), and domain-adapt to other behaviors (Table~\ref{table:behavior-domain-adaptation}), none of which are shown by other models. The other concurrent works which model behavior using LLMs \cite{kang2023llms} model just behavior (for example, by CTR prediction) by attaching classification or regression heads to LLMs and thereby lose out on the text-to-text paradigm where LLMs show their best performance and generalization capabilities. In addition, similar to non LLM paradigm, this method loses out on other capabilities like generating behavior optimized content and explaining behavior. - - - +Output: "Tweet : Announcing a new ASGCT-Pfizer grant to support independent medical education initiatives on genetic medicines. For details, click Request for Proposals. . Apply by January 30, 2022 #raredisease #ASGCT #GeneTherapy "} +\end{lstlisting} @@ -929,18 +942,20 @@ \subsection{Ablation} \fi +%%%%%%%%%%%%%%%%%%%% +%%%%%%%%%%%%%%%%%%%% +\subsection{Conclusion} +In this work, we make initial strides towards solving the effectiveness problem proposed by Shannon in his seminal paper on communication. The effectiveness problem deals with predicting and optimizing communication to get the desired receiver behavior. This can be seen as consisting of a string of capabilities: behavior simulation, content simulation, and behavior domain adaptation. We show that while large language models have great generalization capabilities, are unable to perform well on the effectiveness problem. We posit that the reason for this could be a lack of ``behavior tokens'' in their training corpora. Next, we train LLMs on behavior tokens to show that other than content understanding tasks, the trained models are now able to have good performance across all the behavior-related tasks as well. We also introduce a new Content Behavior Corpus (CBC) to spur research on these large content and behavior models (LCBMs). + +\footnotetext[4]{\tiny Note that we cannot compare this model with GPT-3 due to the private nature of data.} + + + + + + -\begin{figure*}[!t] - \centering - \includegraphics[width=\textwidth]{images/comment-explains-compressed.pdf} - \caption{A few examples showing LCBM's ability to understand and explain human behavior of audience sentiment. We also compare it against other models like Vicuna and GPT-3.5. \label{fig:comment-explains}} -\end{figure*} -\begin{figure*}[!t] - \centering - \includegraphics[width=\textwidth]{images/replay-explains.jpeg} - \caption{A few examples showing LCBM's ability to understand and explain human behavior of scene replayability. We compare it against human-provided explanations of the same. \label{fig:replay-explains}} -\end{figure*} diff --git a/cover.tex b/cover.tex index 88063bc..6a21c97 100644 --- a/cover.tex +++ b/cover.tex @@ -33,7 +33,7 @@ \includegraphics[width=0.2\textwidth]{images/ub-logo.png} \end{figure} - \title{Predicting and Optimizing Human Behavior and Communication} + \title{Behavior As A Modality} \begin{center} \Large{By}