Toward Digital Well-Being: Using Generative AI to Detect and Mitigate Bias in Social Networks

Artificial Intelligence (AI) dominates today’s headlines—hailed as a breakthrough in the future, warned against as a threat the subsequent. Yet much of this debate happens in a bubble, focused on abstract hopes and fears slightly than concrete solutions. Meanwhile, one urgent challenge often ignored is the rise of mental health issues in online communities, where biased or hostile exchanges erode trust and psychological safety.

This text introduces a practical application of AI geared toward that problem: a machine learning pipeline designed to detect and mitigate bias in user-generated content. The system combines deep learning models for classification with generative large language models (LLMs) for crafting context-sensitive responses. Trained on greater than two million Reddit and Twitter comments, it achieved high accuracy (F1 = 0.99) and generated tailored moderation messages through a virtual moderator persona.

Unlike much of the hype surrounding AI, this work demonstrates a tangible, deployable tool that supports digital well-being. It shows how AI can serve not only business efficiency or profit, however the creation of fairer, more inclusive spaces where people connect online. In what follows, I outline the pipeline, its performance, and its broader implications for online communities and digital well-being. For readers curious about exploring the research in additional depth, including a poster presentation video explaining the code areas and the full-length research report, resources can be found on Github. [1]

Method

The system was designed as a three-phase pipeline: collect, detect, and mitigate. Each phase combined established natural language processing (NLP) techniques with modern transformer models to capture each the dimensions and subtlety of biased language online.

Step 1. Data Collection and Preparation

I sourced 1 million Twitter posts from the Sentiment140 dataset [2] and 1 million Reddit comments from a curated Pushshift corpus (2007–2014) [3]. Comments were cleaned, anonymized, and deduplicated. Preprocessing included tokenization, lemmatization, stopword removal, and phrase matching using NLTK and spaCy.

To coach the models effectively, I engineered metadata features—similar to bias_terms, has_bias, and bias_type—that allowed stratification across biased and neutral subsets. Table 1 summarizes these features, while Figure 1 shows the frequency of bias terms across the datasets.

Table 1. Columns used for bias evaluation.

Figure 1. Bias terms occurrences (entire dataset v. stratified dataset v. training dataset).

Step 2. Bias Annotation and Labeling

Bias was annotated on two axes: presence (biased vs. non-biased) and form (implicit, explicit, or none). Implicit bias was defined as subtle or coded language (e.g., stereotypes), while explicit bias was overt slurs or threats. For instance, “Grandpa Biden fell up the steps” was coded as ageist, while “Biden is a grandpa who loves his family” was not. This contextual coding reduced false positives.

Step 3. Sentiment and Classification Models

Two transformer models powered the detection stage:

– RoBERTa [4] was fine-tuned for sentiment classification. Its outputs (positive, neutral, negative) helped infer the tone of biased comments.

– DistilBERT [5] was trained on the enriched dataset with implicit/explicit labels, enabling precise classification of subtle cues.

Step 4. Mitigation Strategy

Bias detection was followed by real-time mitigation. Once a biased comment was identified, the system generated a response tailored to the bias type:

– Explicit bias: direct, assertive corrections.
– Implicit bias: softer rephrasings or educational suggestions.

Responses were generated by ChatGPT [6], chosen for its flexibility and context sensitivity. All responses were framed through a fictional moderator persona, JenAI-Moderator™, which maintained a consistent voice and tone (Figure 3).

Fig. 3. Mitigation Responses to Social Network Comments

Step 5. System Architecture

The total pipeline is illustrated in Figure 4. It integrates preprocessing, bias detection, and generative mitigation. Data and model outputs were stored in a PostgreSQL relational schema, enabling logging, auditing, and future integration with human-in-the-loop systems.

Fig. 4. Methodology Flow from Bias Detection to Mitigation

Results

The system was evaluated on a dataset of over two million Reddit and Twitter comments, specializing in accuracy, nuance, and real-world applicability.

Feature Extraction

As shown in Figure 1, terms related to race, gender, and age appeared disproportionately in user comments. In the primary pass of knowledge exploration, all the datasets were explored, and there was a 4 percent occurrence of bias identified in comments. Stratification was used to handle the imbalance of not bias-to-bias occurrences. Bias terms like brand and bullying appeared infrequently, while political bias showed up as prominently as other equity related biases.

Model Performance

– RoBERTa achieved 98.6% validation accuracy by the second epoch. Its loss curves (Figure 5) converged quickly, with a confusion matrix (Figure 6) showing strong class separation.

– DistilBERT, trained on implicit/explicit labels, reached a 99% F1 rating (Figure 7). Unlike raw accuracy, F1 higher reflects the balance of precision and recall in imbalanced datasets[7].

Figure 5. RoBERTa Models | Training v. Validation Loss, Model Performance over Epochs

Figure 6. RoBERTa Model Confusion Matrix

Figure 7. DistilBERT Models | Training v. Validation Loss, Model Performance over Epochs

Bias Type Distribution

Figure 8 shows boxplots of bias types distributed over predicted sentiment record counts. The length of the box plots for negative comments where about 20,000 records of the stratified database that included very negative and negative comments combined. For positive comments, that’s, comments reflecting affectionate or non-bias sentiment, the box plots span about 10,000 records. Neutral comments were in about 10,000 records. The bias and predicted sentiment breakdown validates the sentiment-informed classification logic.

Fig. 8. Bias Type by Prediced Sentiment Distribution

Mitigation Effectiveness

Generated responses from JenAI-Moderator depicted in Figure 3 were evaluated by human reviewers. Responses were judged linguistically accurate and contextually appropriate, especially for implicit bias. Table 2 provides examples of system predictions with original comments, showing sensitivity to subtle cases.

Table 2. Model Test with Example Comments (chosen).

Discussion

Moderation is usually framed as a technical filtering problem: detect a banned word, delete the comment, and move on. But moderation can be an interaction between users and systems. In HCI research, fairness is just not only technical but experiential [8]. This technique embraces this angle, framing mitigation as dialogue through a persona-driven moderator: JenAI-Moderator.

Moderation as Interaction

Explicit bias often requires firm correction, while implicit bias advantages from constructive feedback. By reframing slightly than deleting, the system fosters reflection and learning [9].

Fairness, Tone, and Design

Tone matters. Overly harsh corrections risk alienating users; overly polite warnings risk being ignored. This technique varies tone: assertive for explicit bias, educational for implicit bias (Figure 4, Table 2). This aligns with research showing fairness depends upon context [10].

Scalability and Integration

The modular design supports API-based integration with platforms. Built-in logging enables transparency and review, while human-in-the-loop options ensure safeguards against overreach.

Ethical and Sociotechnical Considerations

Bias detection risks false positives or over-policing marginalized groups. Our approach mitigates this by stripping personal information data, avoiding demographic labels, and storing reviewable logs. Still, oversight is crucial. As Mehrabi et al. [7] argue, bias is rarely fully eliminated but have to be continually managed.

Conclusion

This project demonstrates that AI could be deployed constructively in online communities—not only to detect bias, but to mitigate it in ways in which preserve user dignity and promote digital well-being.

:
– Dual-pipeline architecture (RoBERTa + DistilBERT).
– Tone-adaptive mitigation engine (ChatGPT).
– Persona-based moderation (JenAI-Moderator).

The models achieved near-perfect F1 scores (0.99). More importantly, mitigation responses were accurate and context-sensitive, making them practical for deployment.

:
– User studies to judge reception.
– Pilot deployments to check trust and engagement.
– Strengthening robustness against evasion (e.g., coded language).
– Expanding to multilingual datasets for global fairness.

At a time when AI is usually solid as hype or hazard, this project shows how it may be socially useful AI. By embedding fairness and transparency it promotes healthier online spaces where people feel safer and revered.

Acknowledgements

This project fulfilled the Milestone II and Capstone requirements for the Master of Applied Data Science (MADS) program on the University of Michigan School of Information (UMSI). The project’s poster received a MADS Award on the UMSI Exposition 2025 Poster Session. Dr. Laura Stagnaro served because the Capstone project mentor, and Dr. Jinseok Kim served because the Milestone II project mentor.

In regards to the Writer

Celia B. Banks is a social and data scientist whose work bridges human systems and applied data science. Her doctoral research in Human and Organization Systems explored how organizations evolve into virtual environments, reflecting her broader interest within the intersection of individuals, technology, and structures. Dr. Banks is a lifelong learner, and her current focus builds on this foundation through applied research in data science and analytics.

References

[1] C. Banks, Celia Banks Portfolio Repository: University of Michigan School of Information Poster Session (2025) [Online]. Available: https://celiabbanks.github.io/ [Accessed 10 May 2025]

[2] A. Go, Twitter sentiment evaluation (2009), p. 252

[3] Watchful1, 1 billion Reddit comments from 2005-2019 [Data set] (2019), Available: https://github.com/Watchful1/PushshiftDumps [Accessed 1 September 2024]

[4] Y. Liu, Roberta: A robustly optimized BERT pretraining approach (2019), p. 1907.116892

[5] V. Sanh, DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter (2019), p. 1910.01108

[6] B. Zhang, Mitigating unwanted biases with adversarial learning (2018), in pp. 335-340

[7] A. Mehrabi, A survey on bias and fairness in machine learning (2021), in vol. 54, no. 6, pp. 1-35

[8] R. Binns, Fairness in machine learning: Lessons from political philosophy (2018), in pp. 149-159

[9] S. Jhaver, A. Bruckman, and E. Gilbert, Human-machine collaboration for content regulation: The case of reddit automoderator (2019), vol. 26, no. 5, pp. 1-35, 2019

[10] N. Lee, P. Resnick, and G. Barton, Algorithmic bias detection and mitigation: Best practices and policies to scale back consumer harms (2019), in , Washington, DC

Toward Digital Well-Being: Using Generative AI to Detect and Mitigate Bias in Social Networks