The capabilities of large language models (LLMs) have been applied in expert systems across various domains, providing new opportunities for AI in Education (AI4Education). Educational interactions involve a cyclical exchange between teachers and students. Current research predominantly focuses on using LLMs to simulate teachers, leveraging their expertise to enhance student learning outcomes. However, the simulation of students, which could improve teachers' instructional skills, has received insufficient attention due to the challenges of modeling and evaluating virtual students. This research poses the question: “Can LLMs be utilized to develop virtual student agents that mimic human-like behavior and individual variability?” Unlike expert systems focusing on knowledge delivery, virtual students must replicate learning difficulties, emotional responses, and linguistic uncertainties. These traits present significant challenges in both modeling and evaluation. To address these issues, this study focuses on language learning as a context for modeling virtual student agents. We propose a novel AI4Education framework, termed SOE (Scene - Object - Evaluation), to systematically construct LVSA (LLM-based Virtual Student Agents). By curating a dataset of personalized teacher-student interactions with various personality traits, question types, and learning stages, and fine-tuning LLMs using LoRA, we conduct multi-dimensional evaluation experiments. Specifically, we: (1) develop a theoretical framework for generating LVSA; (2) integrate human subjective evaluation metrics into GPT-4 assessments, demonstrating a strong correlation between human evaluators and GPT-4 in judging LVSA authenticity; and (3) validate that LLMs can generate human-like, personalized virtual student agents in educational contexts, laying a foundation for future applications in pre-service teacher training and multi-agent simulation environments.
(1) Theoretical framework for LVSA: We proposed a comprehensive framework for constructing virtual student agents with scientific rigor and feasibility. This framework extends from conceptual theory, which is the implicit and explicit characteristics of early adolescent students, to operational theory, which is the classification criteria for constructing teacher-student dialogues (see Section 4.1).
(2) Subjective evaluation metrics integration: We invited ten human evaluators to conduct Turing tests, distinguishing between the LVSA and real students. After the tests, we incorporated human subjective metrics into GPT-4’s evaluation pipeline to align with human assessments of virtual student authenticity (see Sections 5.2 and 5.3).
(3) LVSA validation: We conducted an in-depth, large-scale analysis of LVSA performance using GPT-4 across different personality types, learning stages, and question types, both before and after fine-tuning four foundational models. This evaluation aimed to assess whether these virtual students could achieve personalization, human-like performance, and adaptability in various educational scenarios (see Section 5.4).
We investigated the potential of four foundational models—InternVL, LLaVa, MiniCPM, and Qwen—to simulate student performance in junior high school Chinese education, focusing on early adolescence (ages 10-15), a phase marked by the transition from concrete to abstract thinking and associated cognitive and linguistic challenges. Junior high school Chinese education emphasizes expression, emotional experience, and value formation, aligning well with LLMs' strengths in natural language processing. This context provides an ideal setting to evaluate whether virtual student models can authentically replicate human-like performance, including language style, emotional responses, and value-driven interactions. Thus, our first research question is:
To answer this question, we first built the Basic Chinese Understanding Ability Dataset, sourced from the National Smart Education Platform, which measures text comprehension (613 items) and memorization (438 items) skills. Results show that (1) InternVL achieved the highest average accuracy (0.747), followed by MiniCPM (0.700), while Qwen and LLaVa averaged 0.599 and 0.444, respectively. (2) The lower performance of Qwen and LLaVa likely reflects their design focus on multimodal tasks, limiting their effectiveness in Chinese language processing.
We presented a theoretical framework for constructing LVSA, incorporating both conceptual and operational theories to simulate human-like, personalized performance. This framework is grounded in the physiological, cognitive, social-emotional, and moral-spiritual characteristics of early adolescence (ages 10-15), a phase of significant developmental change. To enhance the construction of LVSA, we introduce key practical dimensions—question-answer types, personality traits, learning stages, response styles, and generation sources—that guide the modeling process. Specifically, to enhance the personalized expression of student personalities in this study and to fully explore the potential of LLMs in constructing LVSA, we utilized five personality types that differed widely: High Neuroticism (HN), High Extraversion (HE), High Agreeableness (HA), Low Conscientiousness (LC), and Low Openness (LO).
To model LVSA, we constructed a fine-tuning dataset aligned with the Basic Chinese Understanding Ability Evaluation Dataset, incorporating data from real classroom video recordings, textbook content, and teacher-prepared lesson plans. The dataset construction involved several key stages to ensure realistic and personalized dialogues, including data preparation, prompt design, expert revision, large-scale dialogue generation based on the Big Five personality traits, and the creation of fine-tuning datasets.
After generating fine-tuning student-teacher dialogues based on the Big Five personality traits, word cloud visualizations were used to analyze students' expression styles and lexical richness. The word cloud for HE highlights frequent use of self-referential and positive language, suggesting extraverted students who engage in social interaction. HN shows hesitancy and uncertainty, marked by emotional fluctuation and nervousness. LO predominantly uses conservative and structured language, indicating a reliance on established knowledge. HA emphasizes cooperation, empathy, and warmth, while LC reflects imprecise and disorganized expression, characteristic of lower conscientiousness. These findings align with existing research on how language style reflects cognitive abilities and personality traits, validating the uniqueness and effectiveness of the fine-tuning dataset used for LVSA construction.
The fine-tuning process was conducted using a high-performance setup, enabling the LLM to personalize the generated responses for each student personality type. Results show that after fine-tuning, the models demonstrated improved linguistic capabilities, with LVSA responses better aligned with targeted personality traits and classroom dialogue styles.
We comprehensively evaluate the LVSA by constructing a subjective evaluation dataset. The dataset construction process comprises inference dataset creation, fine-tuned inference, direct inference, and evaluation data reconstruction. The evaluation process consists of human evaluation, human-GPT-4 comparison evaluation, and large-scale GPT-4 evaluation to adress three key research questions in our study.
Subjective Evaluation Dataset: The Subjective Evaluation Dataset was constructed through a four-step process: (1) inference dataset creation, (2) fine-tuned inference, (3) direct inference, and (4) dataset reconstruction for evaluation. This dataset supports both human and GPT-4 assessments. A total of 12,312 responses were generated across four foundational models and five personality traits, covering various learning stages and question types. For human evaluation, 115 samples were randomly selected, including both fine-tuned and direct inference responses, as well as real student responses for comparison. These evaluations provided insights into the realism and effectiveness of LVSA.
Human Turing Test: The Human Turing Test aimed to assess whether human evaluators could distinguish between LVSA-generated dialogues and real student responses. Participants, acting as judges, evaluated 120 teacher-student dialogues while verbalizing their thought processes. Fleiss’s Kappa score of 0.6917 indicated substantial agreement among participants, with fine-tuned LVSA achieving an average recognition rate above 90%, closely resembling real students. In some cases, LVSA with traits like high neuroticism, low conscientiousness, and low openness were more difficult to distinguish from real students, demonstrating the effectiveness of the models in emulating human-like language performance.
Human-GPT4 Comparison Validation: In the Human-GPT4 Comparison Validation, GPT-4's evaluation capabilities were compared to human evaluators by integrating interview data covering emotional integration, cognitive level, psychological state, and verbal expression into its prompts. Using a chain-of-thought (CoT) approach, GPT-4 achieved an average evaluation score of 0.978, closely aligning with human judgments across five personality traits. The overall Fleiss’s Kappa of 0.6806 indicated substantial agreement between GPT-4 and human evaluators, confirming GPT-4’s reliability in assessing virtual student responses.
Evaluation results for different LVSA types: The average evaluation score for the five personality types significantly increased from 36.76% to 72.51% post-fine-tuning, highlighting LLMs’ capability in generating realistic and personalized behaviors. Paired t-tests confirmed the statistical significance of these improvements, with p-values well below 0.05 for all models. An analysis of different personality types revealed that, except for students with LC, all types showed significant gains post-fine-tuning. Virtual students with HA exhibited the most notable improvements, with p-values below 0.001, indicating strong statistical significance.
Evaluation results for different learning stages: On average, the performance of all four models improved by 36.03%, with paired t-tests showing p-values below 0.001, indicating highly statistically significant improvements. These results suggest that fine-tuning based on learning stages is more straightforward compared to virtual student personality traits due to the structured and hierarchical nature of learning stage data. Each stage has clearly defined teaching content and cognitive benchmarks, allowing the models to recognize and adapt to these relationships effectively during fine-tuning. In contrast, modeling student personalities is more complex, as personality traits are fluid and context-dependent, lacking explicit hierarchical structures, making the fine-tuning process more challenging.
Evaluation results for different question types: The paired t-test P-values for closed, open, and overall questions were 0.015, 0.006, and 0.009, respectively—all below the 0.05 threshold—indicating statistically significant improvements. The differences in model performance between closed and open questions are likely due to inherent variations in complexity. Closed questions require specific factual recall, benefiting from structured dataset pretraining, while open questions involve creative reasoning, posing greater challenges. These findings suggest that fine-tuning enhances adaptability and response quality, particularly for tasks requiring sophisticated reasoning and creativity.
Poor fine-tuning performance of LC LVSA: The limited performance of LC virtual students can be attributed to the sparse distribution of relevant data. Expressions and behaviors typical of low-conscientiousness students are underrepresented in the original training data, making it difficult for models to learn these traits accurately. This scarcity leads to a higher likelihood of “hallucination,” where generated responses lack semantic coherence. Furthermore, LLMs are generally designed to avoid promoting negative traits associated with antisocial performance, further complicating the modeling of this personality type.
Inconsistent fine-tuning effects across question types: Although all four models improved in handling both closed and open-ended questions, none achieved statistically significant improvements across both types. This inconsistency likely stems from inherent differences in cognitive demands. Closed-ended questions are more structured, making them easier for models to manage, whereas open-ended questions require deeper reasoning and creative thinking, resulting in greater variability in performance.
Suboptimal fine-tuning performance of LLaVa: Despite improvements in personalization, question types, and learning stages, LLaVa’s overall performance remained weaker than that of others. This disparity is primarily due to differences between pre-training and finetuning data domains, particularly cross-language issues. Given that LLaVa predominantly relies on English pre-training data, its adaptability and generalization to Chinese contexts are constrained.