Evolution of AI agent architectures and the challenges of optimizing computational resources in online learning services
DOI:
https://doi.org/10.18372/2073-4751.86.21268Keywords:
AI agent, online learning, LLM, inference, scalability, QoS, SLA, observability, autoscalingAbstract
The article provides a systematic review of the evolution of AI agent architectures in distributed intelligent systems, focusing on resource intensity, scalability, and ensuring QoS/SLA in online learning services. It analyzes the transition from monolithic implementations to multi-loop compositional architectures, distinguishing inference, orchestration, memory, and observability. The study generalizes classes of resource optimization methods at the model level, inference serving, autoscaling, flow management, and agent logic. A research gap is identified: the lack of an integrated method that combines model-level optimization, inference serving, and autoscaling with the trajectory of an agent session as the primary unit of resource accounting and control—that is, aligning the token-and-tool profile of a session with admission, prioritization, and scaling policies in a multi-tenant environment. Preconditions are formulated for developing a method and a tool for optimizing the computational resources of AI agents in online learning services.
References
Yao, Shunyu, et al. ‘ReAct: Synergizing Reasoning and Acting in Language Models’. arXiv [Cs.CL], 2023, arxiv.org/abs/2210.03629. arXiv.
Schick, Timo, et al. ‘Toolformer: Language Models Can Teach Themselves to Use Tools’. arXiv [Cs.CL], 2023, arxiv.org/abs/2302.04761. arXiv.
Park, Joon Sung, et al. ‘Generative Agents: Interactive Simulacra of Human Behavior’. arXiv [Cs.HC], 2023, arxiv.org/abs/2304.03442. arXiv.
Wu, Qingyun, et al. ‘AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation’. arXiv [Cs.AI], 2023, arxiv.org/abs/2308.08155. arXiv.
Kwon, Woosuk, et al. ‘Efficient Memory Management for Large Language Model Serving with PagedAttention’. Proceedings of the 29th Symposium on Operating Systems Principles, Association for Computing Machinery, 2023, pp. 611–626, https://doi.org/10.1145/3600006.3613165. SOSP ’23.
Ray Project. (2026). Ray Serve: Scalable and programmable serving (documentation).
Schroeder, Christian, et al. ‘Comparison of Autoscaling Frameworks for Containerised Machine-Learning-Applications in a Local and Cloud Environment’. arXiv [Cs.DC], 2024, arxiv.org/abs/2311.18659. arXiv.
Tina Lekshmi Kanth. ‘Predictive autoscaling in Kubernetes microservices with KEDA and time series forecasting’. Journal of Information Systems Engineering and Management, 2026, pp. 151–163, e-ISSN:2468-4376.
KEDA. Kubernetes event-driven autoscaling / URL: https://keda.sh/
OpenTelemetry. Project and roadmap update from KubeCon / URL: https://opentelemetry.io/blog/2022/kubecon-na-project-update/
OpenTelemetry. Sampling milestones (Tracing specification update / URL: https://opentelemetry.io/blog/2025/sampling-milestones/
OpenTelemetry. Observability primer (concepts) / URL: https://opentelemetry.io/docs/concepts/observability-primer/
Li, Baolin, et al. ‘LLM Inference Serving: Survey of Recent Advances and Opportunities’. arXiv [Cs.DC], 2024, arxiv.org/abs/2407.12391. arXiv.
Su, Qidong, et al. ‘Seesaw: High-Throughput LLM Inference via Model Re-Sharding’. arXiv [Cs.DC], 2025, arxiv.org/abs/2503.06433. arXiv.
Agrawal, Amey, et al. ‘Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve’. Proceedings of the 18th USENIX Conference on Operating Systems Design and Implementation, USENIX Association, 2024. OSDI’24.
Feng, Jingqi, et al. ‘WindServe: Efficient Phase-Disaggregated LLM Serving with Stream-Based Dynamic Scheduling’. Proceedings of the 52nd Annual International Symposium on Computer Architecture, Association for Computing Machinery, 2025, pp. 1283–1295, https://doi.org/10.1145/3695053.3730999. ISCA ’25.
Yu, Gyeong-In, et al. ‘Orca: A Distributed Serving System for Transformer-Based Generative Models’. 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), USENIX Association, 2022, pp. 521–538, www.usenix.org/conference/osdi22/presentation/yu.
Yan, Minghao, et al. ‘Decoding Speculative Decoding’. arXiv [Cs.LG], 2025, arxiv.org/abs/2402.01528. arXiv.
Xiao, Guangxuan, et al. ‘SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models’. Proceedings of the 40th International Conference on Machine Learning, JMLR.org, 2023. ICML’23.
Frantar, Elias, et al. ‘GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers’. arXiv [Cs.LG], 2023, arxiv.org/abs/2210.17323. arXiv.
Lin, Ji, et al. ‘AWQ: Activation-Aware Weight Quantization for LLM Compression and Acceleration’. arXiv [Cs.CL], 2024, arxiv.org/abs/2306.00978. arXiv.
Hong, Ke, et al. ‘Semi-PD: Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage’. arXiv [Cs.CL], 2025, arxiv.org/abs/2504.19867. arXiv.
Delavande, Julien, et al. ‘Understanding Efficiency: Quantization, Batching, and Serving Strategies in LLM Energy Use’. arXiv [Cs.LG], 2026, arxiv.org/abs/2601.22362. arXiv.
Sheng, Y., et al. Fairness in serving large language models. In Proceedings of OSDI ’24, 2024. USENIX.
Khan, Redwan Ibne Seraj, et al. ‘Ensuring Fair LLM Serving Amid Diverse Applications’. arXiv [Cs.LG], 2024, arxiv.org/abs/2411.15997. arXiv.
Xu, Minxian, et al. ‘Auto-Scaling Approaches for Cloud-Native Applications: A Survey and Taxonomy’. arXiv [Cs.DC], 2025, arxiv.org/abs/2507.17128. arXiv.
Jeong, Byeonghui, and Young-Sik Jeong. ‘Autoscaling Techniques in Cloud-Native Computing: A Comprehensive Survey’. Computer Science Review, vol. 58, 2025, p. 100791, https://doi.org/10.1016/j.cosrev.2025.100791
Guo, Yunda, et al. ‘PASS: Predictive Auto-Scaling System for Large-Scale Enterprise Web Applications’. Proceedings of the ACM Web Conference 2024, Association for Computing Machinery, 2024, pp. 2747–2758, https://doi.org/10.1145/3589334.3645330. WWW ’24.
Feng, Binbin, and Zhijun Ding. ‘Application-Oriented Cloud Workload Prediction: A Survey and New Perspectives’. Tsinghua Science and Technology, vol. 30, no. 1, 2025, pp. 34–54, https://doi.org/10.26599/TST.2024.9010024.
Muhammad Herwindra Berlian. A Systematic Review on Cloud Sizing Automation for PaaS Using Historical Workload Analysis. TechRxiv. November 20, 2025.
ZargarAzad, M., et al. (2023). An auto-scaling approach for microservices in cloud environments. The Journal of Supercomputing.
OpenTelemetry. (2025). Metrics semantic conventions (specification) / URL: https://opentelemetry.io/docs/specs/semconv/general/metrics/
OpenTelemetry. (2024–2025). Semantic conventions for generative AI systems. OpenTelemetry Specification / URL: https://opentelemetry.io/docs/specs/semconv/gen-ai/
OpenTelemetry. (2025). AI agent observability: Evolving standards and best practices. OpenTelemetry Blog / URL: https://opentelemetry.io/blog/2025/ai-agent-observability/
Srinivasa, Rakshith S., et al. ‘TutorBench: A Benchmark To Assess Tutoring Capabilities Of Large Language Models’. arXiv [Cs.LG], 2025, arxiv.org/abs/2510.02663. arXiv.
Macina, Jakub, et al. ‘MathTutorBench: A Benchmark for Measuring Open-Ended Pedagogical Capabilities of LLM Tutors’. arXiv [Cs.CL], 2025, arxiv.org/abs/2502.18940. arXiv.
Maurya, Kaushal Kumar, et al. ‘Unifying AI Tutor Evaluation: An Evaluation Taxonomy for Pedagogical Ability Assessment of LLM-Powered AI Tutors’. arXiv [Cs.CL], 2025, arxiv.org/abs/2412.09416. arXiv.
Liu, Xiao, et al. ‘AgentBench: Evaluating LLMs as Agents’. arXiv [Cs.AI], 2025, arxiv.org/abs/2308.03688. arXiv.
Liu, Yang, et al. ‘G-Eval: NLG Evaluation Using GPT-4 with Better Human Alignment’. arXiv [Cs.CL], 2023, arxiv.org/abs/2303.16634. arXiv.
Chiang, Wei-Lin, et al. ‘Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference’. Proceedings of the 41st International Conference on Machine Learning, JMLR.org, 2024. ICML’24.
Wang, Guanzhi, et al. ‘Voyager: An Open-Ended Embodied Agent with Large Language Models’. arXiv [Cs.AI], 2023, arxiv.org/abs/2305.16291. arXiv.
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution 4.0 International License.
The scientific journal adheres to the principles of Open Access and provides free, immediate, and permanent access to all published materials without financial, technical, or legal barriers for readers.
All articles are published in Open Access under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
Copyright
Authors who publish their works in the journal:
-
retain the copyright to their publications;
-
grant the journal the right of first publication of the article;
-
agree to the distribution of their materials under the CC BY 4.0 license;
-
have the right to reuse, archive, and distribute their works (including in institutional and subject repositories), provided that proper reference is made to the original publication in the journal.




