Evolution of AI agent architectures and the challenges of optimizing computational resources in online learning services

Authors

DOI:

https://doi.org/10.18372/2073-4751.86.21268

Keywords:

AI agent, online learning, LLM, inference, scalability, QoS, SLA, observability, autoscaling

Abstract

The article provides a systematic review of the evolution of AI agent architectures in distributed intelligent systems, focusing on resource intensity, scalability, and ensuring QoS/SLA in online learning services. It analyzes the transition from monolithic implementations to multi-loop compositional architectures, distinguishing inference, orchestration, memory, and observability. The study generalizes classes of resource optimization methods at the model level, inference serving, autoscaling, flow management, and agent logic. A research gap is identified: the lack of an integrated method that combines model-level optimization, inference serving, and autoscaling with the trajectory of an agent session as the primary unit of resource accounting and control—that is, aligning the token-and-tool profile of a session with admission, prioritization, and scaling policies in a multi-tenant environment. Preconditions are formulated for developing a method and a tool for optimizing the computational resources of AI agents in online learning services.

References

Yao, Shunyu, et al. ‘ReAct: Synergizing Reasoning and Acting in Language Models’. arXiv [Cs.CL], 2023, arxiv.org/abs/2210.03629. arXiv.

Schick, Timo, et al. ‘Toolformer: Language Models Can Teach Themselves to Use Tools’. arXiv [Cs.CL], 2023, arxiv.org/abs/2302.04761. arXiv.

Park, Joon Sung, et al. ‘Generative Agents: Interactive Simulacra of Human Behavior’. arXiv [Cs.HC], 2023, arxiv.org/abs/2304.03442. arXiv.

Wu, Qingyun, et al. ‘AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation’. arXiv [Cs.AI], 2023, arxiv.org/abs/2308.08155. arXiv.

Kwon, Woosuk, et al. ‘Efficient Memory Management for Large Language Model Serving with PagedAttention’. Proceedings of the 29th Symposium on Operating Systems Principles, Association for Computing Machinery, 2023, pp. 611–626, https://doi.org/10.1145/3600006.3613165. SOSP ’23.

Ray Project. (2026). Ray Serve: Scalable and programmable serving (documentation).

Schroeder, Christian, et al. ‘Comparison of Autoscaling Frameworks for Containerised Machine-Learning-Applications in a Local and Cloud Environment’. arXiv [Cs.DC], 2024, arxiv.org/abs/2311.18659. arXiv.

Tina Lekshmi Kanth. ‘Predictive autoscaling in Kubernetes microservices with KEDA and time series forecasting’. Journal of Information Systems Engineering and Management, 2026, pp. 151–163, e-ISSN:2468-4376.

KEDA. Kubernetes event-driven autoscaling / URL: https://keda.sh/

OpenTelemetry. Project and roadmap update from KubeCon / URL: https://opentelemetry.io/blog/2022/kubecon-na-project-update/

OpenTelemetry. Sampling milestones (Tracing specification update / URL: https://opentelemetry.io/blog/2025/sampling-milestones/

OpenTelemetry. Observability primer (concepts) / URL: https://opentelemetry.io/docs/concepts/observability-primer/

Li, Baolin, et al. ‘LLM Inference Serving: Survey of Recent Advances and Opportunities’. arXiv [Cs.DC], 2024, arxiv.org/abs/2407.12391. arXiv.

Su, Qidong, et al. ‘Seesaw: High-Throughput LLM Inference via Model Re-Sharding’. arXiv [Cs.DC], 2025, arxiv.org/abs/2503.06433. arXiv.

Agrawal, Amey, et al. ‘Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve’. Proceedings of the 18th USENIX Conference on Operating Systems Design and Implementation, USENIX Association, 2024. OSDI’24.

Feng, Jingqi, et al. ‘WindServe: Efficient Phase-Disaggregated LLM Serving with Stream-Based Dynamic Scheduling’. Proceedings of the 52nd Annual International Symposium on Computer Architecture, Association for Computing Machinery, 2025, pp. 1283–1295, https://doi.org/10.1145/3695053.3730999. ISCA ’25.

Yu, Gyeong-In, et al. ‘Orca: A Distributed Serving System for Transformer-Based Generative Models’. 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), USENIX Association, 2022, pp. 521–538, www.usenix.org/conference/osdi22/presentation/yu.

Yan, Minghao, et al. ‘Decoding Speculative Decoding’. arXiv [Cs.LG], 2025, arxiv.org/abs/2402.01528. arXiv.

Xiao, Guangxuan, et al. ‘SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models’. Proceedings of the 40th International Conference on Machine Learning, JMLR.org, 2023. ICML’23.

Frantar, Elias, et al. ‘GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers’. arXiv [Cs.LG], 2023, arxiv.org/abs/2210.17323. arXiv.

Lin, Ji, et al. ‘AWQ: Activation-Aware Weight Quantization for LLM Compression and Acceleration’. arXiv [Cs.CL], 2024, arxiv.org/abs/2306.00978. arXiv.

Hong, Ke, et al. ‘Semi-PD: Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage’. arXiv [Cs.CL], 2025, arxiv.org/abs/2504.19867. arXiv.

Delavande, Julien, et al. ‘Understanding Efficiency: Quantization, Batching, and Serving Strategies in LLM Energy Use’. arXiv [Cs.LG], 2026, arxiv.org/abs/2601.22362. arXiv.

Sheng, Y., et al. Fairness in serving large language models. In Proceedings of OSDI ’24, 2024. USENIX.

Khan, Redwan Ibne Seraj, et al. ‘Ensuring Fair LLM Serving Amid Diverse Applications’. arXiv [Cs.LG], 2024, arxiv.org/abs/2411.15997. arXiv.

Xu, Minxian, et al. ‘Auto-Scaling Approaches for Cloud-Native Applications: A Survey and Taxonomy’. arXiv [Cs.DC], 2025, arxiv.org/abs/2507.17128. arXiv.

Jeong, Byeonghui, and Young-Sik Jeong. ‘Autoscaling Techniques in Cloud-Native Computing: A Comprehensive Survey’. Computer Science Review, vol. 58, 2025, p. 100791, https://doi.org/10.1016/j.cosrev.2025.100791

Guo, Yunda, et al. ‘PASS: Predictive Auto-Scaling System for Large-Scale Enterprise Web Applications’. Proceedings of the ACM Web Conference 2024, Association for Computing Machinery, 2024, pp. 2747–2758, https://doi.org/10.1145/3589334.3645330. WWW ’24.

Feng, Binbin, and Zhijun Ding. ‘Application-Oriented Cloud Workload Prediction: A Survey and New Perspectives’. Tsinghua Science and Technology, vol. 30, no. 1, 2025, pp. 34–54, https://doi.org/10.26599/TST.2024.9010024.

Muhammad Herwindra Berlian. A Systematic Review on Cloud Sizing Automation for PaaS Using Historical Workload Analysis. TechRxiv. November 20, 2025.

ZargarAzad, M., et al. (2023). An auto-scaling approach for microservices in cloud environments. The Journal of Supercomputing.

OpenTelemetry. (2025). Metrics semantic conventions (specification) / URL: https://opentelemetry.io/docs/specs/semconv/general/metrics/

OpenTelemetry. (2024–2025). Semantic conventions for generative AI systems. OpenTelemetry Specification / URL: https://opentelemetry.io/docs/specs/semconv/gen-ai/

OpenTelemetry. (2025). AI agent observability: Evolving standards and best practices. OpenTelemetry Blog / URL: https://opentelemetry.io/blog/2025/ai-agent-observability/

Srinivasa, Rakshith S., et al. ‘TutorBench: A Benchmark To Assess Tutoring Capabilities Of Large Language Models’. arXiv [Cs.LG], 2025, arxiv.org/abs/2510.02663. arXiv.

Macina, Jakub, et al. ‘MathTutorBench: A Benchmark for Measuring Open-Ended Pedagogical Capabilities of LLM Tutors’. arXiv [Cs.CL], 2025, arxiv.org/abs/2502.18940. arXiv.

Maurya, Kaushal Kumar, et al. ‘Unifying AI Tutor Evaluation: An Evaluation Taxonomy for Pedagogical Ability Assessment of LLM-Powered AI Tutors’. arXiv [Cs.CL], 2025, arxiv.org/abs/2412.09416. arXiv.

Liu, Xiao, et al. ‘AgentBench: Evaluating LLMs as Agents’. arXiv [Cs.AI], 2025, arxiv.org/abs/2308.03688. arXiv.

Liu, Yang, et al. ‘G-Eval: NLG Evaluation Using GPT-4 with Better Human Alignment’. arXiv [Cs.CL], 2023, arxiv.org/abs/2303.16634. arXiv.

Chiang, Wei-Lin, et al. ‘Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference’. Proceedings of the 41st International Conference on Machine Learning, JMLR.org, 2024. ICML’24.

Wang, Guanzhi, et al. ‘Voyager: An Open-Ended Embodied Agent with Large Language Models’. arXiv [Cs.AI], 2023, arxiv.org/abs/2305.16291. arXiv.

Published

2026-05-30

How to Cite

Bordiian, A. (2026). Evolution of AI agent architectures and the challenges of optimizing computational resources in online learning services. Problems of Informatization and Control, 2(86), 5–14. https://doi.org/10.18372/2073-4751.86.21268

Issue

Section

Статті