🔺

hf daily

February 28 | 21 papers

🏷️ Topics
🧹 #3d#agents#agi#alignment#architecture#audio#benchmark#cv#data#dataset#diffusion#ethics#games#graphs#hallucinations#healthcare#inference#interpretability#leakage#long_context#low_resource#machine_translation#math#multilingual#multimodal#open_source#optimization#plp#rag#reasoning#rl#rlhf#robotics#science#security#small_models#story_generation#survey#synthetic#training#transfer_learning#video
1

🧠 PlanGEN: Enhancing Planning with Adaptive Verification and Selection

🔺 4. PlanGEN: A Multi-Agent Framework for Generating Planning and Reasoning Trajectories for Complex Problem Solving

published on February 22

The paper introduces PlanGEN, a new agent framework designed to tackle complex planning problems in machine learning. It combines three main components: constraint agents, verification agents, and selection agents, which work together to enhance the performance of inference-time algorithms. By using constraint-guided iterative verification, PlanGEN improves the effectiveness of algorithms like Best of N and Tree-of-Thought. The framework also adapts algorithm selection based on the complexity of the task, leading to significant performance gains across various benchmarks.

...
Arizona State University, Google
#inference #benchmark #reasoning #agents #optimization
2

🖼️ xAR: Redefining Tokens for Enhanced Generative Modeling

🔺 9. Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation

published on February 27

This paper introduces xAR, a novel autoregressive modeling framework that redefines the concept of a 'token' in generative tasks. Instead of using traditional discrete symbols, xAR allows for flexible prediction units, such as patches, groups of patches, or even entire images, enhancing the model's ability to capture complex spatial structures. The authors address the issue of exposure bias in AR models by employing Noisy Context Learning, which trains the model on noisy entities rather than relying solely on ground truth tokens. As a result, xAR not only improves performance on benchmarks like ImageNet-256 but also achieves faster inference times compared to existing models.

...
ByteDance, Johns Hopkins University
#training #benchmark #optimization #cv #architecture
3

🛡️ Enhancing Security for Autonomous AI Agents Against Advanced Threats

🔺 8. Guardians of the Agentic System: Preventing Many Shots Jailbreak with Agentic System

published on February 23

This paper discusses the security challenges faced by autonomous AI agents that use large language models (LLMs). It highlights the inadequacy of static guardrails against advanced attacks like many-shot jailbreaking and deceptive alignment, which threaten the trust and safety of these systems. The authors propose new evaluation frameworks to enhance the security of LLM-based agents, employing methods such as Reverse Turing Tests and multi-agent simulations to detect and counteract threats. Their findings indicate a need for flexible security systems that can adapt to ongoing attacks, ensuring reliable and safe operational deployment of AI agents.

...
North South University, Dhaka, University of Dhaka, Dhaka
#security #agents #inference #benchmark
4

🧠 Unlocking Relation-Specific Neurons in Language Models

🔺 5. On Relation-Specific Neurons in Large Language Models

published on February 24

This paper explores how certain neurons in large language models (LLMs) can specifically encode knowledge about relations, independent of the entities involved. The authors hypothesize that these relation-specific neurons help the model recognize and generate text based on particular relationships. Through experiments with the Llama-2 model, they demonstrate that deactivating these neurons affects the model's performance on tasks related to those specific relations. The study reveals three key properties of these neurons: they show cumulative effects, can be versatile across different relations, and their interference can enhance performance on unrelated tasks.

...
Bosch Center for Artificial Intelligence, Center for Information and Language Processing, LMU Munich, Google DeepMind, Zürich, Switzerland, Munich Center for Machine Learning (MCML), Sorbonne Université, CNRS, ISIR, France, Technical University of Munich
#transfer_learning #architecture #open_source #data #interpretability
5

🖼️ Revolutionizing Image Generation with Flow Matching in Consistency Training

🔺 3. Training Consistency Models with Variational Noise Coupling

published on February 25

This paper introduces a new approach to Consistency Training (CT) for image generation, leveraging the Flow Matching framework. The authors address the issues of high variance and instability in non-distillation CT by proposing a noise-coupling scheme inspired by Variational Autoencoders (VAE). Their method involves training a data-dependent noise emission model, which helps to learn the noise-to-data mapping more effectively. The results demonstrate that their approach significantly improves generative performance, achieving state-of-the-art results on CIFAR-10 and competitive results on ImageNet.

...
#architecture #training #open_source #cv #optimization
6

🎥 Speeding Up Dynamic Scene Rendering with EDGS

🔺 2. Efficient Gaussian Splatting for Monocular Dynamic Scene Rendering via Sparse Time-Variant Attribute Modeling

published on February 27

This paper presents Efficient Dynamic Gaussian Splatting (EDGS), a method for rendering dynamic scenes from monocular videos more efficiently. The authors address the issue of redundant Gaussians in existing methods, which slow down rendering speeds and cause jittering in static areas. EDGS utilizes a sparse anchor-grid representation to model dynamic scenes, focusing on time-variant attributes only for deformable objects. Experiments show that EDGS enhances rendering speed and quality compared to previous techniques, making it a significant advancement in the field.

...
National University of Singapore
#cv #3d
7

🧠 Enhancing Trust in Medical AI with Transparent Reasoning

🔺 46. MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning

published on February 26

This paper presents MedVLM-R1, a Medical Visual Language Model designed to improve reasoning transparency in medical image analysis. Unlike traditional models that only provide final answers, MedVLM-R1 generates natural language explanations for its decisions, enhancing trust among clinicians. It utilizes a reinforcement learning approach to develop reasoning paths without relying on extensive training data or references, which helps avoid overfitting. The model shows significant performance improvements on various imaging benchmarks, indicating its potential for reliable and interpretable AI in healthcare.

...
Chair for AI in Healthcare and Medicine, Technical University of Munich (TUM) and TUM University Hospital, Germany, Data Science Institute, Imperial College London, UK, Department of Computing, Imperial College London, UK, Department of Engineering Science, University of Oxford, UK, Massachusetts General Hospital, Harvard Medical School, USA, School of Computer Science, University of Sheffield, UK
#reasoning #interpretability #rl #training #healthcare
8

🤖 ArtGS: Revolutionizing Articulated Object Reconstruction with 3D Gaussians

🔺 7. Building Interactable Replicas of Complex Articulated Objects via Gaussian Splatting

published on February 26

This paper presents ArtGS, a new method for building articulated objects in computer vision. It uses 3D Gaussians to effectively represent and align information across different states of multi-part objects, enhancing part-mesh reconstruction and dynamics modeling. The approach includes a skinning-inspired module that improves the learning of object articulation. Experiments show that ArtGS outperforms existing methods, providing better quality and efficiency in reconstructing complex articulated structures.

...
Peking University, State Key Laboratory of General Artificial Intelligence, BIGAI, Tsinghua University
#3d #cv #benchmark
9

🚀 NeoBERT: Redefining Bidirectional Models for Superior NLP Performance

🔺 21. NeoBERT: A Next-Generation BERT

published on February 26

This paper introduces NeoBERT, a new encoder model that enhances the capabilities of bidirectional models in natural language processing. Unlike previous models like BERT and RoBERTa, NeoBERT incorporates advanced architectural innovations and optimized pre-training techniques to improve performance. It is designed to be easily integrated into existing systems, featuring a compact size of 250 million parameters while supporting a longer context of 4,096 tokens. NeoBERT achieves state-of-the-art results on the MTEB benchmark, demonstrating its effectiveness compared to other leading models under the same fine-tuning conditions.

...
Canada CIFAR AI Chair, Chandar Research Lab, Mila Quebec AI Institute, Polytechnique Montréal, Royal Military College of Canada
#optimization #benchmark #architecture #long_context #open_source
10

🎨 Dream Engine: Unifying Text and Image Generation for Enhanced Control

🔺 14. Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think

published on February 27

This paper introduces Dream Engine, a new framework for generating images from text that allows for flexible control over the output. It combines advanced text encoders with Diffusion Transformer models to improve the alignment between text and images. The authors highlight the importance of using large multimodal models to create a shared representation space for better integration of multiple concepts. Their two-stage training method shows promising results, achieving competitive performance on established benchmarks.

...
Alibaba Group, Bainance Labs, Beijing Institute of Technology, Peking University, University of Washington
#benchmark #diffusion #multimodal #cv #training
11

🤖 Decoupling for Efficiency: DVPO Revolutionizes RLHF

🔺 9. Lean and Mean: Decoupled Value Policy Optimization with Global Value Guidance

published on February 24

This paper introduces Decoupled Value Policy Optimization (DVPO), a new framework for improving Reinforcement Learning from Human Feedback (RLHF) in large language models. DVPO uses a pretrained global value model (GVM) to provide fixed supervisory signals, which helps to decouple the value model from policy training. This decoupling reduces the computational complexity and instability associated with traditional actor-critic methods, leading to significant reductions in GPU memory usage and training time. Experimental results demonstrate that DVPO not only outperforms existing efficient RLHF methods but also achieves performance comparable to state-of-the-art Proximal Policy Optimization (PPO).

...
Microsoft, School of Computer Science, Fudan University
#alignment #rl #training #rlhf #optimization
12

🧩 Enhancing Reasoning in Language Models with FINEREASON

🔺 21. FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving

published on February 27

This paper introduces FINEREASON, a new benchmark designed to evaluate the reasoning capabilities of large language models (LLMs) through logic puzzles. Unlike traditional benchmarks that focus solely on final-answer accuracy, FINEREASON emphasizes the importance of intermediate reasoning steps, allowing for a more detailed assessment of a model's reflective and corrective abilities. The benchmark includes two specific tasks: state checking and state transition, which help evaluate how models understand their current context and plan their next actions. The authors demonstrate that training on this benchmark can improve LLM performance in mathematical reasoning tasks by up to 5.1%.

...
DAMO Academy, Alibaba Group, Singapore, Nanyang Technological University, Singapore
#benchmark #reasoning #dataset #math
13

🔄 Seamless Video Creation from Text: Introducing Mobius!

🔺 11. Mobius: Text to Seamless Looping Video Generation via Latent Shift

published on February 27

Mobius is a new technique that creates seamless looping videos directly from text descriptions without needing user input. It utilizes a pre-trained video latent diffusion model to generate these videos, ensuring that the start and end of the video connect smoothly. The method involves a latent cycle that allows for flexible video lengths while maintaining temporal consistency through multi-frame latent denoising. This approach enables the generation of dynamic and high-quality visuals, surpassing previous methods that relied on static images.

...
Chongqing University of Post and Telecommunications, China, GVC Lab, Great Bay University, China, Meituan, China, University of Macau, China
#diffusion #multimodal #video #open_source
14

🖼️ Bridging Visual Generation and Understanding with UniTok

🔺 18. UniTok: A Unified Tokenizer for Visual Generation and Understanding

published on February 27

This paper presents UniTok, a novel discrete visual tokenizer designed to improve the integration of visual generation and understanding. It addresses the challenges of representation disparity by using multi-codebook quantization, which enhances the capacity of discrete tokens without causing training instability. The method allows for better encoding of both fine-grained details and high-level semantics, leading to improved performance metrics. UniTok outperforms existing models, achieving higher accuracy and fidelity on tasks like image classification compared to traditional continuous tokenizers.

...
ByteDance Inc., Huazhong University of Science and Technology, The University of Hong Kong
#training #architecture #multimodal #cv #optimization
15

🚀 Flexibility in Diffusion: Reducing Compute with FlexiDiT

🔺 12. FlexiDiT: Your Diffusion Transformer Can Easily Generate High-Quality Samples with Less Compute

published on February 27

This paper addresses the high resource demands of Diffusion Transformers during image generation. The authors introduce a dynamic strategy that allows for flexible compute allocation during the denoising process, leading to the development of FlexiDiT models. These models can adapt to varying compute budgets while maintaining image quality, achieving over 40% reduction in FLOPs compared to traditional static models. Additionally, the approach is versatile and can be applied to video generation, achieving up to 75% less compute usage without sacrificing performance.

...
ETH Zurich, KAIST, Meta GenAI
#diffusion #cv #inference #video #optimization
16

🧠 Revolutionizing Machine Translation with Reasoning-Enhanced Frameworks

🔺 6. R1-T1: Fully Incentivizing Translation Capability in LLMs via Reasoning Learning

published on February 27

This paper presents R1-Translator (R1-T1), a new framework that enhances machine translation (MT) by incorporating reasoning during translation, similar to how human translators think. It addresses limitations of existing methods that either use fixed reasoning patterns or rely on supervised fine-tuning, which can lead to forgetting important information. The framework utilizes reinforcement learning (RL) to align translation with human reasoning through six expert-curated chain-of-thought (CoT) templates. Experimental results show that R1-T1 improves translation quality across multiple languages and tasks, particularly for languages not seen during training, while maintaining strong multilingual capabilities.

...
Huawei Canada, Canada, Huawei, China, Waseda University, Japan
#reasoning #training #machine_translation #rl #multilingual
17

🔀 Optimizing Multimodal Performance with Test-Time Re-Routing

🔺 36. R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts

published on February 27

This paper addresses the performance gap in large multimodal models (LMMs) when processing non-language data compared to large language models (LLMs). The authors introduce a mixture-of-experts (MoE) approach to enhance the vision encoder, allowing for richer and more diverse representations. They identify that the router, which determines how to mix these expert representations, often fails to optimize routing weights effectively during testing. To solve this, they propose a method called Re-Routing in Test-Time (R2-T2), which fine-tunes routing weights based on nearby correctly predicted samples, significantly boosting the performance of LMMs on various challenging tasks without retraining the base model.

...
Johns Hopkins University, University of Maryland, College Park
#multimodal #training #architecture #optimization #benchmark
18

🔬 Extending Context Length Without Compromise

🔺 23. LongRoPE2: Near-Lossless LLM Context Window Scaling

published on February 27

LongRoPE2 is a new method that enhances the context length of large language models (LLMs) while maintaining their performance on shorter contexts. It introduces a hypothesis that inadequate training in higher dimensions of RoPE leads to out-of-distribution issues in existing models. The method employs a RoPE rescaling algorithm that uses evolutionary search to improve training effectiveness. Additionally, it utilizes a mixed context window training strategy to fine-tune model weights, allowing for long-context sequences without sacrificing short-context performance.

...
Microsoft, Shanghai Jiao Tong University, Zhejiang University
#training #architecture #benchmark #long_context
19

🔄 CODESYNC: Keeping Code Knowledge Fresh for LLMs

🔺 16. CODESYNC: Synchronizing Large Language Models with Dynamic Code Evolution at Scale

published on February 23

This paper addresses the limitations of Large Language Models (LLMs) in adapting to changes in third-party library APIs, which can lead to outdated or inefficient code. It introduces CODESYNC, a data engine designed to identify outdated code patterns and gather real-time updates from Python libraries. Additionally, the authors present CODESYNCBENCH, a benchmark for evaluating LLMs' performance in keeping up with code evolution, featuring 3,300 test cases across various tasks. The findings indicate that even advanced LLMs struggle with dynamic code changes, highlighting the need for improved methods for real-time code knowledge updating.

...
Huazhong University of Science and Technology, Wuhuan University, Zhejiang University
#dataset #open_source #data #optimization #benchmark
20

🔧 Enhancing Issue Resolution with SoRFT: A Cost-Effective Approach

🔺 7. SoRFT: Issue Resolving with Subtask-oriented Reinforced Fine-Tuning

published on February 27

This paper introduces Subtask-oriented Reinforced Fine-Tuning (SoRFT), a new method designed to improve the issue-resolving capabilities of large language models (LLMs). It breaks down the issue resolution process into specific subtasks, such as file and function localization, and code editing. The training process involves two stages: first, a supervised fine-tuning phase that uses filtered data, and second, a reinforcement learning phase that applies Proximal Policy Optimization (PPO) with rewards based on ground-truth data. The results show that models trained with SoRFT outperform existing open-source models, achieving state-of-the-art results while being more cost-effective and maintaining better privacy.

...
ByteDance, School of Computer Science, Peking University
#open_source #training #rlhf #rl #optimization
21

🤖 Empowering LLMs with Self-Rewarding Reasoning and Self-Correction

🔺 57. Self-rewarding correction for mathematical reasoning

published on February 26

This paper explores self-rewarding reasoning in large language models (LLMs), enabling them to generate and evaluate their own reasoning without needing outside feedback. The focus is on self-correction, where models can identify and fix their mistakes independently. The authors introduce a two-stage framework that first uses sequential rejection sampling to create data for training the models on self-rewarding and self-correction. The second stage enhances the models' accuracy assessment and output refinement through reinforcement learning, showing that their method outperforms traditional self-correction techniques.

...
University of Illinois Urbana-Champaign, University of Maryland, College Park
#reasoning #training #inference #optimization #rl