🔺

hf daily

March 4 | 26 papers

🏷️ Topics
🧹 #3d#agents#agi#alignment#architecture#audio#benchmark#cv#data#dataset#diffusion#ethics#games#graphs#hallucinations#healthcare#inference#interpretability#leakage#long_context#low_resource#machine_translation#math#multilingual#multimodal#open_source#optimization#plp#rag#reasoning#rl#rlhf#robotics#science#security#small_models#story_generation#survey#synthetic#training#transfer_learning#video
1

🎯 Enhancing LLM Efficiency with Confidence-Based Sampling

🔺 4. Efficient Test-Time Scaling via Self-Calibration

published on February 25

This paper discusses how to improve the efficiency of Large Language Models (LLMs) during testing by using model confidence to guide response sampling. Traditional methods like Best-of-N sampling and Self-Consistency require a fixed number of responses, which can lead to wasted resources or inadequate exploration of complex queries. The authors propose a technique called Self-Calibration, which helps LLMs provide more reliable confidence estimates by distilling information from previous responses. By implementing confidence-based strategies such as Early-Stopping, the paper shows that it is possible to enhance accuracy while reducing unnecessary computations, particularly in challenging tasks like MathQA.

...
Carnegie Mellon University, University of Washington, Washington University in St. Louis
#optimization #training #inference
2

🕸️ Strengthening Web AI Agents Against Vulnerabilities

🔺 1. Why Are Web AI Agents More Vulnerable Than Standalone LLMs? A Security Analysis

published on February 27

This paper explores the vulnerabilities of Web AI agents compared to standalone Large Language Models (LLMs), despite both being based on similar safety models. The research identifies that Web AI agents are more susceptible to adversarial inputs due to their flexibility and the complexity of their tasks. It highlights three key factors that increase their vulnerability: the integration of user goals into prompts, the generation of multi-step actions, and the need for observational capabilities. The study proposes a detailed evaluation framework to better understand these vulnerabilities and suggests strategies for improving the security and robustness of AI agents.

...
University of Maryland
#benchmark #security #agents
3

🧠 Disentangling Knowledge and Reasoning for Robust AI

🔺 3. General Reasoning Requires Learning to Reason from the Get-go

published on February 26

This paper discusses the limitations of Large Language Models (LLMs) in achieving robust reasoning capabilities, which are essential for artificial general intelligence (AGI). The authors identify that LLMs often overfit to their training data, leading to poor generalization in novel algorithmic tasks. They propose a solution that involves separating knowledge from reasoning by employing reinforcement learning (RL) and a structured curriculum of synthetic tasks. By enhancing reasoning functions and integrating a retrieval system with an external memory, the authors aim to improve LLMs' adaptability and performance in unfamiliar contexts.

...
Department of Psychology and Center for Brain Science, Harvard University, Improbable AI Lab, MIT
#agi #transfer_learning #architecture #rl #synthetic
4

🎙️ Revolutionizing Podcast Audio Generation with PodAgent

🔺 5. PodAgent: A Comprehensive Framework for Podcast Generation

published on March 1

This paper introduces PodAgent, a novel framework designed to enhance the generation of podcast-like audio programs. It addresses key challenges in content creation and voice production by employing a multi-agent system that includes a Host, Guest, and Writer for collaborative topic discussions. Additionally, PodAgent features a voice pool for effective voice-role matching and utilizes a large language model (LLM) to improve the expressiveness of the generated speech. The framework's performance is validated through comprehensive evaluation guidelines, showing significant improvements over existing methods, including a high voice-matching accuracy and more engaging conversational audio.

...
Microsoft, The Chinese University of Hong Kong, Xiaohongshu Inc.
#games #audio #interpretability #benchmark #optimization
5

🤖 Enhancing Uncertainty Estimation in LLMs for Better Decision-Making

🔺 16. When an LLM is apprehensive about its answers -- and when its uncertainty is justified

published on March 3

This paper explores how to measure uncertainty in Large Language Models (LLMs) when answering multiple-choice questions, which is important in critical areas where wrong answers can have serious effects. It compares two methods of uncertainty estimation: token-wise entropy and model-as-judge (MASJ), across various LLMs and topics. The findings reveal that while MASJ does not effectively predict errors, token-wise entropy is a better indicator of question difficulty, especially in knowledge-based subjects like biology. The study also highlights the need to refine MASJ and address biases in existing datasets to ensure fair evaluation of LLM performance across different reasoning requirements.

...
#ethics #hallucinations #benchmark #reasoning #data
6

🔀 Revolutionizing Data Mixing for Better Language Model Training

🔺 7. SampleMix: A Sample-wise Pre-training Data Mixing Strategey by Coordinating Data Quality and Diversity

published on March 3

This paper introduces SampleMix, a new method for mixing pretraining data for large language models (LLMs). Unlike traditional domain-wise approaches that sample uniformly within predefined domains, SampleMix uses a bottom-up strategy that evaluates the quality and diversity of individual samples across domains. This allows for a more dynamic and optimal distribution of training data, addressing the limitations of inter-domain overlaps and sample-specific features. Experimental results show that SampleMix not only outperforms existing methods but also requires fewer training steps to achieve comparable performance.

...
Meituan Group, Beijing, China, National Engineering Research Center for Software Engineering, Peking University, Beijing, China
#transfer_learning #training #optimization #data
7

⚡ Accelerating Ultra-Long Sequence Generation with TOKENSWIFT

🔺 7. From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens

published on February 26

This paper presents TOKENSWIFT, a new framework aimed at speeding up the generation of ultra-long sequences using large language models (LLMs). The authors identify key challenges such as model reloading, dynamic key-value management, and repetitive generation that slow down the process. By addressing these issues, TOKENSWIFT achieves over three times the speed of traditional methods while preserving the quality of the generated text. Experimental results show that this framework is effective across various model sizes and architectures, making it a significant advancement in the field of sequence generation.

...
NLCo Lab, BIGAI LUMIA Lab, Shanghai Jiao Tong University
#training #architecture #long_context #inference #optimization
8

🔀 Unlocking LLMs: The Power of Word Form in Understanding Scrambled Text

🔺 4. Word Form Matters: LLMs' Semantic Reconstruction under Typoglycemia

published on March 3

This paper explores how large language models (LLMs) understand scrambled words, similar to how humans do through a phenomenon called Typoglycemia. The authors introduce a new metric, SemRecScore, to measure how well LLMs can reconstruct meaning from scrambled text by focusing on word form and context. Their experiments reveal that LLMs primarily depend on word form for semantic reconstruction, utilizing specific attention heads to process this information. The findings suggest that incorporating more human-like, context-aware strategies could improve LLM performance in understanding language.

...
Fudan University, Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
#training #interpretability #data #multimodal #alignment
9

🚀 Enhancing Generative Models with Direct Discriminative Optimization

🔺 2. Direct Discriminative Optimization: Your Likelihood-Based Visual Generative Model is Secretly a GAN Discriminator

published on March 3

This paper introduces Direct Discriminative Optimization (DDO), a new framework that enhances the performance of generative models by combining likelihood-based training with concepts from Generative Adversarial Networks (GANs). DDO addresses the limitations of maximum likelihood estimation (MLE) by using a discriminator that is parameterized through the likelihood ratio of a target model and a fixed reference model. This approach allows for efficient finetuning of pre-trained models without the need for joint training of generator and discriminator networks. The results show that DDO significantly improves the state-of-the-art performance in visual generation tasks, achieving lower FID scores on popular datasets like CIFAR-10 and ImageNet.

...
NVIDIA, The University of Texas at, Tsinghua University
#training #diffusion #cv #optimization
10

🎵 DiffRhythm: Fast and Scalable Song Generation with Latent Diffusion

🔺 18. DiffRhythm: Blazingly Fast and Embarrassingly Simple End-to-End Full-Length Song Generation with Latent Diffusion

published on March 3

This paper introduces DiffRhythm, a novel music generation model that utilizes latent diffusion techniques to create full-length songs with both vocal and accompaniment tracks. Unlike existing models that are limited to short segments or require complex architectures, DiffRhythm simplifies the process by needing only lyrics and a style prompt for song generation. It achieves high musical quality and intelligibility while significantly improving inference speed, generating songs in just ten seconds. The authors also emphasize the model's scalability and reproducibility by providing the complete training code and pre-trained model for further research.

...
Northwestern Polytechnical University, Shenzhen Research Institute of Big Data, The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), China
#diffusion #inference #dataset #open_source #audio
11

🏠 Revolutionizing Room Layout Estimation with Plane-DUSt3R

🔺 2. Unposed Sparse Views Room Layout Reconstruction in the Age of Pretrain Model

published on February 24

This paper presents Plane-DUSt3R, a new method for estimating room layouts from multiple images taken from different perspectives. It builds on the DUSt3R 3D foundation model, moving away from traditional multi-step processes to a more efficient end-to-end approach. By fine-tuning the model on a specific dataset, Plane-DUSt3R can accurately identify structural planes with minimal post-processing. The results show that this method not only surpasses existing techniques on synthetic data but also performs well on real-world images with varying styles.

...
Astribot, Hong Kong Center for Construction Robotics, The Hong Kong University of Science and Technology, Intellifusion Inc., MMLab, The Chinese University of Hong Kong, The Hong Kong University of Science and Technology (Guangzhou)
#3d #optimization #cv #synthetic
12

🎥 OneRec: Revolutionizing Recommendations with Generative Models

🔺 18. OneRec: Unifying Retrieve and Rank with Generative Recommender and Iterative Preference Alignment

published on February 26

This paper introduces OneRec, a novel generative retrieval-based recommendation system that improves upon traditional retrieve-and-rank methods. Unlike existing systems that use generative models merely for selection, OneRec employs a unified generative model that encodes user behavior and generates personalized video recommendations in a session-wise manner. The model utilizes a sparse Mixture-of-Experts architecture to enhance capacity while maintaining efficiency, and incorporates an Iterative Preference Alignment module to optimize user preferences effectively. Experimental results show that OneRec significantly outperforms existing systems, leading to a notable increase in user engagement metrics such as watch-time.

...
KuaiShou Inc. Beijing, China
#alignment #rlhf #rag #games #training
13

🎵 Unlocking Machine Communication with Tonal Languages

🔺 1. AI-Invented Tonal Languages: Preventing a Machine Lingua Franca Beyond Human Understanding

published on March 2

This paper explores how large language models (LLMs) can create private tonal languages for communication between machines. It draws inspiration from the phenomenon of cryptophasia in twins and uses a character-to-frequency mapping system to encode ASCII characters into musical tones. Each character is assigned a unique frequency, allowing for efficient data transmission that exceeds human speech rates. The study provides a prototype that visualizes and plays back this encoding, addressing concerns about AI developing private languages and offering a framework for understanding and managing such systems.

...
PeopleTec, Inc., Huntsville, AL
#security #ethics #audio #multimodal
14

🔢 Liger: Efficiently Transforming LLMs into Gated Linear Recurrent Models

🔺 13. Liger: Linearizing Large Language Models to Gated Recurrent Structures

published on March 3

This paper introduces Liger, a method for transforming pretrained large language models (LLMs) into gated linear recurrent models. Liger efficiently repurposes existing key matrix weights to create diverse gating mechanisms without adding extra parameters, thus avoiding the costly process of training new components from scratch. The approach employs lightweight fine-tuning techniques, specifically Low-Rank Adaptation (LoRA), to maintain the performance of the linearized models comparable to the original LLMs. Additionally, Liger incorporates a novel intra-layer hybrid attention mechanism, Liger Attention, which enhances the model's efficiency while achieving competitive results across various benchmarks.

...
Nanjing University, Shanghai AI Laboratory, South China University of Technology, The Chinese University of Hong Kong, The Hong Kong University of Science and Technology (Guangzhou)
#architecture #training #optimization #benchmark
15

⚡ Instant Query Results with SpeQL!

🔺 8. Speculative Ad-hoc Querying

published on March 2

This paper introduces SpeQL, a novel system designed to enhance the speed of SQL query execution on large datasets. By utilizing Large Language Models (LLMs), SpeQL predicts user queries even before they are fully typed, allowing for near-instantaneous results. It employs two main strategies: predicting the structure of queries for pre-compilation and creating smaller temporary tables that contain essential data for answering the final query. A user study demonstrated that SpeQL significantly reduced query latency and helped users identify data patterns more efficiently during exploratory analysis.

...
Amazon Web Services, Microsoft Research, The University of Texas at Austin
#dataset #data #benchmark
16

🏟️ Revolutionizing Code Evaluation with CodeArena

🔺 5. CodeArena: A Collective Evaluation Platform for LLM Code Generation

published on March 3

This paper discusses the impact of Large Language Models (LLMs) on code generation, highlighting their ability to understand both natural language and programming syntax, which enhances developer productivity. It identifies ongoing issues in evaluating LLM coding capabilities, such as benchmark leakage and limited access to evaluation systems. To overcome these challenges, the authors present CodeArena, an online framework that offers a collective evaluation mechanism to provide unbiased assessments of LLMs. CodeArena also features a public repository for solutions and test cases, along with APIs for easy integration into existing workflows.

...
ByteDance, Monash University, Nanyang Technological University, National University of Singapore, The University of Hong Kong
#dataset #benchmark #leakage #open_source
17

🤖 Enhancing Task Execution in Dynamic Environments with CLEA

🔺 2. CLEA: Closed-Loop Embodied Agent for Enhancing Task Execution in Dynamic Environments

published on March 2

This paper introduces the Closed-Loop Embodied Agent (CLEA), a new architecture designed to improve the performance of Large Language Models (LLMs) in dynamic environments. CLEA features an interactive task planner that creates subtasks based on real-time environmental data, allowing for better adaptability. Additionally, it includes a multimodal execution critic that evaluates the feasibility of actions and adjusts plans when unexpected changes occur. Experimental results show that CLEA significantly enhances task success and completion rates compared to traditional models, demonstrating its effectiveness in complex, real-world scenarios.

...
Guangdong Provincial Key Laboratory of Future Networks of Intelligence, The Chinese University of Hong Kong, Shenzhen, Harbin Engineering University, Harbin, Infused Synapse AI, Shenzhen, Institute of Medical Robotics, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, School of Science and Engineering (SSE), FNii-Shenzhen, Shenzhen Future Network of Intelligence Institute (FNii-Shenzhen)
#architecture #reasoning #open_source #robotics #agents
18

🔍 Enhancing User Experiences with Qilin: A Multimodal Dataset for S&R Services

🔺 10. Qilin: A Multimodal Information Retrieval Dataset with APP-level User Sessions

published on March 1

This paper introduces Qilin, a new multimodal information retrieval dataset designed to enhance search and recommendation (S&R) services in user-generated content communities. Qilin is unique as it includes diverse user sessions with various content types, such as image-text notes and videos, which can help in developing advanced multimodal neural retrieval models. Additionally, the dataset captures user feedback and contextual signals, allowing researchers to analyze user satisfaction and behavior more effectively. The findings from this research aim to improve S&R systems and contribute to the evolution of multimodal content platforms.

...
Tsinghua University, Xiaohongshu Inc.
#multimodal #dataset #rag
19

🎨 Kiss3DGen: Simplifying 3D Generation with 2D Diffusion Models

🔺 7. Kiss3DGen: Repurposing Image Diffusion Models for 3D Asset Generation

published on March 3

This paper presents Kiss3DGen, a novel framework that simplifies the process of generating and enhancing 3D objects by leveraging existing 2D image diffusion models. The approach involves fine-tuning a diffusion model to create a '3D Bundle Image', which consists of multiple views and normal maps that are essential for 3D reconstruction. By transforming the 3D generation challenge into a 2D image task, the method maximizes the use of knowledge from pretrained models, making it more efficient. The results show that Kiss3DGen not only generates high-quality 3D models but also supports advanced features like editing and texture enhancement.

...
#cv #diffusion #3d
20

🖼️ Enhancing 3D Reconstruction with Difix3D+

🔺 29. Difix3D+: Improving 3D Reconstructions with Single-Step Diffusion Models

published on March 3

This paper presents Difix3D+, a new method for improving 3D reconstruction and novel-view synthesis using single-step diffusion models. The core component, Difix, is an image diffusion model that enhances rendered views by removing artifacts caused by underconstrained areas in 3D representations. It plays a dual role by cleaning up pseudo-training views during reconstruction and acting as a neural enhancer during inference to eliminate residual artifacts. Difix3D+ is versatile, working with both Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS), and it significantly improves the quality of 3D representations, achieving a 2x better FID score compared to existing methods.

...
NVIDIA, National University of Singapore, University of Toronto, Vector Institute
#3d #diffusion
21

🎬 Empowering Text-to-Video Models with User-Focused Data

🔺 3. VideoUFO: A Million-Scale User-Focused Dataset for Text-to-Video Generation

published on March 3

This paper introduces VideoUFO, a novel video dataset designed to enhance text-to-video generative models by focusing on user-relevant topics. The dataset contains over 1.09 million video clips, each accompanied by both brief and detailed captions, ensuring minimal overlap with existing datasets. By clustering user prompts, the authors identified 1,291 specific topics to guide video retrieval from YouTube, which were then segmented into clips. Experiments show that models trained on VideoUFO significantly outperform existing models, particularly on challenging topics, highlighting the importance of tailored training data in machine learning applications.

...
University of Technology Sydney, Zhejiang University
#video #dataset #data #games #open_source
22

🔬 Revolutionizing Visual Learning with Reinforcement Fine-Tuning

🔺 43. Visual-RFT: Visual Reinforcement Fine-Tuning

published on March 3

This paper introduces Visual Reinforcement Fine-Tuning (Visual-RFT), a method that enhances large vision-language models (LVLMs) by using reinforcement learning to improve their performance on visual tasks. Visual-RFT generates multiple responses for each input and employs verifiable reward functions to optimize the model's policy, making it particularly effective in scenarios with limited fine-tuning data. The approach demonstrates significant improvements in tasks like fine-grained image classification and object detection, outperforming traditional supervised fine-tuning methods. Overall, Visual-RFT represents a novel, efficient way to fine-tune LVLMs, focusing on reasoning and adaptability in specific domains.

...
Shanghai Artificial Intelligence Laboratory, Shanghai Jiaotong University, The Chinese University of Hong Kong
#multimodal #open_source #cv #optimization #rlhf
23

🧠 Compact Models, Superior Performance!

🔺 38. Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

published on March 3

The paper presents Phi-4-Mini and Phi-4-Multimodal, two advanced models designed for language and multimodal tasks. Phi-4-Mini, with 3.8 billion parameters, excels in math and coding tasks by utilizing a high-quality synthetic data approach and an expanded vocabulary of 200K tokens. Phi-4-Multimodal integrates text, vision, and audio inputs, employing innovative techniques like LoRA adapters for efficient multi-modal processing. Both models demonstrate superior performance compared to larger counterparts, showcasing their effectiveness in complex reasoning and diverse input scenarios.

...
Microsoft
#multimodal #small_models #data #agi #synthetic
24

🧠 Unlocking Self-Improvement in Language Models through Reasoning

🔺 13. Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs

published on March 3

This paper explores how language models can improve their problem-solving abilities through a process called test-time inference, similar to human experts. It highlights the differences in performance between two models, Qwen-2.5-3B and Llama-3.2-3B, when trained with reinforcement learning (RL) on the game Countdown. The authors identify four cognitive behaviors—verification, backtracking, subgoal setting, and backward chaining—that are crucial for effective self-improvement in these models. They demonstrate that enhancing Llama with examples of these reasoning behaviors can significantly boost its performance, suggesting that the ability to reason is more important than simply providing correct answers.

...
Stanford University, SynthLabs
#training #optimization #rl #reasoning
25

🔍 Quality Over Quantity: Smart Data Selection for Language Models

🔺 5. Large-Scale Data Selection for Instruction Tuning

published on March 3

This paper investigates the importance of selecting high-quality training data for instruction-tuning language models. It reveals that many automated data selection methods do not perform better than random selection when scaling to larger datasets, which can include millions of samples. The study introduces a representation-based data selection method (RDS+) that consistently outperforms more complex approaches while being more efficient in terms of computational resources. The authors emphasize the need for a deeper examination of how these selection methods behave as the size of the data pools increases.

...
Allen Institute for AI, University of Southern California, University of Washington
#data #open_source #optimization #dataset #training
26

🚀 DuoDecoding: Speeding Up Text Generation with Smart Model Deployment

🔺 8. DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting

published on March 2

This paper introduces DuoDecoding, a new method to improve the speed of generating text with large language models (LLMs) while keeping the quality high. It uses a draft-then-verify approach, where a draft model quickly generates initial text, and a target model refines it, but does so in a way that reduces the time it takes to start generating text. By using both CPU and GPU for different parts of the process, DuoDecoding allows for faster and more efficient decoding. The results show that this method can significantly speed up text generation without sacrificing quality, achieving a notable improvement in performance across various tasks.

...
Fudan University, Shanghai AI Laboratory
#inference #training #optimization