🔺

hf daily

March 17 | 26 papers

🏷️ Topics
🧹 #3d#agents#agi#alignment#architecture#audio#benchmark#cv#data#dataset#diffusion#ethics#games#graphs#hallucinations#healthcare#inference#interpretability#leakage#long_context#low_resource#machine_translation#math#multilingual#multimodal#open_source#optimization#plp#rag#reasoning#rl#rlhf#robotics#science#security#small_models#story_generation#survey#synthetic#training#transfer_learning#video
1

📄 SmolDocling: Compact and Powerful Document Conversion

🔺 11. SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

published on March 14

SmolDocling is a compact vision-language model designed for end-to-end document conversion. It generates DocTags, a universal markup format that captures the content, structure, and spatial location of document elements. Unlike traditional methods that use large models or complex pipelines, SmolDocling achieves high accuracy with only 256 million parameters. It performs well across various document types, including business and academic papers, and introduces new datasets for recognizing charts, tables, equations, and code.

...
HuggingFace, IBM Research
#small_models #open_source #cv #dataset #science
2

🎥 VAMBA: Efficient Video Processing with Linear Complexity

🔺 9. Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers

published on March 14

This paper introduces the Mamba-Transformer model (VAMBA), which addresses the limitations of existing transformer-based large multimodal models (LMMs) in processing long video inputs. Unlike traditional methods that reduce video tokens and often lose information, VAMBA uses Mamba-2 blocks to encode video tokens with linear complexity, allowing it to handle over 1024 frames efficiently. The model significantly reduces GPU memory usage by at least 50% during training and inference, while also increasing training speed nearly twofold compared to standard transformers. Experimental results show that VAMBA not only enhances accuracy on the LVBench benchmark but also performs well across various video understanding tasks.

...
1.AI, M-A-P, University of Toronto, University of Waterloo, Vector Institute
#optimization #long_context #benchmark #architecture #video
3

🧠 Evaluating Reasoning Under Uncertainty in Large Models

🔺 2. Can Large Reasoning Models do Analogical Reasoning under Perceptual Uncertainty?

published on March 14

This paper evaluates two advanced Large Reasoning Models (LRMs), OpenAI's o3-mini and DeepSeek R1, on their ability to perform analogical reasoning using Raven's progressive matrices. The study uses the I-RAVEN dataset and its more challenging version, I-RAVEN-X, to test the models' generalization capabilities under visual uncertainties. The results show a significant drop in accuracy for both LRMs when faced with the more complex tasks, indicating their struggle with longer reasoning rules and perceptual noise. In contrast, a neuro-symbolic model, ARLC, demonstrates robust performance, maintaining high accuracy even under challenging conditions.

...
ETH Zürich, IBM Research - Zurich
#benchmark #reasoning #cv #dataset
4

🌳 Revolutionizing Mesh Generation with TreeMeshGPT

🔺 1. TreeMeshGPT: Artistic Mesh Generation with Autoregressive Tree Sequencing

published on March 14

TreeMeshGPT is a novel autoregressive Transformer model that generates artistic meshes from point clouds. It introduces a unique Autoregressive Tree Sequencing method, which builds a dynamic tree structure based on the adjacency of triangular faces, allowing for local mesh extension during generation. This approach not only simplifies the training process but also enhances the quality of the generated meshes by achieving a 22% compression rate through efficient tokenization of triangular faces. Additionally, TreeMeshGPT ensures better normal orientation, reducing the occurrence of flipped normals and improving overall mesh fidelity and detail compared to previous techniques.

...
Garena, National University of Singapore, Sea AI Lab
#games #3d #optimization #architecture
5

🔮 VGGT: Revolutionizing 3D Scene Understanding with Speed and Efficiency

🔺 8. VGGT: Visual Geometry Grounded Transformer

published on March 14

VGGT is a feed-forward neural network designed to extract key 3D attributes from various views of a scene, such as camera parameters and depth maps. Unlike traditional models that focus on single tasks, VGGT efficiently handles multiple 3D computer vision tasks simultaneously. It operates quickly, reconstructing images in under one second while achieving superior results compared to methods that rely on post-processing. Additionally, VGGT can be used as a feature backbone to improve performance in related tasks like point tracking and novel view synthesis.

...
Meta AI, Visual Geometry Group, University of Oxford
#cv #open_source #3d #optimization
6

🎮 Segmenting Skills for Smarter Agents

🔺 2. Open-World Skill Discovery from Unsegmented Demonstrations

published on March 11

This paper presents a novel approach for segmenting long, unstructured demonstration videos into meaningful skill segments using self-supervised learning. The method, called Skill Boundary Detection (SBD), identifies transitions between skills by analyzing prediction errors from a pretrained action-prediction model. By applying this technique, the authors demonstrate significant improvements in the performance of agents trained on these segmented skills in the Minecraft environment. This approach allows for the effective use of diverse online video content to enhance the training of instruction-following agents without the need for manual labeling.

...
Peking University, University of California, Los Angeles
#games #video #agents #open_source
7

🗣️ Integrating Speech into Multilingual LLMs for Enhanced Performance

🔺 2. From TOWER to SPIRE: Adding the Speech Modality to a Text-Only LLM

published on March 13

This paper discusses the enhancement of large language models (LLMs) by integrating speech as a new modality. The authors focus on multilingual LLMs, specifically TOWER, and propose a method to convert speech into a format that the model can understand. They introduce a new model called SPIRE, which can transcribe and translate English speech while preserving the original capabilities of TOWER. The research demonstrates that incorporating discretized speech as an additional language is a viable approach for adapting LLMs, and the authors provide their code and models for public use.

...
ELLIS Unit Lisbon, INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, Instituto de Telecomunicações, NAVER LABS Europe, Paris-Saclay University, Sapienza University of Rome, Unbabel, University of Edinburgh
#multilingual #machine_translation #open_source #multimodal #low_resource
8

🧠 Fair and Effective Machine Unlearning for Diverse Data Groups

🔺 0. Group-robust Machine Unlearning

published on March 12

This paper introduces the concept of group-robust machine unlearning, which addresses the challenge of removing specific training data from a model while maintaining its performance across different groups. It highlights the issue of fairness when the data to be unlearned is not uniformly distributed, leading to performance degradation in dominant groups. The authors propose a novel method called MIU (Mutual Information-aware Machine Unlearning) that minimizes the mutual information between model features and group information, thus enhancing unlearning effectiveness. Through experiments on various datasets, MIU demonstrates superior performance compared to traditional methods, ensuring robust model performance even after unlearning.

...
Fondazione Bruno Kessler, Inria Grenoble, Univ. Grenoble Alpes, LTCI, Telecom Paris, Institut Polytechnique de Paris, University of Trento
#training #dataset #ethics #data
9

🧩 Revolutionizing Visual Generation with Neighboring Autoregressive Modeling

🔺 5. Neighboring Autoregressive Modeling for Efficient Visual Generation

published on March 12

This paper introduces Neighboring Autoregressive Modeling (NAR), a new approach to visual generation that improves upon traditional autoregressive models by focusing on the spatial and temporal relationships between visual tokens. Instead of predicting the next token in a raster order, NAR uses a 'next-neighbor prediction' method, decoding tokens based on their proximity to an initial token. This allows for parallel processing of adjacent tokens, significantly speeding up the generation process. Experiments show that NAR achieves much higher throughput and better quality scores in image and video generation compared to existing methods, while also being more efficient in terms of training data usage.

...
Shanghai AI Laboratory, China, The University of Adelaide, Australia, Zhejiang University, China
#cv #video #games #benchmark #optimization
10

👕 Revolutionizing Body Fitting with Equivariant Tightness!

🔺 3. ETCH: Generalizing Body Fitting to Clothed Humans via Equivariant Tightness

published on March 13

The paper introduces Equivariant Tightness Fitting for Clothed Humans (ETCH), a new method for fitting a body model to a 3D point cloud of a clothed human. ETCH improves upon traditional optimization methods by using a pipeline that leverages SE(3) equivariance to accurately map cloth surfaces to body shapes. This approach simplifies the fitting process by focusing on pose-invariant body features and regressing sparse body markers. Experimental results show that ETCH significantly enhances fitting accuracy for various clothing types and poses, outperforming existing methods in both tightness and shape accuracy.

...
Berkeley AI Research (BAIR), Max Planck Institute for Intelligent Systems, Westlake University
#3d #optimization #cv #open_source
11

🔍 Bridging the Gap in Material Retrieval with MaRI

🔺 1. MaRI: Material Retrieval Integration across Domains

published on March 11

This paper presents MaRI, a novel framework aimed at improving material retrieval for realistic 3D asset creation. It addresses the limitations of existing methods that rely on traditional image search techniques, which often fail to capture the unique properties of materials. By utilizing a contrastive learning strategy, MaRI creates a shared embedding space that aligns visual and material attributes, enhancing the retrieval process. The framework is supported by a comprehensive dataset of synthetic and real-world materials, demonstrating superior performance in accuracy and generalization across various retrieval tasks.

...
Fudan University, Peking University, Tencent Hunyuan3D, University of Electronic Science and Technology of China, University of Minnesota
#synthetic #optimization #dataset #cv #3d
12

🎥 Grounding Video Captions with Precision

🔺 10. Large-scale Pre-training for Grounded Video Caption Generation

published on March 13

This paper presents a new method for video captioning and object grounding, where objects mentioned in captions are linked to specific locations in the video using detailed bounding boxes. The authors introduce a large-scale automatic annotation technique that creates consistent bounding box annotations across video frames, resulting in a dataset called HowToGround1M for pre-training. They also develop a model named GROVE, which is trained on this dataset and further fine-tuned on a smaller, high-quality dataset called iGround, containing manually annotated videos. The results show that their approach outperforms existing methods on several benchmark datasets, highlighting the effectiveness of their pre-training and fine-tuning strategy.

...
Czech Institute of Informatics, Robotics and Cybernetics at the Czech Technical University in Prague, Inria, Ecole normale superieure, CNRS, PSL Research University
#cv #dataset #training #video #data
13

🛡️ ARMOR: Efficient Multimodal Mastery with Minimal Resources

🔺 6. ARMOR v0.1: Empowering Autoregressive Multimodal Understanding Model with Interleaved Multimodal Generation via Asymmetric Synergy

published on March 9

The paper introduces ARMOR, a new framework designed to enhance multimodal understanding and generation in machine learning models. Unlike existing unified models that require extensive computational resources, ARMOR is resource-efficient and fine-tunes large language models to achieve its goals. It employs an asymmetric encoder-decoder architecture to seamlessly integrate text and image data, allowing for natural interleaved generation. Additionally, ARMOR utilizes a carefully curated dataset and a novel training algorithm to improve the multimodal capabilities of existing models while maintaining their understanding abilities.

...
Nankai University, Shanghai AI Laboratory, Shanghai Innovation Institute, University of Science and Technology of China, Wuhan University
#games #optimization #multimodal #dataset #training
14

🔬 Enhancing MLLM Reliability with ProJudgeBench

🔺 5. ProJudge: A Multi-Modal Multi-Discipline Benchmark and Instruction-Tuning Dataset for MLLM-based Process Judges

published on March 9

This paper addresses the challenges faced by multi-modal large language models (MLLMs) in solving scientific problems accurately. It introduces ProJudgeBench, a benchmark designed to evaluate the reasoning abilities of MLLM-based judges through a comprehensive set of test cases and detailed annotations. The study highlights a performance gap between open-source and proprietary models, indicating the need for improvement in the evaluation process. To enhance this, the authors propose ProJudge-173k, a dataset for instruction-tuning and a new fine-tuning strategy that promotes better reasoning in problem-solving.

...
HZAU, NKU, Shanghai AI Laboratory, Shanghai Innovation Institute, USTC, WHU
#interpretability #reasoning #multimodal #science #open_source
15

💊 TxAgent: Personalized Precision Therapeutics through Advanced AI Reasoning

🔺 10. TxAgent: An AI Agent for Therapeutic Reasoning Across a Universe of Tools

published on March 14

The paper presents TxAgent, an advanced AI agent designed for precision therapeutics that generates personalized treatment recommendations. It utilizes multi-step reasoning and real-time biomedical knowledge retrieval from a comprehensive toolbox of 211 tools to analyze drug interactions and contraindications. TxAgent evaluates drug interactions at various levels and tailors treatment strategies based on individual patient characteristics, ensuring recommendations are evidence-based. The system outperforms existing models in drug reasoning tasks, achieving high accuracy and improving therapeutic decision-making by integrating clinical guidelines and real-world evidence.

...
Broad Institute of MIT and Harvard, Cambridge, MA, Cardiovascular Division, Department of Medicine, Brigham and Womens Hospital, Harvard Medical School, Boston, MA, Department of Biomedical Informatics, Harvard Medical School, Boston, MA, Harvard Data Science Initiative, Cambridge, MA, Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University, Cambridge, MA, MIT Lincoln Laboratory, Lexington, MA
#alignment #healthcare #science #agents #reasoning
16

🚀 Efficient Few-Step Diffusion with TDM

🔺 4. Learning Few-Step Diffusion Models by Trajectory Distribution Matching

published on March 9

This paper presents a new method called Trajectory Distribution Matching (TDM) to improve the efficiency of diffusion models in generating images and videos. TDM combines the benefits of distribution matching and trajectory matching, allowing for fewer sampling steps while maintaining high image quality. The method introduces a data-free score distillation objective that aligns the learning process of a smaller model with a larger, more complex model. By decoupling learning targets for different sampling steps, TDM achieves state-of-the-art performance with significantly reduced training costs, making it suitable for applications like text-to-image and text-to-video generation.

...
Huawei Noahs Ark Lab
#diffusion #training #cv #video
17

🤖 Maximizing Data Efficiency in Robotic Learning with Adversarial Collection

🔺 31. Adversarial Data Collection: Human-Collaborative Perturbations for Efficient and Robust Robotic Imitation Learning

published on March 14

This paper introduces a new method called Adversarial Data Collection (ADC) to improve robotic manipulation by focusing on data efficiency. Instead of relying on large datasets, ADC enhances the quality of data through real-time interactions between humans and robots, allowing for dynamic adjustments during demonstrations. By using an adversarial approach, the framework captures a wide range of behaviors and challenges in fewer demonstrations, leading to better performance in unseen tasks. The results show that models trained with ADC can generalize better and recover from errors more effectively, even with significantly less data than traditional methods.

...
Agibot, Beihang University, MMLab, CUHK, Shanghai Jiao Tong University
#robotics #benchmark #dataset #optimization #open_source
18

🧠 Unlocking Efficiency: The Power of State Space Models

🔺 21. Technologies on Effectiveness and Efficiency: A Survey of State Spaces Models

published on March 14

State Space Models (SSMs) are gaining popularity as an efficient alternative to transformer models, especially for sequential data and longer contexts. This paper provides a comprehensive overview of SSMs, detailing their theoretical foundations, mathematical formulations, and comparisons with other model types. It categorizes SSMs into three sections: the original SSM, the structured S4 model, and the selective Mamba model, emphasizing their unique techniques for improved performance. The goal is to guide researchers in understanding and exploring the potential of SSMs in various applications.

...
Department of Electronic Engineering, Tsinghua University, Robotics Institute, Carnegie Mellon University, Shanghai Artificial Intelligence Laboratory
#architecture #long_context #survey #math #training
19

🌊 FlowTok: Simplifying Cross-Modality Generation with 1D Tokens

🔺 12. FlowTok: Flowing Seamlessly Across Text and Image Tokens

published on March 13

This paper presents FlowTok, a novel framework for cross-modality generation that simplifies the process of transitioning between text and image modalities. Unlike traditional methods that rely on complex conditioning signals and denoising processes, FlowTok directly evolves between text and images by utilizing flow matching in a shared latent space. The framework encodes images into a compact 1D token representation, significantly reducing the latent space size and improving efficiency. FlowTok not only enhances memory usage and training resource requirements but also maintains competitive performance in generating images from text and vice versa.

...
ByteDance Seed, Johns Hopkins University
#architecture #training #multimodal #diffusion
20

🧠 Unlocking Complex Relationships with Learnable Activations in Vision Transformers

🔺 8. Kolmogorov-Arnold Attention: Is Learnable Attention Better For Vision Transformers?

published on March 13

Kolmogorov-Arnold networks (KANs) introduce learnable activation functions that can model complex data relationships. This paper explores the application of KANs in vision tasks by integrating them into vision Transformers (ViTs) through a novel learnable attention mechanism called Kolmogorov-Arnold Attention (KArAt). The authors also present a more efficient variant, Fourier-KArAt, which shows competitive performance on popular image datasets like CIFAR-10 and ImageNet-1K. The study emphasizes the importance of understanding learnable activations in advanced architectures, rather than solely focusing on efficiency.

...
Department of Computer Science, University of Central Florida, Orlando, FL, USA, Department of Mathematics, University of Central Florida, Orlando, FL, USA
#architecture #cv #optimization #open_source #training
21

🚗 GoalFlow: Driving the Future with High-Quality Multimodal Trajectories

🔺 2. GoalFlow: Goal-Driven Flow Matching for Multimodal Trajectories Generation in End-to-End Autonomous Driving

published on March 7

GoalFlow is a new method for generating high-quality multimodal trajectories in autonomous driving. It addresses the challenges of trajectory selection and quality by constraining the generative process with a goal point, which helps reduce trajectory divergence. The method uses a novel scoring mechanism to choose the best goal point based on the scene context, ensuring that the generated trajectories are relevant and effective. Experimental results show that GoalFlow outperforms existing methods, achieving state-of-the-art performance with fewer computational steps.

...
Horizon Robotics, Huazhong University of Science & Technology, Nanjing University, School of Artificial Intelligence, University of Chinese Academy of Sciences, Shanghai AI Laboratory
#optimization #diffusion #agents #multimodal
22

🎥 ReCamMaster: Mastering Camera Control in Video Generation

🔺 80. ReCamMaster: Camera-Controlled Generative Rendering from A Single Video

published on March 14

This paper introduces ReCamMaster, a novel framework for generating videos with controlled camera trajectories. It addresses the challenge of maintaining visual consistency and dynamic synchronization across multiple frames when altering camera paths. The framework leverages pre-trained text-to-video models and a specially curated multi-camera synchronized video dataset to enhance its performance. Experimental results demonstrate that ReCamMaster significantly outperforms existing methods, showcasing its potential for applications like video stabilization and super-resolution.

...
CUHK, HUST, Kuaishou Technology, Zhejiang University
#dataset #optimization #training #video #games
23

🖼️ Unlocking the Power of Diffusion Models with Sparse Attention

🔺 65. PLADIS: Pushing the Limits of Attention in Diffusion Models at Inference Time by Leveraging Sparsity

published on March 10

This paper introduces PLADIS, a new method that enhances pre-trained diffusion models like U-Net and Transformer by using sparse attention techniques. Unlike previous methods that needed extra training or evaluations, PLADIS operates efficiently during inference by utilizing query-key correlations in the cross-attention layer. This approach improves the models' performance in generating high-quality images from text prompts without the need for additional training. The results demonstrate significant advancements in text alignment and user preference, making PLADIS a versatile solution for various applications in text-to-image generation.

...
Samsung Research
#training #cv #diffusion #inference
24

🤖 Bridging the Gap: API and GUI LLM Agents Unite

🔺 20. API Agents vs. GUI Agents: Divergence and Convergence

published on March 14

This paper explores the evolution of large language models (LLMs) from simple text generators to sophisticated software agents that can perform tasks based on natural language commands. It compares two main types of LLM agents: API-based agents, which automate tasks through programmatic interfaces, and GUI-based agents, which interact with graphical user interfaces like humans. The study highlights the differences in their architecture, development processes, and user interactions, while also discussing how hybrid approaches can leverage the strengths of both paradigms. The authors provide decision criteria and practical examples to help users choose the right approach for their needs, suggesting that future advancements will further integrate these two types of agents.

...
Microsoft
#multimodal #survey #agents
25

🛡️ Strengthening Privacy in Federated Learning Against Gradient Inversion Attacks

🔺 13. Exploring the Vulnerabilities of Federated Learning: A Deep Dive into Gradient Inversion Attacks

published on March 13

This paper focuses on the vulnerabilities of Federated Learning (FL) to Gradient Inversion Attacks (GIA), which can leak private information despite the model's privacy-preserving intentions. It categorizes existing GIA methods into three types: optimization-based, generation-based, and analytics-based, and provides a thorough analysis of their effectiveness and limitations. The study reveals that while optimization-based GIA is the most practical, it still has performance issues, whereas generation-based and analytics-based methods are less practical due to their dependencies and detectability. The authors propose a defense strategy to enhance privacy in FL frameworks and suggest future research directions to strengthen defenses against these attacks.

...
Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA, Department of Computer Science and Engineering, University of California, Santa Cruz, CA 95064, USA, Department of Electrical and Electronic Engineering, The University of Hong Kong, Hong Kong 999077, China, Department of Electronic and Electrical Engineering, Southern University of Science and Technology, Shenzhen 518055, China, Department of Mathematics, The University of Hong Kong, Hong Kong 999077, China, Materials Innovation Institute for Life Sciences and Energy (MILES), HKU-SIRI, Shenzhen 518055, China, School of Computing and Data Science, The University of Hong Kong, Hong Kong 999077, China, Thrust of Artificial Intelligence, The Hong Kong University of Science and Technology (Guangzhou), Guangzhou 511458, China
#leakage #benchmark #security #survey #healthcare
26

🦜 Bridging Vision and Language with Cockatiel for Enhanced Video Captioning

🔺 5. Cockatiel: Ensembling Synthetic and Human Preferenced Training for Detailed Video Caption

published on March 12

This paper addresses the challenge of Video Detailed Captioning (VDC), which involves creating precise descriptions for complex video content. The authors identify two main issues with existing methods: a bias towards certain aspects of captioning and a lack of alignment with human preferences. To overcome these challenges, they introduce Cockatiel, a three-stage training pipeline that combines synthetic and human-aligned data to enhance VDC performance. Their experiments demonstrate that Cockatiel achieves state-of-the-art results on the VDCSCORE metric and significantly outperforms other methods in terms of human preference evaluations.

...
Fudan University, Shanghai Academy of Artificial Intelligence for Science
#benchmark #training #multimodal #synthetic #alignment