Submodular Maximization beyond Non-negativity: Guarantees, Fast Algorithms, and Applications It is generally believed that submodular functions--and the more general class of γ-weakly submodular functions--may only be optimized under the non-negativity assumption f(S)≥0. In this paper, we show that once the function is expressed as the difference f=g−c, where g is monotone, non-negative, and γ-weakly submodular and c is non-negative modular, then strong approximation guarantees may be obtained. Wepresent an algorithm for maximizing g−c under a k-cardinality constraint which produces a random feasible set S such that E[g(S)−c(S)]≥(1−e−γ−ϵ)g(\opt)−c(\opt), whose running time is O(nϵlog21ϵ),independent of k. We extend these results to the unconstrained setting by describing an algorithm with the same approximation guarantees and faster O(n1ϵlog1ϵ) runtime. The main techniques underlying our algorithms are two-fold: the use of a surrogate objective which varies the relative importance between g and c throughout the algorithm, and a geometric sweep over possible γ values. Our algorithmic guarantees are complemented by a hardness result showing that no polynomial-time algorithm which accesses g through a value oracle can do better. We empirically demonstrate the success of our algorithms by applying them to experimental design on the Boston Housing dataset and directed vertex cover on the Email EU dataset. Online Algorithms for Rent-Or-Buy with Expert Advice We study the use of predictions by multiple experts (such as machine learning algorithms) to improve the performance of online algorithms. In particular, we consider the classical rent-or-buy problem (also called ski rental), and obtain algorithms that provably improve their performance over the adversarial scenario by using these predictions. We also prove matching lower bounds to show that our algorithms are the best possible, and perform experiments to empirically validate their performance in practice Non-monotone Submodular Maximization with Nearly Optimal Adaptivity and Query Complexity As a general optimization problem, submodular maximization has a wide range of applications in machine learning (e.g., active learning, clustering, and feature selection). In large-scale optimization, the parallel running time of an algorithm is governed by its adaptivity, which measures the number of sequential rounds needed if the algorithm can execute polynomially-many independent oracle queries in parallel. While low adaptivity is ideal, it is not sufficient for an algorithm to be efficient in practice---there are many applications of distributed submodular optimization where the number of function evaluations becomes prohibitively expensive. Motivated by these applications, we study the adaptivity and query complexity of submodular maximization. In this paper, we give the first constant-approximation algorithm for maximizing a non-monotone submodular function subject to a cardinality constraint~k that runs in O(log(n)) adaptive rounds. Additionally, our algorithm makes only O(nlog(k)) oracle queries in expectation. In our empirical study, we use three real-world applications to compare our algorithm with several benchmarks for non-monotone submodular maximization, and the results show that our algorithm finds competitive solutions using \emph{significantly fewer rounds and queries} Categorical Feature Compression via Submodular Optimization In the era of big data, learning from categorical features with very large vocabularies (e.g., 28 million for the Criteo click prediction dataset) has become a practical challenge for machine learning researchers and practitioners. We design a highly-scalable vocabulary compression algorithm that seeks to maximize the mutual information between the compressed categorical feature and the target binary labels and we furthermore show that its solution is guaranteed to be within a 1−1/e≈63% factor of the global optimal solution. Although in some settings, entropy-based set functions are known to be submodular, this is not the case for the mutual information objective we consider (mutual information with respect to the target labels). To address this, we introduce a novel re-parametrization of the mutual information objective, which we prove is submodular, and also design a data structure to query the submodular function in amortized O(logn) time (where n is the input vocabulary size). Our complete algorithm is shown to operate in O(nlogn)time. Additionally, we design a distributed implementation in which the query data structure is decomposed across O(k) machines such that each machine only requires O(nk) space, while still preserving the approximation guarantee and using only logarithmic rounds of computation. We also provide analysis of simple alternative heuristic compression methods to demonstrate they cannot achieve any approximation guarantee. Using the large-scale Criteo learning task, we demonstrate better performance in retaining mutual information and also verify competitive learning performance compared to other baseline methods. Multi-Frequency Phase Synchronization We propose a novel formulation for phase synchronization---the statistical problem of jointly estimating alignment angles from noisy pairwise comparisons---as a nonconvex optimization problem that enforces consistency among the pairwise comparisons in multiple frequency channels. Inspired by harmonic retrieval in signal processing, we develop a simple yet efficient two-stage algorithm that leverages the multi-frequency information. We demonstrate in theory and practice that the proposed algorithm significantly outperforms state-of-the-art phase synchronization algorithms, at a mild computational costs incurred by using the extra frequency channels. We also extend our algorithmic framework to general synchronization problems over compact Lie groups. Tractable n-Metrics for Multiple Graphs Graphs are used in almost every scientific discipline to express relations among a set of objects. Algorithms that compare graphs, and output a closeness score, or a correspondence among their nodes, are thus extremely important. Despite the large amount of work done, many o the scalable algorithms to compare graphs do not produce closeness scores that satisfy the intuitive properties of metrics. This is problematic since non-metrics are known to degrade the performance of algorithms such as distance-based clustering of graphs (Bento & Ioannidis, 2018). On the other hand, the use of metrics increases the performance of several machine learning tasks (Indyk, 1999; Clarkson, 1999; Angiulli & Pizzuti, 2002; Ackermann et al., 2010). In this paper, we introduce a new family of multi-distances (a distance between more than two elements) that satisfies a generalization of the properties of metrics to multiple elements. In the context of comparing graphs, we are the first to show the existence of multi-distances that simultaneously incorporate the useful property of alignment consistency (Nguyenet al., 2011), and a generalized metric property. Furthermore, we show that these multi-distances can be relaxed to convex optimization problems, without losing the generalized metric property. Guided evolutionary strategies: augmenting random search with surrogate gradients Many applications in machine learning require optimizing a function whose true gradient is unknown or computationally expensive, but where surrogate gradient information, directions that may be correlated with the true gradient, is cheaply available. For example, this occurs when an approximate gradient is easier to compute than the full gradient (e.g. in meta-learning or unrolled optimization), or when a true gradient is intractable and is replaced with a surrogate (e.g. in reinforcement learning or training networks with discrete variables). We propose Guided Evolutionary Strategies (GES), a method for optimally using surrogate gradient directions to accelerate random search. GES defines a search distribution for evolutionary strategies that is elongated along a subspace spanned by the surrogate gradients and estimates a descent direction which can then be passed to a first-order optimizer. We analytically and numerically characterize the tradeoffs that result from tuning how strongly the search distribution is stretched along the guiding subspace and use this to derive a setting of the hyperparameters that works well across problems. We evaluate GES on several example problems, demonstrating an improvement over both standard evolutionary strategies and first-order methods that directly follow the surrogate gradient. Adaptive and Safe Bayesian Optimization in High Dimensions via One-Dimensional Subspaces Bayesian optimization is known to be difficult to scale to high dimensions, because the acquisition step requires solving a non-convex optimization problem in the same search space. In order to scale the method and keep its benefits, we propose an algorithm (LineBO) that restricts the problem to a sequence of iteratively chosen one-dimensional sub-problems. We show that our algorithm converges globally and obtains a fast local rate when the function is strongly convex. Further, if the objective has an invariant subspace, our method automatically adapts to the effective dimension without changing the algorithm. Our method scales well to high dimensions and makes use of a global Gaussian process model. When combined with the SafeOpt algorithm to solve the sub-problems, we obtain the first safe Bayesian optimization algorithm with theoretical guarantees applicable in high-dimensional settings. We evaluate our method on multiple synthetic benchmarks, where we obtain competitive performance. Further, we deploy our algorithm to optimize the beam intensity of a free electron laser with up to 40 parameters while satisfying the safe operation constraints. Semi-Cyclic Stochastic Gradient Descent We consider convex SGD updates with a blockcyclic structure, i.e. where each cycle consists of a small number of blocks, each with many samples from a possibly different, block-specific, distribution. This situation arises, e.g., in Federated Learning where the mobile devices available for updates at different times during the day have different characteristics. We show that such block-cyclic structure can significantly deteriorate the performance of SGD, but propose a simple correction approach that allows prediction with the same performance guarantees as for i.i.d., non-cyclic, sampling.