-----0
Amin, K., Kearns, M., and Syed, U. Graphical models for bandit problems. In Proceedings of the 27th Annual Conference Uncertainty in Artificial Intelligence (UAI), 2011a.
Amin, K., Kearns, M., and Syed, U. Bandits, query learning, and the haystack dimension. In Proceedings of the 24th Annual Conference on Learning Theory (COLT), 2011b.
Auer, Peter, Ortner, Ronald, and Szepesvari, Csaba.Improved rates for the stochastic continuum-armed bandit problem. In In 20th Conference on Learning Theory (COLT), pp. 454468, 2007.
Beygelzimer, Alina and Langford, John. The offset tree for learning with partial labels. In KDD, pp.129138, 2009.
Beygelzimer, Alina, Langford, John, Li, Lihong, Reyzin, Lev, and Schapire, Robert E. Contextual bandit algorithms with supervised learning guarantees. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS), 2011.
Bubeck, Sebastien, Munos, Remi, Stoltz, Gilles, and Szepesvari, Csaba. Online optimization in x-armed bandits. In NIPS, pp. 201208, 2008.
Kleinberg, Robert, Slivkins, Aleksandrs, and Upfal, Eli. Multi-armed bandits in metric spaces. In Proceedings of the 40th Annual 
ACM Symposium on Theory of Computing (STOC), pp. 681690, New York, NY, USA, 
2008. ACM. ISBN 978-1-60558-047-0. doi: http://doi.acm.org/10.1145/1374376.1374475.
Langford, John and Zhang, Tong. The epoch-greedy algorithm for contextual multi-armed bandits. In Advances in Neural Information Processing Systems 20 (NIPS), 2007.
Li, L. and Littman, M.L. Reducing reinforcement learning to KWIK online regression. Annals of Mathematics and Artificial Intelligence, 58(3):217 237, 2010.
Li, L., Littman, M.L., and Walsh, T.J. Knows what it knows: a framework for self-aware learning. In Proceedings of the 25th International Conference on Machine Learning (ICML), pp. 568575, 2008.
Li, L., Littman, M.L., Walsh, T.J., and Strehl, A.L.Knows what it knows: a framework for self-aware learning. Machine Learning, 82(3):399443, 2011.
Li, Lihong, Chu, Wei, Langford, John, and Schapire, Robert E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International World Wide Web Conference, 2010.
Lu, Tyler, Pal, David, and Pal, Martin. Contextual multi-armed bandits. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS), 2010.
Sayedi, A., Zadimoghaddam, M., and Blum, A. Trading off mistakes and dont-know predictions. In NIPS, 2010.
Slivkins, Aleksandrs. Contextual bandits with similarity information. In Proceedings of the 24th Annual Conference on Learning Theory (COLT), 2011.
Strehl, Alexander and Littman, Michael L. Online linear regression and its application to model-based reinforcement learning. In Advances in Neural Information Processing Systems 20 (NIPS), 2007.
Walsh, T.J., Szita, I., Diuk, C., and Littman, M.L.Exploring compact reinforcement-learning representations with linear regression. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (UAI), pp. 591598. AUAI Press, 2009.
Wang, Yizao, Audibert, Jean-Yves, and Munos, Remi.Algorithms for infinitely many-armed bandits. In Advances in Neural Information Processing Systems 21 (NIPS), pp. 17291736, 2008.
-----0
Afkanpour, A., Gyorgy, A., Szepesvari, C., and Bowling, M. H. (2013). A randomized mirror descent algorithm for large scale multiple kernel learning. CoRR, abs/1205.0288.
Argyriou, A., Hauser, R., Micchelli, C., and Pontil, M.(2006). A DC-programming algorithm for kernel selection. In Proceedings of the 23rd International Conference on Machine Learning, pages 4148.
Argyriou, A., Micchelli, C., and Pontil, M. (2005). Learning convex combinations of continuously parameterized basic kernels. In Proceedings of the 18th Annual Conference on Learning Theory, pages 338352.
Aronszajn, N. (1950). Theory of reproducing kernels.Transactions of the American Mathematical Society, 68(3):337404.
Bach, F. (2008). Exploring large feature spaces with hierarchical multiple kernel learning. In Advances in Neural Information Processing Systems, volume 21, pages 105112.
Beck, A. and Teboulle, M. (2003). Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167175.
Cortes, C., Mohri, M., and Rostamizadeh, A. (2009). Learning non-linear combinations of kernels. In Advances in Neural Information Processing Systems, volume 22, pages 396404.
Frank, A. and Asuncion, A. (2010). UCI machine learning repository.
Gehler, P. and Nowozin, S. (2008). Infinite kernel learning. Technical Report 178, Max Planck Institute For Biological Cybernetics.
Gonen, M. and Alpayd?n, E. (2011). Multiple kernel learning algorithms. Journal of Machine Learning Research, 12:22112268.
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, Prediction. Springer, 2nd edition.
Hazan, E., Agarwal, A., and Kale, S. (2007). Logarithmic regret algorithms for online convex optimization.Machine Learning Journal, 69(2-3):169192.
Hazan, E. and Kale, S. (2011). Beyond the regret minimization barrier: an optimal algorithm for stochastic stronglyconvex optimization. In Proceedings of the 24th Annual Conference on Learning Theory, volume 19 of JMLR Workshop and Conference Proceedings, pages 421436.
Kloft, M., Brefeld, U., Sonnenburg, S., and Zien, A. (2011).lp-norm multiple kernel learning. Journal of Machine Learning Research, 12:953997.
Martinet, B. (1978). Perturbation des methodes doptimisation. Applications. RAIRO Analyse Numerique, 12:153171.
Micchelli, C. and Pontil, M. (2005). Learning the kernel function via regularization. Journal of Machine Learning Research, 6:10991125.
Nath, J., Dinesh, G., Raman, S., Bhattacharyya, C., Ben-Tal, A., and Ramakrishnan, K. (2009). On the algorithmics and applications of a mixed-norm based kernel learning formulation. In Advances in Neural Information Processing Systems, volume 22, pages 844852.
Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A.(2009). Robust stochastic approximation approach to stochastic programming. SIAM J. Optimization, 4:15741609.
Nemirovski, A. and Yudin, D. (1998). Problem Complexity and Method Efficiency in Optimization. Wiley.
Nesterov, Y. (2010). Efficiency of coordinate descent methods on huge-scale optimization problems. CORE Discussion paper, (2010/2).
Nesterov, Y. (2012). Subgradient methods for huge-scale optimization problems. CORE Discussion paper, (2012/2).
Orabona, F. and Luo, J. (2011). Ultra-fast optimization algorithm for sparse multi kernel learning. In Proceedings of the 28th International Conference on Machine Learning, pages 249256.
Rakotomamonjy, A., Bach, F., Canu, S., and Grandvalet, Y. (2008). SimpleMKL. Journal of Machine Learning Research, 9:24912521.
Richtarik, P. and Takac, M. (2011). Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. (revised July 4, 2011) submitted to Mathematical Programming.
Rockafellar, R. (1976). Monotone operators and the proximal point algorithm. SIAM Journal on Control and Optimization, 14(1):877898.
Scholkopf, B. and Smola, A. (2002). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA.
Shalev-Shwartz, S. and Tewari, A. (2011). Stochastic methods for l1-regularized loss minimization. Journal of Machine Learning Research, 12:18651892.
Shawe-Taylor, J. and Cristianini, N. (2004). Kernel Methods for Pattern Analysis. Cambridge Univ Press.
Sonnenburg, S., Ratsch, G., Schafer, C., and Scholkopf, B.(2006). Large scale multiple kernel learning. The Journal of Machine Learning Research, 7:15311565.
Xu, Z., Jin, R., King, I., and Lyu, M. (2008). An extended level method for efficient multiple kernel learning. In Advances in Neural Information Processing Systems, volume 21, pages 18251832.
Xu, Z., Jin, R., Yang, H., King, I., and Lyu, M. R. (2010).Simple and efficient multiple kernel learning by group lasso. In Proceedings of the 27th International Conference on Machine Learning, pages 11751182.
-----0
Beygelzimer, A., Dasgupta, S., and Langford, J. Importance weighted active learning. In ICML, 2009.
Beygelzimer, A., Hsu, D., Langford, J., and Zhang, T. Agnostic active learning without constraints. In NIPS, 2010.
Castro, R.M. and Nowak, R.D. Minimax bounds for active learning. Information Theory, IEEE Transactions on, 54(5):2339 2353, 2008.
Cesa-Bianchi, N., Gentile, C., and Orabona, F. Robust bounds for classification via selective sampling. In ICML, pp. 121128, 2009.
Clarkson, K. L. and Woodruff, D. P. Numerical linear algebra in the streaming model. In STOC, 2009.
Cohn, D., Atlas, L., and Ladner, R. Improving generalization with active learning. Machine Learning, 15:201221, 1994.
Daniely, A., Sabato, S., Ben-David, S., and ShalevShwartz, S. Multiclass learnability and the erm principle. Journal of Machine Learning Research Proceedings Track, 19:207232, 2011.
Dasgupta, S., Hsu, D., and Monteleoni, C. A general agnostic active learning algorithm. In NIPS, 2007.
Dekel, O., Gentile, C., and Sridharan, K. Robust selective sampling from single and multiple teachers.In COLT, pp. 346358, 2010.
Filippi, S., Cappe, O., Garivier, A., and Szepesvari, C. Parametric bandits: The generalized linear case.NIPS, 2010.
Freedman, D. A. On tail probabilities for martingales.The Annals of Probability, 3(1):100118, February 1975.
Gentile, C. and Orabona, F. On multilabel classification and ranking with partial feedback. In NIPS, 2012.
Halko, N., Martinsson, P. G., and Tropp, J. A. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev., 53(2):217288, 2011.
Hanneke, S. Rates of convergence in active learning.Annals of Statistics, 39(1):333361, 2011.
Harchaoui, Z., Douze, M., Paulin, M., Dudik, M., and Malick, J. Large-scale image classification with trace-norm regularization. In CVPR, pp. 3386  3393, 2012.
Hazan, E., Agarwal, A., and Kale, S. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69:169192, 2007.
Jain, P. and Kapoor, A. Active learning for large multi-class problems. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 762 769, 2009.
Joshi, A.J., Porikli, F., and Papanikolopoulos, N.P.Scalable active learning for multiclass image classification. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 34(11):2259 2273, 2012.
Kakade, S. M. and Tewari, A. On the generalization ability of online strongly convex programming algorithms. In Advances in Neural Information Processing Systems 21, 2009.
Kalai, A. and Vempala, S. Efficient algorithms for online decision problems. J. Comput. Syst. Sci., 71 (3):291307, 2005.
Lauritzen, S. L. Graphical Models. Oxford University Press, Oxford, 1996.
Luo, T., Kramer, K., Goldgof, D. B., Hall, L. O., Samson, S., Remsen, A., and Hopkins, T. Active learning to recognize multiple types of plankton. Journal of Machine Learning Research, 6:589613, 2005.
Orabona, F. and Cesa-Bianchi, N. Better algorithms for selective sampling. In ICML, pp. 433440, 2011.Roth, D. and Small, K. Active learning with perceptron for structured output. In ICML 2006 Workshop on Learning in Structured Output Spaces, 2006.
Shalev-Shwartz, Shai. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2), 2012.
Tsybakov, A. B. Optimal aggregation of classifiers in statistical learning. Ann. Statist., 32:135166, 2004.
Yan, R., Yang, Jie, and Hauptmann, A. Automatically labeling video data using multi-class active learning.In ICCV, 2003.
-----0
Abbasi-Yadkori, Yasin, Pal, David, and Szepesvari, Csaba. Improved Algorithms for Linear Stochastic Bandits. In NIPS, pp. 23122320, 2011.
Abramowitz, Milton and Stegun, Irene A. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. Dover, New York, 1964.
Agrawal, Shipra and Goyal, Navin. Analysis of Thompson Sampling for the Multi-armed Bandit Problem. In COLT, 2012.
Agrawal, Shipra and Goyal, Navin. Further Optimal Regret Bounds for Thompson Sampling. AISTATS, 2013.
Auer, Peter. Using Confidence Bounds for Exploitation-Exploration Trade-offs. Journal of Machine Learning Research, 3:397422, 2002.
Auer, Peter, Cesa-Bianchi, Nicolo`, Freund, Yoav, and Schapire, Robert E. The Nonstochastic Multiarmed Bandit Problem. SIAM J. Comput., 32(1):4877, 2002.
Bubeck, Sebastien, Cesa-Bianchi, Nicolo`, and Kakade, Sham M. Towards minimax policies for online linear optimization with bandit feedback. Proceedings of the 25th Conference on Learning Theory (COLT), pp. 114, 2012.
Thompson Sampling for Contextual Bandits with Linear Payoffs Chapelle, Olivier and Li, Lihong. An Empirical Evaluation of Thompson Sampling. In NIPS, pp. 2249 2257, 2011.
Chapelle, Olivier and Li, Lihong. Open Problem: Regret Bounds for Thompson Sampling. In COLT, 2012.
Chu, Wei, Li, Lihong, Reyzin, Lev, and Schapire, Robert E. Contextual Bandits with Linear Payoff Functions. Journal of Machine Learning Research Proceedings Track, 15:208214, 2011.
Dani, Varsha, Hayes, Thomas P., and Kakade, Sham M. Stochastic Linear Optimization under Bandit Feedback. In COLT, pp. 355366, 2008.
Filippi, Sarah, Cappe, Olivier, Garivier, Aurelien, and Szepesvari, Csaba. Parametric Bandits: The Generalized Linear Case. In NIPS, pp. 586594, 2010.
Graepel, Thore, Candela, Joaquin Quinonero, Borchert, Thomas, and Herbrich, Ralf. Web-Scale Bayesian Click-Through rate Prediction for Sponsored Search Advertising in Microsofts Bing Search Engine. In ICML, pp. 1320, 2010.
Granmo, O.-C. Solving Two-Armed Bernoulli Bandit Problems Using a Bayesian Learning Automaton.International Journal of Intelligent Computing and Cybernetics (IJICC), 3(2):207234, 2010.
Kaelbling, Leslie Pack. Associative Reinforcement Learning: Functions in k-DNF. Machine Learning, 15(3):279298, 1994.
Kaufmann, Emilie, Korda, Nathaniel, and Munos, Remi. Thompson Sampling: An Optimal Finite Time Analysis. ALT, 2012.
Lai, T. L. and Robbins, H. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6:422, 1985.
Langford, John and Zhang, Tong. The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information. In NIPS, 2007.
May, Benedict C. and Leslie, David S. Simulation studies in optimistic Bayesian sampling in contextual-bandit problems. Technical Report 11:02, Statistics Group, Department of Mathematics, University of Bristol, 2011.
May, Benedict C., Korda, Nathan, Lee, Anthony, and Leslie, David S. Optimistic Bayesian sampling in contextual-bandit problems. Technical Report 11:01, Statistics Group, Department of Mathematics, University of Bristol, 2011.
Ortega, Pedro A. and Braun, Daniel A. Linearly Parametrized Bandits. Journal of Artificial Intelligence Research, 38:475511, 2010.
Russo, Daniel and Roy, Benjamin Van. Learning to optimize via posterior sampling. CoRR, abs/1301.2609, 2013.
Sarkar, Jyotirmoy. One-armed badit problem with covariates. The Annals of Statistics, 19(4):19782002, 1991.
Scott, S. A modern Bayesian look at the multi-armed bandit. Applied Stochastic Models in Business and Industry, 26:639658, 2010.
Strehl, Alexander L., Mesterharm, Chris, Littman, Michael L., and Hirsh, Haym. Experience-efficient learning in associative bandit problems. In ICML, pp. 889896, 2006.
Strens, Malcolm J. A. A Bayesian Framework for Reinforcement Learning. In ICML, pp. 943950, 2000.
Thompson, William R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4):285 294, 1933.
Woodroofe, Michael. A one-armed bandit problem with a concomitant variable. Journal of the American Statistics Association, 74(368):799806, 1979.Wyatt, Jeremy. Exploration and Inference in Learning from Reinforcement. PhD thesis, Department of Artificial Intelligence, University of Edinburgh, 1997.
-----0
Adams, R., Ghahramani, Z., and Jordan, M. Treestructured stick breaking for hierarchical data. In NIPS, pp. 1927, 2010.
Ahmed, A., Aly, M., Gonzalez, J., Narayanamurthy, S., and Smola, A. Scalable inference in latent variable models. In WSDM, 2012.
Ahmed, A., Ravi, S., Narayanamurthy, S., and Smola, A. Fastex: Hash clustering with exponential families. In NIPS, 2012.
Ahmed,A., Hong, L., and Smola, A. Hierarchical Geographical Modeling of User locations from Social Media Posts. In WWW, 2013.
Ahmed,A., Shervashidze, N., Narayanamurthy, S., Josifovski, V., and Smola, A. Distributed large-scale natrual graph factorization. In WWW, 2013.
Beal, M. J., Ghahramani, Z., and Rasmussen, C. E.The infinite hidden markov model. In NIPS, 2002.
Blei, D., Ng, A., and Jordan, M. Latent dirichlet allocation. In NIPS, MIT Press, 2002 
Blei, D., Griffiths, T., and Jordan, M. The nested chinese restaurant process and Bayesian nonparametric inference of topic hierarchies. Journal of the ACM, 57(2):130, 2010.
Cheng, Z., Caverlee, J., Lee, K., and Sui, D. Exploring millions of footprints in location sharing services. In ICWSM, 2011.
Cho, E., Myers, S. A., and Leskovec, J. Friendship and mobility: user movement in location-based social networks. In KDD, pp. 10821090, New York, NY, USA, 2011. ACM.
Chou, K.C., Willsky, A.S., and Benveniste, A. Multiscale recursive estimation, data fusion, and regularization. IEEE Transactions on Automatic Control, 39(3):464 478, mar 1994.
Cowans, P. J. Probabilistic Document Modelling. PhD thesis, University of Cambridge, 2006.
Eisenstein, J., OConnor, B., Smith, N. A., and Xing, E.P. A latent variable model for geographic lexical variation. In Empirical Methods in Natural Language Processing, pp. 12771287, 2010.
Eisenstein, J., Ahmed, A., and Xing, E. Sparse additive generative models of text. In International Conference on Machine Learning, pp. 10411048, New York, NY, USA, 2011. ACM.
Hong, L., Ahmed, A., Gurumurthy, S., Smola, A., and Tsioutsiouliklis, K. Discovering geographical topics in the twitter stream. In World Wide Web, 2012.
James, L. F. Coag-frag duality for a class of stable poisson-kingman mixtures, 2010. URL http://arxiv.org/abs/1008.2420.
Kim, J., Kim, D., Kim, S., and Oh, A. Modeling Topic Hierarchies with the Recursive Chinese Restaurant Process. In CIKM, , 2012.
Li, W. and McCallum, A. Pachinko allocation: Dagstructured mixture models of topic correlations. In ICML, 2006.
Li, W., Blei, D., and McCallum, A. Nonparametric bayes pachinko allocation. In UAI, 2007.
Mei, Q., Liu, C., Su, H., and Zhai, C.X. A probabilistic approach to spatiotemporal theme pattern mining on weblogs. In Proceedings of WWW, pp. 533542, New York, NY, USA, 2006. ACM.
Mimno, D.M., Li, W., and McCallum, A. Mixtures of hierarchical topics with pachinko allocation. In ICML, volume 227, pp. 633640. ACM, 2007.
Paisley, J., Wang, C., Blei, D., and Jordan, M. I. Nested hierarchical dirichlet processes. Technical report, 2012. http://arXiv.org/abs/1210.6738.
Pitman, J. and Yor, M. The two-parameter poissondirichlet distribution derived from a stable subordinator. Annals of Probability, 25(2):855900, 1997.
Teh, Y., Jordan, M., Beal, M., and Blei, D. Hierarchical dirichlet processes. Journal of the American Statistical Association, 101(576):15661581, 2006.
Wallach, H. Structured Topic Models for Language.PhD thesis, University of Cambridge, 2008.
Wallach, Hanna M., Murray, Iain, Salakhutdinov, Ruslan, and Mimno, David. Evaluation methods for topic models. In ICML, 2009.
Wang, C., Wang, J., Xing, X., and Ma, W.Y. Mining geographic knowledge using location aware topic model. In Proceedings of the 4th ACM workshop on Geographical Information Retrieval, pp. 6570, New York, NY, USA, 2007. ACM.
Wing, B.P. and Baldridge, J. Simple supervised document geolocation with geodesic grids. In Proceedings of ACL, 2011.
Yin, Z., Cao, L., Han, J., Zhai, C., and Huang, T. Geographical topic discovery and comparison. In World 
Wide Web, pp. 247256, New York, NY, USA, 2011.ACM.
-----0
Ailon, N., Charikar, M., and Newman, A. Aggregating inconsistent information: Ranking and clustering. J.ACM, 55(5):23:123:27, 2008.
Ailon, N., Begleiter, R., and Ezra, E. Active learning using smooth relative regret approximations with applications. In COLT, 2012.
Ames, B. and Vavasis, S. Nuclear norm minimization for the planted clique and biclique problems. Mathematical Programming, 129(1):6989, 2011.
Bansal, N., Blum, A., and Chawla, S. Correlation clustering. Machine Learning, 56:89113, 2004.
Bollobas, B. and Scott, AD. Max cut for random graphs with a planted partition. Combinatorics, Prob. and Comp., 13(4-5):451474, 2004.
Cande`s, E., Li, X., Ma, Y., and Wright, J. Robust principal component analysis? J. ACM, 58:137, 2011.
Carson, T. and Impagliazzo, R. Hill-climbing finds random planted bisections. In SODA, 2001.
Chandrasekaran, V., Sanghavi, S., Parrilo, S., and Willsky, A. Rank-sparsity incoherence for matrix decomposition. SIAM J. on Optimization, 21(2): 572596, 2011.
Charikar, Moses, Guruswami, Venkatesan, and Wirth, Anthony. Clustering with qualitative information.
J. Comput. Syst. Sci., 71(3):360383, 2005.Chaudhuri, K., Chung, F., and Tsiatas, A. Spectral clustering of graphs with general degrees in the extended planted partition model. COLT, 2012.
Chen, Y., Sanghavi, S., and Xu, H. Clustering sparse graphs. In NIPS. Available on arXiv:1210.3335, 2012.
Condon, A. and Karp, R.M. Algorithms for graph partitioning on the planted partition model. Random Structures and Algorithms, 2001.
Demaine, E., Emanuel, D., Fiat, A., and Immorlica, N.Correlation clustering in general weighted graphs.Theoretical Comp. Sci., 2006.
Eriksson, B., Dasarathy, G., Singh, A., and Nowak, R. Active clustering: Robust and efficient hierarchical clustering using adaptively selected similarities.arXiv:1102.3887, 2011.
Giesen, J. and Mitsche, D. Reconstructing many partitions using spectral techniques. In Fundamentals of Computation Theory, pp. 433444, 2005.
Giotis, Ioannis and Guruswami, Venkatesan. Correlation clustering with a fixed number of clusters. Theory of Computing, 2(1):249266, 2006.
Holland, P. W., Laskey, K. B., and Leinhardt, S.Stochastic blockmodels: Some first steps. Social networks, 5(2):109137, 1983.
Jalali, A., Chen, Y., Sanghavi, S., and Xu, H. Clustering partially observed graphs via convex optimization. In ICML. Available on arXiv:1104.4803, 2011.
Jalali, Ali and Srebro, Nathan. Clustering using maxnorm constrained optimization. In ICML. Available on arXiv:1202.5598, 2012.
Krishnamurthy, A., Balakrishnan, S., Xu, M., and Singh, A. Efficient active algorithms for hierarchical clustering. arXiv:1206.4672, 2012.
Mathieu, C. and Schudy, W. Correlation clustering with noisy input. In Proceedings of the TwentyFirst Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 712728. SIAM, 2010.
McSherry, F. Spectral partitioning of random graphs.In FOCS, pp. 529537, 2001.
Oymak, S. and Hassibi, B. Finding dense clusters via low rank + sparse decomposition.arXiv:1104.5186v1, 2011.
Rohe, K., Chatterjee, S., and Yu, B. Spectral clustering and the high-dimensional stochastic block model. Ann. of Stat., 39:18781915, 2011.
Shamir, O. and Tishby, N. Spectral Clustering on a Budget. In AISTATS, 2011.
Shamir, R. and Tsur, D. Improved algorithms for the random cluster graph model. Random Struct.& Alg., 31(4):418449, 2007.
Voevodski, K., Balcan, M., Roglin, H., Teng, S., and Xia, Y. Active clustering of biological sequences.JMLR, 13:203225, 2012.
Xu, H., Caramanis, C., and Sanghavi, S. Robust PCA via outlier pursuit. IEEE Transactions on Information Theory, 58(5):30473064, 2012.
-----0
Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., and Baskurt, A. Sequential Deep Learning for Human Action Recognition. In A.A. Salah, B. L. (ed.), 2nd International Workshop on Human Behavior Understanding (HBU), Lecture Notes in Computer Science, pp. 2939. Springer, 2011.
Baldi, P. Autoencoders, Unsupervised Learning, and Deep Architectures. Proceedings of ICML Workshop on Unsupervised and Transfer Learning, 2012.
Baldi, P., Forouzan, S., and Lu, Z. Complex-Valued Autoencoders. Neural Networks, 33:136147, 2012.
Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde
Farley, D., and Bengio, Y. Theano: a CPU and GPU Math Expression Compiler. In Proceedings of SciPy, 2010.
Bethge, M., Gerwinn, S., and Macke, J. H. Unsupervised learning of a steerable basis for invariant image representations. In Proceedings of SPIE Human Vision and Electronic Imaging, 2007.
Denil, M., Bazzani, L., Larochelle, H., and de Freitas, N. Learning where to Attend with Deep Architectures for Image Tracking. Neural Computation, 2012.
Fitzpatrick, P., Metta, G., Natale, L., Rao, S., and Sandini, G. Learning about objects through action  initial steps towards artificial cognition. In IEEE International Conference on Robotics and Automation (ICRA), pp. 31403145, 2003.
Larochelle, H. and Hinton, G. Learning to combine foveal glimpses with a third-order Boltzmann machine. In Lafferty, J., Williams, C. K. I., Shawe Taylor, J., Zemel, R. S., and Culotta, A. (eds.), Advances in Neural Information Processing Systems 23, pp. 12431251. 2010.
Le, Q. V., Zou, W. Y., Yeung, S. Y., and Ng, A. Y.Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In CVPR 2011, pp. 33613368.IEEE, June 2011.
Lee, H., Ekanadham, C., and Ng, A. Sparse deep belief net model for visual area V2. In Platt, J. C., Koller, 
D., Singer, Y., and Roweis, S. (eds.), Advances in Neural Information Processing Systems 20, pp. 873 
880. MIT Press, Cambridge, MA, 2008.Memisevic, R. On multi-view feature learning. In ICML, 2012a.Memisevic, R. Learning to relate images: Mapping units, complex cells and simultaneous eigenspaces.ArXiv e-prints, 2012b.
Memisevic, R. and Hinton, G. Unsupervised Learning of Image Transformations. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2007.
Memisevic, R. and Hinton, G. E. Learning to Represent Spatial Transformations with Factored HigherOrder Boltzmann Machines. Neural Computation, 22(6):14731492, 2010.
Montesano, L., Lopes, M., Bernardino, A., and SantosVictor, J. Learning object affordances: From sensorymotor coordination to imitation. IEEE Transactions on Robotics, 24(1):1526, Feb. 2008.
Susskind, J., Memisevic, R., Hinton, G., and Pollefeys, M. Modeling the joint density of two images under a Variety of Transformations. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2011.
Sutskever, I. and Hinton, G. E. Learning Multilevel Distributed Representations for High-dimensional Sequences. Proceeding of the Eleventh International Conference on Artificial Intelligence and Statistics, 2007.
Taylor, G. W., Fergus, R., LeCun, Y., and Bregler, C.Convolutional learning of spatio-temporal features. In ECCV10, pp. 140153, September 2010.
Taylor, G. W., Hinton, G. E., and Roweis, S. T. Two Distributed-State Models For Generating HighDimensional Time Series. J. Mach. Learn. Res., 12: 10251068, 2011. ISSN 1532-4435.
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion. J. Mach. Learn.Res., 11:33713408, 2010. ISSN 1532-4435.
-----0
Alamgir, Morteza and von Luxburg, Ulrike. Multiagent random walks for local clustering on graphs.ICDM 10, pp. 1827, 2010.
Alon, Noga. Eigenvalues and expanders. Combinatorica, 6(2):8396, 1986.
Alvisi, L., Clement, A., Epasto, A., Lattanzi, S., and Panconesi, A. The evolution of sybil defense via social networks. In IEEE Symposium on Security and Privacy, 2013.
Andersen, Reid and Lang, Kevin J. Communities from seed sets. WWW 06, pp. 223232, 2006.
Andersen, Reid and Peres, Yuval. Finding sparse cuts locally using evolving sets. STOC, 2009.
Andersen, Reid, Chung, Fan, and Lang, Kevin. Using pagerank to locally partition a graph. 2006. An extended abstract appeared in FOCS 2006.
Andersen, Reid, Gleich, David F., and Mirrokni, Vahab. Overlapping clusters for distributed computation. WSDM 12, pp. 273282, 2012.
Arora, Sanjeev and Kale, Satyen. A combinatorial, primal-dual approach to semidefinite programs. STOC 07, pp. 227236, 2007.
Arora, Sanjeev, Rao, Satish, and Vazirani, Umesh V.Expander flows, geometric embeddings and graph partitioning. Journal of the ACM, 56(2), 2009.
Arora, Sanjeev, Hazan, Elad, and Kale, Satyen.O(sqrt(log(n)) approximation to sparsest cut in o(n2) time. SIAM Journal on Computing, 39(5): 17481771, 2010.
Chawla, Shuchi, Krauthgamer, Robert, Kumar, Ravi, Rabani, Yuval, and Sivakumar, D. On the hardness of approximating multicut and sparsest-cut. Computational Complexity, 15(2):94114, June 2006.
Gargi, Ullas, Lu, Wenjun, Mirrokni, Vahab S., and Yoon, Sangho. Large-scale community detection on youtube for topic discovery and exploration. In AAAI Conference on Weblogs and Social Media, 2011.
Gharan, Shayan Oveis and Trevisan, Luca. Approximating the expansion profile and almost optimal local graph clustering. FOCS, pp. 187196, 2012.
Gleich, David F. and Seshadhri, C. Vertex neighborhoods, low conductance cuts, and good seeds for local community methods. In KDD 2012, 2012.
Haveliwala, Taher H. Topic-sensitive pagerank. In WWW 02, pp. 517526, 2002.
Kannan, Ravi, Vempala, Santosh, and Vetta, Adrian.On clusterings: Good, bad and spectral. Journal of the ACM, 51(3):497515, 2004.
Leighton, Frank Thomson and Rao, Satish. Multicommodity max-flow min-cut theorems and their use in designing approximation algorithms. Journal of the ACM, 46(6):787832, 1999.
Leskovec, Jure, Lang, Kevin J., Dasgupta, Anirban, and Mahoney, Michael W. Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. Internet Mathematics, 6(1):29123, 2009.
Leskovec, Jure, Lang, Kevin J., and Mahoney, Michael. Empirical comparison of algorithms for network community detection. WWW, 2010.
Lin, Frank and Cohen, William W. Power iteration clustering. In ICML 10, pp. 655662, 2010.
Lovasz, Laszlo and Simonovits, Miklos. The mixing rate of markov chains, an isoperimetric inequality, and computing the volume. FOCS, 1990.
Lovasz, Laszlo and Simonovits, Miklos. Random walks in a convex body and an improved volume algorithm. Random Struct. Algorithms, 4(4):359412, 1993.
Makarychev, Konstantin, Makarychev, Yury, and Vijayaraghavan, Aravindan. Approximation algorithms for semi-random partitioning problems. In STOC 12, pp. 367384, 2012.
Merca, Mircea. A note on cosine power sums. Journal of Integer Sequences, 15:12.5.3, May 2012.
Morris, Ben and Peres, Yuval. Evolving sets and mixing. STOC 03, pp. 279286. ACM, 2003.
Motwani, Rajeev and Raghavan, Prabhakar. Randomized algorithms. Cambridge University Press, 1995.Schaeffer, S. E. Graph clustering. Computer Science Review,, 1(1):2764, 2007.
Shalev-Shwartz, Shai and Srebro, Nathan. SVM optimization: inverse dependence on training set size.In ICML, 2008.
Sherman, Jonah. Breaking the multicommodity flow barrier for o( ? log n)-approximations to sparsest cut. FOCS 09, pp. 363372, 2009.
Shi, J. and Malik, J. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888905, 2000.
Sinclair, Alistair and Jerrum, Mark. Approximate counting, uniform generation and rapidly mixing markov chains. Information and Computation, 82 (1):93133, 1989.
Spielman, Daniel and Teng, Shang-Hua. Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. STOC, 2004.
Spielman, Daniel and Teng, Shang-Hua. A local clustering algorithm for massive graphs and its application to nearly-linear time graph partitioning. CoRR, abs/0809.3232, 2008.
Wu, Xiao-Ming, Li, Zhenguo, So, Anthony Man-Cho, Wright, John, and Chang, Shih-Fu. Learning with partially absorbing random walks. In NIPS, 2012.
Zhu, Zeyuan Allen, Chen, Weizhu, Zhu, Chenguang, Wang, Gang, Wang, Haixun, and Chen, Zheng. Inverse time dependency in convex regularized learning. ICDM, 2009.
-----0
Babes, M., Marivate, V., Littman, M., and Subramanian, K. Apprenticeship learning about multiple intentions. ICML, 2011.Brillinger, D.R. Learning a potential function from a trajectory. Signal Processing Letters, IEEE, 14(11): 867870, 2007.
Chaumette, F. Image moments: a general and useful set of features for visual servoing. Robotics, IEEE Transactions on, 20(4):713723, 2004.
Chiappa, S. and Peters, J. Movement extraction by detecting dynamics switches and repetitions. Advances in neural information processing systems, 23: 388396, 2010.
Choi, Jaedeug and Kim, Kee-Eung. Nonparametric bayesian inverse reinforcement learning for multiple reward functions. In Neural Information Processing Systems (NIPS), 2012.
Dimitrakakis, C. and Rothkopf, C. Bayesian multitask inverse reinforcement learning. Arxiv preprint arXiv:1106.3655, 2011.
Doshi-Velez, Finale, Wingate, David, Roy, Nicholas, and Tenenbaum, Joshua. Nonparametric Bayesian Policy Priors for Reinforcement Learning. In Neural Information Processing Systems (NIPS), 2010.
Ferguson, T. A bayesian analysis of some nonparametric problems. The annals of statistics, 1:209230, 1973.
Fox, E.B., Sudderth, E.B., Jordan, M.I., and Willsky, A.S. Nonparametric bayesian learning of switching linear dynamical systems. Advances in Neural Information Processing Systems, 21:457464, 2008.
Hannah, Lauren A., Blei, David M., and Powell, Warren B. Dirichlet process mixtures of generalized linear models. J. Mach. Learn. Res., pp. 19231953, July 2011.
Howard, M., Klanke, S., Gienger, M., Goerick, C., and Vijayakumar, S. A novel method for learning policies from variable constraint data. Autonomous Robots, 27:105121, 2009.
Ijspeert, A. J., Nakanishi, J., and Schaal, S. Learning attractor landscapes for learning motor primitives. In Neural Information Processing Systems (NIPS), Cambridge, MA, 2003.
Jetchev, N. and Toussaint, M. Task space retrieval using inverse feedback control. In Intl. Conf. on Machine Learning (ICML), 2011.
Khansari-Zadeh, S.M. and Billard, A. Bm: An iterative algorithm to learn stable non-linear dynamical systems with gaussian mixture models. In Robotics and Automation, IEEE Inter. Conf. on, 2010.
Macqueen, J. B. Some methods for classification and analysis of multivariate observations. In Berkeley Symposium on Math, Statistics, and Probability.
University of California Press, 1967.Majecka, B. Statistical models of pedestrian behaviour in the forum. Masters thesis, School of Informatics, University of Edinburgh, 2009.
Neal, R. Markov chain sampling methods for dirichlet process mixture models. Journal of computational and graphical statistics, 9:249265, 2000.
Park, T. and Casella, G. The Bayesian Lasso. J. Am.Stat. Assoc., 103(482):681686, June 2008. ISSN 0162-1459.
Qi, Y., Liu, D., Dunson, D., and Carin, L. Multi-task compressive sensing with dirichlet process priors. In International conference on Machine learning, pp.768775, 2008.
Ramachandran, Deepak and Amir, Eyal. Bayesian inverse reinforcement learning. In 20th Int. Joint Conf. Artificial Intelligence, India, 2007.
Rasmussen, C. E. and Ghahramani, Z. Infinite Mixtures of Gaussian Process Experts. In Advances in Neural Information Processing Systems 14, 2002.
Schaal, S. and Atkeson, C. G. Constructive incremental learning from only local information. Neural Comput., 10(8):20472084, 1998.
Shahbaba, Babak and Neal, Radford. Nonlinear models using dirichlet process mixtures. J. Mach. Learn.Res., pp. 18291850, August 2009.
Taylor, G.W., Hinton, G.E., and Roweis, S.T. Modeling human motion using binary latent variables.Advances in neural information processing systems, 19:1345, 2007.
Wood, F., Grollman, D. H., Heller, K. A., Jenkins, O. C., and Black, M. Incremental nonparametric bayesian regression, 2008.
Yau, C. and Holmes, C. Hierarchical bayesian nonparametric mixture models for clustering with variable relevance determination. Bayesian Anal, 6:329 352, 2011.
-----0
Ali, R. A., Richardson, T., Spirtes, P., and Zhang, J. Towards characterizing Markov equivalence classes for directed acyclic graphs with latent variables. In UAI, 2005.
Anandkumar, A. and Valluvan, R. Learning loopy graphical models with latent variables: Efficient methods and guarantees. arXiv:1203.3887, 2012.
Anandkumar, A., Chaudhuri, K., Hsu, D., Kakade, S. M., Song, L., and Zhang, T. Spectral methods for learning multivariate latent tree structure. In NIPS, 2011.
Anandkumar, A., Foster, D. P., Hsu, D., Kakade, S. M., and Liu, Y.-K. A spectral algorithm for latent Dirichlet allocation. In NIPS, 2012a.
Anandkumar, A., Hsu, D., Javanmard, A., and Kakade, S. M. Learning linear bayesian networks with latent variables. arXiv:1209.5350, 2012b.
Bach, F. R. and Jordan, M. I. Beyond independent components: trees and clusters. JMLR, 4:12051233, 2003.
Blei, D. M. and Lafferty, J. D. A correlated topic model of science. Annals of Applied Statistics, pp. 1735, 2007.
Blei, D. M., Ng, A. Y., and Jordan, M. I. Latent Dirichlet allocation. JMLR, 3:9931022, 2003.
Bollen, K. A. Structural Equations with Latent Variables.Wiley, New York, 1989.
Bresler, G., Mossel, E., and Sly, A. Reconstruction of Markov random fields from samples: some observations and algorithms. In APPROX, 2008.
Chandrasekaran, V., Sanghavi, S., Parrilo, P. A., and Willsky, A. S. Rank-sparsity incoherence for matrix decomposition. SIAM Journal on Optimization, 21(2):572596, 2011.
Chandrasekaran, V., Parrilo, P. A., and Willsky, A. S. Latent variable graphical model selection via convex optimization. Annals of Statistics (to appear), 2012.
Chickering, D. M. Optimal structure identification with greedy search. JMLR, 3:507554, 2003.
Choi, M. J., Lim, J. J., Torralba, A., and Willsky, A. S.Exploiting hierarchical context on a large database of object categories. In CVPR, 2010.
Chow, C. and Liu, C. Approximating discrete probability distributions with dependence trees. IEEE Tran. on Information Theory, 14(3):462467, 1968.
Durbin, R., Eddy, S., Krogh, A., and Mitchison, G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, 1998.
Erdos, P. L., Szekely, L. A., Steel, M. A., and Warnow, T. J. A few logs suffice to build (almost) all trees: Part I. Random Structures and Algorithms, 14:153184, 1999.
Hoyer, P. O., Janzing, D., Mooij, J. M., Peters, J., and Scholkopf, B. Nonlinear causal discovery with additive noise models. In NIPS, 2009.
Hsu, D., Kakade, S. M., and Zhang, T. Robust matrix decomposition with sparse corruptions. IEEE Trans. on Inf. Theory, 57(11):72217234, 2011.
Hyvarinen, A., Karhunen, J., and Oja., E. Independent Component Analysis. Wiley Interscience, 2001.
Lauritzen, S. Graphical Models. Oxford University Press, 1996.
Li, W. and McCallum, A. Pachinko allocation: DAGstructured mixture models of topic correlations. In ICML, pp. 577584, 2006.
Pearl, J. Probabilistic Reasoning in Intelligent Systems Networks of Plausible Inference. Morgan Kaufmann, 1988.
Peters, J. and Buhlmann, P. Identifiability of Gaussian structural equation models with same error variances.arXiv:1205.2536v1, 2012.
Peters, J., Mooij, J., Janzing, D., and Scholkopf, B. Identifiability of causal graphs using functional models. In UAI, 2011.
Ravikumar, P., Wainwright, M. J., and Lafferty, J. Highdimensional Ising model selection using `1-regularized logistic regression. Annals of Statistics, 38(3):12871319, 2010.
Saunderson, J., Chandrasekaran, V., Parrilo, P. A., and Willsky, A. S. Diagonal and low-rank matrix decompositions, correlation matrices, and ellipsoid fitting.arXiv:1204.1220, 2012.
Shimizu, S., Hoyer, P. O., Hyvarisen, A., and Kerminen, A.A linear non-gaussian acyclic model for causal discovery.JMLR, 7:20032030, 2006.
Silva, R., Scheines, R., Glymour, C., and Spirtes, P. Learning the structure of linear latent variable models. JMLR, 7:191246, 2006.
Spielman, D. A., Wang, H., and Wright, J. Exact recovery of sparsely-used dictionaries. arXiv:1206.5882v1, 2012.
Spirtes, P., Glymour, C., and Scheines, R. Causation, Prediction, and Search. MIT press, 2nd edition, 2000.
Theis, F. J. Towards a general independent subspace analysis. In NIPS, 2007.
Zellner, A. Introduction to Bayesian Inference in Econometrics. New York: John Wiley, 2nd edition, 1971.
-----0
Akaho, S. A kernel method for canonical correlation analysis. In Proc. Intl Meeting on Psychometric Society, 2001.
Anderson, T. W. An Introduction to Multivariate Statistical Analysis (2nd edition). John Wiley and Sons, 1984.
Arora, R. and Livescu, K. Kernel CCA for multi-view learning of acoustic features using articulatory measurements.In Symp. on Machine Learning in Speech and Language Processing, 2012.
Arora, R. and Livescu, K. Multi-view CCA-based acoustic features for phonetic recognition across speakers and domains. In Int. Conf. on Acoustics, Speech, and Signal Processing, 2013.
Bach, F. R. Consistency of trace norm minimization. J.Mach. Learn. Res., 9:10191048, June 2008.
Bach, F. R. and Jordan, M. I. Kernel independent component analysis. J. Mach. Learn. Res., 3:148, 2002.
Bengio, Y. and Delalleau, O. Justifying and generalizing contrastive divergence. Neural Computation, 21(6):1601 1621, 2009.
Blaschko, M. B. and Lampert, C. H. Correlational spectral clustering. In CVPR, 2008.
Chaudhuri, K., Kakade, S. M., Livescu, K., and Sridharan, K. Multi-view clustering via canonical correlation analysis. In ICML, 2009.
Choukri, K. and Chollet, G. Adaptation of automatic speech recognizers to new speakers using canonical correlation analysis techniques. Speech Comm., 1:95107, 1986.Deep Canonical Correlation Analysis 
Davis, S. B. and Mermelstein, P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoustics, Speech, and Signal Proc., 28(4):357366, 1980.
De Bie, T. and De Moor, B. On the regularization of canonical correlation analysis. In Proc. Intl Conf. on Independent Component Analysis and Blind Source Separation, 2003.
Dhillon, P., Foster, D., and Ungar, L. Multi-view learning of word embeddings via CCA. In NIPS, 2011.
Ek, C. H., Torr, P. H., , and Lawrence, N. D. Ambiguity modelling in latent spaces. In MLMI, 2008.
Haghighi, A., Liang, P., Berg-Kirkpatrick, T., and Klein, D. Learning bilingual lexicons from monolingual corpora.In ACL-HLT, 2008.
Hardoon, D. R., Szedmak, S., and Shawe-Taylor, J. Canonical correlation analysis: An overview with application to learning methods. Neural Computation, 16(12):2639 2664, 2004.
Hardoon, D. R., Mourao-Miranda, J., Brammer, M., and Shawe-Taylor, J. Unsupervised analysis of fMRI data using kernel canonical correlation. NeuroImage, 37(4): 12501259, 2007.
Hinton, G. E. and Salakhutdinov, R. R. Reducing the dimensionality of data with neural networks. Science, 313(5786):504507, 2006.
Hinton, G. E., Osindero, S., and Teh, Y.-W. A fast learning algorithm for deep belief nets. Neural computation, 18 (7):15271554, 2006.
Hotelling, H. Relations between two sets of variates.Biometrika, 28(3/4):321377, 1936.
Kakade, S. M. and Foster, D. P. Multi-view regression via canonical correlation analysis. In COLT, 2007.
Kim, T. K., Wong, S. F., and Cipolla, R. Tensor canonical correlation analysis for action classification. In CVPR, 2007.
Le, Q. V., Ngiam, J., Coates, A., Lahiri, A., Prochnow, B., and Ng, A. Y. On optimization methods for deep learning. In ICML, 2011.
LeCun, Y. and Cortes, C. The MNIST database of handwritten digits, 1998.
Lewis, A. S. Derivatives of spectral functions. Mathematics of Operations Research, 21(3):576588, 1996.
Mardia, K. V., Kent, J. T., and Bibby, J. M. Multivariate Analysis. Academic Press, 1979.
Melzer, T., Reiter, M., and Bischof, H. Nonlinear feature extraction using generalized canonical correlation analysis.In ICANN, 2001.
Montanarella, L., Bassami, M., and Breas, O. Chemometric classification of some European wines using pyrolysis mass spectrometry. Rapid Communications in Mass Spectrometry, 9(15):15891593, 1995.
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., and Ng, A. Y. Multimodal deep learning. In ICML, 2011.
Nocedal, J. and Wright, S. J. Numerical Optimization.Springer, New York, 2nd edition, 2006.
Petersen, K. B. and Pedersen, M. S. The matrix cookbook, Nov 2012. URL http://www2.imm.dtu.dk/pubdb/ p.php?3274.
Rudzicz, F. Adaptive kernel canonical correlation analysis for estimation of task dynamics from acoustics. In ICASSP, 2010.
Salakhutdinov, R. and Hinton, G. E. Deep Boltzmann machines. In AISTATS, 2009.
Sargin, M. E., Yemez, Y., and Tekalp, A. M. Audiovisual synchronization and fusion using canonical correlation analysis. IEEE. Trans. Multimedia, 9(7):13961403, 2007.
Slaney, M. and Covell, M. FaceSync: A linear operator for measuring synchronization of video facial images and audio tracks. In NIPS, 2000.
Srivastava, N. and Salakhutdinov, R. Multimodal learning with deep Boltzmann machines. In NIPS, 2012.
Vert, J.-P. and Kanehisa, M. Graph-driven features extraction from microarray data using diffusion kernels and kernel CCA. In NIPS, 2002.
Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A.Extracting and composing robust features with denoising autoencoders. In ICML. ACM, 2008.
Vinokourov, A., Shawe-Taylor, J., and Cristianini, N. Inferring a semantic representation of text via cross-language correlation analysis. In NIPS, 2003.
Westbury, J. R. X-ray microbeam speech production database users handbook. Waisman Center on Mental Retardation & Human Development, U. Wisconsin,  Madison, WI, version 1.0 edition, June 1994.
-----0
Asadi, N. and Lin, J. Training efficient tree-based models for document ranking. In European Conference on Information Retrieval (ECIR), 2013. 1 
Birkbeck, N., Sofka, M., and Zhou, S. K. Fast boosting trees for classification, pose detection, and boundary detection on a GPU. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2011. 2 
Bradley, J. K. and Schapire, R. E. Filterboost: Regression and classification on large datasets. In Neural Information Processing Systems (NIPS), 2007. 2 
Breiman, L. Arcing classifiers. In The Annals of Statistics, 1998. 1 
Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. Classification and Regression Trees. Chapman & Hall, New York, NY, 1984. 2 
Buhlmann, P. and Hothorn, T. Boosting algorithms: Regularization, prediction and model fitting. In Statistical Science, 2007. 2 
Burgos-Artizzu, X.P., Dollar, P., Lin, D., Anderson, D.J., and Perona, P. Social behavior recognition in continuous videos. In Computer Vision and Pattern Recognition (CVPR), 2012. 1 
Coates, A., Baumstarck, P., Le, Q., and Ng, A. Y. Scalable learning for object detection with GPU hardware. In Intelligent Robots and Systems (IROS), 2009. 2 
Dalal, N. and Triggs, B. Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition (CVPR), 2005. 6 
Dollar, P., Tu, Z., Tao, H., and Belongie, S. Feature mining for image classification. In Computer Vision and Pattern Recognition (CVPR), June 2007. 2 
Dollar, P., Tu, Z., Perona, P., and Belongie, S. Integral channel features. In British Machine Vision Conference (BMVC), 2009. 6 
Dollar, P., Appel, R., and Kienzle, W. Crosstalk cascades for frame-rate pedestrian detection. In European Conference on Computer Vision (ECCV), 2012. 1 
Domingo, C. and Watanabe, O. Scaling up a boostingbased learner via adaptive sampling. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2000. 2 
Domingos, P. and Hulten, G. Mining high-speed data streams. In International Conference on Knowledge Discovery and Data Mining, 2000. 2 
Dubout, C. and Fleuret, F. Boosting with maximum adaptive sampling. In Neural Information Processing Systems (NIPS), 2011. 2 
Freund, Y. Boosting a weak learning algorithm by majority. In Information and Computation, 1995. 1 
Freund, Y. and Schapire, R. E. Experiments with a new boosting algorithm. In Machine Learning International Workshop, 1996. 1, 2 
Friedman, J., Hastie, T., and Tibshirani, R. Additive logistic regression: a statistical view of boosting. In The Annals of Statistics, 2000. 3 
Friedman, J. H. Greedy function approximation: A gradient boosting machine. In The Annals of Statistics, 2000.
Friedman, J. H. Stochastic gradient boosting. In Computational Statistics & Data Analysis, 2002. 2 
Kegl, B. and Busa-Fekete, R. Accelerating adaboost using UCB. KDD-Cup 2009 competition, 2009. 2, 6 Kotsiantis, S. B. Supervised machine learning: A review of classification techniques. informatica 31:249268. Informatica, 2007. 1 
LeCun, Y. and Cortes, C. The MNIST database of handwritten digits, 1998. 6 
Mnih, V., Szepesvari, C., and Audibert, J. Empirical bernstein stopping. In International Conference on Machine Learning (ICML), 2008. 2 
Paul, B., Athithan, G., and Murty, M.N. Speeding up adaboost classifier with random projection. In International Conference on Advances in Pattern Recognition (ICAPR), 2009. 2 
Quinlan, R. J. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. 2 
Quinlan, R. J. Bagging, boosting, and C4.5. In National Conference on Artificial Intelligence, 1996. 1 
Ridgeway, G. The state of boosting. In Computing Science and Statistics, 1999. 1 
Rowley, H., Baluja, S., and Kanade, T. Neural networkbased face detection. In Computer Vision and Pattern Recognition (CVPR), 1996. 6 
Schapire, R. E. The strength of weak learnability. In Machine Learning, 1990. 1 
Sharp, T. Implementing decision trees and forests on a gpu.In European Conference on Computer Vision (ECCV), 2008. 2, 3 
Svore, K. M. and Burges, C. J. Large-scale learning to rank using boosted decision trees. Scaling Up Machine Learning: Parallel and Distributed Approaches, 2011. 2 
Viola, P. A. and Jones, M. J. Robust real-time face detection. International Journal of Computer Vision (IJCV), 2004. 6 
Wu, J., Brubaker, S. C., Mullin, M. D., and Rehg, J. M. Fast asymmetric learning for cascade face detection. Pattern Analysis and Machine Intelligence (PAMI), 2008. 2 
-----0
Ailon, N. and Chazelle, B. Approximate nearest neighbors and the fast johnson-lindenstrauss transform. SICOMP, pp. 302322, 2009. 2 
Anandkumar, A., Foster, D., Hsu, D., Kakade, S., and Liu, Y. Two svds suffice: Spectral decompositions A Practical Algorithm for Topic Modeling with Provable Guarantees for probabilistic topic modeling and latent dirichlet allocation. In NIPS, 2012. 1, 3 
Arora, S., Ge, R., Kannan, R., and Moitra, A. Computing a nonnegative matrix factorization  provably. In STOC, pp. 145162, 2012a. 1, 4, 1 
Arora, S., Ge, R., and Moitra, A. Learning topic models  going beyond svd. In FOCS, 2012b. 1, 2, 2, 2, 3, 2, 3, 1, 5, 6 
Bittorf, V., Recht, B., Re, C., and Tropp, J. Factoring nonnegative matrices with linear programs. In NIPS, 2012. 1 
Blei, D. Introduction to probabilistic topic models. Communications of the ACM, pp. 7784, 2012. 1 
Blei, D. and Lafferty, J. A correlated topic model of science. Annals of Applied Statistics, pp. 1735, 2007. 1, 2 
Blei, D., Ng, A., and Jordan, M. Latent dirichlet allocation. Journal of Machine Learning Research, pp. 9931022, 2003. Preliminary version in NIPS 2001. 1, 2 Buntine, Wray L. Estimating likelihoods for topic models.In Asian Conference on Machine Learning, 2009. 5.1 
Deerwester, S., Dumais, S., Landauer, T., Furnas, G., and Harshman, R. Indexing by latent semantic analysis.JASIS, pp. 391407, 1990. 1 
Donoho, D. and Stodden, V. When does non-negative matrix factorization give the correct decomposition into parts? In NIPS, 2003. 2 
Gillis, N. Robustness analysis of hotttopixx, a linear programming model for factoring nonnegative matrices, 2012. http://arxiv.org/abs/1211.6687. 1 
Gillis, N. and Vavasis, S. Fast and robust recursive algorithms for separable nonnegative matrix factorization, 2012. http://arxiv.org/abs/1208.1237. 1, 4 
Gomez, C., Borgne, H. Le, Allemand, P., Delacourt, C., and Ledru, P. N-findr method versus independent component analysis for lithological identification in hyperspectral imagery. Int. J. Remote Sens., 28(23), January 2007. 4 
Griffiths, T. L. and Steyvers, M. Finding scientific topics.Proceedings of the National Academy of Sciences, 101: 52285235, 2004. 1 
Kumar, A., Sindhwani, V., and Kambadur, P. Fast conical hull algorithms for near-separable non-negative matrix factorization. 2012. http://arxiv.org/abs/1210.1190v1.
Li, W. and McCallum, A. Pachinko allocation: Dagstructured mixture models of topic correlations. In ICML, pp. 633640, 2007. 1, 2 McCallum, A.K. Mallet: A machine learning for language toolkit, 2002. http://mallet.cs.umass.edu. 2, 5 
Mimno, David, Wallach, Hanna, Talley, Edmund, Leenders, Miriam, and McCallum, Andrew. Optimizing semantic coherence in topic models. In EMNLP, 2011. 5.1 
Nascimento, J.M. P. and Dias, J. M. B. Vertex component analysis: A fast algorithm to unmix hyperspectral data. IEEE TRANS. GEOSCI. REM. SENS, 43:898910, 2004. 4 
Stevens, Keith, Kegelmeyer, Philip, Andrzejewski, David, and Buttler, David. Exploring topic coherence over many models and many topics. In EMNLP, 2012. 5.1 
Thurau, C., Kersting, K., and Bauckhage, C. Yes we can simplex volume maximization for descriptive webscale matrix factorization. In CIKM10, 2010. 4 
Wallach, Hanna, Murray, Iain, Salakhutdinov, Ruslan, and Mimno, David. Evaluation methods for topic models. In ICML, 2009. 5.1 
Yao, Limin, Mimno, David, and McCallum, Andrew. Efficient methods for topic model inference on streaming document collections. In KDD, 2009. 5.2 
-----0
Bartlett, Peter L., Jordan, Michael I., and McAuliffe, Jon D. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473): 138156, 2006.
Ben-David, Shai, Loker, David, Srebro, Nathan, and Sridharan, Karthik. Minimizing the misclassification error Cost-sensitive Multiclass Classification Risk Bounds rate using a surrogate convex loss. In Proceedings of the 29th International Conference on Machine Learning (ICML-12), pp. 18631870, 2012.
Chen, Di-Rong and Sun, Tao. Consistency of multiclass empirical risk minimization methods based on convex loss. The Journal of Machine Learning Research, 7: 24352447, 2006.
Cortes, Corinna and Vapnik, Vladimir. Support-vector networks. Machine Learning, 20(3):273297, 1995.
Guruprasad, Harish and Agarwal, Shivani. Classification calibration dimension for general multiclass losses. In Advances in Neural Information Processing Systems 25, pp. 20872095. 2012.
Hoffgen, Klaus-U, Simon, Hans-U, and Horn, Kevin S Van.Robust trainability of single neurons. Journal of Computer and System Sciences, 50:114125, Jan 1995.
Lee, Yoonkyung, Lin, Yi, and Wahba, Grace. Multicategory support vector machines: Theory and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association, 99(465):6781, 2004.
Liu, Yufeng. Fisher consistency of multicategory support vector machines. Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics, 2:289296, 2007.
Mroueh, Youssef, Poggio, Tomaso, Rosasco, Lorenzo, and Slotine, Jean-Jacques. Multiclass learning with simplex coding. In Advances in Neural Information Processing Systems 25, pp. 27982806. 2012.
Reid, Mark, Williamson, Robert, and Sun, Peng. The convexity and design of composite multiclass losses. In Proceedings of the 29th International Conference on Machine Learning (ICML-12), pp. 687694, 2012.
Reid, Mark D. and Williamson, Robert C. Surrogate regret bounds for proper losses. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML-09), pp. 897904, 2009.
Rosasco, Lorenzo, Vito, Ernesto De, Caponnetto, Andrea, Piana, Michele, and Verri, Alessandro. Are loss functions all the same? Neural Computation, 16(5):1063 107, 2004.
Steinwart, Ingo. How to compare different loss functions and their risks. Constructive Approximation, 26(2):225 287, 2007.
Tewari, Ambuj and Bartlett, Peter L. On the consistency of multiclass classification methods. The Journal of Machine Learning Research, 8:10071025, 2007.
Vernet, Elodie, Williamson, Robert C., and Reid, Mark D.Composite multiclass losses. In Advances in Neural Information Processing Systems 24, pp. 12241232. 2011.
Zhang, Tong. Statistical analysis of some multi-category large margin classification methods. The Journal of Machine Learning Research, 5:12251251, 2004.
-----0
Ailon, N. and Liberty, E. Fast dimension reduction using Rademacher series on dual BCH codes. In SODA, 2008.
Avron, H., Maymounkov, P., and Toledo, S. Blendenpik: Supercharging LAPACKs least-squares solver.SIAM Journal on Scientific Computing, 32(3):1217 1236, 2010.
Avron, Haim, Boutsidis, Christos, Toledo, Sivan, and Zouzias, Anastasios. Efficient dimensionality reduction for canonical correlation analysis. CoRR, abs/1209.2185, 2012.
Bjorck, A. and Golub, G.H. Numerical methods for computing angles between linear subspaces. Mathematics of computation, 27(123):579594, 1973.
Boutsidis, C. and Drineas, P. Random projections for the nonnegative least-squares problem. Linear Algebra and its Applications, 431(5-7):760771, 2009.
Boutsidis, C., Zouzias, A., and Drineas, P. Random projections for k-means clustering. In NIPS, 2010.
Chaudhuri, K., Kakade, S. M., Livescu, K., and Sridharan, K. Multi-view clustering via canonical correlation analysis. In ICML, pp. 129136, 2009.
Dhillon, P., Rodu, J., Foster, D., and Ungar, L. Using CCA to improve CCA: A new spectral method for estimating vector models of words. In ICML, 2012.
Dhillon, P. S., Foster, D., and Ungar, L. Multi-view learning of word embeddings via CCA. In NIPS, 2011.
Drineas, P., Mahoney, M.W., Muthukrishnan, S., and Sarlos, T. Faster least squares approximation. Numerische Mathematik, 117(2):217249, 2011.
Drineas, P., Magdon-Ismail, M., Mahoney, M. W., and Woodruff, D. P. Fast approximation of matrix coherence and statistical leverage. In ICML, 2012.
Golub, G.H. and Zha, H. The canonical correlations of matrix pairs and their numerical computation. IMA Volumes in Mathematics and its Applications, 69:2727, 1995.
Halko, N., Martinsson, P.G., and Tropp, J.A. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2):217288, 2011.
Hotelling, H. Relations between two sets of variates.Biometrika, 28(3/4):321377, 1936.
Ipsen, I. and Wentworth, T.. The effect of coherence on sampling from matrices with orthonormal columns, and preconditioned least squares problems. Arxiv preprint arXiv:1203.4809, 2012.
Kim, T.-K., Kittler, J., and Cipolla, R. Discriminative learning and recognition of image set classes using canonical correlations. IEEE Trans. Pattern Anal.
Mach. Intell., 29(6):10051018, 2007.Rokhlin, V. and Tygert, M. A fast randomized algorithm for overdetermined linear least-squares regression. Proceedings of the National Academy of Sciences, 105(36):13212, 2008.
Sarlos, T. Improved approximation algorithms for large matrices via random projections. In FOCS, 2006.
Snoek, C. G. M., Worring, M., van Gemert, J. C., Geusebroek, J. M., and Smeulders, A. W. M. The challenge problem for automated detection of 101 semantic concepts in multimedia. In Proceedings of the ACM international conference on Multimedia, pp. 421430, 2006.
Su, Y., Fu, Y., Gao, X., and Tian, Q. Discriminant learning through multiple principal angles for visual recognition. Image Processing, IEEE Transactions on, 21(3):1381 1390, March 2012.
Sun, L., Ji, S., and Ye, J. A least squares formulation for canonical correlation analysis. In ICML, pp. 10241031, 2008.
Sun, L., Ceran, B., and Ye, J. A scalable two-stage approach for a class of dimensionality reduction techniques. In KDD, pp. 313322, 2010.
Talwalkar, Ameet and Rostamizadeh, Afshin. Matrix coherence and the nystrom method. In UAI, pp.572579, 2010.
Tropp, J. Improved analysis of the subsampled randomized Hadamard transform. Adv. Adapt. Data Anal., special issue, Sparse Representation of Data and Images, 2011.
-----0
Ahonen, T., Hadid, A., and Pietikainen, M. Face description with local binary patterns: Application to face recognition. PAMI, 2006.
Ali, S. and Shah, M. Human action recognition in videos using kinematic features and multiple instance learning.PAMI, 2010.
Bunau, P., Meinecke, F., Kiraly, F., and Muller, K. Finding stationary subspaces in multivariate time series. Physical Review Letters, 2009.
Campos, T., Barnard, M., Mikolajczyk, K., Kittler, J., Yan, F., Christmas, W., and Windridge, D. An evaluation of bags-of-words and spatio-temporal shapes for action recognition. In WACV, 2011.
Castrodad, A. and Sapiro, G. Sparse modeling of human actions from motion imagery. IJCV, 2012.
Cevikalp, H. and Triggs, B. Face recognition based on image sets. In CVPR, 2010.
Chan, A. and Vasconcelos, N. Probabilistic kernels for the classification of auto-regressive visual processes. In CVPR, 2005.
Chan, A. and Vasconcelos, N. Classifying video with kernel dynamic textures. In CVPR, 2007.
Chan, A., Coviello, E., and Lanckriet, G. Clustering dynamic textures with the hierarchical em algorithm. In CVPR, 2010.
Chaudhry, R., Ravichandran, A., Hager, G., and Vidal, R. Histograms of oriented optical flow and binet-cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In CVPR, 2009.
Doretto, G., Chiuso, A., Wu, Y., and Soatto, S. Dynamic textures. IJCV, 2003.
Ghanem, B. and Ahuja, N. Maximum margin distance learning for dynamic texture recognition. In ECCV, 2010.Gu, L., Li, S., and Zhang, H. Learning probabilistic distribution model for multi-view face detection. In CVPR, 2001.
Hamm, J. and Lee, D. Grassmann discriminant analysis: a unifying view on subspace-based learning. In ICML, 2008.
Hara, S., Kawahara, Y., Washio, T., Bunau, P., Tokunaga, T., and Yumoto, K. Separation of stationary and nonstationary sources with a generalized eigenvalue problem. Neural networks, 2012.
Hotta, K. Local co-occurrence features in subspace obtained by kpca of local blob visual words for scene classification. Pattern Recognition, 2012.
Knopp, J., Prasad, M., Willems, G., Timofte, R., and Van Gool, L. Hough transform and 3d surf for robust three dimensional classification. In ECCV, 2010.
Lanckriet, G., Ghaoui, L., Bhattacharyya, C., and Jordan, M. A robust minimax approach to classification. JMLR, 2003.
Le, Q., Zou, W., Yeung, S., and Ng, A. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In CVPR, 2011.
Long, F., Wu, T., Movellan, J., Bartlett, M., and Littlewort, G. Learning spatiotemporal features by using independent component analysis with application to facial expression recognition. Neurocomputing, 2012.
Muller, J., Bunau, P., Meinecke, F., Kiraly, F., and Muller, K. The stationary subspace analysis toolbox. JMLR, 2011.
Niebles, J., Wang, H., and Li, F. Unsupervised learning of human action categories using spatial-temporal words.IJCV, 2008.
OHara, S. and Draper, B. Scalable action recognition with a subspace forest. In CVPR, 2012.
Ravichandran, A., Chaudhry, R., and Vidal, R. Viewinvariant dynamic texture recognition using a bag of dynamical systems. In CVPR, 2009.
Rodriguez, M., Ahmed, J., and Shah, M. Action MACH a spatio-temporal maximum average correlation height filter for action recognition. In CVPR, 2008.
Saisan, P., Doretto, G., Wu, Y., and Soatto, S. Dynamic texture recognition. In CVPR, 2001.
Sankaranarayanan, A., Turaga, P., Baraniuk, R., and Chellappa, R. Compressive acquisition of dynamic scenes. In ECCV, 2010.
Schuldt, C., Laptev, I., and Caputo, B. Recognizing human actions: A local SVM approach. In ICPR, 2004.
Sivic, J. and Zisserman, A. Video google: A text retrieval approach to object matching in videos. In ICCV, 2003.
Thurau, C. and Hlavac, V. Pose primitive based human action recognition in videos or still images. In CVPR, 2008.
Tipping, M., Bishop, C., et al. Mixtures of probabilistic principal component analyzers. Neural computation, 1999.
Tseng, C., Chen, J., Fang, C., and Lien, J. Human action recognition based on graph-embedded spatio-temporal subspace. Pattern Recognition, 2012.
Vidal, R., Ma, Y., and Sastry, S. Generalized principal component analysis (GPCA). PAMI, 2005.
Wang, H., Klaser, A., Schmid, C., and Liu, C. Dense trajectories and motion boundary descriptors for action recognition. Research report, INRIA, 2012.
Wang, Y. and Mori, G. Human action recognition by semilatent topic models. PAMI, 2009.
Xu, Y., Quan, Y., Ling, H., and Ji, H. Dynamic texture classification using dynamic fractal analysis. In ICCV, 2011.
-----0
Bartlett, P.L., Bousquet, O., and Mendelson, S. Local rademacher complexities. The Annals of Statistics, 2005.
Bengio, S., Pereira, F., Singer, Y., and Strelow, D.Group sparse coding. In NIPS, 2009.
Bertsekas, D. On the goldstein-levitin-polyak gradient projection method. IEEE Transactions on Automatic Control, 1976.
Bronstein, A., Sprechmann, P., and Sapiro, G. Learning efficient structured sparse models. ICML, 2012.
Devroye, L. and Lugosi, G. Combinatorial methods in density estimation. 2001.
Fan, J. and Lv, J. Sure independence screening for ultrahigh dimensional feature space. JRSS: B(Statistical Methodology), 2008.
Genovese, C. R., Jin, J., Wasserman, L., and Yao, Z.A comparison of the lasso and marginal regression.JMLR, 2012.
Hastie, T. and Loader, C. Local regression: Automatic kernel carpentry. Statistical Science, pp. 120129, 1993.
Jenatton, R., Mairal, J., Obozinski, G., and Bach, F.Proximal methods for sparse hierarchical dictionary learning. ICML, 2010.
Jenatton, R., Gribonval, R., and Bach, F. Local stability and robustness of sparse dictionary learning in the presence of noise. arXiv preprint arXiv:1210.0685, 2012.
Klaser, A., Marsza lek, M., and Schmid, C. A spatiotemporal descriptor based on 3d-gradients. In BMVC, 2008.
Lee, H., Battle, A., Raina, R., and Ng, A.Y. Efficient sparse coding algorithms. In NIPS, 2007.
Liu, J., Luo, J., and Shah, M. Recognizing realistic actions from videos in the wild. In CVPR, 2009.
Loader, C. Local regression and likelihood. Springer Verlag, 1999.
Lowe, D. G. Object recognition from local scaleinvariant features. CVPR, 1999.
Magnus, J. R. and Neudecker, H. Matrix differential calculus with applications in statistics and econometrics. 1988.
Meier, L. and Buhlmann, P. Smoothing l1-penalized estimators for high-dimensional time-course data.Electronic Journal of Statistics, 2007.
Raina, R., Battle, A., Lee, H., Packer, B., and Ng, A.Y. Self-taught learning: transfer learning from unlabeled data. In ICML, 2007.
Ram?rez, I., Lecumberry, F., and Sapiro, G. Sparse modeling with universal priors and learned incoherent dictionaries. Tech Report, IMA, University of Minnesota, 2009.
Sigg, C. D., Dikk, T., and Buhmann, J. M. Learning dictionaries with bounded self-coherence. IEEE Transactions on Signal Processing, 2012.
Solodov, M. V. Convergence analysis of perturbed feasible descent methods. Journal of Optimization Theory and Applications, 1997.
Tropp, J.A. Greed is good: Algorithmic results for sparse approximation. Information Theory, IEEE, 2004.
Vainsencher, D., Mannor, S., and Bruckstein, A.M.The sample complexity of dictionary learning.JMLR, 2011.
Wang, H., Ullah, M.M., Klaser, A., Laptev, I., and Schmid, C. Evaluation of local spatio-temporal features for action recognition. In BMVC, 2009.
Yang, J., Yu, K., and Huang, T. Supervised translation-invariant sparse coding. In CVPR, 2010.
Yu, K., Zhang, T., and Gong, Y. Nonlinear learning using local coordinate coding. NIPS, 2009.
Yu, K., Lin, Y., and Lafferty, J. Learning image representations from the pixel level via hierarchical sparse coding. In CVPR, 2011.
Zangwill, W.I. Nonlinear programming: a unified approach. 1969.
-----0
Balcan, M.-F. and Blum, A. A PAC-style model for learning from labeled and unlabeled data. In Proceedings of the 18th Annual Conference on Computational Learning Theory (COLT), 2005.
Balcan, M. F., Beygelzimer, A., and Langford, J.Agnostic active learning. In Proceedings of the 23rd International Conference on Machine Learning (ICML), 2006.
Balcan, Maria-Florina and Blum, Avrim. A discriminative model for semi-supervised learning. Journal of the ACM, 57(3), 2010.
Beygelzimer, A., Hsu, D., Langford, J., and Zhang, T. Agnostic active learning without constraints. In Advances in Neural Information Processing Systems (NIPS), 2010.
Blum, Avrim and Balcan, Maria-Florina. Open problems in efficient semi-supervised PAC learning. In Proceedings of the 20th Annual Conference on Computational Learning Theory (COLT), 2007.
Blum, Avrim and Mitchell, Tom. Combining labeled and unlabeled data with co-training. In Proceedings of the 11th Annual Conference on Computational Learning Theory (COLT), 1998.
Chapelle, O., Schlkopf, B., and Zien, A. SemiSupervised Learning. MIT press, 2006.
Dasgupta, S. Coarse sample complexity bounds for active learning. In Advances in Neural Information Processing Systems (NIPS), 2005.Dasgupta, S. Active learning. Encyclopedia of Machine Learning, 2011.
Dasgupta, S., Littman, M. L., and McAllester, D. PAC generalization bounds for co-training. In Advances in Neural Information Processing Systems (NIPS), 2001.
Dasgupta, Sanjoy, Hsu, Daniel, and Monteleoni, Claire.A general agnostic active learning algorithm. In Advances in Neural Information Processing Systems (NIPS), 2007.
Hanneke, S. A bound on the label complexity of agnostic active learning. In Proceedings of the 24th International Conference on Machine Learning (ICML), 2007a.
Hanneke, S. Teaching dimension and the complexity of active learning. In Proceedings of the 20th Annual Conference on Computational Learning Theory (COLT), 2007b.
Joachims, T. Transductive inference for text classification using support vector machines. In Proceedings of the 16th International Conference on Machine Learning (ICML), 1999a.
Joachims, Thorsten. Transductive inference for text classification using support vector machines. In Proceedings of the 16th International Conference on Machine Learning (ICML), 1999b.
Kaariainen, Matti. Generalization error bounds using unlabeled data. In Proceedings of the 18th Annual Conference on Computational Learning Theory (COLT), 2005.
Koltchinskii, V. Rademacher complexities and bounding the excess risk in active learning. Journal of Machine Learning, 11:24572485, 2010.Littlestone, Nick. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm.Machine Learning, 2(4):285318, 1988.
Rigollet, Philippe. Generalized error bounds in semisupervised classification under the cluster assumption. Journal of Machine Learning Research, 8, 2007.
Rosenberg, David S. and Bartlett, Peter L. The rademacher complexity of co-regularized kernel classes. In Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics (AISTATS), 2007.
Rosenblatt, Frank. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65:386407, 1958.
Zhu, X., Ghahramani, Z., and Lafferty, J. Semisupervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International Conference on Machine Learning (ICML), 2003.
Zhu, Xiaojin. Semi-supervised learning. Encyclopedia of Machine Learning, 2011.
-----0
Abney, S. Bootstrapping. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL 02, pp. 360367, 2002.Balcan, M.-F. and Blum, A. A PAC-style model for learning from labeled and unlabeled data. In Proceedings of the 18th Annual Conference on Computational Learning Theory (COLT), 2005.
Blum, A. and Mitchell, T. Combining labeled and unlabeled data with co-training. In The 11th Annual Conference on Computational Learning Theory (COLT), 1998.
Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Hruschka Jr., E. R., and Mitchell, T.M. Toward an architecture for never-ending language learning. In AAAI, 2010a.
Carlson, A., Betteridge, J., Wang, R. C., Hruschka Jr., E.R., and Mitchell, T.M. Coupled semi-supervised learning for information extraction. In Proceedings of the Third ACM International Conference on Web Search and Data Mining, 2010b.
Cavallanti, G., Cesa-Bianchi, N., and Gentile, C. Linear algorithms for online multitask classification.Journal of Machine Learning Research, 11:2901 2934, 2010.
Chapelle, O., Scholkopf, B., and Zien, A. (eds.). SemiSupervised Learning. MIT Press, Cambridge, MA, 2006. URL http://www.kyb.tuebingen.mpg.de/ ssl-book.
He, Jingrui and Lawrence, Rick. A graphbased framework for multi-task multi-view learning. In ICML, pp. 2532, 2011.
Mohamed, T., Hruschka Jr., E.R., and Mitchell, T.Discovering relations between noun categories. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 1447 1455, July 2011.
Sridharan, K. and Kakade, S.M. An information theoretic framework for multi-view learning. In COLT, pp. 403414, 2008.
Verma, S. and Hruschka Jr., E.R. Coupled bayesian sets algorithm for semi-supervised learning and information extraction. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECMLPKDD 2012), Bristol, UK, September 2012.Association for Computing Machinery.
Zhu, Xiaojin and Goldberg, Andrew B. Introduction to Semi-Supervised Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2009.
-----0
Benbouzid, D., Busa-Fekete, R., Casagrande, N., Collin, F.-D., and Kegl, B. MultiBoost: a multipurpose boosting package. Journal of Machine Learning Research, 13:549553, 2012.
Bergstra, J. and Bengio, Y. Random search for hyperparameter optimization. Journal of Machine Learning Research, 2012.
Bergstra, J., Bardenet, R., Kegl, B., and Bengio, Y.Algorithms for hyperparameter optimization. In Advances in Neural Information Processing Systems (NIPS), volume 24. The MIT Press, 2011.
Brendel, M. and Schoenauer, M. Instance-based parameter tuning for evolutionary AI planning. In Proceedings of the 20th Genetic and Evolutionary Computation Conference, 2011.Chu, W. and Ghahramani, Z. Preference learning with Gaussian processes. In Proceedings of the 22nd International Conference on Machine Learning, pp.137144, 2005.
Coates, A., Lee, H., and Ng, A. Y. An analysis of single-layer networks in unsupervised feature learning. In 14th International Conference on Artificial Intelligence and Statistics (AISTATS), 2011.
Dems?ar, J. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7:130, 2006.
Hutter, F. Automated Configuration of Algorithms for Solving Hard Computational Problems. PhD thesis, University of British Columbia, 2009.
Hutter, F., Hoos, H. H., Leyton-Brown, K., and Stutzle, T. ParamILS: an automatic algorithm configuration framework. Journal of Artificial Intelligence Research, 36:267306, October 2009.
Joachims, T. Optimizing search engines using clickthrough data. In Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD), 2002.
Jones, D. R. A taxonomy of global optimization methods based on response surfaces. Journal of Global Optimization, 21:345383, 2001.
Kegl, B. and Busa-Fekete, R. Boosting products of base classifiers. In International Conference on Machine Learning, volume 26, pp. 497504, Montreal, Canada, 2009.
Lacoste, A., Laviolette, F., and Marchand, M.Bayesian comparison of machine learning algorithms on single and multiple datasets. In 15th International Conference on Artificial Intelligence and Statistics (AISTATS), 2012.
Lizotte, D. Practical Bayesian Optimization. PhD thesis, University of Alberta, 2008.
Nannen, V. and Eiben, A. E. Relevance estimation and value calibration of evolutionary algorithm parameters. In Proceedings of the International Joint Conference on Artificial Intelligence, pp. 975980, 2007.
Pinto, N., Doukhan, D., DiCarlo, J. J., and Cox, D. D.A high-throughput screening approach to discovering good forms of biologically inspired visual representation. PLoS Computational Biology, 5(11), 11 2009.
Rasmussen, C. E. and Williams, C. K. I. Gaussian Processes for Machine Learning. MIT Press, 2006.
Schapire, R. E. and Singer, Y. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3):297336, 1999.
Snoek, J., Larochelle, H., and Adams, R. P. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems, volume 25, 2012.
Srinivas, N., Krause, A., Kakade, S., and Seeger, M.Gaussian process optimization in the bandit setting: No regret and experimental design. In Proceedings of the 27th International Conference on Machine Learning, 2010.
Thornton, C., Hutter, F., Hoos, H. H., and LeytonBrown, K. Auto-WEKA: Automated selection and hyper-parameter optimization of classification algorithms. Technical report, http://arxiv.org/abs/ 1208.3719, 2012.
Villemonteix, J., Vazquez, E., and Walter, E. An informational approach to the global optimization of expensive-to-evaluate functions. Journal of Global Optimization, 2006.
-----0
Aji, Srinivas M. and McEliece, Robert J. The Generalized Distributive Law. IEEE Transactions on Information Bayesian Learning of Recursively Factored Environments Theory, 46(2):325343, 2000.
Asmuth, John and Littman, Michael L. Learning is Planning: Near Bayes-optimal Reinforcement Learning via Monte-Carlo Tree Search. In Proc. of the Conference on Uncertainty in Artificial Intelligence, 2011.
Bellemare, Marc G., Naddaf, Yavar, Veness, Joel, and Bowling, Michael. Investigating Contingency Awareness Using Atari 2600 Games. In Proc. of the 26th AAAI Conference on Artificial Intelligence, 2012a.
Bellemare, Marc G., Veness, Joel, and Bowling, Michael.Sketch-Based Linear Value Function Approximation. In Advances in Neural Information Processing Systems 25, 2012b.
Bellemare, Marc G., Naddaf, Yavar, Veness, Joel, and Bowling, Michael. The Arcade Learning Environment: An Evaluation Platform for General Agents. Journal of Artificial Intelligence Research (JAIR), to appear, 2013.
Cesa-Bianchi, Nicolo and Lugosi, Gabor. Prediction, Learning, and Games. Cambridge University Press, New York, NY, USA, 2006.
Diuk, Carlos, Li, Lihong, , and Leffler, R. Bethany. The Adaptive k-Meteorologists Problem and Its Application to Structure Learning and Feature Selection in Reinforcement Learning. In Proc. of the 26th International Conference on Machine learning, 2009.
Doshi-Velez, Finale. The Infinite Partially Observable Markov Decision Process. In Advances in Neural Information Processing Systems 22, 2009.
Farahmand, Amir massoud, Shademan, Azad, Jagersand, Martin, and Csaba Szepesvari. Model-based and Modelfree Reinforcement Learning for Visual Servoing. In Proc. of the IEEE International Conference on Robotics and Automation, 2009.
Grunwald, Peter D. The Minimum Description Length Principle (Adaptive Computation and Machine Learning). The MIT Press, 2007.
Guez, Arthur, Silver, David, and Dayan, Peter. Efficient Bayes-Adaptive Reinforcement Learning using Samplebased Search. In Advances in Neural Information Processing Systems 25, 2012.
Hausknecht, Matthew, Khandelwal, Piyush, Miikkulainen, Risto, and Stone, Peter. HyperNEAT-GGP: A HyperNEAT-based Atari general game player. In Proc.of the Genetic and Evolutionary Computation Conference, 2012.
Hutter, Marcus. Universal Artificial Intelligence: Sequential Decisions Based on Algorithmic Probability.Springer, 2005.
Joseph, Joshua, Geramifard, Alborz, Roberst, John W., How, Jonathan P., and Roy, Nicholas. Reinforcement Learning with Misspecified Model Classes. In Proc. of the IEEE International Conference on Robotics and Automation, 2013.
Kaelbling, Leslie Pack, Littman, Michael L., and Cassandra, Anthony R. Planning and Acting in Partially Observable Stochastic Domains. Artificial Intelligence, 101: 99134, 1998.
Kocsis, Levente and Szepesvari, Csaba. Bandit Based Monte-Carlo Planning. In Proc. of the European Conference on Machine Learning, 2006.
Naddaf, Yavar. Game-Independent AI Agents for Playing Atari 2600 Console Games. Masters thesis, University of Alberta, 2010.
Nguyen, Phuong, Sunehag, Peter, and Hutter, Marcus.Context Tree Maximizing Reinforcement Learning. In Proc. of the 26th AAAI Conference on Artificial Intelligence, pp. 10751082, Toronto, 2012. AAAI Press. ISBN 978-1-57735-568-7.
Poupart, Pascal. Model-based Bayesian Reinforcement Learning in Partially Observable Domains. In Proc. of the International Symposium on Artificial Intelligence and Mathematics, 2008.
Puterman, Martin L. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., 1994.
Ross, Stephane and Pineau, Joelle. Model-Based Bayesian Reinforcement Learning in Large Structured Domains.
In Proc. of the Conference on Uncertainty in Artificial Intelligence, 2008.
Silver, David and Veness, Joel. Monte-Carlo Planning in Large POMDPs. In Advances in Neural Information Processing Systems 23, 2010.
Silver, David, Sutton, Richard S., and Muller, Martin.Sample-based Learning and Search with Permanent and Transient Memories. In Proc. of the 25th International Conference on Machine Learning, 2008.
Silver, David, Sutton, Richard S., and Muller, Martin.Temporal-Difference Search in Computer Go. Machine Learning, 87(2):183219, 2012.
Sutskever, Ilya, Hinton, Geoffrey, and Taylor, Graham.The Recurrent Temporal Restricted Boltzmann Machine. In Advances in Neural Information Processing Systems 21, 2008.
Sutton, Richard S. Dyna, an Integrated Architecture for Learning, Planning, and Reacting. SIGART Bulletin, 2 (4):160163, 1991.
Veness, Joel and Hutter, Marcus. Sparse Sequential Dirichlet Coding. ArXiv e-prints, 2012.
Veness, Joel, Ng, Kee Siong, Hutter, Marcus, and Silver, David. Reinforcement Learning via AIXI Approximation. In Proc. of the AAAI Conference on Artificial Intelligence, 2010.
Veness, Joel, Ng, Kee Siong, Hutter, Marcus, Uther, William, and Silver, David. A Monte-Carlo AIXI Approximation. Journal of Artificial Intelligence Research (JAIR), 40(1), 2011.
Veness, Joel, Ng, Kee Siong, Hutter, Marcus, and Bowling, Michael. Context Tree Switching. In Proc. of the Data Compression Conference (DCC), 2012.
Walsh, Thomas J., Goschin, Sergiu, and Littman, Michael L. Integrating Sample-based Planning and Model-Based Reinforcement Learning. In Proc. of the AAAI Conference on Artificial Intelligence, 2010.
Willems, F. and Tjalkens, T.J. Complexity Reduction of the Context-Tree Weighting Algorithm: A Study for KPN Research. EIDMA Report RS.97.01, 1997.
Willems, Frans M.J., Shtarkov, Yuri M., and Tjalkens, Tjalling J. The Context Tree Weighting Method: Basic Properties. IEEE Transactions on Information Theory, 41:653664, 1995.
-----0
Alain, G., Bengio, Y., and Rifai, S. (2012). Regularized auto-encoders estimate local statistics. Technical Report Arxiv report 1211.4246, Universite de Montreal.
Baxter, J. (1997). A Bayesian/information theoretic model of learning via multiple task sampling. Machine Learning , 28, 740.
Bengio, Y. (2009). Learning deep architectures for AI.Better Mixing via Deep Representations Foundations and Trends in Machine Learning , 2(1), 1 127. Also published as a book. Now Publishers, 2009.
Bengio, Y. and Delalleau, O. (2011). On the expressive power of deep architectures. In ALT2011 .Bengio, Y. and LeCun, Y. (2007). Scaling learning algorithms towards AI. In L. Bottou, O. Chapelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Machines. MIT Press.
Bengio, Y., Delalleau, O., and Le Roux, N. (2006). The curse of highly variable functions for local kernel machines. In NIPS05 , pages 107114. MIT Press, Cambridge, MA.
Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D., and Bengio, Y. (2010). Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy).
Breuleux, O., Bengio, Y., and Vincent, P. (2011). Quickly generating representative samples from an rbm-derived process. Neural Computation, 23(8), 20532073.Caruana, R. (1995). Learning many related tasks at the same time with backpropagation. In NIPS94 , pages 657664, Cambridge, MA. MIT Press.
Cayton, L. (2005). Algorithms for manifold learning. Technical Report CS2008-0923, UCSD.
Cho, K., Raiko, T., and Ilin, A. (2010). Parallel tempering is efficient for learning restricted Boltzmann machines.
In Proceedings of the International Joint Conference on Neural Networks (IJCNN 2010), Barcelona, Spain.
Collobert, R. and Weston, J. (2008). A unified architecture for natural language processing: Deep neural networks with multitask learning. In W. W. Cohen, A. McCallum, and S. T. Roweis, editors, ICML 2008 , pages 160167.ACM.
Desjardins, G., Courville, A., Bengio, Y., Vincent, P., and Delalleau, O. (2010). Tempered Markov chain monte carlo for training of restricted Boltzmann machine. In JMLR W&CP: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2010), volume 9, pages 145152.
Flor, H. (2003). Remapping somatosensory cortex after injury. Advances in Neurology , 83, 195204.
Glorot, X., Bordes, A., and Bengio, Y. (2011). Domain adaptation for large-scale sentiment classification: A deep learning approach. In Proceedings of the Twentyeight International Conference on Machine Learning (ICML11), volume 27, pages 97110.
Goodfellow, I., Le, Q., Saxe, A., and Ng, A. (2009). Measuring invariances in deep networks. In NIPS2009 , pages 646654.
H?astad, J. (1986). Almost optimal lower bounds for small depth circuits. In Proceedings of the 18th annual ACM Symposium on Theory of Computing , pages 620, Berkeley, California. ACM Press.
H?astad, J. and Goldmann, M. (1991). On the power of small-depth threshold circuits. Computational Complexity , 1, 113129.
Hinton, G. E. and Zemel, R. S. (1994). Autoencoders, minimum description length, and helmholtz free energy. In NIPS93 , pages 310. Morgan Kaufmann Publishers, Inc.
Hinton, G. E., Osindero, S., and Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18, 15271554.
Lang, K. J. and Hinton, G. E. (1988). The development of the time-delay neural network architecture for speech recognition. Technical Report CMU-CS-88-152, Carnegie-Mellon University.
LeCun, Y. (1987). Mode`les connexionistes de lapprentissage. Ph.D. thesis, Universite de Paris VI.
LeCun, Y. (1989). Generalization and network design strategies. Technical Report CRG-TR-89-4, University of Toronto.
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998).Gradient-based learning applied to document recognition. Proceedings of the IEEE , 86(11), 22782324.
Lee, H., Pham, P., Largman, Y., and Ng, A. (2009). Unsupervised feature learning for audio classification using convolutional deep belief networks. In NIPS2009 .
Narayanan, H. and Mitter, S. (2010). Sample complexity of testing the manifold hypothesis. In NIPS2010 .
Neal, R. M. (1994). Sampling from multimodal distributions using tempered transitions. Technical Report 9421, Dept. of Statistics, University of Toronto.
Rifai, S., Vincent, P., Muller, X., Glorot, X., and Bengio, Y. (2011a). Contracting auto-encoders: Explicit invariance during feature extraction. In Proceedings of the Twenty-eight International Conference on Machine Learning (ICML11).
Rifai, S., Dauphin, Y., Vincent, P., Bengio, Y., and Muller, X. (2011b). The manifold tangent classifier. In NIPS2011 . Student paper award.
Rifai, S., Bengio, Y., Dauphin, Y., and Vincent, P. (2012).A generative process for sampling contractive autoencoders. In ICML2012 .Salakhutdinov, R. (2010a). Learning deep Boltzmann machines using adaptive MCMC. In L. Bottou and 
M. Littman, editors, Proceedings of the Twenty-seventh International Conference on Machine Learning (ICML10), volume 1, pages 943950. ACM.Salakhutdinov, R. (2010b). Learning in Markov random fields using tempered transitions. In NIPS09 .
Salakhutdinov, R. and Hinton, G. E. (2009). Deep Boltzmann machines. In AISTATS2009 , volume 5, pages 448455.
Susskind, J., Anderson, A., and Hinton, G. E. (2010). The Toronto face dataset. Technical Report UTML TR 2010001, U. Toronto.
-----0
Bardenet, R. and Kegl, B. Surrogating the surrogate: accelerating Gaussian Process optimization with mixtures. In ICML, 2010.
Bergstra, J. Hyperopt: Distributed asynchronous hyperparameter optimization in Python.http://jaberg.github.com/hyperopt, 2013.
Bergstra, J. and Bengio, Y. Random search for hyperparameter optimization. Journal of Machine Learning Research, 13:281305, 2012.
Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., and Bengio, Y. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy), June 2010.
Bergstra, J., Bardenet, R., Bengio, Y., and Kegl, B.Algorithms for hyper-parameter optimization. In NIPS*24, pp. 25462554, 2011.
Bergstra, J., Pinto, N., and Cox, D. D. Machine learning for predictive auto-tuning with boosted regression trees. In INPAR, 2012.Making a Science of Model Search 
Bergstra, J., Yamins, D., and Pinto, N. Hyperparameter optimization for convolutional vision architectures. https://github.com/jaberg/hyperoptconvnet, 2013.
Brochu, E. Interactive Bayesian Optimization: Learning Parameters for Graphics and Animation. PhD thesis, University of British Columbia, December 2010.
Coates, A. and Ng, A. Y. The importance of encoding versus training with sparse coding and vector quantization. In Proc. ICML-28, 2011.
DiCarlo, J. J., Zoccolan, D., and Rust, N. C. How does the brain solve visual object recognition? Neuron, 73:41534, 2012 Feb 9 2012. ISSN 1097-4199.
Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.R., , and Lin, C.-J. Liblinear: A library for large linear classification. Journal of Machine Learning Research, 9:18711874, 2008.
Fukushima, K. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36(4):193202, 1980.
Hinton, G. E., Osindero, S., and Teh, Y. A fast learning algorithm for deep belief nets. Neural Computation, 18:15271554, 2006.
Huang, G. B., Ramesh, M., Berg, T., and LearnedMiller, E. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical Report 07-49, University of Massachusetts, Amherst, October 2007.
Hutter, F. Automated Configuration of Algorithms for Solving Hard Computational Problems. PhD thesis, University of British Columbia, 2009.
Hutter, F., Hoos, H., and Leyton-Brown, K. Sequential model-based optimization for general algorithm configuration. In LION-5, 2011. Extended version as UBC Tech report TR-2010-10.
Hyvarinen, A. and Oja, E. Independent component analysis: Algorithms and applications. Neural Networks, 13(45):411430, 2000.
Jones, D.R. A taxonomy of global optimization methods based on response surfaces. Journal of Global Optimization, 21:345383, 2001.Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D.Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4):541551, 1989.
Lowe, D. G. Object recognition from local scale-invariant features. In Proceedings of the International Conference on Computer Vision 2 (ICCV), pp. 11501157, 1999. doi: 10.1109/ICCV.1999.790410.
Mockus, J., Tiesis, V., and Zilinskas, A. The application of Bayesian methods for seeking the extremum. In Dixon, L.C.W. and Szego, G.P. (eds.), Towards Global Optimization, volume 2, pp. 117129. North Holland, New York, 1978.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., 
Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12:28252830, 2011.
Pinto, N. and Cox, D. D. Beyond simple features: A large-scale feature search approach to unconstrained face recognition. In Proc. Face and Gesture Recognition, 2011.
Pinto, N., Doukhan, D., DiCarlo, J. J., and Cox, D. D.A high-throughput screening approach to discovering good forms of biologically inspired visual representation. PLoS Comput Biol, 5(11):e1000579, 11 2009.
Pinto, N., Stone, Z., Zickler, T., and Cox, D. D.Scaling-up Biologically-Inspired Computer Vision: A Case-Study on Facebook. In IEEE Computer Vision and Pattern Recognition, Workshop on Biologically Consistent Vision, 2011.
Rasmussen, C. E. and Williams, C. K. I. Gaussian Processes for Machine Learning. MIT Press, 2006.
Riesenhuber, M. and Poggio, T. Hierarchical models of object recognition in cortex. Nature Neuroscience, 2:10191025, 1999.
Snoek, J., Larochelle, H., and Adams, R. P. Practical bayesian optimization of machine learning algorithms. In Neural Information Processing Systems, 2012.
-----0
Bach, F., Jenatton, R., Mairal, J., and Obozinski, G. Convex optimization with sparsity-inducing norms. In Sra, 
S., Nowozin, S., and Wright, S. J. (eds.), Optimization for Machine Learning. MIT Press, 2011.
Balasubramanian, K. and Lebanon, G. The landmark selection method for multiple output prediction. In Proceedings of the 29th International Conference on Machine Learning, pp. 983990, Edinburgh, Scotland, June 2012.
Bi, W. and Kwok, J.T. Multi-label classification on treeand DAG-structured hierarchies. In Proceedings of the 28th International Conference on Machine Learning, pp.1724, Bellevue, WA, USA, June 2011.
Boutsidis, C., Mahoney, M.W., and Drineas, P. An improved approximation algorithm for the column subset selection problem. In Proceedings of the 20th ACMSIAM Symposium on Discrete Algorithms, pp. 968977, New York, NY, USA, January 2009.
Chen, Y-N. and Lin, H-T. Feature-aware label space dimension reduction for multi-label classification. In Advances in Neural Information Processing Systems 25, pp.15381546, 2012.
Dembczynski, K., Cheng, W., and Hullermeier, E. Bayes optimal multilabel classification via probabilistic classifier chains. In Proceedings of the 27th International Conference on Machine Learning, pp. 279286, Haifa, Isreal, June 2010.
Drineas, P., Mahoney, M., and Muthukrishnan, S. Subspace sampling and relative-error matrix approximation: Column-based methods. In Proceedings of the 9th International Conference on Approximation Algorithms for Combinatorial Optimization Problems, pp. 316326, Barcelona, Spain, August 2006.
Duygulu, P., Barnard, K., Freitas, N.D., and Forsyth, D.Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In Proceedings of the 7th European Conference on Computer Vision, pp.97112, Copenhagen, Denmark, May 2002.
Gu, M. and Eisenstat, S.C. Efficient algorithms for computing a strong rank-revealing qr factorization. SIAM Journal on Scientific Computing, 17(4):848869, 1996.
Hariharan, B., Zelnik-Manor, L., Vishwanathan, S.V.N., and Varma, M. Large scale max-margin multi-label classification with priors. In Proceedings of the 27th International Conference on Machine Learning, pp. 423430, Haifa, Isreal, June 2010.
Hsu, D., Kakade, S.M., Langford, J., and Zhang, T. Multilabel prediction via compressed sensing. In Advances in Neural Information Processing Systems 22, pp. 772780, 2009.
Mahoney, M.W. Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning, 3 (2):123224, 2011.
Menc?a, E.L. and Furnkranz, J. Efficient pairwise multilabel classification for large-scale problems in the legal domain. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 5065, Antwerp, Belgium, 2008.
Rudelson, M. and Vershynin, R. Sampling from large matrices: An approach through geometric functional analysis. Journal of the ACM, 54(4):21, 2007.
Tai, F. and Lin, H.T. Multilabel classification with principal label space transformation. Neural Computation, 24 (9):25082542, 2012.
Tsochantaridis, I., Joachims, T., Hofmann, T., and Altun, Y. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6(2):1453, 2005.
Tsoumakas, G., Katakis, I., and Vlahavas, I. Effective and efficient multilabel classification in domains with large number of labels. In Proceedings of the ECML/PKDD Workshop on Mining Multidimensional Data, Antwerp, Belgium, 2008.
Tsoumakas, G., Katakis, I., and Vlahavas, I. Mining multilabel data. In Maimon, O. and Rokach, L. (eds.), Data Mining and Knowledge Discovery Handbook, pp. 667 685. Springer, 2010.
Turnbull, D., Barrington, L, Torres, D., and Lanckriet, G.Semantic annotation and retrieval of music and sound effects. IEEE Transactions on Audio, Speech and Language Processing, 16(2):467476, 2008.
Vens, C., Struyf, J., Schietgat, L., Dz?eroski, S., and Blockeel, H. Decision trees for hierarchical multi-label classification. Machine Learning, 73:185214, 2008.
Zhang, Y. and Schneider, J. Multi-label output codes using canonical correlation analysis. In Proceedings of the 14th International Conference on Artificial Intelligence and 
Statistics, pp. 873882, Ft. Lauderdale, FL, USA, April 2011.
Zhang, Y. and Schneider, J. Maximum margin output coding. In Proceedings of the 29th International Conference on Machine Learning, pp. 15751582, Edinburgh, Scotland, June 2012.
Zhou, T., Tao, D., and Wu, X. Compressed labeling on distilled labelsets for multi-label learning. Machine Learning, 88:69126, 2012.
-----0
Biggs, Michael, Ghodsi, Ali, Wilkinson, Dana, and Bowling, Michael. Action respecting embedding. In In Proceedings of the Twenty-Second International Conference on Machine Learning, pp. 6572, 2005.
Boots, Byron and Gordon, Geo?. Predictive state temporal di?erence learning. In La?erty, J., Williams, C. K. I., Shawe-Taylor, J., Zemel, R.S., and Culotta, A. (eds.), Advances in Neural Information Processing Systems 23, pp. 271279. 2010.
Boots, Byron, Siddiqi, Sajid M., and Gordon, Geo?rey J.Closing the learning-planning loop with predictive state representations. In Proceedings of Robotics: Science and Systems VI, 2010.
Boots, Byron, Siddiqi, Sajid, and Gordon, Geo?rey. An online spectral learning algorithm for partially observable nonlinear dynamical systems. In Proceedings of the 25th National Conference on Artificial Intelligence (AAAI2011), 2011.
Brand, Matthew. Fast low-rank modifications of the thin singular value decomposition. Linear Algebra and its Applications, 415(1):2030, 2006.
Cande`s, Emmanuel J. and Plan, Yaniv. Matrix completion with noise. CoRR, abs/0903.3131, 2009.
Djugash, Joseph. Geolocation with range: Robustness, efficiency and scalability. PhD. Thesis, Carnegie Mellon University, 2010.
Djugash, Joseph and Singh, Sanjiv. A robust method of localization and mapping using only range. In International Symposium on Experimental Robotics, July 2008.
Djugash, Joseph, Singh, Sanjiv, and Corke, Peter Ian. Further results with localization and mapping using range from radio. In International Conference on Field and Service Robotics (FSR 05), July 2005.
Ferris, Brian, Fox, Dieter, and Lawrence, Neil. WiFiSLAM using Gaussian process latent variable models.In Proceedings of the 20th international joint conference on Artifical intelligence, IJCAI07, pp. 24802485, San  Francisco, CA, USA, 2007. Morgan Kaufmann Publishers Inc.
Hsu, Daniel, Kakade, Sham, and Zhang, Tong. A spectral algorithm for learning hidden Markov models. In COLT, 2009.
Kanade, T. and Morris, D.D. Factorization methods for structure from motion. Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences, 356(1740):1153 1173, 1998.
Kantor, George A and Singh, Sanjiv. Preliminary results in range-only localization and mapping. In Proceedings of the IEEE Conference on Robotics and Automation (ICRA 02), volume 2, pp. 1818  1823, May 2002.
Kehagias, Athanasios, Djugash, Joseph, and Singh, Sanjiv.Range-only SLAM with interpolated range data. Technical Report CMU-RI-TR-06-26, Robotics Institute, May 2006.
Kurth, Derek, Kantor, George A, and Singh, Sanjiv. Experimental results in range-only localization with radio. In 2003 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 03), volume 1, pp. 974  979, October 2003.
Montemerlo, Michael, Thrun, Sebastian, Koller, Daphne, and Wegbreit, Ben. FastSLAM: A factored solution to the simultaneous localization and mapping problem. In In Proceedings of the AAAI National Conference on Artificial Intelligence, pp. 593598. AAAI, 2002.
Rosencrantz, Matthew, Gordon, Geo?rey J., and Thrun, Sebastian. Learning low dimensional predictive representations. In Proc. ICML, 2004.
Siddiqi, Sajid, Boots, Byron, and Gordon, Geo?rey J.Reduced-rank hidden Markov models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS-2010), 2010.
Thrun, Sebastian, Burgard, Wolfram, and Fox, Dieter. Probabilistic Robotics (Intelligent Robotics and Autonomous Agents). The MIT Press, 2005. ISBN 0262201623.
Tipping, Michael E. and Bishop, Chris M. Probabilistic principal component analysis. Journal of the Royal Statistical Society, Series B, 61:611622, 1999.
Tomasi, Carlo and Kanade, Takeo. Shape and motion from image streams under orthography: a factorization method. International Journal of Computer Vision, 9: 137154, 1992.
Triggs, B. Factorization methods for projective structure and motion. In Computer Vision and Pattern Recognition, 1996. Proceedings CVPR96, 1996 IEEE Computer Society Conference on, pp. 845851. IEEE, 1996.
Van Overschee, P. and De Moor, B. Subspace Identification for Linear Systems: Theory, Implementation, Applications. Kluwer, 1996.
Yairi, Takehisa. Map building without localization by dimensionality reduction techniques. In Proceedings of the 24th international conference on Machine learning, ICML 07, pp. 10711078, New York, NY, USA, 2007.ACM.
-----0
Bai, H., Hsu, D., Lee, W.S., and V.A., Ngo. Monte Carlo Value Iteration for Continuous-State POMDPs. In Workshop on the Algorithmic Foundations of Robotics, pp. 175191, 2010.
Bellman, R. A Markovian decision process. Journal of Applied Mathematics and Mechanics, 6:679684, 1957.
Brechtel, S., Gindele, T., and Dillmann, R. Probabilistic MDP-behavior planning for cars. In Int. Conf. on Intelligent Transportation Systems, pp. 15371542, 2011.
Brooks, A., Makarenko, A., Williams, S., and DurrantWhyte, H. Parametric POMDPs for planning in continuous state spaces. Robotics and Autonomous Systems, 54(11):887897, 2006. ISSN 0921-8890.
Doucet, A., De Freitas, N., and Gordon, N. Sequential Monte Carlo methods in practice. Springer, 2001.
Feng, Z. and Hansen, E. An Approach to State Aggregation for POMDPs. In Workshop on Learning and Planning in Markov Processes, pp. 712, 2004.
Gindele, T., Brechtel, S., and Dillmann, R. A Probabilistic Model for Estimating Driver Behaviors and Vehicle Trajectories in Traffic Environments. In Int. Conf. on Intelligent Transportation Systems, pp. 16251631, 2010.
Hauskrecht, M. Value-function approximations for partially observable Markov decision processes.Journal of Artificial Intelligence Research, 13(1):33 94, 2000.
Hoey, J. and Poupart, P. Solving POMDPs with continuous or large discrete observation spaces. In Int. Joint Conf. on Artificial Intelligence, pp. 1332, 2005.
Kaelbling, L. P., Littman, M. L., and Cassandra, A. R.Planning and acting in partially observable stochastic domains. Journal on Artificial Intelligence, 101 (12):99  134, 1998. ISSN 0004-3702.
Kurniawati, H., Hsu, D., and Lee, W.S. SARSOP: Efficient point-based POMDP planning by approximating optimally reachable belief spaces. In Robotics: Science and Systems, 2008.
MacKay, D.J.C. Information Theory, Inference and Learning Algorithms. Cambridge University Press, 2003.
Munos, R. and Moore, A. Variable resolution discretization in optimal control. Machine learning, 49 (2):291323, 2002. ISSN 0885-6125.
Pineau, J., Gordon, G., and Thrun, S. Point-based value iteration: An anytime algorithm for POMDPs. In Int. Joint Conf. on Artificial Intelligence, volume 18, pp. 10251032, 2003.
Porta, J.M., Spaan, M.T.J., and Vlassis, N. Robot planning in partially observable continuous domains. In Robotics: Science and Systems, volume 1, pp. 217. The MIT Press, 2005.
Porta, J.M., Vlassis, N., Spaan, M.T.J., and Poupart, P. Point-based value iteration for continuous POMDPs. The Journal of Machine Learning Research, 7:23292367, 2006. ISSN 1532-4435.
Poupart, P. and Boutilier, C. Value-directed compression of POMDPs. Advances in Neural Information Processing Systems, 15:15471554, 2002.
Poupart, P., Kim, K.E., and Kim, D. Closing the Gap: Improved Bounds on Optimal POMDP Solutions. In Int. Conf. on Automated Planning and Scheduling, 2011.
Quinlan, J.R. C4. 5: programs for machine learning.Morgan Kaufmann, 1993.
Roy, N., Gordon, G.J., and Thrun, S. Finding approximate pomdp solutions through belief compression.Journal of Artificial Intelligence Research, 23:140, 2005.
Smith, T. and Simmons, R. Heuristic search value iteration for pomdps. In Uncertainty in Artificial Intelligence, pp. 520527. AUAI Press, 2004.
Sondik, E.J. The Optimal Control of Partially Observable Markov Decision Processes. PhD thesis, Stanford University, 1971.
Spaan, M.T.J. and Vlassis, N. Perseus: Randomized point-based value iteration for POMDPs. Journal of Artificial Intelligence Research, 24(1):195220, 2005.
van den Berg, Jur, Patil, Sachin, and Alterovitz, Ron. Efficient Approximate Value Iteration for Continuous Gaussian POMDPs. In Conf. on Artificial Intelligence, 2012.
Zhou, E., Fu, M.C., and Marcus, S.I. Solving continuous-state POMDPs via density projection. Transactions on Automatic Control, 55(5):1101 1116, 2010.
-----0
Akaike, H. A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6):716723, 1974.
Aldous, D. Exchangeability and related topics. Ecole dEte de Probabilites de Saint-Flour XIII1983, pp.1198, 1985.
Antoniak, C. E. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. The Annals of Statistics, pp. 11521174, 1974.
Arthur, D. and Vassilvitskii, S. k-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 10271035. Society for Industrial and Applied Mathematics, 2007.
Berkhin, P. A survey of clustering data mining techniques. Grouping Multidimensional Data, pp. 2571, 2006.
Blackwell, D. and MacQueen, J. B. Ferguson distributions via Polya urn schemes. The Annals of Statistics, 1(2):353355, 1973.
Broderick, T., Jordan, M. I., and Pitman, J. Beta processes, stick-breaking, and power laws. Bayesian Analysis, 7:439476, 2012.
Broderick, T., Jordan, M. I., and Pitman, J. Clusters and features from combinatorial stochastic processes. Statistical Science, to appear, 2013a. Arxiv preprint arXiv:1206.5862.
Broderick, T., Pitman, J., and Jordan, M. I. Feature allocations, probability functions, and paintboxes.
Bayesian Analysis, to appear, 2013b. ArXiv preprint arXiv:1301.6647.
Escobar, M. D. Estimating normal means with a Dirichlet process prior. Journal of the American Statistical Association, pp. 268277, 1994.
Escobar, M. D. and West, M. Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association, pp. 577588, 1995.
Gordon, A. D. and Henderson, J. T. An algorithm for Euclidean sum of squares classification. Biometrics, pp. 355362, 1977.
Griffiths, T. and Ghahramani, Z. Infinite latent feature models and the Indian buffet process. In Weiss, 
Y., Scholkopf, B., and Platt, J. (eds.), Advances in Neural Information Processing Systems 18, pp. 475482. MIT Press, Cambridge, MA, 2006.
Hjort, N. L. Nonparametric Bayes estimators based on beta processes in models for life history data. Annals of Statistics, 18(3):12591294, 1990.
Jain, A. K. Data clustering: 50 years beyond K-means.Pattern Recognition Letters, 31(8):651666, 2010.
Kulis, B. and Jordan, M. I. Revisiting k-means: New algorithms via Bayesian nonparametrics. In Proceedings of the 23rd International Conference on Machine Learning, 2012.
Liu, J. The collapsed Gibbs sampler in Bayesian computations with applications to a gene regulation problem. Journal of the American Statistical Association, 89:958966, 1994.
Pena, J. M., Lozano, J. A., and Larranaga, P. An empirical comparison of four initialization methods for the K-Means algorithm. Pattern Recognition Letters, 20(10):10271040, 1999.
Pitman, J. Exchangeable and partially exchangeable random partitions. Probability Theory and Related Fields, 102(2):145158, 1995.
Steinley, D. K-means clustering: A half-century synthesis. British Journal of Mathematical and Statistical Psychology, 59(1):134, 2006.
Sung, K. and Poggio, T. Example-based learning for view-based human face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(1):3951, 1998.
Thibaux, R. and Jordan, M. I. Hierarchical beta processes and the Indian buffet process. In Proceedings of the International Conference on Artificial Intelligence and Statistics, volume 11, 2007.
Thomaz, C. E. and Giraldi, G. A. A new ranking method for principal components analysis and its application to face image analysis. Image and Vision Computing, 28(6):902913, June 2010.
-----0
Benaroya, L., Donagh, L.M., Bimbot, F., and Gribonval, R. Non negative sparse representation for wiener based source separation with a single sensor. In Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP 03). 2003 IEEE International Conference on, volume 6, april 2003.
Bishop, C. M. Pattern Recognition and Machine Learning. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006.
Cohn, D., Caruana, R., and Mccallum, A. Semisupervised clustering with user feedback. Technical report, 2003.
Durrieu, J.-L. and Thiran, J.-P. Musical audio source separation based on user-selected f0 track. In The 10th International Conference on Latent Variable Analysis and Signal Separation (LVA/ICA), pp.438445, 2012.
Fevotte, C., Bertin, N., and Durrieu, J.-L. Nonnegative matrix factorization with the itakura-saito divergence: With application to music analysis. Neural Computation, 21(3):793830, March 2009.
Graca, J., Ganchev, K., and Taskar, B. Expectation maximization and posterior constraints. In Advances in Neural Information Processing Systems (NIPS), 2007.
Graca, J., Ganchev, K., Taskar, B., and Pereira, F.C. N. Posterior vs parameter sparsity in latent variable models. In Advances in Neural Information Processing Systems (NIPS), pp. 664672, 2009.
Hofmann, T. Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR Conference on Research and Development in 
Information Retrieval, SIGIR 99, pp. 5057, New York, NY, USA, 1999. ACM.
Lee, D. D. and Seung, H. S. Algorithms for nonnegative matrix factorization. In Advances in Neural Information Processing Systems (NIPS), pp. 556 562. MIT Press, 2001.
Lefevre, A., Bach, F., and Fevotte, C. Semi-supervised nmf with time-frequency annotations for singlechannel source separation. In In the Proceedings of The International Society for Music Information Retrieval (ISMIR) Conference, 2012.
Raj, B. and Smaragdis, P. Latent variable decomposition of spectrograms for single channel speaker separation. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp.17  20, oct. 2005.
SiSEC, 2011. Professionally produced music recordings. In Signal Separation Evaluation Campaign (SiSEC), 2011. http:http://sisec.wiki.irisa.fr/tikiindex.php//sisec.wiki.irisa.fr/tiki-index.php.
Smaragdis, P. and Brown, J.C. Non-negative matrix factorization for polyphonic music transcription. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 177  180, oct. 2003.
Smaragdis, P. and Raj, B. Shift-Invariant Probabilistic Latent Component Analysis. MERL Tech Report, 2007.
Smaragdis, P., Raj, B., and Shashanka, M. A Probabilistic Latent Variable Model for Acoustic Modeling. In Advances in Neural Information Processing Systems (NIPS), Workshop on Advances in Modeling for Acoustic Processing, 2006.
Smaragdis, P., Raj, B., and Shashanka, M. Supervised and semi-supervised separation of sounds from single-channel mixtures. In International Conference on Independent Component Analysis and Signal Separation, pp. 414421, Berlin, Heidelberg, 2007. Springer-Verlag.
Smith, J. O. Spectral Audio Signal Processing. http:http://ccrma.stanford.edu/ jos/sasp///ccrma.stanford.edu/~jos/sasp/, 2011. online book.
Vincent, E., Gribonval, R., and Fevotte, C. Performance measurement in blind audio source separation. IEEE Transactions on Audio, Speech, and Language Processing, 14(4):1462 1469, july 2006.
Virtanen, T. Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria. IEEE Transactions on Audio, Speech and Language Processing (TASLP), 15(3):10661074, March 2007.
-----2
Audibert, J.-Y., Bubeck, S., and Munos, R. Best arm identification in multi-armed bandits. In Proceedings of the 23rd Annual Conference on Learning Theory (COLT), 2010.
Bechhofer, R. E. A single-sample multiple decision procedure for ranking means of normal populations with known variances. Annals of Mathematical Statistics, 25:1639, 1954.
Bubeck, S., Munos, R., and Stoltz, G. Pure explo- ration in multi-armed bandits problems. In Proceed- ings of the 20th International Conference on Algo- rithmic Learning Theory (ALT), 2009.
Bubeck, S., Munos, R., and Stoltz, G. Pure ex- ploration in finitely-armed and continuously-armed bandits. Theoretical Computer Science, 412:1832 1852, 2011.
Even-Dar, E., Mannor, S., and Mansour, Y. Pac bounds for multi-armed bandit and markov deci- sion processes. In Proceedings of the Fifteenth An- nual Conference on Computational Learning Theory (COLT), 2002.
Gabillon, V., Ghavamzadeh, M., Lazaric, A., and Bubeck, S. Multi-bandit best arm identification. In Advances in Neural Information Processing Systems (NIPS), 2011.
Kalyanakrishnan, S., Tewari, A., Auer, P., and Stone, P. Pac subset selection in stochastic multi-armed bandits. In Proceedings of the 29th International Conference on Machine Learning (ICML), 2012.
Mannor, S. and Tsitsiklis, J. N. The sample complex- ity of exploration in the multi-armed bandit prob- lem. Journal of Machine Learning Research, 5:623 648, 2004.
-----0
Andrieu, C., Doucet, A., and Tadic, V. On-line parameter estimation in general state-space models. In Proceedings of the 44th Conference on Decision and Control, pp. 332337, 2005.
Andrieu, Christophe, Doucet, Arnaud, and Holenstein, Roman. Particle Markov chain Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3):269342, 2010.
Arulampalam, Sanjeev, Maskell, Simon, Gordon, Neil, and Clapp, Tim. A tutorial on particle filters for on-line non-linear/non-Gaussian Bayesian tracking. IEEE Transactions on Signal Processing, 50(2):174 188, 2002.
Carvalho, Carlos M., Johannes, Michael S., Lopes, Hedibert F., and Polson, Nicholas G. Particle Learning and Smoothing. Statistical Science, 25:88106, 2010. doi: 10.1214/10-STS325.
Doucet, Arnaud and Johansen, Adam M. A tutorial on particle filtering and smoothing: fifteen years later. The Oxford Handbook of Nonlinear Filtering, pp. 4 6, December 2011.
Gilks, Walter R. and Berzuini, Carlo. Following a moving target  Monte Carlo inference for dynamic bayesian models. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 63(1): 127146, 2001.
Kalman, Rudolf E. A new approach to linear filtering and prediction problems. Transactions of the ASME  Journal of Basic Engineering, 82 (Series D):3545, 1960.
Kantas, Nicholas, Doucet, Arnaud, Singh, Sumeetpal Sindhu, and Maciejowski, Jan. An overview of sequential Monte Carlo methods for parameter estimation in general state-space models. In 15th IFAC Symposium on System Identification, volume 15, pp.774785, 2009.
Liu, Jane and West, Mike. Combined parameter and state estimation in simulation-based filtering. In Sequential Monte Carlo Methods in Practice. 2001.
Neal, Radford M. Slice sampling. Annals of Statistics, 31(3):705767, 2003.
Polson, Nicholas G., Stroud, Jonathan R., and Muller, Peter. Practical filtering with sequential parameter learning. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(2):413428, 2008.
Robert, Christian P. and Casella, George. Monte Carlo Statistical Methods. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2005.
Storvik, Geir. Particle filters for state-space models with the presence of unknown static paramaters.
IEEE Transactions on Signal Processing, 50(2):281 289, 2002.van Dijk, Dick, Tersvirta, Timo, and Franses, Philip Hans. Smooth transition autoregressive models  a survey of recent developments. Econometric Reviews, 21:147, 2002.
Welch, Greg and Bishop, Gary. An introduction to the Kalman filter, 1995.
-----0
Andersen, R. and Lang, K. Communities from seed sets. In WWW, pp. 223232, 2006.
Andersen, R., Chung, F., and Lang, K. Local graph partitioning using pagerank vectors. In FOCS, pp.475486, 2006.
Bach, F. Learning with submodular functions: A convex optimization perspective. CoRR, abs/1111.6453, 2011.
Beck, A. and Teboulle, M. Fast gradient-based algorithms for constrained total variation image denoising and deblurring problems. IEEE Trans. Image Processing, 18(11):24192434, 2009.
Bresson, X., Laurent, T., Uminsky, D., and von Brecht, J. H. Convergence and energy landscape for Cheeger cut clustering. In NIPS, pp. 13941402, 2012.
Chung, F. A local graph partitioning algorithm using heat kernel pagerank. In WAW, pp. 6275, 2009.
Di Pillo, G. Exact penalty methods. In Spedicato, E.(ed.), Algorithms for Continuous Optimization, pp.209253. Kluwer, 1994.
Dinkelbach, W. On nonlinear fractional programming.Management Science, 13(7):492498, 1967.
Fortunato, S. Community detection in graphs. Physics Reports, 486(3-5):75  174, 2010.
Gajewar, A. and Das Sarma, A. Multi-skill collaborative teams based on densest subgraphs. In SDM, pp. 165176, 2012.
Goldberg, A. V. Finding a maximum density subgraph. Technical Report UCB/CSD-84-171, EECS Department, UC Berkeley, 1984.
Hagen, L. and Kahng, A. B. Fast spectral methods for ratio cut partitioning and clustering. In ICCAD, pp. 1013, 1991.
Hansen, T. and Mahoney, M. Semi-supervised eigenvectors for locally-biased learning. In NIPS, pp.25372545, 2012.
Hein, M. and Buhler, T. An inverse power method for nonlinear eigenproblems with applications in 1spectral clustering and sparse PCA. In NIPS, pp.847855, 2010.
Hein, M. and Setzer, S. Beyond spectral clustering tight relaxations of balanced graph cuts. In NIPS, pp. 23662374, 2011.
Khot, S. Ruling out PTAS for graph min-bisection, dense k-subgraph, and bipartite clique. SIAM J.Comput., 36(4), 2006.
Khuller, S. and Saha, B. On finding dense subgraphs.In ICALP, pp. 597608, 2009.
Leskovec, J. Stanford large network dataset collection. URL http://snap.stanford.edu/data/ index.html.
Mahoney, M. W., Orecchia, L., and Vishnoi, N. K. A local spectral method for graphs: With applications to improving graph partitions and exploring data graphs locally. JMLR, 13:23392365, 2012.
Maji, S., Vishnoi, N. K., and Malik, J. Biased normalized cuts. In CVPR, pp. 20572064, 2011.
Pothen, A., Simon, H. D., and Liou, K.-P. Partitioning sparse matrices with eigenvectors of graphs. SIAM J. Matrix Anal. Appl., 11(3):430452, 1990.
Rangapuram, S. S. and Hein, M. Constrained 1spectral clustering. In AISTATS, pp. 11431151, 2012.
Saha, B., Hoch, A., Khuller, S., Raschid, L., and Zhang, X.-N. Dense subgraphs with restrictions and applications to gene annotation graphs. In RECOMB, pp. 456472, 2010.
Shi, J. and Malik, J. Normalized cuts and image segmentation. IEEE Trans. Patt. Anal. Mach. Intell., 22(8):888905, 2000.
Spielman, D. A. and Teng, S.-H. Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In STOC, pp. 81 90, 2004.
Szlam, A. and Bresson, X. Total variation and Cheeger cuts. In ICML, pp. 10391046, 2010.
von Luxburg, U. A tutorial on spectral clustering. Statistics and Computing, 17:395416, 2007.
Wagstaff, K., Cardie, C., Rogers, S., and Schroedl, S. Constrained K-means clustering with background knowledge. In ICML, pp. 577584, 2001.
-----0
Akrour, R., Schoenauer, M., and Sebag, M. Preferencebased policy learning. In Proceedings ECMLPKDD 2011, European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, pp. 1227, 2011.
Altman, A. and Tennenholtz, M. Axiomatic foundations for ranking systems. Journal of Artificial Intelligence Research, 31(1):473495, 2008.
Aslam, J.A. and Decatur, S.E. General bounds on statistical query learning and PAC learning with noise via hypothesis boosting. Inf. Comput., 141(2):85118, 1998.
Audibert, J.Y., Munos, R., and Szepesvari, C. Tuning bandit algorithms in stochastic environments. In Proceedings of the Algorithmic Learning Theory, pp.150165, 2007.
Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47:235256, 2002.
Brandt, F. and Fischer, F. Pagerank as a weak tournament solution. In Proceedings of the 3rd international conference on Internet and Network Economics, pp.300305, 2007.
Braverman, Mark and Mossel, Elchanan. Noisy sorting without resampling. In Proceedings of the nineteenth annual ACM-SIAM Symposium on Discrete algorithms, pp. 268276, 2008.
Brin, S. and Page, L. The anatomy of a large-scale hypertextual web search engine. Computer Networks, 30 (1-7):107117, 1998.
Cheng, W., Furnkranz, J., Hullermeier, E., and Park, S.H. Preference-based policy iteration: Leveraging preference learning for reinforcement learning. In Proceedings ECMLPKDD 2011, European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, pp. 414429, 2011.
Even-Dar, E., Mannor, S., and Mansour, Y. PAC bounds for multi-armed bandit and markov decision processes. In Proceedings of the 15th Annual Conference on Computational Learning Theory, pp. 255270, 2002.
Funderlic, R.E. and Meyer, C.D. Sensitivity of the stationary distribution vector for an ergodic markov chain. Linear Algebra and its Applications, 76:117, 1986.
Furnkranz, J. and Hullermeier, E. (eds.). Preference Learning. Springer-Verlag, 2011.
Heidrich-Meisner, V. and Igel, C. Hoeffding and Bernstein races for selecting policies in evolutionary direct policy search. In Proceedings of the 26th International Conference on Machine Learning, pp. 401408, 2009.
Hoeffding, W. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58:1330, 1963.
Kalyanakrishnan, S., Tewari, A., Auer, P., and Stone, P.Pac subset selection in stochastic multi-armed bandits. In Proceedings of the Twenty-ninth International Conference on Machine Learning (ICML 2012), pp.655662, 2012.
Kocsor, A., Busa-Fekete, R., and Pongor, S. Protein classification based on propagation on unrooted binary trees. Protein and Peptide Letters, 15(5):42834, 2008.
Langville, A. N and Meyer, C. D. Deeper inside pagerank. Internet Mathematics, 1(3):335380, 2004.
Maron, O. and Moore, A.W. Hoeffding races: accelerating model selection search for classification and function approximation. In Advances in Neural Information Processing Systems, pp. 5966, 1994.
Maron, O. and Moore, A.W. The racing algorithm: Model selection for lazy learners. Artificial Intelligence Review, 5(1):193225, 1997.
Mnih, V., Szepesvari, C., and Audibert, J.Y. Empirical Bernstein stopping. In Proceedings of the 25th international conference on Machine learning, pp. 672679, 2008.
Moulin, H. Axioms of cooperative decision making. Cambridge University Press, 1988.
Negahban, S., Oh, S., and Shah, D. Iterative ranking from pairwise comparisons. In Advances in Neural Information Processing Systems, pp. 24832491, 2012.
Seneta, E. Sensitivity of finite markov chains under perturbation. Statistics & probability letters, 17(2):163 168, 1992.
Serfling, R.J. Approximation theorems of mathematical statistics, volume 34. Wiley Online Library, 1980.
Yue, Y., Broder, J., Kleinberg, R., and Joachims, T. The k-armed dueling bandits problem. Journal of Computer and System Sciences, 78(5):15381556, 2012.
-----0
Aliferis, Constantin F., Statnikov, Alexander, Tsamardinos, Ioannis, Mani, Subramani, and Koutsoukos, Xenofon D. Local causal and markov blanket induction for causal discovery and feature selection for classification. J. Mach. Learn. Res., 11:171234, 2010.
Baba, Kunihiro, Shibata, Ritei, and Sibuya, Masaaki.Partial correlation and conditional correlation as measures of conditional independence. Australian & New Zealand Journal of Statistics, 46(4):657664, 2004.
Bromberg, Facundo and Margaritis, Dimitris. Improving the reliability of causal discovery from small data sets using argumentation. J. Mach. Learn. Res., 10: 301340, 2009.
Cai, Ruichu, Zhang, Zhenjie, and Hao, Zhifeng. Bassum: A bayesian semi-supervised method for classification feature selection. Pattern Recognition, 44 (4):811820, 2011.
Cai, Ruichu, Zhang, Zhenjie, and Hao, Zhifeng. Causal gene identification using combinatorial v-structure search. Neural Networks, 2013. doi: 10.1016/j.neunet.2013.01.025.
Friedman, Nir, Linial, Michal, Nachman, Iftach, and Peer, Dana. Using bayesian networks to analyze expression data. In RECOMB, pp. 127135, 2000.
He, Y. and Geng, Z. Active learning of causal networks with intervention experiments and optimal designs.J. Mach. Learn. Res., 9:2523C2547, 2008.
Hoyer, Patrik O., Janzing, Dominik, Mooij, Joris M., Peters, Jonas, and Scholkopf, Bernhard. Nonlinear causal discovery with additive noise models. In NIPS, pp. 689696, 2008.
Janzing, Dominik, Mooij, Joris M., Zhang, Kun, Lemeire, Jan, Zscheischler, Jakob, Daniusis, Povilas, Steudel, Bastian, and Scholkopf, Bernhard.Information-geometric approach to inferring causal directions. Artif. Intell., 182-183:131, 2012.
Kalisch, M. and Buhlmann, P. Estimating highdimensional directed acyclic graphs with the pcalgorithm. The J. Mach. Learn. Res., 8:613636, 2007.
Kim, Sunyong, Imoto, Seiya, and Miyano, Satoru. Dynamic bayesian network and nonparametric regression for nonlinear modeling of gene networks from time series gene expression data. Biosystems, 75(13):5765, 2004.
Koller, Daphne and Friedman, Nir. Probabilistic Graphical Model: Principles and Techniques. The MIT Press, 2 edition, 2009.
Mooij, Joris M., Janzing, Dominik, Peters, Jonas, and Scholkopf, Bernhard. Regression by dependence minimization and its application to causal inference in additive noise models. In ICML, pp. 94, 2009.
Pearl, Judea. Causality: models, reasoning and inference. Cambridge Univ. Press, 2 edition, 2009.
Pearl, Judea and Verma, Thomas. A theory of inferred causation. In Proceedings of the 2nd International Conference on Principles of Knowledge Representation and Reasoning, pp. 441452, 1991.
Peters, Jonas, Janzing, Dominik, and Scholkopf, Bernhard. Identifying cause and effect on discrete data using additive noise models. In AIStats, pp. 597 604, 2010.
Peters, Jonas, Janzing, Dominik, and Scholkopf, Bernhard. Causal inference on discrete data using additive noise models. IEEE Trans. Pattern Anal. Mach.Intell., 33(12):2436  2450, 2011.
Shimizu, Shohei, Hoyer, Patrik O., Hyvarinen, Aapo, and Kerminen, Antti J. A linear non-gaussian acyclic model for causal discovery. J. Mach. Learn.Res., 7:20032030, 2006.
Shimizu, Shohei, Inazumi, Takanori, Sogawa, Yasuhiro, Hyvarinen, Aapo, Kawahara, Yoshinobu, 
Washio, Takashi, Hoyer, Patrik O., and Bollen, Kenneth. Directlingam: A direct method for learning a linear non-gaussian structural equation model. J.
Mach. Learn. Res., 12:12251248, 2011.Spirtes, Peter, Glymour, Clark, and Scheines, Richard.
Causation, Prediction, and Search. The MIT Press, 2 edition, 2001.
Zhang, Kun, Peters, Jonas, Janzing, Dominik, and Scholkopf, Bernhard. Kernel-based conditional independence test and application in causal discovery.CoRR, abs/1202.3775, 2012.
Zhu, Z., Ong, Y.S., and Dash, M. Markov blanketembedded genetic algorithm for gene selection. Pattern Recognition, 40(11):32363248, 2007.
Zscheischler, Jakob, Janzing, Dominik, and Zhang, Kun. Testing whether linear equations are causal: A free probability theory approach. CoRR, abs/1202.3779, 2012.
-----0
Sebastien Bubeck, Remi Munos, Gilles Stoltz, and Csaba Szepesvari. Online optimization in x-armed bandits. 2008.
A. Carpentier and R. Munos. Finite-time analysis of stratified sampling for monte carlo. In In Neural Information Processing Systems (NIPS), 2011a.
A. Carpentier and R. Munos. Finite-time analysis of stratified sampling for monte carlo. Technical report, INRIA-00636924, 2011b.
A. Carpentier and R. Munos. Minimax number of strata for online stratified sampling given noisy samples. Algorithmic Learning Theory, 2012a.
A. Carpentier and R. Munos. Adaptive stratified sampling for monte-carlo integration of differentiable functions. In Advances in Neural Information Processing Systems 25, 2012b.
A. Carpentier and R. Munos. Toward optimal stratification for stratified monte-carlo integration. arXiv preprint arXiv:0672:625, 2013.
Pierre Etore and Benjamin Jourdain. Adaptive optimal allocation in stratified sampling methods.Methodol. Comput. Appl. Probab., 12(3):335360, September 2010.
Pierre Etore, Gersende Fort, Benjamin Jourdain, and Eric Moulines. On adaptive stratification. Ann.Oper. Res., 2011. to appear.
P. Glasserman, P. Heidelberger, and P. Shahabuddin. Asymptotically optimal importance sampling and stratification for pricing path-dependent options. Mathematical Finance, 9(2):117152, 1999.V. Grover. Active learning and its application to heteroscedastic problems. Department of Computing 
Science, Univ. of Alberta, MSc thesis, 2009.R. Kawai. Asymptotically optimal allocation of stratified sampling with adaptive variance reduction by strata. ACM Transactions on Modeling and Computer Simulation (TOMACS), 20(2):117, 2010.ISSN 1049-3301.
H. Niederreiter. Quasi-Monte Carlo Methods. Wiley Online Library, 2010.
R.Y. Rubinstein and D.P. Kroese. Simulation and the Monte Carlo method. Wiley-interscience, 2008.ISBN 0470177942.
-----0
Bach, F. R. and Jordan, M. I. Kernel independent component analysis. JMLR, 3:148, 2002.
Bailer-Jones, C. A. L. The ilium forward modelling algorithm for multivariate parameter estimation and its application to derive stellar parameters from gaia spectrophotometry. Monthly Notices of the Royal Astronomical Society, 403:96116, 2010.CCA by HSIC and KTA 
Balakrishnan, S., Puniyani, K., and Lafferty, J. Sparse additive functional and kernel CCA. In ICML, 2012.
Cao, K. L., Gonzalez, I., and Dejean, S. integromics: an R package to unravel relationships between two omics datasets. Bioinformatics, 25(21):28552856, 2009.
Cortes, C., Mohri, M., and Rostamizadeh, A. Algorithms for learning kernels based on centered alignment. JMLR, 13:795828, 2012.
Edelman, A., Arias, T. A., and Smith, S. The geometry of algorithms with orthogonality constraints.SIAM Journal of Matrix Analysis and Applications, 20(2):303353, 1998.
Fukumizu, K., Bach, F. R., and Gretton, A. Statistical consistency of kernel canonical correlation analysis.JMLR, 8:361383, 2007.
Fukumizu, K., Bach, F. R., and Jordan, M. I. Kernel dimension reduction in regression. The Annals of Statistics, 37(4):18711905, 2009.
Gretton, A., Bousquet, O., Smola, A., and Scholkopf, B. Measuring statistical dependence with hilbertschmidt norm. In Algorithmic learning theory. 2005.
Hardoon, D. R. and Shawe-Taylor, J. Sparse canonical correlation analysis. Machine Learning, 83:331353, 2011.
Hardoon, D. R., Szedmak, S., and Shawe-Taylor, J.Canonical correlation analysis: An overview with applications to learning methods. Neural Computation, 16(12):26392664, 2004.
Harrison, D. and Rubinfeld, D. L. Hedonic prices and the demand for clean air. Journal of Environmental Economics and Management, 5:81102, 1978.
Hotelling, H. Relations between two sets of variables.Biometrika, 28(3/4):321377, 1936.
Hsieh, W. W. Nonlinear canonical correlation analysis by neural networks. Neural Networks, 13(10):1095 1105, 2000.
Jegelka, S. and Gretton, A. Large-Scale Kernel Machines, chapter 225, pp. 225250. The MIT Press, 2007.
Kruskal, W. H. Ordinal measures of association.JASA, 53(284):814861, 1958.
Leurgans, S. E., Moyeed, R. A., and Silverman, B. W.Canonical correlation anlaysis when the data are curves. JRSS Series B, 55:725740, 1993.
Parkhomenko, E., Tritchler, D., and Beyene, J. Sparse canonical correlation analysis with application to genomic data integration. Statistical Applications in Genetics and Molecular Biology, 8(1):134, 2009.
Ramsay, J. O. and Silverman, B. W. Functional Data Analysis. New York: Springer-Verlag, 1997.
Ravikumar, P., Lafferty, J., Liu, H., and Wasserman, L. Sparse additive models. JRSS Series B, 71(5): 10091030, 2009.
Sharma, S. K., Kruger, U., and Irwin, G. W. Deflation based nonlinear canonical correlation analysis.
Chemometrics and Intelligent Laboratory Systems, 83:3443, 2006.
Shen, H, Jegelka, S, and Gretton, A. Fast kernel-based independent component analysis. IEEE Transactions on Signal Processing, 57(9):34983511, 2009.
Song, L., Boots, B., Siddiqi, S., Gordon, G., and Smola, A. Hilbert space embeddings of hidden markov models. In ICML, 2012a.
Song, L., Smola, A., Gretton, A., Bedo, J., and Borgwardt, K. Feature selection via dependence maximization. JMLR, 13:13931434, 2012b.
Wang, M., Sha, F., and Jordan, M. I. Unsupervised kernel dimension reduction. In NIPS, 2010.
Witten, D. M. and Tibshirani, R. J. Extensions of sparse canonical correlation analysis with applications to genomic data. Statistical Applications in Genetics and Molecular Biology, 8(1):127, 2009.
-----0
Bickel, S., Bruckner, M., and Scheffer, T. Discriminative learning under covariate shift. JMLR, 10:21372155, 2009.
Borgwardt, K.M., Gretton, A., Rasch, M.J., Kriegel, H.P., Schlkopf, B., and Smola, A.J. Integrating structured biological data by kernel maximum mean discrepancy.Bioinformatics, 22(14):4957, 2006.Brinker, K. Incorporating diversity in active learning with support vector machines. In ICML, 2003.
Campbell, C., Cristianini, N., and Smola, A.J. Query learning with large margin classifiers. In ICML, 2000.
Chen, M., Weinberger, K.Q., and Blitzer, J. Co-training for domain adaptation. In NIPS, 2011.Cortes, C., Mansour, Y., and Mohri, M. Learning bounds for importance weighing. In NIPS, 2010.
Eaton, E. and desJardins, M. Set-based boosting for instance-level transfer. In ICDM, 2009.
Gretton, A., Borgwardt, K.M., Rasch, M., Schlkopf, B., and Smola, A.J. A kernel method for the two-sampleproblem. In NIPS, 2007.
Guo, Y. Active instance sampling via matrix partition. In NIPS, 2010.
Guo, Y. and Schuurmans, D. Discriminative batch mode active learning. In NIPS, 2007.
Hoi, S.C.H., Jin, R., Zhu, J., and Lyu, M.R. Batch mode active learning and its application to medical image classification. In ICML, 2006.
Huang, J., Smola, A.J., Gretton, A., Borgwardt, K.M., and Scholkopf, B. Correcting sample selection bias by unlabeled data. In NIPS, 2007.
Jing, F., Li, M., Zhang, H., and Zhang, B. Entropy based active learning with support vector machines for content based image retrieval. In ICME, 2004.
Joshi, Ajay J., Porikli, F., and Papanikolopoulos, N. Multiclass active learning for image classification. In CVPR, 2009.
Lecuyer, E., Yoshida, H., Parthasarathy, N., Alm, C., and Babak, T. Global analysis of mrna localization reveals a prominent role in organizing cellular architecture and function. Cell, 2011.
Liu, C. and Wechsler, H. Gabor feature based classification using the enhanced fisher linear discriminant model for face recognition. IEEE Trans. Image Processing, 11:467 476, 2002.
Pan, S.J. and Yang, Q. A survey on transfer learning.TKDE, 2009.
Pan, S.J., Tsang, I.W., Kwok, J.T., and Yang, Q. Domain adaptation via transfer component analysis. In IJCAI, 2009.
Rai, P., Saha, A., H. Daume III, and Venkatasubramanian, S. Domain adaptation meets active learning. In NAACLHLT Active Learning for NLP Workshop, 2010.Schohn, G. and Cohn, D. Less is more: Active learning with support vector machines. In ICML, 2000.
Settles, B. Active learning literature survey. In Computer Sciences Technical Report 1648. University of Wisconsin-Madison, 2009.
Shi, X., Fan, W., and Ren, J. Actively transfer domain knowledge. In ECML/PKDD. Antwerp, Belgium, 2008.
Shimodaira, H. Improving predictive inference under covariate shift by weighting the log-likelihood function.JSPI, 90:227244, 2000.
Sriperumbudur, B., Gretton, A., Fukumizu, K., Schlkopf, B., and Lanckriet, G. Hilbert space embeddings and metrics on probability measures. JMLR, 11:15171561, 2010.
Sugiyama, M., Nakajima, S., Kashima, H., Buenau, P.V., and Kawanabe, M. Direct importance estimation with model selection and its application to covariate shift adaptation. In NIPS, 2008.
Tomancak, P., Beaton, A., Weiszmann, R., Kwan, E., and Shu, S. Systematic determination of patterns of gene expression during drosophila embryogenesis. Genome Biology, 3, 2002.Tong, S. and Koller, D. Support vector machine active learning with applications to text classification. JMLR, 2:4566, 2000.
Witten, I.H. and Frank, E. Data mining: Practical machine learning tools with java implementations. Morgan Kaufmann, 2000.Yu, K., Bi, J., and Tresp, V. Active learning via transductive experimental design. In ICML, 2006.
-----0
Altun, Yasemin, Tsochantaridis, Ioannis, and Hofmann, Thomas. Hidden Markov support vector machines. In Proc. ICML, 2004.
Antoniak, C. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. The Annals of Statistics, 2(6):11521174, 1974.
Blackwell, D. and MacQueen, J. Ferguson distributions via Plya urn schemes. The Annals of Statistics, 1(2):353355, 1973.
Blei, David M. and Jordan, Michael I. Variational inference for Dirichlet process mixtures. Bayesian Analysis, 1(1):121144, 2006.
Chandler, D. Introduction to Modern Statistical Mechanics. Oxford University Press, New York, 1987.
Collobert, R., Bengio, S., and Bengio, Y. A parallel mixture of SVMs for very large scale problems. In Proc. NIPS, 2002.
Ding, Y. and Fan, G. Sports video mining via multichannel segmental hidden markov models. IEEE Trans. on Multimedia, 11(7):13011309, 2009.
Dingand, Y. and Fan, G. Segmental hidden Markov models for view-based sport video analysis. In Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, 2007.
Ferguson, T. A Bayesian analysis of some nonparametric problems. The Annals of Statistics, 1:209230, 1973.
Fu, Z., Robles-Kelly, A., and Zhou, J. Mixing linear SVMs for nonlinear classification. IEEE Trans. on Nueral Networks, 21(12):1963  1975, 2010.
Jaakkola, T., Meila, M., and Jebara, T. Maximum entropy discrimination. In Proc. NIPS, 1999.
Kosmopoulos, D. and Chatzis, S.P. Robust visual behavior recognition. Signal Processing Magazine, IEEE, 27(5):34 45, sept. 2010.
Lewis, D., Jebara, T., and Noble, W. Nonstationary kernel combination. In Proc. ICML, 2006.
McLachlan, G. and Peel, D. Finite Mixture Models. Wiley Series in Probability and Statistics, New York, 2000.
Muller, P. and Quintana, F. Nonparametric Bayesian data analysis. Statist. Sci., 19(1):95110, 2004.
Neal, R. Markov chain sampling methods for Dirichlet process mixture models. J. Comput. Graph. Statist., 9:249265, 2000.
Ni, Bingbing, Wang, Gang, and Moulin, Pierre.RGBD-HuDaAct: A color-depth video database for human daily activity recognition. In ICCV Workshops, pp. 11471153, 2011.
Qian, W. and Titterington, D.M. Estimation of parameters in hidden Markov models. Philos. Trans. R. Soc. London Ser. A, 337:407428, 1991a.
Qian, W. and Titterington, D.M. Stochastic relaxations and EM algorithms for Markov random fields. J. Statist. Comput. Simul., 40:5569, 1991b.
Rabiner, L.R. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77:245255, 1989.
Sethuraman, J. A constructive definition of the Dirichlet prior. Statistica Sinica, 2:639650, 1994.
Sha, Fei and Saul, Lawrence K. Comparison of large margin training to other discriminative methods for phonetic recognition by hidden Markov models. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2007), pp. 313316, 2007.
Srinivasan, M., Venkatesh, S., and Hosie, R. Qualitative estimation of camera motion parameters from video sequences. Pattern Recognit., 30:593606, 1997.
Walker, S., Damien, P., Laud, P., and Smith, A.Bayesian nonparametric inference for random distributions and related functions. J. Roy. Statist. Soc.B, 61(3):485527, 1999.
Winn, J. and Bishop, C.M. Variational message passing. J. Machine Learning Research, 6:661694, 2005.
Xu, Minjie, Zhu, Jun, and Zhang, Bo. Nonparametric max-margin matrix factorization for collaborative prediction. In NIPS, 2012.
Ye, Nan, Lee, Wee Sun, Chieu, Hai Leong, and Wu, Dan. Conditional random fields with high-order features for sequence labeling. In Proc. NIPS, 2009.
Zhu, J. and Xing, E. Maximum entropy discrimination Markov networks. J. Machine Learning Research, 10:25312569, 2009.
Zhu, Jun, Chen, Ning, and Xing, Eric P. Infinite SVM: a Dirichlet process mixture of large-margin kernel machines. In Proc. ICML, 2011.
-----0
Anantharam, V., Varaiya, P., and Walrand, J. Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays  Part I: i.i.d.rewards; Part II: Markovian rewards. IEEE Transactions on Automatic Control, AC-32(11):968982, 1987a.
Audibert, J.-Y., and Bubeck, S. Minimax policies for adversarial and stochastic bandits. In COLT, 2009.
Audibert, J.-Y., Bubeck, S., and Lugosi, G. Minimax policies for combinatorial prediction games. In COLT, 2011.
Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235256, 2002a.
Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. The nonstochastic multiarmed bandit problem. SIAM J. Comput., 32(1):4877, 2002b.
Berry, D. and Fristedt, B. Bandit problems. Chapman and Hall, 1985.
Bubeck, S., Cesa-Bianchi, N., and Kakade, S. M. Towards Minimax Policies for Online Linear Optimization with Bandit Feedback. In COLT, 2012.
Caro, F. and Gallien, J. Dynamic assortment with demand learning for seasonal consumer goods. Management Science, 53:276292, 2007.
Cesa-Bianchi, N. and Lugosi, G. Combinatorial bandits.
Gai, Y., Krishnamachari, B., and Jain, R. Learning multiuser channel allocations in cognitive radio networks: A combinatorial multi-armed bandit formulation. In DySPAN, 2010.
Gai, Y., Krishnamachari, B., and Jain, R. Combinatorial network optimization with unknown variables: Multi-armed bandits with linear rewards and individual observations. IEEE/ACM Transactions on Networking, 20, 2012.
Garivier, A. and Cappe, O. The KL-UCB algorithm for bounded stochastic bandits and beyond. In COLT, 2011.
Hazan, E. and Kale, S. Online submodular minimization. In NIPS, 2009.
Kakade, S. M., Kalai, A. T., and Ligett, K. Playing games with approximation algorithms. SIAM Journal on Computing, 39(3):10881106, 2009.
Kempe, D., Kleinberg, J. M., and Tardos, E. Maximizing the spread of influence through a social network.In KDD, 2003.
Lai, T. L. and Robbins, H. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6:422, 1985.
Liu, H., Liu, K., and Zhao, Q. Logarithmic weak regret of non-bayesian restless multi-armed bandit. In ICASSP, 2011.
Liu, K. and Zhao, Q. Adaptive shortest-path routing under unknown and stochastically varying link states. Arxiv preprint arXiv:1201.4906, 2012.
Mannor, S., and Shamir, O. From Bandits to Experts: On the Value of Side-Observations. In NIPS, 2011.
Nemhauser, G., Wolsey, L., and Fisher, M. An analysis of the approximations for maximizing submodular set functions. Mathematical Programming, 14:265 294, 1978.
Radlinski, F., Kleinberg, R., and Joachims, T. Learning diverse rankings with multi-armed bandits. In ICML, 2008.
Streeter, M., Golovin, D., and Krause, A. Online learning of assignments. In NIPS, 2009.
Streeter, M. and Golovin, D. An online algorithm for maximizing submodular functions. In NIPS, 2008.
Sutton, R. and Barto, A. Reinforcement learning, an introduction. MIT Press, 1998.
Vazirani, V. V. Approximation Algorithms Springer, 
-----0
Anshelevich, Elliot, Chakrabarty, Deeparnab, Hate, Ameya, and Swamy, Chaitanya. Approximation algorithms for the firefighter problem: Cuts over time and submodularity. In Algorithms and Computation.Springer Berlin / Heidelberg, 2009.
Asadpour, Arash, Nazerzadeh, Hamid, and Saberi, Amin. Stochastic submodular maximization. In WINE, pp. 477489, Berlin, Heidelberg, 2008.Springer-Verlag.
Balcan, Maria Florina, Beygelzimer, Alina, and Langford, John. Agnostic active learning. In ICML, pp.6572, 2006.
Cesa-Bianchi, Nicolo`, Gentile, Claudio, Vitale, Fabio, and Zappella, Giovanni. Active learning on trees and graphs. In COLT, pp. 320332, 2010.
Dasgupta, Sanjoy. Analysis of a greedy active learning strategy. In NIPS, 2004.
Dasgupta, Sanjoy. Coarse sample complexity bounds for active learning. In Weiss, Y., Scholkopf, B., and Platt, J. (eds.), Advances in Neural Information Processing Systems 18, pp. 235242. MIT Press, Cambridge, MA, 2006.
Desautels, Thomas, Krause, Andreas, and Burdick, Joel. Parallelizing exploration-exploitation tradeoffs with gaussian process bandit optimization. In ICML, 2012.
Domingos, Pedro and Richardson, Matt. Mining the network value of customers. In KDD, pp. 5766, 2001.
Golovin, Daniel and Krause, Andreas. Adaptive submodularity: Theory and applications in active learning and stochastic optimization. Journal of Artificial Intelligence Research (JAIR), 42:427486, 2011.
Golovin, Daniel, Krause, Andreas, and Ray, Debajyoti.Near-optimal bayesian active learning with noisy observations. In NIPS, December 2010.
Golovin, Daniel, Krause, Andreas, Gardner, Beth, Converse, Sarah, and Morey, Steve. Dynamic resource allocation in conservation planning. In AAAI, 2011.
Gonen, Alon, Sabato, Sivan, and Shalev-Shwartz, Shai. Active learning halfspaces under margin assumptions. CoRR, abs/1112.1556v3, 2011.
Guillory, Andrew and Bilmes, Jeff. Interactive submodular set cover. In ICML, 2010.
Guillory, Andrew and Bilmes, Jeff. Active semisupervised learning using submodular functions. In UAI, 2011.
Hoi, Steven C. H., Jin, Rong, Zhu, Jianke, and Lyu, Michael R. Batch mode active learning and its application to medical image classification. In ICML, 2006.
Jain, Prateek, Vijayanarasimhan, Sudheendra, and Grauman, Kristen. Hashing hyperplane queries to near points with applications to large-scale active learning. In Advances in Neural Information Processing Systems 23, pp. 928936. 2010.
Kempe, David, Kleinberg, Jon, and Tardos, Eva.Maximizing the spread of influence through a social network. In KDD, pp. 137146, 2003.
Lovasz, Laszlo. Hit-and-run mixes fast. Math. Prog, 86:443461, 1998.
MacKay, David J.C. Information-based objective functions for active data selection. Neural Computation, 4(4):590604, 1992.Settles, Burr. Active learning literature survey. Technical Report 1648, University of Wisconsin-Madison, 2010.
Smith, Robert L. Efficient monte carlo procedures for generating points uniformly distributed over bounded regions. Operations Research, 1984.
-----0
Belkin, M. and Niyogi, P. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15:13731396, 2002.
Biswas, P. and Ye, Y. Semidefinite programming for ad hoc wireless sensor network localization. In Proceedings of the 3rd international symposium on Information processing in sensor networks, IPSN 04, pp. 4654, New York, NY, USA, 2004. ACM.
Borchers, B. Csdp, ac library for semidefinite programming. Optimization Methods and Software, 11(1-4):613 623, 1999.
Boyd, S. and Vandenberghe, L. Convex Optimization.Cambridge University Press, New York, NY, USA, 2004.
Donoho, D.L. and Grimes, C. When does isomap recover the natural parameterization of families of articulated images? Department of Statistics, Stanford University, 2002.
Gupta, N. and Nau, D.S. On the complexity of blocksworld planning. Artificial Intelligence, 56(2):223254, 1992.
Jones, M.T. Artificial Intelligence: A Systems Approach: A Systems Approach. Jones & Bartlett Learning, 2008.
Kruskal, J.B. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika, 29(1):127, 1964.
Lee, J.A. and Verleysen, M. Nonlinear dimensionality reduction. Springer, 2007.
Ng, T.S.E. and Zhang, H. Predicting internet network distance with coordinates-based approaches. In INFOCOM 2002. Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies. Proceedings. IEEE, volume 1, pp. 170179. IEEE, 2002.
Paprotny, A., Garcke, J., and Fraunhofer, S. On a connection between maximum variance unfolding, shortest path problems and isomap. In Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics. MIT Press, 2012.
Platt, J.C. Fast embedding of sparse music similarity graphs. Advances in Neural Information Processing Systems, 16:571578, 2004.
Rayner, C., Bowling, M., and Sturtevant, N. Euclidean Heuristic Optimization. In Proceedings of the Twenty-Fifth National Conference on Artificial Intelligence (AAAI), pp. 8186, 2011.
Russell, S. J. and Norvig, Peter. Artificial Intelligence: A Modern Approach. Pearson Education, 2 edition, 2003.
Saul, L.K., Weinberger, K.Q., Ham, J. H., Sha, F., and Lee, D. D. Spectral methods for dimensionality reduction, chapter 16, pp. 29330. MIT Press, 2006.
Shaw, B. and Jebara, T. Minimum volume embedding. In Proceedings of the 2007 Conference on Artificial Intelligence and Statistics. MIT press, 2007.
Shaw, B. and Jebara, T. Structure preserving embedding.In Proceedings of the 26th Annual International Conference on Machine Learning, ICML 09, pp. 937944, New York, NY, USA, 2009. ACM.
Silva, V. and Tenenbaum, J.B. Global versus local methods in nonlinear dimensionality reduction. Advances in neural information processing systems, 15:705712, 2002.
Talwalkar, A., Kumar, S., and Rowley, H. Large-scale manifold learning. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pp. 18.IEEE, 2008.
Tenenbaum, J. B., Silva, V., and Langford, J. C. A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science, 290(5500):23192323, 2000a.
Tenenbaum, J.B., De Silva, V., and Langford, J.C. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):23192323, 2000b.
Vasiloglou, N., Gray, A.G., and Anderson, D.V. Scalable semidefinite manifold learning. In Machine Learning for Signal Processing, 2008. MLSP 2008. IEEE Workshop on, pp. 368373. IEEE, 2008.
Weinberger, K.Q. and Saul, L.K. Unsupervised learning of image manifolds by semidefinite programming. International Journal of Computer Vision, 70:7790, 2006.ISSN 0920-5691.
Weinberger, K.Q., Packer, B. D., and Saul, L. K. Nonlinear dimensionality reduction by semidefinite programming and kernel matrix factorization. In Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, Barbados, West Indies, 2005.
Weinberger, K.Q., Sha, F., Zhu, Q., and Saul, L. Graph laplacian regularization for large-scale semidefinite programming. In Advances in Neural Information Processing Systems 19. MIT Press, Cambridge, MA, 2007.
Wu, X., So, A. Man-Cho, Li, Z., and Li, S.R. Fast graph laplacian regularized kernel learning via semidefinite quadratic linear programming. In Advances in Neural Information Processing Systems 22, pp. 19641972.2009.
Zhang, T., Tao, D., Li, X., and Yang, J. Patch alignment for dimensionality reduction. Knowledge and Data Engineering, IEEE Transactions on, 21(9):12991313, 2009.
-----0
Bajwa, W.U., Calderbank, R., and Jafarpour, S.Model selection: Two fundamental measures of coherence and their algorithmic significance. In Proceedings of IEEE International Symposium on Information Theory (ISIT), pp. 15681572. IEEE, 2010.
Cai, T.T. and Wang, L. Orthogonal matching pursuit for sparse signal recovery with noise. Information Theory, IEEE Transactions on, 57(7):4680 4688, 2011.
Carroll, R.J. Measurement error in nonlinear models: a modern perspective, volume 105. CRC Press, 2006.
Davenport, M.A. and Wakin, M.B. Analysis of orthogonal matching pursuit using the restricted isometry property. Information Theory, IEEE Transactions on, 56(9):43954401, 2010.
Donoho, D.L., Elad, M., and Temlyakov, V.N. Stable recovery of sparse overcomplete representations in the presence of noise. Information Theory, IEEE Transactions on, 52(1):618, 2006.
Jalali, A., Johnson, C., and Ravikumar, P. On learning discrete graphical models using greedy methods.arXiv preprint arXiv:1107.3258, 2011.
Loh, P.L. and Wainwright, M.J. High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity. Arxiv preprint arXiv:1109.3714, 2011.
Loh, P.L. and Wainwright, M.J. Corrupted and missing predictors: Minimax bounds for highdimensional linear regression. In Information Theory Proceedings (ISIT), 2012 IEEE International Symposium on, pp. 26012605. IEEE, 2012.
Raskutti, G., Wainwright, M.J., and Yu, B. Minimax rates of estimation for high-dimensional linear regression over `q-balls. Arxiv preprint arXiv:0910.2042, 2009.
Rauhut, H. On the impossibility of uniform sparse reconstruction using greedy methods. Sampling Theory in Signal and Image Processing, 7(2):197, 2008.
Rosenbaum, M. and Tsybakov, A.B. Sparse recovery under matrix uncertainty. The Annals of Statistics, 38(5):26202651, 2010.
Rosenbaum, M. and Tsybakov, A.B. Improved matrix uncertainty selector. arXiv preprint arXiv:1112.4413, 2011.
Stadler, N. and Buhlmann, P. Missing values: sparse inverse covariance estimation and an extension to sparse regression. Statistics and Computing, pp. 1 17, 2010.
Tewari, A., Ravikumar, P., and Dhillon, I.S. Greedy algorithms for structurally constrained high dimensional problems. In In Advances in Neural Information Processing Systems (NIPS) 24, 2011.
Tropp, J.A. Greed is good: Algorithmic results for sparse approximation. Information Theory, IEEE Transactions on, 50(10):22312242, 2004.
Tropp, J.A. and Gilbert, A.C. Signal recovery from random measurements via orthogonal matching pursuit. Information Theory, IEEE Transactions on, 53 (12):46554666, 2007.
Yang, Y. and Barron, A. Information-theoretic determination of minimax rates of convergence. The Annals of Statistics, 27(5):15641599, 1999.
Yu, B. Assouad, Fano, and Le Cam. Festschrift for Lucien Le Cam, pp. 423435, 1997.
-----0
Abernethy, J., Bach, F., Evgeniou, T., and Vert, J.P.A new approach to collaborative filtering: Operator estimation with spectral regularization. Journal of Machine Learning Research, 10:803826, 2009.
Agarwal, D. and Chen, B.C. Regression-based latent factor models. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD 09, pp. 19 28, New York, NY, USA, 2009. ACM.
Burges, C. From ranknet to lambdarank to lambdamart: An overview. Learning, 11:23581, 2010.
Friedman, J., Hastie, T., and Tibshirani, R. Additive logistic regression: a statistical view of boosting. The annals of statistics, 28(2):337407, 2000.
Friedman, J., Hastie, T., and Tibshirani, R. The elements of statistical learning, volume 1. Springer Series in Statistics, 2001.
Friedman, J.H. Greedy function approximation: a gradient boosting machine. Annals of Statistics, pp.11891232, 2001.
Jamali, M. and Ester, M. A matrix factorization technique with trust propagation for recommendation in social networks. In Proceedings of the fourth ACM conference on Recommender systems, RecSys 10, pp. 135142, New York, NY, USA, 2010. ACM.
Koren, Y. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD 08, pp. 426434, New York, NY, USA, 2008. ACM.
Koren, Y. Collaborative filtering with temporal dynamics. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 447456, 2009.
Koren, Y., Bell, R., and Volinsky, C. Matrix factorization techniques for recommender systems. Computer, 42, August 2009.
Lawrence, N.D. and Urtasun, R. Non-linear matrix factorization with gaussian processes. In Proceedings of the 26th International Conference on Machine Learning, pp. 601608, Montreal, June 2009.Omnipress.
Rendle, S., Gantner, Z., Freudenthaler, C., and Schmidt-Thieme, L. Fast context-aware recommendations with factorization machines. In Proceedings of the 34th ACM SIGIR Conference on Reasearch and Development in Information Retrieval. ACM, 2011.
Salakhutdinov, R. and Mnih, A. Probabilistic matrix factorization. In Advances in Neural Information Processing Systems 20, pp. 12571264. MIT Press, Cambridge, MA, 2008.
Stern, D.H., Herbrich, R., and Graepel, T. Matchbox: large scale online bayesian recommendations.In Proceedings of the 18th international conference on World wide web, WWW 09, pp. 111120, New York, NY, USA, 2009. ACM.
Weston, J., Wang, C., Weiss, R., and Berenzweig, A.Latent collaborative retrieval. In International Conference on Machine Learning (ICML), Edinburgh, Scotland, June 2012.
Zhou, K., Yang, S.H., and Zha, H. Functional matrix factorizations for cold-start recommendation. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, SIGIR 11, pp. 315324, New York, NY, USA, 2011. ACM.
-----0
Bachrach, Y., Minka, T., Guiver, J., and Graepel, T.How to grade a test without knowing the answers a Bayesian graphical model for adaptive crowdsourcing and aptitude testing,. In ICML, 2012.Bishop, C. M. Pattern Recognition and Machine Learning. Springer, 2007.
Dawid, A. P. and Skene, A. M. Maximum likelihood estimation of observer error-rates using the em algorithm. JRSS-C, 28:2028, 1979.
Ertekin, S., Hirsh, H., and Rudin, C. Wisely using a budget for crowdsourcing. Technical report, MIT, 2012.
Frazier, P., Powell, W. B., and Dayanik, S. A knowledge-gradient policy for sequential information collection. SIAM J. Control Optim., 47(5): 24102439, 2008.
Gittins, J. C. Multi-armed Bandit Allocation Indices.John Wiley & Sons, 1989.
Gupta, S. S. and Miescke, K.J. Bayesian look ahead one stage sampling allocations for selection the largest normal mean. J. of Stat. Planning and Inference, 54(2):229244, 1996.
Karger, D. R., Oh, S., and Shah, D. Budget-optimal task allocation for reliable crowdsourcing systems.arXiv:1110.3564v3, 11 2012.
Liu, C. and Wang, Y. M. Truelabel + confusions: A spectrum of probabilistic models in analyzing multiple ratings. In ICML, 2012.
Liu, Q., Peng, J., and Ihler, A. Variational inference for crowdsourcing. In NIPS, 2012.
Nino-Mora, J. Computing a classic index for finitehorizon bandits. INFORMS Journal on Computing, 23(2):254267, 2011.
Nowak, R. D. Noisy generalized binary search. In NIPS, 2009.
Pfeiffer, T., Gao, X. A., Mao, A., Chen, Y., and Rand, D. G. Adaptive polling for information aggregation.In AAAI, 2012.
Powell, W. B. Approximate Dynamic Programming: solving the curses of dimensionality. John Wiley & Sons, 2007.
Puterman, M. L. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, 2005.
Raykar, V. C., Yu, S., Zhao, L. H., Valadez, G. H., Florin, C., Bogoni, L., and Moy, L. Learning from crowds. JMLR, 11:12971322, 2010.
Robert, Christian P. The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation. Springer Verlag, 2007.
Rockafellar, R. T. and Uryasev, S. Conditional valueat-risk for general loss distributions. J. of Banking and Finance, 26:14431471, 2002.Settles, B. Active learning literature survey. Technical report, University of WisconsinMadison, 2009.
Snow, R., Connor, B. O., Jurafsky, D., and Ng., A. Y.Cheap and fast but is it good? evaluating nonexpert annotations for natural language tasks. In EMNLP, 2008.
Welinder, P., Branson, S., Belongie, S., and Perona, P.The multidimensional wisdom of crowds. In NIPS, 2010.
Whitehill, J., Ruvolo, P., Wu, T., Bergsma, J., and Movellan, J. R. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In NIPS, 2009.
Xie, J. and Frazier, P. I. Sequential bayes-optimal policies for multiple comparions with a control. Technical report, Cornell University, 2012.
Yan, Y., Rosales, R., Fung, G., and Dy, J. Active learning from crowds. In ICML, 2011.Zhou, D., Basu, S., Mao, Y., and Platt, J. Learning from the wisdom of crowds by minimax conditional entropy. In NIPS, 2012.
-----0
Balle, B. and Mohri, M. Spectral learning of general weighted automata via constrained matrix completion. Advances in Neural Information Processing Systems (NIPS), pp. 21682176, 2012.
Borcea, L., Papanicolaou, G., Tsogka, C., and Berryman, J. Imaging and time reversal in random media.Inverse Problems, 18(5):1247, 2002.
Cai, J. F., Candes, E. J., and Shen, Z. A singular value thresholding algorithm for matrix completion.SIAM Journal on Optimization, 20(4):19561982, 2010.
Candes, E. J. and Fernandez-Granda, C. Superresolution from noisy data. Arxiv 1211.0290, November 2012.
Candes, E. J. and Fernandez-Granda, C. Towards a mathematical theory of super-resolution. to appear in Communications on Pure and Applied Mathematics, 2013.
Candes, E. J. and Recht, B. Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6):717772, April 2009.
Candes, E. J., Romberg, J., and Tao, T. Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory, 52(2): 489509, Feb. 2006.
Chen, Y. and Chi, Y. Robust spectral compressed sensing via structured matrix completion. Arxiv: 1304.8126, May 2013.
Chi, Y., Scharf, L.L., Pezeshki, A., and Calderbank, A.R. Sensitivity to basis mismatch in compressed sensing. IEEE Transactions on Signal Processing, 59(5):21822195, May 2011.
Dragotti, P. L., Vetterli, M., and Blu, T. Sampling moments and reconstructing signals of finite rate of innovation: Shannon meets strang-fix. IEEE Trans on Signal Processing, 55(5):1741 1757, May 2007.
Duarte, M.F. and Baraniuk, R.G. Spectral compressive sensing. Applied and Computational Harmonic Analysis, 2012.
Fazel, M., Hindi, H., and Boyd, S. P. Log-det heuristic for matrix rank minimization with applications to Hankel and Euclidean distance matrices. American Control Conference, 3:2156  2162 vol.3, June 2003.
Fazel, M., Pong, T. K., Sun, D., and Tseng, P. Hankel matrix rank minimization with applications in system identification and realization, 2011.
Gedalyahui, K., Tur, R., and Eldar, Y. C. Multichannel sampling of pulse streams at the rate of innovation. IEEE Trans on Sig. Proc, 59:14911504, 2011.
Gross, D. Recovering low-rank matrices from few coefficients in any basis. IEEE Transactions on Information Theory, 57(3):15481566, March 2011.
Hua, Y. Estimating two-dimensional frequencies by matrix enhancement and matrix pencil. IEEE Trans on Sig. Proc., 40(9):2267 2280, Sep 1992.
Lustig, M., Donoho, D., and Pauly, J. M. Sparse MRI: The application of compressed sensing for rapid MR imaging. Magnetic Resonance in Medicine, 58(6): 11821195, 2007.
Markovsky, I. Structured low-rank approximation and its applications. Automatica, 44(4):891909, 2008.
Negahban, S. and Wainwright, M.J. Restricted strong convexity and weighted matrix completion: Optimal bounds with noise. The Journal of Machine Learning Research, 98888:16651697, May 2012.
Nion, D. and Sidiropoulos, N. D. Tensor algebra and multidimensional harmonic retrieval in signal processing for MIMO Radar. IEEE Transactions on 
Signal Processing, 58(11):5693 5705, Nov. 2010.Potter, L.C., Ertin, E., Parker, J.T., and Cetin, M.Sparsity and compressed sensing in radar imaging.Proceedings of the IEEE, 98(6):10061020, 2010.
Roy, R. and Kailath, T. Esprit-estimation of signal parameters via rotational invariance techniques. IEEE Transactions on Acoustics, Speech and Signal Processing, 37(7):984 995, Jul 1989.
Sankaranarayanan, A., Turaga, P., Baraniuk, R., and Chellappa, R. Compressive acquisition of dynamic scenes. ECCV 2010, pp. 129142, 2010.
Sayeed, A. M. and Aazhang, B. Joint multipathDoppler diversity in mobile wireless communications. IEEE Transactions on Communications, 47 (1):123 132, Jan 1999.
Schermelleh, L., Heintzmann, R., and Leonhardt, H.A guide to super-resolution fluorescence microscopy.The Journal of cell biology, 190(2):165175, 2010.
Shin, P., Larson, P., Ohliger, M., Elad, M., Pauly, J., Vigneron, D., and Lustig, M. Calibrationless parallel imaging reconstruction based on structured low-rank matrix completion. submitted to Magnetic Resonance in Medicine, 2012.
Tang, G., Bhaskar, B. N., Shah, P., and Recht, B.Compressed sensing off the grid. Arxiv 1207.6053, July 2012.
-----0
Bickel, P.J., Ritov, Y., and Tsybakov, A.B. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, 37(4):17051732, 2009.
Candes, E.J. and Tao, T. Decoding by linear programming. IEEE Transactions on Information Theory, 51(12):42034215, 2005.
Candes, E.J., Li, X., Ma, Y., and Wright, J. Robust principal component analysis? Arxiv preprint arXiv:0912.3599, 2009.
Chandrasekaran, V., Sanghavi, S., Parrilo, S., and Willsky, A. Rank-sparsity incoherence for matrix decomposition. SIAM Journal on Optimization, 21 (2):572596, 2011.
Chen, S.S., Donoho, D.L., and Saunders, M.A. Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing, 20(1):3361, 1999.
Chen, Y. and Caramanis, C. Noisy and missing data regression: Distribution-oblivious support recovery.In ICML, 2013.
Chen, Y., Xu, H., Caramanis, C., and Sanghavi, Sujay.Robust matrix completion with corrupted columns.In ICML, 2011.
Donoho, D. L. Breakdown properties of multivariate location estimators, qualifying paper, Harvard University, 1982.
Donoho, D.L., Elad, M., and Temlyakov, V.N. Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Transactions on Information Theory, 52(1):618, 2006.
Fuchs, J.J. An inverse problem approach to robust regression. In Proceedings of ICASSP, volume 4, pp. 18091812. IEEE, 1999.
Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J., and Stahel, W.A. Robust statistics: the approach based on influence functions, volume 114. Wiley, 1986.
Herman, M.A. and Strohmer, T. General deviants: An analysis of perturbations in compressed sensing.IEEE Journal of Selected Topics in Signal Processing, 4(2):342349, 2010.
Huber, P. Robust Statistics. Wiley, New York, 1981.Kekatos, V. and Giannakis, G.B. From sparse signals to sparse residuals for robust sensing. IEEE Transactions on Signal Processing, 59(7), 2011.
Laska, J.N., Davenport, M.A., and Baraniuk, R.G. Exact signal recovery from sparsely corrupted measurements through the pursuit of justice. In Asilomar Conference on Signals, Systems & Computers, 2009.
Lerman, G., McCoy, M., Tropp, J.A., and Zhang, T.Robust computation of linear models, or how to find a needle in a haystack. arXiv:1202.4044, 2012.
Li, Xiaodong. Compressed sensing and matrix completion with constant proportion of corruptions. Arxiv preprint arXiv:1104.1041, 2011.
Loh, P.L. and Wainwright, M.J. High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity. Annals of Statistics, 40(3):16371664, 2012.
Maronna, R.A., Martin, R.D., and Yohai, V.J. Robust statistics. Wiley, 2006.
Nguyen, N.H., Tran, T., et al. Exact recoverability from dense corrupted observations via l 1 minimization. Arxiv preprint arXiv:1102.1227, 2011.
Rosenbaum, M. and Tsybakov, A.B. Sparse recovery under matrix uncertainty. The Annals of Statistics, 38(5):26202651, 2010.
Rosenbaum, M. and Tsybakov, A.B. Improved matrix uncertainty selector. arXiv:1112.4413, 2011.
She, Y. and Owen, A. B. Outlier Detection Using Nonconvex Penalized Regression. arXiv:1006.2592, 2010.
Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society.
Series B (Methodological), pp. 267288, 1996.Tropp, J.A. Greed is good: Algorithmic results for sparse approximation. IEEE Transactions on Information Theory, 50(10):22312242, 2004.
Xu, H., Caramanis, C., and Sanghavi, S. Robust PCA via outlier pursuit. IEEE Transactions on Information Theory, 58(5):30473064, 2012.
Xu, H., Caramanis, C., and Mannor, S. OutlierRobust PCA: The High Dimensional Case. IEEE 
Transactions on Information Theory, 59(1), 2013.Yu, Y., Aslan, O., and Schuurmans, D. A polynomialtime form of robust regression. In NIPS, 2012.
Zhu, H., Leus, G., and Giannakis, G.B. Sparsitycognizant total least-squares for perturbed compressive sampling. IEEE Transactions on Signal Processing, 59(5):20022016, 2011.
-----0
Ahmed, A. and Xing, E. P. Dynamic non-parametric mixture models and the recurrent Chinese restaurant process. In SDM, 2008.
Andrieu, C. and Roberts, G. O. The pseudo-marginal approach for efficient Monte Carlo computations.
Ann. Statist., 37(2):697725, 2009.Broderick, T., Jordan, M. I., and Pitman, J. Beta processes, stick-breaking, and power laws. Bayesian Anal., 7:439476, 2012.
Caron, F., Davy, M., and Doucet, A. Generalized Polya urn for time-varying Dirichlet process mixtures. In UAI, 2007.
Chen, C., Ding, N., and Buntine, W. Dependent hierarchical normalized random measures for dynamic topic modeling. In ICML. 2012.
Du, L., Buntine, W., and Jin, H. A segmented topic model based on the two-parameter Poisson-Dirichlet process. Mach. Learn., 81:519, 2010.
Favaro, S. and Teh, Y. W. MCMC for normalized random measure mixture models. Stat. Sci., 2012.
Foti, N. J., Futoma, J., Rockmore, D., and Williamson, S. A. A unifying representation for a class of dependent random measures. Technical Report arXiv:1211.4753, Dartmouth College and CMU, USA, 2012. URL http://arxiv.org/abs/ 1211.4753.
Gelfand, A. E., Kottas, A., and MacEachern, S. N. Bayesian nonparametric spatial modeling with Dirichlet process mixing. J. Amer. Statist. Assoc., 100(471):10211035, 2005.
Globerson, A., Chechik, G., Pereira, F., and Tishby, N.Euclidean embedding of co-occurrence data. JMLR, 8:22652295, 2007.
Griffin, J. E. and Steel, M. F. J. Order-based dependent Dirichlet processes. J. Amer. Statist. Assoc., 101:179194, 2006.
Griffin, J. E., Kolossiatis, M., and Steel, M. F. J.Comparing distributions using dependent normalized random measure mixtures. J. R. Stat. Soc. Ser.B Stat. Methodol., 2012.
Griffin, J.E. and Walker, S.G. Posterior simulation of normalized random measure mixtures. J. Comput.
Graph. Stat., 20(1):241259, 2011.Griffiths, T. L. and Ghahramani, Z. The Indian buffet process: An introduction and review. JMLR, 12: 11851224, 2011.
James, L. F. Bayesian Poisson process partition calculus with an application to Bayesian Levy moving averages. Ann. Statist., 33(4):17711799, 2005.
James, L.F., Lijoi, A., and Prunster, I. Posterior analysis for normalized random measures with independent increments. Scand. J. Stat., 36:7697, 2009.
Lin, D., Grimson, E., and Fisher, J. Construction of dependent Dirichlet processes based on Poisson processes. In NIPS. 2010.
Lin, D. H. and Fisher, J. Coupling nonparametric mixtures via latent Dirichlet processes. In NIPS.2012.
MacEachern, S. Dependent nonparametric processes.In Proc. of the SBSS, 1999.
MacEachern, S.N., Kottas, A., and Gelfand, A.E. Spatial nonparametric Bayesian models. In Proc. of the 2001 Joint Statistical Meetings, 2001.
Neal, R. M. Slice sampling. Ann. Statist., 31(3):705 767, 2003.
Rao, V. and Teh, Y. W. Spatial normalized Gamma processes. In NIPS. 2009.
Rosen-Zvi, M., Griffiths, T., Steyvers, M., and Smyth, P. The author-topic model for authors and documents. In UAI, 2004.
Srebro, N. and Roweis, S. Time-varying topic models using dependent Dirichlet processes. Technical report, University of Toronto, 2005.
Teh, Y.W. and Gorur, D. Indian buffet processes with power-law behavior. In NIPS. 2009.
Teh, Y.W., Jordan, M.I., Beal, M.J., and Blei, D.M.Hierarchical Dirichlet processes. J. Amer. Statist.Assoc., 101(476):15661581, 2006.
Williamson, S. A., Wang, C., Heller, K. A., and Blei, D. The IBP compound Dirichlet process and its application to focused topic modeling. In ICML.2010.
-----0
Barnard, Kobus, Duygulu, Pinar, Forsyth, David, De Freitas, Nando, Blei, David M, and Jordan, Michael I.Matching words and pictures. The Journal of Machine Learning Research, 3:11071135, 2003.
Carneiro, G., Chan, A.B., Moreno, P.J., and Vasconcelos, N. Supervised learning of semantic classes for image annotation and retrieval. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 29(3):394 410, 2007.
Chen, M., Xu, Z., Weinberger, K., and Sha, F. Marginalized denoising autoencoders for domain adaptation. In ICML 12, pp. 767774. ACM, New York, NY, USA, July 2012.
Cusano, Claudio, Ciocca, Gianluigi, and Schettini, Raimondo. Image annotation using svm. In Electronic Imaging 2004, pp. 330338. International Society for Optics and Photonics, 2003.
Duygulu, P., Barnard, K., De Freitas, J., and Forsyth, D. Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. Computer VisionECCV 2002, pp. 349354, 2006.
Feng, SL, Manmatha, R., and Lavrenko, V. Multiple bernoulli relevance models for image and video annotation. In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, volume 2, pp. II1002.IEEE, 2004.
Fergus, R., Weiss, Y., and Torralba, A. Semi-supervised learning in gigantic image collections. 2009.
Grangier, David and Bengio, Samy. A discriminative kernel-based approach to rank images from text queries.Pattern Analysis and Machine Intelligence, IEEE Transactions on, 30(8):13711384, 2008.
Grubinger, M., Clough, P., Muller, H., and Deselaers, T.The iapr tc-12 benchmark: A new evaluation resource for visual information systems. In International Workshop OntoImage, pp. 1323, 2006.
Guillaumin, M., Mensink, T., Verbeek, J., and Schmid, C. Tagprop: Discriminative metric learning in nearest neighbor models for image auto-annotation. In Computer Vision, 2009 IEEE 12th International Conference on, pp. 309316. Ieee, 2009.
Hafner, J., Sawhney, H.S., Equitz, W., Flickner, M., and Niblack, W. Efficient color histogram indexing for quadratic form distance functions. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 17(7):729 736, 1995.
Hertz, Tomer, Bar-Hillel, Aharon, and Weinshall, Daphna.Learning distance functions for image retrieval. In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society 
Conference on, volume 2, pp. II570. IEEE, 2004.Jeon, J., Lavrenko, V., and Manmatha, R. Automatic image annotation and retrieval using cross-media relevance models. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 119126. ACM, 2003.
Lavrenko, V., Manmatha, R., and Jeon, J. A model for learning the semantics of pictures. NIPS, 2003.
Liu, J., Li, M., Liu, Q., Lu, H., and Ma, S. Image annotation via graph learning. Pattern Recognition, 42(2): 218228, 2009.
Lowe, D.G. Object recognition from local scale-invariant features. In Computer Vision, 1999. The Proceedings of the Seventh IEEE International Conference on, volume 2, pp. 11501157. Ieee, 1999.
Makadia, A., Pavlovic, V., and Kumar, S. A new baseline for image annotation. In ECCV, volume 8, pp. 316329, 2008.
Metzler, D. and Manmatha, R. An inference network approach to image retrieval. Image and video retrieval, pp.21302131, 2004.
Monay, Florent and Gatica-Perez, Daniel. Plsa-based image auto-annotation: constraining the latent space. In Proceedings of the 12th annual ACM international conference on Multimedia, pp. 348351. ACM, 2004.
Oliva, A. and Torralba, A. Modeling the shape of the scene: A holistic representation of the spatial envelope.International Journal of Computer Vision, 42(3):145 175, 2001.
Schroff, F., Criminisi, A., and Zisserman, A. Harvesting image databases from the web. In Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, pp. 18. IEEE, 2007.
Socher, R. and Fei-Fei, L. Connecting modalities: Semisupervised segmentation and annotation of images using unaligned text corpora. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pp. 966973. IEEE, 2010.
Vedaldi, A. and Zisserman, A. Efficient additive kernels via explicit feature maps. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 34(3):480492, 2012.
Vempala, Santosh S. The random projection method, volume 65. Amer Mathematical Society, 2005.
Von Ahn, L. and Dabbish, L. Labeling images with a computer game. In Proceedings of the SIGCHI conference on Human factors in computing systems, pp. 319326.ACM, 2004.
Weinberger, Kilian Q., Slaney, Malcolm, and Van Zwol, Roelof. Resolving tag ambiguity. In Proceeding of the 16th ACM international conference on Multimedia, MM 08, pp. 111120, New York, NY, USA, 2008. ACM.ISBN 978-1-60558-303-7.
Yavlinsky, A., Schofield, E., and Ruger, S. Automated image annotation using global features and robust nonparametric density estimation. Image and video retrieval, pp. 593593, 2005.
-----0
CMU motion capture database.http://mocap.cs.cmu.edu/.
Absil, P., Mahony, R., and Sepulchre, R. Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton, NJ, 2008.
Akhter, I., Simon, T., Khan, S., Matthews, I., and Sheikh, Y. Bilinear spatiotemporal basis models. ACM Trans.Graph., 31(2), 2012.
Bach, F., Jenatton, R., Mairal, J., and Obozinski, G. Optimization with sparsity-inducing penalties. Foundations and Trends in Machine Learning, 4(1):1106, 2012.
Balcan, M. and Blum, A. On a theory of learning with similarity functions. In IMCL, 2006.
Bellet, A. Supervised Metric Learning with Generalization Guarantees. PhD thesis, University of Saint-Etienne, 2012.
Bellet, A., Habrard, A., and Sebban, M. Similarity learning for provably accurate sparse linear classication. In ICML, 2012.
Bi, J., Wu, D., Lu, L., Liu, M., Tao, Y., and Wolf, M.AdaBoost on low-rank PSD matrices for metric learning.In CVPR, 2011.
Bishop, C. Pattern Recognition and Machine Learning.Springer-Verlag, 2006.
Chen, H., Liu, T., and Fuh, C. Learning effective image metrics from few pairwise examples. In ICCV, 2005.
Davis, J., Kulis, B., Jain, P., Sra, S., and Dhillon, I.Information-theoretic metric learning. In ICML, 2007.do Carmo, M. Riemannian Geometry. Springer, 1992.
Duan, L., Xu, D., and Tsang, I. Learning with augmented features for heterogeneous domain adaptation. In ICML, 2012.
Elden, Lars. Matrix Methods in Data Mining and Pattern Recognition. SIAM, 2007.
Fei-Fei, L. and Perona, P. A Bayesian hierarchical model for learning natural scene categories. In CVPR, 2005.
Fei-fei, L., Fergus, R., and Perona, P. One-shot learning of object categories. IEEE Trans. PMAI, 28:594611, 2006.
Ferreira, O. and Oliveira, P. Subgradient algorithm on Riemannian manifolds. J. Optim. Theory Appl., 97:93 104, 1998.
Golub, G. and Loan, C. Matrix Computations. The Johns Hopkins University Press, 1996.
Huang, G., Ramesh, M., Berg, T., and Learned-Miller, E. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical Report 07-49, University of Massachusetts, Amherst, 2007.
Hubert, L., Meulman, J., and Heiser, W. Two purposes for matrix factorization: A historical appraisal. SIAM Rev., 42(1):6882, 2000.
Jain, P., Kulis, B., Davis, J., and Dhillon, I. Metric and kernel learning using a linear transformation. JMLR, 13: 519547, 2012.Riemannian Similarity Learning 
Kliper-Gross, O., Hassner, T., and Wolf, L. One shot similarity metric learning for action recognition. In International conference on Similarity-based pattern recognition, pp. 3145, 2011.
Kliper-gross, O., Hassner, T., and Wolf, L. The action similarity labeling challenge. IEEE Trans. PAMI, 2012.
Kulis, B., Saenko, K., and Darrell, T. What you saw is not what you get: Domain adaptation using asymmetric kernel transforms. In CVPR, pp. 17851792. 2011.
Kumar, N., Berg, A., Belhumeur, P., , and Nayar, S. Attribute and simile classifiers for face verification. In ICCV, 2009.
Lee, J. Introduction to Smooth Manifolds. Springer, 2003.Mensink, T., Verbeek, J., Perronnin, F., and Csurka, G.Metric learning for large scale image classication: Generalizing to new classes at near-zero cost. In ECCV, 2012.
Miller, E., Miller, E., Matsakis, N., and Viola, P. Learning from one example through shared densities on transforms. In CVPR, 2000.
Mishra, B., Meyer, G., Bonnabel, S., and Sepulchre, R.Fixed-rank matrix factorizations and riemannian lowrank optimization. Technical report, arXiv, 2012.
Mohri, M., Rostamizadeh, A., and Talwalkar, A. Foundations of Machine Learning. The MIT Press, 2012.
Nocedal, J. and Wright, S. Numerical optimization.Springer, 2006.
Pan, S. and Yang, Q. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22 (10):13451359, 2010.
Pirsiavash, H. and Ramanan, D. Steerable part models. In CVPR, 2012.
Pirsiavash, H., Ramanan, D., and Fowlkes, C. Bilinear classifiers for visual recognition. In NIPS, 2009.
Saenko, K., Kulis, B., F., M., and Darrell, T. Adapting visual category models to new domains. In ECCV, pp.213226, 2010.
Shalit, U., Weinshall, D., and Chechik, G. Online learning in the embedded manifold of low-rank matrices. J.
Mach. Learn. Res., 13:429458, 2012.Srebro, Nathan, Alon, Noga, and Jaakkola, Tommi. Generalization error bounds for collaborative prediction with low-rank matrices. In NIPS, 2004.
Tenenbaum, J. and Freeman, W. Separating style and content with bilinear models. Neural Comput., 12(6):1247 1283, 2000.
Vandereycken, B. Low-rank matrix completion by riemannian optimization. SIAM Journal on Optimization, 2013. Accepted.
Wang, H., Ullah, M., Klaser, A., Laptev, I., and Schmid, C. Evaluation of local spatio-temporal features for action recognition. In BMVC, 2009a.
Wang, L., Sugiyama, M., Yang, C., Hatano, K., and Feng, J. Theory and algorithm for learning with dissimilarity functions. Neural Comput., 21(5):14591484, 2009b.
Yang, Y. Globally convergent optimization algorithms on Riemannian manifolds: Uniform framework for unconstrained and constrained optimization. Journal of Optimization Theory and Applications, 132(2):245265, 2007.
Yin, Q., Tang, X., and Sun, J. An associate-predict model for face recognition. In CVPR, 2011.
Ying, Y., Huang, K., and Campbell, C. Sparse metric learning via smooth optimization. In NIPS, 2009.
-----0
P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3: 463482, 2002.
P. Berman and M. Karpinski. On some tighter inapproximability results (extended abstract). In Automata, Languages and Programming, 26th International Colloquium (ICALP), pages 200209.Springer, 1999.
Y. Chevaleyre, F. Koriche, and J. D. Zucker. Rounding methods for discrete linear classification (extended version). Technical Report hal-00771012, hal, 2013.
U. Feige, M. Karpinski, and M. Langberg. A note on approximating Max-Bisection on regular graphs. Information Processing Letters, 79(4):181188, 2001.
M. Golea and M. Marchand. Average case analysis of the clipped Hebb rule for nonoverlapping Perception networks. In Proceedings of the 6th annual conference on computational learning theory (COLT93).ACM, 1993a.
M. Golea and M. Marchand. On learning perceptrons with binary weights. Neural Computation, 5(5):767 782, 1993b.
S. M. Kakade, K. Sridharan, and A. Tewari. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In Proceedings of the 22nd Annual Conference on Neural Information Processing Systems (NIPS), pages 793800, 2008.
H. Kohler, S. Diederich, W. Kinzel, and M. Opper.Learning algorithm for a neural network with binary synapses. Zeitschrift fr Physik B Condensed Matter, 78:333342, 1990.
M. Opper, W. Kinzel, J. Kleinz, and R Nehl. On the ability of the optimal perceptron to generalise.Journal of Physics A: Mathematical and General, 23 (11):L581L586, 1990.
L. Pitt and L. G. Valiant. Computational limitations on learning from examples. J. ACM, 35(4):965984, 1988.
P. Raghavan and C. D. Thompson. Randomized rounding: A technique for probably good algorithms and algorithmic proofs. Combinatorica, 7(4):365 374, 1987.
J. Shawe-Taylor and N. Cristianini. An Introduction to Support Vector Machines. Cambridge, 2000.
G. G. Towell and J. W. Shavlik. Extracting refined rules from knowledge-based neural networks. Machine Learning, 13:71101, 1993.S. Venkatesh. On learning binary weights for majority functions. In Proceedings of the 4th Annual Workshop on Computational Learning Theory (COLT), pages 257266. Morgan Kaufmann, 1991.
D. P. Williamson and D. B. Shmoys. The Design of Approximation Algorithms. Cambridge, 2011.
-----0
Bishop, C. M. Pattern Recognition and Machine Learning (Information Science and Statistics).
Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006.
Burger, H., Schuler, C., and Harmeling, S. Image denoising: Can plain neural networks compete with bm3d? In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp. 2392 2399, june 2012.
Chang, C.-C. and Lin, C.-J. Libsvm: A library for support vector machines. ACM Trans. Intell. Syst.Technol., 2(3):27:127:27, May 2011.
Chen, S. S., Donoho, D. L., and Saunders, M. A.Atomic decomposition by basis pursuit. SIAM Rev., 43(1):129159, January 2001.
Cho, K. Boltzmann machines and denoising autoencoders for image denoising. ArXiv e-prints, January 2013.
Elad, M. and Aharon, M. Image denoising via sparse and redundant representations over learned dictionaries. Image Processing, IEEE Transactions on, 15 (12):37363745, December 2006.
Hyvarinen, A. Fast and robust fixed-point algorithms for independent component analysis. IEEE Transactions on Neural Networks, 10(3):62634, 1999.
Hyvarinen, A., Hoyer, P., and Oja, E. Image denoising by sparse code shrinkage. In Intelligent Signal Processing. IEEE Press, 1999.Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, Computer Science Department, University of Toronto, 2009.
Simple Sparsification Improves Sparse Denoising Autoencoders in Image Denoising LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P.Gradient-based learning applied to document recognition. In Proceedings of the IEEE, volume 86, pp.22782324, 1998.
Lee, H., Ekanadham, C., and Ng, A. Sparse deep belief net model for visual area v2. In Platt, J., Koller, 
D., Singer, Y., and Roweis, S. (eds.), Advances in Neural Information Processing Systems 20, pp. 873 880. MIT Press, Cambridge, MA, 2008.
Olshausen, B. A. and Field, D. J. Emergence of simplecell receptive field properties by learning a sparse code for natural images. Nature, 381(6583):607609, June 1996.
Ranzato, M., Poultney, C., Chopra, S., and LeCun, Y.Efficient learning of sparse representations with an energy-based model. In Scholkopf, B., Platt, J., and Hoffman, T. (eds.), Advances in Neural Information Processing Systems 19, pp. 11371144. MIT Press, Cambridge, MA, 2007.
Rumelhart, D. E., Hinton, G. E., and Williams, R. J.Learning representations by back-propagating errors. Nature, 323(Oct):533536+, 1986.
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11:33713408, December 2010.
Xie, J., Xu, L., and Chen, E. Image denoising and inpainting with deep neural networks. In Bartlett, 
P., Pereira, F., Burges, C., Bottou, L., and Weinberger, K. (eds.), Advances in Neural Information Processing Systems 25, pp. 350358. 2012.
-----0
Alsumait, Loulwah, Barbara, Daniel, Gentle, James, and Domeniconi, Carlotta. Topic significance ranking of LDA generative models. In ECML, pp. 6782, 2009.
Blei, David M. and Lafferty, John D. Dynamic topic models. In ICML, pp. 113120, 2006.
Blei, David M., Ng, Andrew Y., and Jordan, Michael I.Latent Dirichlet allocation. J of Machine Learning Research, 3(1):9931022, 2003.
Blei, David M., Griffiths, Thomas L., Jordan, Michael I., and Tenenbaum, Joshua B. Hierarchical topic models and the nested chinese restaurant process. In NIPS, 2004.
Budiu, Raluca, Royer, Christiaan, and Pirolli, Peter. Modeling information scent: a comparison of LSA, PMI and GLSA similarity measures on common tests and corpora.In RIAO, pp. 314332, 2007.
Chaney, Allison and Blei, David. Visualizing topic models.In AAAI, pp. 419422, 2012.
Chang, Jonathan, Boyd-Graber, Jordan, Wang, Chong, Gerrish, Sean, and Blei, David M. Reading tea leaves: How humans interpret topic models. In NIPS, 2009.
Chuang, Jason, Manning, Christopher D., and Heer, Jeffrey. Termite: Visualization techniques for assessing textual topic models. In AVI, pp. 7477, 2012a.
Chuang, Jason, Ramage, Daniel, Manning, Christopher D., and Heer, Jeffrey. Interpretation and trust: Designing model-driven visualizations for text analysis. In CHI, pp. 443452, 2012b.
Gretarsson, Brynjar, ODonovan, John, Bostandjiev, Svetlin, Hollerer, Tobias, Asuncion, Arthur, Newman, David, and Smyth, Padhraic. TopicNets: Visual analysis of large text corpora with topic modeling. ACM Trans on Intelligent Systems and Technology, 3(2):23:1 23:26, 2012.
Hall, David, Jurafsky, Daniel, and Manning, Christopher D. Studying the history of ideas using topic models.In EMNLP, pp. 363371, 2008.
Hu, Yuening, Boyd-Graber, Jordan, and Satinoff, Brianna.Interactive topic modeling. In ACL-HLT, 2011.
Johnson, K. E. and Mervis, C. B. Effects of varying levels of expertise on the basic level of categorization. J of Experimental Psychology: General, 126(3):248277, 1997.
McCallum, Andrew Kachites. MALLET: A machine learning for language toolkit. http://mallet.cs.umass.edu, 2013. Accessed 2013.
Mimno, David, Wallach, Hanna, Talley, Edmund, Leenders, Miriam, and McCallum, Andrew. Optimizing semantic coherence in topic models. In EMNLP, 2011.
Musat, Claudiu Cristian, Velcin, Julien, Trausan-Matu, Stefan, and Rizoiu, Marian-Andrei. Improving topic evaluation using conceptual knowledge. In IJCAI, 2011.
Newman, David, Chemudugunta, Chaitanya, Smyth, Padhraic, and Steyvers, Mark. Analyzing entities and topics in news articles using statistical topic models. In ISI, pp. 93104, 2006.
Newman, David, Lau, Jey Han, Grieser, Karl, and Baldwin, Timothy. Automatic evaluation of topic coherence.In HLT, pp. 100108, 2010a.
Newman, David, Noh, Youn, Talley, Edmund, Karimi, Sarvnaz, and Baldwin, Timothy. Evaluating topic models for digital libraries. In JCDL, pp. 215224, 2010b.
Ramage, Daniel. Stanford topic modeling toolbox 0.4.http://nlp.stanford.edu/software/tmt, 2013. Accessed 2013.
Ramage, Daniel, Hall, David, Nallapati, Ramesh, and Manning, Christopher D. Labeled LDA: A supervised topic model for credit attribution in multi-label corpora.In EMNLP, pp. 248256, 2009.
Ramage, Daniel, Dumais, S., and Liebling, D. Characterizing microblogs with topic models. In ICWSM, pp.130137, 2010.
Ramage, Daniel, Manning, Christopher D., and Dumais, Susan. Partially labeled topic models for interpretable text mining. In KDD, pp. 457465, 2011.
Rosen-Zvi, Michal, Griffiths, Thomas, Steyvers, Mark, and Smyth, Padhraic. The author-topic model for authors and documents. In UAI, pp. 487494, 2004.
Stevens, Keith, Kegelmeyer, Philip, Andrzejewski, David, and Buttler, David. Exploring topic coherence over many models and many topics. In EMNLP-CoNLL, pp.952961, 2012.
Steyvers, M., Smyth, P., Rosen-Zvi, M., and Griffiths, T.Probabilistic author-topic models for information discovery. In KDD, 2004.
Steyvers, Mark and Griffiths, Tom. Matlab topic modeling toolbox 1.4. http://psiexp.ss.uci.edu/research/ programs_data/toolbox.htm, 2013. Accessed 2013.
Talley, Edmund M., Newman, David, Mimno, David, Herr, Bruce W., Wallach, Hanna M., Burns, Gully A.
P. C., Leenders, A. G. Miriam, and McCallum, Andrew. Database of NIH grants using machine-learned categories and graphical clustering. Nature Methods, 8 (6):443444, 2011.
Titov, Ivan and McDonald, Ryan. A joint model of text and aspect ratings for sentiment summarization. In ACLHLT, pp. 308316, 2008.
Wallach, Hanna M., Mimno, David, and McCallum, Andrew. Rethinking LDA: Why priors matter. In NIPS, 2009a.
Wallach, Hanna M., Murray, Iain, Salakhutdinov, Ruslan, and Mimno, David. Evaluation methods for topic models. In ICML, pp. 11051112, 2009b.
Wang, Xuerui and McCallum, Andrew. Topics over time: A non-Markov continuous-time model of topical trends.In KDD, pp. 424433, 2006.
Wei, Xing and Croft, W. Bruce. Lda-based document models for ad-hoc retrieval. In SIGIR, pp. 178185, 2006.
-----0
Bengio, Y, Lamblin, P, Popovici, D, and Larochelle, H. Greedy layer-wise training of deep networks. In Neural Information Processing Systems, 2006.
Chu, C, Kim, S. K, Lin, Y.-A, Yu, Y, Bradski, G, Ng, A. Y, and Olukotun, K. Map-reduce for machine learning on multicore. Advances in Neural Information Processing Systems, 19:281, 2007.
Ciresan, D. C, Meier, U, Masci, J, Gambardella, L. M, and Schmidhuber, J. Flexible, high performance convolutional neural networks for image classification. In International Joint Conference on Artificial Intelligence, pp. 12371242, 2011.
Ciresan, D. C, Meier, U, and Schmidhuber, J. Multicolumn deep neural networks for image classification. In Computer Vision and Pattern Recognition, pp. 36423649, 2012.
Coates, A, Lee, H, and Ng, A. Y. An analysis of singlelayer networks in unsupervised feature learning. In 14th International Conference on AI and Statistics, pp. 215223, 2011.
Coates, A, Karpathy, A, and Ng, A. Y. On the emergence of object-selective features in unsupervised feature learning. In Advances in Neural Information Processing Systems, 2012.
Dean, J and Ghemawat, S. Mapreduce: Simplified data processing on large clusters. In Sixth Symposium on Operating System Design and Implementation, 2004.Deep learning with COTS HPC systems 
Dean, J, Corrado, G. S, Monga, R, Chen, K, Devin, M, Le, Q. V, Mao, M. Z, Ranzato, M, Senior, A, Tucker, 
P, Yang, K, and Ng, A. Y. Large scale distributed deep networks. In Advances in Neural Information Processing Systems 25, 2012.
Garrigues, P and Olshausen, B. Group sparse coding with a laplacian scale mixture prior. In Advances in Neural Information Processing Systems 23, pp.676684, 2010.
Glorot, X, Bordes, A, and Bengio, Y. Deep sparse rectifier neural networks. In 14th International Conference on Artificial Intelligence and Statistics, pp.315323, 2011.
Hinton, G, Osindero, S, and Teh, Y. A fast learning algorithm for deep belief nets. Neural Computation, 18(7):15271554, 2006.
Hinton, G. A practical guide to training restricted boltzmann machines. Technical report, University of Toronto, 2010.
Huang, G. B, Ramesh, M, Berg, T, and LearnedMiller, E. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical Report 07-49, University of Massachusetts, Amherst, October 2007.
Hubel, D and Wiesel, T. Receptive fields of single neurones in the cats striate cortex. The Journal of physiology, 148(3):574591, 1959.
Hyvarinen, A and Hoyer, P. Emergence of phase-and shift-invariant features by decomposition of natural images into independent feature subspaces. Neural Computation, 12(7):17051720, 2000.
Hyvarinen, A, Hoyer, P, and Inki, M. Topographic independent component analysis. Neural Computation, 13(7):15271558, 2001.
Jarrett, K, Kavukcuoglu, K, Ranzato, M, and LeCun, Y. What is the best multi-stage architecture for object recognition? In 12th International Conference on Computer Vision, pp. 21462153, 2009.
Krizhevsky, A. Convolutional Deep Belief Networks on CIFAR-10. Unpublished manuscript, 2010.
Krizhevsky, A, Sutskever, I, and Hinton, G. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pp. 11061114, 2012.
Le, Q. V, Ngiam, J, Coates, A, Lahiri, A, Prochnow, B, and Ng, A. Y. On optimization methods for deep learning. In International Conference on Machine Learning, 2011.
Le, Q, Ranzato, M, Monga, R, Devin, M, Chen, K, Corrado, G, Dean, J, and Ng., A. Building highlevel features using large scale unsupervised learning. In International Conference on Machine Learning, 2012.
LeCun, Y, Boser, B, Denker, J. S, Henderson, D, Howard, R. E, Hubbard, W, and Jackel, L. D. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1:541551, 1989.
LeCun, Y, Huang, F. J, and Bottou, L. Learning methods for generic object recognition with invariance to pose and lighting. In Computer Vision and Pattern 
Recognition, volume 2, pp. 97104, 2004.Martens, J. Deep learning via hessian-free optimization. In 27th International Conference on Machine 
Learning, volume 951, pp. 2010, 2010.nVidia CUDA Programming Guide. NVIDIA Corporation, 2701 San Tomas Expressway, Santa Clara, CA 95050. URL http://docs.nvidia.com/.
Raina, R, Madhavan, A, and Ng, A. Large-scale deep unsupervised learning using graphics processors. In 26th International Conference on Machine Learning, 2009.
Ranzato, M, Poultney, C, Chopra, S, and LeCun, Y.Efficient learning of sparse representations with an energy-based model. In Advances in Neural Information Processing Systems, 2007.
Riesenhuber, M and Poggio, T. Hierarchical models of object recognition in cortex. Nature neuroscience, 2, 1999.
Rumelhart, D, Hinton, G, and Williams, R. Learning representations by back-propagating errors. Nature, 323(6088):533536, 1986.
Tomov, S, Nath, R, Du, P, and Dongarra, J. MAGMA Users Guide. UT Knoxville ICL, 2011. URL http: //icl.cs.utk.edu/magma/.
Uetz, R and Behnke, S. Large-scale object recognition with CUDA-accelerated hierarchical neural networks. In Intelligent Computing and Intelligent Systems, 2009.
Wang, H, Potluri, S, Luo, M, Singh, A. K, Sur, S, and Panda, D. K. MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters. Computer Science-Research and Development, 26(3):257266, 2011.
-----0
Bach, Francis R., Lanckriet, Gert R. G., and Jordan, Michael I. Multiple kernel learning, conic duality, and the smo algorithm. In ICML, 2004.
Bartlett, Peter L. and Mendelson, Shahar.Rademacher and Gaussian complexities: Risk bounds and structural results. JMLR, 3, 2002.
Cortes, Corinna, Mohri, Mehryar, and Rostamizadeh, Afshin. Generalization bounds for learning kernels.In ICML, pp. 247254, 2010.
Cortes, Corinna, Mohri, Mehryar, and Rostamizadeh, Afshin. Tutorial: Learning kernels. In ICML, 2011.
Cortes, Corinna, Mohri, Mehryar, and Rostamizadeh, Afshin. Algorithms for learning kernels based on centered alignment. Journal of Machine Learning, 2012.
Crammer, Koby and Singer, Yoram. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research, 2:265292, 2001.
Damoulas, Theodoros and Girolami, Mark A. Probabilistic multi-class multi-kernel learning: on protein fold recognition and remote homology detection. Bioinformatics, 24(10):12641270, 2008.
Gehler, P. and Nowozin, S. On feature combination for multiclass object classification. In International Conference on Computer Vision, pp. 221228, 2009.
Kloft, M., Brefeld, U., Sonnenburg, S., and Zien, A.Lp-norm multiple kernel learning. Journal of Machine Learning Research, 12, 2011.
Koltchinskii, Vladmir and Panchenko, Dmitry. Empirical margin distributions and bounding the generalization error of combined classifiers. Annals of Statistics, 30, 2002.
Kumar, A., Niculescu-Mizil, A., Kavukcoglu, K., and Daume III, H. A binary classification framework for two stage kernel learning. In ICML, 2012.
Kwapien, Stanislaw and Woyczynski, W Wojbor Andrzej. Random series and stochastic integrals.Birkhauser, 1992.
Lanckriet, Gert R. G., Cristianini, Nello, Bartlett, Peter L., Ghaoui, Laurent El, and Jordan, Michael I.Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research, 5:2772, 2004.
Ledoux, Michel and Talagrand, Michel. Probability in Banach Spaces: Isoperimetry and Processes.Springer, New York, 1991.
Orabona, F. and Jie, L. Ultra-fast optimization algorithm for sparse multi kernel learning. In Proceedings of the 28th International Conference on Machine Learning, 2011.
Orabona, F., Jie, L., and Caputo, B. Onlinebatch strongly convex multi kernel learning. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pp. 787794. IEEE, 2010.
Sonnenburg, S., Ratsch, G., Schafer, C., and Scholkopf, B. Large scale multiple kernel learning. Journal of Machine Learning Research, 7:1531 1565, 2006.
Tsochantaridis, Ioannis, Hofmann, Thomas, Joachims, Thorsten, and Altun, Yasemin. Support vector machine learning for interdependent and structured output spaces. In ICML 2004, Banff, Canada, 2004.
Weston, Jason and Watkins, Chris. Support vector machines for multi-class pattern recognition. European Symposium on Artificial Neural Networks, 4 (6), 1999.
Zien, Alexander and Ong, Cheng Soon. Multiclass multiple kernel learning. In ICML, pp. 11911198, 2007.
-----0
Burges, C. and Scholkopf, B. Improving the accuracy and speed of support vector machines. In NIPS97, pp. 375381. MIT Press, 1997.
Cavallanti, G., Cesa-Bianchi, N., and Gentile, C.Tracking the best hyperplane with a simple budget perceptron. Machine Learning, 69(2-3), December 2007.
Chang, C-C. and Lin, C-J. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
Collobert, Ronan, Sinz, Fabian, Weston, Jason, and Bottou, Leon. Trading convexity for scalability. In ICML06, pp. 201208, 2006.
Cotter, A., Srebro, N., and Keshet, J. A GPU-tailored approach for training kernelized SVMs. In KDD11, 2011.
Cotter, A., Shalev-Schwartz, S., and Srebro, N. The kernelized stochastic batch perceptron. In ICML12, 2012a.
Cotter, A., Shalev-Schwartz, S., and Srebro, N. The kernelized stochastic batch perceptron. http:// arxiv.org/abs/1204.0566, 2012b.
Dekel, O., Shalev-Shwartz, S., and Singer, Y. The forgetron: A kernel-based perceptron on a fixed budget. In NIPS05, pp. 259266, 2005.
Freund, Y. and Schapire, R. E. Large margin classification using the perceptron algorithm. Machine Learning, 37(3):277296, 1999.
Joachims, T. and Yu, Chun-Nam. Sparse kernel svms via cutting-plane training. Machine Learning, 76(2 3):179193, 2009. European Conference on Machine Learning (ECML) Special Issue.
Keerthi, S. Sathiya, Chapelle, Olivier, and DeCoste, Dennis. Building support vector machines with reduced classifier complexity. Journal of Machine Learning Research, 7:14931515, 2006.
Lee, Y-J. and Mangasarian, O. RSVM: Reduced support vector machines. In Data Mining Institute, Computer Sciences Department, University of Wisconsin, pp. 0007, 2001.
Lin, K-M. and Lin, C-J. A study on reduced support vector machines. IEEE Transactions on Neural Networks, 2003.
Nesterov, Y. Primal-dual subgradient methods for convex problems. Math. Program., 120(1):221259, Apr 2009.
Nguyen, D D., Matsumoto, K., Takishima, Y., and Hashimoto, K. Condensed vector machines: learning fast machine for large data. Trans. Neur. Netw., 21(12):19031914, Dec 2010.
Osuna, E. and Girosi, F. Reducing the run-time complexity of support vector machines, 1998.Shalev-Shwartz, S. Introduction to machine learning, lecture notes. Technical report, The Hebrew University, 2010. http://www.cs.huji.ac.il/~shais/ Handouts2010.pdf.
Srebro, N., Sridharan, K., and Tewari, A. Smoothness, low-noise and fast rates. In NIPS10, 2010.
Wu, M., Scholkopf, B., and Bakir, G. Building sparse large margin classifiers. In ICML05, pp. 996 1003, New York, NY, USA, 8 2005. Max-PlanckGesellschaft, ACM.
Zhan, Y. and Shen, D. Design efficient support vector machine for fast classification. Pattern Recognition, 38(1):157161, 2005.
-----0
Abdullah, A., Moeller, J., and Venkatasubramanian, S. Approximate Bregman near neighbors in sublinear time: Beyond the triangle inequality. SCG, 2012.
Amari, S. Information geometry and its applications: convex function and dually flat manifold. Emerging Trends in Visual Computing, 2009.
Banerjee, A., Merugu, S., Dhillon, I.S., and Ghosh, J.Clustering with Bregman divergences. JMLR, 2005.
Barrington, L., Chan, A.B., and Lanckriet, G. Modeling music as a dynamic texture. IEEE TASLP, 2010.
Blei, D.M. and Lafferty, J.D. A correlated topic model of science. The Annals of Applied Statistics, 2007.
Cayton, L. Fast nearest neighbor retrieval for bregman divergences. In ICML, 2008.
Cayton, L. Efficient Bregman range search. NIPS, 2009.
Chan, Antoni B. and Vasconcelos, Nuno. Probabilistic kernels for the classification of auto-regressive visual processes. In IEEE CVRP, 2005.
Coviello, E., Chan, A.B., and Lanckriet, G.R.G. Time series models for semantic music annotation. IEEE TASLP, 2011.
Coviello, E., Mumtaz, A., Chan, A.B., and Lanckriet, G.R.G. Growing a Bag of Systems Tree for fast and accurate classification. In IEEE CVPR, 2012.
Coviello, E., Mumtaz, A., Chan, A.B., and Lanckriet, G.R.G. Supplement to That was fast! Speeding up NN search of high dimensional distributions. 2013.
Doretto, G., Chiuso, A., Wu, Y. N., and Soatto, S.Dynamic textures. Intl. J. Computer Vision, 2003.
Friedman, J.H., Bentley, J.L., and Finkel, R.A. An algorithm for finding best matches in logarithmic expected time. ACM TOMS, 1977.
Hershey, J.R. and Olsen, P.A. Approximating the Kullback Leibler divergence between Gaussian mixture models. In IEEE ICASSP, 2007.
Moore, A.W. The anchors hierarchy: Using the triangle inequality to survive high dimensional data. In UAI, 2000.
Mu, Y. and Yan, S. Non-metric locality-sensitive hashing. In AAAI CAI, 2010.
Mumtaz, A., Coviello, E., Lanckriet, G., and Chan, A.Clustering Dynamic Textures with the Hierarchical EM Algorithm for Modeling Video. IEEE TPAMI, 2012.
Nielsen, F. and Nock, R. Sided and symmetrized Bregman centroids. IEEE Transactions on IT, 2009.
Nielsen, F., Piro, P., and Barlaud, M. Bregman vantage point trees for efficient nearest neighbor queries. In IEEE ICME, 2009a.
Nielsen, F., Piro, P., Barlaud, M., et al. Tailored Bregman ball trees for effective nearest neighbors. In EuroCG, 2009b.
Omohundro, S.M. Five balltree construction algorithms. International Computer Science Institute, 1989.
Pereira, F., Tishby, N., and Lee, L. Distributional clustering of English words. In Association for Computational Linguistics, 1993.
Puzicha, J., Buhmann, J.M., Rubner, Y., and Tomasi, C. Empirical evaluation of dissimilarity measures for color and texture. In IEEE ICCV, 1999.
Rasiwasia, N., Moreno, P.J., and Vasconcelos, N.Bridging the gap: Query by semantic example.
IEEE Transactions on Multimedia, 2007.Rockafellar, R.T. Convex analysis. Princeton university press, 1996.
Slaney, M. and Casey, M. Locality-sensitive hashing for finding nearest neighbors. IEEE, SPM, 2008.
Uhlmann, J.K. Satisfying general proximity/similarity queries with metric trees. IP letters, 1991.
Wainwright, M.J. and Jordan, M.I. Graphical models, exponential families, and variational inference.Foundations and Trends in Machine Learning, 2008.
Zhang, Z., Ooi, B.C., Parthasarathy, S., and Tung, A.K.H. Similarity search on Bregman divergence: towards non-metric indexing. VLDB E, 2009.
-----0
Agarwal, P.K. and Erickson, J. Geometric range searching and its relatives. Contemporary Mathematics, 223:156, 1999.
Bentley, J.L. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9):509517, 1975.
Bentley, J.L. and Friedman, J.H. Data structures for range searching. ACM Computing Surveys (CSUR), 11(4):397409, 1979.
Beygelzimer, A., Kakade, S.M., and Langford, J.Cover trees for nearest neighbor. In Proceedings of the 23rd International Conference on Machine Learning (ICML 06), volume 23, pp. 97, 2006.
Cover, T.M. and Hart, P.E. Nearest neighbor pattern classification. Information Theory, IEEE Transactions on, 13(1):2127, 1967.
Curtin, R.R., Cline, J.R., Slagle, N.P., Amidon, M.L., and Gray, A.G. MLPACK: a scalable C++ machine learning library. In BigLearning: Algorithms, Systems, and Tools for Learning at Scale, 2011.
Friedman, J.H., Bentley, J.L., and Finkel, R.A. An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software (TOMS), 3(3):209226, 1977.
Fukunaga, K. and Narendra, P.M. A branch and bound algorithm for computing k-nearest neighbors. Computers, IEEE Transactions on, 100(7):750753, 1975.
Gray, A.G. Bringing Tractability to Generalized NBody Problems in Statistical and Scientific Computation. PhD thesis, Carnegie Mellon University, 2003.
Gray, A.G. and Moore, A.W. N-Body problems in statistical learning. In Advances in Neural Information Processing Systems 14 (NIPS 2001), volume 4, pp. 521527, 2001.
Gray, A.G. and Moore, A.W. Rapid evaluation of multiple density models. In Advances in Neural Information Processing Systems 16 (NIPS 2003), volume 2003 of NIPS 03, 2003a.
Gray, A.G. and Moore, A.W. Nonparametric density estimation: Toward computational tractability. In SIAM International Conference on Data Mining (SDM). Citeseer, 2003b.
Holmes, M.P., Gray, A.G., and Isbell Jr., C.L. QUICSVD: Fast SVD using cosine trees. Advances in Neural Information Processing Systems (NIPS), 21, 2008.
Lee, D. and Gray, A.G. Faster Gaussian summation: Theory and experiment. In In Proceedings of the Twenty-second Conference on Uncertainty in Artificial Intelligence (UAI), 2006.
Lee, D. and Gray, A.G. Fast high-dimensional kernel summations using the monte carlo multipole method. Advances in Neural Information Processing Systems (NIPS), 21, 2008.
Lee, D., Gray, A.G., and Moore, A.W. Dual-tree fast Gauss transforms. In Advances in Neural Information Processing Systems 18, pp. 747754. MIT Press, Cambridge, MA, 2006.
Lee, D., Vuduc, R., and Gray, A.G. A distributed kernel summation framework for general-dimension machine learning. In SIAM International Conference on Data Mining, volume 2012, pp. 5, 2012.
March, W.B., Ram, P., and Gray, A.G. Fast Euclidean minimum spanning tree: algorithm, analysis, and applications. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 10), pp.603612, 2010.
March, W.B., Connolly, A.J., and Gray, A.G. Fast algorithms for comprehensive n-point correlation estimates. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 14781486. ACM, 2012.
Ram, P., Lee, D., March, W.B., and Gray, A.G.Linear-time algorithms for pairwise statistical problems. Advances in Neural Information Processing Systems 22 (NIPS 2009), 23, 2009.
Wang, P., Lee, D., Gray, A.G., and Rehg, J.M. Fast mean shift with accurate and stable convergence.
In Workshop on Artificial Intelligence and Statistics (AISTATS), 2007.
-----0
Barvinok, A. A course in convexity. American Mathematical Society, 2002.
Ben-Tal, A. and Nemirovski, A. Lectures on modern convex optimization. SIAM, 2001.
Ben-Tal, A., El Ghaoui, L., and Nemirovski, A.S. Robust optimization. Princeton University Press, 2009.
Bewley, R., Orden, D., Yang, M., and Fisher, L.A. Comparison of Box-Tiao and Johansen Canonical Estimators of Cointegrating Vectors in VEC (1) Models. Journal of Econometrics, 64:327, 1994.
Box, Gep and Tiao, GC. A canonical analysis of multiple time series. Biometrika, 64(2):355365, 1977.
Boyd, Stephen and Vandenberghe, Lieven. Convex Optimization. Cambridge University Press, 2004.
Brickman, L. On the field of values of a matrix. Proceedings of the American Mathematical Society, pp. 6166, 1961.
Cuturi, Marco, Vert, Jean-Philippe, and DAspremont, Alexandre. White functionals for anomaly detection in dynamical systems. In Advances in Neural Information Processing Systems 23, pp. 432440. 2010.
dAspremont, Alexandre. Identifying small mean reverting portfolios. Quantitative Finance, 11(3):351364.
Elie, R. and Espinosa, G.-E. Optimal stopping of a mean reverting diffusion: minimizing the relative distance to the maximum. hal-00573429, 2011.
Engle, Robert F. and Granger, C. W. J. Co-integration and error correction: Representation, estimation, and testing. Econometrica, 55(2):251276, 1987.
Golub, G.H. and Van Loan, C.F. Matrix computations.Johns Hopkins Univ Pr, 1996.
Hamilton, J.D. Time series analysis, volume 2. Cambridge Univ Press, 1994.
Hara, Satoshi, Kawahara, Yoshinobu, Washio, Takashi, and von Bunau, Paul. Stationary subspace analysis as a generalized eigenvalue problem. In Neural Information Processing. Theory and Algorithms, volume 6443 of Lecture Notes in Computer Science, pp. 422429. Springer, 2010.
Johansen, S. Cointegration: a survey. Palgrave Handbook of Econometrics, 1, 2005.
Johansen, Soren. Estimation and hypothesis testing of cointegration vectors in gaussian vector autoregressive models. Econometrica, 59(6):155180, November 1991.
Jurek, Jakub W. and Yang, Halla. Dynamic Portfolio Selection in Arbitrage. SSRN eLibrary, 2007. doi: 10.2139/ssrn.882536.
Kedem, B. and Yakowitz, S. Time series analysis by higher order crossings. IEEE press Piscataway, NJ, 1994.
Liu, J. and Timmermann, A. Optimal arbitrage strategies.Technical report, UC San Diego Working Paper, 2010.
Ljung, G.M. and Box, G.E.P. On a measure of lack of fit in time series models. Biometrika, 65(2):297303, 1978.
Lovasz, L. and Schrijver, A. Cones of matrices and setfunctions and 0-1 optimization. SIAM Journal on Optimization, 1(2):166190, 1991.
Lutkepohl, H. New Introduction to Multiple Time Series Analysis. Springer, 2005.
Maddala, GS and Kim, I.M. Unit roots, cointegration, and structural change. Cambridge Univ Pr, 1998.
Nemirovski, A. Sums of random symmetric matrices and quadratic optimization under orthogonality constraints. Mathematical programming, 109(2):283317, 2007.
Nemirovski, A., Roos, C., and Terlaky, T. On maximization of quadratic form over intersection of ellipsoids with common center. Mathematical Programming, 86(3):463 473, 1999.
Nesterov, Y. Smoothing technique and its applications in semidefinite optimization. Mathematical Programming, 110(2):245259, 2007.
Phillips, P.C.B. Fully modified least squares and vector autoregression. Econometrica: Journal of the Econometric Society, pp. 10231078, 1995.
So, A.M.C. Moment inequalities for sums of random matrices and their applications in optimization. Mathematical Programming, pp. 2, 2009.
Tsay, Ruey S. Analysis of financial time series, volume 543. Wiley-Interscience, 2005.
von Bunau, Paul, Meinecke, Frank C., Kiraly, Franz C., and Muller, Klaus-Robert. Finding stationary subspaces in multivariate time series. Phys. Rev. Lett., 103:214101, 2009.
Ylvisaker, N Donald. The expected number of zeros of a stationary gaussian process. The Annals of Mathematical Statistics, pp. 10431046, 1965.
-----0
Anestis Antoniadis. Comments on: `1-penalization for mixture regression models. TEST, 19(2):257258, 2010.
Alfred Auslender and Marc Teboulle. Interior gradient and proximal methods for convex and conic optimization. SIAM J. Optim., 16(3):697725 (electronic), 2006.
Francis Bach. Consistency of the Group Lasso and multiple kernel learning. J. Mach. Learn. Res., 9:11791225, 2008.
Amir Beck and Marc Teboulle. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM J. Imaging Sci., 2(1):183202, 2009.
Stephen Becker, Emmanuel J. Cande`s, and Michael C.Grant. Templates for convex cone problems with applications to sparse signal recovery. Math. Program. Comput., 3(3):165218, 2011.
Alexandre Belloni, Victor Chernozhukov, and Lie Wang.Square-root Lasso: Pivotal recovery of sparse signals via conic programming. Biometrika, 98(4):791806, 2011.
Peter J. Bickel, Yaacov Ritov, and Alexandre B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. Ann. Statist., 37(4):17051732, 2009.
Emmanuel Cande`s and Terence Tao. The Dantzig selector: statistical estimation when p is much larger than n. Ann.Statist., 35(6):23132351, 2007.
Christophe Chesneau and Mohamed Hebiri. Some theoretical results on the grouped variables Lasso. Math.Methods Statist., 17(4):317326, 2008.
Arnak Dalalyan and Yin Chen. Fused sparsity and robust estimation for linear models with unknown variance. In NIPS, pages 12681276. 2012.
John Daye, Jinbo Chen, and Hongzhe Li. High-dimensional heteroscedastic regression with an application to eQTL data analysis. Biometrics, 68(1):316326, 2012.
Eric Gautier and Alexandre Tsybakov. High-dimensional instrumental variables regression and confidence sets. Technical Report arxiv:1105.2454, September 2011.
Jian Huang, Patrick Breheny, and Shuangge Ma. A selective review of group selection in high dimensional models. Statist. Sci., 27(4):481499, 2012.
Junzhou Huang and Tong Zhang. The benefit of group sparsity. Ann. Statist., 38(4):19782004, 2010.
Mladen Kolar and James Sharpnack. Variance function estimation in high-dimensions. In Proceedings of the ICML-12, pages 14471454, 2012.
Vladimir Koltchinskii and Ming Yuan. Sparsity in multiple kernel learning. Ann. Statist., 38(6):36603695, 2010.
Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. Ann.Statist., 28(5):13021338, 2000.
Yi Lin and Hao Helen Zhang. Component selection and smoothing in multivariate nonparametric regression.Ann. Statist., 34(5):22722297, 2006.
Han Liu, Jian Zhang, Xiaoye Jiang, and Jun Liu. The group Dantzig selector. J. Mach. Learn. Res. Proc.Track, 9:461468, 2010.
Karim Lounici, Massimiliano Pontil, Sara van de Geer, and Alexandre B. Tsybakov. Oracle inequalities and optimal inference under group sparsity. Ann. Statist., 39(4): 21642204, 2011.
Julien Mairal, Rodolphe Jenatton, Guillaume Obozinski, and Francis Bach. Convex and network flow optimization for structured sparsity. J. Mach. Learn. Res., 12: 26812720, 2011.
Lukas Meier, Sara van de Geer, and Peter Buhlmann. Highdimensional additive modeling. Ann. Statist., 37(6B): 37793821, 2009.
Yuval Nardi and Alessandro Rinaldo. On the asymptotic properties of the group Lasso estimator for linear models. Electron. J. Stat., 2:605633, 2008.
Yurii E. Nesterov. A method for solving the convex programming problem with convergence rate O(1/k2). Dokl. Akad. Nauk SSSR, 269(3):543547, 1983.
Garvesh Raskutti, Martin J. Wainwright, and Bin Yu.Minimax-optimal rates for sparse additive models over kernel classes via convex programming. J. Mach. Learn.Res., 13:389427, 2012.
Pradeep Ravikumar, John Lafferty, Han Liu, and Larry Wasserman. Sparse additive models. J. R. Stat. Soc. Ser. B Stat. Methodol., 71(5):10091030, 2009.
Noah Simon and Robert Tibshirani. Standardization and the Group Lasso penalty. Stat. Sin., 22(3):9831001, 2012.
Nicolas Stadler, Peter Buhlmann, and Sara van de Geer.`1-penalization for mixture regression models. TEST, 19 (2):209256, 2010.
Jos F. Sturm. Using sedumi 1.02, a MATLAB toolbox for optimization over symmetric cones. Optimization Methods and Software, 1112:625653, 1999.
Tingni Sun and Cun-Hui Zhang. Comments on: `1penalization for mixture regression models. TEST, 19 (2):270275, 2010.
Tingni Sun and Cun-Hui Zhang. Scaled sparse linear regression. Biometrika, 99(4):879898, 2012.
Robert Tibshirani. Regression shrinkage and selection via the Lasso. J. Roy. Statist. Soc. Ser. B, 58(1):267288, 1996.
Jens Wagener and Holger Dette. Bridge estimators and the adaptive Lasso under heteroscedasticity. Mathematical Methods of Statistics, 21:109126, 2012.
Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser.B Stat. Methodol., 68(1):4967, 2006.
-----0
Apel, Sven, Kastner, Christian, and Lengauer, Christian. FEATUREHOUSE: Language-Independent, 
Automated Software Composition. In Proceedings of the 31st International Conference on Software Engineering(ICSE), pp. 221231, 2009.
Baldi, Pierre F., Lopes, Cristina V., Linstead, Erik J., and Bajracharya, Sushil K. A theory of aspects as latent topics. In Proceedings of the 23rd ACM SIGPLAN conference on Object-oriented programming systems languages and applications(OOPSLA), pp.543562, 2008.
Blei, David M., Ng, Andrew Y., and Jordan, Michael I.Latent Dirichlet Allocation. The Journal of Machine Learning Research(JMLR), 3:9931022, 2003.
Doshi-Velez, Finale, Miller, Kurt T., Gael, Jurgen Van, and Teh, Yee Whye. Variational Inference for the Indian Buffet Process. In Proceedings of the Intl. Conf. on Artificial Intelligence and Statistics(AISTATS), pp. 137144, 2009.
Eaddy, M., Aho, A. V., Antoniol, G., and Gueheneuc, Y. G. CERBERUS: Tracing Requirements to Source Code Using Information Retrieval, Dynamic Analysis, and Program Analysis. In International Conference on Program Comprehension In Program Comprehension(ICPC), pp. 5362, 2008.
Globerson, A., Chechik, G., Pereira, F., and Tishby, N.Euclidean Embedding of Co-occurrence Data. The Journal of Machine Learning Research(JMLR), 8: 22652295, 2007.
Griffiths, T. L., Ghahramani, Z., and Sollich, Peter.Bayesian Nonparametric Latent Feature Models. In Bayesian Statistics, pp. 201225, 2007.
Ishwaran, H. and James, L. F. Gibbs Sampling Methods for Stick-Breaking Priors. Journal of the American Statistical Association, 96:161173, 2001.
Kastner, Christian, Apel, Sven, and Batory, Don. A Case Study Implementing Features Using AspectJ. In Proceedings of the 11th International Software Product Line Conference(SPLC), pp. 223232, 2007.
Marin, Marius, Deursen, Arie Van, and Moonen, Leon.Identifying crosscutting concerns using fan-in analysis. ACM Transactions on Software Engineering and Methodology(TOSEM), 17, 2007.
Mimno, David, Wallach, Hanna, Talley, Edmund, Leenders, Miriam, and McCallum, Andrew. Optimizing semantic coherence in topic models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing(EMNLP), pp.262272, 2011.
Revelle, Meghan, Dit, Bogdan, and Poshyvanyk, Denys. Using data fusion and web mining to support feature location in software. In Proceedings of the 2010 IEEE 18th International Conference on Program Comprehension(ICPC), pp. 1423, 2010.
Robillard, M. P. Topology Analysis of Software Dependencies. ACM Transactions on Software Engineering and Methodology(TOSEM), (4), 2008.
Savage, T., Revelle, M., and Poshyvanyk, D. FLAT3: Feature Location and Textual Tracing Tool. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering(ICSE) Volume 2, pp. 255258, 2010a.
Savage, Trevor, Dit, Bogdan, Gethers, Malcom, and Poshyvanyk, Denys. TopicXP: Exploring topics in source code using Latent Dirichlet Allocation. In Proceedings of the IEEE International Conference on Software Maintenance(ICSM), pp. 16, 2010b.
Teh, Y., Jordan, M. I., and Beal, M. Hierarchical Dirichlet processes. Journal of American Statistical Association, pp. 15661581, 2007.
Titov, Ivan and McDonald, Ryan. Modeling Online Reviews with Multi-grain Topic Models. In Proceedings of the 17th International Conference on World Wide Web(WWW), pp. 111120, 2008.
Wen, Bo and Boahen, Kwabena. Active Bidirectional Coupling in a Cochlear Chip. In Advances in Neural Information Processing Systems(NIPS), 2005.
Williamson, Sinead, Wang, Chong, Heller, Katherine A., and Blei, David M. The IBP Compound Dirichlet Process and its Application to Focused Topic Modeling. In Proceedings of the 27th International Conference on Machine Learning(ICML), 2010.
Wong, T. T. Generalized Dirichlet distribution in Bayesian analysis. Applied Mathematics and Computation, 97:165181, 1998.
-----0
Abernethy, J., Hazan, E., and Rakhlin, A. Interiorpoint methods for full-information and bandit online learning. IEEE Transactions on Information Theory, 58(7):41644175, 2012.
Alfonsin, J. Ramirez. The Diophantine Frobenius problem. Oxford University Press, 2005.
Arora, R., Dekel, O., and Tewari, A. Deterministic Better Rates for Any Adversarial Deterministic MDP MDPs with adversarial rewards and bandit feedback. In Proceedings of the 28th Conference on Uncertainty in Artificial Intelligence, pp. 93101, 2012.
Audibert, Jean-Yves, Bubeck, Sebastien, and Lugosi, Gabor. Minimax policies for combinatorial prediction games. Journal of Machine Learning Research Proceedings Track, 19:107132, 2011.
Bertsekas, D. P. Dynamic Programming and Optimal Control. Athena Scientific, Third edition, 2005.
Bremaud, P. Markov chains : Gibbs fields, Monte Carlo simulation and queues. Springer, 1999.
Dani, Varsha, Hayes, Thomas P., and Kakade, Sham.The price of bandit information for online optimization. In Advances in Neural Information Processing Systems 20, 2007.
Denardo, E. V. Periods of connected networks and powers of nonnegative matrices. Mathematics of Operations Research, 2(1):2024, 1977.
Even-Dar, E., Kakade, S., and Mansour, Y. Online markov decision processes. Mathematics of Operations Research, 34(3):726736, 2009.
Farias, V. F., Moallemi, C. C., Roy, B. Van, and Weissman, T. Universal reinforcement learning. IEEE Transactions on Information Theory, 56(5):2441 2454, 2010.
Feinberg, E. A. and Yang, F. On polynomial cases of the unichain classification problem for Markov decision processes. Operations Research Letters, 36(5): 527530, 2008.
Neu, G., Gyorgy, A., Szepesvari, C., and Antos, A.Online Markov decision processes under bandit feedback. In Advances in Neural Information Processing Systems 23, pp. 18041812, 2010.
Ortner, R. Online regret bounds for Markov decision processes with deterministic transitions. Theoretical Computer Science, 411(29-30):26842695, 2010.
Szepesvari, C. Algorithms for reinforcement learning.Synthesis Lectures on Artificial Intelligence and Machine Learning, 4(1), 2010.
Takimoto, Eiji and Warmuth, Manfred K. Path kernels and multiplicative updates. Journal of Machine Learning Research, 4:773818, 2003.
Yu, J. Y., Mannor, S., and Shimkin, N. Markov decision processes with arbitrary reward processes.
Mathematics of Operations Research, 34(3):737757, 2009.
Yu, Jia Yuan and Mannor, Shie. Arbitrarily modulated markov decision processes. In Proceedings of the 48th IEEE Conference on Decision and Control, pp. 29462953, 2009.
-----0
Bartlett, P., Jordan, M., and Mcauliffe, J. Convexity, classification and risk bounds. Journal of the American Statistical Association, 101:138156, 2006.
Chai, K. Expectation of F-measures: Tractable exact computation and some empirical observations of its properties. In SIGIR, 2005.
Cover, T. M. and Thomas, J. A. Elements of Information Theory. John Wiley, 1991.
Dembczynski, K., Waegeman, W., Cheng, W., and Hullermeier, E. An exact algorithm for F-measure maximization. In Advances in Neural Information Processing Systems, volume 25, 2011.
Gao, W. and Zhou, Z. On the consistency of multi-label learning. In COLT, 2011.
Hariharan, B., Vishwanathan, S. V. N., and Varma, M. Efficient max-margin multi-label classification with applications to zero-shot learning. Machine Learning Journal, 88(1):127155, 2012.
Jansche, M. A maximum expected utility framework for binary sequence labeling. In ACL 2007, pp. 736743, 2007.
Kelley, J. E. The cutting-plane method for solving convex programs. Journal of the Society for Industrial Applied Mathematics, 8:704712, 1960.
Lewis, D. Evaluating and optimizing autonomous text classification systems. In SIGIR 1995, pp. 246254, 1995.McCallum, A. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu, 2002.
Petterson, J. and Caetano, T. S. Reverse multi-label learning. In Advances in Neural Information Processing Systems 24, pp. 19121920, 2010.
Petterson, J. and Caetano, T. S. Submodular multi-label learning. In Advances in Neural Information Processing Systems 24, pp. 15121520, 2011.
Quevedo, J., Luaces, O., and Bahamonde, A. Multilabel classifiers with a probabilistic thresholding strategy. Pattern Recognition, 45, 2012.
Tewari, A. and Bartlett, P. On the consistency of multiclass classification methods. Journal of Machine Learning Research, 8:10071025, May 2007.
Tsochantaridis, I., Joachims, T., Hofmann, T., and Altun, Y. Large margin methods for structured and interdependent output variables. J. Mach. Learn. Res., 6: 14531484, 2005.
Ye, N., Chai, K., Lee, W., and Chieu, H. Optimizing Fmeasures: a tale of two approaches. In ICML, 2012.
-----0
H. Abdulsalam. Streaming Random Forests. PhD thesis, Queens University, 2008.
G. Biau. Analysis of a Random Forests model. JMLR, 13 (April):10631095, 2012.
G. Biau, L. Devroye, and G. Lugosi. Consistency of random forests and other averaging classifiers. JMLR, 9:2015 2033, 2008.
A. Bifet, G. Holmes, and B. Pfahringer. MOA: Massive Online Analysis, a framework for stream classification and clustering. In Workshop on Applications of Pattern Analysis, pp. 316, 2010.
A. Bifet, E. Frank, G. Holmes, and B. Pfahringer. Ensembles of Restricted Hoeffding Trees. ACM Transactions on Intelligent Systems and Technology, 3(2):120, February 2012.
A. Bifet, G. Holmes, and B. Pfahringer. New ensemble methods for evolving data streams. In ACM SIGKDD Intl. Conference on Knowledge Discovery and Data Mining, 2009.
A. Bosch, A. Zisserman, and X. Munoz. Image classification using random forests and ferns. In International Conference on Computer Vision, pp. 18, 2007.L. Breiman. Random forests. Machine Learning, 45(1): 532, 2001.
L. Breiman. Consistency for a Simple Model of Random Forests. Technical report, University of California at Berkeley, 2004.
L. Breiman, J. Friedman, C. Stone, and R. Olshen. Classification and Regression Trees. CRC Press LLC, Boca Raton, Florida, 1984.
C. Chang and C. Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:127:27, 2011.
A. Criminisi, J. Shotton, and E. Konukoglu. Decision forests: A unified framework for classification, regression, density estimation, manifold learning and semisupervised learning. Foundations and Trends in Computer Graphics and Vision, 7(2-3):81227, 2011.
D. Cutler, T. Edwards, and K. Beard. Random forests for classification in ecology. Ecology, 88(11):278392, November 2007.
L. Devroye, L. Gyorfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer-Verlag, New York, USA, 1996.
P. Domingos and G. Hulten. Mining high-speed data streams. In International Conference on Knowledge Discovery and Data Mining, pp. 7180. ACM, 2000.J. Gama, P. Medas, and P. Rodrigues. Learning decision trees from dynamic data streams. In ACM symposium on Applied computing, SAC 05, pp. 573577, New York, NY, USA, 2005. ACM.
R. Genuer. Risk bounds for purely uniformly random forests. Technical report, Institut National de Recherche en Informatique et en Automatique, 2010.
R. Genuer. Variance reduction in purely random forests.Journal of Nonparametric Statistics, 24(3):543562, 2012.
H. Ishwaran and U. Kogalur. Consistency of random survival forests. Statistics and Probability Letters, 80:1056 1064, 2010.
Y. Lin and Y. Jeon. Random forests and adaptive nearest neighbors. Technical Report 1055, University of Wisconsin, 2002.
N. Meinshausen. Quantile regression forests. JMLR, 7: 983999, 2006.
N. Oza and S. Russel. Online Bagging and Boosting. In Artificial Intelligence and Statistics, volume 3, 2001.
A. Prasad, L. Iverson, and A. Liaw. Newer Classification and Regression Tree Techniques: Bagging and Random Forests for Ecological Prediction. Ecosystems, 9(2):181 199, March 2006. ISSN 1432-9840.
A. Saffari, C. Leistner, J. Santner, M. Godec, and H. Bischof. On-line random forests. In International Conference on Computer Vision Workshops (ICCV Workshops), pp. 13931400. IEEE, 2009.
J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake. Real-time human pose recognition in parts from single depth images. CVPR, pp. 12971304, 2011.
V. Svetnik, A. Liaw, C. Tong, J. Culberson, R. Sheridan, and B. Feuston. Random forest: a classification and regression tool for compound classification and QSAR modeling. Journal of Chemical Information and Computer Sciences, 43(6):194758, 2003.
-----0
Agrawal, Shipra and Goyal, Navi. Analysis of thompson sampling for the multi-armed bandit problem.In COLT 2012, 2012.
Araya, M., Thomas, V., Buet, O., et al. Near-optimal BRL using optimistic local transitions. In ICML, 2012.
Bertsekas, Dimitri P. Nonlinear Programming. Athena Scientic, 1999.
Bertsekas, Dimitri P. Rollout algorithms for constrained dynamic programming. Technical Report LIDS 2646, Dept. of Electrical Engineering and Computer Science, M.I.T., Cambridge, Mass., 2006.
Bertsekas, Dimitri P. and Tsitsiklis, John N. NeuroDynamic Programming. Athena Scientic, 1996.
Castro, Pablo Samuel and Precup, Doina. Using linear programming for Bayesian exploration in Markov decision processes. In Veloso, Manuela M. (ed.), IJCAI, pp. 2437{2442, 2007.
Csillery, K., Blum, M.G.B., Gaggiotti, O.E., Francois, O., et al. Approximate Bayesian computation (ABC) in practice. Trends in ecology & evolution, 25(7):410{418, 2010.
Dean, Thomas A and Singh, Sumeetpal S. Asymptotic behaviour of approximate bayesian estimators.arXiv preprint arXiv:1105.3655, 2011.
DeGroot, Morris H. Optimal Statistical Decisions.John Wiley & Sons, 1970.
Dimitrakakis, Christos. Robust bayesian reinforcement learning through tight lower bounds. In European Workshop on Reinforcement Learning (EWRL 2011), number 7188 in LNCS, pp. 177{188, 2011.ABC Reinforcement Learning 
Dimitrakakis, Christos and Lagoudakis, Michail G.Rollout sampling approximate policy iteration. Machine Learning, 72(3):157{171, September 2008. doi: 10.1007/s10994-008-5069-3. Presented at ECML'08.
Dimitrakakis, Christos, Tziortziotis, Nikolaos, and Tossou, Aristide. Beliefbox: A framework for statistical methods in sequential decision making.http://code.google.com/p/beliefbox/, 2007.
Du, Michael O'Gordon. Optimal Learning Computational Procedures for Bayes-adaptive Markov Decision Processes. PhD thesis, University of Massachusetts at Amherst, 2002.
Dwork, Cynthia and Lei, Jing. Dierential privacy and robust statistics. In Proceedings of the 41st annual ACM symposium on Theory of computing, pp. 371{ 380. ACM, 2009.
Ernst, Damien, Geurts, Pierre, and Wehenkel, Louis.Tree-based batch mode reinforcement learning.Journal of Machine Learning Research, 6:503{556, 2005.
Gabillon, Victor, Lazaric, Alessandro, Ghavamzadeh, Mohammad, and Scherrer, Bruno. Classicationbased policy iteration with a critic. In ICML 2011, 2011.
Geweke, J. Using simulation methods for Bayesian econometric models: inference, development, and communication. Econometric Reviews, 18(1):1{73, 1999.
Hoeding, Wassily. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13{30, March 1963.
Jasra, Ajay, Singh, Sumeetpal S, Martin, James S, and McCoy, Emma. Filtering via approximate bayesian computation. Stat. Comput, 2010.
Kaufmanna, Emilie, Korda, Nathaniel, and Munos, Remi. Thompson sampling: An optimal nite time analysis. In ALT-2012, 2012.
Kolter, J. Zico and Ng, Andrew Y. Near-Bayesian exploration in polynomial time. In ICML 2009, 2009.Lagoudakis, M. and Parr, R. Reinforcement learning as classication: Leveraging modern classiers. In ICML, pp. 424, 2003a.
Lagoudakis, M.G. and Parr, R. Least-squares policy iteration. The Journal of Machine Learning Research, 4:1107{1149, 2003b.
Marin, J.M., Pudlo, P., Robert, C.P., and Ryder, R.J. Approximate Bayesian computational methods. Statistics and Computing, pp. 1{14, 2011.
Poupart, P., Vlassis, N., Hoey, J., and Regan, K. An analytic solution to discrete Bayesian reinforcement learning. In ICML 2006, pp. 697{704. ACM Press New York, NY, USA, 2006.
Poupart, Pascal and Vlassis, Nikos. Model-based Bayesian reinforcement learning in partially observable domains. In International Symposium on Articial Intelligence and Mathematics (ISAIM), 2008.
Puterman, Marting L. Markov Decision Processes : Discrete Stochastic Dynamic Programming. John 
Wiley & Sons, New Jersey, US, 1994.Ross, Stephane, Chaib-draa, Brahim, and Pineau, Joelle. Bayes-adaptive POMDPs. In Platt, J.C., 
Koller, D., Singer, Y., and Roweis, S. (eds.), Advances in Neural Information Processing Systems 20, Cambridge, MA, 2008. MIT Press.
Savage, Leonard J. The Foundations of Statistics.Dover Publications, 1972.
Strens, Malcolm. A Bayesian framework for reinforcement learning. In ICML 2000, pp. 943{950, 2000.
Sutton, Richard S. and Barto, Andrew G. Reinforcement Learning: An Introduction. MIT Press, 1998.
Thompson, W.R. On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of two Samples. Biometrika, 25(3-4):285{ 294, 1933.
Toni, T., Welch, D., Strelkowa, N., Ipsen, A., and Stumpf, M.P.H. Approximate Bayesian computation scheme for parameter inference and model selection in dynamical systems. Journal of the Royal Society Interface, 6(31):187{202, 2009.
Vlassis, N., Ghavamzadeh, M., Mannor, S., and Poupart, P. Reinforcement Learning, chapter Bayesian Reinforcement Learning, pp. 359{386.Springer, 2012.
Wu, Feng, Zilberstein, Shlomo, and Chen, Xiaoping.Rollout sampling policy iteration for decentralized POMDPs. In The 26th conference on Uncertainty in Articial Intelligence (UAI 2010), Catalina Island, CA, USA, July 2010.
-----0
Anandkumar, A., Foster, D., Hsu, D., Kakade, S., and Liu, Y. K. A spectral algorithm for latent dirichlet allocation. In Advances in Neural Information Processing Systems 25, pp. 926934, Lake Tahoe, NV, Dec. 2012.
Arora, S., Ge, R., and Moitra, A. Learning topic models  going beyond SVD. In 53rd IEEE Annu. Symp.Foundations of Computer Science, pp. 110, New Brunswick, NJ, Oct. 2012.
Arora, S., Ge, R., Halpern, Y., Mimno, D., Moitra, A., Sontag, D., Wu, Y., and Zhu, Michael. A practical algorithm for topic modeling with provable guarantees. In the 30th Int. Conf. on Machine Learning, Atlanta, GA, Jun. 2013.
Blei, D. Probabilistic topic models. Commun. of the ACM, 55(4):7784, 2012.
Blei, D. and Lafferty, J. A correlated topic model of science. The Ann. of Applied Statistics, 1(1):1735, 2007.
Blei, D., Ng, A., and Jordan, M. Latent dirichlet allocation. J. Mach. Learn. Res., 3:9931022, Mar.2003.
Cichocki, A., Zdunek, R., Phan, A. H., and Amari, S. Nonnegative matrix and tensor factorizations: applications to exploratory multi-way data analysis and blind source separation. Wiley, 2009.
Donoho, D. and Stodden, V. When does non-negative matrix factorization give a correct decomposition into parts? In Advances in Neural Information Processing Systems 16, pp. 11411148, Cambridge, MA, 2004. MIT press.
Frank, A. and Asuncion, A. UCI machine learning repository, 2010. URL http://archive.ics.uci.edu/ml.
Griffiths, T. and Steyvers, M. Finding scientific topics.Proceedings of the National Academy of Sciences, 101:52285235, Apr. 2004.
Lee, D. and Seung, H. Learning the parts of objects by non-negative matrix factorization. Nature, 401 (6755):788791, Oct. 1999.
Li, W. and McCallum, A. Pachinko allocation: Dagstructured mixture models of topic correlations. In Proc. the 23rd Int. Conf. on Machine learning, pp. 577584, Pittsburgh, PA, Jun. 2006.
Recht, B., Re, C., Tropp, J., and Bittorf, V. Factoring nonnegative matrices with linear programs. In Advances in Neural Information Processing Systems 25, pp. 12231231, Lake Tahoe, NV, Dec. 2012.
Tan, V. Y. F. and Fevotte, C. Automatic relevance determination in nonnegative matrix factorization with the beta-divergence. IEEE Transactions on Pattern Analysis and Machine Intelligence, in press, 2013.
Vavasis, S. On the complexity of nonnegative matrix factorization. SIAM J. on Optimization, 20(3): 13641377, Oct. 2009.
-----0
Chapelle, O., Vapnik, V., Bousquet, O., and Mukherjee, S. Choosing multiple parameters for support vector machines. Machine Learning, 46, 2002.
Cristianini, N. and Shawe-Taylor, J. An introduction to Support Vector Machines. Cambridge Uni. Press, 2000.
Do, H., Kalousis, A., Woznica, A., and Hilario, M.Margin and radius based multiple kernel learning. In European Conference on Machine Learning (ECML), pp. 330343, 2009a.
Do, H., Kalousis, A., Wang, J., and Woznica, A. A metric learning perspective of svm on the relation of svm and lmnn. In Journal of Machine Learning Research W&C Proceedings AI and Statistics (AISTATS), 2012.
Do, Huyen, Kalousis, Alexandros, and Hilario, Melanie. Feature weighting using margin and radius based error bound optimization in svms. In European Conference on Machine Learning (ECML), 2009b.
Goldberger, J., Roweis, S., Hinton, G., and Salakhutdinov, R. Neighbourhood components analysis. In Neural Information Processing Systems (NIPS), volume 17. MIT Press, 2005.
Guyon, I., Weston, J., Barnhill, S., and Vapnik, V.Gene selection for cancer classification using support vector machine. Machine Learning, 46:389422, 2002.
Kalousis, A., Prados, J., and Hilario, M. Stability of feature selection algorithms-a study on highdimensional spaces. Knowl. Inf. Syst, 12:95116, 2007.
Pekalska, E. and Duin, R. The Dissimilarity Representation for Pattern Recognition. Foundations and Applications. World Scientific Publishing Company, 2005.
Rakotomamonjy, A. Variable selection using svmbased criteria. Journal of Machine Learning Research, 3, 2003.
Ratsch, G., Onoda, T., and Muller, K. R. Soft margins for adaboost. Machine Learning, 2001.
Shivaswamy, P. K. and Jebara, T. Maximum relative margin and data-dependent regularization. Journal of Machine Learning Research, 11, 2010.
Torresani, L. and Lee, K. Large margin component analysis. In Neural Information Processing Systems (NIPS), 2006.Vapnik, V. Statistical learning theory. Wiley InterSc, 1998.
Wang, L., Zhu, J., and Zou, H. The doubly regularized svm. Statistica Sinica, 2006.
Weston, J., Mukherjee, S., Chapelle, O., Pontil, M., Poggio, J., and Vapnik, V. Feature selection for svms. Neural Information Processing Systems (NIPS), 2000.
Ye, G., Chen, Y., and Chen, Y. Efficient variable selection in SVMs via the alternating direction method of multipliers. In AI and Statistics, 2011.
Zhu, J., Rosset, S., Hastie, T., and Tibshirani, R. 1norm support vector machine. In Neural Information Processing Systems (NIPS), 2003.
Zou, H. and Hastie, T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B, 2005.
-----0
Abramowitz, M. and Stegun, I.A. Handbook of Mathematical Functions. Dover, New York, 10th edition, 1972.
Achlioptas, D. Database-friendly Random Projections: Johnson-Lindenstrauss with Binary Coins. J. Computer and System Sciences, 66(4):671687, 2003.
Ailon, N. and Chazelle, B. Approximate Nearest Neighbors and the Fast JohnsonLindenstrauss Transform. In Proc. 38th Annual ACM Symposium on Theory of Computing (STOC 2006), pp. 557563. ACM, 2006.
Anthony, M. and Bartlett, P.L. Neural Network Learning: Theoretical Foundations. Cambridge University press, 1999.
Arriaga, R.I. and Vempala, S. An Algorithmic Theory of Learning: Robust Concepts and Random Projection. In 40th Annual Symposium on Foundations of Computer Science (FOCS 1999). , pp. 616623. IEEE, 1999.
Ball, K. An Elementary Introduction to Modern Convex Geometry. Flavors of Geometry, 31:158, 1997.
Bartlett, P.L. and Mendelson, S. Rademacher and Gaussian Complexities: Risk Bounds and Structural Results.J. Machine Learning Research, 3:463482, 2002.
Boyali, A. and Kavakli, M. A Robust Gesture Recognition Algorithm based on Sparse Representation, Random Projections and Compressed Sensing. In 7th IEEE Conference on Industrial Electronics and Applications (ICIEA 2012), pp. 243249, july 2012.
Calderbank, R., Jafarpour, S., and Schapire, R. Compressed Learning: Universal Sparse Dimensionality Reduction and Learning in the Measurement Domain. Technical Report, Rice University, 2009.
Cristianini, N. and Shawe-Taylor, J. An Introduction to Support Vector Machines and other Kernel-based Learning Methods. Cambridge University Press, 2000.
Dasgupta, S. and Gupta, A. An Elementary Proof of the JohnsonLindenstrauss Lemma. Random Structures & Algorithms, 22:6065, 2002.
Davenport, M.A., Boufounos, P.T., Wakin, M.B., and Baraniuk, R.G. Signal Processing with Compressive Measurements. IEEE J. Selected Topics in Signal Processing, 4(2):445460, April 2010.
Durrant, R.J. Learning in High Dimensions with Projected Linear Discriminants. PhD thesis, School of Computer Science, University of Birmingham, January 2013.
Durrant, R.J. and Kaban, A. Compressed Fisher Linear Discriminant Analysis: Classification of Randomly Projected Data. In Proc. 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2010), 2010.
Durrant, R.J. and Kaban, A. A Tight Bound on the Performance of Fishers Linear Discriminant in Randomly Projected Data Spaces. Pattern Recognition Letters, 33 (7):911919, 2011.
Fard, M., Grinberg, Y., Pineau, J., and Precup, D. Compressed Least-squares Regression on Sparse Spaces. In Proc. 26th AAAI Conference on Artificial Intelligence (AAAI 2012), 2012.
Fodor, I.K. A Survey of Dimension Reduction Techniques.Technical Report UCRL-ID-148494, US Dept. of Energy, Lawrence Livermore National Laboratory, 2002.
Garg, A. and Roth, D. Margin Distribution and Learning Algorithms. In Proc. 20th International Conference on Machine Learning (ICML 2003), pp. 210217, 2003.
Garg, A., Har-Peled, S., and Roth, D. On Generalization Bounds, Projection Profile, and Margin Distribution. In Proc. 19th International Conference on Machine Learning (ICML 2002), pp. 171178, 2002.
Goemans, M.X. and Williamson, D.P. Improved Approximation Algorithms for Maximum Cut and Satisfiability Problems using Semidefinite Programming. Journal of the ACM, 42(6):1145, 1995.
Herbrich, R. Learning Kernel Classifiers: Theory and Algorithms. The MIT Press, 2002.
Kendall, MG. A Course in the Geometry of n Dimensions.Dover, New York, 2004.
Mahoney, M.W. Randomized Algorithms for Matrices and Data. arXiv preprint arXiv:1104.5557, 2011.
Maillard, O. and Munos, R. Linear Regression with Random Projections. J. Machine Learning Research, 13: 27352772, 2012.
Mardia, K.V., Kent, J.T., and Bibby, J.M. Multivariate Analysis. Academic Press, London, 1979.
Matous?ek, J. On Variants of the JohnsonLindenstrauss Lemma. Random Structures & Algorithms, 33(2):142 156, 2008.
Paul, S., Boutsidis, C., Magdon-Ismail, M., and Drineas, P. Random Projections for Support Vector Machines.arXiv preprint arXiv:1211.6085, 2012.
Pillai, J.K., Patel, V.M., Chellappa, R., and Ratha, N.K.Secure and Robust Iris Recognition using Random Projections and Sparse Representations. IEEE Trans. Pattern Analysis and Machine Intelligence, 33(9):1877 1893, 2011.
Shawe-Taylor, J. Classification Accuracy based on Observed Margin. Algorithmica, 22(1-2):157172, 1998.
Shawe-Taylor, J. and Cristianini, N. Further Results on the Margin Distribution. In Proc. 12th Annual Conference on Computational Learning Theory (COLT 1999), pp.278285. ACM, 1999.
Siegel, A. Toward a Usable Theory of Chernoff bounds for Heterogeneous and Partially Dependent Random Variables. Technical Report, New York University, 1995.
Vapnik, V.N. An Overview of Statistical Learning Theory.IEEE Trans. Neural Networks, 10(5):988999, 1999.
-----0
Bach, F. Exploring large feature spaces with hierarchical multiple kernel learning. In Advances in Neural Information Processing Systems, pp. 105112. 2009.
Bing, W., Wen-qiong, Z., Ling, C., and Jia-hong, L.A GP-based kernel construction and optimization method for RVM. In International Conference on Computer and Automation Engineering (ICCAE), volume 4, pp. 419423, 2010.
Box, G.E.P., Jenkins, G.M., and Reinsel, G.C. Time series analysis: forecasting and control. 1976.
Christoudias, M., Urtasun, R., and Darrell, T.Bayesian localized multiple kernel learning. Technical report, EECS Department, University of California, Berkeley, 2009.
Diosan, L., Rogozan, A., and Pecuchet, J.P. Evolving kernel functions for SVMs by genetic programming.In Machine Learning and Applications, 2007, pp.1924. IEEE, 2007.
Duvenaud, D., Nickisch, H., and Rasmussen, C.E. Additive Gaussian processes. In Advances in Neural Information Processing Systems, 2011.
Grosse, R.B., Salakhutdinov, R., Freeman, W.T., and Tenenbaum, J.B. Exploiting compositionality to explore a large space of model structures. In Uncertainty in Artificial Intelligence, 2012.
Gu, C. Smoothing spline ANOVA models. Springer Verlag, 2002. ISBN 0387953531.
Hastie, T.J. and Tibshirani, R.J. Generalized additive models. Chapman & Hall/CRC, 1990.
Jaynes, E. T. Highly informative priors. In Proceedings of the Second International Meeting on Bayesian Statistics, 1985.
Kemp, C. and Tenenbaum, J.B. The discovery of structural form. Proceedings of the National Academy of Sciences, 105(31):1068710692, 2008.
Lawrence, N. Probabilistic non-linear principal component analysis with gaussian process latent variable models. The Journal of Machine Learning Research, 6:17831816, 2005.
Lean, J., Beer, J., and Bradley, R. Reconstruction of solar irradiance since 1610: Implications for climate change. Geophysical Research Letters, 22(23):3195 3198, 1995.
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D.Backpropagation applied to handwritten zip code recognition. Neural Computation, 1:541551, 1989.
Plate, T.A. Accuracy versus interpretability in flexible modeling: Implementing a tradeoff using Gaussian process models. Behaviormetrika, 26:2950, 1999.ISSN 0385-7417.
Poon, H. and Domingos, P. Sum-product networks: a new deep architecture. In Conference on Uncertainty in AI, 2011.
Rasmussen, C.E. and Ghahramani, Z. Occams razor.In Advances in Neural Information Processing Systems, 2001.
Rasmussen, C.E. and Williams, C.K.I. Gaussian Processes for Machine Learning. The MIT Press, Cambridge, MA, USA, 2006.
Ruppert, D., Wand, M.P., and Carroll, R.J. Semiparametric regression, volume 12. Cambridge University Press, 2003.
Salakhutdinov, R. and Hinton, G. Using deep belief nets to learn covariance kernels for Gaussian processes. Advances in Neural information processing systems, 20:12491256, 2008.
Schmidt, M. and Lipson, H. Distilling free-form natural laws from experimental data. Science, 324(5923): 8185, 2009.
Schwarz, G. Estimating the dimension of a model. The Annals of Statistics, 6(2):461464, 1978.
Todorovski, L. and Dzeroski, S. Declarative bias in equation discovery. In International Conference on Machine Learning, pp. 376384, 1997.
Wahba, G. Spline models for observational data.Society for Industrial Mathematics, 1990. ISBN 0898712440.
Washio, T., Motoda, H., Niwa, Y., et al. Discovering admissible model equations from observed data based on scale-types and identity constraints. In International Joint Conference On Artifical Intelligence, volume 16, pp. 772779, 1999.
Wilson, Andrew Gordon and Adams, Ryan Prescott.Gaussian process covariance kernels for pattern discovery and extrapolation. Technical Report arXiv:1302.4245 [stat.ML], 2013.
-----0
Blei, D. and Lafferty, J. Correlated topic models. In Advances in Neural Information Processing Systems, 2006.
Braun, M. and McAuliffe, J. Variational inference for large-scale models of discrete choice. Journal of the American Statistical Association, 105(489):324335, 2010.
Challis, E. and Barber, D. Concave Gaussian variational approximations for inference in large-scale Bayesian linear models. In International conference on Artificial Intelligence and Statistics, volume 6, pp. 7, 2011.
Girolami, M. and Rogers, S. Variational Bayesian multinomial probit regression with Gaussian process priors. Neural Comptuation, 18(8):1790  1817, 2006.
Honkela, A., Raiko, T., Kuusela, M., Tornio, M., and Karhunen, J. Approximate Riemannian conjugate gradient learning for fixed-form variational Fast Dual Variational Inference for Non-Conjugate LGMs Bayes. Journal of Machine Learning Research, 11: 32353268, 2011.
Jojic, Vladimir, Gould, Stephen, and Koller, Daphne.Accelerated dual decomposition for map inference.In International Conference on Machine Learning, 2010.
Khan, Mohammad Emtiyaz. Variational Learning for Latent Gaussian Models of Discrete Data. PhD thesis, University of British Columbia, 2012.
Khan, Mohammad Emtiyaz, Mohamed, Shakir, Marlin, Benjamin, and Murphy, Kevin. A stick breaking likelihood for categorical data analysis with latent Gaussian models. In International conference on Artificial Intelligence and Statistics, 2012a.
Khan, Mohammad Emtiyaz, Mohamed, Shakir, and Murphy, Kevin. Fast Bayesian inference for nonconjugate Gaussian process regression. In Advances in Neural Information Processing Systems, 2012b.
Knowles, D. and Minka, T. Non-conjugate variational message passing for multinomial and binary regression. In Advances in Neural Information Processing Systems, 2011.
Lazaro-Gredilla, M. and Titsias, M. Variational heteroscedastic Gaussian process regression. In International Conference on Machine Learning 28, 2011.
Marlin, B., Khan, M., and Murphy, K. Piecewise bounds for estimating Bernoulli-logistic latent Gaussian models. In International Conference on Machine Learning, 2011.
Minka, T. Expectation propagation for approximate Bayesian inference. In Uncertainty in Artificial Intelligence 17, 2001.
Nickisch, H. and Rasmussen, C.E. Approximations for binary Gaussian process classification. Journal of Machine Learning Research, 9(10), 2008.
Opper, M. and Archambeau, C. The variational Gaussian approximation revisited. Neural computation, 21(3):786792, 2009.
Rasmussen, Carl Edward and Williams, Christopher K. I. Gaussian Processes for Machine Learning.MIT Press, 2006.
Rockafellar, R. Convex Analysis. Princeton University Press, 1970.
Rue, H. and Held, L. Gaussian Markov Random Fields: Theory and Applications, volume 104 of Monographs on Statistics and Applied Probability.Chapman & Hall, London, 2005.
Rue, H., Martino, S., and Chopin, N. Approximate Bayesian inference for latent Gaussian models using integrated nested Laplace approximations. Journal of Royal Statistical Sociecty, Series B, 71:319392, 2009.
Seeger, M. Bayesian inference and optimal design for the sparse linear model. Journal of Machine Learning Research, 9:759813, 2008.
Seeger, M. Sparse linear models: Variational approximate inference and Bayesian experimental design.Journal of Physics: Conference Series, 197(012001), 2009.
Seeger, M. and Nickisch, H. Large scale Bayesian inference and experimental design for sparse linear models. SIAM J. Imag. Sciences, 4(1):166199, 2011.
Sontag, David, Globerson, Amir, and Jaakkola, Tommi. Introduction to dual decomposition for inference. Optimization for Machine Learning, 1, 2011.
Tipping, M. Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1:211244, 2001.
-----0
Allouche, D., de Givry, S., and Schiex, T. Toulbar2, an open source exact cost function network solver. Technical report, INRIA, 2010.
Bellman, R.E. Adaptive control processes: A guided tour.Princeton University Press (Princeton, NJ), 1961.
Berlekamp, E., McEliece, R., and Van Tilborg, H. On the inherent intractability of certain coding problems. Information Theory, IEEE Transactions on, 24(3):384 386, 1978.
Cai, J.Y. and Chen, X. A decidable dichotomy theorem on directed graph homomorphisms with non-negative weights. In FOCS, 2010.
Carreira-Perpinan, M.A. and Hinton, G.E. On contrastive divergence learning. In Artificial Intelligence and Statistics, volume 2005, pp. 17, 2005.
Chakraborty, S., Meel, K., and Vardi, M. A scalable and nearly uniform generator of SAT witnesses, 2013. To appear.
Desjardins, G., Courville, A., and Bengio, Y. On tracking the partition function. In NIPS-2011, pp. 25012509, 2011.
Dyer, M., Frieze, A., and Kannan, R. A random polynomial-time algorithm for approximating the volume of convex bodies. JACM, 38(1):117, 1991.
Ermon, S., Gomes, C., Sabharwal, A., and Selman, B. Accelerated Adaptive Markov Chain for Partition Function Computation. In NIPS-2011, 2011.
Felgenhauer, B. and Jarvis, F. Enumerating possible Sudoku grids. Mathematical Spectrum, 2005.
Girolami, M. and Calderhead, B. Riemann Manifold Langevin and Hamiltonian Monte Carlo Methods. J.of the Royal Statistical Society, 73(2):123214, 2011.
Gogate, V. and Dechter, R. SampleSearch: Importance sampling in presence of determinism. Artificial Intelligence, 175(2):694729, 2011.
Goldreich, O. Randomized methods in computation. Lecture Notes, 2011.
Gomes, Carla P., van Hoeve, Willem Jan, Sabharwal, Ashish, and Selman, Bart. Counting CSP solutions using generalized XOR constraints. In AAAI, 2007.
Gomes, C.P., Sabharwal, A., and Selman, B. Near-uniform sampling of combinatorial spaces using XOR constraints.In NIPS-2006, pp. 481488, 2006a.
Gomes, C.P., Sabharwal, A., and Selman, B. Model counting: A new strategy for obtaining good bounds. In AAAI, pp. 5461, 2006b.
Hazan, T. and Jaakkola, T. On the partition function and Random Maximum A-Posteriori perturbations. In ICML, 2012.
Hinton, G.E., Osindero, S., and Teh, Y.W. A fast learning algorithm for Deep Belief Nets. Neural Computation, 18 (7):15271554, 2006.
Jerrum, M. and Sinclair, A. The Markov chain Monte Carlo method: an approach to approximate counting and integration. Approximation algorithms for NP-hard problems, pp. 482520, 1997.
Jordan, M.I., Ghahramani, Z., Jaakkola, T.S., and Saul, L.K. An introduction to variational methods for graphical models. Machine learning, 37(2):183233, 1999.
Lauritzen, Steffen L and Spiegelhalter, David J. Local computations with probabilities on graphical structures and their application to expert systems. J. of the Royal Statistical Society, B (Methodological), pp. 157224, 1988.
Madras, N.N. Lectures on Monte Carlo Methods. American Mathematical Society, 2002. ISBN 0821829785.
Mooij, J.M. libDAI: A free and open source C++ library for discrete approximate inference in graphical models.JMLR, 11:21692173, 2010.
Murphy, K.P., Weiss, Y., and Jordan, M.I. Loopy belief propagation for approximate inference: An empirical study. In UAI, 1999.
Park, J.D. Using weighted MAX-SAT engines to solve MPE. In AAAI-2002, pp. 682687, 2002.
Riedel, Sebastian. Improving the accuracy and efficiency of MAP inference for Markov Logic. In UAI-2008, pp.468475, 2008.
Simonovits, M. How to compute the volume in high dimension? Math. programming, 97(1):337374, 2003.
Sontag, David, Meltzer, Talya, Globerson, Amir, Jaakkola, Tommi, and Weiss, Yair. Tightening LP relaxations for MAP using Message Passing. In UAI, pp. 503510, 2008.
Soos, M., Nohl, K., and Castelluccia, C. Extending SAT solvers to cryptographic problems. SAT, 2009.
Vadhan, S. Pseudorandomness. Foundations and Trends in Theoretical Computer Science, 2011.
Valiant, L.G. The complexity of enumeration and reliability problems. SIAM Journal on Computing, 8(3): 410421, 1979.
Valiant, L.G. and Vazirani, V.V. NP is as easy as detecting unique solutions. Theoretical Computer Science, 47:85 93, 1986.
Vardy, Alexander. Algorithmic complexity in coding theory and the minimum distance problem. In STOC, 1997.
Wainwright, M.J. Tree-reweighted belief propagation algorithms and approximate ML estimation via pseudomoment matching. In AISTATS, 2003.
Wainwright, M.J. and Jordan, M.I. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1-2):1305, 2008.Welling, M. and Hinton, G. A new learning algorithm for mean field Boltzmann Machines. Artificial Neural Networks: ICANN 2002, pp. 8282, 2002.
-----0
Adamic, L. A. and Adar, E. Friends and neighbors on the web. Social Networks, 25(3):211230, July 2003.
Bejder, L., Fletcher, D., and Brager, S. A method for testing association patterns of social animals. Animal Behaviour, 56(3):719725, 1998.
Chandola, V., Banerjee, A., and Kumar, V. Anomaly detection: A survey. ACM Computing Surveys, 41 (3):158, July 2009.
Chaudhuri, S., Ganti, V., and Motwani, R. Robust identification of fuzzy duplicates. In Proc. 21st Intl Conf. on Data Engineering (ICDE 2005), pp. 865 876. IEEE, April 2005.
Cohen, W. W., Ravikumar, P. D., and Fienberg, S. E.A comparison of string distance metrics for namematching tasks. In Proc. IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03), pp.7378, 2003.
Committee on DNA Forensic Science: An Update, National Research Council. The Evaluation of Forensic DNA Evidence. The National Academies Press, 1996.
Crandall, D. J., Backstrom, L., Cosley, D., Suri, S., Huttenlocher, D., and Kleinberg, J. Inferring social ties from geographic coincidences. Proceedings of the National Academy of Sciences, 107(52):22436 22441, December 2010.
Eagle, N. and Pentland, A. Reality mining: Sensing complex social systems. Personal and Ubiquitous Computing, 10(4):255268, 2006.
Eagle, N., Pentland, A. S., and Lazer, D. Inferring friendship network structure by using mobile phone data. Proceedings of the National Academy of Sciences, 106(36):1527415278, September 2009.
Elmagarmid, A. K., Ipeirotis, P. G., and Verykios, V. S. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(1):116, January 2007.
Eskin, E. Anomaly detection over noisy data using learned probability distributions. In Proc. 17th Intl Conf. on Machine Learning (ICML 2000), pp. 255 262, 2000. Morgan Kaufmann.
Fellegi, I. P. and Sunter, A. B. A theory for record linkage. Journal of the American Statistical Association, 64(328):11831210, December 1969.
Friedland, L. and Jensen, D. Finding tribes: Identifying close-knit individuals from employment patterns. In Proc. 13th Intl Conf. on Knowledge Discovery and Data Mining (KDD 2007), pp. 290299, 2007. ACM.
Hand, D. J. Measuring classifier performance: a coherent alternative to the area under the ROC curve.Machine Learning, 77(1):103123, October 2009.
Liben-Nowell, D. and Kleinberg, J. The link-prediction problem for social networks. Journal of the American Society for Information Science and Technology, 58(7):10191031, 2007.
McCallum, A., Nigam, K., and Ungar, L. H. Efficient clustering of high-dimensional data sets with application to reference matching. In Proc. 6th Intl Conf.on Knowledge Discovery and Data Mining (KDD 2000), pp. 169178, 2000. ACM.
Metwally, A., Agrawal, D., and Abbadi, A. E. Detectives: Detecting coalition hit inflation attacks in advertising networks streams. In Proc. 16th Intl Conf. on World Wide Web (WWW 2007), pp. 241 250, 2007. ACM.
Narayanan, A. and Shmatikov, V. Robust deanonymization of large sparse datasets. In IEEE Symposium on Security and Privacy, pp. 111125, 2008. IEEE Computer Society.
National Center for Health Statistics. Matched multiple birth data, 19952000. Public-use data file and documentation, 2000. URL http://ftp.cdc.gov/ pub/Health_Statistics/NCHS/Datasets/mmb2/.
Sorokina, D., Gehrke, J., Warner, S., and Ginsparg, P. Plagiarism detection in arXiv. In Proc. 6th Intl Conf. on Data Mining (ICDM 2006), pp. 10701075, 2006. IEEE Computer Society.
Su, C. and Srihari, S. N. Evaluation of rarity of fingerprints in forensics. In Advances in Neural Information Processing Systems 23, pp. 12071215, 2010.
Whang, S. and Garcia-Molina, H. Managing information leakage. In Proc. 5th Biennial Conf. on Innovative Data Systems Research (CIDR 2011), pp.7984, 2011.
Winkler, W. E. Overview of record linkage and current research directions. Technical report, U.S. Census Bureau, February 2006.
Yang, Z., Wilson, C., Wang, X., Gao, T., Zhao, B. Y., and Dai, Y. Uncovering social network sybils in the wild. In Proc. Internet Measurement Conf. (IMC 2011), pp. 259268, 2011. ACM.
-----0
Arnold, Andrew, Liu, Yan, and Abe, Naoki. Temporal causal modeling with graphical granger methods. In Proceedings of the 13th ACM SIGKDD, KDD 07, pp.6675. ACM, 2007.
Bae, K H. A New Approach to Measuring Financial Contagion. Review of Financial Studies, 16(3):717763, July 2003.
Bell, R.M. and Koren, Y. Scalable collaborative filtering with jointly derived neighborhood interpolation weights.ICDM 2007. Seventh IEEE International Conference on, pp. 4352, 2007.
Boldi, M O and Davison, A C. A mixture model for multivariate extremes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(2):217229, 2007.
Borodin, A., El-Yaniv, R., and Gogan, V. Can we learn to beat the best stock. J. Artif. Intell. Res. (JAIR), 21: 579594, 2004.
Box, G.E.P. and Cox, D.R. An analysis of transformations. Journal of the Royal Statistical Society. Series B (Methodological), pp. 211252, 1964.
Castillo, E. and Hadi, A.S. Fitting the generalized pareto distribution to data. Journal of the American Statistical Association, 92(440):16091620, 1997.
Chen, P. and Chihying, H. Learning causal relations in multivariate time series data. Economics: The Open Access, Open-Assessment E-Journal, 1:11, 2007.
Cherubini, U., Luciano, E., and Vecchiato, W. Copula methods in finance. Wiley, 2004.
Coles, S.G. and Tawn, J.A. Modelling extreme multivariate events. Journal of the Royal Statistical Society. Series B (Methodological), pp. 377392, 1991.
Forbes, K J and Rigobon, R. No contagion, only interdependence: measuring stock market comovements. The Journal of Finance, 57(5):22232261, 2002.
Kalai, A and Vempala, S. Efficient algorithms for universal portfolios. In Foundations of Computer Science, 2000.
Proceedings. 41st Annual Symposium on, pp. 486491, 2000.
Kendall, M.G. and Stuart, A. The Advanced Theory of Statistics. Vol. 2: Inference and: Relationsship. Griffin, 1973.
Kenett, Dror Y, Tumminello, Michele, Madi, Asaf, GurGershgoren, Gitit, Mantegna, Rosario N, and BenJacob, Eshel. Dominating Clasp of the Financial Sector Revealed by Partial Correlation Analysis of the Stock Market. PLoS ONE, 5(12):e15032, December 2010.
Khandani, A. and Lo, A. What happened to the quants in august 2007? Available at SSRN 1015987, 2007.
Longin, F. and Solnik, B. Correlation structure of international equity markets during extremely volatile periods.Available at SSRN 147848, 1999.MacKenzie, D. End-of-the-world trade. London Review of Books, 30(9):2426, 2008.
Markowitz, H. Portfolio selection: efficient diversification of investments. New Haven, CT: Cowles Foundation, 94, 1959.
Meneguzzo, D. and Vecchiato, W. Copula sensitivity in collateralized debt obligations and basket default swaps.Journal of Futures Markets, 24(1):3770, 2004.
Nelsen, R.B. An introduction to copulas. Springer, 2006.Pekasiewicz, D. Methods of multivariate statistical analysis and their applications. Wydawnictwo Uniwersytetu Lodzkiego, 2007.
Plyakha, Y., Uppal, R., and Vilkov, G. Why does an equalweighted portfolio outperform value-and price-weighted portfolios? Available at SSRN 1787045, 2012.
Reilly, F.K. and Brown, K.C. Investment Analysis and Portfolio Management. South-Western Pub, 2011.
Richards, A.J. Comovements in national stock market returns: Evidence of predictability, but not cointegration.Journal of Monetary Economics, 36(3):631654, 1995.
Rockafellar, R Tyrrell and Uryasev, Stanislav. Optimization of conditional value-at-risk. Journal of risk, 2:2142, 2000.Sharpe, W.F. The Sharpe Ratio, 1994.
Sta?rica?, C. Multivariate extremes for models with constant conditional correlations. Journal of Empirical Finance, 6(5):515553, 1999.Stulz, R M. Rethinking risk management. Journal of applied corporate finance, 9(3):825, 2005.
White, H. A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity.Econometrica: Journal of the Econometric Society, pp.817838, 1980.
Zhu, M. Recall, precision and average precision. Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, 2004.
-----0
Cesa-Bianchi, N. and Lugosi, G. Prediction, learning, and games. Cambridge University Press, 2006.
Cortes, C. and Mohri, M. AUC optimization vs. error rate minimization. In Advances in Neural Information Processing Systems 16, pp. 313320. MIT Press, Cambridge, MA, 2004.
Flach, P. A., Hernandez-Orallo, J., and Ramirez, C. F.A coherent interpretation of AUC as a measure of aggregated classification performance. In Proceedings of the 28th International Conference on Machine Learning, pp. 657664, Bellevue, WA, 2011.
Gao, W. and Zhou, Z.-H. On the consistency of AUC optimization. CoRR/abstract, 1208.0645, 2012.
Gao, W., Jin, R., Zhu, S., and Zhou, Z.-H. Onepass AUC optimization. CoRR/abstract, 1305.1363, 2013.
Hanley, J. A. and McNeil, B. J. A method of comparing the areas under receiver operating characteristic curves derived from the same cases. Radiology, 148 (3):839843, 1983.
Hansen, P. C. Rank-Deficient and Discrete Ill-Posed Problems: Numerical Aspects of Linear Inversion.SIAM, 1987.
Herschtal, A. and Raskutti, B. Optimising area under the ROC curve using gradient descent. In Proceedings of the 21st International Conference on Machine Learning, Alberta, Canada, 2004.
Hsu, D., Kakade, S., and Zhang, T. Random design analysis of ridge regression. In Proceedings of the 25th Annual Conference on Learning Theory, pp. 9.19.24, Edinburgh, Scotland, 2012.
Joachims, T. A support vector method for multivariate performance measures. In Proceedings of the 22nd International Conference on Machine Learning, pp.377384, Bonn, Germany, 2005.
Joachims, T. Training linear svms in linear time.In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data 
Mining, pp. 217226, Philadelphia, PA, 2006.Johnstone, I. High dimensional statistical inference and random matrices. In Proceedings of the International Congress of Mathematicians, pp. 307333, Madrid, Spain, 2006.
Kotlowski, W., Dembczynski, K., and Hullermeier, E.Bipartite ranking through minimization of univariate loss. In Proceedings of the 28th International Conference on Machine Learning, pp. 11131120, Bellevue, WA, 2011.
Liu, X.-Y., Wu, J., and Zhou, Z.-H. Exploratory undersampling for class-imbalance learning. IEEE 
Trans. Systems, Man, and Cybernetics B, 39(2): 539550, 2009.
Metz, C. E. Basic principles of ROC analysis. Seminars in Nuclear Medicine, 8(4):283298, 1978.
Nocedal, J. and Wright, S. J. Numerical optimization.Springer, 1999.
Provost, F. J., Fawcett, T., and Kohavi, R. The case against accuracy estimation for comparing induction algorithms. In Proceedings of the 15th International Conference on Machine Learning, pp. 445 453, Madison, WI, 1998.
Rakhlin, A., Shamir, O., and Sridharan, K. Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Conference on Machine Learning, pp.449456, Edinburgh, Scotland, 2012.
Rudin, C. and Schapire, R. E. Margin-based ranking and an equivalence between AdaBoost and RankBoost. Journal of Machine Learning Research, 10: 21932232, 2009.
Srebro, N., Sridharan, K., and Tewari, A. Smoothness, low noise and fast rates. In Advances in Neural Information Processing Systems 24, pp. 21992207.MIT Press, Cambridge, MA, 2010.
Zhao, P., Hoi, S., Jin, R., and Yang, T. Online AUC maximization. In Proceedings of the 28th International Conference on Machine Learning, pp. 233 240, Bellevue, WA, 2011.
-----2
Amer, M. R. and Todorovic, S. Sum-product networks for modeling activities with stochastic structure. In CVPR, 2012.Andrew, G. and Gao, J. Scalable training of L1- regularized log-linear models. In ICML 24, 2007.
Bengio, Y. Learning deep architectures for AI. Foun- dations and Trends in Machine Learning, 2:1127, 2009.
Chechetka, A. and Guestrin, C. Efficient principled learning of thin junction trees. In NIPS 20, 2008.
Chickering, D. M. The WinMine Toolkit. Microsoft, Redmond, WA MSR-TR-2002-103, 2002.
Chickering, D. M., Heckerman, D., and Meek, C. A Bayesian approach to learning Bayesian networks with local structure. In UAI 13, 1997.
Darwiche, A. A differential approach to inference in Bayesian networks. JACM, 50:280305, 2003.
Davis, J. and Domingos, P. Bottom-up learning of Markov network structure. In ICML 27, 2010.
Dechter, R. and Mateescu, R. AND/OR search spaces for graphical models. AIJ, 171:73106, 2007.
Della Pietra, S., Della Pietra, V., and Lafferty, J. In- ducing features of random fields. PAMI, 19:380392, 1997.
Dempster, A. P., Laird, N. M., and Rubin, D. B. Max- imum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39:138, 1977.
Dennis, A. and Ventura, D. Learning the architecture of sum-product networks using clustering on vari- ables. In NIPS 25, 2012.
Gens, R. and Domingos, P. Discriminative learning of sum-product networks. In NIPS 25, 2012.
Gogate, V., Webb, W., and Domingos, P. Learning efficient Markov networks. In NIPS 23, 2010.
Lowd, D. The Libra Toolkit. URL http://libra.cs.uoregon.edu/. Version 0.5.0, 2012.
Lowd, D. and Davis, J. Learning Markov network structure with decision trees. In ICDM 10, 2010.
Lowd, D. and Domingos, P. Learning arithmetic cir- cuits. In UAI 24, 2008.
Neal, R.M. and Hinton, G.E. A view of the EM algo- rithm that justifies incremental, sparse, and other variants. NATO ASI SERIES D: Behavioural and Social Sciences, 89:355370, 1998.
Parzen, E. On estimation of a probability density func- tion and mode. The Annals of Mathematical Statis- tics, 33(3):10651076, 1962.
Poon, H. and Domingos, P. Sum-product networks: A new deep architecture. In UAI 27, 2011.
Queyranne, M. Minimizing symmetric submodular functions. Math. Programming, 82(1):312, 1998.
Ravikumar, P., Wainwright, M. J., and Lafferty, J. D.High-dimensional ising model selection using L1- regularized logistic regression. The Annals of Statis- tics, 38(3):12871319, 2010.
Raz, R. Multi-linear formulas for permanent and de- terminant are of super-polynomial size. In STOC 36, 2004.
Roth, D. On the hardness of approximate reasoning.AIJ, 82:273302, 1996.
Van Haaren, J. and Davis, J. Markov network struc- ture learning: A randomized feature generation ap- proach. In AAAI 26, 2012.
Woolf, B. The log likelihood ratio test (the G-test).Annals of Human Genetics, 21(4):397409, 1957.Learning the Structure of Sum-Product Networks
-----0
Billsus, Daniel and Pazzani, Michael. Learning collaborative information filters. In Proceedings of ICML, pp. 4654, San Francisco, CA, USA, 1998.
Freund, Yoav and Haussler, David. Unsupervised learning of distributions on binary vectors using two layer networks. Technical report, University of California at Santa Cruz, Santa Cruz, CA, USA, 1994.
Goldberg, Ken, Roeder, Theresa, Gupta, Dhruv, and Perkins, Chris. Eigentaste: A constant time collaborative filtering algorithm. Inf. Retr., 4(2):133151, 2001.
Hinton, Geoffrey. Training products of experts by minimizing contrastive divergence. Neural Comput., 14 (8):17711800, 2002.
Hinton, Geoffrey and Salakhutdinov, Ruslan. Reducing the dimensionality of data with neural networks.Science, 313(5786):504507, 2006.
Hofmann, Thomas. Latent semantic models for collaborative filtering. ACM Trans. Inf. Syst., 22(1): 89115, 2004.
Kim, Dohyun and Yum, Bong-Jin. Collaborative filtering based on iterative principal component analysis. Expert Syst. Appl., 28(4):823830, 2005.
Koren, Yehuda, Bell, Robert, and Volinsky, Chris. Matrix factorization techniques for recommender systems. Computer, 42(8):3037, 2009.
Kozma, Laszlo, Raiko, Tapani, and Ilin, Alexander.Binary principal component analysis in the Netflix collaborative filtering task. In Proceedings of the IEEE Workshop on Machine Learning for Signal Processing, pp. 16, Grenoble, France, 2009.
Langseth, Helge and Nielsen, Thomas Dyhre. A latent model for collaborative filtering. Int. J. Approx.Reasoning, 53(4):447466, 2012.
Lawrence, Neil D. and Urtasun, Raquel. Non-linear matrix factorization with Gaussian processes. In Proceedings of ICML, pp. 601608, Montreal, Quebec, Canada, 2009.
Melville, Prem, Mooney, Raymod, and Nagarajan, Ramadass. Content-boosted collaborative filtering for improved recommendations. In Proceedings of the national conference on Artificial Intelligence, pp.187192, Edmonton, Alberta, Canada, 2002.
Pazzani, Michael. A framework for collaborative, content-based and demographic filtering. Artif. Intell. Rev., 13(5-6):393408, 1999.
Salakhutdinov, Ruslan and Mnih, Andriy. Bayesian probabilistic matrix factorization using Markov chain Monte Carlo. In Proceedings of ICML, pp.880887, Helsinki, Finland, 2008.
Salakhutdinov, Ruslan, Mnih, Andriy, and Hinton, Geoffrey. Restricted Boltzmann machines for collaborative filtering. In Proceedings of ICML, pp.791798, Corvalis, OR, USA, 2007.
Sarwar, Badrul, Karypis, George, Konstan, Joseph, and Riedl, John. Application of dimensionality reduction in recommender system  a case study.In Proceedings of the ACM WebKDD workshop, Boston, MA, USA, 2000a.
Sarwar, Badrul, Karypis, George, Konstan, Joseph, and Riedl, John. Analysis of recommendation algorithms for e-commerce. In Proceedings of the ACM conference on Electronic commerce, pp. 158167, Minneapolis, MN, USA, 2000b.
Sarwar, Badrul, Karypis, George, Konstan, Joseph, and Riedl, John. Item-based collaborative filtering recommendation algorithms. In Proceedings of WWW, pp. 285295, Hong Kong, 2001.
Sarwar, Badrul, Karypis, George, Konstan, Joseph, and Riedl, John. Incremental singular value decomposition algorithms for highly scalable recommender systems. In Proceedings of the International Conference on Computer and Information Science, pp.2728, 2002.
Smolensky, Paul. Information processing in dynamical systems: foundations of harmony theory. In Rumelhart, David and McClelland, James (eds.), Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations, pp. 194281. MIT Press, Cambridge, MA, USA, 1986.
Taranto, Claudio, Mauro, Nicola Di, and Esposito, Floriana. Uncertain graphs meet collaborative filtering. In Amati, Giambattista, Carpineto, Claudio, and Semeraro, Giovanni (eds.), Proceedings of the Italian Information Retrieval Workshop, pp. 89 100, Bari, Italy, 2012.
Truyen, Tran The, Phung, Dinh, and Venkatesh, Svetha. Ordinal Boltzmann machines for collaborative filtering. In Proceedings of UAI, pp. 548556, Montreal, Quebec, Canada, 2009.
Vozalis, Manolis, Markos, Angelos, and Margaritis, Konstantinos. Collaborative filtering through SVDbased and hierarchical nonlinear PCA. In Proceedings of ICANN, pp. 395400, Thessaloniki, Greece, 2010.
-----0
Arlot, S. and Celisse, A. A survey of cross-validation procedures for model selection. Statistics Surveys, 4:4079, 2010.
Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., and Vaughan, J. Wortman. A theory of learning from different domains. Machine Learning, 79:151175, 2010.
Bengio, Y. and Grandvalet, Y. No unbiased estimator of the variance of k-fold cross-validation. Journal of Machine Learning Research, 5:10891105, 2003.
Blitzer, J., Dredze, M., and Pereira, F. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In In Proceedings of Association of Computational Linguistics, 2007.Caruana, R. Multi-task learning. Machine Learning, 28:4175, 1997.
Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., and Slatteryy, S. Learning to extract symbolic knowledge from the world wide web. In Proceedings of the 15th National Conference on Artificial Intelligence (AAAI-98), 1998.
Grandvalet, Y. and Bengio, Y. Hypothesis testing for cross-validation. Technical Report 1285, Departement dinformatique et recherche operationnelle, Universite de Montreal, 2006.
Gretton, A., Smola, A., Huang, J., Schmittfull, M., Borgwardt, K., and Scholkopf, B. &. Covariate shift by kernel mean matching. In Quinonero-Candela, 
J., Sugiyama, M., Schwaighofer, A., and Lawrence, N.D. (eds.), Dataset shift in machine learning, pp.131160. The MIT Press, 2009.
Markatou, M., Tian, H., Biswas, S., and Hripcsak, G.Analysis of variance of cross-validation estimators of the generalization error. Journal of Machine Learning Research, 6:11271168, 2005.
Mitchell, T., Hutchinson, R., , Niculescu, R., Pereira, F., Wang, X., Justl, M., and Newman, S. Learning to decode cognitive states from brain images. Machine Learning, 57(1):145175, 2004.
Nadeau, C. and Bengio, Y. Inference for the generalization error. Machine Learning, 52:239281, 2003.
Rakotomalala, R., Chauchat, J.-H., and Pellegrino, F. Accuracy estimation with clustered dataset. In In Proceedings of Australasian conference on Data mining and analytics, 2006.
Stone, M. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society. Series B (Methodological), 36(2):111 147, 1974.
Sugiyama, M., Krauledat, M., and Muller, K.-R. Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research, 8:9851005, 2007.Thrun, S. Is learning the n-th thing any easier than learning the first? In In Advances in Neural Information Processing Systems, 1996.
-----0
Ambroladze, A., Parrado-Hernandez, E., and ShaweTaylor, J. Tighter PAC-Bayes bounds. In NIPS, pp.916, 2006.
Ben-David, S. and Urner, R. On the hardness of domain adaptation and the utility of unlabeled target samples. In ALT, pp. 139153, 2012.
Ben-David, S., Blitzer, J., Crammer, K., and Pereira, F. Analysis of representations for domain adaptation. In NIPS, pp. 137144, 2006.
Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., and Vaughan, J.W. A theory of learning from different domains. Mach. Learn., 79(1-2):151 175, 2010a.
Ben-David, S., Lu, T., Luu, T., and Pal, D. Impossibility theorems for domain adaptation. JMLR W&CP, AISTAT, 9:129136, 2010b.
Blitzer, J., McDonald, R., and Pereira, F. Domain adaptation with structural correspondence learning.In EMNLP, 2006.
Bruzzone, L. and Marconcini, M. Domain adaptation problems: A DASVM classification technique and a circular validation strategy. Trans. Pattern Anal. Mach. Intell., 32(5):770787, 2010.
C. Zhang, L. Zhang, J. Ye. Generalization bounds for domain adaptation. In NIPS, 2012.
Catoni, O. PAC-Bayesian supervised classification: the thermodynamics of statistical learning, volume 56. Inst of Mathematical Statistic, 2007.
Chen, M., Weinberger, K. Q., and Blitzer, J. Cotraining for domain adaptation. In NIPS, pp. 2456 2464, 2011.
Cortes, C. and Mohri, M. Domain adaptation in regression. In ALT, pp. 308323, 2011.
Germain, P., Lacasse, A., Laviolette, F., and Marchand, M. PAC-Bayesian learning of linear classifiers.In ICML, 2009a.
Germain, P., Lacasse, A., Laviolette, F., Marchand, M., and Shanian, S. From PAC-Bayes bounds to KL regularization. In NIPS, pp. 603610, 2009b.
Jiang, J. A literature survey on domain adaptation of statistical classifiers. Technical report, CS Department at Univ. of Illinois at Urbana-Champaign, 2008.
Lacasse, A., Laviolette, F., Marchand, M., Germain, P., and Usunier, N. PAC-Bayes bounds for the risk of the majority vote and the variance of the Gibbs classifier. In NIPS, 2006.
Langford, J. and Shawe-Taylor, J. PAC-Bayes & margins. In NIPS, pp. 439446, 2002.
Laviolette, F., Marchand, M., and Roy, J.-F. From PAC-Bayes bounds to quadratic programs for majority votes. In ICML, 2011.
Li, X. and Bilmes, J. A bayesian divergence prior for classifier adaptation. In AISTATS-2007, 2007.
Mansour, Y., Mohri, M., and Rostamizadeh, A. Domain adaptation: Learning bounds and algorithms.In COLT, pp. 1930, 2009a.
Mansour, Y., Mohri, M., and Rostamizadeh, A. Multiple source adaptation and the renyi divergence. In UAI, pp. 367374, 2009b.
McAllester, D. A. Some PAC-Bayesian theorems.Mach. Lear., 37:355363, 1999.
Quionero-Candela, J., Sugiyama, M., Schwaighofer, A., and Lawrence, N.D. Dataset Shift in Machine Learning. MIT Press, 2009. ISBN 0262170051, 9780262170055.
Zhong, E., Fan, W., Yang, Q., Verscheure, O., and Ren, J. Cross validation framework to choose amongst models and datasets for transfer learning.In ECML-PKDD, 2010.
-----0
Baldassarre, Luca, Rosasco, Lorenzo, Barla, Annalisa, and Verri, Alessandro. Multi-output learning via spectral filtering. Machine Learning, 87:259301, 2012.
Caponnetto, A. and De Vito, E. Optimal rates for regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331368, 2007.
Cortes, Corinna, Mohri, Mehryar, and Weston, Jason.A general regression framework for learning stringto-string mappings. In Bak?r, Gokhan, Hofmann, Thomas, Scholkopf, Bernhard, Smola, Alexander J., Taskar, Ben, and Vishwanathan, S. V. N. (eds.), Predicting Structured Data, chapter 8, pp. 143168.MIT Press, Cambridge, MA, 2007.
Gartner, Thomas and Vembu, Shankar. On structured output training: hard cases and an efficient alternative. Machine Learning, 79:227242, 2009.
Germain, Pascal, Lacoste, Alexandre, Laviolette, Francois, Marchand, Mario, and Shanian, Sara. A PAC-Bayes sample-compression approach to kernel methods. In Getoor, Lise and Scheffer, Tobias (eds.), Proceedings of the 28th International Conference on Machine Learning, ICML 11, pp. 297304, New York, NY, USA, June 2011. ACM.
Jacob, Laurent, Hoffmann, Brice, Stoven, Veronique, and Vert, Jean-Philippe. Virtual screening of gpcrs: An in silico chemogenomics approach. BMC Bioinformatics, 9(1):363, 2008.
Langford, John. Tutorial on practical prediction theory for classification. Journal of Machine Learning Research, 6:273306, 2005.
Laviolette, Francois and Marchand, Mario. PACBayes risk bounds for stochastic averages and majority votes of sample-compressed classifiers. Journal of Machine Learning Research, 8:14611487, 2007.
Maurer, Andreas. A note on the PAC Bayesian theorem. CoRR, cs.LG/0411099, 2004.
McAllester, David. PAC-Bayesian stochastic model selection. Machine Learning, 51:521, 2003.
McAllester, David. Generalization bounds and consistency for structured labeling. In Bak?r, Gokhan, 
Hofmann, Thomas, Scholkopf, Bernhard, Smola, Alexander J., Taskar, Ben, and Vishwanathan, S.V. N. (eds.), Predicting Structured Data, chapter 11, pp. 247261. MIT Press, Cambridge, MA, 2007.
Rousu, Juho, Saunders, Craig, Szedmak, Sandor, and Shawe-Taylor, John. Kernel-based learning of hierarchical multilabel classification models. J. Mach.Learn. Res., 7:16011626, December 2006.
Taskar, Ben, Guestrin, Carlos, and Koller, Daphne. Max-margin markov networks. In Thrun, Sebastian, Saul, Lawrence, and Scholkopf, Bernhard (eds.), Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004.
Tsochantaridis, I., Joachims, T., Hofmann, T., and Altun, Y. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6:14531484, 2005.
Zhang, Tong. Information theoretical upper and lower bounds for statistical estimation. IEEE Transaction on Information Theory, 52:13071321, 2006.
-----0
Alcala-Fdez, J., Fernandez, A., Luengo, J., Derrac, J., Garc?a, S., Sanchez, L., and Herrera, F. Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. Journal of Multiple-Valued Logic and Soft Computing, 17(2-3), 2011.
Barber, D., Cemgil, A. T., and Chiappa, S. Bayesian Time Series Models. Cambridge University Press, 2011.Bishop, C. M. Pattern Recognition and Machine Learning. Springer, 2007.
Douc, R., Garivier, A., Moulines, E., and Olsson, J.On the Forward Filtering Backward Smoothing Particle Approximations of the Smoothing Distribution in General State Space Models. Annals of Applied Probability, 2011.
Duvenaud, D.K., Nickisch, H., and Rasmussen, C.E.Additive Gaussian processes. In NIPS, pp. 226234, 2011.
Hartikainen, J. and Sarkka, S. Kalman filtering and smoothing solutions to temporal Gaussian process regression models. In Machine Learning for Signal Processing (MLSP), pp. 379384, Kittila, Finland, August 2010. IEEE.
Hastie, T., Tibshirani, R., and Friedman, J. H. The elements of statistical learning: data mining, inference, and prediction. New York: Springer-Verlag, Second edition, 2009.
Quinonero-Candela, J. and Rasmussen, C.E. A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research, 6:19391959, December 2005.
Rasmussen, C.E. and Nickisch, H. Gaussian processes for machine learning (gpml) toolbox. Journal of Machine Learning Research, 11:30113015, December 2010.
Rasmussen, C.E. and Williams, C.K.I. Gaussian Processes for Machine Learning. The MIT Press, 2006.
Saatci, Y. Scalable Inference for Structured Gaussian Process Models. PhD thesis, University of Cambridge, 2011.
Snelson, E. and Ghahramani, Z. Sparse Gaussian processes using pseudo-inputs. In Advances in Neural Information Processing Systems 18, pp. 12571264, Cambridge, MA, USA, December 2006. The MIT Press.
-----0
Dalal, N. and Triggs, B. Histograms of oriented gradients for human detection. In CVPR. IEEE, 2005.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In CVPR. IEEE, 2009.
Deng, J., Berg, A., Li, K., and Fei-Fei, L. What does classifying more than 10,000 image categories tell us? In ECCV, 2010.
Everingham, M., Van Gool, L., Williams, C.K. I., Winn, J., and Zisserman, A. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. http://www.pascalnetwork.org/challenges/VOC/voc2007/workshop/ index.html.
Fei-Fei, L., Fergus, R., and Perona, P. One-shot learning of object categories. IEEE TPAMI, 28(4), 2006.
Felzenszwalb, P. F., Girshick, R. B., and McAllester, D. Discriminatively trained deformable part models, release 4. http://people.cs.uchicago.edu/ ~pff/latent-release4/.
Felzenszwalb, P. F., Girshick, R. B., McAllester, D., and Ramanan, D. Object detection with discriminatively trained part based models. IEEE TPAMI, 32(9), 2010a.
Felzenszwalb, P.F., Girshick, R.B., and McAllester, D.Cascade object detection with deformable part models. In CVPR, 2010b.
Fidler, S., Boben, M., and Leonardis, A. Learning hierarchical compositional representations of object structure. In Object Categorization: Computer and Human Vision Perspectives. 2009.
Freeman, W.T. and Adelson, E.H. The design and use of steerable filters. IEEE TPAMI, 13(9), 1991.
Girshick, R.B, Felzenszwalb, P.F., and McAllester, D.Object detection with grammar models. In NIPS.2011.
Griffin, G., Holub, A., and Perona, P. Caltech-256 object category dataset. Technical report, California Institute of Technology, 2007.
Kreutz-Delgado, K., Murray, J.F., Rao, B.D., Engan, K., Lee, T.W., and Sejnowski, T.J. Dictionary learning algorithms for sparse representation. Neural computation, 15(2), 2003.
Langford, J., Li, L., and Zhang, T. Sparse online learning via truncated gradient. JMRL, 10, 2009.
Mairal, J., Bach, F., Ponce, J., and Sapiro, G. Online dictionary learning for sparse coding. In ICML.ACM, 2009.
Mairal, J., Bach, F., and Ponce, J. Task-driven dictionary learning. IEEE TPAMI, 32(4), 2012.
Mallat, Stphane and Zhang, Zhifeng. Matching pursuit with time-frequency dictionaries. IEEE Trans. on Signal Processing, 41, 1993.
Manduchi, R., Perona, P., and Shy, D. Efficient deformable filter banks. IEEE Trans. on Signal Processing, 46(4), 1998.
Ott, P. and Everingham, M. Shared parts for deformable part-based models. In CVPR. IEEE, 2011.
Pirsiavash, H. and Ramanan, D. Steerable part models. In CVPR. IEEE, 2012.
Sarwar, B., Karypis, G., Konstan, J., and Riedl, J.Application of dimensionality reduction in recommender systemsa case study. In Proc. ACM WebKDD Workshop, 2000.
Song, H., Zickler, S., Althoff, T., Girshick, R., Fritz, M., Geyer, C., Felzenszwalb, F., and Darrell, T.Sparselet models for efficient multiclass object detection. In ECCV. Springer-Verlag, 2012.
Taskar, B., Guestrin, C., and Koller, D. Max-margin markov networks. In NIPS. 2003.
Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B, 1996.
Torralba, A., Murphy, K.P., and Freeman, W.T. Sharing visual features for multiclass and multiview object detection. IEEE TPAMI, 29(5), 2007.
Tsochantaridis, I., Joachims, T., Hofmann, T., and Altun, Y. Large margin methods for structured and interdependent output variables. JMRL, 6(2), 2006.
Vedaldi, A. and Fulkerson, B. VLFeat: An open and portable library of computer vision algorithms.http://www.vlfeat.org/, 2008.
Vedaldi, A. and Zisserman, A. Efficient additive kernels via explicit feature maps. IEEE TPAMI, 34(3), 2011.
Wolf, Lior, Jhuang, Hueihan, and Hazan, Tamir. Modeling appearances with low-rank svm. In CVPR, 2007.
Zhu, L.L., Chen, Y., Torralba, A., Freeman, W., and Yuille, A. Part and appearance sharing: Recursive compositional models for multi-view multi-object detection. In CVPR. IEEE, 2010.
Zou, H. and Hastie, T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B, 2005.
-----0
Arcolano, N. and Wolfe, P. J. Nystrom approximation of Wishart matrices. In Proceedings of the 2010 IEEE International Conference on Acoustics Speech and Signal Processing, pp. 36063609, 2010.
Asuncion, A. and Newman, D. J. UCI Machine Learning Repository, November 2012. URL http://www.ics.uci.edu/~mlearn/MLRepository.html.
Belabbas, M.-A. and Wolfe, P. J. Spectral methods in machine learning and new strategies for very large datasets. Proc. Natl. Acad. Sci. USA, 106:369374, 2009.
Boutsidis, C., Mahoney, M.W., and Drineas, P. An improved approximation algorithm for the column subset selection problem. In Proceedings of the 20th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 968977, 2009.
Corke, P. I. A Robotics Toolbox for MATLAB. IEEE Robotics and Automation Magazine, 3:2432, 1996.
Drineas, P. and Mahoney, M.W. On the Nystrom method for approximating a Gram matrix for improved kernelbased learning. Journal of Machine Learning Research, 6:21532175, 2005.
Drineas, P., Mahoney, M.W., and Muthukrishnan, S.Relative-error CUR matrix decompositions. SIAM Journal on Matrix Analysis and Applications, 30:844881, 2008.
Drineas, P., Mahoney, M.W., Muthukrishnan, S., and Sarlos, T. Faster least squares approximation. Numerische Mathematik, 117(2):219249, 2010.
Drineas, P., Magdon-Ismail, M., Mahoney, M. W., and Woodruff, D. P. Fast approximation of matrix coherence and statistical leverage. Journal of Machine Learning Research, 13:34753506, 2012.
Gittens, A. The spectral norm error of the naive Nystrom extension. Technical report, 2011. Preprint: arXiv:1110.5305 (2011).
Gittens, A. and Mahoney, M. W. Revisiting the Nystrom Method for Improved Large-Scale Machine Learning. Technical report, 2013. Preprint: arXiv:1303.1849 (2013).
Gustafson, A. M., Snitkin, E. S., Parker, S. C. J., DeLisi, C., and Kasif, S. Towards the identification of essential genes using targeted genome sequencing and comparative analysis. BMC Genomics, 7:265, 2006.
Guyon, I., Gunn, S. R., Ben-Hur, A., and Dror, G. Result analysis of the NIPS 2003 feature selection challenge. In Advances in Neural Information Processing Systems 17.MIT Press, 2005.
Halko, N., Martinsson, P.-G., and Tropp, J. A. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2):217288, 2011.
Klimt, B. and Yang, Y. The Enron corpus: A new dataset for email classification research. In Proceedings of the 15th European Conference on Machine Learning, pp.217226, 2004.
Kumar, S., Mohri, M., and Talwalkar, A. Ensemble Nystrom method. In Annual Advances in Neural Information Processing Systems 22: Proceedings of the 2009 Conference, 2009.
Kumar, S., Mohri, M., and Talwalkar, A. Sampling methods for the Nystrom method. Journal of Machine Learning Research, 13:9811006, 2012.
Leskovec, J., Kleinberg, J., and Faloutsos, C. Graph Evolution: Densification and Shrinking Diameters. ACM Transactions on Knowledge Discovery from Data, 1, 2007.
Li, M., Kwok, J.T., and Lu, B.-L. Making large-scale Nystrom approximation possible. In Proceedings of the 27th International Conference on Machine Learning, pp.631638, 2010.
Liu, S., Zhang, J., and Sun, K. Learning low-rank kernel matrices with column-based methods. Communications in StatisticsSimulation and Computation, 39(7):1485 1498, 2010.
Mahoney, M. W. Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning. NOW Publishers, Boston, 2011. Also available at: arXiv:1104.5557.
Mahoney, M.W. and Drineas, P. CUR matrix decompositions for improved data analysis. Proc. Natl. Acad. Sci.USA, 106:697702, 2009.
Nielsen, T. O., West, R. B., Linn, S. C., Alter, O., Knowling, M. A., OConnell, J. X., Zhu, S., Fero, M., Sherlock, G., Pollack, J. R., Brown, P. O., Botstein, D., and van de Rijn, M. Molecular characterisation of soft tissue tumours: a gene expression study. The Lancet, 359: 13011307, 2002.
Paschou, P., Ziv, E., Burchard, E.G., Choudhry, S., Rodriguez-Cintron, W., Mahoney, M.W., and Drineas, P. PCA-correlated SNPs for structure identification in worldwide human populations. PLoS Genetics, 3:1672 1686, 2007.
Talwalkar, A. and Rostamizadeh, A. Matrix coherence and the Nystrom method. In Proceedings of the 26th Conference in Uncertainty in Artificial Intelligence, 2010.
Williams, C.K.I. and Seeger, M. Using the Nystrom method to speed up kernel machines. In Annual Advances in Neural Information Processing Systems 13: Proceedings of the 2000 Conference, pp. 682688, 2001.
Zhang, K. and Kwok, J. T. Density-weighted Nystrom method for computing large kernel eigensystems. Neural Computation, 21(1):121146, 2009.
Zhang, K., Tsang, I.W., and Kwok, J.T. Improved Nystrom low-rank approximation and error analysis. In Proceedings of the 25th International Conference on Machine Learning, pp. 12321239, 2008.
-----0
Box, G. E. P. and G. C. Tiao (1977). A canonical analysis of multiple time series. Biometrika 64 (2), 355365.
Brockwell, P. J. and R. A. Davis (1991). Time Series: Theory and Methods (2 ed.). New York, NY: Springer Series in Statistics.
Cardoso, J.-F. (2004). Dependence, correlation and gaussianity in independent component analysis. J. Mach. Learn. Res. 4 (7-8), 11771203.Cover, T. M. and J. Thomas (1991). Elements of Information Theory. Wiley.
Dempster, A. P., N. M. Laird, and D. B. Rubin (1977).Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B Methodological 39 (1), 138.
Dhiral, K. K., K. Kalpakis, D. Gada, and V. Puttagunta (2001). Distance Measures for Effective Clustering of ARIMA Time-Series. In Proceedings of the 2001 IEEE International Conference on Data Mining, pp. 273280.
Fessler, J. A. and B. P. Sutton (2003). Nonuniform fast fourier transforms using min-max interpolation. IEEE Trans. Signal Process 51, 560574.
Fryzlewicz, P., G. P. Nason, and R. von Sachs (2008).A wavelet-Fisz approach to spectrum estimation.Journal of Time Series Analysis 29 (5), 868880.
Gibson, J. (1994). What is the interpretation of spectral entropy? In Proceedings of IEEE International Symposium on Information Theory, 1994, pp. 440.
Gibson, J., S. Stanners, and S. McClellan (1993).Spectral entropy and coefficient rate for speech coding. In Signals, Systems and Computers, 1993. 1993 Conference Record of The Twenty-Seventh Asilomar Conference on, pp. 925 929 vol.2.
Gomez-Herrero, G., K. Rutanen, and K. Egiazarian (2010). Blind source separation by entropy rate minimization. Signal Processing Letters, IEEE 17 (2), 153 156.
Hughes Hallett, A. and C. Richter (2008). Have the Eurozone economies converged on a common European cycle? International Economics and Economic Policy 5, 71101.
Hyvarinen, A. and E. Oja (2000). Independent Component Analysis: Algorithms and Applications.Neural Networks 13, 411430.
Jacques, L. and P. Vandergheynst (2010). Compressed Sensing: When sparsity meets sampling, Chapter 23, pp. 507528. Wiley-Blackwell.
Jolliffe, I. T. (2002). Principal Component Analysis (2 ed.). New York, NY: Springer.
Lees, J. M. and J. Park (1995). Multiple-Taper Spectral-Analysis A Stand-Alone C-Subroutine. Computers & Geosciences 21 (2), 199236.
Li, X.-L. and T. Adali (2010). Blind spatiotemporal separation of second and/or higher-order correlated sources by entropy rate minimization. In Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, pp. 1934 1937.
Matteson, D. S. and R. S. Tsay (2011). Dynamic orthogonal components for multivariate time series. Journal of the American Statistical Association 106 (496), 14501463.
Nuttal, A. H. and G. C. Carter (1982). Spectral Estimation and Lag Using Combined Time Weighting. In Proceedings of IEEE, Volume 70, pp. 11111125.
Paninski, L. (2003). Estimation of entropy and mutual information. Neural Comput. 15 (6), 11911253.
R Development Core Team (2010). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.ISBN 3-900051-07-0.
Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal 27, 37923, 623656.
Sricharan, K., R. Raich, and A. Hero (2011). K-nearest neighbor estimation of entropies with confidence. In Information Theory Proceedings (ISIT), 2011 IEEE International Symposiumon, pp. 1205 1209.
Stone, J. V. (2001). Blind source separation using temporal predictability. Neural Comput. 13 (7), 1559 1574.
Stowell, D. and M. D. Plumbley (2009). Fast Multidimensional Entropy Estimation by k-d Partitioning.IEEE Signal Processing Letters 16, 537540.
Trobs, M. and G. Heinzel (2006). Improved spectrum estimation from digitized time series on a logarithmic frequency axis. Measurement 39 (2), 120129.
Wiskott, L. and T. J. Sejnowski (2002). Slow Feature Analysis: Unsupervised Learning of Invariances. Neural computation 14 (4), 715770.
-----0
Bilenko, Mikhail and Richardson, Matthew. Predictive client-side profiles for personalized advertising. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, 2011.
Blandford, Daniel K. and Blelloch, Guy E. Compact dictionaries for variable-length keys and data with applications. ACM Trans. Algorithms, 4(2), May 2008.
Blum, Avrim, Kalai, Adam, and Langford, John. Beating the hold-out: bounds for k-fold and progressive crossvalidation. In Proceedings of the twelfth annual conference on Computational learning theory, 1999.
Bottou, Leon and Bousquet, Olivier. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems, volume 20. 2008.
Bucilua?, Cristian, Caruana, Rich, and Niculescu-Mizil, Alexandru. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, 2006.
Chang, Chih-Chung and Lin, Chih-Jen. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2, 2011. Datasets from http://www.csie.ntu.edu.tw/ cjlin/libsvm.
Craswell, Nick, Zoeter, Onno, Taylor, Michael, and Ramsey, Bill. An experimental comparison of click positionbias models. In Proceedings of the international conference on Web search and web data mining, 2008.
Duchi, John, Shalev-Shwartz, Shai, Singer, Yoram, and Chandra, Tushar. Efficient projections onto the l1ball for learning in high dimensions. In Proceedings of the 25th international conference on Machine learning, 2008.
Duchi, John, Hazan, Elad, and Singer, Yoram. Adaptive subgradient methods for online learning and stochastic optimization. In COLT, 2010.
Flajolet, Philippe. Approximate counting: A detailed analysis. BIT, 25(1):113134, 1985.
Goodman, Joshua, Cormack, Gordon V., and Heckerman, David. Spam and the ongoing battle for the inbox. Commun. ACM, 50(2), 2 2007.
Langford, John, Li, Lihong, and Zhang, Tong. Sparse online learning via truncated gradient. Journal of Machine Learning Research, 10, June 2009.
McMahan, H. Brendan. Follow-the-Regularized-Leader and Mirror Descent: Equivalence Theorems and L1 Regularization. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS), 2011.
McMahan, H. Brendan and Streeter, Matthew. Adaptive bound optimization for online convex optimization. In COLT, 2010.
Morris, Robert. Counting large numbers of events in small registers. Communications of the ACM, 21(10), October 1978. doi: 10.1145/359619.359627.
Patrascu, M. Succincter. In IEEE Symposium on Foundations of Computer Science, pp. 305313. IEEE, 2008.
Raghavan, Prabhakar and Tompson, Clark D. Randomized rounding: a technique for provably good algorithms and algorithmic proofs. Combinatorica, 7(4), 12 1987.
Richardson, Matthew, Dominowska, Ewa, and Ragno, Robert. Predicting clicks: estimating the click-through rate for new ads. In Proceedings of the 16th international conference on World Wide Web, 2007.Shalev-Shwartz, Shai. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2), 2012.
Streeter, Matthew J. and McMahan, H. Brendan. Less regret via online conditioning. CoRR, abs/1002.4862, 2010.
Thorup, Mikkel. String hashing for linear probing. In Proceedings of the 20th ACM-SIAM Symposium on Discrete Algorithms, 2009.
Tibshirani, Robert. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1), 1996.
Van Durme, Benjamin and Lall, Ashwin. Probabilistic counting with randomized storage. In Proceedings of the 21st international jont conference on Artifical intelligence, 2009.
Weinberger, Kilian, Dasgupta, Anirban, Langford, John, Smola, Alex, and Attenberg, Josh. Feature hashing for large scale multitask learning. In Proceedings of the 26th Annual International Conference on Machine Learning, 2009.
Xiao, Lin. Dual averaging method for regularized stochastic learning and online optimization. In NIPS, 2009.
Zinkevich, Martin. Online convex programming and generalized infinitesimal gradient ascent. In ICML, 2003.
-----0
Abbeel, P. and Ng, A.Y. Apprenticeship learning via inverse reinforcement learning. In Proc. 21st International Conf. on Machine Learning, pp. 18, 2004.
Abbeel, P., Quigley, M., and Ng, A.Y. Using inaccurate models in reinforcement learning. In Proc. 23rd International Conf. on Machine learning, pp. 18, 2006.
Anderson, B.D.O. and Moore, J.B. Optimal control: linear quadratic methods. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1990.
Boyd, S., El Ghaoui, L., Feron, E., and Balakrishnan, V. Linear Matrix Inequalities in System and Control Theory. Studies in Applied Mathematics, Philadelphia, PA, 1994.
Chase, S.M., Kass, R.E., and Schwartz, A.B. Behavioral and neural correlates of visuomotor adaptation observed through a brain-computer interface in primary motor cortex. J. Neurophysiol., 108(2):624 644, 2012.
Coates, A., Abbeel, P., and Y. Ng, A.Y. Learning for control from multiple demonstrations. In Proc.25th International Conf. on Machine Learning, pp.144151, 2008.
Crapse, T.B. and Sommer, M.A. Corollary discharge across the animal kingdom. Nat. Rev. Neurosci., 9 (8):587600, 2008.
Deisenroth, M.P. and Rasmussen, C.E. Pilco: A model-based and data-efficient approach to policy search. In Proc. 28th International Conf. on Machine Learning, pp. 465472, 2011.
Deisenroth, M.P., Turner, R.D., Huber, M.F., Hanebeck, U.D., and Rasmussen, C.E. Robust filtering and smoothing with gaussian processes. IEEE Trans. Automatic Control, 57(7):18651871, 2012.
Dempster, A.P., Laird, N.M., and Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. Royal Stat. Soc., Series B, 39(1):138, 1977.
Ganguly, K., Dimitrov, D.F., Wallis, J.D., and Carmena, J.M. Reversible large-scale modification of cortical networks during neuroprosthetic control.
Nature Neurosci., 14(5):662667, 2011.Georgopoulos, A.P., Caminiti, R., Kalaska, J.F., and Massey, J.T. Spatial coding of movement: a hypothesis concerning the coding of movement direction by motor cortical populations. Exp. Brain Res. Suppl., 7:327336, 1983.
Green, A.M. and Kalaska, J.F. Learning to move machines with the mind. Trends Neurosci., 34(2):61 75, 2011.
Kording, K.P. and Wolpert, D.M. The loss function of sensorimotor learning. Proc. Natl. Acad. Sci., 101 (26):98399842, 2004.
Miall, R.C. and Wolpert, D.M. Forward models for physiological motor control. Neural Networks, 9(8): 12651279, 1996.
Ng, A.Y. and Russell, S. Algorithms for inverse reinforcement learning. In Proc. 17th International Conf. on Machine Learning, pp. 663670, 2000.
Ratliff, N.D., Bagnell, J.A., and Zinkevich, Martin.Maximum margin planning. In Proc. 23rd International Conf. on Machine Learning, pp. 729736, 2006.Schaal, S. Is imitation learning the route to humanoid robots? Trends in Cognitive Sciences, 3(6):233242, 1999.
Schwartz, A.B., R.E., Kettner, and Georgopoulos, A.P. Primate motor cortex and free arm movements to visual targets in 3-dimensional space. 1. relations between single cell discharge and direction of movement. J. Neurosci., 8(8):29132927, 1988.
Taylor, D.M., Helms Tillery, S.I., and Schwartz, A.B.Direct cortical control of 3D neuroprosthetic devices. Science, 296:18291832, 2002.
Ziebart, B.D., Maas, A.L., Bagnell, J.A., and Dey, A.K. Maximum entropy inverse reinforcement learning. In Proc. 23rd AAAI Conf. on Artificial Intelligence, pp. 14331438, 2008.
-----0
Aalen, O.O., Borgan, ., and Gjessing, H.K. Survival and event history analysis: a process point of view.Springer Verlag, 2008.
Austin, P.C. Generating survival times to simulate cox proportional hazards models with time-varying covariates. Statistics in Medicine, 2012.
Bailey, N. T. J. The Mathematical Theory of Infectious Diseases and its Applications. Hafner Press, 2nd edition, 1975.
Boyd, S.P. and Vandenberghe, L. Convex optimization. Cambridge University Press, 2004.
Devroye, L. Non-uniform random variate generation, volume 4. Springer-Verlag New York, 1986.
Du, N., Song, L., Smola, A., and Yuan, M. Learning networks of heterogeneous influence. In NIPS 12: Neural Information Processing Systems, 2012.
Erdo?s, P. and Renyi, A. On the evolution of random graphs. Publication of the Mathematical Institute of the Hungarian Academy of Science, 5:1767, 1960.
Gomez-Rodriguez, M. and Scholkopf, B. Submodular Inference of Diffusion Networks from Multiple Trees.
In ICML 12: Proceedings of the 29th International Conference on Machine Learning, 2012.
Gomez-Rodriguez, M., Leskovec, J., and Krause, A.Inferring Networks of Diffusion and Influence. In KDD 10: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2010.
Gomez-Rodriguez, M., Balduzzi, D., and Scholkopf, B. Uncovering the Temporal Dynamics of Diffusion Networks. In ICML 11: Proceedings of the 28th International Conference on Machine Learning, 2011.
Gomez-Rodriguez, M., Leskovec, J., and Scholkopf, B.Structure and Dynamics of Information Pathways in On-line Media. In WSDM 13: Proceedings of the 6th ACM International Conference on Web Search and Data Mining, 2013.
Kempe, D., Kleinberg, J. M., and Tardos, E. Maximizing the spread of influence through a social network. In KDD 03: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003.
Leskovec, J., Adamic, L. A., and Huberman, B. A. The dynamics of viral marketing. In EC 06: Proceedings of the eigth International Conference on Electronic Commerce, 2006.
Leskovec, J., Backstrom, L., and Kleinberg, J. Memetracking and the dynamics of the news cycle. In KDD 09: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2009.
Leskovec, J., Chakrabarti, D., Kleinberg, J., Faloutsos, C., and Ghahramani, Z. Kronecker graphs: An approach to modeling networks. Journal of Machine Learning Research, 11:9851042, 2010.
Liben-Nowell, David and Kleinberg, Jon. Tracing the flow of information on a global scale using Internet chain-letter data. Proceedings of the National Academy of Sciences, 105(12):46334638, 2008.
Myers, S. and Leskovec, J. On the Convexity of Latent Social Network Inference. In NIPS 10: Neural Information Processing Systems, 2010.
Myers, S., Leskovec, J., and Zhu, C. Information Diffusion and External Influence in Networks. In KDD 12: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2012.
Netrapalli, P. and Sanghavi, S. Finding the graph of epidemic cascades. In ACM SIGMET RICS/Performance 12: Proceedings of the ACM SIGMETRICS and Performance Conference, 2012.
Rogers, E. M. Diffusion of Innovations. Free Press, New York, fourth edition, 1995.
Saito, K., Kimura, M., Ohara, K., and Motoda, H. Learning continuous-time information diffusion model for social behavioral data analysis. Advances in Machine Learning, pp. 322337, 2009.
Snowsill, T.M., Fyson, N., De Bie, T., and Cristianini, N. Refining causality: who copied from whom? In KDD 11: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2011.
Wang, L., Ermon, S., and Hopcroft, J. Featureenhanced probabilistic models for diffusion network inference. In ECML PKDD 12: Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2012.
-----0
Arkin, E.M., Meijer, H., Mitchell, J.S.B., Rappaport, D., and Skiena, S.S. Decision trees for geometric models. In Proceedings of the ninth annual symposium on Computational geometry, pp. 369378. ACM, 1993.
Balcan, M.F., Beygelzimer, A., and Langford, J. Agnostic active learning. In Proceedings of the 23rd international conference on Machine learning, pp. 6572. ACM, 2006.
Balcan, M.F., Broder, A., and Zhang, T. Margin based active learning. Learning Theory, pp. 3550, 2007.
Brightwell, G. and Winkler, P. Counting linear extensions is #p-complete. In Proceedings of the twenty-third annual ACM symposium on Theory of computing, STOC 91, pp. 175181, 1991.
Cohn, D., Atlas, L., and Ladner, R. Improving generalization with active learning. Machine Learning, 15(2): 201221, 1994.
Dasgupta, S. Analysis of a greedy active learning strategy.Advances in neural information processing systems, 17: 337344, 2005.
Dasgupta, S. Coarse sample complexity bounds for active learning. Advances in neural information processing systems, 18:235, 2006.
Dasgupta, S., Kalai, A., and Monteleoni, C. Analysis of perceptron-based active learning. Learning Theory, pp.889905, 2005.
Dasgupta, S., Hsu, D., and Monteleoni, C. A general agnostic active learning algorithm. Advances in neural information processing systems, 20:353360, 2007.
El-Yaniv, R. and Wiener, Y. Active learning via perfect selective classification. The Journal of Machine Learning Research, 13:255279, 2012.
Freund, Y., Seung, H.S., Shamir, E., and Tishby, N. Selective sampling using the query by committee algorithm.Machine learning, 28(2):133168, 1997.
Friedman, E. Active learning for smooth problems. In Proceedings of the 22nd Conference on Learning Theory, volume 1, pp. 32, 2009.
Gilad-Bachrach, R., Navot, A., and Tishby, N. Query by committee made real. Advances in Neural Information Processing Systems (NIPS), 19, 2005.
Golovin, D. and Krause, A. Adaptive submodularity: A new approach to active learning and stochastic optimization. In Proceedings of International Conference on Learning Theory (COLT), 2010.
Gonen, A., Sabato, S., and Shalev-Shwartz, S. Efficient pool-based active learning of halfspaces. CoRR, abs/1208.3561, 2012.
Hanneke, S. A bound on the label complexity of agnostic active learning. In ICML, 2007.
Hanneke, S. Rates of convergence in active learning. The Annals of Statistics, 39(1):333361, 2011.
H?astad, J. On the size of weights for threshold gates. SIAM Journal on Discrete Mathematics, 7:484, 1994.
Kannan, R., Lovasz, L., and Simonovits, M. Random walks and an o?(n5) volume algorithm for convex bodies. Random structures and algorithms, 11(1):150, 1997.
Lovasz, L. Hit-and-run mixes fast. Mathematical Programming, 86(3):443461, 1999.
Matous?ek, J. Lectures on discrete geometry, volume 212.Springer Verlag, 2002.
McCallum, A. and Nigam, K. Employing em in pool-based active learning for text classification. In Proceedings of ICML-98, 15th International Conference on Machine Learning, pp. 350358, 1998.
Muroga, S., Toda, I., and Takasu, S. Theory of majority decision elements. Journal of the Franklin Institute, 271 (5):376418, 1961.
Seung, H.S., Opper, M., and Sompolinsky, H. Query by committee. In Proceedings of the fifth annual workshop on Computational learning theory, pp. 287294. ACM, 1992.
Tong, S. and Koller, D. Support vector machine active learning with applications to text classification. The Journal of Machine Learning Research, 2:4566, 2002.
-----0
Agarwal, D. and Chen, B.-C. fLDA: Matrix factorization through latent Dirichlet allocation. In Proceedings of WSDM, pp. 91100, 2010.
Albert, J. H. and Chib, S. Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88(422):669679, 1993.
Beal, M. J. Variational Algorithms for Approximate Bayesian Inference. PhD thesis, The Gatsby Computational Neuroscience Unit, University College London, 2003.
Ben-Hur, A. and Noble, W. S. Kernel methods for predicting proteinprotein interactions. Bioinformatics, 21 (Suppl. 1):i38i46, 2005.
Cruciani, G., Pastor, M., and Guba, W. VolSurf: A new tool for the pharmacokinetic optimization of lead compounds. European Journal of Pharmaceutical Sciences, 11(Suppl. 2):S29S39, 2000.
Damoulas, T. and Girolami, M. A. Probabilistic multiclass multi-kernel learning: On protein fold recognition and remote homology detection. Bioinformatics, 24(10): 12641270, 2008.
Duran, A., Martinez, G. C., and Pastor, M. Development and validation of AMANDA, a new algorithm for selecting highly relevant regions in Molecular Interaction Fields. Journal of Chemical Information and Modeling, 48(9):18131823, 2008.
Elisseeff, A. and Weston, J. A kernel method for multilabelled classification. In Proceedings of NIPS, pp. 681 687, 2002.
Girolami, M. and Rogers, S. Hierarchic Bayesian models for kernel learning. In Proceedings of ICML, pp. 241 248, 2005.
Gonen, M. Predicting drugtarget interactions from chemical and genomic kernels using Bayesian matrix factorization. Bioinformatics, 28(18):23042310, 2012a.
Gonen, M. Bayesian efficient multiple kernel learning. In Proceedings of ICML, pp. 18, 2012b.Gonen, M. and Alpaydin, E. Multiple kernel learning algorithms. Journal of Machine Learning Research, 12(Jul): 22112268, 2011.
Khan, S. A., Faisal, A., Mpindi, J. P., Parkkinen, J. A., Kalliokoski, T., Poso, A., Kallioniemi, O. P., Wennerberg, K., and Kaski, S. Comprehensive data-driven analysis of the impact of chemoinformatic structure on the genome-wide biological response profiles of cancer cells to 1159 drugs. BMC Bioinformatics, 13(112), 2012.
Lawrence, N. D. and Jordan, M. I. Semi-supervised learning via Gaussian processes. In Proceedings of NIPS, pp.753760, 2005.
Lawrence, N. D. and Urtasun, R. Non-linear matrix factorization with Gaussian processes. In Proceedings of ICML, pp. 601608, 2009.
Ma, H., Yang, H., Lyu, M. R., and King, I. SoRec: Social recommendation using probabilistic matrix factorization. In Proceedings of CIKM, pp. 931940, 2008.
Neal, R. M. Bayesian Learning for Neural Networks.Springer, New York, NY, 1996.
Petterson, J. and Caetano, T. Reverse multi-label learning.In Proceedings of NIPS, pp. 19121920, 2010.
Salakhutdinov, R. and Mnih, A. Probabilistic matrix factorization. In Proceedings of NIPS, pp. 12571264, 2008a.
Salakhutdinov, R. and Mnih, A. Bayesian probabilistic matrix factorization using Markov chain Monte Carlo. In Proceedings of ICML, pp. 880887, 2008b.
Scholkopf, B. and Smola, A. J. Learning with Kernels: 
Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, 2002.
Scholkopf, B., Tsuda, K., and Vert, J.-P. (eds.). Kernel Methods in Computational Biology. MIT Press, Cambridge, MA, 2004.
Shan, H. and Banerjee, A. Generalized probabilistic matrix factorizations for collaborative filtering. In Proceedings of ICDM, pp. 10251030, 2010.Srebro, N. Learning with Matrix Factorizations. PhD thesis, Massachusetts Institute of Technology, 2004.
Tang, L., Chen, J., and Ye, J. On multiple kernel learning with multiple labels. In Proceedings of IJCAI, pp. 1255 1260, 2009.
Wang, C. and Blei, D. M. Collaborative topic modeling for recommending scientific articles. In Proceedings of KDD, pp. 448456, 2011.
Yamanishi, Y. Supervised bipartite graph inference. In Proceedings of NIPS, pp. 18411848, 2009.
Yamanishi, Y., Araki, M., Gutteridge, A., Honda, W., and Kaneisha, M. Prediction of drugtarget interaction networks from the integration of chemical and genomic spaces. Bioinformatics, 24:i232i240, 2008.
Yamanishi, Y., Kotera, M., Kanesiha, M., and Goto, S.Drugtarget interaction prediction from chemical, genomic and pharmacological data in an integrated framework. Bioinformatics, 26:i246i254, 2010.
Zhang, M.-L. and Zhou, Z.-H. ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognition, 40(7):20382048, 2007.
Zhang, W., Xue, X., Fan, J., Huang, X., Wu, B., and Liu, M. Multi-kernel multi-label learning with max-margin concept network. In Proceedings of IJCAI, pp. 1615 1620, 2012.
Zhou, T., Shan, H., Banerjee, A., and Sapiro, G. Kernelized probabilistic matrix factorization: Exploiting graphs and side information. In Proceedings of SDM, pp. 403414, 2012.
-----0
Ben-David, S., Blitzer, J., Crammer, K., and Pereira, F.Analysis of representations for domain adaptation. In NIPS, 2007.
Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., and Wortman, J. A theory of learning from different domains. Machine Learning, 79:151175, 2010.
Bergamo, A. and Torresani, L. Exploiting weakly-labeled web images to improve object classification: a domain adaptation approach. In NIPS, 2010.
Blitzer, J., McDonald, R., and Pereira, F. Domain adaptation with structural correspondence learning. In EMNLP, 2006.
Blitzer, J., Dredze, M., and Pereira, F. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In ACL, 2007.
Blitzer, J., Foster, D., and Kakade, S. Domain adaptation with coupled subspaces. In AISTATS, 2011.
Chen, M., Weinberger, K.Q., and Blitzer, J.C. Co-training for domain adaptation. In NIPS, 2011.
Daume, H. Frustratingly easy domain adaptation. In ACL, 2007.
Daume, H. and Marcu, D. Domain adaptation for statistical classifiers. JAIR, 26(1):101126, 2006.
Daume, H., Kumar, A., and Saha, A. Co-regularization based semi-supervised domain adaptation. In NIPS, 2010.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In CVPR, 2009.
Gong, B., Shi, Y., Sha, F., and Grauman, K. Geodesic flow kernel for unsupervised domain adaptation. In CVPR, 2012.
Gopalan, R., Li, R., and Chellappa, R. Domain adaptation for object recognition: An unsupervised approach. In ICCV, 2011.
Gretton, A., Borgwardt, K., Rasch, M., Scholkopf, B., and Smola, A. A kernel method for the two-sample-problem.In NIPS. 2006.
Gretton, A., Smola, A., Huang, J., Schmittfull, M., Borgwardt, K., and Scholkopf, B. Covariate shift by kernel mean matching. In Quionero-Candela, J., Sugiyama, M., 
Schwaighofer, A., and Lawrence, N.D. (eds.), Dataset Shift in Machine Learning. MIT Press, 2009.
Griffin, G., Holub, A., and Perona, P. Caltech-256 object category dataset. Technical report, California Institute of Technology, 2007.
Huang, J., Smola, A.J., Gretton, A., Borgwardt, K.M., and Scholkopf, B. Correcting sample selection bias by unlabeled data. In NIPS, 2007.
Kulis, B., Saenko, K., and Darrell, T. What you saw is not what you get: Domain adaptation using asymmetric kernel transforms. In CVPR, 2011.
Lanckriet, G. R. G., Cristianini, N., Bartlett, P., Ghaoui, L. El, and Jordan, M. Learning the kernel matrix with semidefinite programming. JMLR, 5:2772, 2004.
Mansour, Y., Mohri, M., and Rostamizadeh, A. Domain adaptation: Learning bounds and algorithms. In COLT, 2009a.
Mansour, Y., Mohri, M., and Rostamizadeh, A. Multiple source adaptation and the renyi divergence. In UAI, 2009b.
Pan, S.J. and Yang, Q. A survey on transfer learning. Knowledge and Data Engineering, 22(10):1345 1359, 2010.
Pan, S.J., Tsang, I.W., Kwok, J.T., and Yang, Q. Domain adaptation via transfer component analysis. IEEE Trans. NN, (99):112, 2009.
Perronnin, F., Senchez, J., and Liu, Y. Large-scale image categorization with explicit data embedding. In CVPR, 2010.
Quionero-Candela, J., Sugiyama, M., Schwaighofer, A., and Lawrence, N.D. Dataset shift in machine learning.The MIT Press, 2009.
Russell, B. C., Torralba, A., Murphy, K. P., and Freeman, W. T. LabelMe: a database and web-based tool for image annotation. IJCV, 77:157173, 2008.
Saenko, K., Kulis, B., Fritz, M., and Darrell, T. Adapting visual category models to new domains. In ECCV, 2010.
Shimodaira, H. Improving predictive inference under covariate shift by weighting the log-likelihood function.Journal of Statistical Planning and Inference, 90(2):227 244, 2000.
Torralba, A. and Efros, A.A. Unbiased look at dataset bias. In CVPR, 2011.
-----0
Barzilai, J. and Borwein, J.M. Two-point step size gradient methods. IMA Journal of Numerical Analysis, 8(1):141 148, 1988.
Beck, A. and Teboulle, M. A fast iterative shrinkagethresholding algorithm for linear inverse problems.SIAM Journal on Imaging Sciences, 2(1):183202, 2009.
Candes, E.J., Wakin, M.B., and Boyd, S.P. Enhancing sparsity by reweighted ?1 minimization. Journal of Fourier Analysis and Applications, 14(5):877905, 2008.
Combettes, P.L. and Wajs, V.R. Signal recovery by proximal forward-backward splitting. Multiscale Modeling & Simulation, 4(4):11681200, 2005.
Daubechies, I., Defrise, M., and De Mol, C. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on pure and applied mathematics, 57(11):14131457, 2004.
Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R.Least angle regression. The Annals of statistics, 32(2): 407499, 2004.
Fan, J. and Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):1348 1360, 2001.
Foucart, S. and Lai, M.J. Sparsest solutions of underdetermined linear systems via ?q-minimization for 0 < q ? 1.
Applied and Computational Harmonic Analysis, 26(3): 395407, 2009.
Gasso, G., Rakotomamonjy, A., and Canu, S. Recovering sparse signals with a certain family of nonconvex penalties and dc programming. IEEE Transactions on Signal Processing, 57(12):46864698, 2009.
Geman, D. and Yang, C. Nonlinear image recovery with half-quadratic regularization. IEEE Transactions on Image Processing, 4(7):932946, 1995.
Gong, P., Ye, J., and Zhang, C. Multi-stage multi-task feature learning. In NIPS, pp. 19972005, 2012a.
Gong, P., Ye, J., and Zhang, C. Robust multi-task feature learning. In SIGKDD, pp. 895903, 2012b.
Gong, P., Zhang, C., Lu, Z., Huang, J., and Ye, J. GIST: General Iterative Shrinkage and Thresholding for Non-convex Sparse 
Learning. Tsinghua University, 2013. URL http://www.public.asu.edu/~jye02/Software/GIST.
Grippo, L. and Sciandrone, M. Nonmonotone globalization techniques for the barzilai-borwein gradient method.
Computational Optimization and Applications, 23(2): 143169, 2002.
Grippo, L., Lampariello, F., and Lucidi, S. A nonmonotone line search technique for newtons method. SIAM Journal on Numerical Analysis, 23(4):707716, 1986.
Hale, E.T., Yin, W., and Zhang, Y. A fixed-point continuation method for l1-regularized minimization with applications to compressed sensing. CAAM TR07-07, Rice University, 2007.
Hunter, D.R. and Lange, K. Quantile regression via an mm algorithm. Journal of Computational and Graphical Statistics, 9(1):6077, 2000.
Lin, C.J., Weng, R.C., and Keerthi, S.S. Trust region newton method for logistic regression. Journal of Machine Learning Research, 9:627650, 2008.
Liu, J., Ji, S., and Ye, J. SLEP: Sparse Learning with Efficient Projections. Arizona State University, 2009. URL http://www.public.asu.edu/~jye02/Software/SLEP.
Lu, Z. Iterative reweighted minimization methods for ?p regularized unconstrained nonlinear programming.arXiv preprint arXiv:1210.0066, 2012a.
Lu, Z. Sequential convex programming methods for a class of structured nonlinear programming. arXiv preprint arXiv:1210.3039, 2012b.
Shevade, S.K. and Keerthi, S.S. A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics, 19(17):22462253, 2003.
Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1):267288, 1996.
Toland, JF. A duality principle for non-convex optimisation and the calculus of variations. Archive for Rational Mechanics and Analysis, 71(1):4161, 1979.
Trzasko, J. and Manduca, A. Relaxed conditions for sparse signal recovery with general concave priors. IEEE Transactions on Signal Processing, 57(11):43474354, 2009.
Wright, J., Yang, A.Y., Ganesh, A., Sastry, S.S., and Ma, Y. Robust face recognition via sparse representation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(2):210227, 2008.
Wright, S.J., Nowak, R., and Figueiredo, M. Sparse reconstruction by separable approximation. IEEE Transactions on Signal Processing, 57(7):24792493, 2009.
Ye, J. and Liu, J. Sparse methods for biomedical data. ACM SIGKDD Explorations Newsletter, 14(1): 415, 2012.
Yuille, A.L. and Rangarajan, A. The concave-convex procedure. Neural Computation, 15(4):915936, 2003.
Zhang, C.H. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2): 894942, 2010a.
Zhang, T. Analysis of multi-stage convex relaxation for sparse regularization. Journal of Machine Learning Research, 11:10811107, 2010b.
Zhang, T. Multi-stage convex relaxation for feature selection. Bernoulli, 2012.
-----0
Bastien, Frederic, Lamblin, Pascal, Pascanu, Razvan, 
Bergstra, James, Goodfellow, Ian, Bergeron, Arnaud, Bouchard, Nicolas, and Bengio, Yoshua. Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop, 2012.
Bergstra, James, Breuleux, Olivier, Bastien, Frederic, Lamblin, Pascal, Pascanu, Razvan, Desjardins, Guillaume, Turian, Joseph, Warde-Farley, David, and Bengio, Yoshua. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy), June 2010. Oral Presentation.
Breiman, Leo. Bagging predictors. Machine Learning, 24 (2):123140, 1994.
Ciresan, D. C., Meier, U., Gambardella, L. M., and Schmidhuber, J. Deep big simple neural nets for handwritten digit recognition. Neural Computation, 22:114, 2010.
Glorot, Xavier, Bordes, Antoine, and Bengio, Yoshua.Deep sparse rectifier neural networks. In JMLR W&CP: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2011), April 2011.
Goodfellow, Ian J., Courville, Aaron, and Bengio, Yoshua.Joint training of deep Boltzmann machines for classification. In International Conference on Learning Representations: Workshops Track, 2013.
Hahnloser, Richard H. R. On the piecewise analysis of networks of linear threshold neurons. Neural Networks, 11(4):691697, 1998.
Hinton, Geoffrey E., Srivastava, Nitish, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan. Improving neural networks by preventing co-adaptation of feature detectors. Technical report, arXiv:1207.0580, 2012.
Jarrett, Kevin, Kavukcuoglu, Koray, Ranzato, MarcAurelio, and LeCun, Yann. What is the best multi-stage architecture for object recognition? In Proc. International Conference on Computer Vision (ICCV09), pp. 21462153. IEEE, 2009.Krizhevsky, Alex and Hinton, Geoffrey. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey.ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25 (NIPS2012). 2012.
LeCun, Yann, Bottou, Leon, Bengio, Yoshua, and Haffner, Patrick. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):22782324, November 1998.
Malinowski, Mateusz and Fritz, Mario. Learnable pooling regions for image classification. In International Conference on Learning Representations: Workshop track, 2013.
Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. Reading digits in natural images with unsupervised feature learning. Deep Learning and Unsupervised Feature Learning Workshop, NIPS, 2011.
Rifai, Salah, Dauphin, Yann, Vincent, Pascal, Bengio, Yoshua, and Muller, Xavier. The manifold tangent classifier. In NIPS2011, 2011. Student paper award.
Salakhutdinov, R. and Hinton, G.E. Deep Boltzmann machines. In Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS 2009), volume 8, 2009.
Salinas, E. and Abbott, L. F. A model of multiplicative neural responses in parietal cortex. Proc Natl Acad Sci U S A, 93(21):1195611961, October 1996.
Sermanet, Pierre, Chintala, Soumith, and LeCun, Yann.Convolutional neural networks applied to house numbers digit classification. CoRR, abs/1204.3968, 2012a.
Sermanet, Pierre, Chintala, Soumith, and LeCun, Yann.Convolutional neural networks applied to house numbers digit classification. In International Conference on Pattern Recognition (ICPR 2012), 2012b.
Snoek, Jasper, Larochelle, Hugo, and Adams, Ryan Prescott. Practical bayesian optimization of machine learning algorithms. In Neural Information Processing Systems, 2012.
Srebro, Nathan and Shraibman, Adi. Rank, trace-norm and max-norm. In Proceedings of the 18th Annual Conference on Learning Theory, pp. 545560. SpringerVerlag, 2005.
Srivastava, Nitish. Improving neural networks with dropout. Masters thesis, U. Toronto, 2013.
Wang, Shuning. General constructive representations for continuous piecewise-linear functions. IEEE Trans. Circuits Systems, 51(9):18891896, 2004.
Yu, Dong and Deng, Li. Deep convex net: A scalable architecture for speech pattern classification. In INTERSPEECH, pp. 22852288, 2011.
Zeiler, Matthew D. and Fergus, Rob. Stochastic pooling for regularization of deep convolutional neural networks. In International Conference on Learning Representations, 2013.
-----0
Blei, D.M. and Lafferty, J.D. Dynamic topic models.In ICML, pp. 113120. ACM, 2006.Bottou, L. Online learning and stochastic approximaDistributed training of large-scale logistic models tions. On-line learning in neural networks, 1998.
Bottou, L. Large-scale machine learning with stochastic gradient descent. In Compstat, 2010.
Bouchard, G. Efficient bounds for the softmax function, applications to inference in hybrid models.2007.
Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends R in Machine Learning, 2011.
Darroch, J.N. and Ratcliff, D. Generalized iterative scaling for log-linear models. The annals of mathematical statistics, 43(5):14701480, 1972.
Daume III, H. Notes on cg and lm-bfgs optimization of logistic regression. 2004.
Holland, P.W. and Welsch, R.E. Robust regression using iteratively reweighted least-squares. CSTM, 1977.
Hsiung, K.L., Kim, S.J., and Boyd, S. Tractable approximate robust geometric programming. Optimization and Engineering, 9(2):95118, 2008.
Kirkwood, B.R. et al. Essentials of medical statistics.Blackwell Scientific Publications, 1988.
Lin, C.J., Weng, R.C., and Keerthi, S.S. Trust region newton method for logistic regression. The Journal of Machine Learning Research, 9:627650, 2008.
Liu, D.C. and Nocedal, J. On the limited memory bfgs method for large scale optimization. Mathematical programming, 45(1):503528, 1989.
Marlin, B., Khan, M.E., and Murphy, K. Piecewise bounds for estimating bernoulli-logistic latent gaussian models. In ICML, 2011.
Minka, T.P. A comparison of numerical optimizers for logistic regression. Unpublished draft, 2003.
Rust, R.T. and Zahorik, A.J. Customer satisfaction, customer retention, and market share. Journal of retailing, 69(2):193215, 1993.
Schraudolph, N., Yu, J., and Gunter, S. A stochastic quasi-newton method for online convex optimization. 2007.
Sha, F. and Pereira, F. Shallow parsing with conditional random fields. In NAACL, pp. 134141. Association for Computational Linguistics, 2003.
Shanno, D.F. On broyden-fletcher-goldfarb-shanno method. Journal of Optimization Theory and Applications, 46(1):8794, 1985.
Yu, H.F., Huang, F.L., and Lin, C.J. Dual coordinate descent methods for logistic regression and maximum entropy models. Machine Learning, 85(1):41 75, 2011.
Zhang, T. Solving large scale linear prediction problems using stochastic gradient descent algorithms.In ICML, pp. 116. ACM, 2004.
-----0
Ai, A., Lapanowski, A., Plan, Y., and Vershynin, R.One-bit compressed sensing with non-gaussian measurements. arXiv preprint arXiv:1208.6279, 2012.
Baraniuk, Richard G., Cevher, Volkan, Duarte, Marco F., and Hegde, Chinmay. Model-based compressive sensing. IEEE Transactions on Information Theory, 56(4):19822001, 2010.
Berinde, Radu and Indyk, Piotr. Sparse recovery using sparse random matrices, 2008.
Boufounos, Petros and Baraniuk, Richard G. 1-bit compressive sensing. In CISS, pp. 1621, 2008.
Boufounos, Petros T. Greedy sparse signal reconstruction from sign measurements. In Proceedings of the 43rd Asilomar conference on Signals, systems and computers, pp. 13051309, 2009.
Cande`s, Emmanuel J. and Recht, Benjamin. Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6):717772, December 2009.
Cande`s, Emmanuel J. and Tao, Terence. Decoding by linear programming. IEEE Transactions on Information Theory, 51(12):42034215, 2005.
Davenport, M.A., Plan, Y., Berg, E., and Wootters, M. 1-bit matrix completion. arXiv preprint arXiv:1209.3672, 2012.de Wolf, Ronald. Efficient data structures from unionfree families of sets. http://homepages.cwi.nl/ ~rdewolf/unionfree_datastruc.pdf, 2012.
Duarte, M.F., Davenport, M.A., Takhar, D., Laska, J.N., Sun, Ting, Kelly, K.F., and Baraniuk, R.G.
Single-pixel imaging via compressive sampling. Signal Processing Magazine, IEEE, 25(2):83 91, march 2008. ISSN 1053-5888. doi: 10.1109/MSP.2007.914730.
Erdos, Peter L., Frankl, Peter, and Furedi, Zoltan.Families of finite sets in which no set is covered by the union of two others. J. Comb. Theory, Ser. A, 33(2):158166, 1982.
Garg, Rahul and Khandekar, Rohit. Gradient descent with sparsification: an iterative algorithm for sparse recovery with restricted isometry property. In ICML, 2009.
Haupt, Jarvis and Baraniuk, Richard G. Robust support recovery using sparse compressive sensing matrices. In CISS, pp. 16, 2011.
Hsu, D., Kakade, S. M., Langford, J., and Zhang, T. Multi-label prediction via compressed sensing.
In Advances in Neural Information Processing Systems, 2009.
Jacques, Laurent, Laska, Jason N., Boufounos, Petros, and Baraniuk, Richard G. Robust 1-bit compressive sensing via binary stable embeddings of sparse vectors. CoRR, abs/1104.3160, 2011.
Jafarpour, Sina, Xu, Weiyu, Hassibi, Babak, and Calderbank, A. Robert. Efficient and robust compressed sensing using optimized expander graphs. IEEE Transactions on Information Theory, 55(9): 42994308, 2009.
Laska, Jason N. and Baraniuk, Richard G. Regime change: Bit-depth versus measurement-rate in compressive sensing. IEEE Transactions on Signal Processing, 60(7):34963505, 2012.
Laska, Jason N., Wen, Zaiwen, Yin, Wotao, and Baraniuk, Richard G. Trust, but verify: Fast and accurate signal recovery from 1-bit compressive measurements. IEEE Transactions on Signal Processing, 59 (11):52895301, 2011.
Negahban, Sahand, Ravikumar, Pradeep D., Wainwright, Martin J., and Yu, Bin. A unified framework for high-dimensional analysis of $m$-estimators with decomposable regularizers. In NIPS, pp. 1348 1356, 2009.
Plan, Y. and Vershynin, R. One-bit compressed sensing by linear programming. arXiv preprint arXiv:1109.4299, 2011.
Plan, Yaniv and Vershynin, Roman. Robust 1bit compressed sensing and sparse logistic regression: A convex programming approach. CoRR, abs/1202.1212, 2012.
Tropp, Joel A. and Gilbert, Anna C. Signal recovery from random measurements via orthogonal matching pursuit. IEEE Transactions on Information Theory, 53(12):46554666, 2007.
Wright, J., Ma, Yi, Mairal, J., Sapiro, G., Huang, T.S., and Yan, Shuicheng. Sparse representation for computer vision and pattern recognition. Proceedings of the IEEE, 98(6):1031 1044, june 2010. ISSN 00189219. doi: 10.1109/JPROC.2010.2044470.
-----0
Alon, G., Kroese, D. P., Raviv, T., and Rubinstein, R. Application of the cross-entropy method to the buffer allocation problem in a simulation-based environment. Annals Operations Research, 134(1):137 151, 2005.
Bertsekas, Dimitri P. and Ioffe, Sergey. Temporal differences-based policy iteration and applications in neuro-dynamic programming. Technical report, MIT, 1996.
Boer, P., Kroese, D.P., Mannor, S., and Rubinstein, R. A tutorial on the cross-entropy method. Annals of Operations Research, 134(1):1967, 2005.
Chang, H.S., Fu, M.C., Hu, J., and Marcus, S.I.Simulation-based Algorithms for Markov Decision Processes. Springer-Verlag New York, Inc., 2007.
Costa, A., Jones, O.W., and Kroese, D. Convergence properties of the cross-entropy method for discrete optimization. Operations Research Letters, 35(5): 573580, 2007.
Fitzpatrick, J.M. and Grefenstette, J.J. Genetic algorithms in noisy environments. Machine Learning, 3: 101120, 1988.
Goschin, S., Littman, M.L., and Ackley, D.H. The effects of selection on noisy fitness optimization. In Genetic and Evolutionary Computation Conference (GECCO), 2011.
Hansen, N. and Ostermeier, A. Completely derandomized self-adaptation in evolution strategies. Evolution Computation, 9(2):159195, 2001.
Kalyanakrishnan, S. and Stone, P. An empirical analysis of value function-based and policy search reinforcement learning. In International Conference on Autonomous Agents and Multiagent Systems, 2009.
Mannor, S., Rubinstein, R., and Gat, Y. The cross entropy method for fast policy search. In International Conference on Machine Learning, 2003.
Mannor, S., Peleg, D., and Rubinstein, R. The cross entropy method for classification. In International Conference on Machine learning, 2005.
Margolin, L. On the convergence of the cross-entropy method. Annals of Operations Research, 134:201 214, 2005.Rubinstein, R. Optimization of computer simulation models with rare events. European Journal of Operations Research, 99:89112, 1996.
Rubinstein, R. The cross-entropy method for combinatorial and continuous optimization. Methodology And Computing In Applied Probability, 1:127190, 1999.
Rubinstein, R. and Kroese, D.P. The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation and Machine Learning. Springer, 2004.
Stulp, F. and Sigaud, O. Path integral policy improvement with covariance matrix adaptation. In International Conference on Machine learning, 2012.
Sutton, Richard S. and Barto, Andrew G. Reinforcement Learning: An Introduction. MIT Press, 1998.Szita, I. and Lorincz, A. Learning Tetris using the noisy cross-entropy method. Neural Computation, 18(12):29362941, 2006.
Szita, I. and Szepesvari, C. SZ-Tetris as a benchmark for studying key problems of reinforcement learning.In ICML 2010 Workshop on Machine Learning and Games, 2010a.
Szita, I. and Szepesvari, C. sztetris-rl Library. http: //code.google.com/p/sztetris-rl/, 2010b.
Vose, M.D. The Simple Genetic Algorithm: Foundations and Theory. MIT Press, Cambridge, MA, 1998.
-----0
Aberdeen, D., Buffet, O., and Thomas, O. Policygradients for PSRs and POMDPs. In Proceedings of international conference on artificial intelligence and statistics, 2007.
Baxter, J. and Bartlett, P.L. Infinite-horizon policygradient estimation. Journal of Artificial Intelligence Research, 15:319350, 2001.
Boots, B., Siddiqi, S.M., and Gordon, G.J. Closing the learning-planning loop with predictive state representations. In Proceedings of the 9th international conference on autonomous agents and multiagent systems, pp. 13691370, 2010.
Chaves, M., Eissing, T., and Allgower, F. Bistable biological systems: A characterization through local compact input-to-state stability. Automatic Control, IEEE Transactions on, 53(Special Issue):87 100, 2008.
Faigle, U. and Schonhuth, A. Asymptotic mean stationarity of sources with finite evolution dimension. Information Theory, IEEE Transactions on, 53(7): 23422348, 2007.
Gray, R.M. Probability, random processes, and ergodic properties. Springer, 2009.
Gray, R.M. and Kieffer, JC. Asymptotically mean stationary measures. The Annals of Probability, 8(5): 962973, 1980.
Grinberg, Y. and Precup, D. On average reward policy evaluation in infinitestate partially observable systems. In Proceedings of international conference on artificial intelligence and statistics, 2012.
Izadi, M.T. and Precup, D. A planning algorithm for predictive state representations. In Proceedings of the 18th international joint conference on artificial intelligence, pp. 15201521. Morgan Kaufmann Publishers Inc., 2003.
Jaeger, H. Observable operator models for discrete stochastic time series. Neural Computation, 12(6): 13711398, 2000.
James, M.R. Using predictions for planning and modeling in stochastic environments. PhD thesis, The University of Michigan, 2005.
James, M.R., Singh, S., and Littman, M.L. Planning with predictive state representations. In The 2004 international conference on machine learning and applications, 2004.
Kaelbling, L.P., Littman, M.L., and Cassandra, A.R.Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101(1):99134, 1998.
Littman, M.L., Sutton, R.S., and Singh, S. Predictive representations of state. Advances in neural information processing systems, 14:15551561, 2001.
Peters, J. and Schaal, S. Natural actor-critic. Neurocomputing, 71(7):11801190, 2008.Rosencrantz, M., Gordon, G., and Thrun, S. Learning low dimensional predictive representations. In Proceedings of the twenty-first international conference on machine learning, pp. 88, 2004.
Schweitzer, P.J. Perturbation theory and finite Markov chains. Journal of Applied Probability, pp. 401413, 1968.
Singh, S., James, M.R., and Rudary, M.R. Predictive state representations: A new theory for modeling dynamical systems. In Proceedings of the 20th conference on uncertainty in artificial intelligence, pp.512519. AUAI Press, 2004.
Singh, S.P., Jaakkola, T., and Jordan, M.I. Learning without state-estimation in partially observable Markovian decision processes. In Proceedings of the eleventh international conference on machine learning, volume 31, pp. 37. Citeseer, 1994.
Sutton, R. http://www.incompleteideas.net/ sutton/book/errata.html, 1998.
Wiewiora, E.W. Modeling probability distributions with predictive state representations. PhD thesis, The University of California at San Diego, 2007.
Yu, H. and Bertsekas, D.P. On near optimality of the set of finite-state controllers for average cost POMDP. Mathematics of Operations Research, 33 (1):111, 2008.
-----0
Bruckner, M., Kanzow, C., and Scheffer, T. Static prediction games for adversarial learning problems.Journal of Machine Learning Research, 13:2589 2626, 2012.
Carlson, D. A. The existence and uniqueness of equilibria in convex games with strategies in Hilbert spaces. Advances in Dynamic Games and Applications, 6:7997, 2001.
El Ghaoui, L., Lanckriet, G., and Natsoulis, G. Robust classification with interval data. Technical Report UCB/CSD-03-1279, Computer Science Division (EECS), University of California, 2003.
Freund, Y. and Schapire, R. Game theory, on-line prediction and boosting. In Proceedings of the 9th Annual Conference on Computational Learning Theory, 1996.
Globerson, A. and Roweis, S. Nightmare at test time: Robust learning by feature deletion. In Proceedings of the 23rd International Conference on Machine Learning, 2006.
Harsanyi, J. C. Games with incomplete information played by Bayesian players, i-iii. part i. the basic model. Management Science, 14(3):159182, 1967.
Harsanyi, J. C. Games with incomplete information played by Bayesian players, i-iii. part ii. Bayesian equilibrium points. Management Science, 14(5): 320334, 1968.
Hirsch, M. D., Papadimitriou, C. H., and Vavasis, S. A.Exponential lower bounds for finding Brouwer fix points. Journal of Complexity, 5(4):379416, 1989.
Lanckriet, G., El Ghaoui, L., Bhattacharyya, C., and Jordan, M. A robust minimax approach to classification. Journal of Machine Learning Research, 3: 552582, 2002.
Sayed, A. H. and Chen, H. A uniqueness result concerning a robust regularized least-squares solution. Systems and Control Letters, 46(5):361369, 2002.
Teo, C. H., Globerson, A., Roweis, S., and Smola, A.Convex learning with invariances. In Proceedings of the 20th Annual Conference on Neural Information Processing Systems, 2007.
Van der Laan, G. and Talman, A. J. J. On the computation of fixed points in the product space of unit simplices and an application to noncooperative n person games. Mathematics of Operations Research, 7(1):113, 1982.
Zinkevich, M., Johanson, M., Bowling, M., and Piccione, C. Regret minimization in games with incomplete information. In Proceedings of the 21st Annual Conference on Neural Information Processing Systems, 2008.
-----0
Aronszajn, N. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68 (3):337404, 1950.
Berlinet, A. and Thomas-Agnan, C. Reproducing kernel Hilbert spaces in Probability and Statistics.Kluwer, 2004.
Caponnetto, A. and De Vito, E. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331368, 2007.
Carmeli, C., De Vito, E., and Toigo, A. Vector valued reproducing kernel Hilbert spaces of integrable functions and mercer theorem. Analysis and Applications, 4(4):377408, 2006.
Fremlin, D.H. Measure Theory Volume 1: The Irreducible Minimum. Torres Fremlin, 2000.
Fremlin, D.H. Measure Theory Volume 2: Broad Foundations. Torres Fremlin, 2001.
Fremlin, D.H. Measure Theory Volume 4: Topological Measure Spaces. Torres Fremlin, 2003.
Fukumizu, K., Song, L., and Gretton, A. Kernel bayes rule. In NIPS, 2011.
Fukumizu, K., Song, L., and Gretton, A. Kernel bayes rule: Bayesian inference with positive definite kernels. ArXiv, 1009.5736v4, 2012.
Gretton, A., Smola, A., Huang, J., Schmittfull, M., Borgwardt, K., and Scholkopf, B. Covariate shift and local learning by distribution matching. In Dataset Shift in Machine Learning. 2009.
Grunewalder, S., Lever, G., Baldassarre, L., Patterson, S., Gretton, A., and Pontil, M. Conditional mean embeddings as regressors. In ICML, 2012a.
Grunewalder, S., Lever, G., Baldassarre, L., Pontil, M., and Gretton, A. Modelling transition dynamics in MDPs with RKHS embeddings. In ICML, 2012b.
Huang, J., Smola, A. J., Gretton, A., Borgwardt, K., and Scholkopf, B. Correcting sample selection bias by unlabeled data. In NIPS, 2007.
Micchelli, C.A. and Pontil, M.A. On learning vectorvalued functions. Neural Computation, 17(1), 2005.
Nishiyama, Y., Boularias, A., Gretton, A., and Fukumizu, K. Hilbert space embeddings of POMDPs. In UAI, 2012.
Smola, A., Gretton, A., Song, L., and Scholkopf, B. A Hilbert space embedding for distributions. In ALT, 2007.
Song, L., Huang, J., Smola, A.J., and Fukumizu, K.Hilbert space embeddings of conditional distributions with applications to dynamical systems. In ICML, 2009.
Song, L., Boots, B., Siddiqi, S. M., Gordon, G. J., and Smola, A. J. Hilbert space embeddings of hidden Markov models. In ICML, 2010.
Song, L., Fukumizu, K., and Gretton, A. Kernel embeddings of conditional distributions. IEEE Signal Processing Magazine, To Appear, 2013.
Sriperumbudur, B., Gretton, A., Fukumizu, K., Lanckriet, G., and Scholkopf, B. Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research, 11:15171561, 2010.
Steinwart, I. and Christmann, A. Support Vector Machines. Springer, 2008.
Werner, D. Funktionalanalysis. Springer, 4th edition, 2002.
Yu, Y. and Szepesvari, C. Analysis of kernel mean matching under covariate shift. In ICML, 2012.
-----0
Agarwal, A., Daum III, H., and Gerber, S. Learning multiple tasks using manifold regularization. Advances in neural information processing systems, 23:4654, 2010.
Ando, R.K. and Zhang, T. A framework for learning predictive structures from multiple tasks and unlabeled data. The Journal of Machine Learning Research, 6: 18171853, 2005.
Argyriou, A., Evgeniou, T., and Pontil, M. Convex multitask feature learning. Machine Learning, 73(3):243272, 2008.
Baxter, J. A model of inductive bias learning. Journal of Artificial Intelligence Research, 12:149198, 2000.
Bi, J., Xiong, T., Yu, S., Dundar, M., and Rao, R. An improved multi-task learning approach with applications in medical diagnosis. Machine Learning and Knowledge Discovery in Databases, pp. 117132, 2008.
Bishop, Christopher M et al. Pattern recognition and machine learning, volume 4. springer New York, 2006.Caruana, R. Multitask learning. Machine Learning, 28(1): 4175, 1997.
Corduneanu, A. and Bishop, C.M. Variational bayesian model selection for mixture distributions. In Artificial intelligence and Statistics, volume 2001, pp. 2734. Morgan Kaufmann Waltham, MA, 2001.
Daum III, H. Bayesian multitask learning with latent hierarchies. In Proc. of the 25th Conference on Uncertainty in Artificial Intelligence, pp. 135142, 2009.
Ferguson, T.S. A Bayesian analysis of some nonparametric problems. The Annals of Statistics, 1(2):209230, 1973.
Gilks, Walter R and Wild, Pascal. Adaptive rejection sampling for gibbs sampling. Applied Statistics, pp. 337348, 1992.
Gilks, W.R., Richardson, S., and Spiegelhalter, D. Markov Chain Monte Carlo in practice: interdisciplinary statistics, volume 2. Chapman & Hall/CRC, 1995.
Gupta, S.K., Phung, D., Adams, B., and Venkatesh, S.A Bayesian framework for learning shared and individual subspaces from multiple data sources. In Advances in Knowledge Discovery and Data Mining, 15th PacificAsia Conference, pp. 136147, 2011.
Gupta, S.K., Phung, D., and Venkatesh, S. A Bayesian nonparametric joint factor model for learning shared and individual subspaces from multiple data sources. In Proceedings of 12th SIAM International Conference on Data Mining, pp. 200211, 2012a.
Gupta, S.K., Phung, D., and Venkatesh, S. A slice sampler for restricted hierarchical beta process with applications to shared subspace learning. In Proc. of 28th Uncertainty in Artificial Intelligence (UAI), pp. 316325, 2012b.
Gupta, S.K., Phung, D., Adams, B., and Venkatesh, S.Regularized nonnegative shared subspace learning. Data Mining and Knowledge Discovery, 26(1):5797, 2013.
Kang, Z., Grauman, K., and Sha, F. Learning with whom to share in multi-task feature learning. In Proceedings of the 28th International Conference on Machine Learning, pp. 521528, 2011.
Kingman, J.F.C. Completely random measures. Pacific Journal of Mathematics, 21(1):5978, 1967.
Kumar, A. and Daum III, H. Learning task grouping and overlap in multi-task learning. In International Conference on Machine Learning (ICML), 2012.
Passos, A., Rai, P., Wainer, J., and Daum III, H. Flexible modeling of latent task structures in multitask learning. In International Conference on Machine Learning (ICML), 2012.
Rai, P. and Daum III, H. Infinite predictor subspace models for multitask learning. Journal of Machine Learning Research Proceedings Track, 9:613620, 2010.
Rosenstein, M.T., Marx, Z., Kaelbling, L.P., and Dietterich, T.G. To transfer or not to transfer. In NIPS Workshop on Inductive Transfer, volume 10, 2005.
Sethuraman, J. A constructive definition of Dirichlet priors. Statistica Sinica, 4(2):639650, 1994.
Teh, Y.W., Jordan, M.I., Beal, M.J., and Blei, D.M. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476):15661581, 2006.
Teh, Y.W., Grr, D., and Ghahramani, Z. Stick-breaking construction for the Indian buffet process. Journal of Machine Learning Research Proceedings Track, 2:556 563, 2007.
Thibaux, R. and Jordan, M.I. Hierarchical beta processes and the Indian buffet process. Journal of Machine Learning Research Proceedings Track, 2:564571, 2007.
Wang, X., Zhang, C., and Zhang, Z. Boosted multi-task learning for face verification with applications to web image and video search. In CVPR 2009. IEEE Conference on, pp. 142149. IEEE, 2009.
Xue, Y., Liao, X., Carin, L., and Krishnapuram, B. Multitask learning for classification with Dirichlet process priors. The Journal of Machine Learning Research, 8:3563, 2007.
-----0
Bengio, Y. and Lecun, Y. Scaling learning algorithms towards AI. MIT Press, 2007.
Boureau, Y., Ponce, J., and LeCun, Y. A theoretical analysis of feature pooling in vision algorithms. In ICML, 2010.
Bourlard, H. and Kamp, Y. Auto-association by multilayer perceptrons and singular value decomposition.
Biological Cybernetics, 59(4):291294, 1988.Hinton, G. E., Osindero, S., and Teh, Y. A fast learning algorithm for deep belief nets. Neural Comput., 18(7):15271554, July 2006. ISSN 0899-7667.
Hubel, D. H. and Wiesel, T. N. Receptive fields, binocular interaction and functional architecture in the cats visual cortex. The Journal of Physiology, 160: 106154, 1962.
Imabayashi, E., Matsuda, H., Asada, T., Ohnishi, T., Sakamoto, S., Nakano, S., and Inoue, T. Superiority of 3-dimensional stereotactic surface projection analysis over visual inspection in discrimination of patients with very early Alzheimers Disease from controls using brain perfusion SPECT. Journal of Nuclear Medicine, 45(9):14501457, 2004.
Kloppel, S., Stonnington, C. M., and Barnes, J. et. al.Accuracy of dementia diagnosis a direct comparison between radiologists and a computerized method. Brain: A Journal of Neurology, 131(11): 29692974, 2008.
Krizhevsky, A., Sutskever, I., and Hinton, G. Imagenet classification with deep convolutional neural networks. In NIPS 25, pp. 11061114. 2012.
Lecun, Y. and Bengio, Y. Convolutional Networks for Images, Speech and Time Series, pp. 255258. The MIT Press, 1995.
Matsuda, H. Role of neuroimaging in Alzheimers Disease, with emphasis on brain perfusion SPECT.Journal of Nuclear Medicine, 48(8):12891300, 2007.
Minoshima, S., Frey, K. A., Koeppe, R. A., Foster, N. L., and Kuhl, D. E. A diagnostic approach in Alzheimers Disease using three-dimensional stereotactic surface projections of Fluorine-18-FDG PET. Journal of Nuclear Medicine, 36(7):1238 1248, 1995.
Olshausen, B. A. Sparse codes and spikes. In Probabilistic Models Of The Brain: Perceptron Aand Neural Function, pp. 257272. MIT Press, 2001.
Raina, R., Battle, A., Lee, H., Packer, B., and Ng, A. Y. Self-taught learning: Transfer learning from unlabeled data. In ICML, pp. 759766, 2007.
Ranzato, M., Poultney, C, Chopra, S., and Lecun, Y.E cient learning of sparse representations with an energy-based model. In NIPS, 2006.
Teh, Y. W., M., Welling, S., Osindero, Hinton, G. E., Lee, T., F., Cardoso. J., Oja, E., and Amari, S.Energy-based models for sparse overcomplete representations. JMLR, 4:2003, 2003.
Yang, W., Lui, R. L. M., Gao, J. H., Chan, T. F., Yau, S., Sperling, R. A., and Huang, X. Independent component analysis-based classification of Alzheimers Disease MRI Data. Journal of Alzheimers Disease, 24(4):775783, 2011.
-----0
Angelosante, D., Giannakis, G. B., and Grossi, E.Compressed sensing of time-varying signals. In Intl Conf. on Dig. Sig. Proc., 2009.
Bain, A. and Crisan, D. Fundamentals of Stochastic Filtering. Springer, 2009.
Banerjee, O., El Ghaoui, L., and dAspremont, A.Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. J. Mach. Learn. Res., 9:485516, 2008.
Beck, A. and Teboulle, M. Mirror descent and nonlinear projected subgradient methods for convex programming. Operations Research Letters, 31:167 175, 2003.
Belkin, M. and Niyogi, P. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput., 15(6):13731396, June 2003.
Cande`s, E., Romberg, J., and Tao, T. Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics, 59(8):12071223, 2006.
Cesa-Bianchi, N. and Lugosi, G. Prediction, Learning and Games. Cambridge University Press, New York, 2006.
Cesa-Bianchi, N., Gaillard, P., Lugosi, G., and Stoltz, G. A new look at shifting regret. arXiv:1202.3323, 2012.
Duarte, M. F., Davenport, M. A., Takhar, D., Laska, J. N., Sun, T., Kelly, K. F., and Baraniuk, R. G. Single pixel imaging via compressive sampling. IEEE Sig. Proc. Mag., 25(2):8391, 2008.
Duchi, J., Shalev-Shwartz, S., Singer, Y., and Tewari, A. Composite objective mirror descent. In Conf. on Learning Theory (COLT), 2010.
Gyorgy, A., Linder, T., and Lugosi, G. Efficient tracking of large classes of experts. IEEE Transaction on Information Theory, 58:67096725, November 2012.
Hazan, E. and Seshadhri, C. Efficient learning algorithms for changing environments. In Proc. Int.Conf on Machine Learning (ICML), pp. 393400, 2009.
Herbster, M. and Warmuth, M. K. Tracking the best linear predictor. Journal of Machine Learning Research, 35(3):281309, 2001.
Kolar, M., Song, L., Ahmed, A., and Xing, E. P. Estimating time-varying networks. Annals of Applied Statistics, 4(1):94123, 2010.
Langford, J., Li, L., and Zhang, T. Sparse online learning via truncated gradient. J. Mach. Learn. Res., 10: 777801, 2009.
Littlestone, N. and Warmuth, M. K. The weighted majority algorithm. Inf. Comput., 108(2):212261, 1994.
McMahan, B. A unified view of regularized dual averaging and mirror descent with implicit updates.arXiv:1009.3240v2, 2011.
Merhav, N. and Feder, M. Universal prediction. IEEE Trans. Info. Th., 44(6):21242147, October 1998.
Nemirovsky, A. S. and Yudin, D. B. Problem complexity and method efficiency in optimization. John Wiley & Sons, New York, 1983.
Rakhlin, A. and Sridharan, K. Online learning with predictable sequences. arXiv:1208.3728, 2012.
Ravikumar, P., Wainwright, M. J., and Lafferty, J. D.High-dimenstional Ising model selection using `1regularized logistic regression. Annals of Statistics, 38:12871319, 2010.
Snijders, T. A. B. The statistical evaluation of social network dynamics. Sociological Methodology, 31(1): 361395, 2001.
Theodor, Y. and Shaked, U. Robust discrete-time minimum-variance filtering. IEEE Trans. Sig. Proc., 44(2):181189, 1996.
Vaswani, N. and Lu, W. Modified-CS: Modifying compressive sensing for problems with partially known support. IEEE Trans. Sig. Proc., 58:45954607, 2010.
Xiao, L. Dual averaging methods for regularized stochastic learning and online optimization. J.Mach. Learn. Res., 11:25432596, 2010.
Xie, L., Soh, Y. C., and de Souza, C. E. Robust Kalman filtering for uncertain discrete-time systems. IEEE Trans. Autom. Control, 39:13101314, 1994.
Zinkevich, M. Online convex programming and generalized infinitesimal gradient descent. In Proc. Int.Conf. on Machine Learning (ICML), pp. 928936, 2003.
-----0
Baraniuk, R. and Wakin, M. Random projections of smooth manifolds. Foundations of Computational Mathematics, 9:5177, 2009.
Boots, B. and Gordon, G. An online spectral learning algorithm for partially observable dynamical systems. In Association for the Advancement of Artificial Intelligence, 2011.
Boots, B., Siddiqi, S., and Gordon, G. Closing the learning-planning loop with predictive state representations. In Proceedings of Robotics: Science and Systems VI, 2009.
Fard, M.M., Grinberg, Y., Pineau, J., and Precup, D. Compressed least-squares regression on sparse spaces. In Association for the Advancement of Artificial Intelligence, 2012.
Fisher, R.A. and Tippett, L.H.C. Limiting forms of the frequency distribution of the largest or smallest member of a sample. In Mathematical Proceedings of the Cambridge Philosophical Society, volume 24.Cambridge Univ Press, 1928.
Hsu, D., Kakade, S., and Zhang, T. A spectral algorithm for learning hidden markov models. In Conference on Learning Theory, 2008.
Kaelbling, L., Littman, M., and Cassandra, A. Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101:99134, 1998.
Maillard, O.A. and Munos, R. Compressed leastsquares regression. In Advances in Neural Information Processing Systems, 2009.
Maillard, O.A., Munos, R., et al. Linear regression with random projections. Journal of Machine Learning Research, 2012.
Rabiner, Lawrence R. A tutorial on hidden markov models and selected applications in speech recognition. In Waibel, Alex and Lee, Kai-Fu (eds.), Readings in speech recognition, pp. 267296. 1990.
Rosencrantz, M., Gordon, G., and Thrun, S. Learning low dimensional predictive representations. In Proceedings of the twenty-first International Conference on Machine learning, 2004.
Silver, D. and Veness, J. Monte-carlo planning in large pomdps. In Advances in Neural Information Processing Systems, 2010.
Singh, S., James, M.., and Rudary, M. Predictive state representations: a new theory for modeling dynamical systems. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, 2004.
Veness, J., Ng, K.S., Hutter, M., Uther, W., and Silver, D. A monte-carlo AIXI approximation. Journal of Artificial Intelligence Research, 40, 2011.
Wiewiora, E. Modeling probability distributions with predictive state representations. PhD thesis, University of California at San Diego, 2007.
-----0
Balasubramanian, M. and Schwartz, E.L. The isomap algorithm and topological stability. Science, 295 (5552):77, 2002.
dAspremont, A., El Ghaoui, L., Jordan, M.I., and Lanckriet, G.R.G. A direct formulation for sparse PCA using semidefinite programming. Computer Science Division, University of California, 2004.
dAspremont, A., Bach, F., and Ghaoui, L.E. Optimal solutions for sparse principal component analysis.The Journal of Machine Learning Research, 9:1269 1294, 2008.
Fan, J., Qi, L., and Tong, X. Penalized least squares estimation with weakly dependent data. 2012.
Han, F. and Liu, H. Semiparametric principal component analysis. In Advances in Neural Information Processing Systems 25, pp. 171179, 2012.
Han, F. and Liu, H. Principal component analysis on high dimensional complex and noisy data. Technical Report, 2013.
Johnstone, I.M. and Lu, A.Y. On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association, 104(486):682693, 2009.
Jolliffe, I. Principal component analysis. Wiley Online Library, 2005.
Jolliffe, I.T., Trendafilov, N.T., and Uddin, M. A modified principal component technique based on the lasso. Journal of Computational and Graphical Statistics, 12(3):531547, 2003.
Journee, M., Nesterov, Y., Richtarik, P., and Sepulchre, R. Generalized power method for sparse principal component analysis. The Journal of Machine Learning Research, 11:517553, 2010.
Kontorovich, L. Measure concentration of strongly mixing processes with applications. PhD thesis, Carnegie Mellon University, 2007.
Kontorovich, L.A. and Ramanan, K. Concentration inequalities for dependent random variables via the martingale method. The Annals of Probability, 36 (6):21262158, 2008.
Kruskal, W.H. Ordinal measures of association. Journal of the American Statistical Association, 53(284): 814861, 1958.
Liu, H., Lafferty, J., and Wasserman, L. The nonparanormal: Semiparametric estimation of high dimensional undirected graphs. The Journal of Machine Learning Research, 10:22952328, 2009.
Liu, H., Han, F., Yuan, M., Lafferty, J., and Wasserman, L. High dimensional semiparametric gaussian copula graphical models. Annals of Statistics, 2012.
Loh, P.L. and Wainwright, M.J. High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity. Arxiv preprint arXiv:1109.3714, 2011.
Ma, Z. Sparse principal component analysis and iterative thresholding. Arxiv preprint arXiv:1112.2432, 2011.
McDiarmid, C. On the method of bounded differences.Surveys in combinatorics, 141(1):148188, 1989.
Raskutti, G., Wainwright, M.J., and Yu, B. Minimax rates of estimation for high-dimensional linear regression over ell q-balls. Information Theory, IEEE Transactions on Information Theory, 57(10):6976 6994, 2011.
Samson, P.M. Concentration of measure inequalities for markov chains and phi-mixing processes. The Annals of Probability, 28(1):416461, 2000.
Shen, H. and Huang, J.Z. Sparse principal component analysis via regularized low rank matrix approximation. Journal of multivariate analysis, 99(6):1015 1034, 2008.
Skinner, CJ, Holmes, DJ, and Smith, TMF. The effect of sample design on principal component analysis.Journal of the American Statistical Association, 81 (395):789798, 1986.
Vu, V.Q. and Lei, J. Minimax rates of estimation for sparse pca in high dimensions. Arxiv preprint arXiv:1202.0786, 2012.
Witten, D.M., Tibshirani, R., and Hastie, T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10(3):515534, 2009.
Yuan, X.T. and Zhang, T. Truncated power method for sparse eigenvalue problems. Arxiv preprint arXiv:1112.2679, 2011.
Zou, H., Hastie, T., and Tibshirani, R. Sparse principal component analysis. Journal of computational and graphical statistics, 15(2):265286, 2006.
-----0
Andersson, J.L.R., Hutton, C., Ashburner, J., Turner, R., and Friston, K. Modeling geometric deformations in epi time series. Neuroimage, 13(5):903919, 2001.
Bar-Joseph, Z. Analyzing time series gene expression data. Bioinformatics, 20(16):24932503, 2004.
Briiggemann, R. and Liitkepohl, H. Lag selection in subset var models with an application to a us monetary system. Econometric Studies: A Festschrift in Honour of Joachim Frohn, LIT-Verlag, Miinster, pp. 10728, 2001.
Cai, T., Liu, W., and Luo, X. A constrained l1 minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association, 106(494):594607, 2011.
Candes, E. and Tao, T. The dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics, 35(6):23132351, 2007.
de Waele, S. and Broersen, P.M.T. Order selection for vector autoregressive models. Signal Processing, IEEE Transactions on, 51(2):427433, 2003.
Goebel, R., Roebroeck, A., Kim, D.S., and Formisano, E. Investigating directed cortical interactions in time-resolved fmri data using vector autoregressive modeling and granger causality mapping. Magnetic resonance imaging, 21(10):12511261, 2003.
Hamilton, J.D. Time series analysis, volume 2. Cambridge Univ Press, 1994.
Han, F. and Liu, H. Transition matrix estimation in high dimensional time series data. Technical Report, 2013.
Hatemi-J, A. Multivariate tests for autocorrelation in the stable and unstable var models. Economic Modelling, 21(4):661683, 2004.
Hsu, N.J., Hung, H.L., and Chang, Y.M. Subset selection for vector autoregressive processes using lasso.
Computational Statistics & Data Analysis, 52(7): 36453657, 2008.Huang, T.K. and Schneider, J. Learning autoregressive models from sequence and non-sequence data. Advances in Neural Information Processing Systems, 25, 2011.
Ledoit, O. and Wolf, M. Improved estimation of the covariance matrix of stock returns with an application to portfolio selection. Journal of Empirical Finance, 10(5):603621, 2003.
Liu, H., Han, F., Yuan, M., Lafferty, J., and Wasserman, L. High dimensional semiparametric gaussian copula graphical models. Annals of Statistics, 2012.
Lozano, A.C., Abe, N., Liu, Y., and Rosset, S.Grouped graphical granger modeling for gene expression regulatory networks discovery. Bioinformatics, 25(12):i110i118, 2009.
Nardi, Y. and Rinaldo, A. Autoregressive process modeling via the lasso procedure. Journal of Multivariate Analysis, 102(3):528549, 2011.
Roebroeck, A., Formisano, E., Goebel, R., et al. Mapping directed influence over the brain using granger causality and fmri. Neuroimage, 25(1):230242, 2005.
Tsay, R.S. Analysis of financial time series, volume 543. Wiley-Interscience, 2005.
Wang, H., Li, G., and Tsai, C.L. Regression coefficient and autoregressive order shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(1): 6378, 2007.
Yuan, M. High dimensional inverse covariance matrix estimation via linear programming. The Journal of Machine Learning Research, 99:22612286, 2010.
-----0
Cheng, Y. and Church, G. M. Biclustering of expression data. Proc Int Conf Intell Syst Mol Biol, 8: 93103, 2000. ISSN 1553-0833.
Davis, Jesse and Goadrich, Mark. The relationship between precision-recall and roc curves. In Proceedings of the 23rd international conference on Machine learning, ICML 06, pp. 233240, 2006.
Dhillon, Inderjit S. Co-clustering documents and words using bipartite spectral graph partitioning. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, KDD 01, pp. 269274, 2001.
Flach, Peter A. The geometry of roc space: Understanding machine learning metrics through roc isometrics. In ICML, pp. 194201, 2003.
Govaert, G. and Nadif, M. Block clustering with Bernoulli mixture models: Comparison of different approaches. Computational Statistics and Data Analysis, 52:32333245, 2008.
Hanczar, B. and Nadif, M. Bagging for biclustering: Application to microarray data. In European Conference on Machine Learning, volume 1, pp. 490 505, 2010.
Hanczar, B. and Nadif, M. Ensemble methods for biclustering tasks. Pattern Recognition, 45(11):3938 3949, 2012.
Handl, Julia, Knowles, Joshua, and Kell, Douglas B.Computational cluster validation in post-genomic data analysis. Bioinformatics, 21:32013212, 2005.
Lazzeroni, Laura and Owen, Art. Plaid models for gene expression data. Technical report, Stanford University, 2000.
Lee, Youngrok, Lee, Jeonghwa, and Jun, Chi-Hyuck.Stability-based validation of bicluster solutions.
Pattern Recognition, 44:252264, 2011.Madeira, S. C. and Oliveira, A. L. Biclustering algorithms for biological data analysis: a survey.
IEEE/ACM Transactions on Computational Biology and Bioinformatics, 1(1):2445, 2004.
Prelic, Amela, Bleuler, Stefan, Zimmermann, Philip, Wille, Anja, Buhlmann, Peter, Gruissem, Wilhelm, 
Hennig, Lars, Thiele, Lothar, and Zitzler, Eckart. A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics, 22(9):11221129, 2006.
-----0
Belkin, M., Niyogi, P., and Sindhwani, V. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research, 7:23992434, 2006.
Brefeld, U., Gartner, T., Scheffer, T., and Wrobel, S.Efficient co-regularised least squares regression. In Proceedings of the International Conference on Machine Learning(ICML), 2006.
Brouard, C., DAlche-Buc, F., and Szafranski, M.Semi-supervised penalized output kernel regression for link prediction. In Proceedings of the International Conference on Machine Learning, 2011.
Caponnetto, A., Pontil, M., C.Micchelli, and Ying, Y.Universal multi-task kernels. Journal of Machine Learning Research, 9:16151646, 2008.
Carmeli, C., Vito, E. De, and Toigo, A. Vector-valued reproducing kernel Hilbert spaces of integrable functions and Mercer theorem. Analysis and Applications, 4:377408, 2006.
Christoudias, M., Urtasun, R., and Darrell, T.Bayesian localized multiple kernel learning. Univ.
California Berkeley, Berkeley, CA, 2009.Dinuzzo, F., Ong, C.S., Gehler, P., and Pillonetto, G.Learning output kernels with block coordinate descent. In ICML, 2011.
Fei-Fei, L., Fergus, R., and Perona, P. One-shot learning of object categories. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 28(4):594 611, april 2006.
Gehler, P. and Nowozin, S. On feature combination for multiclass object classification. In Proceedings of the 12th IEEE International Conference on Computer Vision, 2009.
Kadri, H., Rabaoui, A., Preux, P., Duflos, E., and Rakotomamonjy, A. Functional regularized least squares classification with operator-valued kernels.
In Proceedings of the International Conference on Machine Learning(ICML), 2011.Micchelli, C. A. and Pontil, M. On learning vectorvalued functions. Neural Computation, 17:177204, 2005.
Minh, H.Q. and Sindhwani, V. Vector-valued manifold regularization. In ICML, 2011.Reisert, M. and Burkhardt, H. Learning equivariant functions with matrix valued kernels. J. Mach.Learn. Res., 8:385408, 2007.
Rosenberg, D., Sindhwani, V., Bartlett, P., and Niyogi, P. A kernel for semi-supervised learning with multi-view point cloud regularization. IEEE Sig. Proc. Mag., 26(5):145150, 2009.
Saffari, A., Leistner, C., Godec, M., and Bischof, H.Robust multi-view boosting with priors. In Proceedings of the European Conference on Computer Vision, 2010.
Scholkopf, B. and Smola, A. Learning with kernels: Support Vector Machines, Regularization, Optimization, and Beyond. The MIT Press, Cambridge, 2002.
Shawe-Taylor, J. and Cristianini, N. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004.
Sindhwani, V. and Rosenberg, D. An RKHS for multiview learning and manifold co-regularization. In Proceedings of the International Conference on Machine Learning (ICML), 2008.
Vedaldi, A., Gulshan, V., Varma, M., and Zisserman, A. Multiple kernels for object detection. In Proceedings of 12th IEEE International Conference on Computer Vision, 2009.
Wah, C., Branson, S., Welinder, P., Perona, P., and Belongie, S. The Caltech-UCSD Birds-200-2011 dataset. 2011. URL http://www.vision.caltech.edu/visipedia/CUB-200-2011.html.
Yang, J., Li, Y., Tian, Y., Duan, L., and Gao, W.Group-sensitive multiple kernel learning for object categorization. In Proceedings of the 12th IEEE International Conference on Computer Vision, 2009.
-----0
Airoldi, E.M., Blei, D.M., Fienberg, S.E., and Xing, E.P. Mixed-membership stochastic blockmodels.JMLR, 9:19812014, 2008.
Beal, M.J., Ghahramani, Z., and Rasmussen, C. The infinite hidden Markov model. In Proc. NIPS, 2002.
Crandall, D., Cosley, D., Huttenlocher, D., Kleinberg, J., and Suri, S. Feedback effects between similarity and social influence. In Proc. SIGKDD.
Foulds, J., DuBois, C., Asuncion, A.U., Butts, C.T., and Smyth, P. A dynamic relational infinite feature model for longitudinal social networks. In Proc.AISTATS, April 2011.
Fu, W., Song, L., and Xing, E.P. Dynamic mixed membership stochastic blockmodel for evolving networks. In Proc. ICML, 2009.
Ghahramani, Z. and Jordan, M.I. Factorial hidden Markov models. Machine Learning, 29:245273, 1997.
Griffiths, T. and Ghahramani, Z. The Indian buffet process: An introduction and review. JMLR, 12: 11851224, 2011.
Handcock, M. S., Robins, G., Snijders, T., Moody, J., and Besag, J. Assessing degeneracy in statistical models of social networks. J. Am. Statist. Assoc., 76:3350, 2003.
Hanneke, S., Fu, W., and Xing, E.P. Discrete temporal models of social networks. Electronic Journal of Statistics, 4:585605, 2010.
Hoff, P., Raftery, A., and Handcock, M. Latent space approaches to social network analysis. J. Am. Statist. Assoc., 97(460):10901098, 2002.
Kemp, Charles, Tenenbaum, Joshua B., Griffiths, Thomas L., Yamada, Takeshi, and Ueda, Naonori.Learning systems of concepts with an infinite relational model. In Proc. AAAI, 2006.
Lloyd, J.R., Orbanz, P., Ghahramani, Z., and Roy, D.M. Random function priors for exchangeable arrays with applications to graphs and relational data.In Proc. NIPS, 2012.
Miller, K.T., Griffiths, T.L., and Jordan, M.I. Nonparametric latent feature models for link prediction.In Proc. NIPS, 2009.
Mrup, M., Schmidt, M.N., and Hansen, L.K. Infinite multiple membership relational modeling for complex networks. In Proc. MLSP. IEEE, 2011.
Neal, R.M. Slice sampling. Annals of Statistics, 31(3): 705741, 2003.
Nowicki, K. and Snijders, T. Estimation and prediction for stochastic blockstructures. J. Am. Statist.Assoc., 96(455):10771087, 2001.
Palla, K., Knowles, D., and Ghahramani, Z. An infinite latent attribute model for network data. In Proc. ICML, 2012.
Roy, D.M. and Teh, Y.W. The Mondrian process. In Proc. NIPS, 2009.
Sarkar, P. and Moore, A. Dynamic social network analysis using latent space models. SIGKDD Explor.Newsl., 7(2):3140, December 2005.
Scott, J., Gass, R., Crowcroft, J., Hui, P., Diot, C., and Chaintreau, A. CRAWDAD data set cambridge/haggle (v. 2009-05-29). Downloaded from http://crawdad.cs.dartmouth.edu/cambridge/haggle, May 2009.
Scott, S. Bayesian methods for hidden Markov models: Recursive computing in the 21st century. J. Am.
Statist. Assoc., 97(457):337351, 2002.Snijders, T. The statistical evaluation of social network dynamics. Sociological Methodology, 31(1): 361395, 2001.
Snijders, T. Statistical methods for network dynamics. In Proc. Scientific Meeting, pp. 281296. Italian Statistical Society, 2006.
Teh, Y.W., Jordan, M.I., Beal, M.J., and Blei, D.M.Hierarchical dirichlet processes. J. Am. Statist. Assoc., 101(476):15661581, 2006.
Van Gael, J., Teh, Y.W., and Ghahramani, Z. The infinite factorial hidden Markov model. In Proc. NIPS, 2009.
Westveld, A. and Hoff, P. A mixed effects model for longitudinal relational and network data, with applications to international trade and conflict. Ann.Appl. Stat., 5(2A):843872, 2011.
Xing, E., Fu, W., and Song, L. A state-space mixedmembership blockmodel for dynamic network tomography. Ann. Appl. Stat., 4(2):535566, 2010.
Xu, Z., Tresp, V., Yu, K., and Kriegel, H.P. Learning infinite hidden relational models. In Proc. UAI, 2006.
-----0
Barrett, R., Berry, M., Chan, T.F., Demmel, J., Donato, J., J., Dongarra, Eijkhout, V., Pozo, R., Romine, C., and Van der Vorst, H. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods. SIAM, Philadelphia, PA, 1994.
Bordes, A., Bottou, L., and Gallinari, P. SGD-QN: Careful quasi-Newton stochastic gradient descent. J of Machine Learning Research, 10:17371754, 2009.
Broyden, C.G. A class of methods for solving nonlinear simultaneous equations. Math. Comp., 19(92):577 593, 1965.
Broyden, C.G. A new double-rank minimization algorithm. Notices American Math. Soc, 16:670, 1969.
Byrd, R.H., Chin, G.M., Neveitt, W., and Nocedal, J. On the use of stochastic Hessian information in optimization methods for machine learning. SIAM J. Optimization, 21(3):977995, 2011.
Davidon, W.C. Variable metric method for minimization. Technical report, Argonne National Laboratories, Ill., 1959.
Dennis, J.E. Jr and Moree, J.J. Quasi-Newton methods, motivation and theory. SIAM Review, pp. 4689, 1977.
Fletcher, R. A new approach to variable metric algorithms. The Computer Journal, 13(3):317, 1970.
Fletcher, R. and Powell, M.J.D. A rapidly convergent descent method for minimization. The Computer Journal, 6(2):163168, 1963.
Genz, A. Numerical computation of rectangular bivariate and trivariate normal and t probabilities. Statistics and Computing, 14(3):251260, 2004.
Goldfarb, D. A family of variable metric updates derived by variational means. Math. Comp., 24(109): 2326, 1970.
Hennig, P. and Kiefel, M. Quasi-Newton methods  a new direction. In Int. Conf. on Machine Learning (ICML), volume 29, 2012.
Hinton, G.E. and Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science, 313(5786):504507, 2006.
Martens, J. Deep learning via Hessian-free optimization.In International Conference on Machine Learning, 2010.
Martens, J. and Sutskever, I. Learning recurrent neural networks with Hessian-free optimization. In International Conference on Machine Learning, 2011.
Nocedal, J. Updating quasi-Newton matrices with limited storage. Math. Comp., 35(151):773782, 1980.
Pearlmutter, B.A. Fast exact multiplication by the Hessian. Neural Computation, 6(1):147160, 1994.
Powell, M.J.D. A new algorithm for unconstrained optimization. In Mangasarian, O. L. and Ritter, K.(eds.), Nonlinear Programming. AP, 1970.
Rasmussen, C.E. and Williams, C.K.I. Gaussian Processes for Machine Learning. MIT Press, 2006.
Robbins, H. and Monro, S. A stochastic approximation method. The Annals of Mathematical Statistics, pp.400407, 1951.
Rumelhart, D.E., Hinton, G.E., and Williams, R.J.Learning representations by back-propagating errors.Nature, 323(6088):533536, 1986.
Schraudolph, N.N. Fast curvature matrix-vector products for second-order gradient descent. Neural Computation, 14(7):17231738, 2002.
Shanno, D.F. Conditioning of quasi-Newton methods for function minimization. Math. Comp., 24(111): 647656, 1970.
-----0
Albert, Reka, Jeong, Hawoong, and Barabasi, AlbertLaszlo. Internet: Diameter of the World-Wide Web. 401 (6749):130131, September 1999. ISSN 0028-0836.
Barabasi, A. Emergence of Scaling in Random Networks. Science, 286(5439):509512, October 1999. ISSN 00368075.
Clauset, A., Moore, C., and Newman, M.E.J. Hierarchical Modeling Temporal Evolution and Multiscale Structure in Networks structure and the prediction of missing links in networks.Nature, 453(7191):98101, 2008.
Cohen, William W. Enron Dataset, 2009. URL http: //www.cs.cmu.edu/~enron/.
Collins, Allan M. and Quillian, M. Ross. Retrieval time from semantic memory. Journal of Verbal Learning and Verbal Behavior, 8(2):240247, April 1969. ISSN 00225371.
Doreian, P. Evolution of social networks, volume 1. Routledge, 1997.
Dorogovtsev, S., Mendes, J., and Samukhin, A. Structure of Growing Networks with Preferential Linking. Physical Review Letters, 85(21):46334636, November 2000. ISSN 0031-9007.
Dosenbach, N.U.F., Nardos, B., Cohen, A.L., Fair, D.A., Power, J.D., Church, J.A., Nelson, S.M., Wig, G.S., Vogel, A.C., Lessov-Schlaggar, C.N., et al. Prediction of individual brain maturity using fmri. Science, 329(5997): 13581361, 2010.
Fortunato, S. Community detection in graphs. Physics Reports, 486(3-5):75174, 2010.
Fox, M.D. and Raichle, M.E. Spontaneous fluctuations in brain activity observed with functional magnetic resonance imaging. Nature Reviews Neuroscience, 8(9):700 711, 2007.
Herlau, T., Mrup, M., Schmidt, M. N., and Hansen, L. K.Detecting hierarchical structure in networks. Proceedings of Cognitive Information Processing, 2012.
Holland, Paul W., Laskey, Kathryn Blackmond, and Leinhardt, Samuel. Stochastic blockmodels: First steps. Social Networks, 5(2):109  137, 1983.
Holme, Petter and Saramaki, Jari. Temporal networks.Physics Reports, 519(3):97125, October 2012. ISSN 03701573.
Ishiguro, K., Iwata, T., Ueda, N., and Tenenbaum, J.Dynamic infinite relational model for time-varying relational data analysis. Advances in Neural Information Processing Systems, 23, 2010.
Kemp, Charles, Tenenbaum, Joshua B, Griffiths, Thomas L, Yamada, Takeshi, and Ueda, Naonori. Learning Systems of Concepts with an Infinite Relational Model. In AAAI, pp. 381388, 2006.
Lewis, Jeff, Poole, Keith, and McCarty, Nolan. US Senate Dataset, 2010. URL http://www.voteview.com/ downloads.asp.
Lin, Dahua, Grimson, Eric, and Fisher, John. Construction of Dependent Dirichlet Processes based on Poisson Processes. 2010.
McCullagh, Peter, Pitman, Jim, and Winkel, Matthias.Gibbs fragmentation trees. Bernoulli, 14(4):9881002, 2008.
McPherson, Miller, Smith-Lovin, Lynn, and Cook, James M. Birds of a Feather: Homophily in Social Networks. Annual Review of Sociology, 27(1):415444, August 2001. ISSN 0360-0572.
Meunier, D, Lambiotte, R, and Bullmore, E T. Modular and hierarchically modular organization of brain networks. Frontiers in neuroscience, 4, 2010.
Miller, K.T., Griffiths, T.L., and Jordan, M.I. Nonparametric latent feature models for link prediction. Advances in Neural Information Processing Systems (NIPS), pp. 12761284, 2009.
Mrup, Morten and Schmidt, Mikkel N. Bayesian community detection. Neural Computation, 24(9):24342456, 2012.
Pastor-Satorras, R., Smith, E., Sole, R.V., et al. Evolving protein interaction networks through gene duplication.Journal of Theoretical biology, 222(2):199210, 2003.
Peng, Wei and Li, Tao. Temporal relation co-clustering on directional social network and author-topic evolution. Knowledge and Information Systems, 26(3):467 486, March 2010. ISSN 0219-1377.
Perra, N, Goncalves, B, Pastor-Satorras, R, and Vespignani, A. Activity driven modeling of time varying networks. Scientific reports, 2:469, January 2012. ISSN 2045-2322.
Ravasz, E and Barabasi, A L. Hierarchical organization in complex networks. Physical Review E, 67(2):26112, 2003.
Ravasz, E, Somera, A L, Mongru, D A, Oltvai, Z N, and Barabasi, A L. Hierarchical organization of modularity in metabolic networks. science, 297(5586):15511555, 2002.
Redner, S. How popular is your paper? An empirical study of the citation distribution. The European Physical Journal B Condensed Matter and Complex Systems, 4(2): 131134, 1998.
Rosvall, Martin and Bergstrom, Carl T. Mapping change in large networks. PloS one, 5(1):e8694, January 2010.ISSN 1932-6203.
Roweis, Sam T. NIPS Dataset, 2002. URL http://www.cs.nyu.edu/~roweis/.
Roy, D M and Teh, Y W. The Mondrian Process. In Advances in Neural Information Processing Systems, volume 21, 2009.
Roy, D.M., Kemp, C., Mansinghka, V.K., and Tenenbaum, J.B. Learning annotated hierarchies from relational data. Advances in neural information processing systems, 19:1185, 2007.
Sales-Pardo, M, Guimera, R, Moreira, A A, and Amaral, L A N. Extracting the hierarchical organization of complex systems. Proceedings of the National Academy of Sciences, 104(39):15224, 2007.
Schmidt, M. N., Herlau, T., and Mrup, M. Nonparametric bayesian models of hierarchical structure in complex networks. ArXiv, 2012.
Simon, H A. The architecture of complexity. Proceedings of the American philosophical society, 106(6):467482, 1962.
Teh, Yee Whye, Blundell, Charles, and Elliott, Lloyd.Modelling Genetic Variations using FragmentationCoagulation Processes, 2011.
Vu, Duy, Asuncion, Arthur, Hunter, David, and Smyth, Padhraic. Dynamic Egocentric Models for Citation Networks. In ICML, 2011.
Watts, Duncan J, Dodds, Peter Sheridan, and Newman, M E J. Identity and search in social networks. Science (New York, N.Y.), 296(5571):13025, May 2002. ISSN 1095-9203.
-----0
Alon, Noga, Awerbuch, Baruch, Azar, Yossi, Buchbinder, Niv, and Naor, Joseph. A general approach to online network optimization problems. In SODA, 2004.
Bansal, Nikhil, Buchbinder, Niv, and Naor, Joseph. A primal-dual randomized algorithm for weighted paging.In FOCS, 2007.
Buchbinder, Niv and Naor, Joseph. Online primal-dual algorithms for covering and packing problems. In ESA, 2005.
Buchbinder, Niv, Jain, Kamal, and Naor, Joseph. Online primal-dual algorithms for maximizing ad-auctions revenue. In ESA, 2007.
Crammer, Koby, Kearns, Michael, and Wortman, Jennifer.Learning from data of variable quality. In NIPS, 2005.
Crammer, Koby, Kearns, Michael, and Wortman, Jennifer.Learning from multiple sources. Journal of Machine Learning Research, 9:17571774, 2008.
Dawid, Philip and Skene, Allan. Maximum likelihood estimation of observer error-rates using the EM algorithm.Journal of the Royal Statistical Society, Series C (Applied Statistics), 28(1):2028, 1979.
Dekel, Ofer and Shamir, Ohad. Vox populi: Collecting high-quality labels from a crowd. In COLT, 2009.
Devanur, Nikhil and Hayes, Thomas. The adwords problem: Online keyword matching with budgeted bidders under random permutations. In ACM EC, 2009.
Devanur, Nikhil, Jain, Kamal, Sivan, Balasubramanian, and Wilkens, Christopher. Near optimal online algorithms and fast approximation algorithms for resource allocation problems. In ACM EC, 2011.
Ghosh, Arpita, Kale, Satyen, and McAfee, Preston. Who moderates the moderators? Crowdsourcing abuse detection in user-generated content. In ACM EC, 2011.
Ho, Chien-Ju and Vaughan, Jennifer Wortman. Online task assignment in crowdsourcing markets. In AAAI, 2012.
Ipeirotis, Panagiotis. Analyzing the Amazon Mechanical Turk marketplace. ACM XRDS, 17(2):1621, 2010.
Ipeirotis, Panagiotis G., Provost, Foster, and Wang, Jing.Quality management on Amazon Mechanical Turk. In HCOMP, 2010.
Karger, David, Oh, Sewoong, and Shah, Devavrat. Iterative learning for reliable crowdsourcing systems. In NIPS, 2011a.
Karger, David, Oh, Sewoong, and Shah, Devavrat. Budgetoptimal task allocation for reliable crowdsourcing systems. CoRR, abs/1110.3564, 2011b.
Kittur, Aniket, Chi, Ed, and Suh, Bongwon. Crowdsourcing user studies with Mechanical Turk. In CHI, 2008.
Liu, Qiang, Peng, Jian, and Ihler, Alexander. Variational inference for crowdsourcing. In NIPS, 2012.
Oleson, Dave, Hester, Vaughn, Sorokin, Alex, Laughlin, Greg, Le, John, and Biewald, Lukas. Programmatic gold: Targeted and scalable quality assurance in crowdsourcing. In HCOMP, 2011.
Sheng, Victor, Provost, Foster, and Ipeirotis, Panagiotis.Get another label? Improving data quality using multiple, noisy labelers. In KDD, 2008.
Tran-Thanh, Long, Stein, Sebastian, Rogers, Alex, and Jennings, Nicholas. Efficient crowdsourcing of unknown experts using multi-armed bandits. In ICAI, 2012.
Wah, Catherine, Branson, Steve, Welinder, Peter, Perona, Pietro, and Belongie, Serge. The Caltech-UCSD Birds200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.
Wais, Paul, Lingamneni, Shivaram, Cook, Duncan, Fennell, Jason, Goldenberg, Benjamin, Lubarov, Daniel, and Martin, David. Towards building a high-quality workforce with mechanical turk. Presented at the NIPS Workshop on Computational Social Science and the Wisdom of Crowds, 2010.
Welinder, Peter, Branson, Steve, Belongie, Serge, and Pietro, Perona. The Multidimensional Wisdom of Crowds. In NIPS, 2010.
Zhou, Dengyong, Basu, Sumit, Mao, Yi, and Platt, John.Learning from the wisdom of crowds by minimax entropy. In NIPS, 2012.
-----0
Absil, P.A., Mahony, R., and Sepulchre, R. Optimization algorithms on matrix manifolds. Princeton Univ Pr, 2008.
Aharon, M., Elad, M., and Bruckstein, A. K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans. on Signal Processing, 54 (11):43114322, 2006.
Amari, S. and Nagaoka, H. Methods of Information Geometry. American Mathematical Society, 2007.
Brodatz, P. Textures: a photographic album for artists and designers, volume 66. Dover New York, 1966.
Chang, C.-C. and Lin, C.-J. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:127:27, 2011.
Chen, T., Rangarajan, A., and Vemuri, B.C. Caviar: Classification via aggregated regression and its application in classifying oasis brain database. In ISBI, 2010.do Carmo, M. Riemannian Geometry. Birkhauser, 1992.
Edelman, A., Arias, T.A., and Smith, S.T. The geometry of algorithms with orthogonality constraints. Arxiv preprint physics/9806030, 1998.
Elad, M. Sparse and Redundant Representations: From Theory to Applications in Signal and Image Processing.Springer, 2010.
Fletcher, P.T. and Joshi, S. Riemannian geometry for the statistical analysis of diffusion tensor data. Signal Processing, 87(2):250262, 2007.
Helgason, S. Differantial Geometry, Lie Groups, and Symmetric Spaces. American Mathematical Society, 2001.
Huang, J., Zhang, T., and Metaxas, D. Learning with structured sparsity. The Journal of Machine Learning Research, 999888:33713412, 2011.
Joshi, S., Davis, B., Jomier, M., and Gerig, G. Unbiased diffeomorphic atlas construction for computational anatomy. NeuroImage, 23:S151S160, 2004.
Lewicki, M.S. and Sejnowski, T.J. Learning overcomplete representations. Neural computation, 12(2):337 365, 2000.
Mairal, J., Bach, F., Ponce, J., and Sapiro, G. Online learning for matrix factorization and sparse coding. The Journal of Machine Learning Research, 11:1960, 2010.
Marcus, D.S., Wang, T.H., Parker, J., Csernansky, J.G., Morris, J.C., and Buckner, R.L. Open access series of imaging studies (oasis): cross-sectional mri data in young, middle aged, nondemented, and demented older adults. Journal of Cognitive Neuroscience, 19(9):1498 1507, 2007.
Mardia, K. and Jupp, P. Directional Statistics. Wiley, 1999.
Mortamet, B., Zeng, D., Gerig, G., Prastawa, M., and Bullitt, E. Effects of healthy aging measured by intracranial compartment volumes using a designed MR brain database. MICCAI, pp. 383391, 2005.
Nash, J. The imbedding problem for riemannian manifolds.Annals of Mathematics, 63(1):2063, 1956.
Olshausen, B.A. and Field, D.J. Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37(23):33113325, 1997.
Pennec, X., Fillard, P., and Ayache, N. A riemannian framework for tensor computing. International Journal of Computer Vision, 66(1):4166, 2006.
Rao, C.R. Information and accuracy attainable in the estimation of statistical parameters. Bull. Calcutta Math.Soc, 37(3):8191, 1945.
Sivalingam, R., Boley, D., Morellas, V., and Papanikolopoulos, N. Tensor sparse coding for region covariances. In ECCV, pp. 722735, 2010.
Spivak, M. A comprehensive introduction to differential geometry. Publish or perish Berkeley, 1979.
Srivastava, A., Jermyn, I., and Joshi, S. Riemannian analysis of probability density functions with applications in vision. In CVPR, 2007.
Szabo, Z., Poczos, B., and Lorincz, A. Online groupstructured dictionary learning. In CVPR, pp. 28652872, 2011.
Terras, A. Harmonic Analysis on Symmetric Spaces and Applications. Springer-Verlag, 1985.
Turaga, P., Veeraraghavan, A., and Chellappa, R. Statistical analysis on stiefel and grassmann manifolds with applications in computer vision. In CVPR, 2008.
Tuzel, O., Porikli, F., and Meer, P. Region covariance: A fast descriptor for detection and classification. In ECCV, pp. 589600, 2006.
Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., and Gong, Y. Locality-constrained linear coding for image classification. In CVPR, pp. 33603367, 2010.
Yang, J., Yu, K., Gong, Y., and Huang, T. Linear spatial pyramid matching using sparse coding for image classification. In CVPR, pp. 17941801, 2009.
Yu, K. and Zhang, T. Improved local coordinate coding using local tangents. In ICML, pp. 12151222, 2010.
Yu, K., Zhang, T., and Gong, Y. Nonlinear learning using local coordinate coding. NIPS, 22:22232231, 2009.
-----0
Auger, I E and Lawrence, C E. Algorithms for the optimal identification of segment neighborhoods. Bulletin of mathematical biology, 51(1):3954, 1989.
Bai, J. and Perron, P. Computation and analysis of multiple structural change models. J. Appl. Econ., 18:122, 2003.
Baraud, Y., Giraud, C., and Huet, S. Gaussian model selection with unknown variance. Ann. Statist., 37 (2):630672, 2009.
Beck, A. and Teboulle, M. A fast iterative shrinkagethresholding algorithm for linear inverse problems.
SIAM J. Imaging Sciences, 2(1):183202, 2009.Birge, L. and Massart, P. Minimal penalties for gaussian model selection. Probability Th. and Related Fields, 138:3373, 2007.
Bottou, Leon and Bousquet, Olivier. The tradeoffs of large scale learning. In Platt, J.C., Koller, D., Singer, Y., and Roweis, S. (eds.), NIPS, pp. 161 168. 2008.
Braun, J. V., Braun, R. K., and Muller, H. G. Multiple changepoint fitting via quasilikelihood, with application to DNA sequence segmentation. Biometrika, 87(2):301314, June 2000.
Flamary, R., Jrad, N., Phlypo, R., Congedo, M., and Rakotomamonjy, A. Mixed-norm regularization for brain decoding. HAL tech. report 00708243, 2012.
Gillet, O., Essid, S., and Richard, G. On the correlation of automatic audio and visual segmentations of music videos. IEEE Trans. Cir. and Sys. for Video Technol., 17(3):347355, March 2007.
Hall, P., Kay, J. W., and Titterinton, D. M. Asymptotically optimal difference-based estimation of variance in nonparametric regression. Biometrika, 77 (3):521528, January 1990.
Harchaoui, Zaid and Levy-Leduc, Celine. Catching change-points with lasso. In Platt, J.C., Koller, D., Singer, Y., and Roweis, S. (eds.), NIPS, pp. 617 624. MIT Press, Cambridge, MA, 2008.
Hocking, T.D., Schleiermacher, G, Janoueix-Lerosey, I., Delattre, O., Bach, F., and Vert, J.-P. Learning smoothing models using breakpoint annotations.HAL technical report 00663790, 2012.
Jackson, B., Scargle, J.D., Barnes, D., Arabhi, S., Alt, A., Gioumousis, P., Gwin, E., San, P., Tan, L., and Tsai, Tun Tao. An algorithm for optimal partitioning of data on an interval. IEEE Signal Processing Letters, 12(2):105108, February 2005.
Killick, R., Fearnhead, P., and Eckley, I. A. Optimal detection of changepoints with a linear computational cost. arXiv:1101.1438, January 2011.
Lavielle, M. Using penalized contrasts for the changepoint problem. Sig. Proc., 85(8):15011510, 2005.
Lebarbier, E. Detecting multiple change-points in the mean of gaussian process by model selection. Signal Processing, 85:717736, 2005.
Lee, C.-B. Estimating the number of change points in a sequence of independent normal random variables. Statist. Proba. Lett., 25(3):2418, 1995.
Picard, Franck, Hoebeke, Mark, Lebarbier, Emilie, Miele, Vincent, Rigaill, Guillem, and Robin, Stephane. cghseg: Segmentation methods for array CGH analysis, 2012. R package version 1.0.1.
Rigaill, G. Pruned dynamic programming for optimal multiple change-point detection. arXiv:1004.0887, 2010.
Schwarz, G. Estimating the dimension of a model.Ann. Statist., 6(2):4614, 1978.
Tibshirani, Robert and Wang, Pei. Spatial smoothing and hot spot detection for CGH data using the fused lasso. Biostatistics, 9(1):1829, January 2008.
Vapnik, V., Golowich, S., and Smola, A. J. Support vector method for function approximation, regression estimation, and signal processing. In Mozer, 
M. C., Jordan, M. I., and Petsche, T. (eds.), NIPS, pp. 281287, 1997.
Venkatraman, E S and Olshen, Adam B. A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics, 23(6):657663, March 2007.
Vert, Jean-Philippe and Bleakley, Kevin. Fast detection of multiple change-points shared by many signals using group LARS. In Lafferty, J., Williams, C.
K. I., Shawe-Taylor, J., Zemel, R. S., and Cullota, A. (eds.), NIPS, pp. 23432351, 2010.
Yao, Y.-C. Estimating the number of change-points via Schwarz criterion. Statistics & Probability Letters, 6(3):181189, February 1988.
Zhang, N. R. and Siegmund, D. O. A modified Bayes information criterion with applications to the analysis of comparative genomic hybridization data. Biometrics, 63(1):2232, 2007.
-----0
Bousquet, O. and Elisseeff, A. Stability and generalization.The Journal of Machine Learning Research, 2:499526, 2002.
Bunea, F., Tsybakov, A., and Wegkamp, M. Sparsity oracle inequalities for the lasso. Electronic Journal of Statistics, 1:169194, 2007.
Chen, S.S., Donoho, D.L., and Saunders, M.A. Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing, 20(1):3361, 1998.
Donoho, D.L., Elad, M., and Temlyakov, V.N. Stable recovery of sparse overcomplete representations in the The lasso, persistence, and cross-validation presence of noise. IEEE Transactions on Information Theory, 52(1):618, 2006.
Dumbgen, L., Van De Geer, S.A., Veraar, M.C., and Wellner, J.A. Nemirovskis inequalities revisited. American Mathematical Monthly, 117(2):138160, 2010.
Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R.Least angle regression. The Annals of Statistics, 32(2): 407499, 2004.
Friedman, J., Hastie, T., and Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):122, 2010.
Fu, W. and Knight, K. Asymptotics for lasso-type estimators. Annals of Statistics, 28(5):13561378, 2000.
Greenshtein, E. and Ritov, Y.A. Persistence in highdimensional linear predictor selection and the virtue of overparametrization. Bernoulli, 10(6):971988, 2004.
Gyorfi, L., Kohler, M., Krzyz?ak, A., and H., Walk. A Distribution-Free Theory of Nonparametric Regression.Springer Verlag, 2002.
Hastie, T., Tibshirani, R., and Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Verlag, 2009.
Leng, C., Lin, Y., and Wahba, G. A note on the lasso and related procedures in model selection. Statistica Sinica, 16(4):12731284, 2006.
Meinshausen, N. and Buhlmann, P. High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, 34(3):14361462, 2006.
Meinshausen, N. and Yu, B. Lasso-type recovery of sparse representations for high-dimensional data. The Annals of Statistics, 37(1):246270, 2009.
Nemirovski, A. Topics in non-parametric statistics. lectures on probability theory and statistics (Saint-Flour, 1998), 85277. Lecture Notes in Math, 1738:86277, 2000.
Osborne, M.R., Presnell, B., and Turlach, B.A. On the lasso and its dual. Journal of Computational and Graphical statistics, 9(2):319337, 2000.
Plutowski, M., Sakata, S., and White, H. Cross-validation estimates IMSE. Advances in Neural Information Processing Systems, 6, 1994.
Rudelson, M. and Vershynin, R. Smallest singular value of a random rectangular matrix. Communications on Pure and Applied Mathematics, 62(12):17071739, 2009.
Shao, J. Linear model selection by cross-validation. Journal of the American Statistical Association, 88:486494, 1993.
Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 58(1):267288, 1996.
Tibshirani, R. Regression shrinkage and selection via the lasso: A retrospective. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(3):273 282, 2011.
Tibshirani, R.J. and Taylor, J. Degrees of freedom in lasso problems. Annals of Statistics, 40:11981232, 2012.van de Geer, S. High-dimensional generalized linear models and the lasso. The Annals of Statistics, 36(2):614645, 2008.
van de Geer, S. and Lederer, J. The lasso, correlated design, and improved oracle inequalities, 2011. URL http://arxiv.org/abs/1107.0189.
Wainwright, M.J. Sharp thresholds for high-dimensional and noisy sparsity recovery using `1-constrained quadratic programming (lasso). IEEE Transactions on Information Theory, 55(5):21832202, 2009.
Wang, H. and Leng, C. Unified lasso estimation by least squares approximation. Journal of the American Statistical Association, 102(479):10391048, 2007.
Xu, H., Mannor, S., and Caramanis, C. Sparse algorithms are not stable: A no-free-lunch theorem. In 46th Annual Allerton Conference on Communication, Control, and Computing, pp. 12991303. IEEE, 2008.
Zhao, P. and Yu, B. On model selection consistency of Lasso. The Journal of Machine Learning Research, 7: 25412563, 2006.
Zou, H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476): 14181429, 2006.
Zou, H., Hastie, T., and Tibshirani, R. On the degrees of freedom of the lasso. The Annals of Statistics, 35(5): 21732192, 2007.
-----0
Ahmad, I. and Lin, P. A nonparametric estimation of the entropy for absolutely continuous distributions. IEEE Transactions on Information Theory, 1976.
Antos, A., Devroye, L., and Gyorfi, L. Lower bounds for Bayes error estimation. PAMI, 1999.
Beirlant, J., Dudewicz, E., Gyorfi, L., and Van der Meulen, E. Nonparametric entropy estimation: An overview. International Journal of Mathematical and Statistical Sciences, 1997.
Chow, C. and Liu, C. Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, 1968.De Campos, L. A scoring function for learning Bayesian networks based on mutual information and conditional independence tests. JMLR, 2006.
Drakopoulos, J. Bounds on the classification error of the nearest neighbor rule. ICML, 1995.
Eggermont, P. and LaRiccia, V. Best asymptotic normality of the kernel density entropy estimator for smooth densities. IEEE Transactions on Information Theory, 1999.
Fralick, S. and Scott, R. Nonparametric Bayes-risk estimation. IEEE Transactions on Information Theory, 1971.
Friedman, N. and Yakhini, Z. On the sample complexity of learning Bayesian networks. UAI, 1997.
Gyorfi, L. The rate of convergence of kn-NN regression estimates and classification rules. IEEE Transactions on Information Theory, 1981.
Hoffgen, K. Learning and robust learning of product distributions. COLT, 1993.
Kohler, M. and Krzyz?ak, A. Rate of convergence of local averaging plug-in classication rules under margin condition. ISIT, 2006.
Kulkarni, S. and Posner, S. Rates of convergence of nearest neighbor estimation under arbitrary sampling. IEEE Transactions on Information Theory, 1995.
Liu, H., Lafferty, J., and Wasserman, L. Exponential concentration for mutual information estimation with application to forests. NIPS, 2012.
Nguyen, X., Wainwright, M., and Jordan, M. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 2010.
Nock, R. and Sebban, M. An improved bound on the finite-sample risk of the nearest neighbor rule. Pattern Recognition Letters, 2001.
Pal, D., Poczos, B., and Szepesvari, C. Estimation of Renyi entropy and mutual information based on generalized nearest-neighbor graphs. NIPS, 2010.
Paninski, L. Estimation of entropy and mutual information. Neural Computation, 2003.
Paninski, L. and Yajima, M. Undersmoothed kernel entropy estimators. IEEE Transactions on Information Theory, 2008.
Perez-Cruz, F. Estimation of information theoretic measures for continuous random variables. NIPS, 2008.
Poczos, B. and Schneider, J. Nonparametric estimation of conditional information and divergences.AISTATS, 2012.
Stone, C. Consistent nonparametric regression. The Annals of Statistics, 1977.
Tsybakov, A. and Van der Meulen, E. Root-n consistent estimators of entropy for densities with unbounded support. Scandinavian Journal of Statistics, 1996.
Van Es, B. Estimating functionals related to a density by a class of statistics based on spacings. Scandinavian Journal of Statistics, 1992.
Wang, Q., Kulkarni, S., and Verdu, S. Divergence estimation of continuous distributions based on datadependent partitions. IEEE Transactions on Information Theory, 2005.
Weissman, T., Ordentlich, E., Seroussi, G., Verdu, S., and Weinberger, M. Inequalities for the `1 deviation of the empirical distribution. Technical Report HPL2003-97, 2003.
-----0
Balle, Borja and Mohri, Mehryar. Spectral learning of general weighted automata via constrained matrix completion. In Advances in Neural Information Processing Systems 25, pp. 21682176, 2012.
Balle, Borja, Quattoni, Ariadna, and Carreras, Xavier.Local loss optimization in operator models: A new insight into spectral learning. In Proceedings of the 29th International Conference on Machine Learning (ICML-12), pp. 18791886, 2012.
Bertsekas, Dimitri P. Nonlinear Programming. Athena Scientific, Belmont, MA 02178-9998, second edition, 1999.
Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends R in Machine Learning, 3(1):1122, 2011.
Cai, J.F., Cande`s, E.J., and Shen, Z. A singular value thresholding algorithm for matrix completion.SIAM Journal on Optimization, 20(4):19561982, 2010.
Chen, X., Lin, Q., Kim, S., Carbonell, J.G., and Xing, E.P. Smoothing proximal gradient method for general structured sparse regression. The Annals of Applied Statistics, 6(2):719752, 2012.
Duchi, John, Shalev-Shwartz, Shai, Singer, Yoram, and Chandra, Tushar. Efficient projections onto the ?1-ball for learning in high dimensions. In Proceedings of the 25th International Conference on Machine Learning, pp. 272279, 2008.
Hsu, Daniel, Kakade, Sham M., and Zhang, Tong. A spectral algorithm for learning hidden Markov models. In Proceedings of the Twenty-Second Annual Conference on Learning Theory, 2009.
Huang, Tzu-Kuo and Schneider, Jeff. Learning autoregressive models from sequence and non-sequence data. In Advances in Neural Information Processing Systems 24, pp. 15481556. 2011.
Nesterov, Y. Smooth minimization of non-smooth functions. Mathematical Programming, 103(1):127 152, 2005.
Rabiner, Lawrence R. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257285, 1989.
Reiss, A. and Stricker, D. Introducing a new benchmarked dataset for activity monitoring. InWearable Computers (ISWC), 2012 16th International Symposium on, pp. 108109. IEEE, 2012.
Scholkopf, B., Smola, A., and Muller, K.R. Nonlinear component analysis as a kernel eigenvalue problem.
Neural computation, 10(5):12991319, 1998.Siddiqi, Sajid M., Boots, Byron, and Gordon, Geoffrey J. Reduced-rank hidden Markov models. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, 2010.
Song, Le, Huang, Jonathan, Smola, Alex, and Fukumizu, Kenji. Hilbert space embeddings of conditional distributions with applications to dynamical systems. In Proceedings of the 26th International Conference on Machine Learning, 2009.
Song, Le, Boots, Byron, Siddiqi, Sajid, Gordon, Geoffrey, and Smola, Alex. Hilbert space embeddings of hidden Markov models. In Proceedings of the 27th International Conference on Machine Learning, 2010.
-----0
Allman, E. S. and Rhodes, J. A. The identifiability of tree topology for phylogenetic models, including covarion and mixture models. Journal of Computational Biology, 13(5):11011113, 2006.
Anandkumar, A., Chaudhuri, K., Hsu, D., Kakade, S., Song, L., and Zhang, T. Spectral methods for learning multivariate latent tree structure. In Neural Information Processing Systems, 2011.
Buneman, P. The recovery of trees from measures of dissimilarity. In Hodson, F.R., Kendall, D.G., and Tautu, P. (eds.), Mathematics in the archaeological and historical sciences, pp. 387395. Edinburgh University Press, 1971.
Carroll, J. and Chang, J. Analysis of individual differences in multidimensional scaling via an N-way generalization of Eckart-Young decomposition. Psychometrika, 35(3):283319, 1970.
Choi, M., Tan, V., Anandkumar, A., and Willsky, A.Learning latent tree graphical models. Journal of Machine Learning Research, 12:17711812, 2011.
Chow, C., and Liu, C. Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, 14:462467, 1968.
Erdos, P. L., Szekely, L. A., Steel, M. A., andWarnow., T. J. A few logs suffice to build (almost) all trees: Part II. Theoretical Computer Science, 221:77118, 1999.
Eriksson, N. Tree construction using singular value decomposition. In Pachter, L. and Sturmfels, B. (eds.), Algebraic Statistics for Computational Biology, pp. 347358. Cambridge University Press, 2005. URL http://dx.doi.org/10.1017/CBO9780511610684.
Fazel, Maryam, Hindi, Haitham, and Boyd, Stephen P.A rank minimization heuristic with application to minimum order system approximation. In American Control Conference, pp. 47344739, 2001.
Grasedyck, L. Hierarchical singular value decomposition of tensors. SIAM J. Matrix Anal. Appl., 31(4): 20292054, 2010.
Gretton, A., Bousquet, O., Smola, A. J., and Scholkopf, B. Measuring statistical dependence with 
Hilbert-Schmidt norms. In Jain, S., Simon, H. U., and Tomita, E. (eds.), Proceedings of the International Conference on Algorithmic Learning Theory, pp. 6377. Springer-Verlag, 2005a.
Gretton, A., Herbrich, R., Smola, A. J., Bousquet, O., and Scholkopf, B. Kernel methods for measuring independence. Journal of Machine Learning Research, 6:20752129, 2005b.
Harmeling, S. and Williams, C. Greedy learning of binary latent trees. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 10871097, 2010.
Harshman, R. A. Foundations of the PARAFAC procedure: Model and conditions for an explanatory multi-mode factor analysis. UCLA Working Papers in Phonetics, 16(1):184, 1970.
Heller, K. A. and Ghahramani, Z. Bayesian hierarchical clustering. In Proceedings of the International Conference on Machine Learning, pp. 297 304, 2005.
Lake, J.A. Reconstructing evolutionary trees from dna and protein sequences: paralinear distances. Proceedings of the National Academy of Sciences, 91 (4):1455, 1994.
Mihaescu, R., Levy, D., and Pachter, L. Why neighbor-joining works. Algorithmica, 54(1):124, 2009.
Oseledets, I. V. Tensor-train decomposition. SIAM Journal on Scientific Computing, 33:22952317, 2011.
Parikh, A., Song, L., and Xing, E. P. A spectral algorithm for latent tree graphical models. In Proceedings of the International Conference on Machine Learning, 2011.
Pearl, J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufman, 1988.
Pearl, J. and Tarsi, M. Structuring causal trees. Journal of Complexity, 2(1):6077, 1986.
Robinson, D.F. and Foulds, L.R. Comparison of phylogenetic trees. Mathematical Biosciences, 53(1-2): 131147, 1981.
Rosasco, L., Belkin, M., and Vito, E.D. On learning with integral operators. Journal of Machine Learning Research, 11:905934, 2010.
Saitou, N. and Nei, M. The neighbor-joining method: a new method for reconstructing phylogenetic trees.
Molecular Biology and Evolution, 4(4):406425, 1987.
Semple, C. and Steel, M.A. Phylogenetics, volume 24.Oxford University Press, USA, 2003.
Teh, Yee Whye, Daume, Hal, and Roy, Daniel.Bayesian agglomerative clustering with coalescents.
In Advances in Neural Information Processing Systems 22, 2008.
Zhang, N. L. Hierarchical latent class models for cluster analysis. Journal of Machine Learning Research, 5:697723, 2004.
-----0
Averbakh, I. and Berman, O. Categorized bottleneckminisum path problems on networks. Operations Research Letters, 16:291297, 1994.
Bach, F. Learning with Submodular functions: A convex Optimization Perspective. Arxiv, 2011.
Boykov, Y. and Jolly, M.P. Interactive graph cuts for optimal boundary and region segmentation of objects in n-d images. In ICCV, 2001.
Buchbinder, N., Feldman, M., Naor, J., and Schwartz, R. A tight (1/2) linear-time approximation to unconstrained submodular maximization. In FOCS, 2012.
Conforti, M. and Cornuejols, G. Submodular set functions, matroids and the greedy algorithm: tight worst-case bounds and some generalizations of the Rado-Edmonds theorem. Discrete Applied Mathematics, 7(3):251274, 1984.
Delong, A., Veksler, O., Osokin, A., and Boykov, Y.Minimizing sparse high-order energies by submodular vertex-cover. In In NIPS, 2012.
Feige, U., Mirrokni, V., and Vondrak, J. Maximizing non-monotone submodular functions. SIAM J.COMPUT., 40(4):11331155, 2007.
Fujishige, S. Submodular functions and optimization, volume 58. Elsevier Science, 2005.
Fast Semidifferential-based Submodular Function Optimization Fujishige, S. and Isotani, S. A submodular function minimization algorithm based on the minimum-norm base. Pacific Journal of Optimization, 7:317, 2011.
Garofolo, J., Lamel, L., Fisher, W., Fiscus, J., Pallet, D., and Dahlgren, N. Timit, acoustic-phonetic continuous speech corpus. In DARPA, 1993.
Goel, G., Karande, C., Tripathi, P., and Wang, L. Approximability of combinatorial problems with multiagent submodular cost functions. In FOCS, 2009.
Goemans, M.X., Harvey, N.J.A., Iwata, S., and Mirrokni, V. Approximating submodular functions everywhere. In SODA, pp. 535544, 2009.
Goldengorin, B., Tijssen, G.A., and Tso, M. The maximization of submodular functions: Old and new proofs for the correctness of the dichotomy algorithm.University of Groningen, 1999.
Hunter, D.R. and Lange, K. A tutorial on MM algorithms. The American Statistician, 2004.
Iwata, S. and Nagano, K. Submodular function minimization under covering constraints. In In FOCS, pp. 671680. IEEE, 2009.
Iyer, R. and Bilmes, J. The submodular Bregman and Lovasz-Bregman divergences with applications. In NIPS, 2012a.
Iyer, R. and Bilmes, J. Algorithms for approximate minimization of the difference between submodular functions, with applications. In UAI, 2012b.
Iyer, R., Jegelka, S., and Bilmes, J. Mirror descent like algorithms for submodular optimization. NIPS Workshop on Discrete Optimization in Machine Learning (DISCML), 2012.
Iyer, R., Jegelka, S., and Bilmes, J. Fast Semidifferential-based Submodular Function Optimization : Extended Version, 2013.
Jegelka, S. and Bilmes, J. A. Approximation bounds for inference using cooperative cuts. In ICML, 2011a.
Jegelka, S. and Bilmes, J. A. Submodularity beyond submodular energies: coupling edges in graph cuts.In CVPR, 2011b.
Jegelka, S., Lin, H., and Bilmes, J. On fast approximate submodular minimization. In NIPS, 2011.
Krause, A. SFO: A toolbox for submodular function optimization. JMLR, 11:11411144, 2010.
Krause, A., Singh, A., and Guestrin, C. Near-optimal sensor placements in Gaussian processes: Theory, efficient algorithms and empirical studies. JMLR, 9: 235284, 2008.
Kulesza, A. and Taskar, B. Determinantal point processes for machine learning. arXiv preprint arXiv:1207.6083, 2012.
Lin, H. and Bilmes, J. How to select a good trainingdata subset for transcription: Submodular active selection for sequences. In Interspeech, 2009.
Lin, H. and Bilmes, J. Optimal selection of limited vocabulary speech corpora. In Interspeech, 2011a.
Lin, H. and Bilmes, J. A class of submodular functions for document summarization. In ACL, 2011b.
McCormick, S Thomas. Submodular function minimization. Discrete Optimization, 12:321391, 2005.
McLachlan, G.J. and Krishnan, T. The EM algorithm and extensions. New York, 1997.
Nagano, K., Kawahara, Y., and Iwata, S. Minimum average cost clustering. In NIPS, 2010.
Narasimhan, M. and Bilmes, J. A submodularsupermodular procedure with applications to discriminative structure learning. In UAI, 2005.
Narasimhan, M., Jojic, N., and Bilmes, J. Q-clustering.NIPS, 18:979, 2006.
Nemhauser, G.L., Wolsey, L.A., and Fisher, M.L. An analysis of approximations for maximizing submodular set functionsi. Mathematical Programming, 14 (1):265294, 1978.
Orlin, J.B. A faster strongly polynomial time algorithm for submodular function minimization. Mathematical Programming, 118(2):237251, 2009.
Rousu, J. and Shawe-Taylor, J. Efficient computation of gapped substring kernels on large alphabets. Journal of Machine Learning Research, 6(2):1323, 2006.
Stobbe, P. and Krause, A. Efficient minimization of decomposable submodular functions. In NIPS, 2010.
Svitkina, Z. and Fleischer, L. Submodular approximation: Sampling-based algorithms and lower bounds.In FOCS, pp. 697706, 2008.
Wan, P.-J., Calinescu, G., Li, X.-Y., and Frieder, O.Minimum-energy broadcasting in static ad hoc wireless networks. Wireless Networks, 8:607617, 2002.
Yuille, A.L. and Rangarajan, A. The concave-convex procedure (CCCP). In NIPS, 2002.
-----0
Arlot, Sylvain and Celisse, Alain. A survey of crossvalidation procedures for model selection. Statistics Surveys, 4:4079, 2010.
Ben-david, Shai, Kushilevitz, Eyal, and Mansour, Yishay. Online Learning versus Offline Learning.Machine Learning, 29:4563, 1997.Breiman, Leo. Bagging predictors. Machine Learning, 24:123140, 1996.
Chen, Yen-kuang, Li, Wenlong, and Tong, Xiaofeng.Parallelization of AdaBoost algorithm on multi-core processors. In IEEE Workshop on Signal Processing Systems, pp. 275280, 2008.
Chu, Cheng-tao, Kim, Sang Kyun, Lin, Yi-an, Yu, Yuanyuan, Bradski, Gary, Ng, Andrew, and Olukotun, Kunle. Map-Reduce for Machine Learning on Multicore. In Neural Information Processing Systems, pp. 281288, 2006.
Dean, Jeffrey and Ghemawat, Sanjay. MapReduce: Simplied Data Processing on Large Clusters. In Operating Systems Design and Implementation, pp.137150, 2004.Dekel, Ofer. From Online to Batch Learning with Cutoff-Averaging. In Neural Information Processing Systems, pp. 377384, 2008.
Freund, Yoav and Schapire, Robert E. Experiments with a New Boosting Algorithm. In International Conference on Machine Learning, pp. 148 156, 1996.
Grabner, Helmut, Leistner, Christian, and Bischof, Horst. Semi-Supervised On-Line Boosting for Robust Tracking. 2008.
Iba, Wayne and Langley, Pat. Induction of One-Level Decision Trees. In International Conference on Machine Learning, pp. 233240, 1992.
Kakade, Sham M. and Kalai, Adam. From Batch to Transductive Online Learning. In Neural Information Processing Systems, 2005.
Kondor, Risi Imre and Borgwardt, Karsten M. The skew spectrum of graphs. In International Conference on Machine Learning (ICML), pp. 496503, 2008.
Kondor, Risi Imre, Howard, Andrew, and Jebara, Tony. Multi-object tracking with representations of the symmetric group. 2:211218, 2007.
Littlestone, Nick. From On-Line to Batch Learning. In Computational Learning Theory, pp. 269284, 1989.
Merler, Stefano, Caprile, Bruno, and Furlanello, Cesare. Parallelizing AdaBoost by weights dynamics. Computational Statistics & Data Analysis, 51:2487 2498, 2007.
Oza, Nikunj C. Online Bagging and Boosting. 2001.
Pachauri, Deepti, Collins, Maxwell, Kondor, Risi Imre, and Singh, Vikas. Incorporating domain knowledge in matching problems via harmonic analysis.In International Conference on Machine Learning (ICML), 2012.
Palit, I. and Reddy, C.K. Scalable and parallel boosting with mapreduce. Knowledge and Data Engineering, IEEE Transactions on, PP(99):1, 2011. ISSN 1041-4347.
Parhami, B. Introduction to Parallel Processing: Algorithms and Architectures. Plenum Series in Computer Science. Springer, 1999.
Tesson, Pascal and Thrien, Denis. Monoids and Computations. International Journal of Algebra and Computation, 14:801816, 2004. doi: 10.1142/S0218196704001979.
Vaikuntanathan, Vinod. Computing blindfolded: New developments in fully homomorphic encryption. In Proceedings of the 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science, FOCS 11, pp. 516, Washington, DC, USA, 2011. IEEE Computer Society.
Viola, Paul A. and Jones, Michael J. Robust RealTime Face Detection. International Journal of Computer Vision, 57:137154, 2004.
Watanabe, Sumio. Algebraic geometry and statistical learning theory. Cambridge University Press, 2009.
Yorgey, Brent A. Monoids: theme and variations (functional pearl). In Proceedings of the 2012 symposium on Haskell symposium, Haskell 12, pp. 105 116, New York, NY, USA, 2012. ACM.
Yu, C. and b. Skillicorn, D. Parallelizing Boosting and Bagging. 2001.
-----0
Alon, N and Naor, A. Approximating the Cut-Norm via Grothendiecks Inequality. SIAM J. Computing, 2006.
Arora, S, Hazan, E, and Kale, S. Fast algorithms for approximate semidefinite programming using the multiplicative weights update method. FOCS, 2005.
Bach, F. Learning with Submodular Functions: A Convex Optimization Perspective. 2011.
Bach, F, Mairal, J, and Ponce, J. Convex Sparse Matrix Factorizations. Technical report, 2008.
Bach, F, Lacoste-Julien, S, and Obozinski, G. On the Equivalence between Herding and Conditional Gradient Algorithms. In ICML, 2012.
Boyd, S and Vandenberghe, L. Convex optimization. 2004.Cai, J-F, Candes, E J, and Shen, Z. A Singular Value Thresholding Algorithm for Matrix Completion. SIAM Journal on Optimization, 20(4):19561982, 2010.
Canon, M D and Cullum, C D. A Tight Upper Bound on the Rate of Convergence of Frank-Wolfe Algorithm.SIAM Journal on Control, 6(4):509516, 1968.
Chandrasekaran, V, Recht, B, Parrilo, P A, and Willsky, A S. The Convex Geometry of Linear Inverse Problems.
Found. Comp. Math., 12(6):805849, 2012.Clarkson, K L. Coresets, Sparse Greedy Approximation, and the Frank-Wolfe Algorithm. ACM Transactions on Algorithms, 6(4), 2010.
Demyanov, V F and Rubinov, A M. Approximate methods in optimization problems. Elsevier, 1970.
Dudik, M, Harchaoui, Z, and Malick, J. Lifted coordinate descent for learning with trace-norm regularization. In AISTATS, 2012.
Dunn, J C and Harshbarger, S. Conditional gradient algorithms with open loop step size rules. Journal of Mathematical Analysis and Applications, 62(2):432444, 1978.
Edmonds, J. Submodular Functions, Matroids, and Certain Polyhedra. In Comb. Struct. and Appl., 6987, 1970.
Frank, M and Wolfe, P. An algorithm for quadratic programming. Naval Res. Logis. Quart., 3:95110, 1956.
Gartner, B and Jaggi, M. Coresets for polytope distance.ACM SCG, 2009.
Giesen, J, Jaggi, M, and Laue, S. Regularization Paths with Guarantees for Convex Semidefinite Optimization.AISTATS, 2012.
Goemans, M and Williamson, D. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. J. ACM, 42(6), 1995.
GueLat, J and Marcotte, P. Some comments on Wolfes away step. Mathematical Programming, 35(1), 1986.
Harchaoui, Z, Juditsky, A, and Nemirovski, A. Conditional gradient algorithms for machine learning. In NIPS Workshop on Optimization for ML, December 2012.
Hazan, E. Sparse Approximate Solutions to Semidefinite Programs. In LATIN, pp. 306316, 2008.
Hazan, E and Kale, S. Projection-free Online Learning. In ICML, 2012.
Hazan, E, Kale, S, and Warmuth, M.K. Learning rotations with little regret. In COLT, pp. 144154, 2010.
Jaggi, M. Sparse Convex Optimization Methods for Machine Learning. PhD thesis, ETH Zurich, 2011.
Jaggi, M and Sulovsky, M. A Simple Algorithm for Nuclear Norm Regularized Problems. ICML, 2010.
Jenatton, R, Audibert, J-Y, and Bach, F. Structured Variable Selection with Sparsity-Inducing Norms. JMLR, 12: 27772824, 2011.
Jones, L K. A Simple Lemma on Greedy Approximation in Hilbert Space and Convergence Rates for Projection Pursuit Regression and Neural Network Training. The Annals of Statistics, 20(1):608613, 1992.
Kuczynski, J and Wozniakowski, H. Estimating the Largest Eigenvalue by the Power and Lanczos Algorithms with a Random Start. SIAM Journal on Matrix Analysis and Applications, 13(4):10941122, 1992.
Lacoste-Julien, S, Jaggi, M, Schmidt, M, and Pletscher, P.Block-Coordinate Frank-Wolfe Optimization for Structural SVMs. In ICML, 2013.
Lee, J, Recht, B, Salakhutdinov, R, Srebro, N, and Tropp, J A. Practical Large-Scale Optimization for Max-Norm Regularization. NIPS, 2010.
Levitin, E S and Polyak, B T. Constrained minimization methods. USSR Comp. Math. & M. Phys., 6(5), 1966.
Li, J and Barron, A. Mixture density estimat.. NIPS, 2000.Lovasz, L. Submodular functions and convexity. Mathematical programming: the state of the art, 1983.
Lovasz, L and Plummer, M D. Matching Theory. American Mathematical Society, 2009.
Mallat, S G and Zhang, Z. Matching pursuits with timefrequency dictionaries. IEEE Transactions on Signal Processing, 41(12):33973415, 1993.
Murty, K G and Kabadi, S N. Some NP-complete problems in quadratic and nonlinear programming. Mathematical Programming, 39(2):117129, 1987.
Nesterov, Y. Introductory Lectures on Convex Optimization. A Basic Course. Kluwer, 2004.
Obozinski, G, Jacob, L, and Vert, JP. Group Lasso with Overlaps: the Latent Group Lasso approach. arXiv, 2011.
Orabona, F, Argyriou, A, and Srebro, N. PRISMA: PRoximal Iterative SMoothing Algorithm. arXiv.org, 2012.
Ouyang, H. and Gray, A. Fast Stochastic Frank-Wolfe Algorithms for Nonlinear SVMs. SDM, 2010.
Patriksson, M. Partial linearization methods in nonlinear programming. Journal of Optimization Theory and Applications, 78(2):227246, 1993.
Rockafellar, R T. Convex analysis. 1997.Shalev-Shwartz, S, Srebro, N, and Zhang, T. Trading Accuracy for Sparsity in Optimization Problems with Sparsity Constraints. SIAM J. on Optimization, 20, 2010.
Srebro, N and Shraibman, A. Rank, Trace-Norm and MaxNorm. In COLT, 545560, 2005.
Temlyakov, V N. Greedy approximation in convex optimization. arXiv.org, stat.ML, 2012.
Tewari, A, Ravikumar, P, and Dhillon, I S. Greedy Algorithms for Structurally Constrained High Dimensional Problems. In NIPS, 2011.
Tibshirani, R. Regression Shrinkage and Selection via the Lasso. J. Royal Statistical Society. Series B, 1996.
Tropp, J A and Gilbert, A. Signal Recovery From Random Measurements Via Orthogonal Matching Pursuit. IEEE Trans. on Information Theory, 53(12):46554666, 2007.
Yuan, M and Lin, Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B, 68(1):4967, 2006.
Yuan, X-T and Yan, S. Forward Basis Selection for Sparse Approximation over Dictionary. In AISTATS, 2012.
Zhang, T. Sequential greedy approximation for certain convex optimization problems. IEEE Transactions on Information Theory, 49(3):682691, 2003.
Zhang, X, Yu, Y, and Schuurmans, D. Accelerated Training for Matrix-norm Regularization: A Boosting Approach. In NIPS, 2012.
-----0
Blum, Avrim, Ligett, Katrina, and Roth, Aaron.A learning theory approach to non-interactive database privacy. In STOC, pp. 609618. ACM, 2008.
Chang, Chih-Chung and Lin, Chih-Jen. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2: 27:127:27, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
Chaudhuri, Kamalika and Hsu, Daniel. Sample complexity bounds for differentially private learning.Journal of Machine Learning Research Proceedings Track, pp. 155186, 2011.
Chaudhuri, Kamalika, Monteleoni, Claire, and Sarwate, Anand D. Differentially private empirical risk minimization. JMLR, 12:10691109, 2011.
Dwork, Cynthia. Differential privacy. In ICALP (2), 2006.
Dwork, Cynthia. Differential privacy in new settings.In SODA, 2010.
Dwork, Cynthia, Kenthapadi, Krishnaram, Mcsherry, Frank, Mironov, Ilya, and Naor, Moni. Our data, ourselves: Privacy via distributed noise generation.
In In EUROCRYPT, pp. 486503. Springer, 2006a.Dwork, Cynthia, McSherry, Frank, Nissim, Kobbi, and Smith, Adam. Calibrating noise to sensitivity in private data analysis. In TCC, pp. 265284. Springer, 2006b.
Dwork, Cynthia, Rothblum, Guy N, and Vadhan, Salil. Boosting and differential privacy. In FOCS, 2010.
Dwork, Cynthia, Naor, Moni, and Vadhan, Salil. The privacy of the analyst and the power of the state. In FOCS, 2012.
Grauman, Kristen and Darrell, Trevor. The pyramid match kernel: Efficient learning with sets of features.Journal of Machine Learning Research, 8:725760, 2007.
Gupta, Anupam, Roth, Aaron, and Ullman, Jonathan.Iterative constructions and private data release.CoRR, abs/1107.3731, 2011.
Hall, Rob, Rinaldo, Alessandro, and Wasserman, Larry A. Differential privacy for functions and functional data. CoRR, abs/1203.2570, 2012.
Hardt, Moritz and Rothblum, Guy N. A multiplicative weights mechanism for privacy-preserving data analysis. In FOCS, 2010.
Hardt, Moritz, Ligett, Katrina, and McSherry, Frank.A simple and practical algorithm for differentially private data release. CoRR, abs/1012.4763, 2010.
Hsu, Justin, Roth, Aaron, and Ullman, Jonathan. Differential privacy for the analyst via private equilibrium computation. CoRR, abs/1211.0877, 2012.
Jain, Prateek, Kothari, Pravesh, and Thakurta, Abhradeep. Differentially private online learning.In COLT, 2012.
Kifer, Daniel, Smith, Adam, and Thakurta, Abhradeep. Private convex empirical risk minimization and high-dimensional regression. In COLT, 2012.
McSherry, Frank and Talwar, Kunal. Mechanism design via differential privacy. In FOCS, pp. 94103.IEEE, 2007. NIH. The international hapmap project. Nature, 426: 789796, 2003.
Pathak, Manas A., Rane, Shantanu, and Raj, Bhiksha. Multiparty differential privacy via aggregation of locally trained classifiers. In NIPS, pp. 18761884, 2010.
Rubinstein, Benjamin I. P., Bartlett, Peter L., Huang, Ling, and Taft, Nina. Learning in a large function space: Privacy-preserving mechanisms for svm learning. CoRR, abs/0911.5708, 2009.
Shalev-Shwartz, Shai, Shamir, Ohad, Srebro, Nathan, and Sridharan, Karthik. Stochastic Convex Optimization. In Proceedings of the Conference on Learning Theory (COLT), 2009.
Williams, Oliver and McSherry, Frank. Probabilistic inference and differential privacy. In NIPS, pp.24512459, 2010.
-----0
Besag, J. Statistical analysis of non-lattice data. The Statistician, 24(3), 1975.
Birgin, E., Mart?nez, J., and Raydan, M. Nonmonotone spectral projected gradient methods on convex sets.SIAM Journal on Optimization, 10, 2000.
Domke, J. Parameter learning with truncated messagepassing. In CVPR, 2011.
Duchi, J., Shalev-Shwartz, S., Singer, Y., and Chandra, T.Efficient projections onto the `1-ball for learning in high dimensions. In ICML, 2008.
Finley, T. and Joachims, T. Training structural SVMs when exact inference is intractable. In ICML, 2008.
Franc, V. and Savchynskyy, B. Discriminative learning of max-sum classifiers. JMLR, 9, 2008.
Globerson, A. and Jaakkola, T. Fixing max-product: Convergent message passing algorithms for MAP LPrelaxations. In NIPS, 2007.
Gould, S., Fulton, R., and Koller, D. Decomposing a scene into geometric and semantically consistent regions. In ICCV, 2009.
Hazan, T. and Shashua, A. Norm-product belief propagation: Primal-dual message-passing for approximateinference. IEEE Trans. Inf. Theory, 56, 2010.
Hazan, T. and Urtasun, R. A primal-dual message-passing algorithm for approximated large scale structured prediction. In NIPS, 2010.
He, K., Sun, J., and Tang, X. Fast matting using large kernel matting Laplacian matrices. In CVPR, 2010.
Jancsary, J., Nowozin, S., Sharp, T., and Rother, C. Regression tree fields  an efficient, non-parametric approach to image labeling problems. In CVPR, 2012.
Kolmogorov, V. Convergent tree-reweighted message passing for energy minimization. IEEE Trans. Pattern Anal.Mach. Intell., 28, 2006.
Komodakis, N. Efficient training for pairwise higher order CRFs via dual decomposition. In CVPR, 2011.
Koval, V. and Schlesinger, M. Two-dimensional programming in image analysis problems. USSR Academy of 
Science, Automatics and Telemechanics, 8, 1976.Kulesza, A. and Pereira, F. Structured learning with approximate inference. In NIPS, 2007.
Kumar, M., Kolmogorov, V., and Torr, P. An analysis of convex relaxations for MAP estimation of discrete MRFs. JMLR, 10, 2009.
Lafferty, J., McCallum, A., and Pereira, F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, 2001.
Levin, A., Rav-Acha, A., and Lischinski, D. Spectral matting. IEEE Trans. Pattern Anal. Mach. Intell., 30, 2008.
Martins, A., Smith, N., and Xing, E. Polyhedral outer approximations with application to natural language parsing. In ICML, 2009.
Meshi, O., Sontag, D., Jaakkola, T., and Globerson, A.Learning efficiently with approximate inference via dual losses. In ICML, 2010.
Nowozin, S., Rother, C., Bagon, S., Sharp, T., Zao, B., and Kohli, P. Decision tree fields. In ICCV, 2011.
Ratliff, N., Bagnell, A., and Zinkevich, M. (Online) subgradient methods for structured prediction. In AISTATS, 2007.
Ravikumar, P. and Lafferty, J. Quadratic programming relaxations for metric labeling and Markov random field MAP estimation. In ICML, 2006.
Ravikumar, P., Agarwal, A., and Wainwright, M. Messagepassing for graph-structured linear programs: Proximal methods and rounding schemes. JMLR, 11, 2010.
Savchynskyy, B., Schmidt, S., Kappes, J., and Schnorr, C.Efficient MRF energy minimization via adaptive diminishing smoothing. In UAI, 2012.
Schmidt, M., van den Berg, E., Friedlander, M., and Murphy, K. Optimizing costly functions with simple constraints: A limited-memory projected quasi-Newton algorithm. In AISTATS, 2009.
Stoyanov, V., Ropson, A., and Eisner, J. Empirical risk minimization of graphical model parameters given approximate inference, decoding, and model structure. In AISTATS, 2011.
Sutton, C. and McCallum, A. Piecewise training for structured prediction. Machine Learning, 77, 2009.
Sutton, C., Rohanimanesh, K., and McCallum, A. Dynamic conditional random fields: Factorized probabilistic models for labeling and segmenting sequence data.In ICML, 2004.
Tappen, M., Samuel, K., Dean, C., and Lyle, D. The logistic random field  a convenient graphical model for learning parameters for MRF-based labeling. In CVPR, 2008.
Taskar, B., Guestrin, C., and Koller, D. Max-margin Markov networks. In NIPS, 2003.
Taskar, B., Lacoste-Julien, S., and Jordan, M. Structured prediction, dual extragradient and bregman projections.JMLR, 7, 2006.
Teo, C., Vishwanathan, S.V.N., Smola, A., and Le, Q. Bundle methods for regularized risk minimization.JMLR, 11, 2010.
Tjong Kim Sang, E. and Buchholz, S. Introduction to the CoNLL-2000 shared task: Chunking. In CoNLL, 2000.
Wainwright, M. Estimating the wrong graphical model: Benefits in the computation-limited setting. JMLR, 7, 2006.
Wainwright, M. and Jordan, M. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1, 2008.
-----0
Boureau, Y, Bach, F, LeCun, Y, and Ponce, J. Learning mid-level features for recognition. In CVPR, 2010.
Carreira, J, Caseiro, R, Batista, J, and Sminchisescu, C. Semantic segmentation with second-order pooling. In ECCV, 2012.
Coates, A and Ng, A. The importance of encoding versus training with sparse coding and vector quantization. In ICML, 2011.
Coates, A, Lee, H, and Ng, A. An analysis of singlelayer networks in unsupervised feature learning. In AISTATS, 2011.
Coates, Adam, Karpathy, Andrej, and Ng, Andrew.Emergence of object-selective features in unsupervised feature learning. In NIPS, 2012.
Cortes, C, Mohri, M, and Talwalkar, A. On the impact of kernel approximation on learning accuracy. In AISTATS, 2010.
Denil, M and de Freitas, N. Recklessly approximate sparse coding. arXiv preprint arXiv:1208.0959, 2012.
Frey, BJ and Dueck, D. Clustering by passing messages between data points. Science, 315(5814):972 976, 2007.
Jenatton, R, Obozinski, G, and Bach, F. Structured sparse principal component analysis. In AISTATS, 2010.
Krizhevsky, A, Sutskever, I, and Hinton, GE. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
Kumar, S, Mohri, M, and Talwalkar, A. Sampling methods for the nystrom method. JMLR, 13(Apr): 9811006, 2012.
LeCun, Yann, Bottou, Leon, Bengio, Yoshua, and Haffner, Patrick. Gradient-based learning applied to document recognition. Proc. of the IEEE, 86(11): 22782324, 1998.
Mairal, J, Bach, F, Ponce, J, and Sapiro, G. Online learning for matrix factorization and sparse coding.JMLR, 11:1960, 2010.
Olshausen, B and Field, DJ. Sparse coding with an overcomplete basis set: a strategy employed by V1? Vision research, 37(23):33113325, 1997.
Rigamonti, R, Brown, MA, and Lepetit, V. Are sparse representations really relevant for image classification? In CVPR, 2011.
Saxe, A, Koh, PW, Chen, Z, Bhand, M, Suresh, B, and Ng, A. On random weights and unsupervised feature learning. In ICML, 2011.
Talwalkar, Ameet and Rostamizadeh, Afshin. Matrix coherence and the nystrom method. arXiv preprint arXiv:1004.2008, 2010.
Wang, J, Yang, J, Yu, K, Lv, F, Huang, T, and Gong, Y. Locality-constrained linear coding for image classification. In CVPR, 2010.
Yang, J, Yu, K, and Gong, Y. Linear spatial pyramid matching using sparse coding for image classification. In CVPR, 2009.
Yang, J, Yu, K, and Huang, T. Efficient highly overcomplete sparse coding using a mixture model. In ECCV, 2010.
Zhang, K, Tsang, IW, and Kwok, JT. Improved nystrom low-rank approximation and error analysis.In ICML, 2008.
-----0
Aflalo, J., Ben-Tal, A., Bhattacharyya, C., Nath, J. Saketha, and Raman, S. Variable sparsity kernel learning. JMLR, 12:565592, 2011.
Bach, F. R. Exploring large feature spaces with hierarchical multiple kernel learning. In NIPS, pp. 105112, 2008.
Bengio, Y., Delalleau, O., and Simard, C. Decision trees do not generalize to new variations. Compt. Intl., 26, 2010.
Ba?za?van, E. G., Li, F., and Sminchisescu, C. Fourier kernel learning. In Proc. ECCV, 2012.
Burges, C. J. C. and Scholkopf, B. Improving the accuracy and speed of support vector machines. In NIPS, 1997.
Chapelle, O., Vapnik, V., Bousquet, O., and Mukherjee, S. Choosing multiple parameters for Support Vector Machines. Machine Learning, 46:131159, 2002.
Chen, J., Ji, S., Ceran, B., Li, Q., Wu, M., and Ye, J.Learning subspace kernels for classification. In KDD, 2008.
Cho, Y. and Saul, L. Kernel methods for deep learning. In NIPS, 2009.
Cortes, C., Mohri, M., and Rostamizadeh, A. L2 regularization for learning kernels. In UAI, 2009a.
Cortes, C., Mohri, M., and Rostamizadeh, A. Learning non-linear combinations of kernels. In NIPS, 2009b.
Cossalter, M., Yan, R., and Zheng, L. Adaptive kernel approximation for large-scale non-linear svm prediction.In ICML, 2011.
Gonen, M. and Alpaydin, E. Localized multiple kernel learning. In ICML, 2008.
Hsieh, C.-J., Chang, K.-W., Lin, C.-J., Keerthi, S. S., and Sundarajan, S. A dual coordinate descent method for large-scale linear svm. In ICML, 2008.
Jain, A., Vishwanathan, S. V. N., and Varma, M. Spggmkl: Generalized multiple kernel learning with a million kernels. In KDD, 2012.
Joachims, T. and Yu, C.-N. J. Sparse kernel SVMs via cutting-plane training. Machine Learning, 76, 2009.
Jose, C., Goyal, P., Aggrwal, P., and Varma, M.The LDKL code. http://research.microsoft.com/enus/um/people/manik/code/LDKL/download.html, 2013.
Kar, P. and Karnick, H. Random feature maps for dot product kernels. In AISTATS, 2012.
Keerthi, S. S., Chapelle, O., and DeCoste, D. Building support vector machines with reduced classifier complexity.JMLR, 7, 2006.
Kloft, M., Brefeld, U., Sonnenburg, S., Laskov, P., Muller, K.-R., and Zien, A. Efficient and accurate l p-norm Multiple Kernel Learning. In NIPS, 2009.
Ladicky, L. and Torr, P. H. S. Locally linear support vector machines. In ICML, 2011.
Lanckriet, G. R. G., Cristianini, N., Bartlett, P., El Ghaoui, L., and Jordan, M. I. Learning the kernel matrix with semidefinite programming. JMLR, 5:2772, 2004.
Maji, S., Berg, A. C., and Malik, J. Efficient classification for additive kernel SVMs. IEEE PAMI, 35(1), 2013.
Ong, C. S., Smola, A. J., and Williamson, R. C. Learning the kernel with hyperkernels. JMLR, 6:10431071, 2005.
Orabona, F. and Jie, L. Ultra-fast optimization algorithm for sparse multi kernel learning. In ICML, June 2011.
Orabona, F., Jie, L., and Caputo, B. Online-batch strongly convex multi kernel learning. In CVPR, pp. 787794, San Francisco, California, June 2010.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., 
Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python.JMLR, 12, 2011.
Rahimi, A. and Recht, B. Random features for large-scale kernel machines. In NIPS, 2007.
Rahimi, A. and Recht, B. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In NIPS, 2008.
Rakotomamonjy, A., Bach, F., Grandvalet, Y., and Canu, S. SimpleMKL. JMLR, 9:24912521, 2008.
Sindhwani, V. and Lozano, A. C. Non-parametric group orthogonal matching pursuit for sparse learning with multiple kernels. In NIPS, 2011.
Sonnenburg, S., Raetsch, G., Schaefer, C., and Schoelkopf, B. Large scale multiple kernel learning. JMLR, 7:1531 1565, 2006.
Tsang, I., Kwok, J. T., and Cheung, P. M. Core vector machines: Fast svm training on very large data sets.JMLR, 6, 2005.
Tsang, I., Kocsor, A., and Kwok, J. T. Simpler core vector machines with enclosing balls. In ICML, 2007.Tsang, I. W. and Kwok, J. T. Efficient hyperkernel learning using second-order cone programming. IEEE Transactions on Neural Networks, 17(1):4858, 2006.
Vedaldi, A. and Zisserman, A. Efficient additive kernels via explicit feature maps. IEEE PAMI, 34(3), 2011.
Vedaldi, A. and Zisserman, A. Sparse kernel approximations for efficient classification and detection. In CVPR, 2012.
Vinyals, O., Jia, Y., Deng, L., and Darrell, T. Learning with recursive perceptual representations. In NIPS, 2012.
Vishwanathan, S. V. N., Sun, Z., Theera-Ampornpunt, N., and Varma, M. Multiple kernel learning and the smo algorithm. In NIPS, 2010.
Williams, C. and Seeger, M. Using the Nystrom method to speed up kernel machines. In NIPS, 2001.
Yang, T., Li, Y.-F., Mahdavi, M., Jin, R., and Zhou, Z.-H.Nystrom method vs random fourier features: A theoretical and empirical comparison. In NIPS, 2012.
Ye, J., , Ji, S., and Chen, J. Multi-class discriminant kernel learning via convex programming. JMLR, 9:719758, 2008.
-----0
Agarwal, Alekh and Duchi, John. Distributed delayed stochastic optimization. In Shawe-Taylor, J., Zemel, 
R.S., Bartlett, P., Pereira, F., and Weinberger, K.Q.(eds.), Advances in Neural Information Processing Systems 24 (NIPS), pp. 873881, 2011.
Auer, Peter, Cesa-Bianchi, Nicolo`, and Fischer, Paul.Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235256, May 2002.
Cesa-Bianchi, Nicolo` and Lugosi, Gabor. Prediction, Learning, and Games. Cambridge University Press, 
New York, NY, USA, 2006. ISBN 0521841089.Desautels, Thomas, Krause, Andreas, and Burdick, Joel. Parallelizing exploration-exploitation tradeoffs with gaussian process bandit optimization. In Proceedings of the 29th International Conference on Machine Learning (ICML), Edinburgh, Scotland, UK, 2012. Omnipress.
Dudik, Miroslav, Hsu, Daniel, Kale, Satyen, Karampatziakis, Nikos, Langford, John, Reyzin, Lev, and Zhang, Tong. Efficient optimal learning for contextual bandits. In Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI), pp. 169178, Corvallis, Oregon, 2011. AUAI Press.
Garivier, Aurelien and Cappe, Olivier. The KL-UCB algorithm for bounded stochastic bandits and beyond. In Proceedings of the 24th Annual Conference on Learning Theory (COLT), volume 19, pp. 359 376, Budapest, Hungary, July 2011.
Joulani, Pooria, Gyorgy, Andras, and Szepesvari, Csaba. Online learning under delayed feedback. Extended version of a paper submitted to ICML-2013, 2013. URL http://webdocs.cs.ualberta.ca/~pooria/publications/ DelayedFeedback-ICML2013-Extended.pdf.
Langford, John, Smola, Alexander, and Zinkevich, Martin. Slow learners are fast. In Bengio, Y., Schuurmans, D., Lafferty, J., Williams, C. K. I., and Culotta, A. (eds.), Advances in Neural Information Processing Systems 22, pp. 23312339. 2009.
Li, Lihong, Chu, Wei, Langford, John, and Schapire, Robert E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide 
Web (WWW), pp. 661670, New York, NY, USA, 2010. ACM.Mesterharm, Chris J. On-line learning with delayed label feedback. In Jain, Sanjay, Simon, HansUlrich, and Tomita, Etsuji (eds.), Algorithmic Learning Theory, volume 3734 of Lecture Notes in Computer Science, pp. 399413. Springer Berlin Heidelberg, 2005.
Mesterharm, Chris J. Improving on-line learning. PhD thesis, Department of Computer Science, Rutgers University, New Brunswick, NJ, 2007.
Neu, Gergely, Gyorgy, Andras, Szepesvari, Csaba, and Antos, Andras. Online markov decision processes under bandit feedback. In Lafferty, J., Williams, C.
K. I., Shawe-Taylor, J., Zemel, R.S., and Culotta, A.(eds.), Advances in Neural Information Processing Systems 23 (NIPS), pp. 18041812, 2010.
Weinberger, Marcelo J. and Ordentlich, Erik. On delayed prediction of individual sequences. IEEE Transactions on Information Theory, 48(7):1959 1976, September 2002.
-----0
Bach, Francis R. Kernel independent component analysis. JMLR, 3:148, 2002.
Baker, Charles R. Joint measures and cross-covariance operators. Transactions of the American Mathematical Society, 186:pp. 273289, 1973.
Borgwardt, K., Gretton, A., Rasch, M., Kriegel, H., Scholkopf, B., and Smola, A. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics, 22(14):e49e57, 2006.
Fortet, R. and Mourier, E. Convergence de lareparation empirique vers la reparation theorique. Ann. Scient. Ecole Norm, 70:266285, 1953.
Fukumizu, K., Gretton, A., Sun, X., and Schoelkopf, B. Kernel measures of conditional dependence. In NIPS 20, pp. 489496, Cambridge, MA, 2008. MIT Press.
Fukumizu, Kenji, Bach, Francis R., and Jordan, Michael I. Dimensionality reduction for supervised learning with reproducing kernel hilbert spaces.Journal of Machine Learning Research, 5:7399, 2004.
Fukumizu, Kenji, Bach, Francis R., and Gretton, Arthur. Statistical convergence of kernel cca. In Advances in Neural Information Processing Systems 18, 2005.
Gretton, A., Herbrich, R., and Smola, A. The kernel mutual information. In Proc. ICASSP, 2003.
Gretton, A., Bousquet, O., Smola, A., and Scholkopf, B. Measuring statistical dependence with HilbertSchmidt norms. In ALT, pp. 6377, 2005.
Grunewalder, S, Lever, G, Baldassarre, L, Patterson, S, Gretton, A, and Pontil, M. Conditional mean embeddings as regressors. In Proceedings of the 29th International Conference on Machine Learning, ICML 2012, volume 2, pp. 18231830, 2012.
Hewitt, E. and Stromberg, K. Real and Abstract Analysis: A Modern Treatment of the Theory of Functions of a Real Variable. Graduate Texts in Mathematics. Springer, 1975. ISBN 9780387901381.
Poczos, B. and Schneider, J. Nonparametric estimation of conditional information and divergences. In International Conference on AI and Statistics (AISTATS), JMLR Workshop and Conference Proceedings, 2012.
Poczos, Barnabas, Ghahramani, Zoubin, and Schneider, Jeff G. Copula-based kernel dependency measures. In ICML, 2012.
Reed, M. and Simon, B. Functional Analysis. Academic Press, 1980. ISBN 9780080570488.
Renyi, A. On measures of dependence. Acta. Math.Acad. Sci. Hungar, 10:441451, 1959.
Renyi, A. On measure of entropy and information. In 4th Berkeley Symposium on Math., Stat., and Prob., pp. 547561, 1961.
Ritov, Y. and Bickel, P. J. Achieving information bounds in non and semiparametric models. The Annals of Statistics, 18(2):pp. 925938, 1990.
Schweizer, B. and Wolff, E. On nonparametric measures of dependence for random variables. The Annals of Statistics, 9, 1981.
Szekely, G. J., Rizzo, M. L., and Bakirov, N. K. Measuring and testing dependence by correlation of distances. Annals of Statistics, 35:27692794, 2007.
Tsallis, C. Possible generalization of boltzmann-gibbs statistics. J. Statist. Phys., 52(1-2):479487, 1988.
-----0
Bengio, S, Weston, J, and Grangier, D. Label Embedding Trees for Large Multi-Class Task. In NIPS, 2010.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A Large-Scale Hierarchical Image Database. In CVPR, 2009.
Duda, R., Hart, P., and Stork, D. Pattern Classification, chapter 10. John Wiley and Sons, Inc., New York, 2 edition, 2001.
Fergus, R., Bernal, H., Weiss, Y., and Torralba, A. Semantic label sharing for learning with many categories. In ECCV, 2010.
Gentner, D. Structure-mapping: A theoretical framework for analogy. Cognitive Science, 7(2):155  170, 1983.
Gentner, D. and Markman, A. B. Structure mapping in analogy and similarity. American Psychologist, 52:45 56, 1997.
Hertzmann, A., Jacobs, C., Oliver, N., Curless, B., and Salesin, D. Image analogies. In SIGGRAPH, 2001.
Hwang, S. J., Grauman, K., and Sha, F. Learning a tree of metrics with disjoint visual features. In NIPS, 2011a.
Hwang, S. J., Sha, F., and Grauman, K. Sharing features between objects and their attributes. In CVPR, 2011b.
Lampert, C., Nickisch, H., and Harmeling, S. Learning to Detect Unseen Object Classes by Between-Class Attribute Transfer. In CVPR, 2009.
Miclet, L., Bayoudh, S., and Delhay, A. Analogical dissimilarity. JAIR, 32(1):793824, 2008. ISSN 1076-9757.
Mihalkova, L., Huynh, T., and Mooney, R. Mapping and revising markov logic networks for transfer learning. In AAAI, 2007.
Roweis, S. and Saul, L. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323 2326, 2000.
Russakovsky, O. and Fei-Fei, L. Attribute learning in largescale datasets. In ECCV, 2010.
Shaw, B. and Jebara, T. Structure preserving embedding.In ICML, 2009.
Shieh, Albert, Hashimoto, Tatsunori, and Airoldi, Edo.Tree preserving embedding. In ICML, 2011.
Wang, H. and Yang, Q. Transfer learning by structural analogy. In AAAI, 2011.
Wang, Y. and Mori, G. A discriminative latent model of object classes and attributes. In ECCV, 2010.
Weinberger, K. Q. and Chapelle, O. Large margin taxonomy embedding for document categorization. In NIPS, 2009.
Weinberger, K. Q. and Saul, L. K. An introduction to nonlinear dimensionality reduction by maximum variance unfolding. In AAAI, 2006.
Zhao, B., Fei, L. Fei, and Xing, E. P. Large-scale category structure aware image categorization. In NIPS, 2011.
Zweig, A. and Weinshall, D. Exploiting Object Hierarchy: Combining Models from Different Category Levels. In ICCV, 2007.
-----0
Baldo, J. V. and Shimamura, A. P. Letter and category fluency in patients with frontal lobe lesions. Neuropsychology, 12(2):259267, 1998.
Boyd, S. and Vandenberge, L. Convex Optimization. Cambridge University Press, Cambridge UK, 2004.
Brants, T., Popat, A.C., Xu, P., Och, F.J., and Dean, J. Large language models in machine translation. In Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), 2007.
Chesson, J. A non-central multivariate hypergeometric distribution arising from biased sampling with application to selective predation. Journal of Applied Probability, 13(4):795797, 1976.
Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., and Slattery, S. Learning to extract symbolic knowledge from the world wide web. In Proceedings of the Conference on Artificial Intelligence (AAAI), pp. 509516. AAAI Press, 1998.
Druck, G., Mann, G., and McCallum, A. Learning from labeled features using generalized expectation criteria. In Proceedings of the International ACM SIGIR Conference on Research and Development In Information Retrieval, pp. 595602, 2008.
Fog, A. Calculation methods for Wallenius noncentral hypergeometric distribution. Communications in Statistics Simulation and Computation, 37(2): 258273, 2008.
Glasdjo, J. A., Schuman, C. C., Evans, J. D., Peavy, G. M., Miller, S. W., and Heaton, R. K. Normas for letter and category fluency: Demographic corrections for age, education, and ethnicity. Assessment, 6(2):147178, 1999.
Hodges, J. R., Garrard, P., Perry, R., Patterson, K., Bak, T., and Gregory, C. The differentiation of semantic dementia and frontal lobe dementia from early alzheimers disease: a comparative neuropsychological study. Neuropsychology, 13:3140, 1999.
Liu, B., Li, X., Lee, W.S., and Yu, P.S. Text classification by labeling words. In Proceedings of the Conference on Artificial Intelligence (AAAI), pp. 425430.AAAI Press, 2004.
Monsch, A. U., Bondi, M. W., Butters, N., Salmon, D. P., Katzman, R., and Thal, L. J. Comparisons of verbal fluency tasks in the detection of dementia of the alzheimer type. Archives of Neurology, 49(12): 12531258, 1992.
Pang, B., Lee, L., and Vaithyanathan, S. Thumbs up: Sentiment classification using machine learning techniques. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 7986. ACL, 2002.
Rogers, T.T., Ivanoiu, A., Patterson, K., and Hodges, J.R. Semantic memory in alheimers disease and the fronto-temporal dementias: A longitudinal study of 236 patients. Neuropsychology, 20(3):319335, 2006.
Rosser, A. and Hodges, J.R. Initial letter and semantic category fluency in alzheimers disease, huntingtons disease, and progressive supranuclear palsy. journal of Neurology, Neurosurgery, and Psychiatry, 57: 13891394, 1994.
Settles, B. Closing the loop: Fast, interactive semisupervised annotation with queries on features and instances. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 14671478. ACL, 2011.
Troyer, A.K., Moscovitch, M., Winocur, G., Alexander, M., and Stuss, D. Clustering and switching on verbal fluency: The effects of focal fronaland temporal-lobe lesions. Neuropsychologia, 36(6), 1998.
Wallenius, K.T. Biased Sampling: The Non-Central Hypergeometric Probability Distribution. PhD thesis, Department of Statistics, Stanford University, 1963.
-----0
Alvarez, M., Rosasco, L., and Lawrence, N. D. Kernels for vector-valued functions: a review. Foundations and Trends in Machine Learning, 4(3):195 266, 2012.
Bach, F. and Jordan, M. Kernel independent component analysis. Journal of Machine Learning Research, 3:148, 2002.
Bakir, G., Hofmann, T., Scholkopf, B., Smola, A., Taskar, B., and Vishwanathan, S. (eds.). Predicting Structured Data. MIT Press, 2007.
Brouard, C., dAlche Buc, F., and Szafranski, M.Semi-supervised penalized output kernel regression for link prediction. In Proc. ICML, pp. 593600, 2011.
Caponnetto, A. and De Vito, E. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331368, 2007.
Caponnetto, A., Micchelli, C., Pontil, M., and Ying, Y. Universal multi-task kernels. Journal of Machine Learning Research, 68:16151646, 2008.
Caruana, R. Multitask learning. Machine Learning, 28(1):4175, 1997.
Cortes, C., Mohri, M., and Weston, J. A general regression technique for learning transductions. In Proc. ICML, pp. 153160, 2005.
Cortes, C., Mohri, M., and Weston, J. A General Regression Framework for Learning String-to-String Mappings. MIT Press, 2007.
Evgeniou, T., Micchelli, C., and Pontil, M. Learning multiple tasks with kernel methods. Journal of Machine Learning Research, 6:615637, 2005.
Fukumizu, K., Bach, F., and Jordan, M. Dimensionality reduction for supervised learning with reproducing kernel hilbert spaces. Journal of Machine Learning Research, 5:7399, 2004.
Gretton, A., Herbrich, R., Smola, A., Bousquet, O., and Scholkopf, B. Kernel methods for measuring independence. Journal of Machine Learning Research, 6:147, 2005.
Grunewalder, S., Lever, G., Gretton, A., Baldassarre, L., Patterson, S., and Pontil, M. Conditional mean embeddings as regressors. In Proc. ICML, 2012.
Kadri, H., Duflos, E., Preux, P., Canu, S., and Davy, M. Nonlinear functional regression: a functional RKHS approach. In Proc. AISTATS, pp. 111125, 2010.
Kadri, H., Rabaoui, A., Preux, P., Duflos, E., and Rakotomamonjy, A. Functional regularized least squares classification with operator-valued kernels.In Proc. ICML, pp. 9931000, 2011.
Kadri, H., Ghavamzadeh, M., and Preux, P. A generalized kernel approach to structured output learning. Technical Report 00695631, INRIA, 2012.
Micchelli, C. and Pontil, M. On learning vector-valued functions. Neural Computation, 17:177204, 2005.
Ramsay, J. and Silverman, B. Functional Data Analysis, 2nd edition. Springer Verlag, New York, 2005.
Taskar, B., Guestrin, C., and Koller, D. Max-margin markov networks. In Advances in Neural Information Processing Systems 16. 2004.
Troje, N. and Bulthoff, H. Face recognition under varying poses: The role of texture and shape. Vision Research, 36:17611771, 1996.
Tsochantaridis, I., Joachims, T., Hofmann, T., and Altun, Y. Large margin methods for structured and interdependent output variables. Journal of machine Learning Research, 6:14531484, 2005.
Wang, Z. and Shawe-Taylor, J. A kernel regression framework for SMT. Machine Translation, 24(2): 87102, 2010.
Weston, J., Chapelle, O., Elisseeff, A., Scholkopf, B., and Vapnik, V. Kernel dependency estimation. In Advances in Neural Information Processing Systems 15, pp. 873880, 2003.
Weston, J., BakIr, G., Bousquet, O., Scholkopf, B., Mann, T., and Noble, W. Joint Kernel Maps. MIT Press, 2007.
-----0
Banerjee, O., El Ghaoui, L., and dAspremont, A.Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data.The Journal of Machine Learning Research, 9:485 516, June 2008.
Binkley, J. K. and Nelson, C. H. A note on the efficiency of seemingly unrelated regression. The American Statistician, 42(2):137139, 1988.
Bonilla, E. V., Chai, K. M. A., and Williams, C. K. I.Multi-task Gaussian process prediction. In Platt, J.C., Koller, D., Singer, Y., and Roweis, S. (eds.), Advances in Neural Information Processing Systems 20, pp. 153160, Cambridge, MA, 2008. MIT Press, Cambridge, MA.
Chung, F. R. K. Spectral Graph Theory (CBMS Regional Conference Series in Mathematics, No. 92).The Bigraphical Lasso American Mathematical Society, December 1996.ISBN 0821803158. URL http://www.worldcat.org/isbn/0821803158.
Dawid, A. P. Some matrix-variate distribution theory: notational considerations and a Bayesian application. Biometrika, 68(1):265274, 1981.
Dutilleul, P. The mle algorithm for the matrix normal distribution. Journal of statistical computation and simulation, 64(2):105123, 1999.
Fan, J. and Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456): 13481360, 2001.
Friedman, J., Hastie, T., and Tibshirani, R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432441, Jul 2008.
Gupta, A. K. and Nagar, D. K. Matrix variate distributions. Chapman Hill, 1999.
Imrich, W., Klavzar, S., and Rall, D. F. Topics in Graph Theory: Graphs and Their Cartesian Product. AK Peters Ltd, 2008. ISBN 1568814291, 9781568814292.
Lauritzen, S. L. Graphical models, volume 17. Oxford University Press, USA, 1996.
Lawrence, N. D. Probabilistic non-linear principal component analysis with Gaussian process latent variable models. Journal of Machine Learning Research, 6:17831816, November 2005.
Lawrence, N. D. A unifying probabilistic perspective for spectral dimensionality reduction: Insights and new models. Journal of Machine Learning Research, 13:16091638, 2012.
Leng, C. and Tang, C. Y. Sparse matrix graphical models. Journal of the American Statistical Association, 107(499):11871200, 2012.
OHagan, A. A Markov property for covariance structures. Statistics Research Report 98-13, Nottingham University, 1998.
Roweis, S. T. and Saul, L. K. Nonlinear dimensionality reduction by locally linear embedding. Science, 290 (5500):23232326, 2000. doi: 10.1126/science.290.5500.2323.
Sabidussi, G. Graph multiplication. Mathematische Zeitschrift, 72:446457, 1959. ISSN 0025-5874. URL http://dx.doi.org/10.1007/ BF01162967. 10.1007/BF01162967.
Stegle, O., Lippert, C., Mooij, J., Lawrence, N. D., and Borgwardt, K. Efficient inference in matrix-variate Gaussian models with iid observation noise. Advances in Neural Information Processing Systems, 24:443, 2011.
Tipping, Michael E. and Bishop, Christopher M. Probabilistic principal component analysis. Journal of the Royal Statistical Society, B, 6(3):611622, 1999.doi: doi:10.1111/1467-9868.00196.
Tsiligkaridis, T., Hero, A., and Zhou, S. On convergence of Kronecker graphical lasso algorithms. Signal Processing, IEEE Transactions on, PP(99): 1, 2013. ISSN 1053-587X. doi: 10.1109/TSP.2013.2240157.
Wackernagel, H. Multivariate geostatistics. Springer, 2003.
Zellner, A. An efficient method of estimating seemingly unrelated regressions and tests for aggregation bias. Journal of the American statistical Association, 57(298):348368, 1962.
Zhang, Y. and Schneider, J. Learning multiple tasks with a sparse matrix-normal penalty. Advances in Neural Information Processing Systems, 23:2550 2558, 2010.
-----0
Alain, G. and Bengio, Y. What regularized autoencoders learn from the data generating distribution. In International Conference on Learning Representations (ICLR), 2013.
Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde
Farley, D., and Bengio., Y. Theano: a CPU and GPU math expression compiler. In Python for Scientic Computing Conference (SciPy), 2010.
Cover, TM and Thomas, J. Elements of Information Theory. New York: John Wiley & Sons, Inc, 1991.
Duda, H. and Hart, P. Pattern Classification. John Wiley & Sons, 2001.
Gutmann, M. U. and Hyvarinen, A. Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. Journal of Machine Learning Research, 13:307361, March 2012.
Hinton, G. E. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):17711800, 2002.
Hyvarinen, A. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6:695709, December 2005.
Krizhevsky, A., Sutskever, I., and Hinton, G. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS). 2012.
Larochelle, H., Erhan, D., Courville, A., Bergstra, J., and Bengio, Y. An empirical evaluation of deep architectures on problems with many factors of variation. In International Conference on Machine Learning (ICML), 2007.
Le, Q., Ranzato, M.A., Monga, R., Devin, M., Chen, K., Corrado, G., Dean, J., and Ng, A. Building highlevel features using large scale unsupervised learning. In International Conference on Machine Learning (ICML), 2012.
Memisevic, R. Gradient-based learning of higher-order image features. In the International Conference on Computer Vision (ICCV), 2011.
Memisevic, R., Zach, C., Hinton, G., and Pollefeys, M. Gated softmax classification. Advances in Neural Information Processing Systems (NIPS), 23, 2011.
Rifai, S., Vincent, P., Muller, X., Glorot, X., and Bengio, Y. Contractive auto-encoders: Explicit invariance during feature extraction. In International Conference on Machine Learning (ICML), 2011.
Rolfe, J. T. and LeCun, Y. Discriminative recurrent sparse auto-encoders. In International Conference on Learning Representations (ICLR), 2013.
Seung, H.S. Learning continuous attractors in recurrent networks. Advances in neural information processing systems (NIPS), 10:654660, 1998.
Socher, R., Pennington, J., Huang, E. H., Ng, A. Y., and Manning, C. D. Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2011.
Susskind, J., Memisevic, R., Hinton, G., and Pollefeys, M. Modeling the joint density of two images under a variety of transformations. In International Conference on Computer Vision and Pattern Recognition (CVPR), 2011.
Swersky, K., Buchman, D., Marlin, B.M., and de Freitas, N. On autoencoders and score matching for energy based models. In International Conference on Machine Learning (ICML), 2011.
Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P. A. Extracting and composing robust features with denoising autoencoders. In International Conference on Machine Learning (ICML), 2008.
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.A. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11:33713408, 2010.
Vincent, Pascal. A connection between score matching and denoising autoencoders. Neural Computation, 23(7):16611674, July 2011.
Welling, M., Rosen-Zvi, M., and Hinton, G. Exponential family harmoniums with an application to information retrieval. Advances in neural information processing systems (NIPS), 17, 2005.
Zou, W.Y., Zhu, S., Ng, A., and Yu, K. Deep learning of invariant features via simulated fixations in video. In Advances in Neural Information Processing Systems (NIPS), 2012.
-----0
Agarwal, Shivani and Niyogi, Partha. Generalization Bounds for Ranking Algorithms via Algorithmic Stability. JMLR, 10:441474, 2009.
Balcan, Maria-Florina and Blum, Avrim. On a Theory of Learning with Similarity Functions. In ICML, pp.7380, 2006.
Bellet, Aurelien, Habrard, Amaury, and Sebban, Marc. Similarity Learning for Provably Accurate Sparse Linear Classification. In ICML, 2012.
Brefeld, Ulf and Scheffer, Tobias. AUC Maximizing Support Vector Learning. In ICML workshop on ROC Analysis in Machine Learning, 2005.
Cao, Qiong, Guo, Zheng-Chu, and Ying, Yiming. Generalization Bounds for Metric and Similarity Learning, 2012. arXiv:1207.5437.
Cesa-Bianchi, Nicolo and Gentile, Claudio. Improved Risk Tail Bounds for On-Line Algorithms. IEEE Trans. on Inf. Theory, 54(1):286390, 2008.
Cesa-Bianchi, Nicolo, Conconi, Alex, and Gentile, Claudio. On the Generalization Ability of On-Line Learning Algorithms. In NIPS, pp. 359366, 2001.
Clemencon, Stephan, Lugosi, Gabor, and Vayatis, Nicolas. Ranking and empirical minimization of Ustatistics. Annals of Statistics, 36:844874, 2008.
Cortes, Corinna, Mohri, Mehryar, and Rostamizadeh, Afshin. Generalization Bounds for Learning Kernels. In ICML, pp. 247254, 2010a.
Cortes, Corinna, Mohri, Mehryar, and Rostamizadeh, Afshin. Two-Stage Learning Kernel Algorithms. In ICML, pp. 239246, 2010b.
Cristianini, Nello, Shawe-Taylor, John, Elisseeff, Andre, and Kandola, Jaz S. On Kernel-Target Alignment. In NIPS, pp. 367373, 2001.
Freedman, David A. On Tail Probabilities for Martingales. Annals of Probability, 3(1):100118, 1975.
Hazan, Elad, Kalai, Adam, Kale, Satyen, and Agarwal, Amit. Logarithmic Regret Algorithms for Online Convex Optimization. In COLT, pp. 499513, 2006.
Jin, Rong, Wang, Shijun, and Zhou, Yang. Regularized Distance Metric Learning: Theory and Algorithm. In NIPS, pp. 862870, 2009.
Kakade, Sham M. and Tewari, Ambuj. On the Generalization Ability of Online Strongly Convex Programming Algorithms. In NIPS, pp. 801808, 2008.
Kakade, Sham M., Sridharan, Karthik, and Tewari, Ambuj. On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization.In NIPS, 2008.
Kakade, Sham M., Shalev-Shwartz, Shai, and Tewari, Ambuj. Regularization Techniques for Learning with Matrices. JMLR, 13:18651890, 2012.
Kumar, Abhishek, Niculescu-Mizil, Alexandru, Kavukcuoglu, Koray, and III, Hal Daume. A Binary Classification Framework for Two-Stage Multiple Kernel Learning. In ICML, 2012.
Ledoux, Michel and Talagrand, Michel. Probability in Banach Spaces: Isoperimetry and Processes.Springer, 2002.
Sridharan, Karthik, Shalev-Shwartz, Shai, and Srebro, Nathan. Fast Rates for Regularized Objectives. In NIPS, pp. 15451552, 2008.
Steinwart, Ingo and Christmann, Andreas. Support Vector Machines. Information Science and Statistics. Springer, 2008.
Vitter, Jeffrey Scott. Random Sampling with a Reservoir. ACM Trans. on Math. Soft., 11(1):3757, 1985.
Wang, Yuyang, Khardon, Roni, Pechyony, Dmitry, and Jones, Rosie. Generalization Bounds for Online Learning Algorithms with Pairwise Loss Functions. JMLR Proceedings Track, 23:13.113.22, 2012.
Wang, Yuyang, Khardon, Roni, Pechyony, Dmitry, and Jones, Rosie. Online Learning with Pairwise Loss Functions, 2013. arXiv:1301.5332.
Xing, Eric P., Ng, Andrew Y., Jordan, Michael I., and Russell, Stuart J. Distance Metric Learning with Application to Clustering with Side-Information. In NIPS, pp. 505512, 2002.
Zhao, Peilin, Hoi, Steven C. H., Jin, Rong, and Yang, Tianbao. Online AUC Maximization. In ICML, pp.233240, 2011.
Zinkevich, Martin. Online Convex Programming and Generalized Infinitesimal Gradient Ascent. In ICML, pp. 928936, 2003.
-----0
Bottou, Leon. Online algorithms and stochastic approximations. In Saad, David (ed.), Online Learning and Neural Networks. Cambridge University Press, 1998.
Cande`s, Emmanuel J. and Tao, Terence. Near-optimal signal recovery from random projections: Universal encoding strategies? IEEE Trans. Inform. Theor., 52(12):54065425, 2006.
Donoho, D. L. Compressed sensing. IEEE Trans. Inform. Theor., 52(4):12891306, 2006.
Gripon, Vincent and Berrou, Claude. Sparse neural networks with large learning diversity. IEEE Trans. Neur. Netw., 22(7):10871096, July 2011.
Hopfield, John J. Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. U.S.A., 79(8):25542558, 1982.
Jankowski, Stanislaw, Lozowski, Andrzej, and Zurada, Jacek M. Complex-valued multistate neural associative memory. IEEE Trans. Neural Netw. Learning Syst., 7(6):14911496, 1996.
Jarrett, Kevin, Kavukcuoglu, Koray, Ranzato, MarcAurelio, and LeCun, Yann. What is the best multi-stage architecture for object recognition? In IEEE Int. Conf. Computer Vision (ICCV), pp.21462153, 2009.
Kumar, K. Raj, Salavati, Amir Hesam, and Shokrollahi, Amin. Exponential pattern retrieval capacity with non-binary associative memory. In IEEE Inf. Theor. workshop (ITW), pp. 8084, Oct 2011.
Le, Quoc V., Ngiam, Jiquan, Chen, Zhenghao, hao Chia, Daniel Jin, Koh, Pang Wei, and Ng, Andrew Y. Tiled convolutional neural networks. In Proc. Advances in Neur. Inf. Process. Sys. (NIPS), pp. 12791287, 2010.
Luby, Michael, Mitzenmacher, Michael, Shokrollahi, Amin, and Spielman, Daniel A. Efficient erasure correcting codes. IEEE Transactions on Information Theory, 47(2):569584, 2001.
McEliece, Robert J., Posner, Edward C., Rodemich, Eugene R., and Venkatesh, Santosh S. The capacity of the hopfield associative memory. IEEE Trans.
Inf. Theor., 33(4):461482, July 1987.Muezzinoglu, Mehmet Kerem, Guzelis, Cuneyt, and Zurada, Jacek M. A new design method for the complex-valued multistate hopfield associative memory. IEEE Trans. Neur. Netw., 14(4):891899, July 2003.
Oja, Erkki and Karhunen, Juha. On stochastic approximation of the eigenvectors and eigenvalues of the expectation of a random matrix. Math. Analysis and Applications, 106:6984, 1985.
Oja, Erkki and Kohonen, Teuvo. The subspace learning algorithm as a formalism for pattern recognition and neural networks. In IEEE Int. Conf. Neur.
Netw., volume 1, pp. 277284, Jul 1988.Peretto, P. and Niez, J. J. Long term memory storage capacity of multiconnected neural networks. Biol.Cybern., 54(1):5364, May 1986.
Salavati, Amir Hesam and Karbasi, Amin. Multi-level error-resilient neural networks. In Proc. IEEE Int.
Symp. Inf. Theor. (ISIT), pp. 10641068, Jul 2012.Socher, R., Pennington, J., Huang, E. H., Ng, A. Y., and Manning, C. D. Semi-supervised recursive autoencoders for predicting sentiment distributions. In Proc. EMNLP, pp. 151161, 2011.
Tanner, R. A recursive approach to low complexity codes. IEEE Trans. Inf. Theor., 27(5):533547.
Venkatesh, Santosh S. and Psaltis, Demetri. Linear and logarithmic capacities in associative neural networks. IEEE Trans. Inf. Theor., 35(3):558568, September 1989.
Vincent, Pascal, Larochelle, Hugo, Bengio, Yoshua, and Manzagol, Pierre-Antoine. Extracting and composing robust features with denoising autoencoders.In Proc. Int. Conf. on Machine Learning (ICML), ICML 08, pp. 10961103, 2008.
Xu, Lei, Krzyzak, Adam, and Oja, Erkki. Neural nets for dual subspace pattern recognition method. Int. J. Neural Syst., 2(3):169184, 1991.
-----0
Agarwal, D., Chen, B., and Elango, P. Explore/exploit schemes for web content optimization. In Proc. Ninth IEEE International Conference on Data Mining (ICDM2009), pp. 110, 2009.
Audibert, J.Y., Bubeck, S., and Munos, R. Best arm identification in multi-armed bandits. In COLT, pp.4153, 2010.
Auer, P. and Ortner, R. UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Periodica Mathematica Hungarica, 61(12):5565, 2010.
Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235256, 2002.
Bubeck, S., Munos, R., and Stoltz, G. Pure exploration in multi-armed bandits problems. In Algorithmic Learning Theory, pp. 2337. Springer, 2009.
Bubeck, S., Wang, T., and Viswanathan, N. Multiple identifications in multi-armed bandits. In Proceedings of the 30th International Conference on Machine Learning, 2013.
Chakrabarti, D., Kumar, R., Radlinski, F., and Upfal, E. Mortal multi-armed bandits. In Proceedings of the 22nd Annual Conference on Neural Information Processing Systems (NIPS2008), pp. 273280, 2008.
Even-Dar, E., Mannor, S., and Mansour, Y. PAC bounds for multi-armed bandit and markov decision processes. In COLT, pp. 193209. Springer, 2002.
Even-Dar, E., Mannor, S., and Mansour, Y. Action elimination and stopping conditions for the multiarmed bandit and reinforcement learning problems.The Journal of Machine Learning Research, 7:1079 1105, 2006.
Gabillon, V., Ghavamzadeh, M., Lazaric, A., and Bubeck, S. Multi-bandit best arm identification. In Advances in Neural Information Processing Systems 24, pp. 22222230. 2011.
Gabillon, V., Ghavamzadeh, M., and Lazaric, A. Best arm identification: A unified approach to fixed budget and fixed confidence. In Advances in Neural Information Processing Systems 25, pp. 32213229, 2012.
Kalyanakrishnan, S. and Stone, P. Efficient selection of multiple bandit arms: Theory and practice. In Proceedings of the 27th International Conference on Machine Learning (ICML 2010), pp. 511518, 2010.
Kalyanakrishnan, S., Tewari, A., Auer, P., and Stone, P. PAC subset selection in stochastic multi-armed bandits. In Proceedings of the 29th International Conference on Machine Learning, ICML 2012, 2012.
Lai, Tze Leung and Robbins, Herbert. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):422, 1985.
Mannor, S. and Tsitsiklis, J.N. The sample complexity of exploration in the multi-armed bandit problem.The Journal of Machine Learning Research, 5:623 648, 2004.
Pandey, S., Agarwal, D., Chakrabarti, D., and Josifovski, V. Bandits for taxonomies: A model based approach. In In Proceedings of the SIAM International Conference on Data Mining. SDM, 2007.
Radlinski, F., Kleinberg, R., and Joachims, T. Learning diverse rankings with multi-armed bandits. In Proceedings of the 25th International Conference on Machine Learning, pp. 784791. ACM, 2008.
-----0
Asuncion, A., Welling, M., Smyth, P., and Teh, Y. W. On smoothing and inference for topic models. In Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence (UAI), 2009.
Blei, D. M. and Lafferty, J. Topic models. In Text Mining: Theory and Applications. Taylor and Francis, London, UK, 2009.
Blei, D. M., Ng, A. Y., and Jordan, M. I. Latent Dirichlet allocation. The Journal of Machine Learning Research, 3:9931022, March 2003.
Brin, S. and Page, L. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the Seventh International Conference on the World Wide Web, pp. 107117, 1998.
Bryant, M. and Sudderth, E. Truly nonparametric online variational inference for hierarchical Dirichlet processes. In Bartlett, P., Pereira, F. C. N., Burges,  C. J. C., Bottou, L., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 25, pp. 27082716. 2012.
Du, L., Buntine, W., and Jin, H. A segmented topic model based on the two-parameter Poisson-Dirichlet process. Machine Learning, 81:519, 2010.
Frank, A. and Asuncion, A. UCI machine learning repository, 2010.
Hoffman, M., Blei, D., and Bach, F. Online learning for latent Dirichlet allocation. In Lafferty, J.,  Williams, C. K. I., Shawe-Taylor, J., Zemel, R. S., and Culotta, A. (eds.), Advances in Neural Information Processing Systems 23, pp. 856864. 2010.
Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. An Introduction to variational methods for graphical models. Machine Learning, 37(2):183 233, November 1999.
Kim, D.-k., Motoyama, M., Voelker, G. M., and Saul, L. K. Topic modeling of freelance job postings to monitor Web service abuse. In Proceedings of the 4th ACM Workshop on Security and Artificial Intelligence (AISec-11), pp. 1120, 2011.
Kurihara, K., Welling, M., and Teh, Y. W. Collapsed variational Dirichlet process mixture models. In Proceedings of the Twentieth International Joint Conference on Artificial Intelligence, 2007.
Motoyama, M., McCoy, D., Levchenko, K., Savage, S., and Voelker, G. M. An analysis of underground forums. In Proceedings of the 2011 ACM SIGCOMM Internet Measurement Conference (IMC-11), pp.7180, 2011.
Sato, I., Kurihara, K., and Nakagawa, H. Practical collapsed variational Bayes inference for hierarchical Dirichlet process. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-12), August 2012.
Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476): 15661581, December 2006.
Teh, Y. W., Kurihara, K., and Welling, M. Collapsed variational inference for HDP. In Platt, J. C., 
Koller, D., Singer, Y., and Roweis, S. (eds.), Advances in Neural Information Processing Systems 20, pp. 14811488. 2008.
Wallach, H., Mimno, D., and McCallum, A. Rethinking LDA: why priors matter. In Bengio, Y., Schuurmans, D., Lafferty, J., Williams, C. K. I., and Culotta, A. (eds.), Advances in Neural Information Processing Systems 22, pp. 19731981. 2009a.
Wallach, H., Murray, I., Salakhutdinov, R., and Mimno, D. Evaluation methods for topic models. In Bottou, L. and Littman, M. (eds.), Proceedings of the 26th International Conference on Machine Learning (ICML-09), pp. 11051112, Montreal, June 2009b. Omnipress.
Wang, C., Paisley, J., and Blei, D. Online variational inference for the hierarchical Dirichlet process. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 2011.
Wang, C. and Blei, D. Truncation-free online variational inference for Bayesian nonparametric models. In Bartlett, P., Pereira, F. C. N., Burges, C. J. C., Bottou, L., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 25, pp.422430. 2012.
-----0
Blei, D., Ng, A., and Jordan, M. Latent dirichlet allocation. Journal of Machine Learning Research, 3: 9931022, 2003.
Cao, L. and Fei-Fei, L. Spatially coherent latent topic model for concurrent object segmentation and classification. In Proceedings of the International Conference on Computer Vision (ICCV), 2007.
Fevotte, Cedric, Bertin, Nancy, and Durrieu, JeanLouis. Nonnegative matrix factorization with the Itakura-Saito divergence: With application to music analysis. Neural Computation, 21(3):793830, 2009.
Garofolo, John S., Lamel, Lori F., Fisher, William M., Fiscus, Jonathan G., Pallett, David S., Dahlgren, Nancy L., and Zue, Victor. Timit acoustic-phonetic continuous speech corpus. Linguistic Data Consortium, Philadelphia, 1993.
Hofmann, T. Probablistic latent semantic analysis.In Proceedings of the International Conference on Uncertainty in Artificial Intelligence (UAI), 1999a.
Hofmann, T. Probablistic latent semantic indexing.In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 1999b.
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P.Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):22782324, November 1998.
Lee, D. D. and Seung, H. S. Learning the parts of objects by non-negative matrix factorization. Nature, 401:788791, 1999.
Popescul, A., Ungar, L. H., Pennock, D. M., and Lawrence, S. Probabilistic models for unified collaborative and content-based recommendation in sparse-data environment. In Proceedings of the International Conference on Uncertainty in Artificial Intelligence (UAI), 2001.
Roweis, S. T. and Saul, L. Nonlinear dimensionality reduction by locally linear embedding. Science, 290: 23232326, 2000.
Smaragdis, Paris, Raj, Bhiksha, and Shashanka, Madhusudana. A probabilistic latent variable model for acoustic modeling. In Neural Information Processing Systems Workshop on Advances in Models for Acoustic Processing, 2006.
Smaragdis, Paris, Shashanka, M., and Raj, B. A sparse non-parametric approach for single channel separation of known sounds. In Advances in Neural Information Processing Systems (NIPS), Vancouver, BC, Canada, 2009.
Vincent, E., Fevotte, C., and Gribonval, R. Performance measurement in blind audio source separation. IEEE Transactions on Audio, Speech, and Language Processing, 14(4):14621469, 2006.
-----0
Bickel, P. J. and Levina, E. Some theory for fishers linear discriminant function, naive bayes, and some alternatives when there are many more variables than observations. Bernoulli, 10:9891010, 2004.
Cai, T. and Liu, W. A direct estimation approach to sparse linear discriminant analysis. Arxiv preprint arXiv:1107.3442, 2011.
Cai, Tony, Liu, Weidong, and Luo, Xi. A constrained l1 minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association, 106:594607, 2011.
Candes, Emmanuel and Tao, Terence. The dantzig selector: statistical estimation when p is much larger than n. The Annals of Statistics, 35:23132351, 2007.
Fan, J. and Fan, Y. High dimensional classification using features annealed independence rules. Annals of statistics, 36(6):2605, 2008.
Fan, J., Feng, Y., and Tong, X. A road to classification in high dimensional space. Arxiv preprint arXiv:1011.6095, 2010.
Mai, Q., Zou, H., and Yuan, M. A direct approach to sparse discriminant analysis in ultra-high dimensions. Biometrika, 2012.
Mardia, K.V., Kent, J.T., and Bibby, J.M. Multivariate analysis. 1980.
Meinshausen, N. and Buhlmann, P. High dimensional graphs and variable selection with the lasso. Annals of Statistics, 34(3), 2006.
Muirhead, R.J. Aspects of multivariate statistical theory, volume 42. Wiley Online Library, 1982.
Shao, J., Wang, Y., Deng, X., and Wang, S. Sparse linear discriminant analysis by thresholding for high dimensional data. Arxiv preprint arXiv:1105.3561, 2011.
Tibshirani, R., Hastie, T., Narasimhan, B., and Chu, G. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proceedings of the National Academy of Sciences, 99(10):6567, 2002.
Feature Selection in High-Dimensional Classification Wainwright, Martin. Sharp thresholds for highdimensional and noisy sparsity recovery using ?1constrained quadratic programming (lasso). IEEE Transactions on Information Theory, 55(5):2183 2202, May 2009.
Wang, S. and Zhu, J. Improved centroids estimation for the nearest shrunken centroid classifier. Bioinformatics, 23(8):972, 2007.
Yuan, Ming. High dimensional inverse covariance matrix estimation via linear programming. Journal of Machine Learning Research, 11:22612286, 2010.
Zhao, P. and Yu, B. On model selection consistency of lasso. J. of Mach. Learn. Res., 7:25412567, 2007.
Zou, Hui. The adaptive lasso and its oracle properties. Journal of American Statistical Association, 101(476):14181429, 2006.
-----0
Banerjee, O., El Ghaoui, L., and dAspremont, A. Model selection through sparse maximum likelihood estimation. J. Mach. Learn. Res., 9:485516, 2008.
Beck, A. and Teboulle, M. A fast iterative shrinkagethresholding algorithm for linear inverse problems.SIAM J. Imag. Sci., 2:183202, 2009.
Cai, T., Liu, W., and Luo, X. A constrained ?1 minimization approach to sparse precision matrix estimation. J. Am. Statist. Assoc., 106:594607, 2011.
Chiquet, J., Grandvalet, Y., and Ambroise, C. Inferring multiple graphical structures. Stat. Comput., 21(4):537 553, 2011.
Danaher, P., Wang, P., and Witten, D. M. The joint graphical lasso for inverse covariance estimation across multiple classes. Technical report, University of Washington, 2011.
Dempster, A. P. Covariance selection. Biometrics, 28:157 175, 1972.
Friedman, J. H., Hastie, T. J., and Tibshirani, R. J. Sparse inverse covariance estimation with the graphical lasso.Biostatistics, 9(3):432441, 2008.
Guo, J., Levina, E., Michailidis, G., and Zhu, J. Joint estimation of multiple graphical models. Biometrika, 98 (1):115, 2011.
Honorio, J. and Samaras, D. Multi-task learning of Gaussian graphical models. In Furnkranz, Johannes and Joachims, Thorsten (eds.), Proc. 27 Int. Conf. Mach. Learn., pp. 447454. Omnipress, Haifa, Israel, June 2010.
Johnson, C., Jalali, A., and Ravikumar, P. Highdimensional sparse inverse covariance estimation using greedy methods. In Lawrence, Neil and Girolami, Mark (eds.), Proc. 15 Int. Conf. Artif. Intel. Statist., pp. 574 582. 2012.
Katenka, N. and Kolaczyk, E. D. Multi-attribute networks and the impact of partial information on inference and characterization. Ann. Appl. Stat., 6(3):10681094, 2011.
Kolar, M. and Xing, E. P. Consistent covariance selection from data with missing values. In Langford, John and Pineau, Joelle (eds.), Proc. 29 Int. Conf. Mach. Learn., pp. 551558, New York, NY, USA, July 2012. Omnipress.
Kolar, M., Song, L., Ahmed, A., and Xing, E. P. Estimating Time-Varying networks. Ann. Appl. Statist., 4(1): 94123, 2010.
Kolar, M., Liu, H., and Xing, E. P. Graph estimation from multi-attribute data. Technical report, Carnegie Mellon University (arXiv:1210.7665), 2012.
Lam, C. and Fan, J. Sparsistency and rates of convergence in large covariance matrix estimation. Ann. Statist., 37: 42544278, 2009.
Li, H. and Gui, J. Gradient directed regularization for sparse Gaussian concentration graphs, with applications to inference of genetic networks. Biostatistics, 7(2):302 317, 2006.
Mazumder, R. and Agarwal, D. K. A flexible, scalable and efficient algorithmic framework for primal graphical lasso. Technical report, Stanford University, 2011.
Meinshausen, N. and Buhlmann, P. High dimensional graphs and variable selection with the lasso. Ann.Statist., 34(3):14361462, 2006.
Peng, Jie, Wang, Pei, Zhou, Nengfeng, and Zhu, Ji. Partial correlation estimation by joint sparse regression models.
J. Am. Statist. Assoc., 104(486):735746, 2009.Rao, B. Partial canonical correlations. Trabajos de Estadstica y de Investigacin Operativa, 20(2):211219, 1969.
Ravikumar, P., Wainwright, M. J., Raskutti, G., and Yu, B. High-dimensional covariance estimation by minimizing ?1-penalized log-determinant divergence. Electron.J. Statist., 5:935980, 2011.
Shen, X., Pan, W., and Zhu, Y. Likelihood-based selection and sharp parameter estimation. J. Am. Statist. Assoc., 107:223232, 2012.
Tseng, P. Convergence of a block coordinate descent method for nondifferentiable minimization. J. Optim.Theory Appl., 109(3):475494, 2001.
Varoquaux, G., Gramfort, A., Poline, J.-B., and Thirion, B. Brain covariance selection: better individual functional connectivity models using population prior. In 
Lafferty, J., Williams, C. K. I., Shawe-Taylor, J., Zemel, R.S., and Culotta, A. (eds.), Adv. Neural Inf. Proc. Sys.23, pp. 23342342. 2010.
Yuan, M. High dimensional inverse covariance matrix estimation via linear programming. J. Mach. Learn. Res., 11:22612286, 2010.
Yuan, M. and Lin, Y. Model selection and estimation in regression with grouped variables. J. R. Statist. Soc. B, 68:4967, 2006.
Yuan, M. and Lin, Y. Model selection and estimation in the gaussian graphical model. Biometrika, 94(1):1935, 2007.
Zhao, T., Liu, H., Roeder, K. E., Lafferty, J. D., and Wasserman, L. A. The huge package for highdimensional undirected graph estimation in r. J. Mach.Learn. Res., 13:10591062, 2012.
-----0
Berkman, Omer and Vishkin, Uzi. Recursive star-tree parallel data structure. SIAM Journal on ComputInference algorithms for pattern-based CRFs on sequence data ing, 22(2):221242, 1993.
Bystroff, C., Thorsson, V., and Baker, D. HMMSTR: a hidden Markov model for local sequence-structure correlation in proteins. J Mol. Biol., 301:173190, 2000.
Dietterich, T. G., Ashenfelter, A., and Bulatov, Y.Training conditional random fields via gradient tree boosting. In ICML, 2004.
Komodakis, N. and Paragios, N. Beyond pairwise energies: Efficient optimization for higher-order MRFs.In CVPR, 2009.
Lafferty, J., McCallum, A., and Pereira, F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, 2001.
Nguyen, Viet Cuong, Ye, Nan, Lee, Wee Sun, and Chieu, Hai Leong. Semi-Markov conditional random field with high-order features. In ICML 2011 Structured Sparsity: Learning and Inference Workshop, 2011.
Qian, Xian, Jiang, Xiaoqian, Zhang, Qi, Huang, Xuanjing, and Wu, Lide. Sparse higher order conditional random fields for improved sequence labeling. In ICML, 2009.
Rother, C., Kohli, P., Feng, W., and Jia, J. Minimizing sparse higher order energy functions of discrete variables. In CVPR, 2009.
Sarawagi, S. and Cohen, W. Semi-Markov conditional random fields for information extraction. In NIPS, 2004.
Takhanov, R. and Kolmogorov, V. Inference algorithms for pattern-based CRFs on sequence data.ArXiv, abs/1210.0508, 2012.
Tsochantaridis, I., Hofmann, T., Joachims, T., and Altun, Y. Support vector machine learning for interdependent and structured output spaces. In ICML, 2004.
Vose, M. D. A linear algorithm for generating random numbers with a given distribution. IEEE Transactions on software engineering, 17(9):972975, 1991.
Ye, Nan, Lee, Wee Sun, Chieu, Hai Leong, and Wu, Dan. Conditional random fields with high-order features for sequence labeling. In NIPS, 2009.
-----0
Abe, N. and Warmuth, M.K. On the computational complexity of approximating distributions by probabilistic automata. Machine Learning, 9:205260, 1992.
Anandkumar, A., Hsu, D., and Kakade, S.M. A method of moments for mixture models and hidden markov models.In COLT, 2012.
Bailly, Raphael. Quadratic weighted automata: Spectral algorithm and likelihood maximization. Journal of Machine Learning Research, 20:147162, 2011.
Balle, Borja, Quattoni, Ariadna, and Carreras, Xavier. Local loss optimization in operator models: A new insight into spectral learning. arXiv preprint arXiv:1206.6393, 2012.
Baum, L.E., Petrie, T., Soules, G., and Weiss, N. A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Ann. Math.Stat., 41(1):pp. 164171, 1970.
Belkin, M. and Sinha, K. Polynomial learning of distribution families. In Foundations of Computer Science (FOCS), pp. 103112, 2010.
Bickel, P.J., Ritov, Y., and Ryden, T. Asymptotic normality of the maximum-likelihood estimator for general hidden Markov models. Ann. Statist., 26(4):16141635, 1998.
Cappe, O., Moulines, E., and Ryden, T. Inference in hidden Markov models. Springer Series in Statistics.Springer, New York, 2005.
Chang, J.T. Full reconstruction of Markov models on evolutionary trees: identifiability and consistency. Math.Biosci., 137(1):5173, 1996.
Cybenko, G. and Crespi, V. Learning hidden Markov models using nonnegative matrix factorization. IEEE Trans.
Information Theory, 57(6):3963 3970, 2011.Daniel, J.W. Stability of the solution of definite quadratic programs. Mathematical Programming, 5:4153, 1973.
Dantzig, G.B., Folkman, J., and Shapiro, N. On the continuity of the minimum sets of a continuous function. J.Math. Anal. Appl., 17:519548, 1967.
den Hertog, D. Interior point approach to linear, quadratic and convex programming, volume 277 of Mathematics and its Applications. Kluwer, Dordrecht, 1994.
Douc, R. and Matias, C. Asymptotics of the maximum likelihood estimator for general hidden Markov models.Bernoulli, 7(3):pp. 381420, 2001.
Farago, A. and Lugosi, G. An algorithm to find the global optimum of left-to-right hidden Markov model parameters. Problems Control Inform. Theory/Problemy Upravlen. Teor. Inform., 18(6):435444, 1989.
Hsu, D., Kakade, S.M., and Zhang, T. A spectral algorithm for learning hidden markov models. In COLT, 2009.
Kontorovich, A. and Weiss, R. Uniform Chernoff and Dvoretzky-Kiefer-Wolfowitz-type inequalities for Markov chains and related processes, arxiv:1207.4678.2012.
Kontorovich, Aryeh, Nadler, Boaz, and Weiss, Roi.On learning parametric-output hmms, arxiv:1302.6009.2013.
Lakshminarayanan, B. and Raich, R. Non-negative matrix factorization for parameter estimation in hidden markov models. In Machine Learning for Signal Processing (MLSP), pp. 89 94, 2010.
Lyngs, R. B. and Pedersen, C. N. Complexity of comparing hidden markov models. In Proceedings of the 12th International Symposium on Algorithms and Computation, pp. 416428. Springer-Verlag, 2001.
Mohri, M. and Rostamizadeh, A. Stability bounds for stationary ?-mixing and ?-mixing processes. The Journal of Machine Learning Research, 11:789814, 2010.
Moitra, Ankur and Valiant, Gregory. Settling the polynomial learnability of mixtures of gaussians. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pp. 93102. IEEE, 2010.
Mossel, E. and Roch, S. Learning nonsingular phylogenies and hidden Markov models. Ann. Appl. Probab., 16(2): 583614, 2006.
Nesterov, Y. and Nemirovskii, A. Interior-point polynomial algorithms in convex programming. SIAM, Philadelphia, PA, 1994.
Rabiner, L. R. Readings in speech recognition. chapter A tutorial on hidden Markov models and selected applications in speech recognition, pp. 267296. Morgan Kaufmann, 1990.
Roweis, S. and Ghahramani, Z. A unifying review of linear gaussian models. Neural Comput., 11:305345, February 1999. ISSN 0899-7667.
Siddiqi, S. M., Boots, B., and Gordon, G. J. Reduced-rank Hidden Markov Models. In AISTAT, 2010.
Song, Le, Boots, Byron, Siddiqi, Sajid, Gordon, Geoffrey, and Smola, Alex. Hilbert space embeddings of hidden markov models. 2010.
Terwijn, S. On the learnability of Hidden Markov Models.In Proceedings of the 6th International Colloquium on Grammatical Inference: Algorithms and Applications, ICGI 02, pp. 261268, London, UK, 2002. SpringerVerlag.
-----0
Anand, A., Koppula, H. S., Joachims, T., and Saxena, A.Contextually guided semantic labeling and search for 3d Learning Spatio-Temporal Structure from RGB-D Videos for Human Activity Detection & Anticipation point clouds. IJRR, 2012.
Delaitre, V., Sivic, J., and Laptev, I. Learning personobject interactions for action recognition in still images.In NIPS, 2011.
Faraway, J. J., Reed, M. P., and Wang, J. Modelling threedimensional trajectories by using bezier curves with application to hand motion. J Royal Stats Soc Series CApplied Statistics, 56:571585, 2007.
Felzenszwalb, P. F. and Huttenlocher, D.P. Efficient graphbased image segmentation. IJCV, 59(2), 2004.
Fox, E. B., Sudderth, E. B., Jordan, M. I., and Willsky, A. S. Nonparametric Bayesian learning of switching linear dynamical systems. In NIPS 21, 2009.
Gong, S. and Xiang, T. Recognition of group activities using dynamic probabilistic networks. In ICCV, 2003.
Gupta, A., Kembhavi, A., and Davis, L.S. Observing human-object interactions: Using spatial and functional compatibility for recognition. IEEE TPAMI, 2009.
Harchaoui, Z., Bach, F., and Moulines, E. Kernel changepoint analysis. In NIPS, 2008.
Hoai, M. and De la Torre, F. Max-margin early event detectors. In CVPR, 2012a.
Hoai, M. and De la Torre, F. Maximum margin temporal clustering. In AISTATS, 2012b.
Hoai, M., Lan, Z., and De la Torre, F. Joint segmentation and classification of human actions in video. In CVPR, 2011.
Hongeng, S. and Nevatia, R. Large-scale event detection using semi-hidden markov models. In ICCV, 2003.
Ion, A., Carreira, J., and Sminchisescu, C. Probabilistic Joint Image Segmentation and Labeling. In NIPS, 2011.
Jia, Z., Gallagher, A., Saxena, A., and Chen, T. 3d-based reasoning with blocks, support, and stability. In CVPR, 2013.
Jiang, Y. and Saxena, A. Infinite latent conditional random fields for modeling environments through humans. In RSS, 2013.
Jiang, Y., Lim, M., and Saxena, A. Learning object arrangements in 3d scenes using human context. In ICML, 2012a.
Jiang, Y., Lim, M., Zheng, C., and Saxena, A. Learning to place new objects in a scene. IJRR, 31(9), 2012b.
Jiang, Y., Koppula, H. S., and Saxena, A. Hallucinated humans as the hidden context for labeling 3d scenes. In CVPR, 2013.
Joachims, T., Finley, T., and Yu, C. Cutting-plane training of structural SVMs. Mach. Learn., 77(1), 2009.
Ke, Y., Sukthankar, R., and Hebert, M. Event detection in crowded videos. In ICCV, October 2007.
Kitani, K., Ziebart, B. D., Bagnell, J. A., and Hebert, M.Activity forecasting. In ECCV, 2012.
Koppula, H. S., Anand, A., Joachims, T., and Saxena, A.Semantic labeling of 3d point clouds for indoor scenes.In NIPS, 2011.
Koppula, H. S., Gupta, R., and Saxena, A. Learning human activities and object affordances from rgb-d videos.IJRR, 2013.
Koppula, H.S. and Saxena, A. Anticipating human activities using object affordances for reactive robotic response. In RSS, 2013.
Ly, D. L., Saxena, A., and Lipson, H. Co-evolutionary predictors for kinematic pose inference from rgbd images.In GECCO, 2012.
Maji, S., Bourdev, L., and Malik, J. Action recognition from a distributed representation of pose and appearance. In CVPR, 2011.
Natarajan, P. and Nevatia, R. Coupled hidden semi markov models for activity recognition. In WMVC, 2007.
Nguyen, M. H., Torresani, L., De la Torre, F., and Rother, C. Weakly supervised discriminative localization and classification: a joint learning process. In ICCV, 2009.
Ni, B., Wang, G., and Moulin, P. Rgbd-hudaact: A colordepth video database for human daily activity recognition. In ICCV Workshop Consumer Depth Cameras Computer Vision, 2011.
Oh, S., Rehg, J.M., Balch, T., and Dellaert, F. Learning and inferring motion patterns using parametric segmental switching linear dynamic systems. IJCV, 2008.
Prest, A., Schmid, C., and Ferrari, V. Weakly supervised learning of interactions between humans and objects.IEEE TPAMI, 34(3):601614, 2012.
Quattoni, A., Wang, S., Morency, L.-P., Collins, M., and Darrell, T. Hidden-state conditional random fields.IEEE TPAMI, 2007.
Ryoo, M.S. Human activity prediction: Early recognition of ongoing activities from streaming videos. In ICCV, 2011.
Sarawagi, S. and Cohen, W. W. Semi-markov conditional random fields for information extraction. In NIPS, 2004.
Shi, Q., Wang, L., Cheng, L., and Smola, A. Human action segmentation and recognition using discriminative semimarkov models. IJCV, 2011.
Shotton, J., Girshick, R., Fitzgibbon, A., Sharp, T., Cook, M., Finocchio, M., Moore, R., Kohli, P., Criminisi, A., 
Kipman, A., and Blake, A. Efficient human pose estimation from single depth images. IEEE TPAMI, 2012.
Simon, T., Nguyen, M. H., De la Torre, F., and Cohn, J. F. Action unit detection with segment-based svms.In CVPR, 2010.
Sminchisescu, C., Kanaujia, A., Li, Z., and Metaxas, D.Conditional models for contextual human motion recognition. In ICCV, 2005.
Sung, J., Ponce, C., Selman, B., and Saxena, A. Human activity detection from rgbd images. In AAAI PAIR workshop, 2011.
Sung, J., Ponce, C., Selman, B., and Saxena, A. Unstructured human activity detection from rgbd images. In ICRA, 2012.
Xing, Z., Pei, J., Dong, G., and Yu, P. S. Mining Sequence Classifiers for Early Prediction. In SIAM ICDM, 2008.
Xuan, X. and Murphy, K. Modeling changing dependency structure in multivariate time series. In ICML, 2007.
Yang, W., Wang, Y., and Mori, G. Recognizing human actions from still images with latent poses. In CVPR, 2010.
Yao, B. and Fei-Fei, L. Modeling mutual context of object and human pose in human-object interaction activities.In CVPR, 2010.
Zhang, H. and Parker, L. E. 4-dimensional local spatiotemporal features for human activity recognition. In IROS, 2011.
-----0
Blake, Andrew, Kohli, Pushmeet, and Rother, Carsten. Markov Random Fields for Vision and Image Processing. MIT Press, 2011.
Domke, Justin. Learning graphical model parameters with approximate marginal inference. PAMI, 2013.To appear.
Eaton, Frederik and Ghahramani, Zoubin. Choosing a variable to clamp. JMLR, 5:145152, 2009.
He, Xuming, Zemel, Richard S., and CarreiraPerpinan, Miguel A. Multiscale conditional random fields for image labeling. In CVPR, 2004.
Kohli, Pushmeet, Ladicky, Lubor, and Torr, Philip H. S. Robust higher order potentials for enforcing label consistency. IJCV, 82(3), 2009.
Koller, Daphne and Friedman, Nir. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.
Krahenbuhl, Philipp and Koltun, Vladlen. Efficient inference in fully connected CRFs with Gaussian edge potentials. In NIPS, 2011.
Kulesza, Alex and Pereira, Fernando. Structured learning with approximate inference. In NIPS, 2007.
Kumar, Sanjiv, August, Jonas, and Hebert, Martial. Exploiting inference for approximate parameter learning in discriminative fields. In Energy Minimization Methods in Computer Vision and Pattern Recognition. 2005.
LeCun, Yann, Chopra, Sumit, Hadsell, Raia, Ranzato, MarcAurelio, and Huang, Fu Jie. A tutorial on energy-based learning. In Predicting Structured Data. MIT Press, 2006.
Levin, Anat and Weiss, Yair. Learning to combine bottom-up and top-down segmentation. IJCV, 81: 105118, 2009.
Pal, Christopher J., Weinman, Jerod J., Tran, Lam C., and Scharstein, Daniel. On learning conditional random fields for stereo. IJCV, 99(3):319337, 2012.
Roth, Stefan and Black, Michael J. Fields of experts.IJCV, 82(2):205229, 2009.
Samuel, Kegan G. G. and Tappen, Marshall F.Learning optimized MAP estimates in continuouslyvalued MRF models. In CVPR, 2009.
Shotton, Jamie, Winn, John M., Rother, Carsten, and Criminisi, Antonio. Textonboost for image understanding. IJCV, 81(1), 2009.
Sriperumbudur, Bharath K. and Lanckriet, Gert R. G.On the convergence of the concave-convex procedure. In NIPS, 2009.
Sutton, Charles A. and McCallum, Andrew. Piecewise training for undirected models. In UAI, 2005.
Tappen, Marshall F. Utilizing variational optimization to learn Markov random fields. In CVPR, 2007.
Tappen, Marshall F. Learning parameters in continuous-valued Markov random fields. In Markov Random Fields For Vision And Image Processing.MIT Press, 2011.
Vineet, Vibhav, Warrell, Jonathan, Sturgess, Paul, and Torr, Philip H. S. Improved initialization and Gaussian mixture pairwise terms for dense random fields with mean-field inference. In BMVC, 2012a.
Vineet, Vibhav, Warrell, Jonathan, and Torr, Philip H. S. Filter-based mean-field inference for random fields with higher-order terms and product labelspaces. In ECCV, 2012b.
Vishwanathan, S. V. N., Schraudolph, Nicol N., Schmidt, Mark W., and Murphy, Kevin P. Accelerated training of conditional random fields with stochastic gradient methods. In ICML, 2006.
Wainwright, Martin J. Estimating the wrong graphical model: Benefits in the computation-limited setting. JMLR, 7:18291859, 2006.
Wainwright, Martin J. and Jordan, Michael I. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1-2), 2008.
Wainwright, Martin J., Jaakkola, Tommi S., and Willsky, Alan S. Tree-reweighted belief propagation algorithms and approximate ML estimation by pseudo-moment matching. In Proc. Workshop on Artificial Intelligence and Statistics, 2003.
Yuille, Alan L. and Rangarajan, Anand. The concaveconvex procedure (CCCP). In NIPS, 2001.
-----0
Hinton, Geoffrey E. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):17711800, 2002.
Hinton, Geoffrey E. and Salakhutdinov, Ruslan. Reducing the dimensionality of data with neural networks. Science, 313(5786):504507, 2006.
Hinton, Geoffrey E., Osindero, Simon, and Teh, YeeWhye. A fast learning algorithm for deep belief nets.
Neural Computation, 18(7):15271554, 2006.Le Roux, Nicolas and Bengio, Yoshua. Representational power of restricted Boltzmann machines and deep belief networks. Neural Computation, 20(6): 16311649, 2008.
Le Roux, Nicolas and Bengio, Yoshua. Deep belief networks are compact universal approximators. Neural Computation, 22(8):21922207, 2010.
Le Roux, Nicolas, Heess, Nicolas, Shotton, Jamie, and Winn, John M. Learning a generative model of images by factoring appearance and shape. Neural Computation, 23(3):593650, 2011.
Lee, Honglak, Grosse, Roger, Ranganath, Rajesh, and Ng, Andrew Y. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML), pp. 609616. ACM, 2009.
Li, Jonathan Q. and Barron, Andrew R. Mixture density estimation. In Solla, Sara A., Leen, Todd K., and Muller, Klaus-Robert (eds.), Advances in Neural Information Processing Systems 12 (NIPS 1999), pp. 279285. MIT Press, 2000.
Montufar, Guido and Ay, Nihat. Refinements of universal approximation results for deep belief networks and restricted Boltzmann machines. Neural Computation, 23(5):13061319, 2011.
Salakhutdinov, Ruslan and Hinton, Geoffrey E. Learning a nonlinear embedding by preserving class neighbourhood structure. In Proceedings of the Eleventh International Conference on Artificial Intelligence and Statistics (AISTATS), volume 2, pp. 412  419, 2007.
Smolensky, Paul. Information processing in dynamical systems: foundations of harmony theory. In Rumelhart, David E. and McClelland, James L. (eds.), Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol. 1, pp. 194281.MIT Press, 1986.
Taylor, Graham W., Fergus, Rob, LeCun, Yann, and Bregler, Christoph. Convolutional learning of spatio-temporal features. In Computer Vision  ECCV 2010, volume 6316 of LNCS, pp. 140153.Springer, 2010.
Wang, Nan, Melchior, Jan, and Wiskott, Laurenz. An analysis of Gaussian-binary restricted Boltzmann machines for natural images. In Verleysen, Michel (ed.), Proceedings of the 20th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN 2012), pp. 287292. Evere, Belgium: d-side publications, 2012.
Welling, Max, Rosen-Zvi, Michal, and Hinton, Geoffrey E. Exponential family harmoniums with an application to information retrieval. In Saul, Lawrence K., Weiss, Yair, and Bottou, Leon (eds.), Advances in Neural Information Processing Systems 17 (NIPS 2004), pp. 14811488. MIT Press, 2005.
Zeevi, Assaf J. and Meir, Ronny. Density estimation through convex combinations of densities: Approximation and estimation bounds. Neural Networks, 10(1):99  109, 1997.
-----0
Andrews, Stuart, Tsochantaridis, Ioannis, and Hofmann, Thomas. Support vector machines for multiple-instance learning. In NIPS, 2002.
Babenko, Boris, Verma, Nakul, Dollar, Piotr, and Belongie, Serge. Multiple instance learning with manifold bags. In ICML, pp. 8188, 2011.
Ben-Tal, Aharon, Ghaoui, Laurent El, and Nemirovski, Arkadi. Robust Optimization. Princeton University Press, 2009.
Bertsimas, Dimitris and Popescu, Ioana. Optimal inequalities in probability theory: A convex optimization approach. SIAM Journal on Optimization, 15: 780804, 2001.
Byrd, R.H., Lu, P., Nocedal, J., and Zhu, C. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing, 16(5): 11901208, 1995.
Chapelle, Olivier. Training a support vector machine in the primal. Neural Computation, 19(5):1155 1178, 2007.
Cuturi, M., Vert, J.-P., Birkenes, O., and Matsui, T.A kernel for time series based on global alignments.In ICASSP, volume 2, 2007.
Cuturi, Marco. Fast global alignment kernels. In ICML 2011, 2011.
Dietterich, Thomas G., Lathrop, Richard H., and Lozano-Perez, Toms. Solving the multiple instance problem with axis-parallel rectangles. Artificial Intelligence, 89(1-2):31  71, 1997.
Gehler, Peter V. and Chapelle, Olivier. Deterministic annealing for multiple-instance learning. In AISTATS, 2007.
Guo, Yuhong. Max-margin multiple-instance learning via semidefinite programming. In Advances in Machine Learning, volume 5828 of Lecture Notes in Computer Science, pp. 98108. Springer, 2009.
Kim, M. and la Torre, F. De. Gaussian processes multiple-instance learning. In ICML, 2010.
Lanckriet, Gert R. G., Ghaoui, Laurent El, Bhattacharyya, Chiranjib, and Jordan, Michael I. A robust minimax approach to classification. Journal of Machine Learning Research, 3:555582, 2002.
Le Thi, Hoai An and Pham Dinh, Tao. The DC (difference of convex functions) programming and DCA revisited with DC models of real world nonconvex optimization problems. Annals of Operations Research, 133:2346, 2005. ISSN 0254-5330.
Mangasarian, OL and Wild, E.W. Multiple instance classification via successive linear programming. Journal of Optimization Theory and Applications, 137(3):555568, 2008.
Maron, O. and Lozano-Perez, T. A framework for multiple-instance learning. In NIPS, 1998.
Scholkopf, Bernhard and Smola, Alexander J. Learning with Kernels. MIT Press, 2002.
Shivaswamy, Pannagadatta K., Bhattacharyya, Chiranjib, and Smola, Alexander J. Second order cone programming approaches for handling missing and uncertain data. Journal of Machine Learning Research, 7:12831314, 2006.
Sra, S., Nowozin, S., and Wright, S.J. Optimization for Machine Learning. Mit Press, 2011.
Sriperumbudur, Bharath K. and Lanckriet, Gert R. G.On the convergence of the concave-convex procedure. In NIPS, pp. 17591767, 2009.
Viola, P., Platt, J., and Zhang, C. Multiple instance boosting for object detection. In NIPS, 2006.
Wang, Li, Zhu, Ji, and Zou, Hui. Hybrid huberized support vector machines for microarray classification and gene selection. Bioinformatics, 24(3):412 419, 2008.
Warrell, J. and Torr, P. Multiple-instance learning with structured bag models. Energy Minimazation Methods in Computer Vision and Pattern Recognition, pp. 369384, 2011.
Yuille, A.L. and Rangarajan, A. The concave-convex procedure. Neural Comput., 15(4):915936, 2003.
Zhang, Qi, Goldman, Sally A., Yu, Wei, and Fritts, Jason E. Content-based image retrieval using multipleinstance learning. In ICML, 2002.
Zien, A., De Bona, F., and Ong, C.S. Training and approximation of a primal multiclass support vector machine. In ASMDA, 2007.
-----0
Bertsekas, D. P. Projected Newton Methods for Optimization Problems with Simple Constraints. SIAM Journal on Control and Optimization, 20(2):221 246+, 1982.Fast algorithms for sparse principal component analysis based on Rayleigh quotient iteration 
DAspremont, Alexandre, El Ghaoui, Laurent, Jordan, Michael I., and Lanckriet, Gert R. G. A Direct Formulation for Sparse PCA Using Semidefinite Programming. SIAM Rev., 49(3):434448, 2007.
DAspremont, Alexandre, Bach, Francis, and Ghaoui, Laurent El. Optimal Solutions for Sparse Principal Component Analysis. J. Mach. Learn. Res., 9:1269 1294, 2008.
Gafni, Eli M. and Bertsekas, Dimitri P. Two-Metric Projection Methods for Constrained Optimization.SIAM Journal on Control and Optimization, 22(6): 936964, 1984.
Journee, Michel, Nesterov, Yurii, Richt a rik, Peter, and Sepulchre, Rodolphe. Generalized Power Method for Sparse Principal Component Analysis.
J. Mach. Learn. Res., 11:517553, 2010.Lee, Donghwan, Lee, Woojoo, Lee, Youngjo, and Pawitan, Yudi. Super-sparse principal component analyses for high-throughput genomic data. BMC Bioinformatics, 11(1):296, 2010.
Mackey, Lester. Deflation Methods for Sparse PCA.In Advances in Neural Information Processing Systems, pp. 10172024. MIT Press, 2009.
Moghaddam, Baback, Weiss, Yair, and Avidan, Shai.Spectral Bounds for Sparse PCA: Exact and Greedy Algorithms. In Advances in Neural Information Processing Systems, pp. 915922. MIT Press, 2006.
OLeary, Dianne P., Stewart, G. W., and Vandergraft, James S. Estimating the Largest Eigenvalue of a Positive Definite Matrix. Mathematics of Computation, 33(148):pp. 12891292, 1979.
Parlett, Beresford N. The symmetric eigenvalue problem. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1998.
Richtarik, Peter, Takac, Martin, and Ahipasaoglu, Selin Damla. Alternating Maximization: Unifying Framework for 8 Sparse PCA Formulations and Efficient Parallel Codes. CoRR, abs/1212.4137, 2012.
Singh, Dinesh, Febbo, Phillip G, Ross, Kenneth, Jackson, Donald G, Manola, Judith, Ladd, Christine, 
Tamayo, Pablo, Renshaw, Andrew A, DAmico, Anthony V, Richie, Jerome P, Lander, Eric S, Loda, 
Massimo, Kantoff, Philip W, Golub, Todd R, and Sellers, William R. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell, 1(2): 203209, 2002.
Tapia, R. A. and Whitley, David L. The Projected Newton Method has Order 1 + 2 for the Symmetric Eigenvalue Problem. SIAM Journal on Numerical Analysis, 25(6):pp. 13761382, 1988.
Witten, D M, Tibshirani, R, and Hastie, T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10(3):515534, jun 2009.
Yao, Fangzhou, Coquery, Jeff, and Le Cao, Kim-Anh.Independent Principal Component Analysis for biologically meaningful dimension reduction of large biological data sets. BMC Bioinformatics, 13(1):24, 2012.
Zou, Hui, Hastie, Trevor, and Tibshirani, Robert.Sparse Principal Component Analysis. Journal of Computational and Graphical Statistics, 15(2):265 286, jun 2006.
-----0
Anthony, M. and Holden, S. B. Cross-validation for binary classification by real-valued functions: theoretical analysis. In Proc. COLT, pp. 218229, 1998.
Bengio, Yoshua and Grandvalet, Yves. No unbiased estimator of the variance of k-fold cross-validation.JMLR, 5:10891105, 2004.
Blockeel, Hendrik and Struyf, Jan. Efficient algorithms for decision tree cross-validation. JMLR, 3:621650, 2002.
Blum, A., Kalai, A., and Langford, J. Beating the hold-out: Bounds for k-fold and progressive crossvalidation. In Proc. COLT, pp. 203208, 1999.
Bousquet, O. and Elisseeff, A. Stability and generalization. JMLR, 2:499526, 2002.
Devroye, L. P. and Wagner, T. J. Distribution-free performance bounds. IEEE TOIT, 25:601604, 1979.
Kale, S., Kumar, R., and Vassilvitskii, S. Crossvalidation and mean-square stability. In Proc. ICS, pp. 487495, 2011.
Kearns, M. A bound on the error of cross validation with consequences for the training-test split.In Proc. NIPS, pp. 183189, 1996.
Kearns, M. J. and Ron, D. Algorithmic stability and sanity-check bounds for leave-one-out crossvalidation. Neural Computation, 11(6):14271453, 1999.
Kohavi, R. A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proc. IJCAI, pp. 11371143, 1995.
Kutin, Samuel and Niyogi, Partha. Almost-everywhere algorithmic stability and generalization error. In Proc. UAI, pp. 275282, 2002.
Moore, A. W. and Lee, M. S. Efficient algorithms for minimizing cross validation error. In Proc. ICML, pp. 190198, 1994.
Mullin, M. D. and Sukthankar, R. Complete crossvalidation for nearest neighbor classifiers. In Proc.ICML, pp. 639646, 2000.
Ng, A. Y. Preventing overfitting of cross-validation data. In Proc. ICML, pp. 245253, 1997.
Rogers, W. and Wagner, T. A finite sample distribution-free performance bound for local discrimination rules. Annals of Statistics, 6(3):506 514, 1978.
Rosset, Saharon. Bi-level path following for cross validated solution of kernel quantile regression. JMLR, 10:24732505, 2009.
-----0
Arora, Sanjeev, Ge, Rong, Kannan, Ravi, and Moitra, Ankur. Computing a nonnegative matrix factorization  provably. In STOC, 2012a.
Arora, Sanjeev, Ge, Rong, and Moitra, Ankur. Learning topic models going beyond svd. In FOCS, 2012b.
Bien, J., Xu, Y., and Mahoney, M. CUR from a sparse optimization viewpoint. In NIPS, 2010.
Bittorf, Victor, Recht, Benjamin, Re, Christopher, and Tropp, Joel A. Factoring nonnegative matrices with linear programs. In NIPS, 2012.
Buhlmann, Peter and Geer, Sara Van De. Statistics for High Dimensional Data. Springer, 2010.
Cichocki, A., Zdunek, R., Phan, A. H., and Amari, S. Non-negative Matrix and Tensor Factorizations.Wiley, 2009.
Clarkson, K. More output-sensitive geometric algorithms. In FOCS, 1994.
Donoho, D. and Stodden, V. When does non-negative matrix factorization give a correct decomposition into parts? In NIPS, 2003.
Dula, J. H., Hegalson, R. V., and Venugopal, N. An algorithm for identifying the frame of a pointed finite conical hull. INFORMS Jour. on Comp., 10(3): 323330, 1998.
Elhamifar, Ehsan, Sapiro, Guillermo, and Vidal, Rene.See all by looking at a few: Sparse modeling for finding representative objects. In CVPR, 2012.
Esser, Ernie, Mller, Michael, Osher, Stanley, Sapiro, Guillermo, and Xin, Jack. A convex model for nonnegative matrix factorization and dimensionality reduction on physical space. IEEE Transactions on Image Processing, 21(10):3239  3252, 2012.
Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., and Lin, C.-J. Liblinear: A library for large linear classification. JMLR, 2008.
Gillis, Nicolas and Vavasis, Stephen A. Fast and robust recursive algorithms for separable nonnegative matrix factorization. arXiv:1208.1237v2, 2012.
Greene, Derek and Cunningham, Padraig. Practical solutions to the problem of diagonal dominance in kernel document clustering. In ICML, 2006.
Hsieh, C. J. and Dhillon, I. S. Fast coordinate descent methods with variable selection for non-negative matrix factorization. In KDD, 2011.
Kambadur, P., Gupta, A., Ghoting, A., Avron, H., and Lumsdaine, A. PFunc: Modern Task Parallelism For Modern High Performance Computing. In ACM/IEEE conference on Supercomputing, 2009.
Kumar, Abhishek, Sindhwani, Vikas, and Kambadur, Prabhanjan. Fast conical hull algorithms for near-separable non-negative matrix factorization.arXiv:1210.1190, 2012.
Lee, D. and Seung, S. Learning the parts of objects by non-negative matrix factorization. Nature, 401 (6755):788791, 1999.
Lemur. http://lemurproject.org/clueweb09/.Lewis, D, Yang, Y, Rose, T, and Li, F. RCV1: A new benchmark collection for text categorization research. Journal Of Machine Learning Research, 5: 361397, 2004.
Lin, C.-J. Projected gradient methods for nonnegative matrix factorization. Neural Computation, 2007.
Nemirovski, A. Lecture Notes: Introduction to Linear Optimization. 2010.
Tropp, J. A. Algorithms for simultaneous sparse approximation. part ii: Convex relaxation. Signal Processing, 86:589602, 2006.
Tropp, J. A., Gilbert, A. C., and Strauss, M. J. Algorithms for simultaneous sparse approximation. part i: Greedy pursuit. Signal Processing, 86:572588, 2006.
Vavasis, S. On the complexity of non-negative matrix factorization. SIAM Journal on Optimization, 20 (3):1364, 2009.
-----0
Bach, F. R. and Jordan, M. I. Thin junction trees. In Adv. NIPS, 2002.
Beinlich, I. A., Suermondt, H. J., Chavez, R. M., and Cooper, G. F. The ALARM Monitoring System: A Case Study with Two Probabilistic Inference Techniques for Belief Networks. In Proc. of Euro. Conf.on AI in Medicine, 1989.
Bertsekas, D. P. Nonlinear Programming. Athena Scientific, 1999.Bishop, C. M. et al. Pattern recognition and machine learning. Springer New York, 2006.
Cande`s, E. J. and Tao, T. Decoding by linear programming. IEEE Trans. Inf. Theory, 51(12):42034215, 2005.
Chechetka, A. and Guestrin, C. Efficient principled learning of thin junction trees. In Adv. NIPS, 2007.
Chow, C. I. and Liu, C. N. Approximating discrete probability distributions with dependence trees. IEEE Trans. Inf. Theory, 14:462467, 1968.
Cover, T. M. and Thomas, J. A. Elements of information theory. John Wiley & Sons, 2006.
Deshpande, A., Garofalakis, M. N., and Jordan, M. I.Efficient stepwise selection in decomposable models.In Proc. UAI, 2001.
Frank, A., Kiraly, T., and Kriesell, M. On decomposing a hypergraph into k connected subhypergraphs. Discrete Applied Mathematics, 131(2): 373383, 2003.
Friedman, N. and Koller, D. Being Bayesian about network structure. a bayesian approach to structure discovery in bayesian networks. Machine learning, 50(1):95125, 2003.
Fukunaga, T. Computing minimum multiway cuts in hypergraphs from hypertree packings. Integer Prog. Comb. Optimization, pp. 1528, 2010.
Gogate, V., Webb, W. A., and Domingos, P. Learning efficient Markov networks. In Adv. NIPS, 2010.
Karger, D. and Srebro, N. Learning Markov networks: maximum bounded tree-width graphs. In Proc. ACM-SIAM symposium on Discrete algorithms, 2001.
Koller, D. and Friedman, N. Probabilistic graphical models: principles and techniques. MIT press, 2009.
Kolmogorov, V. and Schoenemann, T. Generalized sequential tree-reweighted message passing. ArXiv e-prints, May 2012.
Krause, A. and Guestrin, C. Near-optimal nonmyopic value of information in graphical models. In Proc.UAI, 2005.
Lauritzen, S. L. Graphical Models. Oxford University Press, Oxford, 1996.
Lorea, M. Hypergraphes et matroides. Cahiers Centre Etud. Rech. Oper, 17:289291, 1975.
Malvestuto, F. M. Approximating discrete probability distributions with decomposable models. IEEE 
Trans. Systems, Man, Cybernetics, 21(5), 1991.Narasimhan, M. and Bilmes, J. PAC-learning bounded tree-width graphical models. In Proc. UAI, 2004.
Nedic, A. and Ozdaglar, A. Approximate primal solutions and rate analysis for dual subgradient methods. SIAM J. Opt., 19(4), February 2009.
Saul, L. and Jordan, M. I. Exploiting tractable substructures in intractable networks. In Adv. NIPS, 1995.
Schrijver, A. Combinatorial optimization: Polyhedra and efficiency. Springer, 2004.
Shahaf, D., Chechetka, A., and Guestrin, C. Learning thin junction trees via graph cuts. In Proc. AISTATS, 2009.
Spirtes, P., Glymour, C., and Scheines, R. Causation, prediction, and search, volume 81. MIT press, 2001.
Srebro, N. Maximum likelihood bounded tree-width Markov networks. In Proc. UAI, 2002.
Szantai, T. and Kovacs, E. Discovering a junction tree behind a Markov network by a greedy algorithm.ArXiv e-prints, April 2011.
Szantai, T. and Kovacs, E. Hypergraphs as a mean of discovering the dependence structure of a discrete multivariate probability distribution. Annals OR, 193(1), 2012.
Teyssier, M. and Koller, D. Ordering-based search: A simple and effective algorithm for learning bayesian networks. In Proc. UAI, 2005.
Wainwright, M. J. and Jordan, M. I. Graphical models, exponential families, and variational inference.Found. and Trends in Machine Learning, 1(1-2), 2008.
-----0
Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., and Vaughan, J.W. A theory of learning from different domains. Machine Learning, 79(1): 151175, 2010a.
Ben-David, S., Lu, T., Luu, T., and Pal, D. Impossibility Theorems for Domain Adaptation. JMLR W&CP, 9:129136, 2010b.
Bishop, C.M. Pattern Recognition and Machine Learning. Springer-Verlang New York, 2006.
Bousquet, O. and Elisseeff, A. Stability and Generalization. Journal of Machine Learning Research, 2: 499526, 2002.
Cawley, G. C. and Talbot, N. L. C. Preventing OverFitting during Model Selection via Bayesian Regularisation of the Hyper-Parameters. Journal of Machine Learning Research, 8:841861, 2007.
Cortes, C., Mohri, M., Riley, M., and Rostamizadeh, A. Sample Selection Bias Correction Theory. In Algorithmic Learning Theory, pp. 3853. Springer, 2008.
De Vito, E., Caponnetto, A., and Rosasco, L. Model Selection for Regularized Least-Squares Algorithm in Learning Theory. Found. Comput. Math., 5(1): 5985, February 2005.
Elisseeff, A. and Pontil, M. Leave-one-out Error and Stability of Learning Algorithms with Applications.In Advances in Learning Theory: Methods, Models and Applications, pp. 111125. VIOS Press, 2003.
Fei-Fei, L., Fergus, R., and Perona, P. One-shot learning of object categories. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 28(4): 594611, 2006.
Kutin, S. and Niyogi, P. Almost-everywhere algorithmic stability and generalization error. In Eighteenth conference on Uncertainty in artificial intelligence, pp. 275282, 2002.
Kuzborskij, I., Orabona, F., and Caputo, B. From N to N+1: Multiclass Transfer Incremental Learning. In Computer Vision and Pattern Recognition (CVPR), IEEE Conference on. IEEE, 2013.
Li, X. and Bilmes, J. A bayesian divergence prior for classifier adaptation. In Eleventh International Conference on Artificial Intelligence and Statistics, 2007.
Mansour, Y., Mohri, M., and Rostamizadeh, A. Domain Adaptation with Multiple Sources. In Advances in neural information processing systems, volume 21, pp. 10411048, 2009a.
Mansour, Y., Mohri, M., and Rostamizadeh, A. Domain adaptation: Learning bounds and algorithms.In Computational Learning Theory, 2009b.
Orabona, F., Castellini, C., Caputo, B., Fiorilla, A.E., and Sandini, G. Model Adaptation with Least
Squares SVM for Adaptive Hand Prosthetics. In Robotics and Automation, IEEE International Conference on, pp. 28972903. IEEE, 2009.
Orabona, F., Cesa-Bianchi, N., and Gentile, C. Beyond logarithmic bounds in online learning. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2012.
Petersen, K.B. and Pedersen, M.S. The matrix cookbook. Technical University of Denmark, 2008.
Rifkin, R., Yeo, G., and Poggio, T. Regularized Least-Squares Classification. In Advances in Learning Theory: Methods, Models and Applications, pp.131154. VIOS Press, 2003.
Tommasi, T., Orabona, F., and Caputo, B. Safety in Numbers: Learning Categories from Few Examples with Multi Model Knowledge Transfer. In Computer Vision and Pattern Recognition (CVPR), IEEE Conference on, pp. 30813088. IEEE, 2010.
Yang, J., Yan, R., and Hauptmann, A.G. CrossDomain Video Concept Detection Using Adaptive SVMs. In Proceedings of the 15th international conference on Multimedia, pp. 188197. ACM, 2007.
Zhang, T. Leave-one-out Bounds for Kernel Methods.Neural Computation, 15(6):13971437, 2003.
-----0
Bahmani, S., Boufounos, P., and Raj, B. Greedy sparsity-constrained optimization. In Signals, Systems and Computers (ASILOMAR), 2011 ConferSparse projections onto the simplex ence Record of the Forty Fifth Asilomar Conference on, pp. 11481152. IEEE, 2011.
Becker, Stephen, Cande`s, Emmanuel, and Grant, Michael. Templates for convex cone problems with applications to sparse signal recovery. Mathematical Programming Computation, pp. 154, 2011. ISSN 1867-2949. URL http://dx.doi.org/10.1007/ s12532-011-0029-5. 10.1007/s12532-011-0029-5.
Brodie, J., Daubechies, I., De Mol, C., Giannone, D., and Loris, I. Sparse and stable Markowitz portfolios.
Proceedings of the National Academy of Sciences, 106(30):1226712272, 2009.
Bunea, F., Tsybakov, A.B., Wegkamp, M.H., and Barbu, A. SPADES and mixture models. The Annals of Statistics, 38(4):25252558, 2010.
Cande`s, E., Romberg, J., and Tao, T. Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Trans. on Information Theory, 52(2):489  509, February 2006.
DeMiguel, V., Garlappi, L., Nogales, F.J., and Uppal, R. A generalized approach to portfolio optimization: Improving performance by constraining portfolio norms. Management Science, 55(5):798812, 2009.
Flammia, S.T., Gross, D., Liu, Y.K., and Eisert, J. Quantum tomography via compressed sensing: error bounds, sample complexity and efficient estimators. New Journal of Physics, 14(9):095022, 2012. URL http://stacks.iop.org/1367-2630/ 14/i=9/a=095022.
Foucart, S. Sparse recovery algorithms: sufficient conditions in terms of restricted isometry constants.Proceedings of the 13th International Conference on Approximation Theory, 2010.
Garg, R. and Khandekar, R. Gradient descent with sparsification: an iterative algorithm for sparse recovery with restricted isometry property. In ICML.ACM, 2009.
Gross, D., Liu, Y.-K., Flammia, S. T., Becker, S., and Eisert, J. Quantum state tomography via compressed sensing. Phys. Rev. Lett., 105(15):150401, Oct 2010. doi: 10.1103/PhysRevLett.105.150401.
Kim, D. Least squares mixture decomposition estimation. PhD thesis, 1995.
Kyrillidis, A. and Cevher, V. Recipes on hard thresholding methods. Dec. 2011.
Kyrillidis, A. and Cevher, V. Combinatorial selection and least absolute shrinkage via the Clash algorithm. In IEEE International Symposium on Information Theory, July 2012.
Liu, Y.K. Universal low-rank matrix recovery from Pauli measurements. In NIPS, pp. 16381646, 2011.
Meka, Raghu, Jain, Prateek, and Dhillon, Inderjit S.Guaranteed rank minimization via singular value projection. In NIPS Workshop on Discrete Optimization in Machine Learning, 2010.
Mirsky, L. Symmetric gauge functions and unitarily invariant norms. The quarterly journal of mathematics, 11(1):5059, 1960.
Nemhauser, G.L. and Wolsey, L.A. Integer and combinatorial optimization, volume 18. Wiley New York, 1988.
Parzen, E. On estimation of a probability density function and mode. The annals of mathematical statistics, 33(3):10651076, 1962.
Pilanci, M., El Ghaoui, L., and Chandrasekaran, V.Recovery of sparse probability measures via convex programming. In Advances in Neural Information Processing Systems 25, pp. 24292437, 2012.
van den Berg, E. and Friedlander, M. P. Probing the Pareto frontier for basis pursuit solutions. SIAM J.Sci. Comput., 31(2):890912, 2008.
-----0
Bach, F., Lacoste-Julien, S., and Obozinski, G. On the equivalence between herding and conditional gradient algorithms. In ICML, 2012.
Balamurugan, P., Shevade, S., Sundararajan, S., and Keerthi, S. A sequential dual method for structural SVMs. In SDM, 2011.
Caetano, T.S., McAuley, J.J., Cheng, Li, Le, Q.V., and Smola, A.J. Learning graph matching. IEEE PAMI, 31 (6):10481058, 2009.
Clarkson, K. Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm. ACM Transactions on Algorithms, 6(4):130, 2010.
Collins, M., Globerson, A., Koo, T., Carreras, X., and Bartlett, P. L. Exponentiated gradient algorithms for conditional random fields and max-margin Markov networks. JMLR, 9:17751822, 2008.
Dunn, J.C. and Harshbarger, S. Conditional gradient algorithms with open loop step size rules. Journal of Mathematical Analysis and Applications, 62(2):432444, 1978.
Finley, T. and Joachims, T. Training structural SVMs when exact inference is intractable. In ICML, 2008.
Frank, M. and Wolfe, P. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3:95 110, 1956.
Gartner, B. and Jaggi, M. Coresets for polytope distance.ACM Symposium on Computational Geometry, 2009.
Hsieh, C., Chang, K., Lin, C., Keerthi, S., and Sundararajan, S. A dual coordinate descent method for large-scale linear SVM. In ICML, pp. 408415, 2008.
Jaggi, M. Sparse convex optimization methods for machine learning. PhD thesis, ETH Zurich, 2011.
Jaggi, M. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In ICML, 2013.
Joachims, T., Finley, T., and Yu, C. Cutting-plane training of structural SVMs. Machine Learn., 77(1):2759, 2009.
Lacoste-Julien, S., Schmidt, M., and Bach, F. A simpler approach to obtaining an O(1/t) convergence rate for the projected stochastic subgradient method. Technical Report 1212.2002v2 [cs.LG], arXiv, December 2012.
Mangasarian, O.L. Machine learning via polyhedral concave minimization. Technical Report 95-20, University of Wisconsin, 1995.
Nesterov, Yurii. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341362, 2012.
Ouyang, H. and Gray, A. Fast stochastic Frank-Wolfe algorithms for nonlinear SVMs. SDM, 2010.
Rakhlin, A., Shamir, O., and Sridharan, K. Making gradient descent optimal for strongly convex stochastic optimization. In ICML, 2012.
Ratliff, N., Bagnell, J. A., and Zinkevich, M. (Online) subgradient methods for structured prediction. In AISTATS, 2007.
Rousu, J., Saunders, C., Szedmak, S., and Shawe-Taylor, J. Kernel-based learning of hierarchical multilabel classification models. JMLR, 2006.
Sang, E.F.T.K. and Buchholz, S. Introduction to the CoNLL-2000 shared task: Chunking, 2000.
Shalev-Shwartz, S. and Zhang, T. Proximal stochastic dual coordinate ascent. Technical Report 1211.2717v1 [stat.ML], arXiv, November 2012.
Shalev-Shwartz, S., Singer, Y., Srebro, N., and Cotter, A.Pegasos: primal estimated sub-gradient solver for SVM. Mathematical Programming, 127(1), 2010a.
Shalev-Shwartz, S., Srebro, N., and Zhang, T. Trading accuracy for sparsity in optimization problems with sparsity constraints. SIAM Journal on Optimization, 20: 28072832, 2010b.
Shamir, O. and Zhang, T. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In ICML, 2013.
Taskar, B. Learning structured prediction models: A large margin approach. PhD thesis, Stanford, 2004.
Taskar, B., Guestrin, C., and Koller, D. Max-margin Markov networks. In NIPS, 2003.
Taskar, B., Lacoste-Julien, S., and Jordan, M. I. Structured prediction, dual extragradient and Bregman projections. JMLR, 7:16271653, 2006.
Teo, C.H., Vishwanathan, S.V.N., Smola, A.J., and Le, Q.V. Bundle methods for regularized risk minimization.JMLR, 11:311365, 2010.
Tsochantaridis, I., Joachims, T., Hofmann, T., and Altun, Y. Large margin methods for structured and interdependent output variables. JMLR, 6:14531484, 2005.
Zhang, X., Saha, A., and Vishwanathan, S. V. N. Accelerated training of max-margin Markov networks with kernels. In ALT, pp. 292307. Springer, 2011.
-----0
Anagnostopoulos, C. and Gramacy, R.B. Dynamic trees for streaming and massive data contexts. arXiv preprint arXiv:1201.5568, 2012.
Asuncion, A. and Newman, D. J. UCI machine learning repository. http://www.ics.uci.edu/ ~mlearn/MLRepository.html, 2007.
Bouchard-Cote, A., Sankararaman, S., and Jordan, M. I. Phylogenetic inference via sequential monte carlo. Systematic biology, 61(4):579593, 2012.
Breiman, L. Random forests. Machine Learning, 45: 532, 2001.
Breiman, L., Friedman, J., Stone, C. J., and Olshen, R. A. Classification and regression trees. Chapman & Hall/CRC, 1984.
Buntine, W. Learning classification trees. Stat. Comput., 2:6373, 1992.
Cappe, O., Godsill, S. J., and Moulines, E. An overview of existing methods and recent advances in sequential Monte Carlo. Proc. IEEE, 95(5):899 924, 2007.
Chipman, H. and McCulloch, R. E. Hierarchical priors for Bayesian CART shrinkage. Stat. Comput., 10(1): 1724, 2000.
Chipman, H. A., George, E. I., and McCulloch, R. E.Bayesian CART model search. J. Am. Stat. Assoc., pp. 935948, 1998.
Chipman, H. A., George, E. I., and McCulloch, R. E.BART: Bayesian additive regression trees. Ann.Appl. Stat., 4(1):266298, 2010.
Del Moral, P., Doucet, A., and Jasra, A. Sequential Monte Carlo samplers. J. R. Stat. Soc. Ser. B Stat.Methodol., 68(3):411436, 2006.
Denison, D. G. T., Mallick, B. K., and Smith, A. F. M.A Bayesian CART algorithm. Biometrika, 85(2): 363377, 1998.
Douc, R., Cappe, O., and Moulines, E. Comparison of resampling schemes for particle filtering. In Image Sig. Proc. Anal., pp. 6469, 2005.
Friedman, J. H. Greedy function approximation: a gradient boosting machine. Ann. Statist, 29(5): 11891232, 2001.
Gordon, N. J., Salmond, D. J., and Smith, A.F. M. Novel approach to nonlinear/non-Gaussian Bayesian state estimation. Radar Sig. Proc., IEE Proc. F, 140(2):107113, 1993.
Minka, T. P. Bayesian model averaging is not model combination. MIT Media Lab note.http://research.microsoft.com/en-us/um/ people/minka/papers/bma.html, 2000.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and E., Duchesnay. Scikit-learn: Machine Learning in Python. J. Machine Learning Res., 12: 28252830, 2011.
Quinlan, J. R. Induction of decision trees. Machine Learning, 1(1):81106, 1986.Quinlan, J. R. C4.5: programs for machine learning.Morgan Kaufmann, 1993.
Roy, D. M. and Teh, Y. W. The Mondrian process. In Adv. Neural Information Proc. Systems, volume 21, pp. 13771384, 2009.
Taddy, M. A., Gramacy, R. B., and Polson, N. G.Dynamic trees for learning and design. J. Am. Stat.Assoc., 106(493):109123, 2011.
Teh, Y. W., Daume III, H., and Roy, D. M. Bayesian agglomerative clustering with coalescents. In Adv. Neural Information Proc. Systems, volume 20, 2008.
Wu, Y., Tjelmeland, H., and West, M. Bayesian CART: Prior specification and posterior simulation. J. Comput. Graph. Stat., 16(1):4466, 2007.
-----0
Auer, P., Jaksch, T., and Ortner, R. Near-optimal regret bounds for reinforcement learning. J. Mach. Learn. Res., 99:15631600, August 2010. ISSN 15324435.
Azar, M., Munos, R., and Kappen, B. On the sample complexity of reinforcement learning with a generative model. In Proceedings of the 29th international conference on machine learning, New York, NY, USA, 2012. ACM.
Chakraborty, D. and Stone, P. Structure learning in ergodic factored mdps without knowledge of the transition functions in-degree. In Proceedings of the Twenty Eighth International Conference on Machine Learning (ICML11), 2011.
Diuk, C., Li, L., and Leffler, B. The adaptive kmeteorologists problem and its application to structure learning and feature selection in reinforcement learning. In Danyluk, Andrea Pohoreckyj, Bottou,  Leon, and Littman, Michael L. (eds.), Proceedings of the 26th Annual International Conference on Machine Learning (ICML 2009), pp. 249256. ACM, 2009.
Even-dar, E., Kakade, S., and Mansour, Y. Reinforcement learning in POMDPs without resets. In In IJCAI, pp. 690695, 2005.
Hutter, M. Self-optimizing and Pareto-optimal policies in general environments based on Bayes-mixtures. In Proc. 15th Annual Conf. on Computational Learning Theory (COLT02), volume 2375 of LNAI, pp. 364379, Sydney, 2002. Springer, Berlin. URL http://arxiv.org/abs/cs.AI/0204040.
Lattimore, T. and Hutter, M. Time consistent discounting. In Kivinen, Jyrki, Szepesvari, Csaba, Ukkonen, Esko, and Zeugmann, Thomas (eds.), Algorithmic Learning Theory, volume 6925 of Lecture Notes in Computer Science. Springer Berlin / Heidelberg, 2011a.
Lattimore, T. and Hutter, M. Asymptotically optimal agents. In Kivinen, Jyrki, Szepesvari, Csaba, Ukkonen, Esko, and Zeugmann, Thomas (eds.), Algorithmic Learning Theory, volume 6925 of Lecture Notes in Computer Science. Springer Berlin / Heidelberg, 2011b.
Lattimore, T. and Hutter, M. PAC bounds for discounted MDPs. Technical report, 2012. http://torlattimore.com/pubs/pac-tech.pdf.
Maillard, Odalric-Ambrym, Nguyen, Phuong, Ortner, Ronald, and Ryabko, Daniil. Optimal regret bounds for selecting the state representation in reinforcement learning. In Proceedings of the Thirtieth International Conference on Machine Learning (ICML13), 2013.
Mannor, S. and Tsitsiklis, J. The sample complexity of exploration in the multi-armed bandit problem. J. Mach. Learn. Res., 5:623648, December 2004.ISSN 1532-4435.
Ryabko, D. and Hutter, M. On the possibility of learning in reactive environments with arbitrary dependence. Theoretical Computer Science, 405(3):274 284, 2008.
Strehl, A. and Littman, M. A theoretical analysis of model-based interval estimation. In Proceedings of the 22nd international conference on Machine learning, ICML 05, pp. 856863, 2005.
Strehl, A., Li, L., and Littman, M. Reinforcement learning in finite MDPs: PAC analysis. J. Mach. Learn. Res., 10:24132444, December 2009.
Sunehag, P. and Hutter, M. Optimistic agents are asymptotically optimal. In Proceedings of the 25th Australasian AI conference, 2012.
Szita, I. and Szepesvari, C. Model-based reinforcement learning with nearly tight exploration complexity bounds. In Proceedings of the 27th international conference on Machine learning, pp. 1031 1038, New York, NY, USA, 2010. ACM.
-----0
Ailon, N. and Chazelle, B. The fast Johnson Lindenstrauss transform and approximate nearest neighbors. SICOMP, 2009.
Aizerman, M. A., Braverman, A. M., and Rozonoer, L. I. Theoretical foundations of the potential function method in pattern recognition learning. Autom.Remote Control, 25:821837, 1964.
Aronszajn, N. La theorie generale des noyaux reproduisants et ses applications. Proc. Cambridge Philos. Soc., 39:133153, 1944.
Boser, B., Guyon, I., and Vapnik, V. A training algorithm for optimal margin classifiers. COLT 1992.
Burges, C. J. C. Simplified support vector decision rules. ICML, 1996 
Cortes, C. and Vapnik, V. Support vector networks.Machine Learning, 20(3):273297, 1995.
Dasgupta, A., Kumar, R., and Sarlos, T. Fast localitysensitive hashing. SIGKDD, pp. 10731081, 2011.
Fine, S. and Scheinberg, K. Efficient SVM training using low-rank kernel representations. JMLR, 2001.
Frank, A. and Asuncion, A. UCI machine learning repository. http://archive.ics.uci.edu/ml.
Girosi, F. An equivalence between sparse approximation and support vector machines. Neural Computation, 10(6):14551480, 1998.
Girosi, F., Jones, M., and Poggio, T. Regularization theory and neural networks architectures. Neural Computation, 7(2):219269, 1995.
Gray, A. G. and Moore, A. W. Rapid evaluation of multiple density models. AISTATS, 2003.
Jin, R., Yang, T., Mahdavi, M., Li, Y.F., and Zhou, Z.H. Improved bound for the Nystroms method and its application to kernel classification, 2011. URL http://arxiv.org/abs/1111.2262.
Kimeldorf, G. S. and Wahba, G. A correspondence between Bayesian estimation on stochastic processes and smoothing by splines. Annals of Mathematical Statistics, 41:495502, 1970.
Kreyszig, E. Introductory Functional Analysis with Applications. Wiley, 1989.
Krizhevsky, A. Learning multiple layers of features from tiny images. TR, U Toronto, 2009.
Ledoux, M. Isoperimetry and Gaussian analysis. Lectures on probability theory and statistics, 1996.
Lee, D. and Gray, A. G. Fast high-dimensional kernel summations using the Monte Carlo multipole method. NIPS, 2009.
MacKay, D. J. C. Information Theory, Inference, and Learning Algorithms. Cambridge 2003.
Mercer, J. Functions of positive and negative type and their connection with the theory of integral equations. Royal Society London, A 209:415446, 1909.
Micchelli, C. A. Interpolation of scattered data: distance matrices and conditionally positive definite functions. Constr. Approximation, 2:1122, 1986.
Neal, R. Priors for infinite networks. CRG-TR-94-1, U Toronto, 1994.
Rahimi, A. and Recht, B. Random features for largescale kernel machines. NIPS 20, 2007.
Rahimi, A. and Recht, B. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. NIPS 21, 2008.
Scholkopf, B., Smola, A. J., and Muller, K.-R. Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput., 10:12991319, 1998.
Scholkopf, Bernhard and Smola, A. J. Learning with Kernels. MIT Press, Cambridge, MA, 2002.
Smola, A. J. and Scholkopf, B. Sparse greedy matrix approximation for machine learning. ICML, 2000.
Smola, A. J., Scholkopf, B., and Muller, K.-R. The Connection between Regularization Operators and Support Vector Kernels. Neur. Networks, 1998 
Steinwart, Ingo and Christmann, Andreas. Support Vector Machines. Springer, 2008.
Taskar, B., Guestrin, C., and Koller, D. Max-margin Markov networks. NIPS 16, 2004.
Tropp, J. A. Improved analysis of the subsampled randomized Hadamard transform Adv. Adapt. Data Anal., 2011.
Vapnik, V., Golowich, S., and Smola, A. Support vector method for function approximation, regression estimation, and signal processing. NIPS 9, 1997.
Wahba, G. Spline Models for Observational Data, CBMS-NSF, vol. 59, SIAM, Philadelphia, 1990.
Williams, C. K. I. Prediction with Gaussian processes: From linear regression to linear prediction and beyond. In Jordan, M. I. (ed.), Learning and Inference in Graphical Models, Kluwer, 1998.
Williams, C. K. I. and Seeger, M. Using the Nystrom method to speed up kernel machines. NIPS 13, 2001.
-----0
Bell, R., Koren, Y., and Volinsky, C. Modeling relationships at multiple scales to improve accuracy of large recommender systems. In Proc. of the ACM SIGKDD Conference, 2007.
Billsus, D. and Pazzani, M. J. Learning collaborative information filters. In Proc. of the International Conference on Machine Learning, 1998.
Cande`s, E.J. and Plan, Y. Matrix completion with noise. Proc. of the IEEE, 98(6):925936, 2010.
Cande`s, E.J. and Tao, T. The power of convex relaxation: Near-optimal matrix completion. IEEE Transactions on Information Theory, 56(5):2053 2080, 2010.
DeCoste, D. Collaborative prediction using ensembles of maximum margin matrix factorizations. In Proc.of the International Conference on Machine Learning, 2006.
Foygel, R. and Srebro, N. Concentration-based guarantees for low-rank matrix reconstruction. ArXiv Report arXiv:1102.3923, 2011.
Foygel, R., Srebro, N., and Salakhutdinov, R. Matrix reconstruction with the local max norm. ArXiv Report arXiv:1210.5196, 2012.
Kambhatla, N. and Leen, T. K. Dimension reduction by local principal component analysis. Neural Computation, 9:14931516, 1997.
Keshavan, R. H., Montanari, A., and Oh, S. Matrix completion from noisy entries. Journal of Machine Learning Research, 99:20572078, 2010.
Koren, Y. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In Proc.of the ACM SIGKDD Conference, 2008.
Kumar, S., Mohri, M., and Talwalkar, A. Ensemble nystrom method. In Advances in Neural Information Processing Systems, 2009.
Lawrence, N. D. and Urtasun, R. Non-linear matrix factorization with gaussian processes. In Proc. of the International Conference on Machine Learning, 2009.
Lee, D. and Seung, H. Algorithms for non-negative matrix factorization. In Advances in Neural Information Processing Systems, 2001.
Lee, J., Sun, M., Kim, S., and Lebanon, G. Automatic feature induction for stagewise collaborative filtering. In Advances in Neural Information Processing Systems, 2012a.
Lee, J., Sun, M., and Lebanon, G. A comparative study of collaborative filtering algorithms. ArXiv Report 1205.3193, 2012b.
Lee, J., Sun, M., and Lebanon, G. PREA: Personalized recommendation algorithms toolkit. Journal of Machine Learning Research, 13:26992703, 2012c.
Mackey, L. W., Talwalkar, A. S., and Jordan, M. I.Divide-and-conquer matrix factorization. In Advances in Neural Information Processing Systems, 2011.
Mirbakhsh, N. and Ling, C. X. Clustering-based matrix factorization. ArXiv Report arXiv:1301.6659, 2013.
Rennie, J.D.M. and Srebro, N. Fast maximum margin matrix factorization for collaborative prediction. In Proc. of the International Conference on Machine Learning, 2005.
Roweis, S. and Saul, L. Nonlinear dimensionality reduction by locally linear embedding. Science, 290: 23232326, 2000.
Salakhutdinov, R. and Mnih, A. Probabilistic matrix factorization. In Advances in Neural Information Processing Systems, 2008a.
Salakhutdinov, R. and Mnih, A. Bayesian probabilistic matrix factorization using markov chain monte carlo. In Proc. of the International Conference on Machine Learning, 2008b.
Shalev-Shwartz, S., Gonen, A., and Shamir, O. Largescale convex minimization with a low-rank constraint. In Proc. of the International Conference on Machine Learning, 2011.
Sill, J., Takacs, G., Mackey, L., and Lin, D.Feature-weighted linear stacking. Arxiv preprint arXiv:0911.0460, 2009.
Toh, K.C. and Yun, S. An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems. Pacific Journal of Optimization, 6(15):615640, 2010.
Wand, M. P. and Jones, M. C. Kernel Smoothing.Chapman and Hall/CRC, 1995.
Wang, Y., Szlam, A., and Lerman, G. Robust locally linear analysis with applications to image denoising and blind inpainting. SIAM Journal on Imaging Sciences, 6(1):526562, 2013.
-----0
Abbeel, P., Coates, A., Quigley, M., and Ng, A. An application of reinforcement learning to aerobatic helicopter flight. In Advances in Neural Information Processing Systems (NIPS 19), 2006.
Atkeson, C. and Morimoto, J. Nonparametric representation of policies and value functions: A trajectory-based approach. In Advances in Neural Information Processing Systems (NIPS 15), 2002.
Atkeson, C. and Stephens, B. Random sampling of states in dynamic programming. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 38(4): 924929, 2008.
Bagnell, A., Kakade, S., Ng, A., and Schneider, J. Policy search by dynamic programming. In Advances in Neural Information Processing Systems (NIPS 16), 2003.
Baxter, J., Bartlett, P., and Weaver, L. Experiments with infinite-horizon, policy-gradient estimation. Journal of Artificial Intelligence Research, 15: 351381, 2001.
Deisenroth, M. and Rasmussen, C. PILCO: a modelbased and data-efficient approach to policy search.In International Conference on Machine Learning (ICML), 2011.
Dvijotham, K. and Todorov, E. Inverse optimal control with linearly-solvable MDPs. In International Conference on Machine Learning (ICML), 2010.
Ijspeert, A., Nakanishi, J., and Schaal, S. Movement imitation with nonlinear dynamical systems in humanoid robots. In International Conference on Robotics and Automation, 2002.
Kakade, S. and Langford, J. Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning (ICML), 2002.
Kalakrishnan, M., Chitta, S., Theodorou, E., Pastor, P., and Schaal, S. STOMP: stochastic trajectory optimization for motion planning. In International Conference on Robotics and Automation, 2011.
Kober, J. and Peters, J. Learning motor primitives for robotics. In International Conference on Robotics and Automation, 2009.
Peshkin, L. and Shelton, C. Learning from scarce experience. In International Conference on Machine Learning (ICML), 2002.
Peters, J. and Schaal, S. Reinforcement learning of motor skills with policy gradients. Neural Networks, 21(4):682697, 2008.
Ross, S. and Bagnell, A. Agnostic system identification for model-based reinforcement learning. In International Conference on Machine Learning (ICML), 2012.
Ross, S., Gordon, G., and Bagnell, A. A reduction of imitation learning and structured prediction to noregret online learning. Journal of Machine Learning Research, 15:627635, 2011.
Sutton, R., McAllester, D., Singh, S., and Mansour, Y.Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems (NIPS 11), 1999.
Tang, J. and Abbeel, P. On a connection between importance sampling and the likelihood ratio policy gradient. In Advances in Neural Information Processing Systems (NIPS 23), 2010.
Tassa, Y., Erez, T., and Todorov, E. Synthesis and stabilization of complex behaviors through online trajectory optimization. In IEEE/RSJ International Conference on Intelligent Robots and Systems, 2012.
Tedrake, R., Zhang, T., and Seung, H. Stochastic policy gradient reinforcement learning on a simple 3d biped. In IEEE/RSJ International Conference on Intelligent Robots and Systems, 2004.
Todorov, E., Erez, T., and Tassa, Y. MuJoCo: A physics engine for model-based control. In IEEE/RSJ International Conference on Intelligent Robots and Systems, 2012.
Whitman, E. and Atkeson, C. Control of a walking biped using a combination of simple policies. In 9th IEEE-RAS International Conference on Humanoid Robots, 2009.
Yin, K., Loken, K., and van de Panne, M. SIMBICON: simple biped locomotion control. ACM Transactions Graphics, 26(3), 2007.
Ziebart, B. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. PhD thesis, Carnegie Mellon University, 2010.
-----0
Andoni, A. and Indyk, P. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Proc. IEEE Symp. Foundations of Computer Science, pp. 459468, 2006.
Baluja, S. and Covell, M. Learning to hash: forgiving hash functions and applications. Data Mining & Knowledge Discovery, 17(3):402430, 2008.
Boyd, S. and Vandenberghe, L. Convex Optimization.Cambridge University Press, 2004.
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., and Harshman, R. Indexing by latent semantic analysis. J. American Society for Information Science, 41(6):391407, 1990.
Demiriz, A., Bennett, K.P., and Shawe-Taylor, J. Linear programming boosting via column generation.Machine Learning, 46(1):225254, 2002.
Gong, Y., Lazebnik, S., Gordo, A., and Perronnin, F.Iterative quantization: a procrustean approach to learning binary codes for large-scale image retrieval.IEEE Trans. Pattern Analysis & Machine Intelligence, 2012.
Korman, S. and Avidan, S. Coherency sensitive hashing. In Proc. Int. Conf. Computer Vision, pp. 1607 1614, 2011.
Kulis, B. and Darrell, T. Learning to hash with binary reconstructive embeddings. In Proc. Adv. Neural Information Process. Systems, 2009.
Kulis, B. and Grauman, K. Kernelized localitysensitive hashing for scalable image search. In Proc.Int. Conf. Computer Vision, pp. 21302137, 2009.
Li, X., Lin, G., Shen, C., van den Hengel, A., and Dick, A. Supplementary document: Effectively learning hash functions using column generation. available at: http://cs.adelaide.edu.au/~chhshen/ paper.html, 2013.
Liu, W., Wang, J., Kumar, S., and Chang, S. F. Hashing with graphs. In Proc. Int. Conf. Machine Learning, 2011.
Liu, W., Wang, J., Ji, R., Jiang, Y.G., and Chang, S.F. Supervised hashing with kernels. In Proc.IEEE Conf. Computer Vision & Pattern Recognition, 2012.
Norouzi, M. and Fleet, D.J. Minimal loss hashing for compact binary codes. In Proc. Int. Conf. Machine Learning, 2011.
Salakhutdinov, R. and Hinton, G. Semantic hashing. Int. J. Approximate Reasoning, 50(7):969978, 2009.
Schultz, M. and Joachims, T. Learning a distance metric from relative comparisons. In Proc. Adv. Neural Information Processing Systems, 2004.
Shakhnarovich, G., Viola, P., and Darrell, T. Fast pose estimation with parameter-sensitive hashing.In Proc. Int. Conf. Computer Vision, pp. 750757, 2003.
Shen, C., Kim, J., Wang, L., and van den Hengel, A.Positive semidefinite metric learning using boostinglike algorithms. J. Machine Learning Research, 13: 10071036, 2012.
Strecha, C., Bronstein, A. M., Bronstein, M. M., and Fua, P. Ldahash: Improved matching with smaller descriptors. IEEE Trans. Pattern Analysis & Machine Intelligence, 2011.
Torralba, A., Fergus, R., and Weiss, Y. Small codes and large image databases for recognition. In Proc.IEEE Conf. Computer Vision & Pattern Recognition, pp. 18, 2008.
Wang, J., Kumar, S., and Chang, S.F. Semisupervised hashing for large scale search. IEEE Trans. Pattern Analysis & Machine Intelligence, 2012.
Weinberger, K.Q., Blitzer, J., and Saul, L.K. Distance metric learning for large margin nearest neighbor classification. In Proc. Adv. Neural Information Processing Systems, 2006.
Weiss, Y., Torralba, A., and Fergus, R. Spectral hashing. In Proc. Adv. Neural Information Process. Systems, 2008.
Zhang, D., Wang, J., Cai, D., and Lu, J. Laplacian co-hashing of terms and documents. In Proc. Eur. Conf. Information Retrieval, pp. 577580, 2010a.
Zhang, D., Wang, J., Cai, D., and Lu, J. Self-taught hashing for fast similarity search. In Proc. ACM SIGIR Conf., pp. 1825, 2010b.
Zhu, C., Byrd, R. H., Lu, P., and Nocedal, J. Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization. ACM Trans. Math. Softw., 23(4):550560, 1997.
-----0
Banach, Stefan. Sur les operations dans les ensembles abstraits et leur application aux equations integrales. In Fund. Math, pp. 133181, 1922.
Besag, Julian. Statistical Analysis of Non-Lattice Data.Journal of the Royal Statistical Society. Series D (The Statistician), 24(3):179195, 1975.
Bo, Liefeng and Sminchisescu, Cristian. Structured outputassociative regression. In CVPR, pp. 24032410, 2009.Breiman, Leo. Random forests. Machine Learning, 45(1): 532, 2001.
Collins, Michael. Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In EMNLP, pp. 18, 2002.
Crammer, Koby and Singer, Yoram. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research, 2:265 292, 2001.
Craven, Mark, DiPasquo, Dan, Freitag, Dayne, McCallum, Andrew, Mitchell, TomM., Nigam, Kamal, and Slattery, Sean. Learning to extract symbolic knowledge from the world wide web. In AAAI/IAAI, pp. 509516, 1998.
Cruz, F. Perez, Ghahramani, Z., and Pontil, M. Kernel conditional graphical models. Predicting Structured Data, pp. 265282, 2007.
Daume, Hal III, Langford, John, and Marcu, Daniel.Search-based structured prediction. Machine Learning, 75(3):297325, 2009.
Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., and Lin, C.-J. Liblinear: A library for large linear classification, 2008.
Finley, Thomas and Joachims, Thorsten. Training structural svms when exact inference is intractable. In ICML, pp. 304311, 2008.
Geman, Stuart and Geman, Donald. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell., 6(6): 721741, 1984.
Heitz, Geremy, Gould, Stephen, Saxena, Ashutosh, and Koller, Daphne. Cascaded classification models: Combining models for holistic scene understanding. In NIPS, pp. 641648, 2008.
Keerthi, S. Sathiya and Sundararajan, S. Crf versus svmstruct for sequence labeling. In Yahoo Research Technical Report, 2007.
Lafferty, John D., McCallum, Andrew, and Pereira, Fernando C. N. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, pp. 282289, 2001.
McCallum, Andrew Kachites. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/ mccallum/bow, 1996.
Nguyen, Nam and Guo, Yunsong. Comparisons of sequence labeling algorithms and extensions. In ICML, pp. 681 688, 2007.
Rabiner, Lawrence R. A tutorial on hidden markov models and selected applications in speech recognition. In Proceedings of the IEEE, pp. 257286, 1989.
Reed, R., Oh, S., and Marks, R. J. Regularization using jittered training data. In IJCNN, pp. 509516, 1992.
Schmidt, Mark. Ugm: Matlab code for undirected graphical models. 2011.
Sontag, David, Meshi, Ofer, Jaakkola, Tommi, and Globerson, Amir. More data means less inference: A pseudomax approach to structured learning. In NIPS, pp. 2181 2189, 2010.
Taskar, Benjamin, Guestrin, Carlos, and Koller, Daphne.Max-margin markov networks. In NIPS, 2003.
Treebank, The Penn. Penns linguistic data consortium.http://www.cis.upenn.edu/treebank. 2002.
Tsochantaridis, Ioannis, Joachims, Thorsten, Hofmann, Thomas, and Altun, Yasemin. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6:14531484, 2005.
Tu, Zhuowen and Bai, Xiang. Auto-context and its application to high-level vision tasks and 3d brain image segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 32(10):17441757, 2010.
Weston, J. and Watkins, C. Multi-class support vector machines. Technical Report, 1998.
Wolpert, D. Stacked generalization. In Neural Networks, pp. 241259, 1992.
Zhu, Ji and Hastie, Trevor. Kernel logistic regression and the import vector machine. In NIPS, pp. 10811088, 2001.
-----0
Bertin-Mahieux, Thierry, Ellis, Daniel P.W., Whitman, Brian, and Lamere, Paul. The million song dataset. In International Conference on Music Information Retrieval, 2011.
Blei, David M., Ng, Andrew Y., and Jordan, Michael I.Latent dirichlet allocation. J. Mach. Learn. Res., 3: 9931022, March 2003. ISSN 1532-4435. URL http: //dl.acm.org/citation.cfm?id=944919.944937.
Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. Distributed optimization and statistical learning via the alternating direction method of multipliers.Foundations and trends in machine learning, 3(1): 1122, 2011.
Davis, Jason V., Kulis, Brian, Jain, Prateek, Sra, Suvrit, and Dhillon, Inderjit S. Information-theoretic metric learning. In ICML, 2007.
Deng, Jia, Dong, Wei, Socher, Richard, Li, Li-jia, Li, Kai, and Li, Fei-fei. Imagenet: A large-scale hierarchical image database. In In CVPR, 2009.
Huang, Kaizhu, Jin, Rong, Xu, Zenglin, and Liu, Cheng-Lin. Robust metric learning by smooth optimization. In UAI, pp. 244251, 2010.
Huang, Kaizhu, Ying, Yiming, and Campbell, Colin.Generalized sparse metric learning with relative comparisons. Knowl. Inf. Syst., 28(1):2545, 2011.
Joachims, T. A support vector method for multivariate performance measures. In ICML, 2005.
Joachims, Thorsten, Finley, Thomas, and Yu, ChunNam John. Cutting-plane training of structural svms.Mach. Learn., 77(1):2759, 2009.
Kowalski, M. Sparse regression using mixed norms.Appl. Comput. Harmon. Anal., 27(3):303  324, 2009.
McFee, B., Barrington, L., and Lanckriet, G.R.G.Learning content similarity for music recommendation. IEEE Transactions on Audio, Speech, and Language Processing, 20(8):22072218, October 2012.McFee, Brian and Lanckriet, G.R.G. Metric learning to rank. In 27th annual International Conference on Machine Learning (ICML), pp. 775782, Haifa, Israel, June 2010.Rosales, Romer and Fung, Glenn. Learning sparse metrics via linear programming. In KDD, pp. 367 373, 2006.Shaw, Blake, Huang, Bert, and Jebara, Tony. Learning a distance metric from a network. In Advances in Neural Information Processing Systems 24. 2011.
Shen, Chunhua, Kim, Junae, Wang, Lei, and van den Hengel, Anton. Positive semidefinite metric learning with boosting. In Advances in Neural Information Processing Systems 22. 2009.
Tingle, D., Kim, Y., and Turnbull, D. Exploring automatic music annotation with acoustically-objective tags. In IEEE International Conference on Multimedia Information Retrieval, 2010.
Tsochantaridis, I., Joachims, T., Hofmann, T., and Altun, Y. Large margin methods for structured and interdependent output variables. JMLR, 6:1453 1484, 2005.
Weinberger, Kilian Q., Blitzer, John, and Saul, Lawrence K. Distance metric learning for large margin nearest neighbor classification. In NIPS, 2006.
Xing, Eric P., Ng, Andrew Y., Jordan, Michael I., and Russell, Stuart. Distance metric learning, with application to clustering with side-information. In Advances in Neural Information Processing Systems 
15, pp. 505512, Cambridge, MA, 2003. MIT Press.Ying, Yiming, Huang, Kaizhu, and Campbell, Colin.Sparse metric learning via smooth optimization. In NIPS, pp. 22142222, 2009.
Yuan, Ming and Lin, Yi. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):4967, February 2006. ISSN 1369-7412. doi: 10.1111/j.1467-9868.2005.00532.x.URL http://dx.doi.org/10.1111/j.1467-9868.2005.00532.x.
Zha, Zheng-Jun, Mei, Tao, Wang, Meng, Wang, Zengfu, and Hua, Xian-Sheng. Robust distance metric learning with auxiliary knowledge. In IJCAI, pp. 1327 1332, 2009.
-----0
Bunea, F., Tsybakov, A., and Wegkamp, M. Sparsity oracle inequalities for the Lasso. Electronic Journal of Statistics, 1:169194, 2007.
Cande`s, E. J. and Plan, Y. Near-ideal model selection by `1 minimization. Annals of Statistics, 37(5A): 21452177, 2009.
Cande`s, E. J. and Tao, T. Decoding by linear programming. IEEE Transactions on Information Theory, 51(12):42034215, 2005.
Cande`s, E. J. and Tao, T. The Dantzig selector: Statistical estimation when p is much larger than n. Annals of Statistics, 35(6):23132351, 2007.
Cande`s, E. J., Romberg, J., and Tao, T. Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory, 52(2):489 509, 2006.
Cande`s, E. J., Eldar, Y. C., Needell, D., and Randall, P. Compressed sensing with coherent and redundant dictionaries. Applied and Computational Harmonic Analysis, 31:5973, 2011.
Chan, T. F. Total variation blind deconvolution.IEEE Transactions on Image Processing, 7(3):370 375, 1998.
Donoho, D. L., Elad, M., and Temlyakov, V. N. Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Transactions on Information Theory, 52(1):618, 2006.
Friedman, J., Hastie, T., Hofling, H., and Tibshirani, R. Pathwise coordinate optimization. Annals of Applied Statistics, 1(2):302332, 2007.
Kolar, M., Song, L., and Xing, E.P. Sparsistent learning of varying-coefficient models with structural changes. NIPS, pp. 10061014, 2009.
Koltchinskii, V. and Yuan, M. Sparse recovery in large ensembles of kernel machines on-line learning and bandits. COLT, pp. 229238, 2008.
Liu, J., Wonka, P., and Ye, J. A multi-stage framework for Dantzig selector and Lasso. Journal of Machine Learning Research, 13:11891219, 2012.
Liu, J., Yuan, L., and Ye, J. Guaranteed sparse recovery under linear transformation. arXiv:1305.0047, 2013.
Lounici, K. Sup-norm convergence rate and sign concentration property of Lasso and Dantzig estimators. Electronic Journal of Statistics, 2:90102, 2008.
Meinshausen, N., Bhlmann, P., and Zrich, E. High dimensional graphs and variable selection with the Lasso. Annals of Statistics, 34(3):14361462, 2006.
Nama, S., Daviesb, M. E., Eladc, M., and Gribonvala, R. The cosparse analysis model and algorithms. Applied and Computational Harmonic Analysis, 34(1): 3056, 2012.
Guaranteed Sparse Recovery under Linear Transformation Ravikumar, P., Raskutti, G., Wainwright, M. J., and Yu, B. Model selection in gaussian graphical models: High-dimensional consistency of `1-regularized MLE. NIPS, 2008.
Rinaldo, A. Properties and refinements of the fusedLasso. The Annals of Statistics, 37(5B):29222952, 2009.
Romberg, J. The Dantzig selector and generalized thresholding. CISS, pp. 2225, 2008.
Sharpnack, J., Rinaldo, A., and Singh, A. Sparsistency of the edgeLasso over graphs. AISTAT, 2012.
Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., and Knight, K. Sparsity and smoothness via the fusedLasso. Journal of the Royal Statistical Society Series B, pp. 91108, 2005.
Vaiter, S., Peyre, G., Dossal, C., and Fadili, J. Robust sparse analysis regularization. IEEE Transaction on Information Theory, 59(4):20012016, 2013.
Wainwright, M. J. Sharp thresholds for highdimensional and noisy sparsity recovery using `1constrained quadratic programming (Lasso). IEEE Transactions on Information Theory, 55(5):2183 2202, 2009.
Zhang, T. Some sharp performance bounds for least squares regression with `1 regularization. Annals of Statistics, 37(5A):21092114, 2009a.
Zhang, T. On the consistency of feature selection using greedy least squares regression. Journal of Machine Learning Research, 10:555568, 2009b.
Zhao, P. and Yu, B. On model selection consistency of Lasso. Journal of Machine Learning Research, 7: 25412563, 2006.
-----0
S. Boyd and L. Vandenberghe. Convex optimization.Cambridge university press, 2004.
Bruno Buchberger. Bruno Buchbergers PhD thesis 1965: An algorithm for finding the basis elements of the residue class ring of a zero dimensional polynomial ideal. Journal of Symbolic Computation, 41 (3-4):475  511, 2006.
Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2: 27:127:27, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
D.A. Cox, J. Little, and D. OShea. Ideals, varieties, and algorithms: an introduction to computational algebraic geometry and commutative algebra, volume 10. Springer, 2007.
M. Drton, B. Sturmfels, and S. Sullivant. Lectures on algebraic statistics. Birkhauser, 2008.
Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, XiangRui Wang, and Chih-Jen Lin. Liblinear: A library for large linear classification. J. Mach. Learn. Res., 9:18711874, June 2008. ISSN 15324435. URL http://dl.acm.org/citation.cfm? id=1390681.1442794.
P. Gibilisco, E. Riccomagno, M.P. Rogantin, and H.P.Wynn. Algebraic and geometric methods in statistics. Cambridge University Press, 2009.
Daniel Heldt, Martin Kreuzer, Sebastian Pokutta, and Hennie Poulisse. Approximate computation of zerodimensional polynomial ideals. J. Symb. Comput., 44(11):15661591, 2009.
F.J. Kiraly, P. Buenau, J.S. Muller, D.A.J. Blythe, F. Meinecke, and K.R. Muller. Regression for sets of polynomial equations. JMLR Workshop and Conference Proceedings, 22:628637, 2012.
H Moller and B Buchberger. The construction of multivariate polynomials with preassigned zeros. Computer Algebra, pages 2431, 1982.
B. Scholkopf, A. Smola, and K.R. Muller. Nonlinear component analysis as a kernel eigenvalue problem. Neural computation, 10(5):12991319, 1998.
Bernhard Scholkopf and Alexander J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA, 2001. ISBN 0262194759.
Sumio Watanabe. Algebraic Geometry and Statistical Learning Theory. Cambridge Monographs on Applied and Computational Mathematics. Cambridge University Press, United Kingdom, 2009.
-----0
Alquier, P. and Wintenburger, O. Model selection for weakly dependent time series forecasting. Bernoulli, 18(3):883913, 2012.
Bartlett, P. and Mendelson, S. Rademacher and gaussian complexities: risk bounds and structural results. Journal of Machine Learning Research, 3:463 482, 2003.
Bousquet, O. and Elisseeff, A. Stability and generalization. Journal of Machine Learning Research, 2: 499526, 2002.
Broecheler, M., Mihalkova, L., and Getoor, L. Probabilistic similarity logic. In Uncertainty in Artificial Intelligence, 2010.
Chazottes, J., Collet, P., Kulske, C., and Redig, F.Concentration inequalities for random fields via coupling. Probability Theory and Related Fields, 137: 201225, 2007.
Honorio, J. Lipschitz parametrization of probabilistic graphical models. In Uncertainty in Artificial Intelligence, 2011.
Jensen, D., Neville, J., and Gallagher, B. Why collective inference improves relational classification. In Knowledge Discovery and Data Mining, 2004.
Kontorovich, L. Measure Concentration of Strongly Mixing Processes with Applications. PhD thesis, Carnegie Mellon University, 2007.
Kontorovich, L. and Ramanan, K. Concentration inequalities for dependent random variables via the martingale method. Annals of Probability, 36(6): 21262158, 2008.
McAllester, D. Generalization bounds and consistency for structured labeling. In Predicting Structured Data. MIT Press, 2007.
McDonald, D., Shazili, C., and Schervish, M. Risk bounds for time series without strong mixing.arXiv:1106.0730, 2011.
Mohri, M. and Rostamizadeh, A. Rademacher complexity bounds for non-i.i.d. processes. In Advances in Neural Information Processing Systems, 2009.
Mohri, M. and Rostamizadeh, A. Stability bounds for stationary ?-mixing and ?-mixing processes.Journal of Machine Learning Research, 11:789814, 2010.
Munoz, D., Bagnell, J., Vandapel, N., and Hebert, M. Contextual classification with functional maxmargin Markov networks. In Computer Vision and Pattern Recognition, 2009.
Neville, J. and Jensen, D. Dependency networks for relational data. In International Conference on Data Mining, 2004.
Ralaivola, L., Szafranski, M., and Stempfel, G. Chromatic PAC-bayes bounds for non-iid data: Applications to ranking and stationary ?-mixing processes.Journal of Machine Learning Research, 11:1927 1956, 2010.
Richardson, M. and Domingos, P. Markov logic networks. Machine Learning, 62(1-2):107136, 2006.
Sen, P., Namata, G., Bilgic, M., Getoor, L., Gallagher, B., and Eliassi-Rad, T. Collective classification in network data. AI Magazine, 29(3):93106, 2008.
Shalev-Schwartz, S. Online Learning: Theory, Algorithms, and Applications. PhD thesis, The Hebrew University of Jerusalem, 2007.
Singh, S., Wick, M., and McCallum, A. Distantly labeling data for large scale cross-document coreference. arXiv:1005.4298, 2010.
Taskar, B., Abbeel, P., and Koller, D. Discriminative probabilistic models for relational data. In Uncertainty in Artificial Intelligence, 2002.
Taskar, B., Guestrin, C., and Koller, D. Max-margin Markov networks. In Advances in Neural Information Processing Systems, 2004.
Tsochantaridis, I., Joachims, T., Hofmann, T., and Altun, Y. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6:14531484, 2005.
Usunier, N., Amini, M., and Gallinari, P. Generalization error bounds for classifiers trained with interdependent data. In Advances in Neural Information Processing Systems, 2006.
Wainwright, M. Estimating the wrong graphical model: Benefits in the computation-limited setting. Journal of Machine Learning Research, 7: 18291859, 2006.
Xiang, R. and Neville, J. Relational learning with one network: An asymptotic analysis. In Artificial Intelligence and Statistics, 2011.
-----0
Ben-David, S., Loker, D., Srebro, N., and Sridharan, K. Minimizing the misclassification error rate using a surrogate convex loss. In ICML, 2012.
Bottou, L. Large-scale machine learning with stochastic gradient descent. In Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT2010), pp. 177187, 2010.
Breiman, L. Some infinity theory for predictor ensembles. Annals of Statistics, 32(1):111, 2004.
Crammer, K. and Singer, Y. On the algorithmic implementation of multiclass kernel-based vector machines. JMLR, 2:265292, 2001.
Friedman, J., Hastie, T., and Tibshirani, R. Additive logistic regression: A statistical view of boosting.The Annals of Statistics, 38(2):337407, 2000.
Harchaoui, Za?d, Douze, Matthijs, Paulin, Mattis, Dud?k, Miroslav, and Malick, Jerome. Large-scale image classification with trace-norm regularization.In CVPR, pp. 33863393, 2012.
Haussler, D. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 100(1): 78150, 1992.
Kearns, M. J., Schapire, R. E., and Sellie, L. M. Toward efficient agnostic learning. Proceedings of the 1992 Workshop on Computational Learning Theory, pp. 341352, 1992.
Lee, Y., Lin, Y., and Wahba, G. Multicategory support vector machines: Theory and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association, 99(465):6781, 2004.
Lin, Y., Lv, F., Zhu, S., Yang, M., Cour, T., Yu, K., Cao, L., and Huang, T. Large-scale image classification: fast feature extraction and SVM training. In CVPR, pp. 1689  1696, 2011.
Liu, Y. Fisher consistency of multicategory support vector machines. In UAI, pp. 291298, 2007.
Long, P. M. and Servedio, R. A. Random classification noise defeats all convex potential boosters. Machine Learning, 78(3):287304, 2010.
Perronnin, F., Akata, Z., Harchaoui, Z., and Schmid, C. Towards good practice in large-scale learning for image classification. CVPR, pp. 34823489, 2012.
Polyak, B. T. and Juditsky, A. B. Acceleration of stochastic approximation by averaging. SIAM J. Control Optim., 30(4):838855, 1992.
Tewari, A. and Bartlett, P. L. On the consistency of multiclass classification methods. JMLR, 8:1007 1025, 2007.
Vapnik, V. N. Inductive principles of the search for empirical dependences (methods based on weak convergence of probability measures). Proceedings of the 1989 Workshop on Computational Learning Theory, 1989.
Weston, J., Bengio, S., and Usunier, N. Large scale image annotation: learning to rank with joint wordimage embeddings. Machine Learning, 81(1):2135, 2010.
Xu, W. Towards optimal one pass large scale learning with averaged stochastic gradient descent. Arxiv, 2011.
Zhang, T. Statistical analysis of some multi-category large margin classification methods. Journal of Machine Learning Research, 5:12251251, 2004.
-----0
Arias-Castro, E., Cande`s, E. J., and Davenport, M. On the fundamental limits of adaptive sensing. Arxiv preprint arXiv:1111.4646, 2011.
Cai, T. T., Wang, L., and Xu, G. New bounds for restricted isometry constants. Information Theory, IEEE Transactions on, 56(9), 2010.
Cande`s, E. J. Compressive sampling. In Proceedings oh the International Congress of Mathematicians: Madrid, August 22-30, 2006: invited lectures, pp. 14331452, 2006.
Cande`s, E. J. and Plan, Y. Tight oracle inequalities for lowrank matrix recovery from a minimal number of noisy random measurements. IEEE Transactions on Information Theory, 57(4):23422359, 2011.
Cande`s, E. J., Romberg, J.K., and Tao, T. Stable signal recovery from incomplete and inaccurate measurements.Communications on pure and applied mathematics, 59 (8):12071223, 2006.
Chandrasekaran, V., Recht, B., Parrilo, P. A., and Willsky, A. S. The convex algebraic geometry of linear inverse problems. In Communication, Control, and Computing (Allerton), 2010 48th Annual Allerton Conference on, pp. 699703. IEEE, 2010.
dAspremont, A. and El Ghaoui, L. Testing the nullspace property using semidefinite programming. Mathematical programming, 127(1):123144, 2011.
Davenport, M. A., Duarte, M. F., Eldar, Y. C., and Kutyniok, G. Introduction to compressed sensing. Preprint, 93, 2011.
Donoho, D. L. Compressed sensing. IEEE Transactions on Information Theory, 52(4):12891306, 2006.
Eldar, Y. C. Generalized sure for exponential families: Applications to regularization. IEEE Transactions on Signal Processing, 57(2):471481, 2009.
Fama, E. F. and Roll, Richard. Parameter estimates for symmetric stable distributions. Journal of the American Statistical Association, 66(334):331338, 1971.
Hoyer, P.O. Non-negative matrix factorization with sparseness constraints. The Journal of Machine Learning Research, 5:14571469, 2004.
Hurley, N. and Rickard, S. Comparing measures of sparsity. IEEE Transactions on Information Theory, 55(10): 47234741, 2009.
Indyk, P. Stable distributions, pseudorandom generators, embeddings, and data stream computation. Journal of the ACM (JACM), 53(3):307323, 2006.
Juditsky, A. and Nemirovski, A. On verifiable sufficient conditions for sparse signal recovery via `1 minimization. Mathematical programming, 127(1):5788, 2011.
Li, P., Hastie, T., and Church, K. Nonlinear estimators and tail bounds for dimension reduction in l 1 using cauchy random projections. Journal of Machine Learning Research, pp. 24972532, 2007.
Lopes, M. E., Jacob, L., and Wainwright, M.J. A more powerful two-sample test in high dimensions using random projection. In NIPS 24, pp. 12061214. 2011.
Malioutov, D. M., Sanghavi, S., and Willsky, A. S. Compressed sensing with sequential observations. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2008., pp. 33573360. IEEE, 2008.
Rigamonti, R., Brown, M. A., and Lepetit, V. Are sparse representations really relevant for image classification? In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pp. 15451552. IEEE, 2011.
Rudelson, M. and Vershynin, R. Sampling from large matrices: An approach through geometric functional analysis. Journal of the ACM (JACM), 54(4):21, 2007.
Shi, Q., Eriksson, A., van den Hengel, A., and Shen, C. Is face recognition really a compressive sensing problem? In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pp. 553560. IEEE, 2011.
Tang, G. and Nehorai, A. The stability of low-rank matrix reconstruction: a constrained singular value view.arXiv:1006.4088, submitted to IEEE Transactions on Information Theory, 2010.
Tang, G. and Nehorai, A. Performance analysis of sparse recovery based on constrained minimal singular values. IEEE Transactions on Signal Processing, 59(12):5734 5745, 2011.
van den Berg, E. and Friedlander, M. P. SPGL1: A solver for large-scale sparse reconstruction, June 2007.http://www.cs.ubc.ca/labs/scl/spgl1.
van den Berg, E. and Friedlander, M. P. Probing the pareto frontier for basis pursuit solutions. SIAM Journal on Scientific Computing, 31(2):890912, 2008. doi: 10.1137/080714488. URL http://link.aip.org/link/ ?SCE/31/890.
Vershynin, R. Introduction to the non-asymptotic analysis of random matrices. Arxiv preprint arxiv:1011.3027, 2010.
Ward, R. Compressed sensing with cross validation. IEEE Transactions on Information Theory, 55(12):57735782, 2009.
Zolotarev, V. M. One-dimensional stable distributions, volume 65. Amer Mathematical Society, 1986.
-----0
Acar, E. F., Craiu, R. V., and Yao, F. Dependence calibration in conditional copulas: A nonparametric approach. Biometrics, 67(2):445453, 2011.
Acar, E. F., Genest, C., and Neslehova, J. Beyond simplified pair-copula constructions. Journal of Multivariate Analysis, 110:7490, 2012.
Bedford, T. and Cooke, R. M. Vinesa new graphical model for dependent random variables. The Annals of Statistics, 30(4):10311068, 2002.
Brechmann, E.C. and Schepsmeier, U. Modeling dependence with Cand D-vine copulas: The R package CDVine. Journal of Statistical Software, 52(3): 127, 2013.
Brechmann, E.C., Czado, C., and Aas, K. Truncated regular vines in high dimensions with applications to financial data. Canadian Journal of Statistics, 40 (1):6885, 2012.
Casella, George and Berger, Roger. Statistical Inference. Duxbury Resource Center, 2001.
Cook, R. D. and Johnson, M. E. Generalized burrparetologistic distributions with applications to a uranium exploration data set. Technometrics, 28: 123131, 1986.
Dissmann, J., Brechmann, E. C., Czado, C., and Kurowicka, D. Selecting and estimating regular vine copulae and application to nancial returns. arXiv preprint, 2012.
Elidan, G. Copula Bayesian networks. In Advances in Neural Information Processing Systems 23, pp.559567, 2010.Elidan, G. Copulas and machine learning. Invited survey to appear in the proceedings of the Copulae in Mathematical and Quantitative Finance workshop, 2012.
Extreme Electronics Ltd. OpenWeatherMap, 2012.URL http://openweathermap.org/.Frank, A. and Asuncion, A. UCI machine learning repository, 2010. URL http://archive.ics.uci.edu/ml.
Goovaerts, P. Geostatistics for natural resources evaluation. Oxford University Press, 1st edition, 1997.
Hobaek, I., Aas, K., and Frigessi, A. On the simplified pair-copula construction. simply useful or too simplistic? Journal of Multivariate Analysis, 101(5): 12961310, 2010.
Joe, H. Families of m-variate distributions with given margins and m(m ? 1)/2 bivariate dependence parameters. Distributions with Fixed Marginals and Related Topics, 1996.
Joe, H. Multivariate Models and Dependence Concepts. CRC Press, 1997.
Joe, H. Asymptotic efficiency of the two-stage estimation method for copula-based models. Journal of Multivariate Analysis, 94(2):401419, 2005.
Kirshner, S. Learning with tree-averaged densities and distributions. In Advances in Neural Information Processing Systems 20, 2007.
Kurowicka, D. and Cooke, R. Uncertainty Analysis with High Dimensional Dependence Modelling. Wiley Series in Probability and Statistics, 1st edition, 2006.
Minka, T. P. Expectation Propagation for approximate Bayesian inference. Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence, pp. 362369, 2001.
Naish-Guzman, Andrew and Holden, Sean B. The generalized FITC approximation. In Advances in Neural Information Processing Systems 20, 2007.
Nelsen, R. An Introduction to Copulas. Springer Series in Statistics, 2006.
Patton, A. J. Modelling asymmetric exchange rate dependence. International Economic Review, 47(2): 527556, 2006.
Prim, R. C. Shortest connection networks and some generalizations. Bell System Technology Journal, 36: 13891401, 1957.
Rasmussen, C. E. and Williams, C. K. I. Gaussian Processes for Machine Learning. The MIT Press, 1st edition, 2006.
Sklar, A. Fonctions de repartition a` n dimension set leurs marges. Publ. Inst. Statis. Univ. Paris, 8(1): 229231, 1959.
Snelson, E. and Ghahramani, Z. Sparse Gaussian processes using pseudo-inputs. Proceedings of the 20th Conference in Advances in Neural Information Processing Systems, pp. 12571264, 2006.
Wilson, A. G. and Ghahramani, Z. Copula processes.In Advances in Neural Information Processing Systems 23, pp. 24602468, 2010.
-----0
Bartlett, P. L. and Tewari, A. REGAL: A regularization based algorithm for reinforcement learning in weakly-communicating MDPs. In UAI 2009, Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence, pp. 3542, 2009.
Brafman, R.I., and Tennenholtz, M. R-max a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3:213231, 2003.
Hutter, M. Feature reinforcement learning: Part I.Unstructured MDPs. Journal of General Artificial Intelligence, 1:324, 2009.
Jaksch, T., Ortner, R., and Auer, P. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 99:15631600, 2010.
Kearns, M., and Singh, S. Near-optimal reinforcement learning in polynomial time. Machine Learning, 49: 209232, 2002.
Maillard, O., Munos, R., and Ryabko, D. Selecting the state-representation in reinforcement learning. In Advances in Neural Information Processing Systems 24: 26272635, 2011.
McCallum, R. A. Reinforcement Learning with Selective Perception and Hidden State. PhD thesis, Department of Computer Science, U. Rochester, 1996.
Ortner, R. and Ryabko, D. Online regret bounds for undiscounted continuous reinforcement learning. In Advances in Neural Information Processing Systems 25: 17721780, 2012.
Puterman, M. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, 1994.
Ryabko, D. and Hutter, H. On the possibility of learning in reactive environments with arbitrary dependence. Theoretical Computer Science, 405:274284, 2008.
Singh, S. P., James, M. R., and Rudary, M. R. Predictive state representations: A new theory for modeling dynamical systems. In UAI 04, Proceedings of the 20th Conference in Uncertainty in Artificial Intelligence, pp. 512518, 2004.
Strehl, A. L., Li, L., Wiewiora, Eric, Langford, J., and Littman, M. L. PAC model-free reinforcement learning. In Machine Learning, Proceedings of the 23rd International Conference (ICML 2006), pp. 881 888, 2006.
Veness, J., Ng, K. S., Hutter, M., Uther, W., and Silver, D. A Monte-Carlo AIXI approximation. Journal of Artificial Intelligence Research, 40(1):95142, 2011.
Vidal, E., Thollard, F., Higuera, C. D. L., Casacuberta, F., and Carrasco, R.C. Probabilistic finitestate machines. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(7):1013 1025, 2005.
-----0
Beck, A. and Teboulle, M. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM J. Imaging Sci., 2(1):183202, 2009.
Bertsekas, D.P. Nonlinear programming. Athena Scientific Belmont, 1999. 2nd edition.
Bohning, D. and Lindsay, B. G. Monotonicity of quadraticapproximation algorithms. Ann. I. Stat. Math., 40(4): 641663, 1988.
Borwein, J.M. and Lewis, A.S. Convex analysis and nonlinear optimization: theory and examples. Springer, 2006.
Bottou, L. Large-scale machine learning with stochastic gradient descent. In Proc. COMPSTAT, 2010.
Bradley, J.K., Kyrola, A., Bickson, D., and Guestrin, C.Parallel coordinate descent for l1-regularized loss minimization. In Proc. ICML, 2011.
Cande`s, E.J., Wakin, M., and Boyd, S.P. Enhancing sparsity by reweighted ?1 minimization. J. Fourier Anal.Appl., 14(5):877905, 2008.
Collins, M., Schapire, R.E., and Singer, Y. Logistic regression, AdaBoost and Bregman distances. Mach. Learn., 48(1):253285, 2002.
Daubechies, I., Defrise, M., and De Mol, C. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Commun. Pur. Appl. Math., 57 (11):14131457, 2004.
Della Pietra, S., Della Pietra, V., and Lafferty, J. Duality and auxiliary functions for Bregman distances. Technical report, CMU-CS-01-109, 2001.
Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., and Lin, C.-J. LIBLINEAR: A library for large linear classification. J. Mach. Learn. Res., 9:18711874, 2008.
Gasso, G., Rakotomamonjy, A., and Canu, S. Recovering sparse signals with non-convex penalties and DC programming. IEEE T. Signal Process., 57(12):46864698, 2009.
Harchaoui, Z., Juditsky, A., and Nemirovski, A. Conditional gradient algorithms for norm-regularized smooth convex optimization. preprint arXiv:1302.2325v4, 2013.
Hazan, E. and Kale, S. Projection-free online learning. In Proc. ICML, 2012.
Horst, R. and Thoai, N.V. DC programming: overview. J.Optim. Theory App., 103(1):143, 1999.
Juditsky, A. and Nemirovski, A. First order methods for nonsmooth convex large-scale optimization, I: General purpose methods. In Optimization for Machine Learning. MIT Press, 2011.
Khan, E., Marlin, B., Bouchard, G., and Murphy, K. Variational bounds for mixed-data factor analysis. In Adv.NIPS, 2010.
Lacoste-Julien, S., Jaggi, M., Schmidt, M., and Pletscher, P. Block-coordinate Frank-Wolfe optimization for structural SVMs. In Proc. ICML, 2013.
Lange, K., Hunter, D.R., and Yang, I. Optimization transfer using surrogate objective functions. J. Comput.Graph. Stat., 9(1):120, 2000.
Le Roux, N., Schmidt, M., and Bach, F. A stochastic gradient method with an exponential convergence rate for finite training sets. In Adv. NIPS, 2012.
Lee, D.D. and Seung, H.S. Algorithms for non-negative matrix factorization. In Adv. NIPS, 2001.
Mairal, J., Bach, F., Ponce, J., and Sapiro, G. Online learning for matrix factorization and sparse coding. J.Mach. Learn. Res., 11:1960, 2010.
Neal, R.M. and Hinton, G.E. A view of the EM algorithm that justifies incremental, sparse, and other variants. Learning in graphical models, 89:355368, 1998.
Nesterov, Y. Introductory lectures on convex optimization.Kluwer Academic Publishers, 2004.
Nesterov, Y. Gradient methods for minimizing composite objective functions. Technical report, CORE Discussion Paper, 2007.
Nesterov, Y. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optimiz., 22 (2):341362, 2012.
Nesterov, Y. and Polyak, B.T. Cubic regularization of Newton method and its global performance. Math. Program., 108(1):177205, 2006.
Richtarik, P. and Takac?, M. Iteration complexity of randomized block coordinate descent methods for minimizing a composite function. Math. Program., 2012.
Seeger, M.W. and Wipf, D.P. Variational Bayesian inference techniques. IEEE Signal Proc. Mag., 27(6):8191, 2010.
Shalev-Schwartz, S. and Zhang, T. Proximal stochastic dual coordinate ascent. preprint arXiv 1211.2717v1, 2012.
Shalev-Shwartz, S. and Tewari, A. Stochastic methods for ?1 regularized loss minimization. In Proc. ICML, 2009.
Tseng, P. and Yun, S. A coordinate gradient descent method for nonsmooth separable minimization. Math.Program., 117:387423, 2009.
Wainwright, M.J. and Jordan, M.I. Graphical models, exponential families, and variational inference. Found.
Trends Mach. Learn., 1(1-2):1305, 2008.Wright, S., Nowak, R., and Figueiredo, M. Sparse reconstruction by separable approximation. IEEE T. Signal Process., 57(7):24792493, 2009.
Zhang, T. Sequential greedy approximation for certain convex optimization problems. IEEE T. Inform. Theory, 49 (3):682691, 2003.
Zhang, X., Yu, Y., and Schuurmans, D. Accelerated training for matrix-norm regularization: a boosting approach. In Adv. NIPS, 2012.
-----2
Bertsimas, D., Chang, A., and Rudin, C. An integer optimization approach to associative classication.In Adv. Neur. Inf. Process. Syst. 25, pp. 269{277.2012.
Candes, E. J. and Wakin, M. B. An introduction to compressive sampling. IEEE Signal Process. Mag., 25(2):21{30, Mar. 2008.
Clark, P. and Niblett, T. The CN2 induction algo- rithm. Mach. Learn., 3(4):261{283, Mar. 1989.
Cohen, W. W. Fast eective rule induction. In Proc.Int. Conf. Mach. Learn., pp. 115{123, Tahoe City, CA, Jul. 1995.
Cohen, W. W. and Singer, Y. A simple, fast, and eec- tive rule learner. In Proc. Nat. Conf. Artif. Intell., pp. 335{342, Orlando, FL, Jul. 1999.
Davenport, T. H. and Harris, J. G. Competing on Analytics: The New Science of Winning. Harvard Business School Press, Boston, MA, 2007.
Dembczynski, K., Kot lowski, W., and S lowinski, R.ENDER: A statistical framework for boosting de- cision rules. Data Min. Knowl. Disc., 21(1):52{90, Jul. 2010.
Demiriz, A., Bennett, K. P., and Shawe-Taylor, J. Lin- ear programming boosting via column generation.Mach. Learn., 46(1{3):225{254, Jan. 2002.
Du, D.-Z. and Hwang, F. K. Pooling Designs and Non- adaptive Group Testing: Important Tools for DNA Sequencing. World Scientic, Singapore, 2006.
Dyachkov, A. G. and Rykov, V. V. A survey of super- imposed code theory. Probl. Control Inform., 12(4): 229{242, 1983.
Frank, A. and Asuncion, A. UCI machine learning repository. http://archive.ics.uci.edu/ml, 2010.
Friedman, J. H. and Popescu, B. E. Predictive learning via rule ensembles. Ann. Appl. Stat., 2(3):916{954, Sep. 2008.
Fry, C. Closing the gap between analytics and action.INFORMS Analytics Mag., 4(6):4{5, Nov./Dec.2011.
Furnkranz, J. Separate-and-conquer rule learning. Ar- tif. Intell. Rev., 13(1):3{54, Feb. 1999.
Gilbert, A. C., Iwen, M. A., and Strauss, M. J. Group testing and sparse signal recovery. In Asilomar Conf.Signals Syst. Comp. Conf. Record, pp. 1059{1063, Pacic Grove, CA, Oct. 2008.
Issenberg, S. How president Obama's campaign used big data to rally individual voters. MIT Tech. Rev., 116(1):38{49, Jan./Feb. 2013.
Jawanpuria, P., Nath, J. S., and Ramakrishnan, G.Ecient rule ensemble learning using hierarchical kernels. In Proc. Int. Conf. Mach. Learn., pp. 161{ 168, Bellevue, WA, Jun.{Jul. 2011.
Kearns, M. J. and Vazirani, U. V. An Introduction to Computational Learning Theory. MIT Press, Cam- bridge, MA, 1994.
Letham, B., Rudin, C., McCormick, T. H., and Madi- gan, D. Building interpretable classiers with rules using Bayesian analysis. Technical Report 609, Dept. Stat., Univ. Washington, Dec. 2012.
Liu, J. and Li, M. Finding cancer biomarkers from mass spectrometry data by decision lists. J. Comp.Bio., 12(7):971{979, Sep. 2005.
Malioutov, D. and Malyutov, M. Boolean compressed sensing: LP relaxation for group testing. In Proc.IEEE Int. Conf. Acoust. Speech Signal Process., pp.3305{3308, Kyoto, Japan, Mar. 2012.
Malyutov, M. The separating property of random ma- trices. Math. Notes, 23(1):84{91, 1978.
Marchand, M. and Shawe-Taylor, J. The set covering machine. J. Mach. Learn. Res., 3:723|746, Dec.2002.
Mazumdar, A. On almost disjunct matrices for group testing. In Proc. Int. Symp. Alg. Comput., pp. 649{ 658, Taipei, Taiwan, Dec. 2012.
Natarajan, B. K. Sparse approximate solutions to linear systems. SIAM J. Comput., 24(2):227{234, 1995.
Quinlan, J. R. Simplifying decision trees. Int. J. Man- Mach. Studies, 27(3):221{234, Sep. 1987.
Rivest, R. L. Learning decision lists. Mach. Learn., 2 (3):229{246, Nov. 1987.
Ruckert, U. and Kramer, S. Margin-based rst-order rule learning. Mach. Learn., 70(2{3):189{206, Mar.2008.
Valiant, L. G. Learning disjunctions of conjunctions.In Proc. Int. Joint Conf. Artif. Intell., pp. 560{566, Los Angeles, CA, Aug. 1985.
Varshney, K. R., Rasmussen, J. C., Mojsilovic, A., Singh, M., and DiMicco, J. M. Interactive visual salesforce analytics. In Proc. Int. Conf. Inf. Syst., Orlando, FL, Dec. 2012.
-----0
Ando, R.K. and Zhang, T. A framework for learning predictive structures from multiple tasks and unlabeled data. J. of Machine Learning Research, 6:18171853, 2005.
Argyriou, A., Evgeniou, T., and Pontil, M. Convex multi-task feature learning. Machine Learning, 73(3):243272, 2008.
Argyriou, A., Maurer, A., and Pontil, M. An algorithm for transfer learning in a heterogeneous environment. Proc. European Conf. Machine Learning, pp. 7185, 2008.
Bartlett, P.L. and Mendelson, S. Rademacher and gaussian complexities: risk bounds and structural results. J. of Machine Learning Research, 3:463 482, 2002.
Baxter, J. A model for inductive bias learning. J. of Artificial Intelligence Research, 12:149198, 2000.
Beck, A. and Teboulle, M. A fast iterative shrinkagethresholding algorithm for linear inverse problems.SIAM Journal of Imaging Sciences, 2(1):183202, 2009.
Ben-David, S. and Schuller, R. Exploiting task relatedness for multiple task learning. Proceedings of Computational Learning Theory (COLT), 2003.
Buhlmann, P. and van de Geer, S. Statistics for HighDimensional Data: Methods, Theory and Applications. Springer, 2011.
Caruana, R. Multi-task learning. Machine Learning, 28:4175, 1997.
Combettes, P.L. and Wajs, V.R. Signal recovery by proximal forward-backward splitting. Multiscale Modeling and Simulation, 4(4):11681200, 2006.
Evgeniou, T., Micchelli, C.A., and Pontil, M. Learning multiple tasks with kernel methods. J. of Machine Learning Research, 6:615637, 2005.
Jenatton, R., Mairal, J., Obozinski, G., and Bach, F.Proximal methods for hierarchical sparse coding. J.of Machine Learning Research, 12:22972334, 2011.
Koltchinskii, V. and Panchenko, D. Empirical margin distributions and bounding the generalization error of combined classifiers. Annals of Statistics, 30(1):1 50, 2002.
Kumar, A. and Daume III, H. Learning task grouping and overlap in multitask learning. International Conference on Machine Learning (ICML), 2012.
Ledoux, M. and Talagrand, M. Probability in Banach Spaces. Springer, 1991.
Lounici, K., Pontil, M., Tsybakov, A.B. and van de Geer, S. Oracle inequalities and optimal inference under group sparsity Annals of Statistics, 39(4): 2164-2204, 2011.
Maurer, A. Concentration inequalities for functions of independent variables. Random Structures and Algorithms, 29:121138, 2006.
Maurer, A. Transfer bounds for linear feature learning.Machine Learning, 75(3):327350, 2009.
Maurer, A. and Pontil, M. K-dimensional coding schemes in Hilbert spaces. IEEE Transactions on Information Theory, 56(11):58395846, 2010.
McDiarmid, C. Probabilistic Methods of Algorithmic Discrete Mathematics. Springer, 1998.
Olshausen, B.A. and Field, D.J. Emergence of simplecell receptive field properties by learning a sparse code for natural images. Nature, 381:607609, 1996.
Slepian, D. The one-sided barrier problem for gaussian noise. Bell System Tech. J., 41:463501, 1962.Thrun, S. and Pratt, L. Learning to Learn. Springer, 1998.
Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B, 58(1):267288, 1996.
-----0
Bradley, David M. and Bagnell, J. Andrew. Differentiable sparse coding. In Advances in Neural Information Processing Systems 21, pp. 113120. MIT Press, 2009.
Carl, B. and Stephani, I. Entropy, compactness, and the approximation of operators, volume 98. Cambridge University Press, 1990.
Fazel, Maryam. Matrix rank minimization with applications. Elec Eng Dept Stanford University, 54: 1130, 2002.
Kakade, Sham M., Sridharan, Karthik, and Tewari, Ambuj. On the complexity of linear prediction: Risk bounds, margin bounds, and regularization. In 
Koller, Daphne, Schuurmans, Dale, Bengio, Yoshua, and Bottou, Leon (eds.), Advances in Neural Information Processing Systems 21, pp. 793800. MIT Press, 2009.
LeCun, Yann, Bottou, Leon, Bengio, Yoshua, and Haffner, Patrick. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):22782324, 1998.
Mairal, Julien, Bach, Francis, Ponce, Jean, Sapiro, Guillermo, and Zisserman, Andrew. Supervised dictionary learning. In Koller, Daphne, Schuurmans, 
Dale, Bengio, Yoshua, and Bottou, Leon (eds.), Advances in Neural Information Processing Systems 21, pp. 10331040. MIT Press, 2009.
Mairal, Julien, Bach, Francis, and Ponce, Jean. Taskdriven dictionary learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(4): 791804, 2012.
Maurer, A. and Pontil, M. K-dimensional coding schemes in hilbert spaces. IEEE Transactions on Information Theory, 56(11):58395846, 2010.
Mendelson, Shahar and Philips, Petra. On the importance of small coordinate projections. Journal of Machine Learning Research, 5:219238, 2004.
Osborne, Michael R., Presnell, Brett, and Turlach, Berwin A. On the lasso and its dual. Journal of Computational and Graphical Statistics, pp. 319 337, 2000.
Shawe-Taylor, John, L., Peter, Williamson, Robert C., and Anthony, Martin. Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5):19261940, 1998.
Steinwart, Ingo and Christmann, Andreas. Support vector machines. Springer, 2008.
Tibshirani, Robert. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pp. 267288, 1996.
Vainsencher, Daniel, Mannor, Shie, and Bruckstein, Alfred M. The sample complexity of dictionary learning. Journal of Machine Learning Research, 12:32593281, 2011.
Vapnik, Vladimir N. and Chervonenkis, Alexey Ya.Uniform convergence of frequencies of occurence of events to their probabilities. In Dokl. Akad. Nauk 
SSSR, volume 181, pp. 915918, 1968.Vidyasagar, Mathukumalli. Learning and Generalization with Applications to Neural Networks. Springer, 2002.
Yu, Kai, Zhang, Tong, and Gong, Yihong. Nonlinear learning using local coordinate coding. In Bengio, Yoshua, Schuurmans, Dale, Lafferty, John, Williams, Christopher K. I., and Culotta, Aron (eds.), Advances in Neural Information Processing Systems 22, pp. 22232231. MIT Press, 2009.
-----0
Adelson, E.H. and Bergen, J.R. Spatiotemporal energy models for the perception of motion. J. Opt. Soc. Am.Learning invariant features by harnessing the aperture problem A, 2(2):284299, 1985.
Arndt, P.A., Mallot, H.A., and Bulthoff, H.H. Human stereovision without localized image features. Biological cybernetics, 72(4):279293, 1995.
Bethge, M, Gerwinn, S, and Macke, JH. Unsupervised learning of a steerable basis for invariant image representations. In Human Vision and Electronic Imaging XII. SPIE, February 2007.
Cadieu, Charles F. and Olshausen, Bruno A. Learning Intermediate-Level Representations of Form and Motion from Natural Movies. Neural Computation, 24(4):827 866, December 2011.
Fleet, D., Wagner, H., and Heeger, D. Neural encoding of binocular disparity: Energy models, position shifts and phase shifts. Vision Research, 36(12):18391857, June 1996.Foldiak, Peter. Learning invariance from transformation sequences. Neural Computation, 3(2):194200, 1991.
Gray, Robert M. Toeplitz and circulant matrices: a review.Commun. Inf. Theory, 2:155239, August 2005.
Grimes, David and Rao, Rajesh. Bilinear sparse coding for invariant vision. Neural Computation, 17(1):4773, 2005.
Hoyer, Patrik and Hyvarinen, Aapo. A multi-layer sparse coding network learns contour coding from natural images. Vision Research, 42:15931605, 2002.
Hyvarinen, Aapo and Hoyer, Patrik. Emergence of phaseand shift-invariant features by decomposition of natural images into independent feature subspaces. Neural Computation, 12:17051720, July 2000.
Hyvarinen, Aapo, Hurri, J., and Hoyer, Patrik O. Natural Image Statistics: A Probabilistic Approach to Early Computational Vision. Springer Verlag, 2009.
Larochelle, Hugo, Erhan, Dumitru, Courville, Aaron, Bergstra, James, and Bengio, Yoshua. An empirical evaluation of deep architectures on problems with many factors of variation. In ICML, 2007.
Le, Q.V., Zou, W.Y., Yeung, S.Y., and Ng, A.Y. Learning hierarchical spatio-temporal features for action recognition with independent subspace analysis. In CVPR, 2011.
LeCun, Yann, Huang, Fu-Jie, and Bottou, Leon. Learning Methods for generic object recognition with invariance to pose and lighting. In CVPR, 2004.
Lee, T. and Soatto, S. Video-based descriptors for object recognition. Image and Vision Computing, 29(10):639 652, September 2011.
Memisevic, Roland. Gradient-based learning of higherorder image features. In ICCV, 2011.
Memisevic, Roland. On multi-view feature learning. In ICML, 2012.Memisevic, Roland and Hinton, Geoffrey E. Learning to represent spatial transformations with factored higherorder Boltzmann machines. Neural Computation, 22(6): 147392, 2010.
Miao, Xu and Rao, Rajesh. Learning the lie groups of visual invariance. Neural Computation, 19(10):26652693, 2007.
Olshausen, Bruno, Cadieu, Charles, Culpepper, Jack, and Warland, David. Bilinear models of natural images. In SPIE Proceedings: Human Vision Electronic Imaging XII, 2007.
Qian, Ning. Computing stereo disparity and motion with known binocular cell properties. Neural Computation, 6: 390404, May 1994.Rao, Rajesh and Ruderman, Daniel. Learning lie groups for invariant visual perception. In In Advances in Neural Information Processing Systems 11. MIT Press, 1999.
Rifai, Salah, Vincent, Pascal, Muller, Xavier, Glorot, Xavier, and Bengio, Yoshua. Contractive AutoEncoders: Explicit Invariance During Feature Extraction. In ICML, 2011.
Taylor, Graham, W., Fergus, Rob, LeCun, Yann, and Bregler, Christoph. Convolutional learning of spatiotemporal features. In ECCV, 2010.
Vincent, Pascal, Larochelle, Hugo, Bengio, Yoshua, and Manzagol, Pierre-Antoine. Extracting and composing robust features with denoising autoencoders. ICML, 2008.
Wiskott, Laurenz. How does our visual system achieve shift and size invariance? In 23 Problems in Systems Neuroscience, chapter 16, pp. 322340. Oxford University Press, New York, 2006.
Wiskott, Laurenz and Sejnowski, Terrence. Slow feature analysis: Unsupervised learning of invariances. Neural Computation, 14(4):715770, 2002.
Zou, Will, Ng, Andrew, Zhu, Shenghuo, and Yu, Kai. Deep learning of invariant features via simulated fixations in video. In Advances in Neural Information Processing Systems 25. 2012.
-----0
Ahrens, J. H. and Dieter, U. Sequential random sampling. ACM Transactions on Mathematical Software (TOMS), 11(2):157169, 1985.
Bernstein, S. Theory of Probability. Moscow, 1927.
Blum, M., Floyd, R. W., Pratt, V., Rivest, R. L., and Tarjan, R. E. Time bounds for selection. Journal of Computer and System Sciences, 7(4):448461, 1973.
Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends R in Machine Learning, 3(1):1122, 2011.
Chaudhuri, S., Motwani, R., and Narasayya, V. On random sampling over joins. ACM SIGMOD Record, 28(2):263274, 1999.
Czajkowski, G. Sorting 1PB with MapReduce, 2008.http://googleblog.blogspot.com/2008/11/sorting1pb-with-mapreduce.html.
Dasgupta, A., Drineas, P., Harb, B., Kumar, R., and Mahoney, M. W. Sampling algorithms and coresets for `p regression. SIAM Journal on Computing, 38 (5):20602078, 2009.
Dean, J. and Ghemawat, S. MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107113, 2008.
Fan, C. T., Muller, M. E., and Rezucha, I. Development of sampling plans by using sequential (item by item) selection techniques and digital computers.Journal of the American Statistical Association, 57 (298):387402, 1962.
Leskovec, J. and Faloutsos, C. Sampling from large graphs. In Proceedings of the 12th International Conference on Knowledge Discovery and Data Mining, pp. 631636. ACM, 2006.
Manku, G. S., Rajagopalan, S., and Lindsay, B. G.Random sampling techniques for space efficient online computation of order statistics of large datasets.
ACM SIGMOD Record, 28(2):251262, 1999.Maurer, A. A bound on the deviation probability for sums of non-negative random variables. J. Inequalities in Pure and Applied Mathematics, 4, 2003.
Owen, S., Anil, R., Dunning, T., and Friedman, E.Mahout in Action. Manning Publications Co., 2011.
Sunter, A. B. List sequential sampling with equal or unequal probabilities without replacement. Applied Statistics, pp. 261268, 1977.
Thompson, S. K. Sampling. Wiley, 3 edition, 2012.Tille, Y. Sampling Algorithms. Springer, 2006.
Toivonen, H. Sampling large databases for association rules. In Proceedings of the 22th International Conference on Very Large Data Bases, pp. 134145.
Morgan Kaufmann Publishers Inc., 1996.Vitter, J. S. Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS), 11(1):3757, 1985.
Vitter, J. S. An efficient algorithm for sequential random sampling. ACM transactions on mathematical software (TOMS), 13(1):5867, 1987.
White, T. Hadoop: The Definitive Guide. OReilly Media, 2012.
-----0
Balcan, M.-F., Blum, A., Fine, S., and Mansour, Y. Distributed learning, communication complexity and privacy. Arxiv preprint arXiv:1204.3514, 2012.
Bekkerman, R., Bilenko, M., and Langford, J. (eds.). Scaling up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, 2011.
Bertsekas, D. P. and Tsitsiklis, J. N. Some aspects of parallel and distributed iterative algorithmsa survey. Automatica, 27(1):321, 1991.
Bertsimas, D. and Vempala, S. Solving convex programs by random walks. Journal of the ACM, 51(4):540556, 2004.
Clarkson, K. L. Subgradient and sampling algorithms for `1 regression. In Proceedings of the Sixteenth Annual ACMSIAM Symposium on Discrete Algorithms (SODA), pp.257266. SIAM, 2005.
Clarkson, K. L. and Woodruff, D. P. Low rank approximation and regression in input sparsity time. In Proceedings of the 45th Annual ACM symposium on Theory of Computing (STOC), 2013.
Clarkson, K. L., Drineas, P., Magdon-Ismail, M., Mahoney, M. W., Meng, X., and Woodruff, D. P. The Fast Cauchy Transform and faster robust linear regression. In Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2013.
Dasgupta, A., Drineas, P., Harb, B., Kumar, R., and Mahoney, M. W. Sampling algorithms and coresets for `p regression. SIAM J. Comput., 38(5):20602078, 2009.
Daume, III, H., Phillips, J. M., Saha, A., and Venkatasubramanian, S. Efficient protocols for distributed classification and optimization. In Proceedings of the 23rd International Conference on Algorithmic Learning Theory, pp. 154168, 2012.
Dean, J. and Ghemawat, S. MapReduce: Simplified data processing on large clusters. In Proceedings of the Sixth Symposium on Operating System Design and Implementation (OSDI), pp. 137149, 2004.
Feldman, J., Muthukrishnan, S., Sidiropoulos, A., Stein, C., and Svitkina, Z. On distributing symmetric streaming computations. ACM Transactions on Algorithms, 6 (4):Article 66, 2010.
Goodrich, M. T. Simulating parallel algorithms in the MapReduce framework with applications to parallel computational geometry. Arxiv preprint arXiv:1004.4708, 2010.
Grotschel, M., Lovasz, L., and Schrijver, A. The ellipsoid method and its consequences in combinatorial optimization. Combinatorica, 1(2):169197, 1981.
Karloff, H., Suri, S., and Vassilvitskii, S. A model of computation for MapReduce. In Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 938948, 2010.
Levin, A. Y. On an algorithm for the minimization of convex functions. In Soviet Mathematics Doklady, volume 160, pp. 12441247, 1965.
Lovasz, L. Hit-and-run mixes fast. Math. Prog., 86(3): 443461, 1999.
Mackey, L., Talwalkar, A., and Jordan, M. I. Divide-andconquer matrix factorization. In Proceedings of the 23rd Annual Conference on Neural Information Processing Systems (NIPS), 2011.
Mahoney, M. W. Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning. NOW Publishers, Boston, 2011. Also available at: arXiv:1104.5557.
Meng, X. and Mahoney, M. W. Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression. In Proceedings of the 45th Annual ACM symposium on Theory of Computing (STOC), 2013.
Mitchell, J. E. Polynomial interior point cutting plane methods. Optimization Methods and Software, 18(5): 507534, 2003.
Nesterov, Y. Unconstrained convex minimization in relative scale. Mathematics of Operations Research, 34(1): 180193, 2009.
Nesterov, Y. and Nemirovsky, A. Interior Point Polynomial Methods in Convex Programming. SIAM, 1994.
Portnoy, S. and Koenker, R. The Gaussian hare and the Laplacian tortoise: computability of squared-error versus absolute-error estimators. Statistical Science, 12(4): 279300, 1997.
Rousseeuw, P. J. and Leroy, A. M. Robust Regression and Outlier Detection. Wiley, 1987.
Sohler, C. and Woodruff, D. P. Subspace embeddings for the `1-norm with applications. In Proceedings of the 43rd annual ACM symposium on Theory of computing (STOC), pp. 755764. ACM, 2011.
Tarasov, S., Khachiyan, L. G., and Erlikh, I. The method of inscribed ellipsoids. In Soviet Mathematics Doklady, volume 37, pp. 226230, 1988.
Vaidya, P. M. A new algorithm for minimizing convex functions over convex sets. Math. Prog., 73:291341, 1996.
Zhang, Y., Duchi, J., and Wainwright, M. J.Communication-efficient algorithms for statistical optimization. In Annual Advances in Neural Information Processing Systems 26: Proceedings of the 2012 Conference, 2012.
-----0
Cypher, Allen, Halbert, Daniel C., Kurlander, David, Lieberman, Henry, Maulsby, David, Myers, Brad A., and Turransky, Alan (eds.). Watch what I do: programming by demonstration. MIT Press, Cambridge, MA, USA, 1993. ISBN 0262032139.
Gulwani, Sumit. Automating string processing in spreadsheets using input-output examples. In POPL, pp. 317330, 2011.
Gulwani, Sumit. Synthesis from examples: Interaction models and algorithms. 14th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, 2012.
Gulwani, Sumit, Korthikanti, Vijay Anand, and Tiwari, Ashish. Synthesizing geometry constructions.In PLDI, pp. 5061, 2011.
Gulwani, Sumit, Harris, William R., and Singh, Rishabh. Spreadsheet data manipulation using examples. Commun. ACM, 55(8):97105, 2012.
Jha, Susmit, Gulwani, Sumit, Seshia, Sanjit A., and Tiwari, Ashish. Oracle-guided component-based program synthesis. In ICSE, pp. 215224, 2010.
Johnson, Mark, Griffiths, Thomas L., and Goldwater, Sharon. Adaptor grammars: A framework for specifying compositional nonparametric bayesian models.In NIPS, pp. 641648, 2006.
Lau, Tessa, Wolfman, Steven A., Domingos, Pedro, and Weld, Daniel S. Programming by demonstration using version space algebra. Mach. Learn., 53:111 156, October 2003. ISSN 0885-6125.
Lau, Tessa A., Domingos, Pedro, and Weld, Daniel S.Version space algebra and its application to programming by demonstration. In ICML, pp. 527534, 2000.
Liang, Percy, Jordan, Michael I., and Klein, Dan.Learning programs: A hierarchical Bayesian approach. In ICML, pp. 639646, 2010.
Lieberman, H. Your Wish Is My Command: Programming by Example. Morgan Kaufmann, 2001.
Miller, Robert C. Lightweight Structure in Text. PhD thesis, Carnegie Mellon University, Pittsburgh, PA, USA, May 2002. URL http://www.cs.cmu.edu/ ~rcm/papers/thesis/.
Nix, Robert P. Editing by example. TOPLAS, 7(4): 600621, 1985.
Rush, Alexander M., Collins, Michael, and Kaelbling, Pack. A tutorial on dual decomposition and Lagrangian relaxation for inference in natural language processing. http://www.cs.columbia.edu/ ~mcollins/acltutorial.pdf, 2011.
Witten, Ian H. and Mo, Dan. TELS: learning text editing tasks from examples, pp. 183203. MIT Press, Cambridge, MA, USA, 1993. ISBN 0-262-03213-9.
-----0
Agarwal, S. Surrogate regret bounds for the area under the ROC curve via strongly proper losses. In COLT, 2013.
Bartlett, P.L., Jordan, M.I., and McAuliffe, J.D. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138156, 2006.
Bousquet, O. and Elisseeff, A. Stability and generalization.Journal of Machine Learning Research, 2:499526, 2002.
Cardie, C. and Howe, N. Improving minority class prediction using case-specific feature weights. In ICML, 1997.
Chan, P.K. and Stolfo, S.J. Learning with non-uniform class and cost distributions: Effects and a distributed multi-classifier approach. In KDD-98 Workshop on Distributed Data Mining, 1998.
Chawla, N.V., Bowyer, K.W., Hall, L.O., and Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16: 321357, 2002.
Chawla, N.V., Japkowicz, N., and Kolcz, A. (eds.). Proceedings of the ICML 2003 Workshop on Learning from Imbalanced Data Sets. 2003.
Chawla, N.V., Japkowicz, N., and Kotcz, A. Editorial: Special issue on learning from imbalanced data sets. SIGKDD Explorations Newsletter, 6(1):16, 2004.
Chawla, N.V., Cieslak, D.A., Hall, L.O., and Joshi, A. Automatically countering imbalance and its empirical relationship to cost. Data Mining & Knowledge Discovery, 17(2):225252, 2008.
Cheng, J., Hatzis, C., Hayashi, H., Krogel, M-A., Morishita, S., Page, D., and Sese, J. KDD Cup 2001 report. ACM SIGKDD Explorations Newsletter, 3(2):47 64, 2002.
Daskalaki, S., Kopanas, I., and Avouris, N. Evaluation of classifiers for an uneven class distribution problem.
Applied Artificial Intelligence, 20:381417, 2006.Davis, J. and Goadrich, M. The relationship between precision-recall and ROC curves. In Proc. ICML, 2006.
Drummond, C. and Holte, R.C. Severe class imbalance: Why better algorithms arent the answer. In Proc.ECML, 2005.
Drummond, C. and Holte, R.C. Cost curves: An improved method for visualizing classifier performance. Machine Learning, 65(1):95130, 2006.
Elkan, C. The foundations of cost-sensitive learning. In Proc. IJCAI, 2001.Frank, A. and Asuncion, A. UCI machine learning repository, 2010. URL http://archive.ics.uci.edu/ml.
Gu, Q., Zhu, L., and Cai, Z. Evaluation measures of the classification performance of imbalanced data sets. In Computational Intelligence and Intelligent Systems, volume 51, pp. 461471. 2009.
He, H. and Garcia, E.A. Learning from imbalanced data.IEEE Transactions on Knowledge and Data Engineering, 21(9):12631284, 2009.
Japkowicz, N. The class imbalance problem: Significance and strategies. In In Proc. ICAI, 2000.
Japkowicz, N. and Stephen, S. The class imbalance problem: A systematic study. Intelligent Data Analysis, 6 (5):429449, 2002.
Kennedy, K., Namee, B.M., and Delany, S.J. Learning without default: a study of one-class classification and the low-default portfolio problem. In ICAICS, 2009.
Kotlowski, W., Dembczynski, K., and Hullermeier, E. Bipartite ranking through minimization of univariate loss.In Proc. ICML, 2011.
Kubat, M. and Matwin, S. Addressing the curse of imbalanced training sets: One-sided selection. In ICML, 1997.
Lawrence, S., Burns, I., Back, A., Tsoi, A-C., and Giles, C.L. Neural network classification and prior class probabilities. In Neural Networks: Tricks of the Trade, Lecture Notes in Computer Science, pp. 1524:299313. 1998.
Lewis, D.D. and Gale, W.A. A sequential algorithm for training text classifiers. In Proc. SIGIR, 1994.
Ling, C., Ling, C.X., and Li, C. Data mining for direct marketing: Problems and solutions. In Proc. KDD, 1998.
Liu, W. and Chawla, S. A quadratic mean based supervised learning model for managing data skewness. In Proc.SDM, 2011.
Platt, J.C. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Adv. Large Margin Classifiers, pp. 6174, 1999.
Powers, R., Goldszmidt, M., and Cohen, I. Short term performance forecasting in enterprise systems. In Proc.KDD, 2005.Qiao, X. and Liu, Y. Adaptive weighted learning for unbalanced multicategory classification. Biometrics, 65(1): 159168, 2009.
Reid, M.D. and Williamson, R.C. Surrogate regret bounds for proper losses. In Proc. ICML, 2009.
Reid, M.D. and Williamson, R.C. Composite binary losses.Journal of Machine Learning Research, 11:23872422, 2010.
Scott, C. Calibrated asymmetric surrogate losses. Electronic Journal of Statistics, 6:958992, 2012.
Van Hulse, J., Khoshgoftaar, T.M., and Napolitano, A.Experimental perspectives on learning from imbalanced data. In Proc. ICML, 2007.
Wallace, B.C., K.Small, Brodley, C.E., and Trikalinos, T.A. Class imbalance, redux. In Proc. ICDM, 2011.
Zadrozny, Bianca, Langford, John, and Abe, Naoki. Costsensitive learning by cost-proportionate example weighting. In ICDM, 2003.
Zhang, T. Statistical behaviour and consistency of classification methods based on convex risk minimization. Annals of Mathematical Statistics, 32:56134, 2004.
-----0
Agarwal, Alekh, Chapelle, Olivier, Dud?k, Miroslav, and Langford, John. A reliable effective terascale linear learning system. CoRR, abs/1110.4198, 2011.
Bengio, Y. and Senecal, J. S. Adaptive importance sampling to accelerate training of a neural probabilistic language model. Trans. Neur. Netw., 19(4): 713722, April 2008. ISSN 1045-9227.
Beygelzimer, Alina, Dasgupta, Sanjoy, and Langford, John. Importance weighted active learning. CoRR, abs/0812.4952, 2008.
Bradley, Joseph K and Schapire, Robert. Filterboost: Regression and classification on large datasets. In 
Platt, J.C., Koller, D., Singer, Y., and Roweis, S.(eds.), Advances in Neural Information Processing 
Systems 20, pp. 185192. MIT Press, Cambridge, MA, 2008.
Floyd, S. and Warmuth, M. Sample compression, learnability, and the Vapnik-Chervonenkis dimension. Machine Learning, 21(3):269304, 1995.
Grunwald, P.D. The minimum description length principle. MIT press, 2007.
Hanneke, S. A bound on the label complexity of agnostic active learning. In Proceedings of the 24th international conference on Machine learning, pp.353360. ACM, 2007.
Langford, John. Vowpal wabbit, 2011. URL https:// github.com/JohnLangford/vowpal_wabbit/wiki.
Lewis, David D. and Catlett, Jason. Heterogeneous uncertainty sampling for supervised learning. In In Proceedings of the Eleventh International Conference on Machine Learning, pp. 148156. Morgan Kaufmann, 1994.
Maurer, Andreas and Pontil, Massimiliano. Empirical Bernstein bounds and sample-variance penalization.In The 22nd Conference on Learning Theory, 2009.
Ridgeway, Greg. Generalized boosted models: A guide to the gbm package, 2005.
Saberian, Mohammad and Vasconcelos, Nuno. Boosting classifier cascades. In Lafferty, J., Williams, C.
K. I., Shawe-Taylor, J., Zemel, R.S., and Culotta, A. (eds.), Advances in Neural Information Processing Systems 23, pp. 20472055. 2010.Sonnenburg, Soren. Pascal large scale learning challenge, 2008. URL http://largescale.ml.tu-berlin.de/about/.
Sonnenburg, Soren and Franc, Vojtech. Coffin : A computational framework for linear svms. In Proc.ICML 2010, 2010.
Viola, Paul and Jones, Michael. Rapid object detection using a boosted cascade of simple features, 2001.
Loss-Proportional Subsampling for Subsequent ERM Xu, Puyang, Gunawardana, Asela, and Khudanpur, Sanjeev. Efficient subsampling for training complex language models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 11, pp. 11281136, 2011. ISBN 978-1937284-11-4.
-----1
A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. In Advances in Neural Informa- tion Processing Systems 19, pages 4148. MIT Press, 2007.
S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira.Domain Generalization via Invariant Feature Representation Analysis of representations for domain adaptation.In Advances in Neural Information Processing Sys- tems 19, pages 137144. MIT Press, 2007.
S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of learning from different domains. Machine Learning, 79:151 175, 2010.
A. Berlinet and T. C. Agnan. Reproducing kernel Hilbert spaces in probability and statistics. Kluwer Academic Publishers, 2004.
S. Bickel, M. Bruckner, and T. Scheffer. Discriminative learning under covariate shift. Journal of Machine Learning Research, pages 21372155, 2009.
G. Blanchard, G. Lee, and C. Scott. Generalizing from several related classification tasks to a new unla- beled sample. In Advances in Neural Information Processing Systems 24, pages 21782186, 2011.
R. R. Brinkman, M. Gasparetto, S.-J. J. Lee, A. J.Ribickas, J. Perkins, W. Janssen, R. Smiley, and C. Smith. High-content flow cytometry and tempo- ral data analysis for defining a cellular signature of graft-versus-host disease. Biol Blood Marrow Trans- plant, 13(6):691700, 2007. ISSN 1083-8791.
A. Christmann and I. Steinwart. Universal kernels on Non-Standard input spaces. In Advances in Neural Information Processing Systems 23, pages 406414.MIT Press, 2010.
K. Fukumizu, F. R. Bach, and M. I. Jordan. Kernel Dimensionality Reduction for Supervised Learning.In Advances in Neural Information Processing Sys- tems 16. MIT Press, Cambridge, MA, 2004.
A. Gretton, A. Smola, J. Huang, M. Schmittfull, K. Borgwardt, and B. Scholkopf. Dataset Shift in Machine Learning, chapter Covariate Shift by Ker- nel Mean Matching, pages 131160. MIT Press, 2009.
Q. Gu and J. Zhou. Learning the shared subspace for multi-task clustering and transductive transfer clas- sification. In Proceedings of the 9th IEEE Interna- tional Conference on Data Mining, pages 159168.IEEE Computer Society, 2009.
M. Kim and V. Pavlovic. Central subspace dimension- ality reduction using covariance operators. IEEE Transactions on Pattern Analysis and Machine In- telligence, 33(4):657670, 2011.K.-C. Li. Sliced inverse regression for dimension re- duction. Journal of the American Statistical Asso- ciation, 86(414):316327, 1991.
K. Muandet, K. Fukumizu, F. Dinuzzo, and B. Scholkopf. Learning from distributions via sup- port measure machines. In Advances in Neural In- formation Processing Systems 25, pages 1018. MIT Press, 2012.
S. J. Pan and Q. Yang. A survey on transfer learning.IEEE Transactions on Knowledge and Data Engi- neering, 22(10):13451359, October 2010.
S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang.Domain adaptation via transfer component analysis.IEEE Transactions on Neural Networks, 22(2):199 210, 2011.
A. Passos, P. Rai, J. Wainer, and H. D. III. Flexible modeling of latent task structures in multitask learn- ing. In Proceedings of the 29th international confer- ence on Machine learning, Edinburgh, UK, 2012.
J. Quionero-Candela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence. Dataset Shift in Machine Learning. MIT Press, 2009.
B. Scholkopf, A. Smola, and K.-R. Muller. Nonlinear component analysis as a kernel eigenvalue problem.Neural Computation, 10(5):12991319, July 1998.
A. Smola, A. Gretton, L. Song, and B. Scholkopf. A Hilbert space embedding for distributions. In Pro- ceedings of the 18th International Conference In Al- gorithmic Learning Theory, pages 1331. Springer- Verlag, 2007.
B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Scholkopf, and G. R. G. Lanckriet. Hilbert space embeddings and metrics on probability mea- sures. Journal of Machine Learning Research, 99: 15171561, 2010.
G. Widmer and M. Kurat. Learning in the Presence of Concept Drift and Hidden Contexts. Machine Learning, 23:69101, 1996.
H.-M. Wu. Kernel sliced inverse regression with appli- cations to classification. Journal of Computational and Graphical Statistics, 17(3):590610, 2008.
-----0
Agarwal, S. The Infinite Push: A new support vector ranking algorithm that directly optimizes accuracy at the absolute top of the list. In Proceedings of the SIAM International Conference on Data Mining, 2011.
Agarwal, S., Graepel, T., Herbrich, R., Har-Peled, S., and Roth, D. Generalization bounds for the area under the ROC curve. Journal of Machine Learning Research, 6: 393425, 2005.
Agarwal, S., Dugar, D., and Sengupta, S. Ranking chemical structures for drug discovery: A new machine learning approach. Journal of Chemical Information and Modeling, 50(5):716731, 2010.
Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., and Hullender, G. Learning to rank using gradient descent. In Proceedings of the 22nd International Conference on Machine Learning, 2005.
Cortes, C. and Mohri, M. AUC optimization vs. error rate minimization. In Advances in Neural Information Processing Systems 16. MIT Press, 2004.
Dodd, L. E. and Pepe, M. S. Partial AUC estimation and regression. Biometrics, 59(3):614623, 2003.
Freund, Y., Iyer, R., Schapire, R. E., and Singer, Y. An efficient boosting algorithm for combining preferences.Journal of Machine Learning Research, 4:933969, 2003.
Herbrich, R., Graepel, T., and Obermayer, K. Large margin rank boundaries for ordinal regression. In Smola, A., 
Bartlett, P., Schoelkopf, B., and Schuurmans, D. (eds.), Advances in Large Margin Classifiers, pp. 115132. MIT Press, 2000.
Hsu, M.-J. and Hsueh, H.-M. The linear combinations of biomarkers which maximize the partial area under the ROC curve. Computational Statistics, pp. 120, 2012.
Joachims, T. Optimizing search engines using clickthrough data. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002.
Joachims, T. A support vector method for multivariate performance measures. In Proceedings of the 22nd International Conference on Machine Learning, 2005.
Joachims, T. Training linear SVMs in linear time. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2006.Joachims, T. SVMstruct support vector machine for complex outputs, 2008. URL http://svmlight.joachims.org/svm_struct.html.
Jorissen, R. N. and Gilson, M. K. Virtual screening of molecular databases using a support vector machine.Journal of Chemical Information and Modeling, 45:549 561, 2005.
Komori, O. and Eguchi, S. A boosting method for maximizing the partial area under the ROC curve. BMC Bioinformatics, 11:314, 2010.
Liu, T-Y., Xu, J., Qin, T., Xiong, W., and Li, H. Letor: benchmark dataset for research on learning to rank for information retrieval. In Proceedings of SIGIR 2007 Workshop on Learning to Rank for Information Retrieval, 2007.
Miller, H. The FROC curve: A representation of the observers performance for the method of free response.Journal of the Acoustical Society of America, 46(6B): 14731476, 1969.
Pepe, M. S. and Thompson, M. L. Combining diagnostic test results to increase accuracy. Biostatistics, 1(2):123 140, 2000.
Qi, Y., Bar-joseph, Z., and Klein-seetharaman, J. Evaluation of different biological data and computational classification methods for use in protein interaction prediction. Proteins, 63:490500, 2006.
Rakotomamonjy, A. Sparse support vector infinite push.In Proceedings of the 29th International Conference on Machine Learning, 2012.
Rao, R. B., Yakhnenko, O., and Krishnapuram, B. KDD Cup 2008 and the workshop on mining medical data.
SIGKDD Explorations Newsletter, 10(2):3438, 2008.Ricamato, M. T. and Tortorella, F. Partial AUCmaximization in a linear combination of dichotomizers. Pattern Recognition, 44(10-11):26692677, 2011.
Rudin, C. The p-norm push: A simple convex ranking algorithm that concentrates at the top of the list. Journal of Machine Learning Research, 10:22332271, 2009.
Takenouchi, T., Komori, O., and Eguchi, S. An extension of the receiver operating characteristic curve and AUC-optimal classification. Neural Computation, 24 (10):27892824, 2012.
Tsochantaridis, I., Joachims, T., Hofmann, T., and Altun, Y. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6:14531484, 2005.
Vedaldi, A. A matlab wrapper of SVMstruct, 2008. URL http://www.vlfeat.org/~vedaldi/code/ svm-struct-matlab.html.
Wang, Z. and Chang, Y.-C.I. Marker selection via maximizing the partial area under the ROC curve of linear risk scores. Biostatistics, 12(2):369385, 2011.
Wu, S.-H., Lin, K.-P., Chen, C.-M., and Chen, M.-S.Asymmetric support vector machines: low false-positive learning under the user tolerance. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2008.
Yu, C. J. and Joachims, T. Training structural svms with kernels using sampled cuts. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 794802, 2008.
Yue, Y., Finley, T., Radlinski, F., and Joachims, T. A support vector method for optimizing average precision.
In Proceedings of the 30th ACM SIGIR International Conference on Research and Development in Information Retrieval, 2007.
-----0
Brafman, Ronen I. and Tennenholtz, Moshe. R-max a general polynomial time algorithm for nearoptimal reinforcement learning. In Journal of Machine Learning Research, 3:213231, 2003.
Chakraborty, Doran and Stone, Peter. Structure learning in ergodic factored MDPs without knowledge of the transition functions in-degree. In Proceedings of the International Conference on Machine Learning, ICML11, pp. 737744, 2011.
Degris, Thomas, Sigaud, Olivier, and Wuillemin, Pierre-Henri. Learning the structure of factored Markov decision processes in reinforcement learning problems. In Proceedings of the International Conference on Machine Learning, ICML06, pp. 257 264, 2006.
Diuk, Carlos, Li, Lihong, and Leffler, Bethany R. The adaptive k-meteorologists problem and its application to structure learning and feature selection in reinforcement learning. In Proceedings of the InternaOnline Feature Selection for Model-based Reinforcement Learning tional Conference on Machine Learning, ICML09, pp. 249256, 2009.
Hester, Todd and Stone, Peter. Generalized model learning for reinforcement learning in factored domains. In Proceedings of International Conference on Autonomous Agents and Multiagent Systems, volume 3 of AAMAS09, pp. 717724, 2009.
Hester, Todd and Stone, Peter. Texplore: real-time sample-efficient reinforcement learning for robots.Machine Learning, pp. 145, 2012.
Kroon, Mark and Whiteson, Shimon. Automatic feature selection for model-based reinforcement learning in factored MDPs. In Proceedings of the International Conference on Machine Learning and Applications, ICMLA09, 2009.
Lee, S. and Wright, S. Manifold identification of dual averaging methods for regularized stochastic online learning. In Journal of Machine Learning Research, 13:17051744, 2012.
Leffler, Bethany R., Littman, Michael L., and Edmunds, Timothy. Efficient reinforcement learning with relocatable action models. In Proceedings of the National Conference on Artificial Intelligence, AAAI07, pp. 572577, 2007.
McCarthy, John. Situations, actions, and causal laws.Technical Report Memo 2, Stanford Artificial Intelligence Project, Stanford University, 1963.
Nguyen, Trung T., Silander, Tomi, and Leong, Tze Yun. Transferring expectations in model-based reinforcement learning. In Proceedings of the Advances in Neural Information Processing Systems, NIPS12, 2012.Quinlan, J. R. C4.5: Programs for Machine Learning.
Morgan Kaufmann, 1993. ISBN 1-55860-238-0.Ross, Stephane and Pineau, Joelle. Model-based bayesian reinforcement learning in large structured domains. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, UAI08, pp. 476 483, 2008.
Strehl, Er L. and Littman, Michael L. Online linear regression and its application to model-based reinforcement learning. In Proceedings of the Advances in Neural Information Processing Systems, NIPS07, pp. 737744, 2007.
Strehl, Er L., Diuk, Carlos, and Littman, Michael L.Efficient structure learning in factored-state MDPs. In Proceedings of the National Conference on Artificial Intelligence, AAAI07, 2007.
Walsh, Thomas J., Szita, Istvan, Diuk, Carlos, and Littman, Michael L. Exploring compact reinforcement-learning representations with linear regression. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, UAI09, pp. 591 598, 2009.
Xiao, Lin. Dual averaging methods for regularized stochastic learning and online optimization. In Proceedings of the Advances in Neural Information Processing Systems, NIPS09, 2009.
Yang, Haiqin, Xu, Zenglin, King, Irwin, and Lyu, Michael R. Online learning for group lasso. In Proceedings of the International Conference on Machine Learning, ICML10, pp. 11911198, 2010.
-----0
Bartlett, Peter L., Jordan, Michael I., and McAuliffe, Jon D. Convexity, classification, and risk bounds. Technical Report 638, University of California, Berkeley, November 2003.
Ben-David, S., Eiron, N., and Long, P.M. On the difficulty of approximately maximizing agreements. J. Comput. System Sci., (66(3)):496514, 2003.
Collobert, R., Sinz, F., Weston, J., and Bottou, L. Trading convexity for scalability. In In Proceedings of the Twenty-third International Conference on Machine Learning (ICML 2006), pp. 201208.ACM Press, 2006.
Cortes, Corinna and Vapnik, Vladimir. Support-vector networks. Machine Learning, 20, 1995.
Ding, Nan and Vishwanathan, S.V.N. t-logistic regression. In Advances in Neural Information Processing Systems. NIPS, 2010.
Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.R., and Lin, C.-J. Liblinear: A library for large linear classification. Journal of Machine Learning Research, pp. 18711874, 2008.
Feldman, V., Guruswami, V., Raghavendra, P., and Wu, Yi. Agnostic learning of monomials by halfspaces is hard. Preliminary version appeared in FOCS, 2009.
Frank, A. and Asuncion, A. Uci machine learning repository. Irvine, CA: University of California, School of Information and Computer Science., 2010.URL http://archive.ics.uci.edu/ml.
Hooke, R. and Jeeves, T.A. Direct search solution of numerical and statistical problems. Journal of the Association for Computing Machinery, 8 (2):212 229, 1961.
Land, A. H. and Doig, A. G. An Automatic Method for Solving Discrete Programming Problems. Econometrica, 28:497520, 1960.
Li, Ling and Lin, Hsuan-Tien. Optimizing 0-1 loss for perceptrons by random coordinate descent. In Proceedings of the 2007 International Joint Conference on Neural Networks, 2007.
Long, Phil and Servedio, Rocco. Random classification noise defeats all convex potential boosters. Machine Learning Journal, (78(3)):287304, 2010.
Masnadi-Shirazi, Hamed and Vasconcelos, Nuno. On the design of loss functions for classification: theory, robustness to outliers, and savageboost. In NIPS, 2008.
McAllester, David. Statistical methods for artificial intelligence. Lecture Notes for TTIC103, Toyota Technological Institute at Chicago, 8 2007.
Minka, Thomas. Expectation propagation for approximate bayesian inference. UAI, pp. 362369, 2001.
Wu, Y. and Liu, Y. Robust truncated hinge loss support vector machines. JASA, 102:974983, 2007.
-----0
Agakov, F. and Barber, D. Kernelized infomax clustering.In NIPS, 2006.
Ali, S. M. and Silvey, S. D. A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society, Series B, 28(1):131142, 1966.
Allwein, E., Schapire, R., and Singer, Y. Reducing multiclass to binary: a unifying approach for margin classifiers. Journal of Machine Learning Research, 1:113141, 2000.
Bartlett, P. and Mendelson, S. Rademacher and Gaussian complexities: risk bounds and structural results. Journal of Machine Learning Research, 3:463482, 2002.
Belkin, M., Matveeva, I., and Niyogi, P. Regularization and semi-supervised learning on large graphs. In ALT, 2004.
Belkin, M., Niyogi, P., and Sindhwani, V. Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research, 7:23992434, 2006.
Bennett, K. and Demiriz, A. Semi-supervised support vector machines. In NIPS, 1998.
Chapelle, O., Scholkopf, B., and Zien, A. (eds.). SemiSupervised Learning. MIT Press, 2006.
Cortes, C., Mohri, M., Pechyony, D., and Rastogi, A. Stability of transductive regression algorithms. In ICML, 2008.
Csiszar, I. Information-type measures of difference of probability distributions and indirect observation. Studia Scientiarum Mathematicarum Hungarica, 2:229318, 1967.
Delalleau, O., Bengio, Y., and Le Roux, N. Efficient nonparametric function induction in semi-supervised learning. In AISTATS, 2005.
El-Yaniv, R. and Pechyony, D. Transductive Rademacher complexity and its applications. Journal of Artificial Intelligence Research, 35:193234, 2009.
Gomes, R., Krause, A., and Perona, P. Discriminative clustering by regularized information maximization. In NIPS, 2010.Grandvalet, Y. and Bengio, Y. Semi-supervised learning by entropy minimization. In NIPS, 2004.
Hall, M. Correlation-based feature selection for discrete and numeric class machine learning. In ICML, 2000.
Joachims, T. Transductive inference for text classification using support vector machines. In ICML, 1999.Joachims, T. Transductive learning via spectral graph partitioning. In ICML, 2003.
Kanamori, T., Suzuki, T., and Sugiyama, M. Statistical analysis of kernel-based least-squares density-ratio estimation. Machine Learning, 86(3):335367, 2012.
Koltchinskii, V. Rademacher penalties and structural risk minimization. IEEE Transactions on Information Theory, 47(5):19021914, 2001.
Kullback, S. and Leibler, R. A. On information and sufficiency. Annals of Mathematical Statistics, 22:7986, 1951.
Mann, G. and McCallum, A. Simple, robust, scalable semisupervised learning via expectation regularization. In ICML, 2007.
McDiarmid, C. On the method of bounded differences. In Siemons, J. (ed.), Surveys in Combinatorics, pp. 148 188. Cambridge University Press, 1989.
Meir, R. and Zhang, T. Generalization error bounds for Bayesian mixture algorithms. Journal of Machine Learning Research, 4:839860, 2003.
Pearson, K. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine, 50:157175, 1900.Scholkopf, B. and Smola, A. Learning with Kernels. MIT Press, 2001.
Shannon, C. E. A mathematical theory of communication.Bell System Technical Journal, 27:379423 & 623656, 1948.Sinha, K. and Belkin, M. Semi-supervised learning using sparse eigenfunction bases. In NIPS, 2009.
Sugiyama, M. Superfast-trainable multi-class probabilistic classifier by least-squares posterior fitting. IEICE Transactions on Information and Systems, E93-D(10): 26902701, 2010.
Sugiyama, M., Yamada, M., Kimura, M., and Hachiya, H. On information-maximization clustering: Tuning parameter selection and analytic solution. In ICML, 2011.
Suzuki, T., Sugiyama, M., Kanamori, T., and Sese, J. Mutual information estimation reveals global associations between stimuli and biological processes. BMC Bioinformatics, 10(1):S52, 2009.
Szummer, M. and Jaakkola, T. Partially labeled classification with Markov random walks. In NIPS, 2001.
Szummer, M. and Jaakkola, T. Information regularization with partially labeled data. In NIPS, 2002.
Yamada, M., Sugiyama, M., Wichern, G., and Simm, J.Improving the accuracy of least-squares probabilistic classifiers. IEICE Transactions on Information and Systems, E94-D(6):13371340, 2011.
Zhou, D., Bousquet, O., Navin Lal, T., Weston, J., and Scholkopf, B. Learning with local and global consistency.In NIPS, 2003.
Zhu, X., Ghahramani, Z., and Lafferty, J. Semi-supervised learning using Gaussian fields and harmonic functions.In ICML, 2003.
-----0
Allgower, E. L. and George, K. Continuation and path following. Acta Numerica, 2:163, 1993.
Best, M. J. An algorithm for the solution of the parametric quadratic programming problem. Applied Mathemetics and Parallel Computing, pp. 57 76, 1996.
Boser, B. E., Guyon, I. M., and Vapnik, V. N. A training algorithm for optimal margin classifiers. Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory, pp. 144152, 1992.
Boyd, S. and Vandenberghe, L. Convex Optimization.Cambridge University Press, 2004.Chapelle, O. Training a support vector machine in the primal. Neural Computation, 19(5):11551178, 2007.
Chapelle, O. and Zien, A. Semi-supervised classification by low density separation. Tenth International Workshop on Artificial Intelligence and Statistics, 2005.
Chapelle, O., Scholkopf, B., and Zien, A. (eds.). SemiSupervised Learning. MIT Press, Cambridge, MA 2006.
Chapelle, O., Sindhwani, V., and Keerthi, S. Branch and bound for semi-supervised support vector machines. Advances in Neural Information Processing Sysrtems, 2007.
Chapelle, O., Sindhwani, V., and Keerthi, S. S. Optimization techniques for semi-supervised support vector machines. J. Mach. Learning Res., 9:202 233, Feb. 2008.
Collobert, R., Sinz, F., Weston, J., and Bottou, L.Large scale transductive SVMs. Journal of Machine Learning Research, 7:16871712, 2006.
Colorni, A., Dorigo, M., and Maniezzo, V. Distributed optimization by ant colonies. Proc. European Conference on Artificial Life, pp. 134142, 1991.
Cortes, C. and Vapnik, V. Support-vector networks.Machine Learning, 20:273297, 1995.
Efron, B. and Tibshirani, R. Least angle regression.Annals of Statistics, 32(2):407499, 2004.
Gal, T. Postoptimal Analysis, Parametric Programming, and Related Topics. Walter de Gruyter, 1995.
Hastie, T., Rosset, S., Tibshirani, R., and Zhu, J. The entire regularization path for the support vector machine. Journal of Machine Learning Research, 5: 1391415, 2004.
Hromkovic, J. Algorithmics for Hard Problems.Springer, 2001.
Joachims, T. Transductive inference for text classification using support vector machines. International Conference on Machine Learning, 1999.
Karasuyama, M., Harada, N., Sugiyama, M., and Takeuchi, I. Multi-parametric solution-path algorithm for instance-weighted support vector machines. Machine Learning, 88(3):297330, 2012.
Kirkpatrick, S. and Gelatt, C. D. Optimization by simulated annealing. Science, 220:671680, 1983.
Korte, B. and Vygen, J. Combinatorial Optimization: Theory and Algorithms. Springer-Verlag, Berlin, 2000.
Ritter, K. On parametric linear and quadratic programming problems. mathematical Programming: Proceedings of the International Congress on Mathematical Programming, pp. 307335, 1984.
Sindhwani, V., Keerthi, S., and Chapelle, O. Deterministic annealing for semi-supervised kernel machines. International Conference on Machine Learning, 2006.
Takeuchi, I., Nomura, K., and Kanamori, T. Nonparametric conditional density estimation using piecewise-linear solution path of kernel quantile regression. Neural Computation, 21(2):539559, 2009.
Vapnik, V. N. The Nature of Statistical Learning Theory. Springer, 1996.
Vapnik, V. N. and Sterin, A. On structural risk minimization or overall risk in a problem of pattern recognition. Automation and Remote Control, 1977.
Yuille, A. L. and Rangarajan, A. The concave-convex procedure (cccp). In Advances in Neural Information Processing Systems, volume 14, 2002.
-----0
Allgower, E. L. and George, K. Continuation and path following. Acta Numerica, 2:163, 1993.
Best, M. J. An algorithm for the solution of the parametric quadratic programming problem. Applied Mathemetics and Parallel Computing, pp. 57 76, 1996.
Boyd, S. and Vandenberghe, L. Convex Optimization.Cambridge University Press, 2004.
Cauwenberghs, G. and Poggio, T. Incremental and decremental support vector machine learning. Advances in Neural Information Processing Systems, 13, 2001.
Chang, C. C. and Lin, C. J. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:127:27, 2011.
Dai, L. and Pelckmans, K. An ellipsoid based twostage sreening test for bpdn. In Proceedings of the 20th European Signal Processing Conference, 2012.
Fan, J. and Lv, J. Sure independence screening for ultrahigh dimensional feature space. Journal of The Royal Statistical Society B, 70:849911, 2008.
Fan, R. R., Chang, K. W., Hsieh, C. J., Wang, X. R., and Lin, C. J. Liblinear: A library for large linear classification. Journal of Machine Learning Research, 9:18711874, 2008.
Friedman, J., Hastie, T., Hofling, H., and Tibshirani, R. Pathwise coordinate optimization. 1(2):302332, 2007.
Friedman, J., Hastie, T., and Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33 (1):122, 2010.
Gal, T. Postoptimal Analysis, Parametric Programming, and Related Topics. Walter de Gruyter, 1995.
Ghaoui, L. E., Viallon, V., and Rabbani, T. Safe feature elimination in sparse supervised learning.eprint arXiv:1009.3515, 2010.
Hastie, T., Rosset, S., Tibshirani, R., and Zhu, J. The entire regularization path for the support vector machine. Journal of Machine Learning Research, 5: 1391415, 2004.
Hsieh, C. J., Chang, K. W., and Lin, C. J. A dual coordinate descent method for large-scale linear svm.Proceedings of the 25th International Conference on Machine Learning, 2008.
Joachims, T. Making large-scale svm learning practical. In Scholkopf, B., Burges, C. J. C., and Smola, A. J. (eds.), Advances in Kernel Methods Support Vector Learning, pp. 169184. MIT Press, 1999.
Karasuyama, M., Harada, N., Sugiyama, M., and Takeuchi, I. Multi-parametric solution-path algorithm for instance-weighted support vector machines. Machine Learning, 88(3):297330, 2012.
Platt, J. Fast training of support vector machines using sequential minimal optimization. In Scholkopf, 
B., Burges, C. J. C., and Smola, A. J. (eds.), Advances in Kernel Methods Support Vector Learning, pp. 185208. MIT Press, 1999.
Takeuchi, I., Nomura, K., and Kanamori, T. Nonparametric conditional density estimation using piecewise-linear solution path of kernel quantile regression. Neural Computation, 21(2):539559, 2009.
Tibshirani, R., Bien, J., Friedman, J., Hastie, T., Simon, N., Taylor, J., and Tibshirani, R. J. Strong rules for discarding predictors in lasso-type problems. Journal of the Royal Statistical Society, Series B, 74:245266, 2011.
Vats, D. Parameter-free high-dimensional screening using multiple grouping of variables. eprint arXiv:1208.2043, 2012.
Wang, J., Lin, B., Gong, P., Wonka, P., and Ye, J.Lasso screening rules via dual polytope projection.eprint arXiv:1211.3966, 2012.
Xiang, Z. J. and Ramadge, P. J. Fast lasso screening test based on correlatins. In Proceedings of the 37th IEEE International Conference on Acoustics, Speech and Signal Processing, 2012.
Xiang, Z. J., Xu, H., and Ramadge, P. J. Learning sparse representations of high dimensional data on large scale dictionaries. In Advances in Neural Information Processing Systems 24, pp. 900908. 2012.
Yuan, G. X., Chang, K. W., Hsieh, C. J., and Lin, C. J. A comparison of optimization methods and software for large-scale l1-regularized linear classification. Journal of Machine Learning Research, 11: 31833234, 2010.
Yuan, G. X., Ho, C. H., and Lin, C. J. An improved glmnet for l1-regularized logistic regression. Journal of Machine Learning Research, 13:19992030, 2012.
-----0
Buck, T.E., Rao, A., Coelho, L.P., Fuhrman, M.H., Jarvik, J.W., Berget, P.B., and Murphy, R.F. Cell cycle dependence of protein subcellular location inferred from static, asynchronous images. In Engineering in Medicine and Biology Society, 2009.EMBC 2009. Annual International Conference of the IEEE, pp. 10161019. IEEE, 2009.
Ferraty, F. and Vieu, P. Nonparametric functional data analysis: theory and practice. Springer, 2006.
Ingster, Y. and Stepanova, N. Estimation and detection of functions from anisotropic sobolev classes.Electronic Journal of Statistics, 5:484506, 2011.
Jaakkola, T.S., Haussler, D., et al. Exploiting generative models in discriminative classifiers. Advances in neural information processing systems, pp. 487493, 1999.
Jebara, T., Kondor, R., and Howard, A. Probability product kernels. The Journal of Machine Learning Research, 5:819844, 2004.
Kpotufe, S. k-nn regression adapts to local intrinsic dimension. arXiv preprint arXiv:1110.4300, 2011.
Laurent, B. Efficient estimation of integral functionals of a density. The Annals of Statistics, 24(2):659681, 1996.
Moreno, P.J., Ho, P., and Vasconcelos, N. A kullbackleibler divergence based kernel for svm classification in multimedia applications. Advances in Neural Information Processing Systems, 16:13851393, 2003.
Muandet, K., Scholkopf, B., Fukumizu, K., and Dinuzzo, F. Learning from distributions via support measure machines. arXiv preprint arXiv:1202.6504, 2012.
Nussbaum, M. On optimal filtering of a function of many variables in white gaussian noise. Problemy Peredachi Informatsii, 19(2):2329, 1983.
Poczos, B., Rinaldo, A., Singh, A., and Wasserman, L. Distribution-Free Distribution Regression. AISTATS 2013, arXiv preprint arXiv:1302.0082, 2012.
Poczos, B., Xiong, L., and Schneider, J. Nonparametric divergence estimation with applications to machine learning on distributions. arXiv preprint arXiv:1202.3758, 2012a.
Poczos, B., Xiong, L., Sutherland, D.J., and Schneider, J. Nonparametric kernel estimators for image classification. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp. 29892996. IEEE, 2012b.
Ramsay, J.O. and Silverman, B.W. Applied functional data analysis: methods and case studies, volume 77.Springer New York:, 2002.
Rigollet, P. and Vert, R. Optimal rates for plug-in estimators of density level sets. Bernoulli, 15(4): 11541178, 2009.
Smola, A., Gretton, A., Song, L., and Scholkopf, B. A hilbert space embedding for distributions. In Algorithmic Learning Theory, pp. 1331. Springer, 2007.
Tsybakov, Alexandre B. Introduction to nonparametric estimation. Springer, 2008.
-----0
Ailon, N. An active learning algorithm for ranking from pairwise preferences with an almost optimal query complexity. J. Mach. Learn. Res., 13:137 Enhanced statistical rankings via targeted data collection 164, 2012.
Biyikoglu, T., Leydold, J., and Stadler, P. F. Laplacian Eigenvectors of Graphs. Springer, 2007.
Bjorner, A., Lovasz, L., and Shor, P. W. Chip-firing games on graphs. European J. Combin, 12(4), 1991.
Callaghan, T., Mucha, P. J., and Porter, M. A. Random walker ranking for NCAA Division I-A football.2007.
Chung, F. R. K. Spectral Graph Theory. AMS, 1997.Chung, M. and Haber, E. Experimental design for biological systems. Siam J. Control Optim., 50(1): 471489, 2012.
David, H. A. The Method of Paired Comparisons.Charles Griffin & Co., 1963.
Elo, A. The Rating of Chessplayers, Past and Present.Arco Pub., 1978.
Fiedler, M. Algebraic connectivity of graphs. Czech.Math. J., 23:298305, 1973.
Ghosh, A. and Boyd, S. Upper bounds on algebraic connectivity via convex optimization. Linear Algebra and its Applications, 418:693707, 2006a.
Ghosh, A. and Boyd, S. Growing well-connected graphs. Proc. IEEE Conf. Decision & Control, 2006b.
Glickman, M. E. A comprehensive guide to chess ratings. American Chess Journal, 3, 1995.
Glickman, M. E. Adaptive paired comparison design.Journal of Statistical Planning and Inference, 127: 279293, 2005.
Haber, E., Horesh, L., and Tenorio, L. Numerical methods for experimental design of large-scale linear ill-posed inverse problems. Inverse Problems, 24: 055012, 2008.
Hirani, A. N., Kalyanaraman, K., and Watts, S. Least squares ranking on graphs. arXiv:1011.1716v4, 2011.
Horn, R.A. and Johnson, C.R. Matrix Analysis. Cambridge University Press, 1990.
Jamieson, K. G. and Nowak, R. D. Active ranking using pairwise comparisons. In Neural Information Processing Systems (NIPS), pp. 22402248, 2011.
Jiang, X., Lim, L.-H., Yao, Y., and Ye, Y. Statistical ranking and combinatorial Hodge theory. Math.
Program. Ser. B, 127(1):203244, 2010.Keener, J. P. The Perron-Frobenius theorem and the ranking of football teams. SIAM Review, 35(1):80 93, 1993.
Langville, A. N. and Meyer, C. D. Whos #1?: The Science of Rating and Ranking. Princeton University Press, 2012.
Mohar, B. The Laplacian spectrum of graphs. In Graph Theory, Combinatorics, and Applications, volume 2, pp. 871898. Wiley, 1991.
Mosk-Aoyama, D. Maximum algebraic connectivity augmentation is NP-hard. Operations Research Letters, 36(6):677679, 2008.
Olfati-Saber, R., Fax, A., and Murray, R. M. Consensus and cooperation in networked multi-agent systems. Proceedings of the IEEE, 95(1):215233, 2007.
Osting, B., Brune, C., and Osher, S. Optimal data collection for improved rankings expose well-connected graphs. submitted, 2012a.
Osting, B., Darbon, J., and Osher, S. Statistical ranking using the `1-norm on graphs. submitted, 2012b.
Pukelsheim, F. Optimal Design of Experiments.SIAM, 2006.
Quinn, G. P. and Keough, M. J. Experimental design and analysis for biologists. Cambridge University Press, Cambridge, UK, 2002.
Seeger, M. and Nickisch, H. Large scale Bayesian inference and experimental design for sparse linear models. SIAM Journal of Imaging Sciences, 4(1):166 199, 2011.
Sun, J., Boyd, S., Xiao, L., and Diaconis, P. The fastest mixing Markov process on a graph and a connection to a maximum variance unfolding problem.SIAM Review, 48:2006, 2004.
Traud, A. L., Frost, C., Mucha, P. J., and Porter, M. A. Visualization of communities in networks.Chaos, 19:041104, 2009.
Wang, H. and Mieghem, P. Van. Algebraic connectivity optimization via link addition. In Bionetics 2008, Hyogo, Japan, 2008.
Wu, Mike. wgPlot. http://www.mathworks.com/ matlabcentral/fileexchange/24035, 2009.
Xu, Q., Yao, Y., Jiang, T., Huang, Q., Yan, B., and Lin, W. Random partial paired comparison for subjective video quality assessment via HodgeRank. In ACM Multimedia, 2011.
-----0
Banerjee, O., Ghaoui, L. E., and d'Aspremont, A. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. Journal of Machine Learning Research, 9:485{516, 2008.
Beck, A. and Teboulle, M. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM J. Imaging Sci., 2(1):193{202, 2009.
Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J.Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1), 2010.
Combettes, P. L. and Pesquet, J. Proximal splitting methods in signal processing. In Bauschke, H. H. et. al. (ed.), Fixed-Point Algorithms for Inverse Problems in Science and Engineering, chapter 10, pp. 185{212. Springer, New York, 2011.
Combettes, P. L. and Wajs, V. R. Signal recovery by proximal forward-backward splitting. Multiscale Model.Simul., 4(4):11681200, 2005.
Daubechies, I., Defrise, M., and Mol, C. De. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics, 57(11):14131457, 2004.
Deng, W. and Yin, W. On the global and linear convergence of the generalized alternating direction method of multipliers. Technical Report TR12-14, Rice University CAAM Technical Report, 2012.Duchi, J. and Singer, Y. Ecient online and batch learning using forward backward splitting. JMLR, 10:2899{2934, 2009.
Eckstein, J. and Bertsekas, D. P. On the douglas-rachford splitting method and the proximal point algorithm for maximal monotone operators. Mathematical Programming, 55(1-3):293{318, 1992.Stochastic ADMM 
Friedman, J. and Tibshirani, R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3): 432{441, 2007.
Gabay, D. Applications of the method of multipliers to variational inequalities. In Fortin, M. and Glowinski, R. (eds.), Augmented Lagrangian Methods: Applications to the Solution of Boundary-Value Problems. NorthHolland: Amsterdam, 1983.
Gabay, D. and Mercier, B. A dual algorithm for the solution of nonlinear variational problems via nite element approximation. Computers & Mathematics with Applications, 2(1), 1976.
Glowinski, R. and Marroco, A. Sur lapproximation, par elements nis dordre un, et la resolution, par penalisationdualite, dune classe de problems de dirichlet non lineares. Revue Francaise dAutomatique, Informatique, et Recherche Operationelle, 9(2), 1975.
Glowinski, R. and Tallec, P. L. Augmented Lagrangian and Operator-Splitting Methods in Nonlinear Mechanics. Studies in Applied and Numerical Mathematics. SIAM, 1989.
Goldfarb, D., Ma, S., and Scheinberg, K. Fast alternating linearization methods for minimizing the sum of two convex functions, 2010. URL http://arxiv.org/abs/ 0912.4571.
Goldstein, T. and Osher, S. The split bregman method for l1-regularized problems. SIAM J. Imaging Sci., 2 (2):323{343, 2009.
He, B. and Yuan, X. On the o(1=n) convergence rate of the douglas-rachford alternating direction method. SIAM J.Numer. Anal., 50(2):700{709, 2012a.
He, B. and Yuan, X. On non-ergodic convergence rate of douglas-rachford alternating direction method of multipliers. 2012b.
Hong, Mingyi and Luo, Zhi-Quan. On the linear convergence of the alternating direction method of multipliers.http://arxiv.org/abs/1208.3922, 2012.
Hu, Chonghai, Kwok, James T., and Pan, Weike. Accelerated gradient methods for stochastic optimization and online learning. In NIPS 22, 2009.
Kim, S., Sohn, K., and Xing, E. P. A multivariate regression approach to association analysis of a quantitative trait network. Bioinformatics, 25(12):204{212, 2009.
Kogan, S., Levin, D., Routledge, B. R., Sagi, J. S., and Smith, N. A. Predicting risk from nancial reports with regression. In NAACL-HLT 2009, Boulder, CO, 2009.
Lan, G. An optimal method for stochastic composite optimization. Mathematical Programming, 2010. doi: DOI10.1007/s10107-010-0434-y.
Lan, G. and Ghadimi, S. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization, i: a generic algorithmic framework. SIAM J. on Optimization, 2011.
Langford, J., Li, L., and Zhang, T. Sparse online learning via truncated gradient. Journal of Machine Learning Research, pp. 777{801, 2009.
Monteiro, Renato D. C. and Svaiter, B. F. Iterationcomplexity of block-decomposition algorithms and the alternating minimization augmented lagrangian method. Technical report, Georgia Institute of Technology, 2010.
Nemirovski, A. and Yudin, D. Problem Complexity and Method Eciency in Optimization. John Wiley and Sons, 1983.
Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A.Robust stochastic approximation approach to stochastic programming. SIAM J. on Optimization, 19(4):1574{ 1609, 2009.
Nesterov, Y. Introductory Lectures on Convex Optimization, A Basic Course. Kluwer Academic Publishers, 2004.
Nesterov, Y. Gradient methods for minimizing composite objective function. Technical Report CORE DISCUSSION PAPER 2007/76, 2007.
Suzuki, Taiji. Dual averaging and proximal gradient descent for online alternating direction multiplier method.In Proceedings of ICML, 2013.
Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., and Knight, K. Sparsity and smoothness via the fused lasso.Journal of the Royal Statistical Society: Series B, 67(1): 91{108, 2004.
Tibshirani, R. J. and Taylor, J. The solution path of the generalized lasso. Annals of Statistics, 39(3):1335{1371, 2011.
Tseng, P. On accelerated proximal gradient methods for convex-concave optimization. SIAM J. Optim., 2008.
Vapnik, V. N. The nature of statistical learning theory. Springer-Verlag New York Incorporated, 2000.
Wang, H. and Banerjee, A. Online alternating direction method. In Proceedings of ICML, 2012.
Wright, S. J., Nowak, R. D., and Figueiredo, M.A. T. Sparse reconstruction by separable approximation. IEEE Transactions on Signal Processing, 57(7): 2479{2493, 2009.
Xiao, L. Dual averaging methods for regularized stochastic learning and online optimization. JMLR, 11:2543{2596, 2010.
Yang, J. and Yuan, X. Linearized augmented lagrangian and alternating direction methods for nuclear norm minimization. Mathematics of Computation, 2012. doi: http://dx.doi.org/10.1090/S0025-5718-2012-02598-1.
Yang, J. and Zhang, Y. Alternating direction algorithms for `1-problems in compressive sensing. SIAM J. on Scientic Computing, 33(1):250{278, 2011.
Zhang, X., Burger, M., and Osher, S. A unied primaldual algorithm framework based on bregman iteration. J. of Scientic Computing, 46(1):20{46, 2011.
-----0
Amini, A.A. and Wainwright, M.J. High-dimensional analysis of semidefinite relaxations for sparse principal components. In Information Theory, 2008. ISIT 2008. IEEE International Symposium on, pp. 2454 2458. IEEE, 2008.
Asteris, M., Papailiopoulos, D.S., and Karystinos, G.N. Sparse principal component of a rank-deficient matrix. In Information Theory Proceedings (ISIT), 2011 IEEE International Symposium on, pp. 673 677. IEEE, 2011.
Cadima, J. and Jolliffe, I.T. Loading and correlations in the interpretation of principle compenents. Journal of Applied Statistics, 22(2):203214, 1995.
dAspremont, A., El Ghaoui, L., Jordan, M.I., and Lanckriet, G.R.G. A direct formulation for sparse pca using semidefinite programming. SIAM review, 49(3):434448, 2007a.
dAspremont, A., Bach, F., and Ghaoui, L.E. Optimal solutions for sparse principal component analysis.The Journal of Machine Learning Research, 9:1269 1294, 2008.
dAspremont, A., Bach, F., and Ghaoui, L.E. Approximation bounds for sparse principal component analysis. arXiv preprint arXiv:1205.0121, 2012.
dAspremont, Alexandre, Bach, Francis R., and Ghaoui, Laurent El. Full regularization path for sparse principal component analysis. In Proceedings of the 24th international conference on Machine learning, ICML 07, pp. 177184, 2007b.
Gawalt, B., Zhang, Y., and El Ghaoui, L. Sparse pca for text corpus summarization and exploration. NIPS 2010 Workshop on Low-Rank Matrix Approximation, 2010.
Jolliffe, I.T. Rotation of principal components: choice of normalization constraints. Journal of Applied Statistics, 22(1):2935, 1995.
Jolliffe, I.T., Trendafilov, N.T., and Uddin, M. A modified principal component technique based on the lasso. Journal of Computational and Graphical Statistics, 12(3):531547, 2003.
Journee, M., Nesterov, Y., Richtarik, P., and Sepulchre, R. Generalized power method for sparse principal component analysis. The Journal of Machine Learning Research, 11:517553, 2010.
Kaiser, H.F. The varimax criterion for analytic rotation in factor analysis. Psychometrika, 23(3):187 200, 1958.
Ma, Zongming. Sparse principal component analysis and iterative thresholding. arXiv preprint arXiv:1112.2432, 2011.
Mackey, L. Deflation methods for sparse pca. Advances in neural information processing systems, 21:1017 1024, 2009.
Moghaddam, B., Weiss, Y., and Avidan, S. Generalized spectral bounds for sparse lda. In Proceedings of the 23rd international conference on Machine learning, pp. 641648. ACM, 2006a.
Moghaddam, B., Weiss, Y., and Avidan, S. Spectral bounds for sparse pca: Exact and greedy algorithms.Advances in neural information processing systems, 18:915, 2006b.
Moghaddam, B., Weiss, Y., and Avidan, S. Fast pixel/part selection with sparse eigenvectors. In Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, pp. 18. IEEE, 2007.
Papailiopoulos, D. S., Dimakis, A. G., and Korokythakis, S. Sparse pca through low-rank approximations. arXiv preprint arXiv:1303.0551, 2013.
Shen, H. and Huang, J.Z. Sparse principal component analysis via regularized low rank matrix approximation. Journal of multivariate analysis, 99(6):1015 1034, 2008.
Sriperumbudur, B.K., Torres, D.A., and Lanckriet, G.R.G. Sparse eigen methods by dc programming.In Proceedings of the 24th international conference on Machine learning, pp. 831838. ACM, 2007.
Yuan, X.T. and Zhang, T. Truncated power method for sparse eigenvalue problems. arXiv preprint arXiv:1112.2679, 2011.
Zhang, Y. and El Ghaoui, L. Large-scale sparse principal component analysis with application to text data. Advances in Neural Information Processing Systems, 2011.
Zhang, Y., dAspremont, A., and Ghaoui, L.E. Sparse pca: Convex relaxations, algorithms and applications. Handbook on Semidefinite, Conic and Polynomial Optimization, pp. 915940, 2012.
Zou, H., Hastie, T., and Tibshirani, R. Sparse principal component analysis. Journal of computational and graphical statistics, 15(2):265286, 2006.
-----0
Amazon mechanical turk. http://www.mturk.com.
Ashby, F.G. and Maddox, W.T. Human category learning. Annu. Rev. Psychol., 56:149178, 2005.
Ashby, F.G. and Maddox, W.T. Human category learning 2.0. Annals of the New York Academy of Sciences, 1224(1):147161, 2011.
Ashby, F.G., Maddox, W.T., and Bohil, C.J. Observational versus feedback training in rule-based and information-integration category learning. Memory & cognition, 30(5):666677, 2002.
Ben-David, S., Long, P., and Mansour, Y. Agnostic boosting. In CoLT, pp. 507516. Springer, 2001.
Bradley, J. K. and Schapire, R. Filterboost: Regression and classification on large datasets. NIPS, 20: 185192, 2008.
Collins, M., Schapire, R.E., and Singer, Y. Logistic regression, adaboost and bregman distances. Machine Learning, 48(1):253285, 2002.
Cosmides, Leda and Tooby, John. Are humans good intuitive statisticians after all? rethinking some conclusions from the literature on judgment under uncertainty, 1996.
Dawid, A.P. and Skene, A.M. Maximum likelihood estimation of observer error-rates using the em algorithm. Applied Statistics, pp. 2028, 1979.
Dietterich, T. G. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization.Machine learning, 40(2):139157, 2000.
Domingo, C. and Watanabe, O. Madaboost: A modification of adaboost. Citeseer, 2000.
Ell, S., Ing, A., and Maddox, W. Critrial noise effects on rule-based category learning: The impact of delayed feedback. Attention, Perception, & Psychophysics, 71:12631275, 2009. ISSN 1943-3921.
Fried, L.S. and Holyoak, K.J. Induction of category distributions: A framework for classification learning. Journal of Experimental Psychology: Learning, 
Memory, and Cognition, 10(2):234, 1984.Friedman, J. H. Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4):367378, 2002.
Friedman, J. H., Hastie, T., and Tibshirani, R. Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors).Ann. Stats., 28(2):337407, 2000.
Grubb, A. and Bagnell, J.A. Generalized boosting algorithms for convex optimization. arXiv preprint arXiv:1105.2054, 2011.
Hsu, A.S., Griffiths, T.L., et al. Effects of generative and discriminative learning on use of category variability. 2010.
Kearns, M. and Valiant, L. Cryptographic limitations on learning boolean formulae and finite automata.JACM, 41(1):6795, 1989.
Mason, L., Baxter, J., Bartlett, P.L., and Frean, M.Functional gradient techniques for combining hypotheses. NIPS, pp. 221246, 1999.
Meir, R. and Ratsch, G. An introduction to boosting and leveraging. Advanced lectures on machine learning, pp. 118183, 2003.
Ott, M., Choi, Y., Cardie, C., and Hancock, J.T. Finding deceptive opinion spam by any stretch of the imagination. Arxiv preprint arXiv:1107.4557, 2011.
Quinn, A.J. and Bederson, B.B. Human computation: a survey and taxonomy of a growing field. In CHI, pp. 14031412. ACM, 2011.
Schapire, R.E. The strength of weak learnability. Machine learning, 5(2):197227, 1990.Schapire, R.E. The boosting approach to machine learning: An overview. Lecture Notes in Statistics, pp. 149172, 2003.
Settles, B. Active learning literature survey. University of Wisconsin, Madison, 2010.
Sheng, V.S., Provost, F., and Ipeirotis, P.G. Get another label? improving data quality and data mining using multiple, noisy labelers. In Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 614622.ACM, 2008.
Zadrozny, B., Langford, J., and Abe, N. Cost-sensitive learning by cost-proportionate example weighting.
In Data Mining, 2003. ICDM 2003. Third IEEE International Conference on, pp. 435442. IEEE, 2003.
Zhu, X., Gibson, B.R., Jun, K.S., Rogers, T.T., Harrison, J., and Kalish, C. Cognitive models of test-item effects in human category learning. In ICML, 2010.
Zhu, X., Gibson, B.R., and Rogers, T.T. Co-training as a human collaboration policy. In AAAI, 2011.
-----0
Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I., Bergeron, A., Bouchard, N., and Bengio, Y. (2012). Theano: new features and speed improvements. Submited to Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop.
Bengio, Y., Frasconi, P., and Simard, P. (1993). The problem of learning long-term dependencies in recurrent networks. pages 11831195, San Francisco.IEEE Press. (invited paper).
Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2), 157166.
Bengio, Y., Boulanger-Lewandowski, N., and Pascanu, R. (2012). Advances in optimizing recurrent networks. Technical Report arXiv:1212.0901, U. Montreal.
Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde Farley, D., and Bengio, Y. (2010). Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy). Oral Presentation.
Boulanger-Lewandowski, N., Bengio, Y., and Vincent, P. (2012). Modeling temporal dependencies in highdimensional sequences: Application to polyphonic music generation and transcription. In Proceedings of the Twenty-nine International Conference on Machine Learning (ICML12). ACM.
Doya, K. (1993). Bifurcations of recurrent neural networks in gradient descent learning. IEEE Transactions on Neural Networks, 1, 7580.
Doya, K. and Yoshizawa, S. (1991). Adaptive synchronization of neural and physical oscillators. In J. E. Moody, S. J. Hanson, and R. Lippmann, editors, NIPS , pages 109116. Morgan Kaufmann.
Elman, J. (1990). Finding structure in time. Cognitive Science, 14(2), 179211.
Graves, A., Liwicki, M., Fernandez, S., Bertolami, R., Bunke, H., and Schmidhuber, J. (2009). A Novel Connectionist System for Unconstrained Handwriting Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(5), 855868.
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 17351780.
Jaeger, H. (2012). Long short-term memory in echo state networks: Details of a simulation study. Technical report, Jacobs University Bremen.
Jaeger, H. and Haas, H. (2004). Harnessing nonlinearity: predicting chaotic systems and saving energy in wireless telecommunication. Science, 304(5667), 7880.
Jaeger, H., Lukosevicius, M., Popovici, D., and Siewert, U. (2007). Optimization and applications of echo state networks with leakyintegrator neurons.Neural Networks, 20(3), 335352.
Martens, J. and Sutskever, I. (2011). Learning recurrent neural networks with Hessian-free optimization.In Proc. ICML2011 . ACM.
Mikolov, T. (2012). Statistical Language Models based on Neural Networks. Ph.D. thesis, Brno University of Technology.
Mikolov, T., Deoras, A., Kombrink, S., Burget, L., and Cernocky, J. (2011). Empirical evaluation and combination of advanced language modeling techniques. In Proc. 12th annual conference of the international speech communication association (INTERSPEECH 2011).
Mikolov, T., Sutskever, I., Deoras, A., Le, H.-S., Kombrink, S., and Cernocky, J. (2012). Subword language modeling with neural networks. preprint (http://www.fit.vutbr.cz/ imikolov/rnnlm/char.pdf).
Pascanu, R. and Jaeger, H. (2011). A neurodynamical model for working memory. Neural Netw., 24, 199 207.
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning representations by backpropagating errors. Nature, 323(6088), 533536.
Strogatz, S. (1994). Nonlinear Dynamics And Chaos: With Applications To Physics, Biology, Chemistry, And Engineering (Studies in Nonlinearity). Studies in nonlinearity. Perseus Books Group, 1 edition.
Sutskever, I. (2012). Training Recurrent Neural Networks. Ph.D. thesis, University of Toronto.
Sutskever, I., Martens, J., and Hinton, G. (2011).Generating text with recurrent neural networks. In L. Getoor and T. Scheffer, editors, Proceedings of the 28th International Conference on Machine Learning (ICML-11), ICML 11, pages 10171024, New York, NY, USA. ACM.
Werbos, P. J. (1988). Generalization of backpropagation with application to a recurrent gas market model. Neural Networks, 1(4), 339356.
-----0
Bishop, C. M. and Lasserre, J. Generative or discriminative? Getting the best of both worlds. Bayesian Statistics, 8:324, 2007.
Bouchard, G. and Triggs, B. The trade-off between generative and discriminative classifiers. In COMPSTAT, pp. 721728, 2004.
Buntine, W. Theory refinement on bayesian networks.In Uncertainty in Artificial Intelligence (UAI), pp.5260, 1991.
Chang, C.-C. and Lin, C.-J. LIBSVM: A library for support vector machines.ACM TIST, 2:27:127:27, 2011. URL http://www.csie.ntu.edu.tw/~cjlin/libsvm.
Cortes, C. and Vapnik, V. Support-vector networks.Machine Learning, 20(3):273297, 1995.
Crammer, K. and Singer, Y. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research, 2: 265292, 2001.
Fayyad, M. U. and Irani, B. K. Multi-interval discretization of continuous-valued attributes for classification learning. IJCAI, pp. 10221029, 2003.
Frank, A. and Asuncion, A. UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences, 2010. URL http://archive.ics.uci.edu/ml.
Friedman, N., Geiger, D., and Goldszmidt, M.Bayesian network classifiers. Machine Learning, (29):131163, 1997.
Greiner, R., Su, X., Shen, B., and Zhou, W. Structural extension to logistic regression: Discriminative parameter learning of belief net classifiers. Machine Learning, 59(3):297322, June 2005.
Guo, Y., Wilkinson, D., and Schuurmans, D. Maximum margin Bayesian networks. In Uncertainty in Artificial Intelligence (UAI), pp. 233242, 2005.
Hastie, T., Tibshirani, R., and Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, August 2003.
Heckerman, D., Geiger, D., and Chickering, D. M.Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20:197243, 1995.
Jebara, T. Discriminative, Generative and Imitative learning. PhD thesis, MIT, 2001.
Lafferty, J., McCallum, A., and Pereira, F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In International Conference on Machine Learning (ICML), pp. 282289, 2001.
Lin, A. A class of methods for projection on a convex set. Advanced Modeling and Optimization (AMO), 5(3), 2003.
Ng, A. Y. and Jordan, M. I. On Discriminative vs.Generative Classifiers: A comparison of logistic regression and naive Bayes. In Advances in Neural Information Processing Systems (NIPS), 2001.
Pernkopf, F., Wohlmayr, M., and Tschiatschek, S. Maximum margin Bayesian network classifiers.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(3):521531, 2012.
Raina, R., Shen, Y., Ng, A., and McCallum, A. Classification with hybrid generative/discriminative models. In Advances in Neural Information Processing Systems (NIPS), 2003.
Sha, F. Large margin training of acoustic models for speech recognition. PhD thesis, University of Pennsylvania, 2007.
Silander, T., Kontkanen, P., and Myllymaki, P. On sensitivity of the MAP bayesian network structure to the equivalent sample size parameter. In Proceedings of UAI, pp. 360367, 2007.
Wettig, H., Grunwald, P., Roos, T., Myllymaki, P., and Tirri, H. When discriminative learning of Bayesian network parameters is easy. In International Joint Conferences on Artificial Intelligence (IJCAI), pp. 491496, 2003.
Zhu, J., Rosset, S., Hastie, T., and Tibshirani, R. 1norm support vector machines. Advances in Neural Information Processing Systems, 16:4956, 2004.
Zou, H. and Yuan, M. The f?-norm support vector machine. Statistica Sinia, 18:379398, 2008.
-----0
A. Asuncion, D.J. Newman. UCI machine learning repository, 2007.
Bernal, A., Crammer, K., and Pereira, F. Automated gene-model curation using global discriminative learning. Bioinformatics, 2012.
Chang, Chih-Chung and Lin, Chih-Jen. LIBSVM: A library for support vector machines. IST, 2011.
Chang, Y.W., Hsieh, C.J., Chang, K.W., Ringgaard, M., and Lin, C.J. Training and testing low-degree polynomial data mappings via linear SVM. JMLR, 2010.
Crammer, K. and Singer, Y. On the algorithmic implementation of multiclass kernel-based vector machines. JMLR, 2002.
Joachims, T. Training linear SVMs in linear time. In KDD, 2006.
Maji, S., Berg, A.C., and J., Malik. Efficient classification for additive kernel SVMs. PAMI, 2012.
Pele, O. Distance Functions: Theory, Algorithms and Applications. PhD thesis, The Hebrew University of Jerusalem, 2011.
Pele, O. and Werman, M. Fast and robust earth movers distances. In ICCV, 2009.
Pele, O. and Werman, M. The quadratic-chi histogram distance family. In ECCV, 2010.
Perronnin, F., Senchez, J., et al. Large-scale image categorization with explicit data embedding. In CVPR, 2010.
Rahimi, A. and Recht, B. Random features for largescale kernel machines. NIPS, 2007.
Ruzon, M.A. and Tomasi, C. Edge, Junction, and Corner Detection Using Color Distributions. PAMI, 2001.
Shalev-Shwartz, S., Singer, Y., Srebro, N., and Cotter, A. Pegasos: primal estimated sub-gradient solver for SVM. MP, 2011.
Tsang, I.W., Kwok, J.T., and Cheung, P.M. Very large svm training using core vector machines. In Proc.AISTATS, pp. 349356, 2005.
Vedaldi, A. and Zisserman, A. Efficient additive kernels via explicit feature maps. PAMI, 2012.
Wang, J.Z., Li, J., and Wiederhold, G. SIMPLIcity: Semantics-Sensitive Integrated Matching for Picture LIbraries. PAMI, 2001.
-----0
Azar, M. Gheshlaghi, Gomez, V., and Kappen, H. J.Dynamic policy programming. Journal of Machine Learning Research, 13(Nov):32073245, 2012.
Bertsekas, D.P. Approximate policy iteration: A survey and some new methods. Journal of Control Theory and Applications, 9(3):310335, 2011.
Dutech, A., Edmunds, T., Kok, J., Lagoudakis, M., Littman, M., Riedmiller, M., Russell, B., Scherrer, B., Sutton, R., Timmer, S., et al. Reinforcement learning benchmarks and bake-offs ii. In Workshop at advances in neural information processing systems conference. Citeseer, 2005.
Gabillon, V., Lazaric, A., Ghavamzadeh, M., and Scherrer, B. Classication-based policy iteration with a critic. In Proceedings of ICML, pp. 10491056, 2011.
Haviv, Moshe and Heyden, Ludo Van Der. Perturbation bounds for the stationary probabilities of a finite markov chain. Advances in Applied Probability, 16(4):pp. 804818, 1984. ISSN 00018678. URL http://www.jstor.org/stable/1427341.
Howard, R.A. Dynamic programming and Markov processes. 1960.
Kakade, S.M. A natural policy gradient. NIPS, 14: 15311538, 2001.
Kakade, S.M. On the sample complexity of reinforcement learning. PhD thesis, PhD thesis, University College London, 2003.
Kakade, S.M. and Langford, J. Approximately optimal approximate reinforcement learning. In Proceedings of ICML, pp. 267274, 2002.
Koller, Daphne and Parr, Ronald. Policy Iteration for Factored MDPs. In Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence, pp. 326334, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc. ISBN 1-55860-709-9.
Lagoudakis, M.G. and Parr, R. Least-squares policy iteration. Journal of Machine Learning Research, 4: 11071149, 2003.
Lazaric, A., Ghavamzadeh, M., and Munos, R. Analysis of a classication-based policy iteration algorithm. In Proceedings of ICML, pp. 607614, 2010.
Munos, R. Error bounds for approximate value iteration. In Proceedings of AAAI, volume 20, pp. 1006, 2005.
Perkins, T.J. and Precup, D. A convergent form of approximate policy iteration. NIPS, 15:15951602, 2002.
Peters, J., Vijayakumar, S., and Schaal, S. Natural actor-critic. In Proceedings of ECML, volume 3720, pp. 280291. Springer, 2005.
Sutton, R.S., McAllester, D., Singh, S., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In NIPS, volume 12, pp. 10571063. MIT Press, 2000.
Wagner, P. A reinterpretation of the policy oscillation phenomenon in approximate policy iteration.In NIPS, 2011.
Ye, Y. The simplex and policy-iteration methods are strongly polynomial for the markov decision problem with a fixed discount rate. Mathematics of Operations Research, 36(4):593603, 2011.
-----0
Heskes, T., Opper, M., Wiegerinck, W., Winther, O., and Zoeter, O. Approximate inference techniques with expectation constraints. Journal of Statistical Mechanics: Theory and Experiment, 11:11015, 2005.
Kschischang, F. R., Frey, B. J., and Loeliger, H.A. Factor graphs and the sum-product algorithm. IEEE Transactions on Information Theory, 47(2): 498519, 1998. 
Kuss, M. and Rasmussen, C. E. Assessing approximate inference for binary Gaussian process classification.Journal of Machine Learning Research, 6(10):1679 1704, 2005.
Minka, T. P. A family of algorithms for approximate Bayesian inference. PhD thesis, Massachusetts Institute of Technology, 2001.
Minka, T. P. Power EP. Technical Report MSR-TR2004-149, Microsoft Research, Cambridge, January 2004.
Minka, T. P. Divergence measures and message passing. Technical Report MSR-TR-2005-173, Microsoft Research, Cambridge, 2005.
Pearl, J. Reverend Bayes on inference engines: A distributed hierarchical approach. In Proceedings of the American Association of Artificial Intelligence National Conference on AI, pp. 133136, Pittsburgh, PA, 1982.
Ratsch, G., Onoda, T., and Muller, K.-R. Soft margins for adaboost. Mach. Learn., 42:287320, March 2001.
Wiegerinck, W. and Heskes, T. Fractional belief propagation. Advances in Neural Information Processing Systems, 12:438445, 2003.
Yedidia, J. S., Freeman, W. T., and Weiss, Y. Understanding belief propagation and its generalizations. In Lakemeyer, Gerhard and Nebel, Bernhard (eds.), Exploring artificial intelligence in the new millennium, pp. 239269. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2003.
Yuille, A. L. CCCP algorithms to minimize the Bethe and Kikuchi free energies: Convergent alternatives to belief propagation. Neural Computation, 14:1691  1722, 2002.
Zhu, H. and Rohwer, R. Bayesian invariant measurements of generalisation for continuous distributions. Technical Report NCRG/4352, Aston University, Aston Triangle, 1995.
-----0
Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3):235256, 2002a.
Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. The non-stochastic multi-armed bandit problem.SIAM Journal on Computing, 32(1):4877, 2002b.
Bartok, Gabor, Pal, David, and Szepesvari, Csaba.Toward a classification of finite partial-monitoring games. In ALT, pp. 224238, 2010.
Cesa-Bianchi, N. and Lugosi, G. Prediction, learning, and games. Cambridge University Press, 2006.
Chapelle, O. and Chang, Y. Yahoo! learning to rank challenge overview. JMLR Proceedings Track, 14: 124, 2011.
Chapelle, O., Joachims, T., Radlinski, F., and Yue, Yisong. Large-scale validation and analysis of interleaved search evaluation. ACM Transactions on Information Systems (TOIS), 30(1):6:16:41, 2012.
Chen, D. and Xiang, D. The consistency of multicategory support vector machines. Adv. Comput. Math, 24(1-4):155169, 2006.
Chu, W. and Ghahramani, Z. Preference learning with gaussian processes. In ICML, 2005.
Collins, M. Discriminative training methods for hidden markov models: theory and experiments with perceptron algorithms. In EMNLP, 2002.
Crammer, K. and Singer, Y. Pranking with ranking.In NIPS, 2001.
Flaxman, A., Kalai, A. T., and McMahan, H. B. Online convex optimization in the bandit setting: gradient descent without a gradient. In SODA, 2005.
Freund, Y., Iyer, R. D., Schapire, R. E., and Singer, Y. An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4:933969, 2003.
Herbrich, R., Graepel, T., and Obermayer, K. Large margin rank boundaries for ordinal regression. In Advances in Large Margin Classifiers. MIT Press, 2000.
Joachims, T. Optimizing search engines using clickthrough data. In KDD, 2002.
Lee, Yoonkyung, Lin, Yi, and Wahba, Grace. Multicategory support vector machines. Journal of the American Statistical Association, 99(465):6781, 2004.
Lewis, D. D., Yang, Y., Rose, T. G., and Li, F. RCV1: A new benchmark collection for text categorization research. JMLR, 5:361397, 2004.
Liu, T-Y. Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3, March 2009.
Manning, C., Raghavan, P., and Schutze, H. Introduction to Information Retrieval. Cambridge University Press, 2008.
Radlinski, F. and Joachims, T. Minimally invasive randomization for collecting unbiased preferences from clickthrough logs. In AAAI, pp. 14061412, 2006.
Raman, K., Shivaswamy, P., and Joachims, T. Online learning to diversify from implicit feedback. In KDD, 2012.
Shivaswamy, P. and Joachims, T. Online structured prediction via coactive learning. In ICML, 2012.
Yue, Y. and Joachims, T. Interactively optimizing information retrieval systems as a dueling bandits problem. In ICML, 2009.
Yue, Y., Broder, J., Kleinberg, R., and Joachims, T.The k-armed dueling bandits problem. In COLT, 2009.
Zinkevich, M. Online convex programming and generalized infinitesimal gradient ascent. In ICML, 2003.
-----0
Agarwal, A., Bartlett, P.L., Ravikumar, P., and Wainwright, M.J. Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. Information Theory, IEEE Transactions on, 58(5):32353249, 2012.
Castro, R.M. and Nowak, R.D. Minimax bounds for active learning. In Proceedings of the 20th annual conference on learning theory, pp. 519. SpringerVerlag, 2007.
Hazan, E. and Kale, S. Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization. COLT, 2011.
Iouditski, A. and Nesterov, Y. Primal-dual subgradient methods for minimizing uniformly convex functions. Universite Joseph Fourier, Grenoble, France [Report], 2010.
Jamieson, K.G., Nowak, R.D., and Recht, B. Query complexity of derivative-free optimization. arXiv preprint arXiv:1209.2434, 2012.
Korostelev, A. P. and Tsybakov, A. B. Minimax Theory of Image Reconstruction, volume 82 of Lecture Notes in Statistics. Springer, NY, 1993.
Nemirovski, A.S. and Yudin, D.B. Problem complexity and method efficiency in optimization. John Wiley & Sons, 1983.
Raginsky, M. and Rakhlin, A. Information complexity of black-box convex optimization: A new look via feedback information theory. In Communication, Control, and Computing, 2009. Allerton 2009. 47th Annual Allerton Conference on, pp. 803510.IEEE, 2009.
Singh, A., Scott, C., and Nowak, R. Adaptive hausdorff estimation of density level sets. Annals of Statistics, 37(5B):27602782, 2009.
Sridharan, K. and Tewari, A. Convex games in banach spaces. In Proceedings of the 23nd Annual Conference on Learning Theory, 2010.
Tsybakov, A. B. On nonparametric estimation of density level sets. Annals of Statistics, 25(3):948969, 1997.
Tsybakov, A.B. Introduction to Nonparametric Estimation. Springer Series in Statistics. Springer, 2009.ISBN 9780387790510.
-----0
Amari, S. Natural gradient works efficiently in learning. Neural computation, 10(2):251276, 1998.
Asuncion, A., Welling, M., Smyth, P., and Teh, Y. On smoothing and inference for topic models. In Uncertainty in Artificial Intelligence, 2009.
Bishop, C. Pattern Recognition and Machine Learning.Springer New York., 2006.
Blei, D., Ng, A., and Jordan, M. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:9931022, January 2003.
Chien, Y. and Fu, K. On Bayesian learning and stochastic approximation. Systems Science and Cybernetics, IEEE Transactions on, 3(1):28 38, jun. 1967.
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., and Kuksa, P. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12:24932537, 2011.
Gelman, A. and Hill, J. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge Univ.Press, 2007.
George, A. and Powell, W. Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming. Machine learning, 65(1): 167198, 2006.
Gopalan, P., Mimno, D., Gerrish, S., Freedman, M., and Blei, D. Scalable inference of overlapping communities. In Neural Information Processing Systems, 2012.
Hjort, N., Holmes, C., Mueller, P., and Walker, S.Bayesian Nonparametrics: Principles and Practice. Cambridge University Press, Cambridge, UK, 2010.
Hoffman, M., Blei, D., and Bach, F. Online inference for latent Drichlet allocation. In Neural Information Processing Systems, 2010.
Hoffman, M., Blei, D., Wang, C., and Paisley, J.Stochastic variational inference. Journal of Machine Learning Research, to appear.
Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):15741609, 2009.
Paisley, J., Wang, C., and Blei, D. The discrete infinite logistic normal distribution. Bayesian Analysis, 7(2): 235272, 2012.
Robbins, H. and Monro, S. A stochastic approximation method. The Annals of Mathematical Statistics, 22 (3):pp. 400407, 1951.
Salakhutdinov, R. and Mnih, A. Probabilistic matrix factorization. In Neural Information Processing Systems, 2008.
Schaul, T., Zhang, S., and LeCun, Y. No more pesky learning rates. ArXiv e-prints, June 2012.
Snoek, J., Larochelle, H., and Adams, R. Practical Bayesian optimization of machine learning algorithms. In Neural Information Processing Systems, 2012.
Wang, C., Paisley, J., and Blei, D. Online variational inference for the hierarchical Dirichlet process. In International Conference on Artificial Intelligence and Statistics, 2011.
-----0
Berg, Alexander C., Berg, Tamara L., III, Hal Daume, Dodge, Jesse, Goyal, Amit, Han, Xufeng, Mensch, Alyssa, Mitchell, Margaret, Sood, Aneesh, Stratos, Karl, and Yamaguchi, Kota. Understanding and predicting importance in images. In CVPR, pp. 35623569, 2012.
Dodge, Jesse, Goyal, Amit, Han, Xufeng, Mensch, Alyssa, Mitchell, Margaret, Stratos, Karl, Yamaguchi, Kota, Choi, Yejin, III, Hal Daume, Berg, Alexander C., and Berg, Tamara L. Detecting visual text. In HLT-NAACL, pp. 762772, 2012.
Fan, R. E., Chang, K. W., Hsieh, C. J., Wang, X. R., and Lin, C. J. LIBLINEAR: A library for large linear classification. JMLR, 2008.
Farhadi, Ali, Hejrati, Mohsen, Sadeghi, Mohammad Amin, Young, Peter, Rashtchian, Cyrus, Hockenmaier, Julia, and Forsyth, David. Every picture tells a story: generating sentences from images. In ECCV, pp. 1529, Berlin, Heidelberg, 2010.Predictable Dual-View Hashing 
Gionis, A., Indyk, P., and Motwani, R. Similarity search in high dimensions via hashing. In VLDB, 1999a.
Gionis, Aristides, Indyk, Piotr, Motwani, Rajeev, and Motwani, Rajeev. Similarity search in high dimensions via hashing. In VLDB, pp. 518529, 1999b.
Gong, Y., Ke, Q., Isard, M., and Lazebnik, S. A MultiView Embedding Space for Modeling Internet Images, Tags, and their Semantics. CoRR, abs/1212.4522, 2012.Gong, Yunchao and Lazebnik, Svetlana. Iterative quantization: A procrustean approach to learning binary codes.In CVPR, 2011.
Hardoon, D. R., Szedmak, S., Szedmak, O., and Shawetaylor, J. Canonical correlation analysis; An overview with application to learning methods. Technical report, University of London, 2003.
Hwang, S. J. and Grauman, K. Accounting for the Relative Importance of Objects in Image Retrieval. In BMVC, 2010.Hwang, S. J. and Grauman, K. Learning the Relative Importance of Objects from Tagged Images for Retrieval and Cross-Modal Search. IJCV, 100(2):134153, 2012.Kulis, Brian and Darrell, Trevor. Learning to hash with binary reconstructive embeddings. In NIPS, 2009.
Kulis, Brian and Grauman, Kristen. Kernelized localitysensitive hashing for scalable image search. In ICCV, 2009.
Kulkarni, G., Premraj, V., Dhar, S., Li, Siming, Choi, Yejin, Berg, A.C., and Berg, T.L. Baby talk: Understanding and generating simple image descriptions. In CVPR, pp. 1601 1608, june 2011.Kumar, Shaishav and Udupa, Raghavendra. Learning Hash Functions for Cross-View Similarity Search. In IJCAI, 2011.
Kuznetsova, Polina, Ordonez, Vicente, Berg, Alexander C., Berg, Tamara L., and Choi, Yejin. Collective generation of natural image descriptions. In ACL (1), pp. 359368, 2012.
Li, Siming, Kulkarni, Girish, Berg, Tamara L., Berg, Alexander C., and Choi, Yejin. Composing simple image descriptions using web-scale n-grams. In CoNLL, pp. 220228, 2011.
Lin, D. An Information-Theoretic Definition of Similarity.In ICML, pp. 296304, 1998.
Liu, W., Wang, J., Ji, R., Jiang, Yu-Gang, and Chang, Shih-Fu. Supervised hashing with kernels. In CVPR, pp. 20742081, 2012.
Lyman, Peter, Varian, Hal R., Charles, Peter, Good, Nathan, Jordan, Laheem L., and Pal, Joyojeet. How much information? 2003, 2003.URL http://www2.sims.berkeley.edu/research/ projects/how-much-info-2003/.
Masci, J., Bronstein, M. M., Bronstein, A. A., and Schmidhuber, Jurgen. Multimodal similarity-preserving hashing. CoRR, abs/1207.1522, 2012.
Norouzi, Mohammad and Fleet, David. Minimal loss hashing for compact binary codes. In ICML, 2011.
Oliva, A. and Torralba, A. Modeling the shape of the scene: a holistic representation of the spatial envelope. IJCV, 2001.
Ordonez, Vicente, Kulkarni, Girish, and Berg, Tamara L.Im2text: Describing images using 1 million captioned photographs. In NIPS, pp. 11431151, 2011.
Patterson, G. and Hays, J. SUN Attribute Database: Discovering, Annotating, and Recognizing Scene Attributes. In CVPR, 2012.
Rashtchian, Cyrus, Young, Peter, Hodosh, Micah, and Hockenmaier, Julia. Collecting image annotations using amazons mechanical turk. In CSLDAMT, pp. 139147, 2010.
Rastegari, Mohammad, Fang, Chen, and Torresani, Lorenzo. Scalable object-class retrieval with approximate and top-k ranking. In ICCV, pp. 26592666, 2011.
Rastegari, Mohammad, Farhadi, Ali, and Forsyth, David A. Attribute discovery via predictable discriminative binary codes. In ECCV (6), 2012.Salakhutdinov, Ruslan and Hinton, Geoffrey. Learning a nonlinear embedding by preserving class neighbourhood structure. In AISTATS, 2007.
Salakhutdinov, Ruslan and Hinton, Geoffrey. Semantic hashing. Int. J. Approx. Reasoning, 2009.
Shakhnarovich, Gregory, Viola, Paul A., and Darrell, Trevor. Fast pose estimation with parameter-sensitive hashing. In ICCV, 2003.
Torralba, A., Fergus, R, , and Weiss, Y. Small codes and large image databases for recognition. In CVPR, 2008.
Torresani, Lorenzo, Szummer, Martin, and Fitzgibbon, Andrew. Efficient object category recognition using classemes. In ECCV, 2010.
Weiss, Yair, Torralba, Antonio, and Fergus, Robert. Spectral hashing. In NIPS, pp. 17531760, 2008.
Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., and Torralba, A. SUN Database: Large-scale Scene Recognition from Abbey to Zoo. In CVPR, 2010.
Zhen, Y. and Yeung, Dit-Yan. Co-Regularized Hashing for Multimodal Data. In NIPS, 2012.
-----0
Aloise, Daniel and Hansen, Pierre. A branch-and-cut sdp-based algorithm for minimum sum-of-squares clustering. Pesquisa Operacional, 29:503516, 2009.
Aloise, Daniel, Deshpande, Amit, Hansen, Pierre, and Popat, Preyas. NP-hardness of euclidean sum-ofsquares clustering. Machine Learning, 75(2):245 248, 2009.
Bansal, Nikhil, Blum, Avrim, and Chawla, Shuchi.Correlation clustering. Machine Learning, 56(1-3): 89113, 2004.
Bolla, Marianna. Spectra, euclidean representations and clusterings of hypergraphs. Discrete Mathematics, 117:1939, 1993.
Bollobas, Bela and Nikiforov, Vladimir. Graphs and hermitian matrices: eigenvalue interlacing. Discrete Mathematics, 289(1-3):119127, 2004.
Bollobas, Bela and Nikiforov, Vladimir. Graphs and hermitian matrices: Exact interlacing. Discrete Mathematics, 308(20):48164821, 2008.
Chung, F. R. K. Spectral Graph Theory. American Mathematical Society, 1997.
Ding, Chris H. Q. and He, Xiaofeng. K-means clustering via principal component analysis. In Brodley, Carla E. (ed.), ICML, volume 69 of ACM International Conference Proceeding Series. ACM, 2004.
Drineas, Petros, Frieze, Alan M., Kannan, Ravi, Vempala, Santosh, and Vinay, V. Clustering large graphs via the singular value decomposition. Machine Learning, 56(1-3):933, 2004.
Figueiredo, Mario A. T. and Jain, Anil K. Unsupervised learning of finite mixture models. IEEE Trans. Pattern Anal. Mach. Intell., 24(3):381396, 2002.
Haemers, W.H. Interlacing eigenvalues and graphs. Linear Algebra Applications, 226/228:593616, 1995.
Jain, Anil K. Data clustering: 50 years beyond kmeans. Pattern Recognition Letters, 31(8):651666, 2010.
Meila, Marina. Comparing clusterings: an axiomatic view. In Raedt, Luc De and Wrobel, Stefan (eds.), ICML, volume 119 of ACM International Conference Proceeding Series, pp. 577584. ACM, 2005.ISBN 1-59593-180-5.
Meila, Marina. The uniqueness of a good optimum for k-means. In Cohen, William W. and Moore, Andrew (eds.), ICML, volume 148 of ACM International Conference Proceeding Series, pp. 625632.ACM, 2006. ISBN 1-59593-383-2.
Nagai, Ayumu. Inappropriateness of the criterion of k-way normalized cuts for deciding the number of clusters. Pattern Recognition Letters, 28(15):1981 1986, 2007.
Shi, Jianbo and Malik, Jitendra. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal.Mach. Intell., 22(8):888905, 2000.
Steinley, D. & Brusco, M. J. Testing for validity and choosing the number of clusters in k-means clustering. Psychological Methods, 16:285297, 2011.
von Luxburg, Ulrike. A tutorial on spectral clustering. Statistics and Computing, 17(4):395416, 2007.
Zha, Hongyuan, He, Xiaofeng, Ding, Chris H. Q., Gu, Ming, and Simon, Horst D. Spectral relaxation for kmeans clustering. In Dietterich, Thomas G., Becker, 
Suzanna, and Ghahramani, Zoubin (eds.), NIPS, pp. 10571064. MIT Press, 2001.
-----0
Attias, H. A variational bayesian framework for graphical models. Advances in Neural Information Processing Systems, 12:209215, 2000.
Broderick, T., Jordan, M. I., and Pitman, J. Clusters and features from combinatorial stochastic processes. arXiv:1206.5862, 2012.
Buchbinder, N., Feldman, M., Naor, J., and Schwartz, R. A tight linear time (1/2)-approximation for unconstrained submodular maximization. In 53rd Annual Symposium on Foundations of Computer Science, pp. 649658. IEEE, 2012.
Ding, N., Qi, Y.A., Xiang, R., Molloy, I., and Li, N. Nonparametric Bayesian matrix factorization by Power-EP. In 14th Intl Conf. on AISTATS, volume 9, pp. 169176, 2010.
Doshi-Velez, F. and Ghahramani, Z. Accelerated sampling for the Indian buffet process. In Proceedings of the 26th Annual Intl Conference on Machine Learning, pp. 273280, 2009.
Doshi-Velez, F., Miller, K. T., Van Gael, J., and Teh, Y. W. Variational inference for the Indian buffet process. In 13th Intl Conf. on AISTATS, pp. 137 144, 2009.
Feige, U., Mirrokni, V.S., and Vondrak, J. Maximizing non-monotone submodular functions. SIAM Journal on Computing, 40(4):11331153, 2011.
Fujishige, S. Submodular functions and optimization, volume 58. Elsevier Science Limited, 2005.
Ghahramani, Z. and Beal, M.J. Propagation algorithms for variational bayesian learning. In Advances in Neural Information Processing Systems, volume 13, 2001.
Griffiths, T. L. and Ghahramani, Z. Infinite latent feature models and the Indian buffet process. In Advances in Neural Information Processing Systems, volume 18, 2006.
Knowles, D. and Ghahramani, Z. Infinite sparse factor analysis and infinite independent components analysis. Independent Component Analysis and Signal Separation, pp. 381388, 2007.
Kollar, T. and Roy, N. Utilizing object-object and object-scene context when planning to find things. In IEEE International Conference on Robotics and Automation, pp. 2168 2173, 2009.
Kurihara, K. and Welling, M. Bayesian k-means as a maximization-expectation algorithm. Neural Computation, 21(4):11451172, 2008.
Lee, K.C., Ho, J., and Kriegman, D. Acquiring linear subspaces for face recognition under variable lighting. IEEE Trans. Pattern Anal. Mach. Intelligence, 27(5):684698, 2005.
Lovasz, L. Submodular functions and convexity. Mathematical programming: the state of the art, pp. 235 257, 1983.
Poliner, G.E. and Ellis, D.P.W. A discriminative model for polyphonic piano transcription.EURASIP Journal on Advances in Signal Processing, 2006.
Rai, P. and Daume III, H. Beam search based map estimates for the Indian buffet process. In Proceedings of the 28th Annual Intl Conference on Machine Learning, 2011.
Schmidt, M. N., Winther, O., and Hansen, L. K.Bayesian non-negative matrix factorization. In Intl Conference on Independent Component Analysis and Signal Separation, volume 5441 of Lecture Notes in Computer Science (LNCS), pp. 540547.Springer, 2009.
-----0
Alon, N., Krivelevich, M., and Sudakov, B. Finding a large hidden clique in a random graph. Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms, 1998.
Bach, F., Jenatton, R., Mairal, J., and Obozinski, G. Convex optimization with sparsity-inducing norms. S. Sra, S. Nowozin, S. J. Wright. editors, Optimization for Machine Learning, 2011.
Boyd, S., Parikh, N., Chu, E. Peleato, B., and Eckstein, J.Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3, 2011.
Cande`s, E. J., Li, X., Ma, Y., and W., John. Robust principal component analysis? Journal of ACM, 8:137, 2009.
Chambolle, A. and Pock, T. A first-order primal-dual algorithm for convex problems with applications to imaging.Journal of Mathematical Imaging and Vision, 2011.
Chandrasekaran, V. and Jordan, M. I. Computational and statistical tradeoffs via convex relaxation. Preprint, 2012.
Chandrasekaran, V., Sanghavi, S., Parrilo, P.A., and Willsky, A.S. Rank-sparsity incoherence for matrix decomposition. SIAM J. Opt., 21:572596, 2011.
Chandrasekaran, V., Recht, B., Parrilo, P.A., and Willsky, A.S. The convex geometry of linear inverse problems. Foundations of Computational Mathematics, 12, 2012.
Dalalyan, A. S. and Chen, Y. Fused sparsity and robust estimation for linear models with unknown variance. In NIPS 2012, 2012.
dAspremont, A., El Ghaoui, L., Jordan, M.I., and Lanckriet, G.R.G. A direct formulation for sparse pca using semidefinite programming. SIAM review, 49(3):434448, 2007.
Doan, X.V. and Vavasis, S.A. Finding approximately rankone submatrices with the nuclear norm and l1 norm.arXiv preprint arXiv:1011.1839, 2010.
Golbabaee, M. and Vandergheynst, P. Compressed sensing of simultaneous low-rank and joint-sparse matrices.submitted to IEEE transaction in Information Theory, 2012.
Grave, E., Obozinski, G., and Bach, F. Trace lasso: a trace norm regularization for correlated designs. In Advances in Neural Information Processing Systems 24, pp. 2187 2195, 2011.
Hosseini Kamal, M. and Vandergheynst, P. Joint low-rank and sparse light field modeling for dense multiview data compression. ICASSP, 2013.
Jaggi, M. Revisiting frank-wolfe: Projection-free sparse convex optimization. Proceedings of the 30th Annual International Conference on Machine Learning, 2013.
Koltchinskii, V., Lounici, K., and Tsybakov, A. Nuclear norm penalization and optimal rates for noisy matrix completion. Annals of Statistics, 2011.
Oymak, S., Jalali, A., Fazel, M., Eldar, Y., and Hassibi, B. Simultaneously structured models with application to sparse and low-rank matrices. submitted, 2012. URL http://arxiv.org/pdf/1212.3753v1.pdf.
Richard, E., Savalle, P.-A., and Vayatis, N. Estimation of simultaneously sparse and low-rank matrices. In Proceeding of 29th Annual International Conference on Machine Learning, 2012.
Tropp, J. A. User-friendly tail bounds for sums of random matrices. ArXiv e-prints, April 2010.
Vaiter, S., Peyre, G., Dossal, C., and Fadili, J. Robust sparse analysis regularization. to appear in IEEE Transactions on Information Theory, 2012.
Vert, J.-P and Bleakley, K. Fast detection of multiple change-points shared by many signals using group lars. Advances in Neural Information Processing Systems 23 (NIPS), pp. 23432351, 2010.
-----0
Ando, R.K. and Zhang, T. A framework for learning predictive structures from multiple tasks and unlabeled data. J. Machine Learning Research, 6:1817 1853, 2005.
Argyriou, A., Evgeniou, T., and Pontil, M. Convex multi-task feature learning. Machine Learning, 73(3):243272, 2008.
Argyriou, A., Maurer, A., and Pontil, M. An algorithm for transfer learning in a heterogeneous environment. Proc. European Conf. Machine Learning, pages 7185, 2008.
Bader, B. W. and Kolda, T. G. Algorithm 862: MATLAB tensor classes for fast algorithm prototyping ACM Transactions on Mathematical Software, 32(4):635653, 2006.
Baxter, J. A model for inductive bias learning. J. of Artificial Intelligence Research, 12:149198, 2000.
Bertsekas, D.P. and Tsitsiklis, J.N. Parallel and Distributed Computation: Numerical Methods, Prentice-Hall, 1989.Caruana, R. Multi-task learning. Machine Learning, 28:4175, 1997.
Ekman, P., Friesen, W. Facial Action Coding System: A Technique for the Measurement of Facial Movement. Consulting Psychologists Press, 1978.
Gandy, S., Recht, B., and Yamada, I. Tensor completion and low-n-rank tensor recovery via convex optimization. Inverse Problems, 27, 2011.
Kolda, T. G. and Bader, B. W. Tensor decompositions and applications. SIAM Review, 51(3):455 500, 2009.
Kumar, A. and Daum III, H. Learning task grouping and overlap in multitask learning. International Conference on Machine Learning (ICML), 2012.
Liu, J., Musialski, P., Wonka, P., and Ye, J. Tensor completion for estimating missing values in visual data. Proc. 12th International Conference on Computer Vision (ICCV), pages 21142121, 2009.
Lucey, P. and Cohn, J.F. and Prkachin, K.M. and Solomon, P.E., and Matthews, I. PAINFUL DATA: The UNBC-McMaster Shoulder Pain Expression 
Archive Database. IEEE Facial and Gesture (FG), pages 5764, 2011.van der Maaten, L. Audio-Visual Emotion Challenge 2012: A Simple Approach. Workshop ICMI 12, 2012.
Maurer, A. Transfer bounds for linear feature learning.Machine Learning, 75(3):327350, 2009.
Maurer, A. and Pontil, M. Excess risk bounds for multitask learning with trace norm regularization.arXiv:1212.1496, 2012.
Maurer, A., Pontil, M., Romera-Paredes, B. Sparse coding for multitask and transfer learning. International Conference on Machine Learning (ICML), 2013.
Mpiperis, I., Malassiotis, S., and Strintzis, M.G. Bilinear elastically deformable models with application to 3D face and facial expression recognition.
Proc. 8th International Conference on Automatic Face and Gesture Recognition, pages 18, 2008.
Romera-Paredes, B., Argyriou A., Bianchi-Berthouze, N., Pontil, M. Exploiting unrelated tasks in multitask learning. JMLR Proceedings Track, 22:951 959, 2012.
Signoretto, M., De Lathauwer, L., Suykens, J.A.K.Nuclear norms for tensors and their use for convex multilinear estimation, Technical Report, 2012.
Signoretto, M., Tran Dinh, Q., De Lathauwer, L., Suykens, J.A.K. Learning with tensors: a framework based on convex optimization and spectral regularization. Machine Learning, to appear.
Signoretto, M., Van de Plas, R., De Moor, B., and Suykens, J.A.K. Tensor versus matrix completion: a comparison with application to spectral data. IEEE Signal Processing Letters, 18(7):403406, 2011.
Tenenbaum, J.B. and Freeman, W.T. Separating style and content with bilinear models. Neural Computation, 12(6):12471283, 2000.
Tomioka, R., Hayashi, K., Kashima, H., Presto, J.S.T.Estimation of Low-Rank Tensors via Convex Optimization. 2010.
Vargas-Govea, B. and Gonzlez-Serna, G. and PonceMedelln, R. Effects of relevant contextual features in the performance of a restaurant recommender system. RecSys 11: Workshop on Context Aware Recommender Systems (CARS-2011), 2011.
Vasilescu, M. A. O. and Terzopoulos, D. Multilinear image analysis for facial recognition. Proc. 16th International Conference on Pattern Recognition (ICPR), pages 511514, 2002.
Vasilescu, M. A. O. and Terzopoulos, D. Multilinear independent components analysis. Proc. 2005 Conference on Computer Vision and Pattern Recognition (CVPR), pages 547553, 2005.
-----0
Antoniak, Charles E. Mixtures of dirichlet processes with applications to bayesian nonparametric problems. The annals of statistics, pp. 11521174, 1974.
Barker, Bethan L and Brightling, Christopher E. Phenotyping the heterogeneity of chronic obstructive pulmonary disease. Clinical Science, 124(6):371 387, 2013.
Basu, S., Bilenko, M., Banerjee, A., and Mooney, R.J.Probabilistic semi-supervised clustering with constraints. Semi-supervised learning, pp. 7198, 2006.
Bell, Benjamin et al. The normative aging study: an interdisciplinary and longitudinal study of health and aging. The International Journal of Aging and Human Development, 3(1):517, 1972.
Besag, Julian. Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical Society. Series B (Methodological), pp. 192 236, 1974.
Blei, David M and Jordan, Michael I. Variational inference for dirichlet process mixtures. Bayesian Analysis, 1(1):121143, 2006.
Cho, Michael H et al. Variants in fam13a are associated with chronic obstructive pulmonary disease. Nature genetics, 42(3):200202, 2010.
Cho, Michael H et al. A genome-wide association study of copd identifies a susceptibility locus on chromosome 19q13. Human molecular genetics, 21(4):947 957, 2012.
Ferguson, Thomas S. A bayesian analysis of some nonparametric problems. The annals of statistics, pp.209230, 1973.
Geman, Stuart and Geman, Donald. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. Pattern Analysis and Machine Intelligence, IEEE Transactions on, (6):721741, 1984.
Grun, Bettina and Leisch, Friedrich. Fitting finite mixtures of generalized linear regressions in r. Computational Statistics & Data Analysis, 51(11):52475252, 2007.
Hankinson, John L et al. Spirometric reference values from a sample of the general us population. American journal of respiratory and critical care medicine, 159(1):179187, 1999.
Jaakkola, Tommi S. 10 tutorial on variational approximation methods. Advanced mean field methods: theory and practice, pp. 129, 2001.
Jain, Anil K, Murty, M Narasimha, and Flynn, Patrick J. Data clustering: a review. ACM computing surveys (CSUR), 31(3):264323, 1999.
Jordan, M.I., Ghahramani, Z., Jaakkola, T.S., and Saul, L.K. An introduction to variational methods for graphical models. Machine learning, 37(2):183 233, 1999.
Lazaro-Gredilla, Miguel, Vaerenbergh, Steven Van, and Lawrence, Neil D. Overlapping mixtures of gaussian processes for the data association problem. Pattern Recognition, 45(4):13861395, 2012.
Li, Lei and Prakash, B Aditya. Time series clustering: Complex is simpler! In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 185192, 2011.
Meeds, Edward and Osindero, Simon. An alternative infinite mixture of gaussian process experts. Advances in Neural Information Processing Systems, 18:883, 2006.
Orbanz, P. and Buhmann, J.M. Nonparametric bayesian image segmentation. International Journal of Computer Vision, 77(1):2545, 2008.
Rasmussen, C.E. and Ghahramani, Z. Infinite mixtures of gaussian process experts. Advances in neural information processing systems, 2:881888, 2002.
Rasmussen, C.E. and Williams, C.K.I. Gaussian processes for machine learning, volume 1. MIT press Cambridge, MA, 2006.
Sethuraman, Jayaram. A constructive definition of dirichlet priors. Technical report, DTIC Document, 1991.
Silverman, Edwin K and Sandhaus, Robert A. Alpha1antitrypsin deficiency. New England Journal of Medicine, 360(26):27492757, 2009.
Strehl, Alexander and Ghosh, Joydeep. Cluster ensemblesa knowledge reuse framework for combining multiple partitions. The Journal of Machine Learning Research, 3:583617, 2003.
Wagstaff, Kiri and Cardie, Claire. Clustering with instance-level constraints. In Proceedings of the Seventeenth International Conference on Machine Learning, pp. 11031110, 2000.
Yuan, Chao and Neubauer, Claus. Variational mixture of gaussian process experts. Advances in Neural Information Processing Systems, 21:18971904, 2009.
Zhu, Xiaojin. Semi-supervised learning literature survey. 2008.
-----0
Auer, Peter, Cesa-Bianchi, Nicolo, Freund, Yoav, and Schapire, Robert. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32 (1):4877, 2003.
Beygelzimer, Alina, Dani, Varsha, Hayes, Thomas, Langford, John, and Zadrozny, Bianca. Error limiting reductions between classification tasks. In ICML. ACM, 2005.
Dang, Hoa Trang. Overview of duc 2005. In DUC, 2005.
Dey, Debadeepta, Liu, Tian Yu, Hebert, Martial, and Bagnell, J. Andrew (Drew). Contextual sequence optimization with application to control library optimization. In RSS, 2012.
Feige, U., Mirrokni, V. S., and Vondrak, J. Maximizing non-monotone submodular functions. SIAM Journal on Computing, 40(4):11331153, 2011.
Guestrin, Carlos and Krause, Andreas. Beyond convexity: Submodularity in machine learning. URL www.submodularity.org.
Guzman-Rivera, Abner, Batra, Dhruv, and Kohli, Pushmeet. Multiple choice learning: Learning to produce multiple structured outputs. In NIPS, 2012.
Joachims, Thorsten. A support vector method for multivariate performance measures. In ICML. ACM, 2005.
Kalai, Adam and Vempala, Santosh. Efficient algorithms for online decision problems. JCSS, 71(3): 291307, 2005.
Kulesza, Alex and Taskar, Ben. Learning determinantal point processes. In UAI, 2011.
Lin, Chin-Yew. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out: ACL-04 Workshop, 2004.
Lin, Hui and Bilmes, Jeff. Multi-document summarization via budgeted maximization of submodular functions. In Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2010.
Lin, Hui and Bilmes, Jeff. A class of submodular functions for document summarization. In ACL-HLT, 2011.
Lin, Hui and Bilmes, Jeff. Learning mixtures of submodular shells with application to document summarization. In UAI, 2012.
Littlestone, Nick and Warmuth, Manfred. The Weighted Majority Algorithm. INFORMATION AND COMPUTATION, 1994.
Radlinski, Filip, Kleinberg, Robert, and Joachims, Thorsten. Learning diverse rankings with multiarmed bandits. In ICML, 2008.
Raman, Karthik, Shivaswamy, Pannaga, and Joachims, Thorsten. Online learning to diversify from implicit feedback. In KDD, 2012.
Ratliff, Nathan, Zucker, Matt, Bagnell, J. Andrew, and Srinivasa, Siddhartha. Chomp: Gradient optimization techniques for efficient motion planning.In ICRA, May 2009.
Ross, Stephane and Bagnell, J. Andrew. Agnostic system identification for model-based reinforcement learning. In ICML, 2012.
Ross, Stephane, Gordon, Geoff, and Bagnell, J. Andrew. A reduction of imitation learning and structured prediction to no-regret online learning. In AISTATS, 2011a.
Ross, Stephane, Munoz, Daniel, Bagnell, J. Andrew, and Hebert, Martial. Learning message-passing inference machines for structured prediction. In CVPR, 2011b.
Streeter, M. and Golovin, D. An online algorithm for maximizing submodular functions. In NIPS, 2008.
Streeter, Matthew, Golovin, Daniel, and Krause, Andreas. Online learning of assignments. In NIPS, 2009.
Yue, Yisong and Guestrin, Carlos. Linear submodular bandits and their application to diversified retrieval.In NIPS, 2011.
Yue, Yisong and Joachims, Thorsten. Predicting diverse subsets using structural svms. In ICML, 2008.
-----0
Bonnans, J.F. and Shapiro, A. Optimization problems with perturbations: A guided tour. SIAM review, 40 (2):228264, 1998.
Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Hruschka Jr., E.R., and Mitchell., T.M. Toward an architecture for never-ending language learning. In Proceedings of the 24th Conference on Artificial Intelligence. AAAI Press, 2010.
Danskin, J.M. The theory of max-min and its application to weapons allocation problems, volume 5 of Econometrics and Operations Research. SpringerVerlag, 1967.
Fisk, D.L. Quasi-martingales. Transactions of the American Mathematical Society, 120(3):369389, 1965.
Kumar, A. and Daume III, H. Learning task grouping and overlap in multi-task learning. In Proceedings of the 29th International Conference on Machine Learning, pp. 13831390. Omnipress, 2012.
Mairal, J., Bach, F., Ponce, J., and Sapiro, G. Online dictionary learning for sparse coding. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 689696. ACM, 2009.
Rai, P. and Daume III, H. Infinite predictor subspace models for multitask learning. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, pp. 613620, 2010.
Ring, M.B. CHILD: A first step towards continual learning. Machine Learning, 28(1):77104, 1997.
Saha, A., Rai, P., Daume III, H., and Venkatasubramanian, S. Online learning of multiple tasks and their relationships. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, pp. 643651, 2011.
Simm, J., Sugiyama, M., and Kato, T. Computationally efficient multi-task learning with least-squares probabilistic classifiers. IPSJ Transactions on Computer Vision and Applications, 3:18, 2011.
Sutton, R., Koop, A., and Silver, D. On the role of tracking in stationary environments. In Proceedings of the 24th International Conference on Machine 
Learning, pp. 871878, Corvallis, OR, 2007. ACM.Thrun, S. Explanation-Based Neural Network Learning: A Lifelong Learning Approach. Kluwer Academic Publishers, Boston, MA, 1996.
Thrun, S. and OSullivan, J. Discovering structure in multiple learning tasks: the TC algorithm. In Proceedings of the 13th International Conference on Machine Learning, pp. 489497. Morgan Kaufmann, 1996.
Valstar, M.F., Jiang, B., Mehu, M., Pantic, M., and Scherer, K. The first facial expression recognition and analysis challenge. In Proceedings of the IEEE International Conference on Automatic Face & Gesture Recognition, pp. 921926. IEEE, 2011.
Van der Vaart, A.W. Asymptotic statistics, volume 3.Cambridge University Press, 2000.
Xue, Y., Liao, X., Carin, L., and Krishnapuram, B.Multi-task learning for classification with Dirichlet process priors. Journal of Machine Learning Research, 8:3563, 2007.
Yu, K.B. Recursive updating the eigenvalue decomposition of a covariance matrix. IEEE Transactions on Signal Processing, 39(5):11361145, 1991.
Zhang, J., Ghahramani, Z., and Yang, Y. Flexible latent variable models for multi-task learning. Machine Learning, 73(3):221242, 2008.
-----0
Ben-Israel, A. and Greville, T.N.E. Generalized inverses: theory and applications, volume 15.Springer, 2003.
Cesa-Bianchi, N., Shalev-Shwartz, S., and Shamir, O.Efficient learning with partially observed attributes. J. Mach. Learn. Res., 12:28572878, November 2011.
Cheng, C. and Van Ness, J.W. Statistical regression with measurement error, volume 6. Arnold London, 1999.
Cockerham, R. The photographic height and weight chart. http://www.cockeyed.com/photos/ bodies/heightweight.html, 2013. With permission of Rob Cockerham.
Cronbach, L. Coefficient alpha and the internal structure of tests. Psychometrika, 16(3):297334, 1951.
Dawid, A. P. and Skene, A. M. Maximum likelihood estimation of observer error-rates using the em algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1):pp. 2028, 1979.
Frank, A. and Asuncion, A. UCI machine learning repository, 2010. URL http://archive.ics.uci.edu/ml.
Higham, N. J. Computing a nearest symmetric positive semidefinite matrix. Linear algebra and its applications, 103:103118, 1988.
Hoerl, A. E. and Kennard, R. W. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):5567, 1970.
Isola, P., Parikh, D., Torralba, A., and Oliva, A. Understanding the intrinsic memorability of images. In Advances in Neural Information Processing Systems 24, pp. 24292437, 2011.
Krippendorff, K. Content analysis: An introduction to its methodology. Sage Publications, Incorporated, 2012.
Miller, A. Subset selection in regression. Chapman & Hall/CRC, 2002.
Natarajan, B.K. Sparse approximate solutions to linear systems. SIAM journal on computing, 24(2): 227234, 1995.
Patterson, G. and Hays, J. Sun attribute database: Discovering, annotating, and recognizing scene attributes. In Proceeding of the 25th Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
Sabato, S. and Kalai, A. Feature multi-selection among subjective features. CoRR, abs/1302.4297, 2013. URL http://arxiv.org/abs/1302.4297.
Smyth, P., Fayyad, U. M., Burl, M. C., Perona, P., and Baldi, P. Inferring ground truth from subjective labelling of venus images. In NIPS, pp. 10851092, 1994.
Srebro, N., Sridharan, K., and Tewari, A. Smoothness, low-noise and fast rates. In Advances in Neural Information Processing Systems (NIPS) 23, 2010.
Welinder, P., Branson, S., Belongie, S., and Perona, P. The multidimensional wisdom of crowds. In Neural Information Processing Systems Conference (NIPS), 2010.
-----0
Ahs, F., Davis, F.C., Gorka, A.X., and Hariri, A.R.Feature-based representations of emotional facial expressions in the human amygdala. Social, Cognitive, and Affective Neuroscience, 2013.
Albert, J. H. and Chib, S. Bayesian methods for cumulative, sequential and two-step ordinal data regression models. Technical report, Department of Mathematics and Statistics, Bowling Green State University, Ohio, 1997.
Albert, J. H. and Chib, S. Sequential ordinal modeling with applications to survival data. Biometrics, 57: 829836, 2001.
Bhattacharya, A. and Dunson, D. B. Sparse bayesian Exploring the Mind: Integrating Questionnaires and fMRI infinite factor models. Biometrika, 98(2):291306, 2011.
Candes, E. J. and Recht, B. Exact matrix completion via convex optimization. Found. of Comput. Math., 9:717772, 2008.
Eades, P. A heuristic for graph drawing. Congressus Numerantium, 42:149160, 1984.
Friedman, J., Hastie, T., and Tibshirani, R. Sparse inverse covariance estimation with the graphical lasso.Biostatistics, 9:432441, 2008.
Fruchterman, T. and Reingold, E. Graph drawing by force-directed placement. Software: Practice and Experience, 21(11):11291164, 1991.
Fyshe, A., Fox, E.B., Dunson, D.B., and Mitchell, T.M. Hierarchical latent dictionaries for models of brain activation. In AISTATS, 2012.
Gerrish, S. and Blei, D.M. Predicting legislative roll calls from text. In ICML, 2011.
Hahn, P. R., Carvalho, C. M., and Scott, J. G. A sparse factor-analytic probit model for congressional voting patterns. J. Royal Stat. Soc., C, 61(4):619 635, 2012.
Hariri, A.R. The neurobiology of individual differences in complex behavioral traits. Annual Review of Neurocience, 32:225247, 2009.
Hoff, P.D. Simulation of the matrix Bingham-von Mises-Fisher distribution, with applications to multivariate and relational data. J. Comp. Graphical Stat., 18(2):438456, 2009.
Kamada, T. and Kawai, S. An algorithm for drawing general undirected graphs. Information Processing Letters, 31:715, 1989.
Kriegel, H.-P., Kroger, P., and Zimek, A. Clustering high dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans. Knowledge Discovery from Data, 1:158, 2009.
Manuck, S.B., Brown, S.M., Forbes, E.E., and Hariri, A.R. Temporal stability of individual differences in amygdala reactivity. Am. J. Psychiatry, 164: 16131614, 2007.
Meeds, E., Ghahramani, Z., Neal, R., and Roweis, S.Modeling dyadic data with binary latent factors. In NIPS, 2007.
Mitchell, T. M., Shinkareva, S. V., Carlson, A., Chang, K.M., Malave, V. L., Mason, R. A., and Just, M. A.Predicting human brain activity associated with the meanings of nouns. Science, 2008.
Salazar, E., Cain, M. S., Darling, E. F., Mitroff, S. R., and Carin, L. Inferring latent structure from mixed real and categorical relational data. In ICML, 2012.
Williamson, S., Wang, C., Heller, K. A., and Blei, D. M. The IBP compound Dirichlet process and its application to focused topic modeling. In ICML, 2010.
Xu, Z., Yan, F., and Qi, Y. Infinite tucker decomposition: nonparametric Bayesian models for multiway data analysis. In ICML, 2012.Yoshida, R. and West, M. Bayesian learning in sparse graphical factor models via variational mean-field annealing. JMLR, 11:17711798, 2010.
Zhang, X. and Carin, L. Joint modeling of a matrix with associated text via latent binary features. In NIPS, 2012.
-----0
Almeida, LB and Langlois, T. Parameter adaptation in stochastic optimization. On-line learning in neural networks, 1999.
Amari, Shun-ichi, Park, Hyeyoung, and Fukumizu, Kenji. Adaptive Method of Realizing Natural Gradient Learning for Multilayer Perceptrons. Neural Computation, 12(6):13991409, June 2000. ISSN 0899-7667.
Bach, F. and Moulines, E. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems (NIPS), 2011.
Bordes, Antoine, Bottou, Leon, and Gallinari, Patrick.SGD-QN: Careful quasi-newton stochastic gradient descent. Journal of Machine Learning Research, 10: 17371754, July 2009.
Bottou, Leon. Online algorithms and stochastic approximations. In Saad, David (ed.), Online Learning and Neural Networks. Cambridge University Press, Cambridge, UK, 1998.
Bottou, Leon and Bousquet, Olivier. The tradeoffs of large scale learning. In Sra, Suvrit, Nowozin, Sebastian, and Wright, Stephen J. (eds.), Optimization for Machine Learning, pp. 351368. MIT Press, 2011.
Bottou, Leon and LeCun, Yann. Large scale online learning. In Thrun, Sebastian, Saul, Lawrence, and Scholkopf, Bernhard (eds.), Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004.
Chapelle, Olivier and Erhan, Dumitru. Improved preconditioner for hessian free optimization. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
Duchi, John C., Hazan, Elad, and Singer, Yoram.Adaptive subgradient methods for online learning and stochastic optimization. 2010.
George, Abraham P. and Powell, Warren B. Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming. Machine Learning, 65(1):167198, May 2006. ISSN 0885-6125.
Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural networks.
Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS10).Krizhevsky, Alex. Learning multiple layers of features from tiny images. Technical report, Department of Computer Science, University of Toronto, 2009.
Le Roux, N., Manzagol, P.A., and Bengio, Y. Topmoumoute online natural gradient algorithm, 2008.
Le Roux, Nicolas and Fitzgibbon, Andrew. A fast natural newton method. In Proceedings of the 27th International Conference on Machine Learning. Citeseer, 2010.
LeCun, Y., Bottou, L., Orr, G., and Muller, K. Efficient backprop. In Orr, G. and K., Muller (eds.), Neural Networks: Tricks of the trade. Springer, 1998.
LeCun, Yann and Cortes, Corinna. The mnist dataset of handwritten digits. 1998.http://yann.lecun.com/exdb/mnist/.
Martens, J, Sutskever, I, and Swersky, K. Estimating the Hessian by Back-propagating Curvature. arXiv preprint arXiv:1206.6464, 2012.
Robbins, H. and Monro, S. A stochastic approximation method. Annals of Mathematical Statistics, 22:400 407, 1951.
Schaul, Tom and LeCun, Yann. Adaptive learning rates and parallelization for stochastic, sparse, non-smooth gradients. In Proceedings of the International Conference on Learning Representations (ICLR), Scottsdale, AZ, 2013.
Schraudolph, Nicol N. Local gain adaptation in stochastic gradient descent. In Artificial Neural Networks, 1999. ICANN 99. Ninth International Conference on (Conf. Publ. No. 470), volume 2, pp. 569 574. IET, 1999.
Schraudolph, Nicol N. Fast curvature matrix-vector products for second-order gradient descent. Neural computation, 14(7):17231738, 2002.Xu, Wei. Towards optimal one pass large scale learning with averaged stochastic gradient descent. ArXivCoRR, abs/1107.2490, 2011.
-----0
Ball, P. The Music Instinct: How music works and why we cant do without it. Random House, 2010.
Bertin-Mahieux, T., Ellis, D.P.W., Whitman, B., and Lamere, P. The million song dataset. In Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR), 2011.
Blei, D.M. and Lafferty, J.D. Dynamic topic models.In Proceedings of the 23rd international conference on Machine learning (ICML), pp. 113120. ACM, 2006.
Blei, D.M., Ng, A.Y., and Jordan, M.I. Latent dirichlet allocation. the Journal of machine Learning research, 3:9931022, 2003.
Bloom, H. The anxiety of influence: A theory of poetry.Oxford University Press, USA, 1997.
Bryan, N. J. and Wang, G. Musical influence network analysis and rank of sampled-based music. In Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR), 2011.
Collins, N. Computational analysis of musical influence: A musicological case study using mir tools. In Proceedings of the International Symposium on Music Information Retrieval (ISMIR), 2010.
Collins, N. Influence in early electronic dance music: an audio content analysis investigation. In Proceedings of the International Symposium on Music Information Retrieval (ISMIR), 2012.
Fabbri, F. A theory of musical genres: Two applications. Popular Music Perspectives, 1:5281, 1982.
Gerrish, S. and Blei, D.M. A language-based approach to measuring scholarly impact. In Proceedings of the 26th International Conference on Machine Learning (ICML), 2010.
Hall, D., Jurafsky, D., and Manning, C. D. Studying the history of ideas using topic models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 363371, 2008.
Holt, F. Genre in popular music. University of Chicago Press, 2007.
Jehan, T. The Echo Nest Analyze documentation.Technical report, The Echo Nest, 2010.
Lethem, J. The Ecstasy of Influence. Harpers Magazine, pp. 5971, 2007.
Logan, B. Content-based playlist generation: Exploratory experiments. In Proceedings of the International Symposium on Music Information Retrieval (ISMIR), 2002.
Logan, B. Music recommendation from song sets.In Proceedings of the 5th International Conference on Music Information Retrieval (ISMIR 2004), pp.425428, 2004.
Lu, L., You, H., Zhang, H.J., et al. A new approach to query by humming in music retrieval. In Proceedings of the IEEE International Conference on Multimedia and Expo, 2001.
McFee, B. and Lanckriet, G. R. G. The natural language of playlists. In Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR), October 2011.
McFee, B., Barrington, L., and Lanckriet, G.R.G.Learning content similarity for music recommendation. IEEE Transactions on Audio, Speech, and Language Processing, 20(8):22072218, October 2012.
Mermelstein, P. Distance measures for speech recognition, psychological and instrumental. Pattern recognition and artificial intelligence, 116:91103, 1976.
Noyes, E., Allen, I.E., and Parise, S. Artistic influences and innovation in the popular music industry. Frontiers of Entrepreneurship Research, 30(15):3, 2010.
Reynolds, Simon. Retromania: Pop Cultures Addiction to Its Own Past. Faber & Faber, 2011.
Scaringella, N., Zoia, G., and Mlynek, D. Automatic genre classification of music content: a survey.
Signal Processing Magazine, IEEE, 23(2):133141, 2006.
Serra`, J., Corral, A., Boguna, M., Haro, M., and Arcos, J.L. Measuring the evolution of contemporary western popular music. Scientific Reports, 2, 2012.
Su, J.H., Yeh, H.H., Yu, P.S., and Tseng, V.S. Music recommendation using content and context information mining. Intelligent Systems, IEEE, 25(1):1626, 2010.
Turnbull, D.R., Barrington, L., Lanckriet, G., and Yazdani, M. Combining audio content and social context for semantic music discovery. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pp. 387394. ACM, 2009.
-----0
Agarwal, A., Bartlett, P., Ravikumar, P., and Wainwright, M. Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. IEEE Transactions on Information Theory, 58 (5):32353249, 2012.
Bach, F. and Moulines, E. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In NIPS, 2011.
Hazan, E. and Kale, S. Beyond the regret minimization barrier: An optimal algorithm for stochastic strongly-convex optimization. In COLT, 2011.
Hazan, E., Agarwal, A., and Kale, S. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169192, 2007.
Kushner, H. and Yin, G. Stochastic Approximation and Recursive Algorithms and Applications.Springer, 2nd edition, 2003.
Lacoste-Julien, S., Schmidt, M., and Bach, F. A simpler approach to obtaining an o(1/t) convergence rate for projected stochastic subgradient descent.CoRR, abs/1212.2002, 2012.
Ouyang, H. and Gray, A. Stochastic smoothing for nonsmooth minimizations: Accelerating sgd by exploiting structure. In ICML, 2012.
Rakhlin, A., Shamir, O., and Sridharan, K. Making gradient descent optimal for strongly convex stochastic optimization. CoRR, abs/1109.5647, 2011.
Shalev-Shwartz, S., Shamir, O., Srebro, N., and Sridharan, K. Stochastic convex optimization. In COLT, 2009.
Shalev-Shwartz, S., Singer, Y., Srebro, N., and Cotter, A. Pegasos: primal estimated sub-gradient solver for svm. Mathematical Programming, 127(1):330, 2011.
Shamir, O. Is averaging needed for strongly convex stochastic gradient descent? Open problem presented at COLT, 2012.
Zhang, T. Solving large scale linear prediction problems using stochastic gradient descent algorithms.In ICML, 2004.
Zinkevich, M. Online convex programming and generalized infinitesimal gradient ascent. In ICML, 2003.
-----0
Apsel, U. and Brafman, R. Extended lifted inference with joint formulas. In Proceedings of the 27th Annual Conference on Uncertainty in Artificial Intelligence (UAI), pp. 1118, Corvallis, Oregon, 2011.
Dawid, A. P. and Lauritzen, S. L. Hyper Markov laws in the statistical analysis of decomposable graphical models. The Annals of Statistics, 21(3):12721317, 1993.de Salvo Braz, R., Amir, E., and Roth, D. Lifted first-order probabilistic inference. In Getoor, Lise and Taskar, Ben (eds.), Introduction to Statistical Relational Learning, pp. 433452. The MIT Press, 2007.
Delyon, B., Lavielle, M., and Moulines, E. Convergence of a stochastic approximation version of the EM algorithm. Annals of Statistics, 27(1):94128, 1999.
Diaconis, P. and Sturmfels, B. Algebraic algorithms for sampling from conditional distributions. The Annals of statistics, 26(1):363397, 1998.
Dobra, A. Markov bases for decomposable graphical models. Bernoulli, 9(6):10931108, 2003.
Fierens, D. and Kersting, K. From lifted inference to lifted models. In Second International Workshop on Statistical Relational AI, 2012.
Getoor, L. and Taskar, B. Introduction to Statistical Relational Learning. The MIT Press, 2007.
Gilks, W.R. and Wild, P. Adaptive Rejection sampling for Gibbs Sampling. Journal of the Royal Statistical Society. Series C (Applied Statistics), 41(2): 337348, 1992.
Heskes, T. Convexity arguments for efficient minimization of the Bethe and Kikuchi free energies. Journal of Artificial Intelligence Research, 26(1):153 190, 2006.
Kalbfleisch, J. D., Lawless, J. F., and Vollmer, W. M.Estimation in Markov models from aggregate data.Biometrics, 1983.
Karp, RM. Reducibility among combinatorial problems. Complexity of Computer Computations, 1972.
Koller, D. and Friedman, N. Probabilistic graphical models: principles and techniques. MIT press, 2009.
Lee, T. C., Judge, G. G., and Zellner, A. Estimating the parameters of the Markov probability model from aggregate time series data. North-Holland Pub. Co., 1970.
MacRae, E. C. Estimation of time-varying Markov processes with aggregate data. Econometrica: Journal of the Econometric Society, 1977.
Milch, B., Zettlemoyer, L.S., Kersting, K., Haimes, M., and Kaelbling, L.P. Lifted probabilistic inference with counting formulas. In Proceedings of the 23rd Conference on Artificial Intelligence (AAAI), pp. 10621068, 2008.
Sheldon, D. and Dietterich, T. G. Collective graphical models. In Advances in Neural Information Processing Systems (NIPS), 2011.
Sheldon, D., Elmohamed, M. A. S., and Kozen, D.Collective inference on Markov models for modeling bird migration. In Advances in Neural Information Processing Systems (NIPS), 2008.
Sheldon, Daniel. Manipulation of PageRank and Collective Hidden Markov Models. PhD thesis, Cornell University, 2009.
Sheldon, Daniel. Discrete adaptive rejection sampling. Technical Report UM-CS-2013-012, School of Computer Science, University of Massachusetts, Amherst, Massachusetts, May 2013.
Sullivan, B. L., Wood, C. L., Iliff, M. J., Bonney, R. E., Fink, D., and Kelling, S. eBird: A citizen-based bird observation network in the biological sciences. Biological Conservation, 142(10):2282  2292, 2009.
Sundberg, R. Some results about decomposable (or Markov-type) models for multidimensional contingency tables: distribution of marginals and partitioning of tests. Scandinavian Journal of Statistics, 2(2):7179, 1975.
Van Der Plas, A. P. On the estimation of the parameters of Markov probability models using macro data.The Annals of Statistics, 1983.
-----0
Agarwal, Alekh, Bartlett, Peter L., and Duchi, John C.Oracle inequalities for computationally adaptive model selection. arxiv: 1208.0129, 2012.
Amini, Arash A. and Wainwright, Martin J. Highdimensional analysis of semidefinite relaxations for sparse principal components. The Annals of Statistics, 37(5):28772921, 2009.
Bach, Francis and Moulines, Eric. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems (NIPS), 2011.
Bickel, Peter J. and Levina, Elizaveta. Covariance regularization by thresholding. The Annals of Statistics, 2008.
Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1122, 2010.
Chandrasekaran, Venkat and Jordan, Michael. Computational and statistical tradeoffs via convex relaxation. Technical report, University of California, Berkeley, 2012. arXiv:1211.1073.
Clarkson, Kenneth L. and Woodruff, David P. Low rank approximation and regression in input sparsity time. Technical report, IBM Almaden Research Center, 2013. arXiv:1207.6365.
Clarkson, Kenneth L., Drineas, Petros, MagdonIsmail, Malik, Mahoney, Michael W., Meng, Xiangrui, and Woodruff, David P. The fast cauchy transform and faster robust linear regression. Technical report, IBM Almaden Research Center, 2013.arXiv:1207.6365.
Dasgupta, A., Drineas, P., Harb, B., Kumar, R., and Mahoney, M. W. Sampling algorithms and coresets for lp regression. SIAM J. Computing, 38:2060 2078, 2009.
dAspremont, A., Ghaoui, L. El, Jordan, M. I., and Lanckriet, G. A direct formulation for sparse PCA using semidefinite programming. In In S. Thrun, L. Saul, and B. Schoelkopf (Eds.), Advances in Neural Information Processing Systems (NIPS) 16, 2004., 2004.
Drineas, P., Mahoney, M. W., and Muthukrishnan, S. Sampling algorithms for l2 regression and applications. In Proc. of the 17-th Annual SODA, pp.11271136, 2006.
Hsu, Daniel, Kakade, Sham M., and Zhang, Tong.An analysis of random design linear regression.arxiv:1106.2363, 2011.
Johnstone, Iain M. and Lu, Arthur Yu. Sparse principal components analysis. Technical report, Stanford University, 2004. arXiv:0901.4392.
Koutis, Ioannis, Miller, Gary, and Peng, Richard. A nearly-m log n time solver for sdd linear systems.
Technical report, Carnegie Mellon University, 2012.arXiv:cs/1102.4842; FOCS 2011.
Robbins, Herbert and Monro, Sutton. A stochastic approximation method. 22(3):400407, 1951.
Shalev-Shwartz, Shai, Srebro, Nathan, and Zhang, Tong. Trading accuracy for sparsity in optimization problems with sparsity constraints. Siam Journal on Optimization, 2010.
Spielman, Daniel A. and Teng, Shang-Hua. Nearlylinear time algorithms for preconditioning and solving symmetric, diagonally dominant linear systems. Technical report, Yale University, 2009.arXiv:cs/0607105.
Vershynin, Roman. How close is the sample covariance matrix to the actual covariance matrix? Journal of Theoretical Probability, 2012.
-----0
Berg, C., Christensen, J. P. R., and Ressel, R. Harmonic Analysis on semigroups. Theory of positive denite and related functions. Springer, 1984.
Chang, C.-C. and Lin, C.-J. Libsvm: a library for support vector machines.http://www.csie.ntu.edu.tw/~cjlin/libsvm/, 2001.
Collins, M. and Duffy, N. Convolution kernels for natural language. In Advances in Neural Information Processing Systems 14 [Neural Information Processing Systems: Natural and Synthetic, NIPS 2001], pp. 625{632. MIT Press, 2001.
Cortes, C., Haffner, P., and Mohri, M. Rational kernels: Theory and algorithm. Journal of Machine Learning Research, 1:1{50, 2004.
Hashimoto, K., Goto, S., Kawano, S., Aoki-Kinoshita, K. F., and Ueda, N. Kegg as a glycome informatics resource. Glycobiology, 16:63R { 70R, 2006.
Haussler, D. Convolution kernels on discrete structures. UCSC-CRL 99-10, Dept. of Computer Science, University of California at Santa Cruz, 1999.
Kashima, H. and Koyanagi, T. Kernels for semistructured data. In the 9th International Conference on Machine Learning (ICML 2002), pp. 291{298, 2002.
Kuboyama, T., Shin, K., and Kashima, H. Flexible tree kernels based on counting the number of tree mappings. In Proc. of Machine Learning with Graphs, 2006.
Leslie, C. S., Eskin, E., and Stafford Noble, W. The spectrum kernel: A string kernel for SVM protein classication. In Pacic Symposium on Biocomputing, pp. 566{575, 2002.
Lodhi, H., Shawe-Taylor, J, Cristianini, N, and H., Watkins C. J. C. Text classication using string kernels. Advances in Neural Information Processing Systems (NIPS 2000), 13, 2001.
Lu, S. Y. A tree-to-tree distance and its application to cluster analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 1:219? 224, 1979.
Shin, K. and Kuboyama, T. A generalization of Haussler's convolution kernel { Mapping kernel. In ICML 2008, 2008.
Shin, K. and Kuboyama, T. A generalization of Haussler's convolution kernel { Mapping kernel and its application to tree kernels. J. Comput. Sci. Technol, 25(5)::1040{1054, 2010.
Shin, K. Partitionable kernels for mapping kernels. In ICDM 2011, pp. 645{654, 2011.
Ta, K. C. The tree-to-tree correction problem. JACM, 26(3):422{433, July 1979.
Wagner, R.A. and Fischer, M.J. The string-to-string correction problem. JACM, 21(1):168{173, 1974.
Zhang, K. Algorithms for the constrained editing distance between ordered labeled trees and related problems. PR, 28(3):463{474, March 1995.
-----0
Abe, N., Pednault, E., Wang, H., Zadrozny, B., Fan, W., and Apte, C. (2002). Empirical comparison of various reinforcement learning strategies for sequential targeted marketing. In International Conference on Data Mining, pages 310.
Abe, N., Verma, N., Schroko, R., and Apte, C.(2004). Cross channel optimized marketing by reinforcement learning. In International Conference on Knowledge Discovery and Data Mining (KDD), pages 767772.
Archibald, T. (1992). Parallel dynamic programming.In Kronsjo, L. and Shumsheruddin, D., editors, Advances in parallel algorithms, pages 343367. John Wiley & Sons, Inc.
Bertsekas, D. (1982). Distributed dynamic programming. Automatic Control, IEEE Transactions on, 27(3):610616.
Gomez-Perez, G., Martin-Guerrero, J. D., SoriaOlivas, E., Balaguer-Ballester, E., Palomares, A., and Casariego, N. (2008). Assigning discounts in a marketing campaign by using reinforcement learning and neural networks. Expert Systems with Applications, (doi: 10.1016/j.eswa.2008.10.064).
Graepel, T., Candela, J. Q., Borchert, T., and Herbrich, R. (2010). Web-scale Bayesian click-through rate prediction for sponsored search advertising in microsofts bing search engine. In 27th International Conference on Machine Learning, pages 1320.
Grounds, M. and Kudenko, D. (2007). Parallel reinforcement learning with linear function approximation. In Adaptive Agents and Multi-Agent Systems, pages 6074.
Kaelbling, L., Littman, M., and Cassandra, A. (1995).Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101:99134.
Li, L., Chu, W., Langford, J., and Schapire, R. E.(2010). A contextual-bandit approach to personalized news article recommendation. CoRR, abs/1003.0146.
Littman, M. L. (1994). Markov games as a framework for multi-agent reinforcement learning. In 11th International Conference on Machine Learning, pages 157163.
Pednault, E., Abe, N., Zadrozny, B., Wang, H., Fan, W., and Apte, C. (2002). Sequential cost-sensitive decision making with reinforcement learning. In International Conference on Knowledge Discovery and Data Mining (KDD).
Schraudolph, N. N. (1999). Local gain adaptation in stochastic gradient descent. In International Conference on Artificial Neural Networks, pages 569 574.Sutton, R. (1988). Learning to predict by the method of temporal differences. Machine Learning, 3(9):9 44.
Sutton, R. (1992). Adapting bias by gradient descent: An incremental version of delta-bar-delta. In 10th National Conference on Artificial Intelligence, pages 171176.
Sutton, R. and Barto, A. (1998). Reinforcement Learning: an Introduction. MIT Press.
Sutton, R., Precup, D., and Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1-2):181211.
Tsiptsis, K. and Chorianopoulos, A. (2010). Data Mining Techniques in CRM: Inside Customer Segmentation. Wiley.
-----0
Bar-Lev, S. K. and Enis, P. Reproducibility and natural exponential families with power variance functions. Annals of Stat., 14, 1986.
Bertin-Mahieux, Thierry, Ellis, Daniel P.W., Whitman, Brian, and Lamere, Paul. The million song dataset. In Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR 2011), 2011.
Boulanger-Lewandowski, Nicolas, Bengio, Yoshua, and Vincent, Pascal. Modeling temporal dependencies in high-dimensional sequences: Application to polyphonic music generation and transcription.In International Conference on Machine Learning (ICML), 2012.
Cichocki, A., Zdunek, R., Phan, A. H., and Amari, S. Nonnegative Matrix and Tensor Factorization.Wiley, 2009.
Simsekli, U., Y?lmaz, Y. K., and Cemgil, A. T. Score guided audio restoration via generalised coupled tensor factorization. In IEEE International Conference on Acoustics, Speech, and Signal Processing, 2012.
Dikmen, Onur and Fevotte, Cedric. Maximum marginal likelihood estimation for nonnegative dictionarylearning in the gamma-poisson models. IEEE Transactions on Signal Processing, 60(10):5163 5175, 2012.
Dunn, P. K. and Smyth, G. S. Series evaluation of tweedie exponential dispersion model densities.Stats. & Comp., 15:267280, 2005.
Emiya, V., Badeau, R, and David, B. Multipitch estimation of piano sounds using a new probabilistic spectral smoothness principle. IEEE TASLP, 18(6): 16431654, 2010.
Fevotte, C., Bertin, N., and Durrieu, J. L. Nonnegative matrix factorization with the Itakura-Saito divergence. with application to music analysis. Neural Computation, 21:793830, 2009.
Jrgensen, B. The Theory of Dispersion Models. Chapman & Hall/CRC Monographs on Statistics & Applied Probability, 1997.
Lu, Zhiyun, Yang, Zhirong, and Oja, Erkki. Selecting ?-divergence for nonnegative matrix factorization by score matching. In Proceedings of 22nd International Conference on Artificial Neural Networks (ICANN 2012), volume 7553 of Lecture Notes in Computer Science, pp. 419426, Lausanne, Switzerland, 2012. Springer.
McCulloch, C. E. and Nelder, J. A. Generalized Linear Models. Chapman and Hall, 2nd edition, 1989.
Smaragdis, P. Non-negative matrix factor deconvolution; extraction of multiple sound sources from monophonic inputs. In ICA, pp. 494499, 2004.
Smaragdis, P. and Brown, J. C. Non-negative matrix factorization for polyphonic music transcription. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 177180, 2003.
Tweedie, M. C. An index which distinguishes between some important exponential families. Statistics: applications and new directions, Indian Statist. Inst., Calcutta, pp. 579604, 1984.
Y?lmaz, Y. K. and Cemgil, A. T. Alpha/beta divergences and tweedie models. arXiv:1209.4280 v1, 2012.
Y?lmaz, Y. K., Cemgil, A. T., and Simsekli, U. Generalised coupled tensor factorisation. In NIPS, 2011.
Zhang, Yanwei. Likelihood-based and bayesian methods for tweedie compound poisson linear mixed models. Statistics and Computing, accepted, 2012.
-----0
Bowling, Michael. Convergence problems of generalsum multiagent reinforcement learning. In Proceedings of the Seventeenth International Conference on Machine Learning, pp. 8994, 2000.
Chen, Xi and Deng, Xiaotie. Settling the complexity of two-player Nash equilibrium. In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pp. 261272, 2006.
Gilboa, I. and Zemel, E. Nash and correlated equilibria: Some complexity considerations. Games and Economic Behavior, 1:8093, 1989.
Gomes, Eduardo Rodrigues and Kowalczyk, Ryszard.Dynamic analysis of multiagent Q-learning with egreedy exploration. In Proceedings of the 2009 International Conference on Machine Learning, 2009.
Greenwald, Amy and Hall, Keith. Correlated-Q learning. In Proceedings of the Twentieth International Conference on Machine Learning, pp. 242 249, 2003.
Hu, Junling and Wellman, Michael P. Nash q-learning for general-sum stochastic games. The Journal of Machine Learning Research, 4:10391069, 2003.
Kalai, Adam and Kalai, Ehud. Cooperation in strategic games revisited*. The Quarterly Journal of Economics, 2012.
Kalai, Adam Tauman and Kalai, Ehud. Cooperation and competition in strategic games with private information. In Proceedings of the 11th ACM conference on Electronic commerce, EC 10, pp. 345346, New York, NY, USA, 2010. ACM.
Littman, Michael L. Markov games as a framework for multi-agent reinforcement learning. In Proceedings of the Eleventh International Conference on Machine Learning, pp. 157163, 1994.
Littman, Michael L. Friend-or-foe Q-learning in general-sum games. In Proceedings of the Eighteenth International Conference on Machine Learning, pp.322328. Morgan Kaufmann, 2001.
Littman, Michael L. and Szepesvari, Csaba. A generalized reinforcement-learning model: Convergence and applications. In Saitta, Lorenza (ed.), Proceedings of the Thirteenth International Conference on Machine Learning, pp. 310318, 1996.
Munoz de Cote, Enrique and Littman, Michael L. A polynomial-time Nash equilibrium algorithm for repeated stochastic games. In 24th Conference on Uncertainty in Artificial Intelligence (UAI08), 2008.
Puterman, Martin L. Markov Decision Processes Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, NY, 1994.
Sandholm, Tuomas W. and Crites, Robert H. Multiagent reinforcement learning in the iterated prisoners dilemma. Biosystems, 37:144166, 1995.
Shapley, L.S. Stochastic games. Proceedings of the National Academy of Sciences of the United States of America, 39:10951100, 1953.
Sutton, Richard S. and Barto, Andrew G. Reinforcement Learning: An Introduction. The MIT Press, 1998.
Watkins, Christopher J. C. H. and Dayan, Peter. Qlearning. Machine Learning, 8(3):279292, 1992.
Zinkevich, Martin, Greenwald, Amy R., and Littman, Michael L. Cyclic equilibria in Markov games. In Advances in Neural Information Processing Systems 18, 2005.
-----0
Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. Greedy layer-wise training of deep networks. In NIPS, 2007.
Bengio, Y. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1):1127, 2009.
Boureau, Y. L., Bach, F., LeCun, Y., and Ponce, J. Learning mid-level features for recognition. In CVPR, 2010.
Everingham, M., Van Gool, L., Williams, C. K., Winn, J., and Zisserman, A. The pascal visual object classes (voc) challenge. International journal of computer vision, 88 (2):303338, 2010.
Fan, R. E., Chang, K. W., Hsieh, C. J., Wang, X. R., and Lin, C. J. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9: 18711874, 2008.
Fei-Fei, L., Fergus, R., and Perona, P. Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories.In CVPR Workshop on Generative Model Based Vision, 2004.
Goh, H., Thome, N., Cord, M., and Lim, J. H. Unsupervised and supervised visual codes with restricted Boltzmann machines. In ECCV, 2012.
Griffin, G., Holub, A., and Perona, P. Caltech-256 object category dataset. Technical report, California Institute of Technology, 2007.
Guyon, I. and Elisseeff, A. An introduction to variable and feature selection. Journal of Machine Learning Research, 3:11571182, 2003.
Heess, N., Le Roux, N., and Winn, J. Weakly supervised learning of foreground-background segmentation using masked RBMs. In ICANN, 2011.
Hinton, G. E. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8): 17711800, 2002.
Hinton, G. E., Osindero, S., and Teh, Y. A fast learning algorithm for deep belief nets. Neural Computation, 18 (7):15271554, 2006.
Jain, A. and Zongker, D. Feature selection: Evaluation, application, and small sample performance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(2):153158, 1997.
Larochelle, H. and Bengio, Y. Classification using discriminative restricted Boltzmann machines. In ICML, 2008.
Larochelle, H., Erhan, D., Courville, A., Bergstra, J., and Bengio, Y. An empirical evaluation of deep architectures on problems with many factors of variation. In ICML, 2007.
Lazebnik, S., Schmid, C., and Ponce, J. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR, 2006.
Le Roux, N., Heess, N., Shotton, J., and Winn, J. Learning a generative model of images by factoring appearance and shape. Neural Computation, 23(3):593650, 2011.
Lee, H., Ekanadham, C., and Ng, A. Y. Sparse deep belief network model for visual area V2. In NIPS, 2008.
Lee, H., Grosse, R., Ranganath, R., and Ng, A. Y. Unsupervised learning of hierarchical representations with convolutional deep belief networks. Communications of the ACM, 54(10):95103, 2011.
Memisevic, R. and Hinton, G. E. Learning to represent spatial transformations with factored higher-order Boltzmann machines. Neural Computation, 22(6):14731492, 2010.
Nair, V. and Hinton, G. E. Implicit mixtures of restricted Boltzmann machines. In NIPS, 2008.
Ranzato, M., Poultney, C., Chopra, S., and LeCun, Y. Efficient learning of sparse representations with an energybased model. In NIPS, 2007.
Rifai, S., Vincent, P., Muller, X., Glorot, X., and Bengio, Y. Contractive auto-encoders: Explicit invariance during feature extraction. In ICML, 2011.
Rifai, S., Bengio, Y., Courville, A., Vincent, P., and Mirza, M. Disentangling factors of variation for facial expression recognition. In ECCV, 2012.
Sejnowski, T. J. Higher-order Boltzmann machines. In AIP Conference Proceedings on Neural Networks for Computing, 1987.
Sohn, K. and Lee, H. Learning invariant representations with local transformations. In ICML, 2012.
Sohn, K., Jung, D. Y., Lee, H., and Hero, A. O. Efficient learning of sparse, distributed, convolutional feature representations for object recognition. In ICCV, 2011.
Tang, Y., Salakhutdinov, R., and Hinton, G. E. Robust Boltzmann machines for recognition and denoising. In CVPR, 2012.
Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P. A. Extracting and composing robust features with denoising autoencoders. In ICML, 2008.
Weston, J., Mukherjee, S., Chapelle, O., Pontil, M., Poggio, T., and Vapnik, V. Feature selection for svms. In NIPS, 2001.
Yang, J., Yu, K., Gong, Y., and Huang, T. S. Linear spatial pyramid matching using sparse coding for image classification. In CVPR, 2009.
Yang, Y. and Pedersen, J. O. A comparative study on feature selection in text categorization. In ICML, 1997.
-----0
Balle, Borja, Quattoni, Ariadna, and Carreras, Xavier.Local loss optimization in operator models: A new insight into spectral learning. In Proceedings of the International Conference on Machine Learning, 2012.
Blei, D., Ng, A., and Jordan, M. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:9931022, January 2003.
Clark, A. Inference of haplotypes from PCR-amplified samples of diploid populations. Molecular Biology and Evolution, 7(2):111122, 1990.
Dempster, A. P., Laird, N. M., and Rubin, D. B. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39(1):122, 1977.
Foster, D.P., Rodu, J., and Ungar, L.H. Spectral dimensionality reduction for hmms. Arxiv preprint arXiv:1203.6130, 2012.
Grasedyck, Lars. Hierarchical singular value decomposition of tensors. SIAM Journal on Matrix Analysis and Applications, 31(4):20292054, 2010.
Hoff, Peter D., Raftery, Adrian E., and Handcock, Mark S. Latent space approaches to social network analysis. Journal of the American Statistical Association, 97(460):10901098, 2002.
Hsu, D., Kakade, S., and Zhang, T. A spectral algorithm for learning hidden markov models. In Proc. Annual Conf. Computational Learning Theory, 2009.
Kalkatawi, M., Rangkuti, F., Schramm, M., Jankovic, B.R., Kamau, A., Chowdhary, R., Archer, J.A.C., and Bajic, V.B. Dragon polya spotter: predictor of poly (a) motifs within human genomic dna sequences. Bioinformatics, 28(1):127129, 2012.
Oseledets, IV. Tensor-train decomposition. SIAM Journal on Scientific Computing, 33(5):22952317, 2011.
Parikh, A., Song, L., and Xing, E. P. A spectral algorithm for latent tree graphical models. In Proceedings of the International Conference on Machine Learning, 2011.
Rabiner, L. R. and Juang, B. H. An introduction to hidden Markov models. IEEE ASSP Magazine, 3 (1):416, January 1986.
Song, L., Parikh, A., and Xing, E.P. Kernel embeddings of latent tree graphical models. In Advances in Neural Information Processing Systems, volume 25, 2011.
Yu, Y. and Schuurmans, D. Rank/norm regularization with closed-form solutions: Application to subspace clustering. In Conference on Uncertainty in Artificial Intelligence, 2011.
-----0
Behle, Markus. Binary decision diagrams and integer programming. PhD thesis, der Universitat des Saarlandes, 2010.
Bourdev, Lubomir and Brandt, Jonathan. Robust object detection via soft cascade. In Computer Vision and Pattern Recognition, 2005. CVPR 2005.IEEE Computer Society Conference on, volume 2, pp. 236243. IEEE, 2005.
Bryant, Randal E. Graph-based algorithms for boolean function manipulation. Computers, IEEE Transactions on, 100(8):677691, 1986.
Busa-Fekete, R, Benbouzid, D, Kegl, Balazs, et al.Fast classification using sparse decision dags. In 29th International Conference on Machine Learning (ICML), 2012.
Chen, Minmin, Xu, Zhixiang, Weinberger, Kilian Q, Chapelle, Olivier, Kedem, Dor, and Saint Louis, MO. Classifier cascade for minimizing feature evaluation cost. In AISTATS, volume 15, pp. 218226, 2012.
Ebendt, Rudiger, Fey, Gorschwin, and Drechsler, Rolf.Advanced BDD optimization. Springer, 2005.
Friedman, Jerome, Hastie, Trevor, and Tibshirani, Robert. Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). The annals of statistics, 28(2):337 407, 2000.
Goodman, Rodney M. and Smyth, Padhraic. Decision tree design from a communication theory standpoint. Information Theory, IEEE Transactions on, 34(5):979994, 1988.
Grossmann, E. Adatree: Boosting a weak classifier into a decision tree. In Computer Vision and Pattern Recognition Workshop, 2004. CVPRW 04. Conference on, pp. 105105, 2004. doi: 10.1109/CVPR.2004.22.
Grubb, Alexander and Bagnell, J Andrew. Speedboost: Anytime prediction with uniform near optimality. In AISTATS, volume 15, pp. 458466, 2012.
Kim, Tae-Kyun, Budvytis, Ignas, and Cipolla, Roberto. Making a shallow network deep: Conversion of a boosting classifier into a decision tree by boolean optimisation. International Journal of Computer Vision, pp. 113, 2012.
Mayer-Eichberger, Valentin Christian Johannes Kaspar. Towards solving a system of pseudo boolean constraints with binary decision diagrams. PhD thesis, Universidade Nova de Lisboa, 2008.
Reyzin, Lev. Boosting on a budget: Sampling for feature-efficient prediction. In Proceedings of the 28th Internation Conference on Machine Learning (ICML), 2011.
Rudell, Richard. Dynamic variable ordering for ordered binary decision diagrams. In Proceedings of the 1993 IEEE/ACM international conference on Computer-aided design, pp. 4247. IEEE Computer Society Press, 1993.
Saberian, Mohammad J and Vasconcelos, Nuno.Boosting classifier cascades. In Proceedings of the 24th Annual Conference on Neural Information Processing Systems (NIPS), 2010.
Sanner, Scott. Decision diagrams in automated planning and scheduling. Tutorial at ICAPS2011.http://users.cecs.anu.edu.au/~ssanner/ 
Papers/Decision_Diagrams_Tutorial.pdf, 2011.Sochman, Jan and Matas, Jiri. Waldboost-learning for time constrained sequential detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005.IEEE Computer Society Conference on, volume 2, pp. 150156. IEEE, 2005.
Tu, Zhuowen. Probabilistic boosting-tree: learning discriminative models for classification, recognition, and clustering. In Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on, volume 2, pp. 15891596 Vol. 2, 2005. doi: 10.1109/ ICCV.2005.194.
Viola, Paul and Jones, Michael J. Robust real-time face detection. International journal of computer vision, 57(2):137154, 2004.
Xu, Zhixiang, Kusner, Matt, Weinberger, Kilian, and Chen, Minmin. Cost-sensitive tree of classifiers. In 30th International Conference on Machine Learning (ICML), 2013.
Zhang, Cha and Viola, Paul. Multiple-instance pruning for learning efficient cascade detectors. In Proceedings of the 21th Annual Conference on Neural Information Processing Systems (NIPS), 2007.
-----0
Bengio, Y., Simard, P., and Frasconi, P. Learning long-term dependencies with gradient descent is di cult. IEEE Transactions on Neural Networks, 5:157166, 1994.
Bengio, Y., Lamblin, P, Popovici, D., and Larochelle, H.Greedy layer-wise training of deep networks. In In NIPS.MIT Press, 2007.
Bottou, L. and LeCun, Y. Large scale online learning. In Advances in Neural Information Processing Systems 16: Proceedings of the 2003 Conference, volume 16, pp. 217.MIT Press, 2004.
Chapelle, O. and Erhan, D. Improved Preconditioner for Hessian Free Optimization. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
Cotter, A., Shamir, O., Srebro, N., and Sridharan, K. Better mini-batch algorithms via accelerated gradient methods. arXiv preprint arXiv:1106.4574, 2011.
Dahl, G.E., Yu, D., Deng, L., and Acero, A. Contextdependent pre-trained deep neural networks for largevocabulary speech recognition. Audio, Speech, and Language Processing, IEEE Transactions on, 20(1):3042, 2012.
Darken, C. and Moody, J. Towards faster stochastic gradient search. Advances in neural information processing systems, pp. 10091009, 1993.
Glorot, X. and Bengio, Y. Understanding the di culty of training deep feedforward neural networks. In Proceedings of AISTATS 2010, volume 9, pp. 249256, may 2010.
Graves, A. Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711, 2012.
Hinton, G and Salakhutdinov, R. Reducing the dimensionality of data with neural networks. Science, 313:504507, 2006.
Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., et al. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 2012.
Hinton, G.E., Osindero, S., and Teh, Y.W. A fast learning algorithm for deep belief nets. Neural computation, 18 (7):15271554, 2006.
Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural computation, 9(8):17351780, 1997.
Jaeger, H. personal communication, 2012.Jaeger, H. and Haas, H. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science, 304:7880, 2004.
Krizhevsky, A., Sutskever, I., and Hinton, G. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pp. 11061114, 2012.
Lan, G. An optimal method for stochastic composite optimization. Mathematical Programming, pp. 133, 2010.
LeCun, Y., Bottou, L., Orr, G., and Muller, K. E cient backprop. Neural networks: Tricks of the trade, pp. 546 546, 1998.Martens, J. Deep learning via Hessian-free optimization. In Proceedings of the 27th International Conference on Machine Learning (ICML), 2010.
Martens, J. and Sutskever, I. Learning recurrent neural networks with hessian-free optimization. In Proceedings of the 28th International Conference on Machine Learning (ICML), pp. 10331040, 2011.
Martens, J. and Sutskever, I. Training deep and recurrent networks with hessian-free optimization. Neural Networks: Tricks of the Trade, pp. 479535, 2012.
Mikolov, Tomas?, Sutskever, Ilya, Deoras, Anoop, Le, Hai-Son, Kombrink, Stefan, and Cernocky, J. Subword language modeling with neural networks. preprint (http://www. fit. vutbr. cz/imikolov/rnnlm/char. pdf), 2012.
Mohamed, A., Dahl, G.E., and Hinton, G. Acoustic modeling using deep belief networks. Audio, Speech, and Language Processing, IEEE Transactions on, 20(1):14 22, Jan. 2012.
Nesterov, Y. A method of solving a convex programming problem with convergence rate O(1/sqr(k)). Soviet Mathematics Doklady, 27:372376, 1983.
Nesterov, Y. Introductory lectures on convex optimization: A basic course, volume 87. Springer, 2003.
Orr, G.B. Dynamics and algorithms for stochastic search.1996.
Polyak, B.T. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):117, 1964.
Raiko, Tapani, Valpola, Harri, and LeCun, Yann. Deep learning made easier by linear transformations in perceptrons. In NIPS 2011 Workshop on Deep Learning and  Unsupervised Feature Learning, Sierra Nevada, Spain, 2011.
Sutskever, I., Martens, J., and Hinton, G. Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning, ICML 11, pp. 10171024, June 2011.
Wiegerinck, W., Komoda, A., and Heskes, T. Stochastic dynamics of learning with momentum in neural networks. Journal of Physics A: Mathematical and General, 27(13):4425, 1999.
-----0
A. Beck and M. Teboulle. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM J. Imaging Sciences, 2(1):183202, 2009.
S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3:1122, 2010.
P. L. Combettes and V. R. Wajs. Signal recovery by proximal forward-backward splitting. Multiscale Modeling and Simulation, 4(4):11681200, 2005.
W. Deng and W. Yin. On the global and linear convergence of the generalized alternating direction Dual Averaging and Proximal Gradient Descent for Online ADMM method of multipliers. Technical report, Rice University CAAM TR12-14, 2012.
J. Duchi and Y. Singer. Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research, 10:28732908, 2009.
J. C. Duchi, S. Shalev-Shwartz, Y. Singer, and A. Tewari. Composite objective mirror descent. In Proceedings of the Annual Conference on Computational Learning Theory, 2010.
M. A. T. Figueiredo and R. Nowak. An em algorithm for wavelet-based image restoration. IEEE Trans.Image Process, 12:906916, 2003.
D. Gabay and B. Mercier. A dual algorithm for the solution of nonlinear variational problems via finiteelement approximations. Computers & Mathematics with Applications, 2:1740, 1976.
B. He and X. Yuan. On the O(1/n) convergence rate of the Douglas-Rachford alternating direction method.
SIAM J. Numerical Analisis, 50(2):700709, 2012.M. Hestenes. Multiplier and gradient methods. Journal of Optimization Theory & Applications, 4:303 320, 1969.
L. Jacob, G. Obozinski, and J.-P. Vert. Group lasso with overlap and graph lasso. In Proceedings of the 26th International Conference on Machine Learning, 2009.
A. Nemirovskii and D. Yudin. Problem complexity and method efficiency in optimization. John Wiley, New York, 1983.
H. Ouyang, N. He, L. Q. Tran, and A. Gray. Stochastic alternating direction method of multipliers. In Proceedings of the 30th International Conference on Machine Learning, 2013.
M. Powell. A method for nonlinear constraints in minimization problems. In R. Fletcher, editor, Optimization, pages 283298. Academic Press, London, New York, 1969.
Z. Qin and D. Goldfarb. Structured sparsity via alternating direction methods. Journal of Machine Learning Research, 13:14351468, 2012.
R. T. Rockafellar. Convex Analysis. Princeton University Press, Princeton, 1970.
R. T. Rockafellar. Augmented Lagrangians and applications of the proximal point algorithm in convex programming. Mathematics of Operations Research, 1:97116, 1976.
M. Signoretto, L. D. Lathauwer, and J. Suykens. Nuclear norms for tensors and their use for convex multilinear estimation. Technical Report 10-186, ESATSISTA, K.U.Leuven, 2010.
R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight. Sparsity and smoothness via the fused lasso. Journal of Royal Statistical Society: B, 67(1): 91108, 2005.
R. Tomioka, T. Suzuki, K. Hayashi, and H. Kashima.Statistical performance of convex tensor decomposition. In Advances in Neural Information Processing Systems 25, 2011.
R. Tomioka, T. Suzuki, and M. Sugiyama. Superlinear convergence of dual augmented lagrangian algorithm for sparsity regularized estimation. Journal of Machine Learning Research, 12:15371586, 2012.
H. Wang and A. Banerjee. Online alternating direction method. In Proceedings of the 29th International Conference on Machine Learning, 2012.
L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. In Advances in Neural Information Processing Systems 23, 2009.
L. Yuan, J. Liu, and J. Ye. Efficient methods for overlapping group lasso. In Advances in Neural Information Processing Systems 24, 2011.
X. Q. Zhang, M. Burger, and S. Osher. A unified primal-dual algorithm framework based on Bregman iteration. Journal of Scientific Computing, 46(1): 2046, 2011.
-----0
Ben-Or, M. and Hassidim, A. The bayesian learner is optimal for noisy binary search. In IEEE 49th Annual IEEE Symposium on Foundations of Computer Science, pp.221230, 2008.
Berry, D.A. and Fristedt, B. Bandit Problems: Sequential Allocation of Experiments. Chapman & Hall, London, 1985.
Brochu, E., Cora, M., and de Freitas, N. A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. Technical Report TR-2009-023, Department of Computer Science, University of British Columbia, November 2009.
Castro, R. and Nowak, R. Active sensing and learning, in Foundations and Applications of Sensor Management.pringer-Verlag, 2007.
Castro, R., Willett, R., and Nowak, R. Faster rates in regression via active learning. In NIPS, 2005.
Chakraborty, M., Das, S., and Magdon-Ismail, M. Nearoptimal target learning with stochastic binary signals.In UAI, pp. 6976, 2011.
Chick, S.E. and Frazier, P.I. Sequential sampling for selection with economics of selection procedures. Management Science, 58:550569, 2012.
Cover, T. M. and Thomas, J.A. Elements of Information Theory. Wiley Interscience Press, 1991.
Dasgupta, S., Hsu, D., and Monteleoni, C. A general agnostic active learning algorithm. In NIPS, 2007.
DeGroot, M. H. Optimal Statistical Decisions. l, New York.McGraw Hill, 1970.
Dhagat, A., Gacs, P., and Winkler, P. On playing twenty questions with a liar. Technical report, Boston University, 2004.
Dynkin, E. B. and Yushkevich, A. A. Controlled Markov Processes. Springer, 1979.
Gittins, J. C. and Jones, D. M. A dynamic allocation index for the sequential design of experiments. In Gani, J.(ed.), Progress in Statistics, pp. 241266, Amsterdam, 1974. North-Holland.
Haupt, J., Castro, R. M., and Nowak, R. Distilled sensing: Adaptive sampling for sparse detection and estimation. IEEE Transactions on Information Theory, 57(9), 2011.
Horstein, M. Sequential decoding using noiseless feedback. IEEE Transactions on Information Theory, 9:136143, 1963.
Jedynak, B., Frazier, P. I., and Sznitman, R. Twenty questions with noise: Bayes optimal policies for entropy loss.Journal of Applied Probability, 49:114136, 2011.
Lucchi, A., Smith, K., Radhakrishna, A., Knott, G., and Fua, P. Supervoxel-Based Segmentation of Mitochondria in Em Image Stacks with Learned Shape Features. TMI, 31(2):474486, 2011.
Novak, R. Generalized binary search. In Conference on Communication, Control, and Computing, pp. 568  574, 2008.Settles, B. Active Learning Literature Survey. Computer Sciences Technical Report 1648, University of WisconsinMadison, 2009.
Spencer, J. and Winkler, P. Three thresholds for a liar. Combinatorics, Probability and Computing, 1:81 93, 1992.
Srinivas, N., Krause, A., Kakade, S., and Seeger, M. Gaussian process optimization in the bandit setting: No regret and experimental design. In ICML, 2010.
Sznitman, R. and Jedynak, B. Active Testing for Face Detection and Localization. PAMI, 32(10):19141920, June 2010.
Sznitman, R., Lucchi, A., Pjescic-Emedji, N., Knott, G., and Fua, P. Efficient Scanning for EM Based Target Localization. In MICCAI, 2012.
Veeraraghavan, Ashok, Genkin, Alex, Vitaladevuni, Shiv, Scheffer, Lou, Xu, Shan, Hess, Harald, Fetter, Richard, 
Cantoni, Marco, Knott, Graham, and Chklovskii, Dmitri. Increasing Depth Resolution of Electron Microscopy of Neural Circuits using Sparse Tomographic Reconstruction. In CVPR, pp. 17671774, 2010.
Waeber, R., Frazier, P. I., and Henderson, S. G. A bayesian approach to stochastic root finding. In Winter Simulation Conference, 2011.
Wetherill, GB and Glazebrook, KD. Sequential Methods in Statistics. Monographs on Statistics and Applied Probability. Chapman & Hall, London, third edition, 1986.
Zhang, Yi, Xu, Wei, and Callan, Jamie. Exploration and exploitation in adaptive filtering based on bayesian active learning. In Proceedings of the Twentieth International Conference (ICML 2003), pp. 896903, Washington, DC, USA, 2003.
-----0
Auer, P., Cesa-Bianchi, N., and Fischer, P. Finitetime analysis of the multiarmed bandit problem.Machine Learning, 47:235256, 2002.
Awerbuch, Baruch and Kleinberg, Robert. Competitive collaborative learning. J. Comput. Syst. Sci., 74(8):12711288, December 2008. ISSN 00220000. doi: 10.1016/j.jcss.2007.08.004. URL http: //dx.doi.org/10.1016/j.jcss.2007.08.004.
Cesa-Bianchi, N. and Lugosi, G. Prediction, Learning, and Games. Cambridge University Press, NY, USA, 2006.
Gelly, S., Hoock, J.B., Rimmel, A., Teytaud, O., and Kalemkarian, Y. The parallelization of MonteCarlo planning. In Proceedings of of the Fifth International Conference on Informatics in Control, Automation and Robotics, pp. 244249, 2008.
Hegedu?s, I., Busa-Fekete, R., Ormandi, R., Jelasity, M., and Kegl, B. Peer-to-peer multi-class boosting. In International European Conference on Parallel and Distributed Computing (EUROPAR), pp. 389400, 2012.
Jelasity, M., Montresor, A., and Babaoglu, O.Gossip-based aggregation in large dynamic networks. ACM Trans. on Computer Systems, 23(3): 219252, August 2005.
Jelasity, M., Voulgaris, S., Guerraoui, R., Kermarrec, A.-M., and van Steen, M. Gossip-based peer sampling. ACM Transactions on Computer Systems, 25(3):8, 2007.
Joulani, Pooria. Multi-armed bandit problems under delayed feedback. Msc thesis, Department of Computing Science, University of Alberta, 2012.
Kempe, D., Dobra, A., and Gehrke, J. Gossip-based computation of aggregate information. In Proc. 44th Annual IEEE Symposium on Foundations of Computer Science (FOCS03), pp. 482491. IEEE Computer Society, 2003.
Kocsis, L. and Szepesvari, Cs. Bandit based MonteCarlo planning. In Proceedings of the 17th European Conference on Machine Learning, pp. 282 293, 2006.
Kowalczyk, W. and Vlassis, N. Newscast EM. In 17th Advances in Neural Information Processing Systems, pp. 713720, Cambridge, MA, 2005. MIT Press.
Lai, T.L. and Robbins, H. Asymptotically efficient allocation rules. Advances in Applied Mathematics, 6(1):422, 1985.
Langford, John and Zhang, Tong. The epoch-greedy algorithm for multi-armed bandits with side information. In NIPS, 2007.
Langford, John, Smola, Alex, and Zinkevich, Martin. Slow Learners are Fast. In Bengio, Y., Schuurmans, D., Lafferty, J., Williams, C. K. I., and Culotta, A. (eds.), Advances in Neural Information Processing Systems 22, pp. 23312339. 2009.
Ormandi, R., Hegedus, I., and Jelasity, M. Gossip learning with linear models on fully distributed data. Concurrency and Computation: Practice and Experience, 2012. doi: 10.1002/cpe.2858.
Xiao, L., Boyd, S., and Kim, S.-J. Distributed average consensus with least-mean-square deviation.Journal of Parallel and Distributed Computing, 67 (1):3346, January 2007.
-----0
Agarwal, A. and Duchi, J. Distributed delayed stochastic optimization. In NIPS, 2011.
Bradley, J.K., Kyrola, A., Bickson, D., and Guestrin, C. Parallel coordinate descent for l1-regularized loss minimization. In ICML, 2011.
Cotter, A., Shamir, O., Srebro, N., and Sridharan, K.Better mini-batch algorithms via accelerated gradient methods. In NIPS, 2011.
Dekel, O., Gilad-Bachrach, R., Shamir, O., and Xiao, L. Optimal distributed online prediction using minibatches. Journal of Machine Learning Research, 13: 165202, 2012.
Duchi, John, Bartlett, Peter L., and Wainwright, Martin J. Randomized smoothing for (parallel) stochastic optimization. In ICML, 2012a.
Duchi, John, Bartlett, Peter L., and Wainwright, Martin J. Randomized smoothing for stochastic optimization. SIAM Journal on Optimization, 22(2): 674701, June 2012b.
Hsieh, C-J., Chang, K-W., Lin, C-J., Keerthi, S.S., and Sundarajan, S. A dual coordinate descent method for large-scale linear svm. In ICML, 2008.
Hsu, D., Karampatziakis, N., Langford, J., and Smola, A. Parallel online learning. arXiv:1103.4204, 2011.
Libsvm. Datasets. http://www.csie.ntu.edu.tw/?cjlin/ libsvmtools/datasets/binary.html.
Nesterov, Yu. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J.Optimization, 22:341362, 2012.
Niu, F., Recht, B., Re, C., and Wright, S. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In Shawe-Taylor, J., Zemel, R.S., 
Bartlett, P., Pereira, F.C.N., and Weinberger, K.Q.(eds.), NIPS 24, pp. 693701. 2011.
Rakhlin, A., Shamir, O., and Sridharan, K. Making gradient descent optimal for strongly convex stochastic optimization. ArXiv:1109.5647, 2012.
Richtarik, P. and Takac?, M. Distributed coordinate descent methods for big data optimization. Technical report.
Richtarik, P. and Takac?, M. Parallel coordinate descent methods for big data optimization.ArXiv:1212.0873, 2012.
Richtarik, P. and Takac?, M. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming, 2013. doi: 10.1007/ s10107-012-0614-z.
Shalev-Shwartz, S. and Tewari, A. Stochastic Methods for l1-regularized Loss Minimization. JMLR, 12: 18651892, 2011.
Shalev-Shwartz, S. and Zhang, T. Stochastic dual coordinate ascent methods for regularized loss minimization. ArXiv:1209.1873, 2012.
Shalev-Shwartz, S.S., Singer, Y., Srebro, N., and Cotter, A. Pegasos: Primal estimated sub-gradient solver for svm. Mathematical Programming: Series A and BSpecial Issue on Optimization and Machine Learning, pp. 330, 2011.
Tappenden, R., Richtarik, P., and Gondzio, J. Inexact coordinate descent: complexity and preconditioning. ArXiv:1304.5530, 2013.
Zhang, T. Solving large scale linear prediction using stochastic gradient descent algorithms. In ICML, 2004.
-----0
Bertsekas, D. P. Temporal difference methods for general projected equations. IEEE Trans. Auto. Control, 56(9):21282139, 2011.
Bertsekas, D. P. Dynamic Programming and Optimal Control, Vol II. Athena Scientific, fourth edition, 2012.
Bertsekas, D. P. and Tsitsiklis, J. N. Neuro-Dynamic Programming. Athena Scientific, 1996.
Borkar, V. S. Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge Univ Press, 2008.
Boyan, J. A. Technical update: Least-squares temporal difference learning. Machine Learning, 49(2): 233246, 2002.
Engel, Y., Mannor, S., and Meir, R. Reinforcement learning with Gaussian processes. In ICML, 2005.
Filar, J. A., Krass, D., and Ross, K. W. Percentile performance criteria for limiting average Markov decision processes. IEEE Trans. Auto. Control, 40(1): 210, 1995.
Geibel, P. and Wysotzki, F. Risk-sensitive reinforcement learning applied to control under constraints.JAIR, 24(1):81108, 2005.
Horn, R. A. and Johnson, C. R. Matrix Analysis. Cambridge University Press, 1985.
Konda, V. Actor-Critic Algorithms. PhD thesis, Dept.Comput. Sci. Elect. Eng., MIT, Cambridge, MA, 2002.
Konidaris, G. D. and Barto, A. G. Skill discovery in continuous reinforcement learning domains using skill chaining. In NIPS, 2009.
Lazaric, A., Ghavamzadeh, M., and Munos, R. Finitesample analysis of LSTD. In ICML, 2010.
Mannor, S. and Tsitsiklis, J. N. Mean-variance optimization in Markov decision processes. In ICML, 2011.
Mihatsch, O. and Neuneier, R. Risk-sensitive reinforcement learning. Machine Learning, 49(2):267 290, 2002.
Morimura, T., Sugiyama, M., Kashima, H., Hachiya, H., and Tanaka, T. Parametric return density estimation for reinforcement learning. arXiv preprint arXiv:1203.3497, 2012.
Puterman, M. L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, Inc., 1994.
Sato, M., Kimura, H., and Kobayashi, S. TD algorithm for the variance of return and mean-variance reinforcement learning. Transactions of the Japanese Society for Artificial Intelligence, 16:353362, 2001.
Sharpe, W. F. Mutual fund performance. The Journal of Business, 39(1):119138, 1966.
Shortreed, S. M., Laber, E., Lizotte, D. J., Stroup, T. S., Pineau, J., and Murphy, S. A. Informing sequential clinical decision-making through reinforcement learning: an empirical study. Machine learning, 84(1):109136, 2011.
Sobel, M. J. The variance of discounted Markov decision processes. J. Applied Probability, pp. 794802, 1982.
Sutton, R. S. Learning to predict by the methods of temporal differences. Machine Learning, 3(1):944, 1988.
Sutton, R. S. and Barto, A. G. Reinforcement Learning. MIT Press, 1998.
Tamar, A., Di Castro, D., and Mannor, S. Policy gradients with variance related risk criteria. In ICML, 2012.
Tesauro, G. Temporal difference learning and TDgammon. Communications of the ACM, 38(3):58 68, 1995.
-----0
Belhumeur, P. N. and Kriegman, D. J. What is the set of images of an object under all possible lighting conditions.In CVPR, pp. 270277, 1996.
Bro, Rasmus. Multi-way Analysis in the Food Industry.Models, Algorithms and Applications. PhD thesis, University of Amsterdam, The Netherlands, 1998. URL www.models.kvl.dk/users/rasmus/thesis/thesis.html.
Carroll, J. and Chang, Jih J. Analysis of individual differences in multidimensional scaling via an n-way generalization of Eckart-Young decomposition. Psychometrika, 35(3):283319, September 1970.
Chu, Wei and Ghahramani, Zoubin. Probabilistic models for incomplete multi-dimensional arrays. Journal of Machine Learning Research Proceedings Track, 5:89 96, 2009.
Culpepper, Benjamin J., Sohl-Dickstein, Jascha, and Olshausen, Bruno A. Building a better probabilistic model of images by factorization. In ICCV, pp. 20112017, 2011.
Fan, Rong-En, Chang, Kai-Wei, Hsieh, Cho-Jui, Wang, Xiang-Rui, and Lin, Chih-Jen. LIBLINEAR: A library for large linear classification, August 2008.
Graham, Daniel B and Allinson, Nigel M. Characterizing virtual eigensignatures for general purpose face recognition. In Face Recognition: From Theory to Applications, pp. 446456. 1998.
Grimes, David B. and Rao, Rajesh P. Bilinear sparse coding for invariant vision. Neural Computation, 17(1), January 2005.
Hinton, G. E. and Salakhutdinov, R. Reducing the dimensionality of data with neural networks. Science, 313: 504507, 2006.
Hinton, G. E., Osindero, S., and Teh, Y. W. A fast learning algorithm for deep belief nets. Neural Computation, 18 (7):15271554, 2006.
Kolda, Tamara G. and Bader, Brett W. Tensor decompositions and applications. SIAM Review, 51(3):455500, 2009.
Lathauwer, Lieven De and Vandewalle, Joos. Dimensionality reduction in higher-order signal processing and rank(r1,r2,...,rn) reduction in multilinear algebra, 2004.
Lee, K. C., Ho, Jeffrey, and Kriegman, David. Acquiring linear subspaces for face recognition under variable lighting. IEEE PAMI, 27:684698, 2005.
Levine, Richard A. and Casella, George. Implementations of the Monte Carlo EM Algorithm. Journal of Computational and Graphical Statistics, 10(3):422439, 2001.
Martin, D., Fowlkes, C., Tal, D., and Malik, J. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In ICCV, volume 2, pp. 416423, July 2001.
Neal, R. M. Annealed importance sampling. Statistics and Computing, 11:125139, 2001.
Rubin, Donald B. and Thayer, Dorothy T. EM algorithms for the ML factor analysis. Psychometrika, 47(1):6976, 1982.
Shashua, Amnon and Hazan, Tamir. Non-negative tensor factorization with applications to statistics and computer vision. In Proc. of ICML, pp. 792799. ICML, 2005.
Sun, Jimeng, Tao, Dacheng, and Faloutsos, Christos. Beyond streams and graphs: dynamic tensor analysis. In KDD, 2006.
Tang, Yichuan, Salakhutdinov, Ruslan, and Hinton, Geoffrey. Deep mixtures of factor analysers. In Proc. of ICML 2012, Edinburgh, Scotland, 2012.
Tenenbaum, Joshua B. and Freeman, William T. Separating style and content with bilinear models. Neural Computation, pp. 12471283, 2000.
Tucker, L. R. Implications of factor analysis of three-way matrices for measurement of change. In Harris, C. W.(ed.), Problems in measuring change., pp. 122137. University of Wisconsin Press, Madison WI, 1963.
Vasilescu, M. A. O. and Terzopoulos, D. Multilinear analysis of image ensembles: Tensorfaces. In ECCV, pp.447460, 2002.
Verbeek, Jakob. Learning nonlinear image manifolds by global alignment of local linear models. IEEE Trans.Pattern Analysis and Machine Intelligence, 28:14, 2006.
Wang, Hongcheng and Ahuja, Narendra. Facial expression decomposition. In ICCV, pp. 958965, 2003.
Wang, Jack M., Fleet, David J., and Hertzmann, Aaron.Multifactor gaussian process models for style-content separation. In ICML, pp. 975982, 2007.
Xu, Zenglin, Yan, Feng, and Qi, Yuan. Infinite tucker decomposition: Nonparametric bayesian models for multiway data analysis. In Proc. of ICML, 2012, Edinburgh, Scotland, 2012.
Yang, Ming-Hsuan, Ahuja, Narendra, and Kriegman, David. Face detection using a mixture of factor analyzers. In ICIP, Kobe, Japan, 1999.Zoran, D. and Weiss, Y. From learning models of natural image patches to whole image restoration. ICCV, 2011.
-----0
Balasubramanian, M., Shwartz, E. L., Tenenbaum, J. B., de Silva, V., and Langford, J. C. The isomap algorithm and topological stability. Science, 2002.
Cook, JA, Sutskever, I., Mnih, A., and Hinton, GE. Visualizing similarity data with a mixture of maps. Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2007.
Davis, Jason V., Kulis, Brian, Jain, Prateek, Sra, Suvrit, and Dhillon, Inderjit S. Information-theoretic metric learning. In Proceedings of the International Conference on Machine Learning (ICML), 2007.
Globerson, Amir and Roweis, Sam T. Metric learning by collapsing classes. In Advances in Neural Information Processing Systems (NIPS), 2006.
Goldberger, Jacob, Roweis, Sam T., Hinton, Geoffrey E., and Salakhutdinov, Ruslan. Neighbourhood components analysis. In Advances in Neural Information Processing Systems (NIPS), 2004.
Hinton, Geoffrey E. and Roweis, Sam T. Stochastic neighbor embedding. In Advances in Neural Information Processing Systems (NIPS), 2002.
Roweis, Sam T. and Saul, Lawrence K. Nonlinear dimensionality reduction by locally linear embedding. Science, 290, 2000.
Salakhutdinov, Ruslan and Hinton, Geoffrey. Learning a nonlinear embedding by preserving class neighbourhood structure. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), volume 11, 2007.
Shen, Chunhua, Kim, Junae, Wang, Lei, and van den Hengel, Anton. Positive semidefinite metric learning with boosting. In Advances in Neural Information Processing Systems (NIPS). 2009.
Tarlow, Daniel, Swersky, Kevin, Zemel, Richard S., Adams, Ryan P., and Frey, Brendan J. Fast exact inference for recursive cardinality models. In Uncertainty in Artificial Intelligence (UAI), 2012.
van der Maaten, Laurens. Learning a parametric embedding by preserving local structure. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2009.
van der Maaten, L.J.P., Postma, E.O., and van den Herik, H.J. Dimensionality reduction: A comparative review. Technical Report TiCC-TR 2009-005, Tilburg University, 2009.
Weinberger, K.Q. and Saul, L.K. Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research (JMLR), 2009.
Xiong, Caiming, Johnson, David, Xu, Ran, and Corso, Jason J. Random forests for metric learning with implicit pairwise position dependence. In Proceedings of the International Conference on Knowledge Discovery and  Data Mining (SIGKDD), New York, NY, USA, 2012.ACM.
Yang, Liu. Distance metric learning: A comprehensive survey. Technical report, Carnegie Mellon University, 2007.
-----0
Anandkumar, A., Foster, D. P., Hsu, D., Kakade, S. M., and Liu, Y. Two SVDs suffice: Spectral decompositions for probabilistic topic modeling and latent Dirichlet allocation. In Advances in Neural Information Processing Systems (NIPS), Cambridge, MA, 2012a. MIT Press.
Anandkumar, A., Hsu, D., and Kakade, S. M. A method of moments for mixture models and hidden Markov models. In Conference on Learning Theory (COLT), 2012b.
Anandkumar, Anima, Ge, Rong, Hsu, Daniel, Kakade, Sham M., and Telgarsky, Matus. Tensor decompositions for learning latent variable models. CoRR, abs/1210.7559, 2012c.
Balle, B. and Mohri, M. Spectral learning of general weighted automata via constrained matrix completion. In Advances in Neural Information Processing 
Systems (NIPS), Cambridge, MA, 2012. MIT Press.Balle, B., Quattoni, A., and Carreras, X. A spectral learning algorithm for finite state transducers.In European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2011.Spectral Experts 
Candes, E. J., Strohmer, T., and Voroninski, V.Phaselift: Exact and stable signal recovery from magnitude measurements via convex programming.Technical report, ArXiv, 2011.
Cohen, S. B., Stratos, K., Collins, M., Foster, D. P., and Ungar, L. Spectral learning of latent-variable PCFGs. In Association for Computational Linguistics (ACL). Association for Computational Linguistics, 2012.
Grant, Michael and Boyd, Stephen. CVX: Matlab software for disciplined convex programming, version 2.0 beta. http://cvxr.com/cvx, September 2012.
Hsu, D. and Kakade, S. M. Learning mixtures of spherical gaussians: Moment methods and spectral decompositions. In Innovations in Theoretical Computer Science (ITCS), 2013.
Hsu, D., Kakade, S. M., and Zhang, T. A spectral algorithm for learning hidden Markov models. In Conference on Learning Theory (COLT), 2009.
Hsu, D., Kakade, S. M., and Liang, P. Identifiability and unmixing of latent parse trees. In Advances in Neural Information Processing Systems (NIPS), Cambridge, MA, 2012. MIT Press.
Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. Adaptive mixtures of local experts. Neural Computation, 3:7987, 1991.
Liang, P., Bouchard-Cote, A., Klein, D., and Taskar, B. An end-to-end discriminative approach to machine translation. In International Conference on Computational Linguistics and Association for Computational Linguistics (COLING/ACL), Sydney, Australia, 2006. Association for Computational Linguistics.
Negahban, S. and Wainwright, M. J. Estimation of (near) low-rank matrices with noise and highdimensional scaling. ArXiv e-prints, December 2009.
Ohlsson, H., Yang, A., Dong, R., and Sastry, S. CPRL  an extension of compressive sensing to the phase retrieval problem. In Advances in Neural Information Processing Systems (NIPS), Cambridge, MA, 2012. MIT Press.
Parikh, A., Song, L., Ishteva, M., Teodoru, G., and Xing, E. A spectral algorithm for latent junction trees. In Uncertainty in Artificial Intelligence (UAI), 2012.
Petrov, S. and Klein, D. Discriminative log-linear grammars with latent variables. In Advances in Neural Information Processing Systems (NIPS), Cambridge, MA, 2008. MIT Press.
Quattoni, A., Collins, M., and Darrell, T. Conditional random fields for object recognition. In Advances in Neural Information Processing Systems (NIPS), Cambridge, MA, 2004. MIT Press.
Song, L., Boots, B., Siddiqi, S., Gordon, G., and Smola, A. Hilbert space embeddings of hidden Markov models. In International Conference on Machine Learning (ICML), Haifa, Israel, 2010. Omnipress.
Tomioka, R., Suzuki, T., Hayashi, K., and Kashima, H. Statistical performance of convex tensor decomposition. Advances in Neural Information Processing Systems (NIPS), pp. 137, 2011.
Viele, Kert and Tong, Barbara. Modeling with mixtures of linear regressions. Statistics and Computing, 12:315330, 2002. ISSN 0960-3174. doi: 10.1023/A:1020779827503. URL http://dx.doi.org/10.1023/A%3A1020779827503.
Wang, Y. and Mori, G. Max-margin hidden conditional random fields for human action recognition. In Computer Vision and Pattern Recognition (CVPR), 2009.
-----0
Bradski, G. The OpenCV Library. Dr. Dobbs Journal of Software Tools, 2000.
Collins, Michael, Schapire, Robert E., and Singer, Yoram. Logistic regression, AdaBoost and Bregman distances. Machine Learning, 48(1-3):253285, 2002.
Copas, J. B. Regression, prediction and shrinkage.Margins, Shrinkage, and Boosting Journal of the Royal Statistical Society, Series B (Methodological), 45(3):311354, 1983.Freund, Yoav. Boosting a weak learning algorithm by majority. Information and Computation, 121(2): 256285, 1995.
Freund, Yoav and Schapire, Robert E. A decisiontheoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci., 55 (1):119139, 1997.
Friedman, Jerome H. Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29:11891232, 2000.
Goldreich, Oded and Levin, Leonid. A hard-core predicate for all one-way functions. STOC, pp. 2532, 1989.
Impagliazzo, Russell. Hard-core distributions for somewhat hard problems. In FOCS, pp. 538545, 1995.
Kearns, Michael and Valiant, Leslie. Cryptographic limitations on learning finite automata and boolean formulae. STOC, pp. 433444, 1989.
Mukherjee, Indraneel, Rudin, Cynthia, and Schapire, Robert. The convergence rate of AdaBoost. In COLT, 2011.
Nocedal, Jorge and Wright, Stephen J. Numerical optimization. Springer, 2 edition, 2006.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., 
Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine Learning in Python . Journal of Machine Learning Research, 12:28252830, 2011.
Ratsch, G., Onoda, T., and Muller, K.-R. Soft margins for adaboost. Machine Learning, 42:287320, 2001.
Ratsch, Gunnar and Warmuth, Manfred. Efficient margin maximizing with boosting. Journal of Machine Learning Research, 6:21532175, 2005.
Reyzin, Lev and Schapire, Robert E. How boosting the margin can also boost classifier complexity. In In Proceedings of the 23rd International Conference on Machine Learning, pp. 753760, 2006.
Rudin, Cynthia, Daubechies, Ingrid, and Schapire, Robert E. The dynamics of AdaBoost: cyclic behavior and convergence of margins. Journal of Machine Learning Research, 5:15571595, 2004.
Rudin, Cynthia, Schapire, Robert E., and Daubechies, Ingrid. Analysis of boosting algorithms using the smooth margin function. Annals of Statistics, 35 (6):27232768, 2007.
Schapire, Robert E. and Freund, Yoav. Boosting: Foundations and Algorithms. MIT Press, 2012.
Schapire, Robert E. and Singer, Yoram. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3):297336, 1999.
Schapire, Robert E., Freund, Yoav, Barlett, Peter, and Lee, Wee Sun. Boosting the margin: A new explanation for the effectiveness of voting methods. In ICML, pp. 322330, 1997.
Shalev-Shwartz, Shai and Singer, Yoram. On the equivalence of weak learnability and linear separability: New relaxations and efficient boosting algorithms. In COLT, pp. 311322, 2008.
Steele, J. Michael. The Cauchy-Schwarz Master Class.Cambridge University Press, 2004.
Telgarsky, Matus. A primal-dual convergence analysis of boosting. 2012. arXiv:1101.4752v3 [cs.LG].
Warmuth, Manfred K., Liao, Jun, and Ratsch, Gunnar. Totally corrective boosting algorithms that maximize the margin. In ICML, pp. 10011008, 2006.
Zhang, Tong and Yu, Bin. Boosting with early stopping: Convergence and consistency. The Annals of Statistics, 33:15381579, 2005.
-----0
Agrawal, Rajeev. The Continuum-Armed Bandit Problem. SIAM Journal on Control and Optimization, 33(6):19261951, November 1995. ISSN 03630129. doi: 10.1137/S0363012992237273.
Auer, Peter, Cesa-Bianchi, N, and Fischer, P. Finitetime analysis of the multiarmed bandit problem.Machine learning, pp. 235256, 2002.
Auer, Peter, Ortner, Ronald, and Szepesvari, C. Improved rates for the stochastic continuum-armed bandit problem. Learning Theory, 2007.
Bubeck, Sebastien, Munos, Remi, and Stoltz, Gilles.Pure exploration in finitely-armed and continuousarmed bandits. Theoretical Computer Science, 412 (19):18321852, April 2011a. ISSN 03043975. doi: 10.1016/j.tcs.2010.12.059.
Bubeck, Sebastien, Munos, Remi, Stoltz, Gilles, and Szepesvari, Csaba. X -Armed Bandits. Journal of Machine Learning Research, 12:16551695, 2011b.
Garnett, Roman, Krishnamurthy, Yamuna, Wang, Donghan, Schneider, Jeff, and Mann, Richard. Bayesian optimal active search on graphs. In Workshop on Mining and Learning with Graphs, 2011.ISBN 9781450308342.
Jones, Donald R. A taxonomy of global optimization methods based on response surfaces. Journal of Global Optimization, 21(4):345383, 2001.
Jones, Donald R., Schonlau, Matthias, and Welch, William J. Efficient Global Optimization of Expensive Black-Box Functions. Journal of Global Optimization, 13(4), 1998. ISSN 0925-5001.
Kleinberg, Robert. Nearly tight bounds for the continuum-armed bandit problem. In Advances in Neural Information Processing Systems, pp. 697 704, 2004.
Kleinberg, Robert and Upfal, Eli. Multi-Armed Bandits in Metric Spaces. In STOC 08 Proceedings of the 40th annual ACM symposium on Theory of computing, pp. 681690, 2008. ISBN 9781605580470.
Minka, Thomas P. A family of algorithms for approximate Bayesian inference. Phd thesis, Massachusetts Institute of Technology, 2001.
Mockus, J, Tiesis, V, and Zilinskas, A. The application of Bayesian methods for seeking the extremum. Towards Global Optimization, 2:117129, 1978.
ickisch, Hannes and Rasmussen, CE. Approximations for binary Gaussian process classification.Journal of Machine Learning Research, 9:2035 2078, 2008.
Rasmussen, Carl Edward and Williams, Christopher K. I. Gaussian Processes for Machine Learning. URL http://www.gaussianprocess.org/ gpml/code/gpml-matlab.tar.gz.
Rasmussen, Carl Edward and Williams, Christopher K. I. Gaussian Processes for Machine Learning. The MIT Press, 2006. ISBN 026218253X.Settles, Burr. Active Learning Literature Survey. Technical report, University of WisconsinMadison, 2009.
Tesch, Matthew, ONeill, Alex, and Choset, Howie.Using Kinesthetic Input to Overcome Obstacles with Snake Robots. In International Symposium on  Safety, Security, and Rescue Robotics, 2012.
Z?ilinskas, Antanas. A review of statistical models for global optimization. Journal of Global Optimization, 2(2):145153, June 1992. ISSN 0925-5001. doi: 10.1007/BF00122051.
Wright, C., Buchan, A., Brown, B., Geist, J., Schwerin, M., Rollinson, D., Tesch, M., and Choset, H. Design and Architecture of the Unified Modular Snake Robot. In 2012 IEEE International Conference on Robotics and Automation, St. Paul, MN, 2012.
-----0
Abernethy, J., Chapelle, O., and Castillo, C. Graph regularization methods for web spam detection. Machine Learning, 81(2):207225, 2010.
Adamic, L.A. and Glance, N. The political blogosphere and the 2004 us election: divided they blog.In Proceedings of the 3rd international workshop on Link discovery, pp. 3643. ACM, 2005.
Bruckner, M. and Scheffer, T. Nash equilibria of static prediction games. In Advances in Neural Information Processing Systems 22, 2009.
Bruckner, M. and Scheffer, T. Stackelberg games for adversarial prediction problems. In Proceedings of the Seventeenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM Press, 2011.
Chau, D., Pandit, S., and Faloutsos, C. Detecting fraudulent personalities in networks of online auctioneers. Knowledge Discovery in Databases: PKDD 2006, pp. 103114, 2006.
Dalvi, N., Domingos, P., M., Sanghai, S., and Verma, D. Adversarial classification. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 99108, Seattle, WA, 2004. ACM Press.
Domingos, P. and Lowd, D. Markov Logic: An Interface Layer for AI. Morgan & Claypool, San Rafael, CA, 2009.
Drost, I. and Scheffer, T. Thwarting the nigritude ultramarine: Learning to identify link spam. In Proceedings of the Sixteenth European Conference on Machine Learning, pp. 96107. Springer, 2005.
El Ghaoui, L., Lanckriet, G.R.G., and Natsoulis, G.Robust classification with interval data. Computer Science Division, University of California, 2003.
Globerson, A. and Roweis, S. Nightmare at test time: robust learning by feature deletion. In Proceedings of the Twenty-Third International Conference on Machine Learning, pp. 353360, Pittsburgh, PA, 2006.ACM Press.
Huynh, T.N. and Mooney, R.J. Max-margin weight learning for Markov logic networks. In In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD-09). Bled, pp.564579. Springer, 2009.
Kim, S.J., Magnani, A., and Boyd, S. Robust fisher discriminant analysis. Advances in Neural Information Processing Systems, 18:659, 2006.
Kolmogorov, V. and Zabin, R. What energy functions can be minimized via graph cuts? Pattern Analysis and Machine Intelligence, IEEE Transactions on, 26 (2):147159, 2004.
Lanckriet, G.R.G., Cristianini, N., Bartlett, P., Ghaoui, L.E., and Jordan, M.I. Learning the kernel matrix with semidefinite programming. The Journal of Machine Learning Research, 5:2772, 2004.
Laskov, P. and Lippmann, R. Machine learning in adversarial environments. Machine learning, 81(2): 115119, 2010.
Lowd, D. and Domingos, P. Efficient weight learning for Markov logic networks. In Proceedings of the Eleventh European Conference on Principles and Practice of Knowledge Discovery in Databases, pp.200211, Warsaw, Poland, 2007. Springer.
Lowd, D. and Meek, C. Good word attacks on statistical spam filters. In Proceedings of the Second Conference on Email and Anti-Spam (CEAS), pp.125132, 2005a.
Lowd, D. and Meek, C. Adversarial learning. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 641647. ACM, 2005b. ISBN 159593135X.
Nelson, B., Rubinstein, B.I.P., Huang, L., Joseph, A.D., Lau, S., Lee, S.J., Rao, S., Tran, A., and Tygar, JD. Near-optimal evasion of convex-inducing classifiers. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS) 2010, volume 9, Chia Laguna Resort, Sardinia, Italy, 2010.
Peng, H., Long, F., and Ding, C. Feature selection based on mutual information criteria of maxdependency, max-relevance, and min-redundancy.Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27(8):12261238, 2005.
Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., and Eliassi-Rad, T. Collective classification in network data. AI Magazine, 29(3):93, 2008. ISSN 0738-4602.
Taskar, B., Chatalbashev, V., and Koller, D. Learning associative Markov networks. In Proceedings of the twenty-first international conference on machine learning. ACM Press, 2004a.
Taskar, B., Wong, M. F., Abbeel, P., and Koller, D.Max-margin Markov networks. In Thrun, S., Saul, L., and Scholkopf, B. (eds.), Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004b.
Teo, C.H., Globerson, A., Roweis, S., and Smola, A.Convex learning with invariances. In Advances in Neural Information Processing Systems 21, 2008.
-----0
Acemoglu, Daron, Ozdaglar, Asuman, and ParandehGheibi, Ali. Spread of (mis)information in social networks. Games and Economic Behavior, 70(2): 194227, 2010.
Airoldi, Edo, Toulis, Panos, Kao, Edward, and Rubin, Donald B. Estimation of causal peer influence effects. Forthcoming, 2013.
Aronow, P. M. and Samii, C. Estimating average causal effects under general interference. working paper, 2013.
Bakshy, Eytan, Eckles, Dean, Yan, Rong, and Rosenn, Itamar. Social influence in social advertising: Evidence from field experiments. CoRR, abs/1206.4327, 2012.
Bond, Robert M, Fariss, Christopher J, Jones, Jason J, Kramer, Adam DI, Marlow, Cameron, Settle, Jaime E, and Fowler, James H. A 61-million-person experiment in social influence and political mobilization. Nature, 489(7415):295298, 2012.
Butts, C.T. Inference, error, and informant (in)accuracy. Social Networks, 25:103140, 2003.
Hudgens, M.G. and Halloran, M.E. Toward causal inference with interference. Journal of the American Statistical Association, 103(482):832842, 2008.
Manski, C.F. Identification of endogenous social effects: The reflection problem. The Review of Economic Studies, 60(3):531542, 1993.
Mednick, S.C., Christakis, N.A., and Fowler, J.H. The spread of sleep behavior influences drug use in adolescent social networks. PLoS One, 5(3):e9775, 2010.
Ostrovsky, M. and Schwarz, M. Reserve prices in internet advertising auctions: A field experiment. Available at SSRN 1573947, (2054), 2010.
Parker, B.M. Design of network experiments. available at http://www.newton.ac.uk/programmes/DAE/ seminars/090111301.pdf, 2011.
Pearl, Judea. Causality: models, reasoning and inference, volume 29. Cambridge Univ Press, 2000.
Perry, P.O. and Wolfe, P.J. Point process modeling for directed interaction networks. In arXiv:1011.1703v1 [stat.ME], 2010.
Rosenbaum, P.R. Interference between units in randomized experiments. Journal of the American Statistical Association, 102(477):191200, 2007.
Rubin, D.B. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology; Journal of Educational Psychology, 66(5):688, 1974.
Rubin, D.B. [on the application of probability theory to agricultural experiments. essay on principles. section 9.] comment: Neyman (1923) and causal inference in experiments and observational studies. Statistical Science, 5(4):472480, 1990.
Shah, Devavrat and Zaman, Tauhid. Rumors in a network: Whos the culprit? IEEE Transactions on Information Theory, 57(8):51635181, 2011.
Shalizi, C. and Thomas, A. Homophily and contagion are generically confounded in observational social network studies. Sociological Methods and Research, 40:211239, 2011.
Sobel, M.E. What do randomized studies of housing mobility demonstrate? Journal of the American Statistical Association, 101(476):13981407, 2006.
Spirtes, Peter, Glymour, Clark, and Scheines, Richard.Causation, prediction, and search, volume 81. MIT press, 2001.
Tchetgen, E.J.T. and VanderWeele, T.J. On causal inference in the presence of interference. Statistical Methods in Medical Research, 21(1):5575, 2012.
Ugander, J., Karrer, B., Backstrom, L., and Kleinberg, J. Network exposure to multiple universes. In NIPS Workshop: Social network and social media analysis: Methods, models and applications, Lake Tahoe, NV, 2012.
William, Aiello, Chung, Fan, and Lu, Linyuan. A random graph model for power law graphs. Experimental Mathematics, 10(1):5366, 2001.
Zachary, W.W. An information flow model for conflict and fission in small groups. Journal of anthropological research, pp. 452473, 1977.
-----0
JR Ashford and RR Sowden. Multi-variate probit analysis.Biometrics, pages 535546, 1970.
U. Bockenholt. Thurstonian-based analyses: past, present, and future utilities. Psychometrika, 71(4):615629, 2006.
O. Chapelle, D. Metlzer, Y. Zhang, and P. Grinspan. Expected reciprocal rank for graded relevance. In CIKM, pages 621630. ACM, 2009.
S. Chib and E. Greenberg. Analysis of multivariate probit models. Biometrika, 85(2):347361, 1998.Thurstonian Boltzmann Machines 
O.B. Downs, D.J.C. MacKay, and D.D. Lee. The nonnegative Boltzmann machine. Advances in Neural Information Processing Systems, 12:428434, 2000.
D.B. Dunson and A.H. Herring. Bayesian latent variable models for mixed discrete outcomes. Biostatistics, 6(1): 11, 2005.
Y. Freund and D. Haussler. Unsupervised learning of distributions on binary vectors using two layer networks. Advances in Neural Information Processing Systems, pages 912919, 1993.
P.V. Gehler, A.D. Holub, and M. Welling. The rate adapting Poisson model for information retrieval and object recognition. In Proceedings of the ICML, pages 337344, 2006.
J. Geweke. Efficient simulation from the multivariate normal and student-t distributions subject to linear constraints and the evaluation of constraint probabilities.In Computing science and statistics: Proceedings of the 23rd symposium on the interface, pages 571578, 1991.
G.E. Hinton and R.R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313 (5786):504507, 2006.
K. Jarvelin and J. Kekalainen. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems (TOIS), 20(4):446, 2002.
E. Khan, B. Marlin, and K. Murphy. Variational bounds for mixed-data factor analysis. In Proc. of Neural Information Processing Systems, 2010.
A. Kottas, P. Muller, and F. Quintana. Nonparametric Bayesian modeling for multivariate ordinal data. Journal of Computational and Graphical Statistics, 14(3):610 625, 2005.
N. Le Roux, N. Heess, J. Shotton, and J. Winn. Learning a generative model of images by factoring appearance and shape. Neural Computation, 23(3):593650, 2011.
J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A.Y.Ng. Multimodal deep learning. In ICML, 2011.
M.A. Ranzato and G.E. Hinton. Modeling pixel means and covariances using factorized third-order Boltzmann machines. In CVPR, pages 25512558. IEEE, 2010.
C.P. Robert. Simulation of truncated normal variables.Statistics and computing, 5(2):121125, 1995.
R. Salakhutdinov and G. Hinton. Deep Boltzmann Machines. In Proceedings of 20th AISTATS, volume 5, pages 448455, 2009a.
R. Salakhutdinov and G. Hinton. Replicated softmax: an undirected topic model. Advances in Neural Information Processing Systems, 22, 2009b.
R. Salakhutdinov, A. Mnih, and G. Hinton. Restricted Boltzmann machines for collaborative filtering. In Proceedings of the 24th ICML, pages 791798, 2007.
Y. Shi, M. Larson, and A. Hanjalic. List-wise learning to rank with matrix factorization for collaborative filtering.
In ACM RecSys, pages 269272. ACM, 2010.P. Smolensky. Information processing in dynamical systems: Foundations of harmony theory. Parallel distributed processing: Explorations in the microstructure of cognition, 1:194281, 1986.
N. Srivastava and R. Salakhutdinov. Multimodal learning with deep Boltzmann machines. In NIPS, pages 2231 2239, 2012.
L.L. Thurstone. A law of comparative judgment. Psychological review, 34(4):273, 1927.
T. Tieleman. Training restricted Boltzmann machines using approximations to the likelihood gradient. In Proceedings of the 25th ICML, pages 10641071, 2008.
M.E. Tipping and C.M. Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B, 61(3):611622, 1999.
T. Tran, D.Q. Phung, and S. Venkatesh. Mixed-variate restricted Boltzmann machines. In Proc. of 3rd Asian Conference on Machine Learning (ACML), Taoyuan, Taiwan, 2011.
T. Tran, D.Q. Phung, and S. Venkatesh. Cumulative restricted Boltzmann machines for ordinal matrix data analysis. In Proc. of 4th Asian Conference on Machine Learning (ACML), Singapore, 2012.
T. Truyen, D.Q Phung, and S. Venkatesh. Probabilistic models over ordered partitions with applications in document ranking and collaborative filtering. In Proc. of SIAM Conference on Data Mining (SDM), Mesa, Arizona, USA, 2011. SIAM.
T.T. Truyen, D.Q. Phung, and S. Venkatesh. Ordinal Boltzmann machines for collaborative filtering. In Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (UAI), Montreal, Canada, June 2009.
L. van der Maaten and G. Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 9(25792605):85, 2008.
M. Wedel and W.A. Kamakura. Factor analysis with (mixed) observed and latent variables in the exponential family. Psychometrika, 66(4):515530, 2001.
E. Xing, R. Yan, and A.G. Hauptmann. Mining associated text and images with dual-wing harmoniums. In Proceedings of the 21st UAI, 2005.
M. Yasuda and K. Tanaka. Boltzmann machines with bounded continuous random variables. Interdisciplinary Information Sciences, 13(1):2531, 2007.
L. Younes. Parametric inference for imperfectly observed Gibbsian fields. Probability Theory and Related Fields, 82(4):625645, 1989.
X. Zhang, W.J. Boscardin, and T.R. Belin. Bayesian analysis of multivariate nominal measures using multivariate multinomial probit models. Computational statistics & data analysis, 52(7):36973708, 2008.
-----0
Banerjee, O., El Ghaoui, L., and dAspremont, A.Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data.The Journal of Machine Learning Research, 9:485 516, 2008.
Beck, A. and Teboulle, M. A Fast Iterative ShrinkageThresholding Algorithm for Linear Inverse Problems. SIAM J. Imaging Sciences, 2(1):183202, 2009.
Bertsekas, D.P. and Tsitsiklis, J. N. Parallel and distributed computation: Numerical methods. Prentice Hall, 1989.
Boyd, S. and Vandenberghe, L. Convex Optimization.University Press, Cambridge, 2004.
Dempster, A. P. Covariance selection. Biometrics, 28: 157175, 1972.
Hsieh, C. J., Sustik, M.A., Dhillon, I.S., and Ravikumar, P. Sparse inverse covariance matrix estimation using quadratic approximation. Advances in Neutral Information Processing Systems (NIPS), 24:1 18, 2011.
Lee, J.D., Sun, Y., and Saunders, M.A. Proximal newton-type methods for convex optimization.Tech. Report., pp. 125, 2012.
Li, L. and Toh, K.C. An inexact interior point method for l 1-regularized sparse covariance selection. Mathematical Programming Computation, 2(3):291315, 2010.
Lu, Z. Adaptive first-order methods for general sparse inverse covariance selection. SIAM Journal on Matrix Analysis and Applications, 31(4):20002016, 2010.
Nesterov, Y. Introductory lectures on convex optimization: a basic course, volume 87 of Applied Optimization. Kluwer Academic Publishers, 2004.
Nesterov, Y. Gradient methods for minimizing composite objective function. CORE Discussion paper, 76, 2007.
Nesterov, Y. and Nemirovski, A. Interior-point Polynomial Algorithms in Convex Programming. Society for Industrial Mathematics, 1994.
Olsen, P.A., Oztoprak, F., Nocedal, J., and Rennie, S.J. Newton-like methods for sparse inverse covariance estimation. Optimization Online, 2012.
Ravikumar, P., Wainwright, M. J., Raskutti, G., and Yu, B. High-dimensional covariance estimation by minimizing l1-penalized log-determinant divergence.
Electron. J. Statist., 5:935988, 2011.Rolfs, B., Rajaratnam, B., Guillot, D., Wong, I., and Maleki, A. Iterative thresholding algorithm for sparse inverse covariance estimation. In Advances in Neural Information Processing Systems 25, pp.15831591, 2012.
Scheinberg, K. and Rish, I. Sinco-a greedy coordinate ascent method for sparse inverse covariance selection problem. preprint, 2009.
Scheinberg, K., Ma, S., and Goldfarb, D. Sparse inverse covariance selection via alternating linearization methods. arXiv preprint arXiv:1011.0097, 2010.
Yuan, X. Alternating direction method for covariance selection models. Journal of Scientific Computing, 51(2):261273, 2012.
-----0
Audibert, J.Y., Munos, R., and Szepesvari, Cs.Exploration-exploitation trade-off using variance estimates in multi-armed bandits. Theoretical Computer Science, 2008.
Audibert, J.Y., Bubeck, S., and Munos, R. Best arm identification in multi-armed bandits. In COLT, Ha?fa (Israel), 2010. Omnipress.
Auer, P. and Ortner, R. UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem. Period.Math.Hungar., 61(12):5565, 2011.
Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Mach.Learn., 47(2-3):235256, 2002.
Bubeck, S., Munos, R., and Stoltz, G. Pure exploration in finitely-armed and continuous-armed bandits. Theor. Comput. Sci., 412(19):18321852, 2011.
Chapelle, O., Joachims, T., Radlinski, F., and Yue, Y. Large-scale validation and analysis of interleaved search evaluation. ACM Trans. Inf. Syst., 30(1):6, 2012.
Charon, I. and Hudry, O. An updated survey on the linear ordering problem for weighted or unweighted tournaments. Annals OR, 175(1):107158, 2010.
Chevaleyre, Y., Endriss, U., Lang, J., and Maudet, N. A short introduction to computational social choice. In SOFSEM, volume 4362 of LNCS, pp. 51 69. Springer-Verlag, 2007.
Di Castro, D., Gentile, C., and Mannor, S. Bandits with an edge. CoRR, abs/1109.2296, 2011.
Even-Dar, E., Mannor, S., and Mansour, Y. PAC bounds for multi-armed bandit and Markov decision processes. In COLT, pp. 255270, London, UK, 2002. Springer-Verlag.
Even-Dar, E., Mannor, S., and Mansour, Y. Action elimination and stopping conditions for the multiarmed bandit and reinforcement learning problems.JMLR, 7:10791105, 2006.
Feige, U., Raghavan, P., Peleg, D., and Upfal, E. Computing with noisy information. SIAM J. Comput., 23(5):10011018, 1994.
Gardner, M. Mathematical games: The paradox of the nontransitive dice and the elusive principle of indifference. Scientific American, 223:110114, dec 1970.
Hoeffding, W. Probability inequalities for sums of bounded random variables. J. of the American Statistical Association, 58(301):1330, 1963.
Joachims, T. Evaluating retrieval performance using clickthrough data. In Text Mining, pp. 7996. Physica/Springer Verlag, 2003.
Lai, T.L. and Robbins, H. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):422, 1985.
Ravikumar, B., Ganesan, K., and Lakshmanan, K.B.On selecting the largest element in spite of erroneous information. In STACS, volume 247 of LNCS, pp. 8899. Springer Berlin Heidelberg, 1987.
Saltelli, A., Chan, K., and Scott, E.M. (eds.). Sensitivity analysis. Wiley series in probability and statistics. J. Wiley & sons, New York, Chichester, Weinheim, 2000.
Yue, Y. and Joachims, T. Interactively optimizing information retrieval systems as a dueling bandits problem. In ICML, volume 382 of ACM Proceeding Series, pp. 12011208. ACM, 2009.
Yue, Y. and Joachims, T. Beat the mean bandit. In ICML, pp. 241248. Omnipress, 2011.
Yue, Y., Broder, J., Kleinberg, R., and Joachims, T.The k-armed dueling bandits problem. J. Comput.Syst. Sci., 78(5):15381556, 2012.
-----0
Auer, Peter, Cesa-Bianchi, Nicolo`, and Fischer, Paul. Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning, 47(2-3):235256, 2002.
Bubeck, Sebastien, Munos, Remi, Stoltz, Gilles, and Szepesvari, Csaba. Online Optimization of X-armed Bandits. In Advances in Neural Information Processing Systems, pp. 201208, 2008.
Bubeck, Sebastien, Munos, Remi, and Stoltz, Gilles.Pure Exploration in Multi-armed Bandits Problems.Algorithmic Learning Theory, pp. 2337, 2009.
Bubeck, Sebastien, Munos, Remi, Stoltz, Gilles, and Szepesvari, Csaba. X-armed bandits. Journal of Machine Learning Research, 12:15871627, 2011a.
Bubeck, Sebastien, Stoltz, Gilles, and Yuan, YuJia. Lipschitz bandits without the Lipschitz constant. In Algorithmic Learning Theory, pp. 144158.Springer, 2011b.
Bull, Adam. Convergence rates of efficient global optimization algorithms. The Journal of Machine Learning Research, 12:28792904, 2011.
Coquelin, Pierre-Arnaud and Munos, Remi. Bandit Algorithms for Tree Search. In Uncertainty in Artificial Intelligence, pp. 6774, 2007.
Gelly, Sylvain, Kocsis, Levente, Schoenauer, Marc, Sebag, Miche`le, Silver, David, Szepesvari, Csaba, and Teytaud, Olivier. The grand challenge of computer Go: Monte Carlo tree search and extensions. Commun. ACM, 55(3):106113, March 2012.
Hansen, Eldon and Walster, William. Global Optimization Using Interval Analysis: Revised and Expanded. Pure and Applied Mathematics Series. Marcel Dekker, 2004.
Hren, Jean-Francois and Munos, Remi. Optimistic Planning of Deterministic Systems. In European Workshop on Reinforcement Learning, pp. 151164, 2008.
Jones, David, Perttunen, Cary, and Stuckman, Bruce.Lipschitzian optimization without the Lipschitz constant. Journal of Optimization Theory and Applications, 79(1):157181, 1993.
Kearfott, R Baker. Rigorous Global Search: Continuous Problems. Nonconvex Optimization and Its Applications. Springer, 1996.
Kleinberg, Robert, Slivkins, Alexander, and Upfal, Eli.Multi-armed bandit problems in metric spaces. In Proceedings of the 40th ACM symposium on Theory Of Computing, pp. 681690, 2008.
Kocsis, Levente and Szepesvari, Csaba. Bandit based Monte-Carlo Planning. In Proceedings of the 15th European Conference on Machine Learning, pp.282293. Springer, 2006.
Munos, Remi. Optimistic Optimization of Deterministic Functions without the Knowledge of its Smoothness. In Advances in Neural Information Processing Systems, pp. 783791, 2011.
Neumaier, Arnold. Interval Methods for Systems of Equations. Encyclopedia of Mathematics and its Applications. Cambridge University Press, 2008.
Osborne, Michael. Bayesian Gaussian processes for sequential prediction, optimisation and quadrature. PhD thesis, University of Oxford, 2010.
Pinter, Janos. Global Optimization in Action: Continuous and Lipschitz Optimization: Algorithms, Implementations and Applications. Nonconvex Optimization and Its Applications. Springer, 1995.
Slivkins, Aleksandrs. Multi-armed bandits on implicit metric spaces. In Advances in Neural Information Processing Systems 24, pp. 16021610. 2011.
Srinivas, Niranjan, Krause, Andreas, Kakade, Sham, and Seeger, Matthias. Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design. Proceedings of International Conference on Machine Learning, pp. 10151022, 2010.
Strongin, Roman and Sergeyev, Yaroslav. Global Optimization with Non-Convex Constraints: Sequential and Parallel Algorithms. Nonconvex Optimization and Its Applications. Springer, 2000.
-----0
Bhattacharyya, C., Pannagadatta, K.S., and Smola, A.J.A second order cone programming formulation for classifying missing data. In Advances in Neural Information Processing Systems, pp. 153160, 2004.
Bishop, C.M. Training with noise is equivalent to tikhonov regularization. Neural Computation, 7(1):108116, 1995.
Blitzer, J., Dredze, M., and Pereira, F. Biographies, Bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Association for Computational Linguistics, volume 45, pp. 440, 2007.
Bruckner, M., Kanzow, C., and Scheffer, T. Static prediction games for adversarial learning problems. Journal of Machine Learning Research, 12:26172654, 2012.
Burges, C.J.C. and Scholkopf, B. Improving the accuracy and speed of support vector machines. Advances in Neural Information Processing Systems, 9:375381, 1997.
Chapelle, O., Weston, J., Bottou, L., and Vapnik, V. Vicinal risk minimization. In Advances in Neural Information Processing Systems, pp. 416422, 2000.
Chechik, G., Heitz, G., Elidan, G., Abbeel, P., and Koller, D. Max-margin classification of data with absent features. Journal of Machine Learning Research, 9(Jan): 121, 2008.
Chen, M., Xu, Z., Weinberger, K.Q., and Sha, F. Marginalized denoising autoencoders for domain adaptation. In Proceedings of the International Conference on Machine Learning, pp. 767774, 2012.
Coates, A., Lee, H., and Ng, A.Y. An analysis of singlelayer networks in unsupervised feature learning. In Proceedings of the International Conference on Artificial Intelligence & Statistics, JMLR W&CP 15, pp. 215223, 2011.
Cover, T. and Hart, P. Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13 (1):2127, 1967.
Dekel, O. and Shamir, O. Learning to classify with missing and corrupted features. In Proceedings of the International Conference on Machine Learning, pp. 216223, 2008.
Duda, R.O., Hart, P.E., and Stork, D.G. Pattern Classification. Wiley Interscience Inc., 2001.
Globerson, A. and Roweis, S. Nightmare at test time: Robust learning by feature deletion. In Proceedings of the International Conference on Machine Learning, pp. 353 360, 2006.
Glorot, X., Bordes, A., and Bengio, Y. Domain adaptation for large-scale sentiment classification: A deep learning approach. In Proceedings of the International Conference on Machine Learning, pp. 513520, 2011.
Herbrich, R. and Graepel, T. Invariant pattern recognition by semidefinite programming machines. In Advances in Neural Information Processing Systems, volume 16, pp.33, 2004.
Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R.R. Improving neural networks by preventing co-adaptation of feature detectors, 2012.
Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
Lawrence, N.D. and Scholkopf, B. Estimating a kernel Fisher discriminant in the presence of label noise. In Proceedings of the International Conference in Machine Learning, pp. 306313, 2001.
LeCun, Y., Denker, J.S., and Solla, S.A. Optimal brain damage. In Advances in Neural Information Processing Systems, pp. 598605, 1990.
Ng, A.Y. Feature selection, l1 vs. l2 regularization, and rotational invariance. In Proceedings of International Conference on Machine Learning, pp. 7885, 2004.
Ranzato, M. and Hinton, G.E. Modeling pixel means and covariances using factorized third-order Boltzmann machines. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 25512558, 2010.
Shivaswamy, P.K., Bhattacharyya, C., and Smola, A.J.Second order cone programming approaches for handling missing and uncertain data. Journal of Machine Learning Research, 7:12831314, 2006.
Sietsma, J. and Dow, R.J.F. Creating artificial neural networks that generalize. Neural Networks, 4:6779, 1991.
Sutton, C., Sindelar, M., and McCallum, A. Feature bagging: Preventing weight undertraining in structured discriminative learning. Technical Report IR-402, University of Massachusetts, 2005.
Teo, C.H., Globerson, A., Roweis, S., and Smola, A. Convex learning with invariances. Advances in Neural Information Processing Systems, 20:14891496, 2008.
Torralba, A., Fergus, R., and Freeman, W.T. 80 million tiny images: A large dataset for non-parametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(11):19581970, 2008.
Trafalis, T. and Gilbert, R. Robust support vector machines for classification and computational issues. Optimization Methods and Software, 22(1):187198, 2007.
Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.A.Extracting and composing robust features with denoising autoencoders. In Proceedings of the International Conference on Machine Learning, pp. 10961103, 2008.
Webb, A.R. Functional approximation by feed-forward networks: a least-squares approach to generalization. IEEE Transactions on Neural Networks, 5(3):363371, 1994.
Xu, H., Caramanis, C., and Mannor, S. Robustness and regularization of support vector machines. Journal of Machine Learning Research, 10:14851510, 2009.
Zhu, J., Rosset, S., Zou, H., and Hastie, T. Multi-class AdaBoost. Technical Report 430, Department of Statistics, University of Michigan, 2006.
-----0
Andre, D., Friedman, N., and Parr, R. (1998). Generalized Prioritized Sweeping. Advances in Neural Information Processing Systems, 10:1001-1007.
Barto, A. G., Bradtke, S. J., and Singh, S.P. (1995).Learning to act using real-time dynamic programming. Artificial Intelligence, 72(1): 81138.
Bellman, R. (1957). A Markovian decision process.Journal of Mathematical Mechanics, 6:679684.
Bonet, B. and Ge?ner, H. (2003). Labeled RTDP: Improving the Convergence of Real-Time Dynamic Programming. ICAPS, 1221.
Brafman, R. and Tennenholtz, M. (2002). R-max: a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3:213231.
Grzes, M. and Hoey, J. (2011). E cient planning in R-max. AAMAS.
Kaelbling, L.P., Littman, M.L., and Moore, A.P.(1996). Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research, 4:237285.
McMahan, H.B. and Gordon, G.J. (2005). Fast Exact Planning in Markov Decision Processes. ICAPS.
Moore, A. and Atkeson, C. (1993). Prioritized Sweeping: Reinforcement Learning with Less Data and Less Real Time. Machine Learning, 13:103130.
Peng, J. and Williams, R.J. (1993). E cient Learning and Planning Within the Dyna Framework. Adaptive Behavior, 1(4):437454.
Rao, K. and Whiteson, S. (2012). V-MAX: Tempered Optimism for Better PAC Reinforcement Learning.AAMAS, 375382.
Sutton, R.S. and Barto, A.G. (1998). Reinforcement Learning: An Introduction. MIT Press, Cambridge, Massachussets.
Van Seijen, H, Whiteson, S, van Hasselt, H, and Wiering, M. (2011). Exploiting Best-Match Equations for E cient Reinforcement Learning. Journal of Machine Learning Research, 12:20452094.
Wiering, M. and Schmidhuber, J. (1998). E cient Model-Based Exploration. SAB, 223-228.
-----0
Bchir, Ouiem and Frigui, Hichem. Fuzzy relational kernel clustering with local scaling parameter learning. In Proc. of the 2010 IEEE Int. Workshop on Machine Learning for Signal Processing (MLSP10), pp. 289294, Kittila, Finland, Aug. 29 Sep. 1 2010.
Belkin, Mikhail and Niyogi, Partha. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6):13731396, Jun. 2003.
Belkin, Mikhail, Niyogi, Partha, and Sindhwani, Vikas. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research, 7: 23992434, Nov. 2006.
Brent, Richard P. Algorithms for Minimization without Derivatives. Prentice-Hall, Englewood Cliffs, N.J., 1973.
Carreira-Perpinan, Miguel A. Fast nonparametric clustering with Gaussian blurring mean-shift.In Proc. of the 23rd Int. Conf. Machine Learning (ICML 2006), pp. 153160, Pittsburgh, PA, Jun. 2529 2006.
Carreira-Perpinan, Miguel A. The elastic embedding algorithm for dimensionality reduction. In Proc. of the 27th Int. Conf. Machine Learning (ICML 2010), pp. 167174, Haifa, Israel, Jun. 2125 2010.
de Silva, V. and Tenenbaum, Joshua B. Global versus local approaches to nonlinear dimensionality reduction. In NIPS, volume 15, pp. 721728, MIT Press, Cambridge, MA, 2003.
Duong, Tarn and Hazelton, Martin L. Convergence rates for unconstrained bandwidth matrix selectors in multivariate kernel density estimation. J. Multivariate Analysis, 93(2):417433, Apr. 2005.
Er, Meng Joo, Wu, Shiqian, Lu, Juwei, and Toh, Hock Lye. Face recognition with radial basis function (RBF) neural networks. IEEE Trans. Neural Networks, 13(3):697710, May 2002.
Gander, Walter. On Halleys iteration method. Amer.Math. Monthly, 92(2):131134, Feb. 1985.
Hinton, Geoffrey and Roweis, Sam T. Stochastic neighbor embedding. In NIPS, volume 15, pp. 857864. MIT Press, Cambridge, MA, 2003.
Manning, Christopher D. and Schutze, Hinrich. Foundations of Statistical Natural Language Processing.MIT Press, Cambridge, MA, 1999.
Melman, A. Geometry and convergence of Eulers and Halleys methods. SIAM Review, 39(4):728735, Dec. 1997.
Nene, S. A., Nayar, S. K., and Murase H. Columbia object image library (COIL-20). Technical Report CUCS00596, Dept. of Computer Science, Columbia University, Feb. 1996.
Ng, A. Y., Jordan, M. I., and Weiss, Y. On spectral clustering: Analysis and an algorithm. In NIPS, volume 14, pp. 849856. MIT Press, Cambridge, MA, 2002.
Raykar, Vikas C. and Duraiswami, Ramani. The improved fast Gauss transform with applications to machine learning. In Large Scale Kernel Machines, Neural Information Processing Series, pp. 175202.MIT Press, 2007.
Raykar, Vikas C., Duraiswami, Ramani, and Zhao, Linda H. Fast computation of kernel estimators.Journal of Computational and Graphical Statistics, 19(1):205220, 2010.
Ridders, C. J. F. A new algorithm for computing a single root of a real continuous function. IEEE Trans.
Circuits and Systems, 26(11):979980, Nov. 1979.Roweis, Sam T. and Saul, Lawrence K. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):23232326, Dec. 22 2000.
Scavo, T. R. and Thoo, J. B. On the geometry of Halleys method. Amer. Math. Monthly, 102(5):417 426, May 1995.
Sheather, Simon J. Density estimation. Statistical Science, 19(4):588597, Nov. 2004.
Traub, J. F. Iterative Methods for the Solution of Equations. Prentice-Hall, second edition, 1982.
Vladymyrov, Max and Carreira-Perpinan, Miguel A.Partial-Hessian strategies for fast learning of nonlinear embeddings. In Proc. of the 29th Int. Conf.Machine Learning (ICML 2012), pp. 345352, Edinburgh, Scotland, Jun. 26  July 1 2012.
Zelnik-Manor, Lihi and Perona, Pietro. Self-tuning spectral clustering. In NIPS, volume 17, pp. 1601 1608. MIT Press, Cambridge, MA, 2005.
Zhou, Dengyong, Bousquet, Olivier, Lal, Thomas N., Weston, Jason, and Scholkopf, Bernhard. Learning with local and global consistency. In NIPS, volume 16, pp. 321328. MIT Press, Cambridge, MA, 2004.
-----0
D. Ciresan, U. Meier, and J. Schmidhuber. Multicolumn deep neural networks for image classification. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), CVPR 12, pages 36423649, Washington, DC, USA, 2012. IEEE Computer Society. ISBN 9781-4673-1226-4.
G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. CoRR, abs/1207.0580, 2012.A. Krizhevsky. Learning Multiple Layers of Features from Tiny Images. Masters thesis, University of Toront, 2009.
A. Krizhevsky. cuda-convnet. http://code.google.com/p/cuda-convnet/, 2012.
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner.Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, nov 1998. ISSN 0018-9219. doi: 10.1109/5.726791.
Y. LeCun, F. J. Huang, and L. Bottou. Learning methods for generic object recognition with invariance to pose and lighting. In Proceedings of the 2004 IEEE computer society conference on Computer vision and pattern recognition, CVPR04, pages 97104, Washington, DC, USA, 2004. IEEE Computer Society.
M. Ledoux and M. Talagrand. Probability in Banach Spaces. Springer, New York, 1991.
D. J. C. Mackay. Probable networks and plausible predictions a review of practical bayesian methods for supervised neural networks. In Bayesian methods for backpropagation networks. Springer, 1995.
V. Nair and G. E. Hinton. Rectified Linear Units Improve Restricted Boltzmann Machines. In ICML, 2010.
Y. Netzer, T. Wang, Coates A., A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011, 2011.
J. Snoek, H. Larochelle, and R. A. Adams. Practical bayesian optimization of machine learning algorithms. In Neural Information Processing Systems, 2012.
A. S. Weigend, D. E. Rumelhart, and B. A. Huberman.Generalization by weight-elimination with application to forecasting. In NIPS, 1991.
M. D. Zeiler and R. Fergus. Stochastic pooling for regualization of deep convolutional neural networks.In ICLR, 2013.
-----0
Alonso-Gutierrez, D. On the isotropy constant of random convex sets. Proceedings of the American Mathematical Society, 136(9):32933300, 2008.
Basri, R. and Jacobs, D.W. Lambertian reflectance and linear subspaces. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 25(2):218233, 2003.
Ben-Tal, A. and Nemirovski, A. Robust convex optimization. Mathematics of Operations Research, 23(4):769 805, 1998.
Bertsimas, D. and Sim, M. The price of robustness. Operations research, 52(1):3553, 2004.
Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J.Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends R in Machine Learning, 3(1):1122, 2011.Noisy Sparse Subspace Clustering 
Bradley, P.S. and Mangasarian, O.L. k-plane clustering.Journal of Global Optimization, 16(1):2332, 2000.
Cande`s, E.J. The restricted isometry property and its implications for compressed sensing. Comptes Rendus Mathematique, 346(9):589592, 2008.
Chen, G. and Lerman, G. Spectral curvature clustering (scc). International Journal of Computer Vision, 81(3): 317330, 2009.
Costeira, J.P. and Kanade, T. A multibody factorization method for independently moving objects. International Journal of Computer Vision, 29(3):159179, 1998.
Donoho, D.L., Elad, M., and Temlyakov, V.N. Stable recovery of sparse overcomplete representations in the presence of noise. Information Theory, IEEE Transactions on, 52(1):618, 2006.
Elhamifar, E. and Vidal, R. Sparse subspace clustering. In CVPR09, pp. 27902797. IEEE, 2009.
Elhamifar, E. and Vidal, R. Clustering disjoint subspaces via sparse representation. In ICASSP11, pp. 19261929.IEEE, 2010.
Elhamifar, E. and Vidal, R. Sparse subspace clustering: Algorithm, theory, and applications. to appear in IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012.
Eriksson, B., Balzano, L., and Nowak, R. High rank matrix completion. In AI Stats12, 2012.
Hastie, T. and Simard, P.Y. Metrics and models for handwritten character recognition. Statistical Science, pp.5465, 1998.
Jalali, A., Chen, Y., Sanghavi, S., and Xu, H. Clustering partially observed graphs via convex optimization. In ICML11, pp. 10011008. ACM, 2011.
Kanatani, K. Motion segmentation by subspace separation and model selection. In ICCV01, volume 2, pp. 586591.IEEE, 2001.
Liu, G., Lin, Z., Yan, S., Sun, J., Yu, Y., and Ma, Y.Robust recovery of subspace structures by low-rank representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1):171184, 2013.
Loh, P.L. and Wainwright, M.J. High-dimensional regression with noisy and missing data: Provable guarantees with nonconvexity. The Annals of Statistics, 40(3):1637 1664, 2012.
Nasihatkon, B. and Hartley, R. Graph connectivity in sparse subspace clustering. In CVPR11, pp. 21372144.IEEE, 2011.
Ng, A.Y., Jordan, M.I., Weiss, Y., et al. On spectral clustering: Analysis and an algorithm. In NIPS02, volume 2, pp. 849856, 2002.
Soltanolkotabi, M. and Candes, E.J. A geometric analysis of subspace clustering with outliers. To appear in Annals of Statistics, 2012.
Tron, R. and Vidal, R. A benchmark for the comparison of 3-d motion segmentation algorithms. In CVPR07, pp.18. IEEE, 2007.
Vidal, R. Subspace clustering. Signal Processing Magazine, IEEE, 28(2):5268, 2011.
Vidal, R., Ma, Y., and Sastry, S. Generalized principal component analysis (gpca). IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(12):1945 1959, 2005.
Zhang, A., Fawaz, N., Ioannidis, S., and Montanari, A.Guess who rated this movie: Identifying users through subspace clustering. arXiv preprint arXiv:1208.1544, 2012.
Zhou, S.K., Aggarwal, G., Chellappa, R., and Jacobs, D.W. Appearance characterization of linear lambertian objects, generalized photometric stereo, and illumination-invariant face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29 (2):230245, 2007.
-----0
Bishop, Christopher M. Training with noise is equivalent to Tikhonov regularization. Neural Computation, 7(1):108116, 1995.
Goodfellow, Ian J., Warde-Farley, David, Mirza, Mehdi, Courville, Aaron C., and Bengio, Yoshua. Maxout networks. CoRR, abs/1302.4389, 2013.
Hinton, Geoffrey E., Srivastava, Nitish, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan. Improving neural networks by preventing co-adaptation of feature detectors. CoRR, abs/1207.0580, 2012.
Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. ImageNet classification with deep convolutional neural networks. In Proceedings of NIPS, 2012.
Lehmann, Erich L. Elements of Large-Sample Theory.Springer, 1998. ISBN 03873985956.
Maas, Andrew L., Daly, Raymond E., Pham, Peter T., Huang, Dan, Ng, Andrew Y., and Potts, Christopher. Learning word vectors for sentiment analysis.In Proceedings of the ACL, 2011.
MacKay, David J.C. The evidence framework applied to classification networks. Neural Computation, 5 (4):720736, 1992.
Matsuoka, Kiyotoshi. Noise injection into inputs in back-propagation learning. IEEE Transactions on Systems, Man, and Cybernetics, 22(3):436440, 1992.
Nakagawa, Tetsuji, Inui, Kentaro, and Kurohashi, Sadao. Dependency tree-based sentiment classification using CRFs with hidden variables. In Proceedings of ACL:HLT, 2010.
Ng, Andrew Y. and Jordan, Michael I. On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. In Proceedings of NIPS, volume 2, pp. 841848, 2002.
Ross, Andrew. Computing bounds on the expected maximum of correlated normal variables. Methodology and Computing in Applied Probability, 12:111 138, 2010. ISSN 1387-5841.
Simard, Patrice, LeCun, Yann, Denker, John, and Victorri, Bernard. Transformation invariance in pattern recognition-tangent distance and tangent propagation. In Neural Networks: Tricks of the Trade, pp.23927, 1996.
Smith, Andrew, Cohn, Trevor, and Osborne, Miles.Logarithmic opinion pools for conditional random fields. In Proceedings of the ACL, pp. 1825, 2005.
Socher, Richard, Pennington, Jeffrey, Huang, Eric H., Ng, Andrew Y., and Manning, Christopher D. SemiSupervised Recursive Autoencoders for Predicting Sentiment Distributions. In Proceedings of EMNLP, 2011.
Sutton, Charles, Sindelar, Michael, and McCallum, Andrew. Reducing weight undertraining in structured discriminative learning. In Proceedings of HLT-NAACL), 2006.van der Maaten, Laurens, Chen, Minmin, Tyree, Stephen, and Weinberger, Kilian. Learning with marginalized corrupted features. In Proceedings of ICML, 2013.
Wang, Sida and Manning, Christopher. Baselines and bigrams: Simple, good sentiment and topic classification. In Proceedings of the ACL, pp. 9094, 2012.
-----0
Felzenszwalb, P. F. and McAuley, J. J. Fast inference with min-sum matrix product. PAMI, 33(12):2549 2554, 2011.
Globerson, Amir and Jaakkola, Tommi. Fixing maxproduct: Convergent message passing algorithms for MAP LP-relaxations. In NIPS, 2007.
Koller, D. and Friedman, N. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.
Kolmogorov, V. Convergent tree-reweighted message passing for energy minimization. PAMI, 28(10): 15681583, October 2006.
Komodakis, N. and Paragios, N. Beyond loose LPrelaxations: Optimizing MRFs by repairing cycles.In ECCV, pp. III: 806820, 2008.
Komodakis, Nikos, Paragios, Nikos, and Tziritas, Georgios. MRF energy minimization and beyond via dual decomposition. PAMI, 33(3):531552, 2011.
Meltzer, Talya, Globerson, Amir, and Weiss, Yair.Convergent message passing algorithms a unifying view. In UAI, pp. 393401, 2009.
Shimony. Finding MAPs for belief networks is NPhard. AIJ: Artificial Intelligence, 68, 1994.
Sontag, David and Jaakkola, Tommi. Tree block coordinate descent for MAP in graphical models.AISTATS, 5:544551, 2009.
Sontag, David, Meltzer, Talya, Globerson, Amir, Jaakkola, Tommi, and Weiss, Yair. Tightening LP relaxations for MAP using message passing. In UAI, pp. 503510, 2008.
Sontag, David, Globerson, Amir, and Jaakkola, Tommi. Introduction to dual decomposition for inference. In Sra, Suvrit, Nowozin, Sebastian, and Wright, Stephen J. (eds.), Optimization for Machine Learning. MIT Press, 2011.
Szeliski, R. S., Zabih, R., Scharstein, D., Veksler, O., Kolmogorov, V., Agarwala, A., Tappen, M., and Rother, C. A comparative study of energy minimization methods for markov random fields with smoothness-based priors. IEEE Trans. Pattern Analysis and Machine Intelligence, 30(6):1068 1080, June 2008.
Tarlow, Daniel, Batra, Dhruv, Kohli, Pushmeet, and Kolmogorov, Vladimir. Dynamic tree block coordinate ascent. In ICML, 2011.
Wainwright, Martin J. and Jordan, Michael I. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1-2):1305, 2008.
Weiss, Y. Correctness of local probability propagation in graphical models with loops. Neural Computation, 12:141, 2000.
Werner, T. A linear programming approach to maxsum problem: A review. PAMI, 29(7):11651179, July 2007.
Yanover, Chen, Meltzer, Talya, and Weiss, Yair. Linear programming relaxations and belief propagation an empirical study. JMLR, 7:18871907, 2006.
-----0
Argyriou, Andreas, Evgeniou, Theodoros, and Pontil, Massimiliano. Convex multi-task feature learning. Machine Learning, 73(3):243272, 2008.
Bach, F., Lanckriet, G.R.G., and Jordan, M.I. Multiple Kernel Learning, Conic Duality, and the SMO Algorithm. ICML, 2004.
Belkin, M, Niyogi, P, and Sindhwani, V. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. JMLR, 1:148, 2006.
Bickel, Steffen and Scheffer, Tobias. Multi-view clustering.IEEE International Conference on Data Mining, 2004.
Brefeld, Ulf and Scheffer, Tobias. Co-em support vector learning. International Conference on Machine Learning, 2004.
Cai, X., Nie, F., Huang, H., and Kamangar, F. Heterogeneous image feature integration via multi-modal spectral clustering. In CVPR, 2011.
Ghani, R. Combining labeled and unlabeled data for multiclass text categorization. International Conference on Machine Learning, 2002.
Gorodnitsky, I.F. and Rao, B.D. Sparse signal reconstruction from limited data using focuss: A re-weighted minimum norm algorithm. Signal Processing, IEEE Transactions on, 45(3):600616, 1997.
He, X., Yan, S., Hu, Y., Niyogi, P., and Zhang, H.J. Face recognition using laplacianfaces. IEEE TPAMI, 27(3): 328340, 2005.
Kloft, M., Brefeld, U., Laskov, P., and Sonnenburg, S. Nonsparse multiple kernel learning. In NIPS.
Kumar, A., Rai, P., and Daume III, H. Co-regularized multi-view spectral clustering. In NIPS, 2011.
Lanckriet, G.R.G., Cristianini, N., Bartlett, P., Ghaoui, L.E., and Jordan, M.I. Learning the kernel matrix with semidefinite programming. JMLR, 5:2772, 2004a. ISSN 1532-4435.
Lanckriet, G.R.G., De Bie, T., Cristianini, N., Jordan, M.I., and Noble, W.S. A statistical framework for genomic data fusion. Bioinformatics, 2004b. ISSN 13674803.
Ng, Andrew Y., Jordan, Michael I., and Weiss, Yair. On spectral clustering: Analysis and an algorithm. In NIPS, pp. 849856, 2001.
Nie, Feiping, Xu, Dong, Tsang, Ivor W., and Zhang, Changshui. Spectral embedded clustering. In IJCAI, pp. 11811186, 2009.
Obozinski, Guillaume, Taskar, Ben, and Jordan, Michael I.Joint covariate selection and joint subspace selection for multiple classification problems. Statistics and Computing, 20:231252, 2010.
Prettenhofer, P. and Stein, B. Cross-language text classification using structural correspondence learning. In ACL, 2010.
Quattoni, Ariadna, Carreras, Xavier, Collins, Michael, and Darrell, Trevor. An Efficient Projection for l1;? Regularization. ICML, 2009.
Sonnenburg, S., Ratsch, G., Schafer, C., and Scholkopf, B. Large scale multiple kernel learning. JMLR, 7:1531 1565, 2006. ISSN 1532-4435.
Suykens, J.A.K., Van Gestel, T., and De Brabanter, J.Least squares support vector machines. World Scientific Pub Co Inc, 2002. ISBN 9812381511.
Wang, Hua, Nie, Feiping, Huang, Heng, Risacher, Shannon, Saykin, Andrew J, and Shen, Li. Identifying adsensitive and cognition-relevant imaging biomarkers via joint classification and regression. In MICCAI 2011, pp.115123. Springer, 2011.
Wang, Hua, Nie, Feiping, Huang, Heng, Kim, Sungeun, Nho, Kwangsik, Risacher, Shannon L, Saykin, Andrew J, Shen, Li, et al. Identifying quantitative trait loci via group-sparse multitask regression and feature selection: an imaging genetics study of the adni cohort.Bioinformatics, 28(2):229237, 2012a.
Wang, Hua, Nie, Feiping, Huang, Heng, Risacher, Shannon L, Saykin, Andrew J, Shen, Li, et al. Identifying disease sensitive and quantitative trait-relevant biomarkers from multidimensional heterogeneous imaging genetics data via sparse multimodal multitask learning. Bioinformatics, 28(12):i127i136, 2012b.
Wang, Hua, Nie, Feiping, Huang, Heng, Yan, Jingwen, Kim, Sungeun, Nho, Kwangsik, Risacher, Shannon L., 
Saykin, Andrew J., Shen, Li, and for the Alzheimers Disease Neuroimaging Initiative. From phenotype to genotype: an association study of longitudinal phenotypic markers to alzheimers disease relevant snps.
Bioinformatics, 28(18):i619i625, 2012c.Wang, Hua, Nie, Feiping, Huang, Heng, Yan, Jingwen, 
Kim, Sungeun, Risacher, Shannon, Saykin, Andrew, and Shen, Li. High-order multi-task feature learning to identify longitudinal phenotypic markers for alzheimers disease progression prediction. In NIPS, 2012d.
Wang, Hua, Nie, Feiping, Huang, Heng, and Ding, Chris.Heterogeneous Visual Features Fusion via Sparse Multimodal Machine. In CVPR 2013, 2013.
Ye, J., Ji, S., and Chen, J. Multi-class discriminant kernel learning via convex programming. JMLR, 9:719758, 2008a. ISSN 1532-4435.
Ye, Jieping, Zhao, Zheng, andWu, Mingrui. Discriminative k-means for clustering. In NIPS, pp. 16491656, 2008b.
Yu, S., Falck, T., Daemen, A., Tranchevent, L.C., Suykens, J.A.K., De Moor, B., and Moreau, Y. L 2-norm multiple kernel learning and its application to biomedical data fusion. BMC bioinformatics, 11(1):309, 2010. ISSN 14712105.
Yuan, M. and Lin, Y. Model selection and estimation in regression with grouped variables. Journal of The Royal Statistical Society Series B, 68(1):4967, 2006.
-----0
Andrews, S., Tsochantaridis, I., and Hofmann, T. Support vector machines for multiple-instance learning. NIPS, 15:561568, 2002.
Blei, David M., Ng, Andrew Y., and Jordan, Michael I.Latent dirichlet allocation. J. of Machine Learning Res., 3:9931022, 2003.
Bo, L., Ren, X., and Fox, D. Kernel descriptors for visual recognition. NIPS, 7, 2010.
Bourdev, Lubomir and Malik, Jitendra. Poselets: Body part detectors trained using 3d human pose annotations.In ICCV, 2009.
Chatfield, K., Lempitsky, V., Vedaldi, A., and Zisserman, A. The devil is in the details: an evaluation of recent feature encoding methods. In BMVC, 2011.
Crammer, K. and Singer, Y. On the algorithmic implementation of multiclass kernel-based vector machines. The Journal of Machine Learning Research, 2:265292, 2002.
Dietterich, T.G., Lathrop, R.H., and Lozano-Perez, T.Solving the multiple-instance problem with axis parallel rectangles. Artificial Intelligence, 89:3171, 1997.
Dollar, P., Babenko, B., Belongie, S., Perona, P., and Tu, Z. Multiple component learning for object detection.ECCV, pp. 211224, 2008.
Duda, R., Hart, P., and Stork, D. Pattern Classification and Scene Analysis. John Wiley and Sons, 2000.
Fan, Rong-En, Chang, Kai-Wei, Hsieh, Cho-Jui, Wang, Xiang-Rui, and Lin, Chih-Jen. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9:18711874, 2008.
Farhadi, A., Endres, I., Hoiem, D., and Forsyth, D. Describing objects by their attributes. In CVPR, pp. 1778 1785, 2009.
Fei-Fei, L. and Perona, P. A bayesian hierarchical model for learning natural scene categories. In CVPR, 2005.
Felzenszwalb, P.F., Girshick, R.B., McAllester, D., and Ramanan, D. Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627 1645, 2010.
Hinton, G. E., Osindero, S., and Teh, Y. W. A fast learning algorithm for deep belief nets. Neural Computation, 18 (7):15271554, 2006.
Jiang, Z., Zhang, G., and Davis, L.S. Submodular dictionary learning for sparse coding. In CVPR, pp. 3418 3425. IEEE, 2012.Max-Margin Multiple-Instance Dictionary Learning Jurie, F. and Triggs, B. Creating efficient codebooks for visual recognition. In ICCV, pp. 604610, 2005.
Kwitt, R., Vasconcelos, N., and Rasiwasia, N. Scene recognition on the semantic manifold. In ECCV, 2012.Lazebnik, S. and Raginsky, M. Supervised learning of quantizer codebooks by information loss minimization.IEEE Tran. PAMI, 31:12941309, 2009.
Lazebnik, S., Schmid, C., and Ponce, J. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR, volume 2, pp. 21692178, 2006.LeCun, Y., Huang, F., and Bottou, L. Learning methods for generic object recognition with invariance to pose and lighting. In Proc. of CVPR, June 2004.
Li, L.J. and Fei-Fei, L. What, where and who? classifying events by scene and object recognition. In ICCV, 2007.
Li, L.J., Su, H., Xing, E.P., and Fei-Fei, L. Object bank: A high-level image representation for scene classification and semantic feature sparsification. NIPS, 24, 2010.
Lowe, D. G. Distinctive image features from scale-invariant keypoints. Intl J. of Comp. Vis., 60(2):91110, 2004.
Mairal, J., Bach, F., and Ponce, J. Task-driven dictionary learning. arXiv preprint arXiv:1009.5358, 2010.
Moosmann, F., Triggs, B., and Jurie, F. Fast discriminative visual codebooks using randomized clustering forests. In NIPS, pp. 985992, 2006.
Moosmann, Frank, Nowak, Eric, and Jurie, Frederic. Randomized clustering forests for image classification. IEEE Trans. Pattern Anal. Mach. Intell., 30(9):16321646, 2008.
Ojala, T., Pietikainen, M., and Harwood, D. A comparative study of texture measures with classification based on feature distributions. Pattern Recognition, 29:5159, 1996.
Oliva, Aude and Torralba, Antonio. Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision, 42(3): 145175, 2001.
Pandey, M. and Lazebnik, S. Scene recognition and weakly supervised object localization with deformable partbased models. In ICCV, pp. 13071314. IEEE, 2011.
Parikh, D. and Grauman, K. Relative attributes. In ICCV, pp. 503510, 2011.
Parizi, S.N., Oberlin, J.G., and Felzenszwalb, P.F. Reconfigurable models for scene recognition. In CVPR, pp.27752782. IEEE, 2012.
Pechyony, D. and Vapnik, V. On the theory of learning with privileged information. In NIPS, 2010.
Quattoni, A. and A.Torralba. Recognizing indoor scenes.In CVPR, 2009.
Sadeghi, F. and Tappen, M.F. Latent pyramidal regions for recognizing scenes. In ECCV, 2012.
Serre, Thomas and Poggio, Tomaso. A neuromorphic approach to computer vision. Commun. ACM, 53(10):54 61, 2010.
Singh, Saurabh, Gupta, Abhinav, and Efros, Alexei A. Unsupervised discovery of mid-level discriminative patches.In ECCV, 2012.
Torresani, L., Szummer, M., and Fitzgibbon, A. Efficient object category recognition using classemes. ECCV, pp.776789, 2010.
Vedaldi, A. and Fulkerson, B. VLFeat: An open and portable library of computer vision algorithms, 2008.
Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., and Gong, Y. Locality-constrained linear coding for image classification. In CVPR, pp. 33603367, 2010.
Winn, J., Criminisi, A., and Minka, T. Object categorization by learned universal visual dictionary. In ICCV, volume 2, pp. 18001807, 2005.
Wu, J. and Rehg, J.M. Beyond the euclidean distance: Creating effective visual codebooks using the histogram intersection kernel. In ICCV, pp. 630637, 2009.
Wu, J. and Rehg, J.M. Centrist: A visual descriptor for scene categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(8):1489 1501, 2011.
Xu, Y., Zhu, J.Y., Chang, E., and Tu, Z. Multiple clustered instance learning for histopathology cancer image classification, segmentation and clustering. In CVPR, pp. 964971, 2012.
Yang, J., Yu, K., Gong, Y., and Huang, T. Linear spatial pyramid matching using sparse coding for image classification. In CVPR, pp. 17941801, 2009.
Yang, J., Yu, K., and Huang, T. Supervised translationinvariant sparse coding. In CVPR, pp. 35173524. IEEE, 2010.
Yang, L., Jin, R., Sukthankar, R., and Jurie, F. Unifying discriminative visual codebook generation with classifier training for object category recognition. In Proc. of CVPR, 2008.
Zhang, Dan, Wang, Fei, Si, Luo, and Li, Tao. M3ic: maximum margin multiple instance clustering. In IJCAI, pp.13391344, 2009.
Zhang, Min-Ling and Zhou, Zhi-Hua. M3miml: A maximum margin method for multi-instance multi-label learning. In ICDM, pp. 688697, 2008.
Zhu, J., Li, L.J., Fei-Fei, L., and Xing, E.P. Large margin learning of upstream scene understanding models. NIPS, 24, 2010.
Zhu, J., Zou, W., Yang, X., Zhang, R., Zhou, Q., and Zhang, W. Image Classification by Hierarchical Spatial Pooling with Partial Least Squares Analysis. In British Machine Vision Conference, 2012.
-----0
Andrieu, Christophe and Robert, Christian. Controlled MCMC for optimal sampling. Technical Report 0125, Cahiers de Mathematiques du Ceremade, Universite Paris-Dauphine, 2001.
Andrieu, Christophe, de Freitas, Nando, Doucet, Arnaud, and Jordan, Michael I. An Introduction to MCMC for Machine Learning. Machine Learning, 50(1):543, 2003.
Atchade, Yves and Fort, Gersende. Limit theorems for some adaptive MCMC algorithms with subgeometric kernels. Bernoulli, 16(1):116154, 2010.
Beskos, Alexandros, Pillai, Natesh S., Roberts, Gareth O., Sanz-Serna, Jesus M., and Stuart, Andrew M. Optimal tuning of the hybrid Monte-Carlo algorithm. Preprint arXiv:1001.4460, 2010.
Brochu, Eric, Cora, Vlad M, and de Freitas, Nando. A tutorial on Bayesian optimization of expensive cost functions. Preprint arXiv:1012.2599, 2009.
Chen, Lingyu, Qin, Zhaohui, and Liu, Jun S. Exploring Hybrid Monte Carlo in Bayesian Computation. Sigma, 2:25, 2001.
Christensen, Ole F., Roberts, Gareth O., and Rosenthal, Jeffrey S. Scaling limits for the transient phase of local MetropolisHastings algorithms. Journal of the Royal Statistical Society: Series B, 67(2):253268, 2005.
Duane, S, Kennedy, A D, Pendleton, B J, and Roweth, D.Hybrid Monte Carlo. Physics Letters B, 195(2):216222, 1987.
Engel, Yaakov. Algorithms and representations for reinforcement learning. Doktorarbeit, The Hebrew University of Jerusalem, 2005.
Girolami, Mark and Calderhead, Ben. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B, 73(2):123 214, 2011.
Guyon, Isabelle, Gunn, Steve, Ben-Hur, Asa, and Dror, Gideon. Result analysis of the NIPS 2003 feature selection challenge. In Advances in Neural Information 
Processing Systems, volume 17, pp. 545552, 2005.Hamze, Firas, Wang, Ziyu, and de Freitas, Nando. Selfavoiding random dynamics on integer complex systems.ACM Transactions on Modeling and Computer Simulation, 23(1):9:19:25, 2013.
Hoffman, Matthew, Brochu, Eric, and de Freitas, Nando.Portfolio allocation for Bayesian optimization. In Uncertainty in Artificial Intelligence, pp. 327336, 2011.
Hoffman, Matthew D and Gelman, Andrew. The NoU-Turn Sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. Preprint arXiv:1111.4246, 2011.
Ishwaran, Hemant. Applications of hybrid Monte Carlo to Bayesian generalized linear models: Quasicomplete separation and neural networks. Journal of Computational and Graphical Statistics, 8(4):779799, 1999.
Kim, Sangjoon, Shephard, Neil, and Chib, Siddhartha.Stochastic volatility: likelihood inference and comparison with ARCH models. The Review of Economic Studies, 65(3):361393, 1998.
Mahendran, Nimalan, Wang, Ziyu, Hamze, Firas, and de Freitas, Nando. Adaptive MCMC with Bayesian optimization. Articial Intelligence and Statistics, 2012.
Meyn, Sean P. and Tweedie, Richard L. Markov chains and stochastic stability. Springer-Verlag, 1993.
Moc?kus, Jonas. The Bayesian approach to global optimization. In System Modeling and Optimization, volume 38, pp. 473481. Springer, 1982.
Mohamed, Shakir, Heller, Katherine, and Ghahramani, Zoubin. Bayesian exponential family PCA. In Advances in Neural Information Processing Systems, pp. 1089 1096. 2008.
Neal, R. and Zhang, J. High dimensional classification with Bayesian neural networks and Dirichlet diffusion trees. In Feature Extraction, pp. 265296. Springer, 2006.Neal, Radford M. MCMC using Hamiltonian dynamics.
Handbook of Markov Chain Monte Carlo, 54:113162, 2010.
Pasarica, Cristian and Gelman, Andrew. Adaptively scaling the Metropolis algorithm using expected squared jumped distance. Statistica Sinica, 20(1):343, 2010.
Ranzato, MarcAurelio and Hinton, Geoffrey. Modeling pixel means and covariances using factorized third-order Boltzmann machines. In IEEE Computer Vision and Pattern Recognition, pp. 25512558, 2010.
Rasmussen, Carl Edward and Williams, Christopher K I.Gaussian Processes for Machine Learning. MIT Press, Cambridge, Massachusetts, 2006.
Roberts, Gareth O. and Rosenthal, Jeffrey S. Coupling and ergodicity of adaptive Markov chain Monte Carlo algorithms. Journal of applied probability, 44(2):458 475, 2007.
Roberts, Gareth O. and Stramer, Osnat. Langevin diffusions and Metropolis-Hastings algorithms. Methodology and computing in applied probability, 4(4):337357, 2002.
Roberts, Gareth O and Tweedie, Richard L. Geometric convergence and central limit theorems for multidimensional Hastings and Metropolis algorithms. Biometrika, 83(1):95110, 1996.
Snoek, Jasper, Larochelle, Hugo, and Adams, Ryan Prescott. Practical Bayesian optimization of machine learning algorithms. In Neural Information Processing Systems, 2012.
Srinivas, Niranjan, Krause, Andreas, Kakade, Sham M., and Seeger, Matthias. Gaussian process optimization in the bandit setting: No regret and experimental design.In International Conference on Machine Learning, 2010.
Vihola, Matti. Grapham: Graphical models with adaptive random walk Metropolis algorithms. Computational Statistics and Data Analysis, 54(1):49  54, 2010.
-----0
Batra, Dhruv, Nowozin, Sebastian, and Kohli, Pushmeet.Tighter relaxations for MAP-MRF inference: A local primal-dual gap based separation algorithm. Journal of Machine Learning Research Proceedings Track, 15:146 154, 2011.
Belanger, David, Passos, Alexandre, Riedel, Sebastian, and McCallum, Andrew. Map inference in chains using column generation. In Bartlett, P., Pereira, F.C.N., 
Burges, C.J.C., Bottou, L., and Weinberger, K.Q. (eds.), Advances in Neural Information Processing Systems 25, pp. 18531861. 2012.
Felzenszwalb, P. F. and McAuley, J. J. Fast inference with min-sum matrix product. IEEE Trans. Pattern Anal.
Mach. Intell, 33(12):25492554, 2011.Flerova, Natalia, Ihler, Alexander, Dechter, Rina, and Otten, Lars. Mini-bucket elimination with moment matching. In NIPS Workshop DISCML, 2011.
Globerson, Amir and Jaakkola, Tommi. Fixing maxproduct: Convergent message passing algorithms for MAP LP-relaxations. In NIPS, 2007.
Heitz, G., Elidan, G., Packer, B., and Koller, D. Shapebased object localization for descriptive classification.International Journal of Computer Vision, 84(1):4062, 2009.
Kolmogorov, V. Convergent tree-reweighted message passing for energy minimization. IEEE Trans. Pattern Analysis and Machine Intelligence, 28(10):15681583, October 2006.
Komodakis, N. and Paragios, N. Beyond loose LPrelaxations: Optimizing MRFs by repairing cycles. In ECCV, pp. III: 806820, 2008.
Komodakis, Nikos, Paragios, Nikos, and Tziritas, Georgios.MRF energy minimization and beyond via dual decomposition. IEEE Trans. Pattern Anal. Mach. Intell, 33(3):531552, 2011.
Lowe, D. G. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60 (2):91110, 2004.
Marinescu, Radu and Dechter, Rina. Memory intensive branch-and-bound search for graphical models. In AAAI, pp. 12001205. AAAI Press, 2006.
Marinescu, Radu and Dechter, Rina. Best-first AND/OR search for graphical models. In AAAI, pp. 11711176.AAAI Press, 2007.
McAuley, J. J. and Caetano, T. S. Faster algorithms for max-product message-passing. Journal of Machine Learning Research, 12(4):13491388, 2011.
Sontag, David and Jaakkola, Tommi. New outer bounds on the marginal polytope. In NIPS, pp. 13931400, 2008.
Sontag, David, Meltzer, Talya, Globerson, Amir, Jaakkola, Tommi, and Weiss, Yair. Tightening lp relaxations for map using message passing. In UAI, pp. 503510, 2008.
Sontag, David, Globerson, Amir, and Jaakkola, Tommi.Clusters and coarse partitions in LP relaxations. In NIPS, pp. 15371544, 2009.
Sontag, David, Globerson, Amir, and Jaakkola, Tommi.Introduction to dual decomposition for inference. In 
Sra, Suvrit, Nowozin, Sebastian, and Wright, Stephen J.(eds.), Optimization for Machine Learning. MIT Press, 2011.
Sontag, David, Choe, Do Kook, and Li, Yitao. Efficiently searching for frustrated cycles in map inference. In UAI, 2012.
Sun, Min, Telaprolu, Murali, Lee, Honglak, and Savarese, Silvio. Efficient and exact MAP-MRF inference using branch and bound. In AISTATS, 2012.
Wang, Huayan and Koller, Daphne. Subproblem-tree calibration: A unified approach to max-product message passing. In ICML, 2013.
Weiss, Y. Correctness of local probability propagation in graphical models with loops. Neural Computation, 12: 141, 2000.
Werner, T. A linear programming approach to max-sum problem: A review. IEEE Trans. Pattern Analysis and Machine Intelligence, 29(7):11651179, July 2007.
Werner, T. High-arity interactions, polyhedral relaxations, and cutting plane algorithm for soft constraint optimisation (MAP-MRF). In CVPR, pp. 18, 2008.
Yanover, Chen, Meltzer, Talya, and Weiss, Yair. Linear programming relaxations and belief propagation an empirical study. Journal of Machine Learning Research, 7:18871907, 2006.
Yarkony, Julian, Morshed, Ragib, Ihler, Alexander T., and Fowlkes, Charless. Tightening mrf relaxations with planar subproblems. In UAI, pp. 770777, 2011.
-----0
Bengio, S., Pereira, F., Singer, Y., and Strelow, D. Group sparse coding. In NIPS, 2009.
Bradley, D.M. and Bagnell, J.A. Convex coding. In UAI, 2009.
Cande`s, E. and Wakin, M. An introduction to compressive sensing. IEEE Signal Processing Magazine, 25(2):2130, 2008.
Dai, W., Yang, Q., Xue, G.R., and Yu, Y. Self-taught clustering. In ICML, 2008.
Ding, C., Zhou, D., He, X., and Zha, H. R1-PCA: rotational invariant L 1-norm principal component analysis for robust subspace factorization. In ICML, 2006.
Ding, C., Simon, H.D., Jin, R., and Li, T. A learning framework using Greens function and kernel regularization with application to recommender system. In SIGKDD, 2007.
Gorodnitsky, I.F. and Rao, B.D. Sparse signal reconstruction from limited data using FOCUSS: A re-weighted minimum norm algorithm. IEEE Transactions on Signal Processing, 45(3):600616, 1997.
Ji, S., Tang, L., Yu, S., and Ye, J. A shared-subspace learning framework for multi-label classification. ACM TKDD, 4(2):129, 2010.
Jia, Y., Salzmann, M., and Darrell, T. Factorized latent spaces with structured sparsity. In NIPS, 2010.
Joachims, T. Transductive inference for text classification using support vector machines. In ICML, 1999.
Lee, H., Battle, A., Raina, R., and Ng, A.Y. Efficient sparse coding algorithms. In NIPS, 2007.
Lee, H., Raina, R., Teichman, A., and Ng, A.Y. Exponential family sparse coding with applications to self-taught learning. In IJCAI, 2009.Li, T., Sindhwani, V., Ding, C., and Zhang, Y. Knowledge transformation for cross-domain sentiment classification.In SIGIR, 2009.
Mairal, J., Bach, F., Ponce, J., Sapiro, G., and Zisserman, A. Discriminative learned dictionaries for local image analysis. In CVPR, pp. 18, 2008.
Mairal, J., Bach, F., and Ponce, J. Task-Driven Dictionary Learning. Tech Report, INRIA, 2010.
Mairal, Julien, Bach, Francis, Ponce, Jean, Sapiro, Guillermo, and Zisserman, Andrew. Supervised dictionary learning. In NIPS, 2009.
Nie, F., Huang, H., Cai, X., and Ding, C. Efficient and Robust Feature Selection via Joint l2,1-Norms Minimization. In NIPS, 2010.
Pan, S.J. and Yang, Q. A survey on transfer learning.IEEE TKDE, 2009. ISSN 1041-4347.Pham, D.S. and Venkatesh, S. Joint learning and dictionary construction for pattern recognition. In CVPR, pp.18, 2008.
Raina, R. Self-taught learning. PhD thesis of Stanford University, 2009.
Raina, R., Battle, A., Lee, H., Packer, B., and Ng, A.Y.Self-taught learning: Transfer learning from unlabeled data. In ICML, 2007.
Tibshirani, R. Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society. Series B (Methodological), pp. 267288, 1996.
Wang, H., Huang, H., and Ding, C. Image annotation using multi-label correlated Greens function. In ICCV, 2009.
Wang, H., Ding, C., and Huang, H. Multi-Label Classification: Inconsistency and Class Balanced K-Nearest Neighbor. In AAAI, 2010a.
Wang, H., Huang, H., and Ding, C. Multi-label Feature Transform for Image Classifications. ECCV, 2010b.
Wang, Hua, Nie, Feiping, Huang, Heng, Risacher, Shannon, Ding, Chris, Saykin, Andrew J, and Shen, Li.Sparse multi-task regression and feature selection to identify brain imaging predictors for memory performance. In ICCV 2011, pp. 557562. IEEE, 2011.
Wang, Hua, Huang, Heng, and Ding, Chris. Functionfunction correlated multi-label protein function prediction over interaction networks. In Research in Computational Molecular Biology (RECOMB 2012), pp. 302313.Springer, 2012a.
Wang, Hua, Nie, Feiping, Huang, Heng, Kim, Sungeun, Nho, Kwangsik, Risacher, Shannon L, Saykin, Andrew J, Shen, Li, et al. Identifying quantitative trait loci via group-sparse multitask regression and feature selection: an imaging genetics study of the adni cohort.Bioinformatics, 28(2):229237, 2012b.
Wang, Hua, Nie, Feiping, Huang, Heng, Risacher, Shannon L, Saykin, Andrew J, Shen, Li, et al. Identifying disease sensitive and quantitative trait-relevant biomarkers from multidimensional heterogeneous imaging genetics data via sparse multimodal multitask learning. Bioinformatics, 28(12):i127i136, 2012c.
Wang, Hua, Nie, Feiping, Huang, Heng, Yan, Jingwen, Kim, Sungeun, Risacher, Shannon, Saykin, Andrew, and Shen, Li. High-order multi-task feature learning to identify longitudinal phenotypic markers for alzheimers disease progression prediction. In NIPS, 2012d.
Wang, Hua, Nie, Feiping, Huang, Heng, and Ding, Chris.Heterogeneous Visual Features Fusion via Sparse Multimodal Machine. In CVPR 2013, 2013.
Zhang, Q. and Li, B. Discriminative K-SVD for dictionary learning in face recognition. In CVPR, pp. 26912698, 2010.
Zhu, X. Semi-supervised learning literature survey. Technical report, University of Wisconsin-Madison, 2006.
-----0
Agarwal, S. A Study of the Bipartite Ranking Problem in Machine Learning. PhD thesis, University of Illinois at Urbana-Champaign, 2005.
Agarwal, S. The infinite push: A new support vector ranking algorithm that directly optimizes accuracy at the absolute top of the list. In Proceedings of the SIAM International Conference on Data Mining, 2011.Ailon, N. An active learning algorithm for ranking from pairwise preferences with an almost optimal query complexity. Journal of Machine Learning Research, 13:137164, 2012.
Ailon, N., Begleiter, R., and Ezra, E. A new active learning scheme with applications to learning to rank from pairwise preferences. arXiv CoRR, abs/1110.2136, 2011.
Ammar, A. and Shah, D. Ranking: Compare, dont score. In Proceedings of the 49th Annual Allerton Efficient Ranking from Pairwise Comparisons 
Conference on Communication, Control and Computing (Allerton), pp. 776783. 2011.
Braverman, M. and Mossel, E. Noisy sorting without resampling. In Symposium on Discrete Algorithms, pp. 268276, 2008.
Braverman, M. and Mossel, E. Sorting from noisy information. arXiv CoRR, abs/0910.1191, 2009.
Cauwenberghs, G. and Poggio, T. Incremental and decremental support vector machine learning. In 
Leen, T.K., Dietterich, T.G., and Tresp, V. (eds.), Advances in Neural Information Processing Systems 13 (NIPS), pp. 409415. MIT Press, 2000.
Coppersmith, D., Fleischer, L., and Rudra, A. Ordering by weighted number of wins gives a good ranking for weighted tournaments. ACM Transactions on Algorithms, 6(3):55:155:13, 2010.
Dekel, O., Manning, C., and Singer, Y. Log-linear models for label ranking. In Thrun, S., Saul, L., and Scholkopf, B. (eds.), Advances in Neural Information Processing Systems 16 (NIPS). MIT Press, 2004.
Diaconis, P. and Graham, R. L. Spearmans footrule as a measure of disarray. Journal of the Royal Statistical Society. Series B (Methodological), 39(2):262 268, 1977.
Feige, U., Raghavan, P., Peleg, D., and Upfal, E. Computing with noisy information. SIAM Journal on Computing, 23(5):10011018, 1994.
Freund, Y., Iyer, R., Schapire, R. E., and Singer, Y.An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4: 933969, 2003.
Fulman, J. Steins method, Jack measure, and the Metropolis algorithm. Journal of Combinatorial Theory. Series A, 108(2):275296, 2004.
Giesen, J., Schuberth, E., and Stojakovic, M. Approximate sorting. Fundamenta Informaticae, 90(1-2): 6772, 2009.
Gleich, D. F. and Lim, L. Rank aggregation via nuclear norm minimization. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 6068, 2011.
Graf, H. P., Cosatto, E., Bottou, L., Durdanovic, I., and Vapnik, V. Parallel support vector machines: 
The cascade SVM. In Saul, L.K., Weiss, Y., and Bottou, L. (eds.), Advances in Neural Information Processing Systems 17 (NIPS). MIT Press, 2004.
Hazan, T., Man, A., and Shashua, A. A parallel decomposition solver for SVM: Distributed dual ascend using Fenchel duality. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18, 2008.
Herbrich, R., Graepel, T., and Obermayer, K. Large margin rank boundaries for ordinal regression. In Advances in Large Margin Classifiers, pp. 115132.MIT Press, 2000.
Jamieson, K. G. and Nowak, R. Active ranking using pairwise comparisons. In Shawe-Taylor, J., Zemel, 
R.S., Bartlett, P., Pereira, F.C.N., and Weinberger, K.Q. (eds.), Advances in Neural Information Processing Systems 24 (NIPS), pp. 22402248. MIT Press, 2011.
Jarvelin, K. and Kekalainen, J. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20(4):422446, 2002.
Joachims, T. Training linear SVMs in linear time.In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 217226, 2006.
Mitliagkas, I., Gopalan, A., Caramanis, C., and Vishwanath, S. User rankings from comparisons: Learning permutations in high dimensions. In Proceedings of the 49th Annual Allerton Conference on Communication, Control and Computing (Allerton), 2011.
Negahban, S., Oh, S., and Shah, D. Iterative ranking from pair-wise comparisons. In Bartlett, P., 
Pereira, F., Burges, C., Bottou, L., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 25 (NIPS), pp. 24832491. MIT Press, 2012.
Radinsky, K. and Ailon, N. Ranking from pairs and triplets: Information quality, evaluation methods and query complexity. In King, I., Nejdl, W., and Li, H. (eds.), Fourth ACM International Conference on Web Search and Data Mining (WSDM), pp. 105 114. ACM, 2011.
Rudin, C. The p-norm push: A simple convex ranking algorithm that concentrates at the top of the list. Journal of Machine Learning Research, 10: 22332271, 2009.
-----0
Adam, A., Rivlin, E., Shimshoni, I., and Reinitz, D. Robust real-time unusual event detection using multiple fixed-location monitors. IEEE Trans. PAMI, 30(3):555 560, 2008.
Blei, D. M., Ng, A.Y., and Jordan, M.I. Latent dirichlet allocation. J. Mach. Learn. Res., 3:9931022, March 2003.
Cao, L. and Fei-Fei, L. Spatially coherent latent topic model for concurrent segmentation and classification of objects and scenes. In Proc. ICCV, pp. 18, 2007.
Chan, A.B. and Vasconcelos, N. Modeling, clustering, and segmenting video with mixtures of dynamic textures. IEEE Trans. PAMI, 30(5):909926, 2008.
Coates, A., Lee, H., and Ng, A.Y. An analysis of singlelayer networks in unsupervised feature learning. Ann Arbor, 1001:48109, 2010.
Csurka, G., Dance, C., Fan, L., Willamowski, J., and Bray, C. Visual categorization with bags of keypoints.In Workshop on statistical learning in computer vision, ECCV, volume 1, pp. 22, 2004.
Doretto, G., Chiuso, A., Wu, Y.N., and Soatto, S. Dynamic texturbes. International Journal of Computer Vision, 51 (2):91109, 2003.
Farquhar, J., Szedmak, S., Meng, H., and Shawe-Taylor, J. Improving bag-of-keypoints image categorisation: Generative models and pdf-kernels. 2005.
Fei-Fei, L. and Perona, P. A bayesian hierarchical model for learning natural scene categories. In Proc. CVPR, volume 2, pp. 524531, 2005.
Griffiths, Thomas L and Steyvers, Mark. Finding scientific topics. PNAS, 101(Suppl 1):52285235, 2004.
Hoffman, Matthew, Blei, David M, and Bach, Francis. Online learning for latent dirichlet allocation. Advances in Neural Information Processing Systems, 23:856864, 2010.
Larlus, D. and Jurie, F. Latent mixture vocabularies for object categorization and segmentation. Image and Vision Computing, 27(5):523534, 2009.
Le, Quoc V, Zou, Will Y, Yeung, Serena Y, and Ng, Andrew Y. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In Proc. CVPR, pp. 33613368, 2011.
Mahadevan, Vijay, Li, Weixin, Bhalodia, Viral, and Vasconcelos, Nuno. Anomaly detection in crowded scenes.In Proc. CVPR, 2010.
Mahadevan, Vijay, Li, Weixin, Bhalodia, Viral, and Vasconcelos, Nuno.http://www.svcl.ucsd.edu/projects/anomaly/dataset.html Accessed October 2011, 2012.
Olshausen, B.A., Field, D.J., et al. Sparse coding with an overcomplete basis set: A strategy employed by vi? Vision research, 37(23):33113326, 1997.
Perronnin, F. Universal and adapted vocabularies for generic visual categorization. IEEE Trans. PAMI, 30 (7):12431256, 2008.
Philbin, J., Chum, O., Isard, M., Sivic, J., and Zisserman, A. Lost in quantization: Improving particular object retrieval in large scale image databases. In Proc. CVPR, pp. 18, 2008.
Philbin, J., Sivic, J., and Zisserman, A. Geometric latent dirichlet allocation on a matching graph for large-scale image datasets. International journal of computer vision, 95(2):138153, 2011.
Ramirez, I., Sprechmann, P., and Sapiro, G. Classification and clustering via dictionary learning with structured incoherence and shared features. In Proc. CVPR, pp.35013508, 2010.
Rematas, K., Fritz, M., and Tuytelaars, T. Kernel density topic models: Visual topics without visual words. In NIPS workshops, Modern Nonparametric Methods in Machine Learning, 2012.
Sivic, J., Russell, B.C., Efros, A.A., Zisserman, A., and Freeman, W.T. Discovering objects and their location in images. In Proc. ICCV, volume 1, pp. 370377, 2005.
Sivic, J., Russell, B.C., Zisserman, A., Freeman, W.T., and Efros, A.A. Unsupervised discovery of visual object class hierarchies. In Proc. CVPR, pp. 18, 2008.
Sudderth, E.B., Torralba, A., Freeman, W.T., and Willsky, A.S. Learning hierarchical models of scenes, objects, and parts. In Proc. ICCV, volume 2, pp. 13311338, 2005.
Tuytelaars, T., Lampert, C.H., Blaschko, M.B., and Buntine, W. Unsupervised object discovery: A comparison.International Journal of Computer Vision, 88(2):284 302, 2010.
van Gemert, J.C., Veenman, C.J., Smeulders, A.W.M., and Geusebroek, J.M. Visual word ambiguity. IEEE Trans.PAMI, 32(7):12711283, 2010.
Wang, C., Blei, D., and Li, F.F. Simultaneous image classification and annotation. In Proc. CVPR, pp. 19031910, 2009.
Wang, X. and Grimson, E. Spatial latent dirichlet allocation. Advances in Neural Information Processing Systems, 20:15771584, 2007.
-----0
Bertsekas, Dimitri. Dynamic Programming and Optimal Control. Athena Scientific, Nashua, New Hampshire, 2012.
Cesa-Bianchi, Nicolo and Lugosi, Gabor. Prediction, Learning, and Games. Cambridge University Press, New York, NY, 2006.
Cover, Thomas and Thomas, Joy. Elements of Information Theory. John Wiley & Sons, Hoboken, New Jersey, 2006.Dasgupta, Sanjoy. Analysis of a greedy active learning strategy. In Advances in Neural Information Processing Systems 17, pp. 337344, 2005.
Dasgupta, Sanjoy, Lee, Wee Sun, and Long, Philip.A theoretical analysis of query selection for collaborative filtering. Machine Learning, 51(3):283298, 2003.
Golovin, Daniel and Krause, Andreas. Adaptive submodularity: Theory and applications in active learning and stochastic optimization. Journal of Artificial Intelligence Research, 42:427486, 2011.
Lam, Shyong and Herlocker, Jon. MovieLens 1M Dataset. http://www.grouplens.org/node/12, 2012.
Nowak, Robert. The geometry of generalized binary search. IEEE Transactions on Information Theory, 57(12):78937906, 2011.
Yue, Yisong and Guestrin, Carlos. Linear submodular bandits and their application to diversified retrieval. In Advances in Neural Information Processing Systems 24, pp. 24832491, 2011.
-----0
Bai, B., Weston, J., Grangier, D., Collobert, R., Sadamasa, K., Qi, Y., Cortes, C., and Mohri, M. Polynomial semantic indexing. In NIPS, 2009.
Barla, A., Odone, F., and Verri, A. Histogram intersection kernel for image classification. Intl. Conf. Image Processing (ICIP), 3:III51316 vol.2, 2003.
Bengio, S., Weston, J., and Grangier, D. Label embedding trees for large multi-class tasks. In NIPS, 2010.
Bentley, J.L. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18 (9):517, 1975.
Beygelzimer, A., Langford, J., and Ravikumar, P. Errorcorrecting tournaments. In International Conference on Algorithmic Learning Theory (ALT), pp. 247262, 2009.
Burges, CJC. From ranknet to lambdarank to lambdamart: An overview: Tech. rep. Microsoft Research, 2010.Cisse, M., Artie`res, T., and Gallinari, P. Learning compact class codes for fast inference in large multi class classification. Machine Learning and Knowledge Discovery in Databases, pp. 506520, 2012.
Collins, M. and Koo, T. Discriminative reranking for natural language parsing. Computational Linguistics, 31(1): 2570, 2005.
Davidson, J., Liebald, B., Liu, J., Nandy, P., Van Vleet, T., Gargi, U., Gupta, S., He, Y., Lambert, M., Livingston, B., et al. The youtube video recommendation system. In Proceedings of the fourth ACM conference on Recommender systems, pp. 293296. ACM, 2010.
Deerwester, Scott, Dumais, Susan T., Furnas, George W., Landauer, Thomas K., and Harshman, Richard. Indexing by latent semantic analysis. JASIS, 1990.
Deng, J., Satheesh, S., Berg, A.C., and Fei-Fei, L. Fast and balanced: Efficient label tree learning for large scale object recognition. In NIPS, volume 2, pp. 3, 2011.
Duda, R.O., Hart, P.E., and Stork, D.G. Pattern Classification and Scene Analysis 2nd ed. John Wiley & Sons, 1995.
Fellbaum, Christiane (ed.). WordNet: An Electronic Lexical Database. MIT Press, 1998.
Grangier, D. and Bengio, S. A discriminative kernelbased model to rank images from text queries. IEEE Trans. Pattern Analysis and Machine Intelligence, 30: 13711384, 2008.
Indyk, P. and Motwani, R. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pp. 604613. ACM, 1998.
Perronnin, F., Akata, Z., Harchaoui, Z., and Schmid, C.Towards good practice in large-scale learning for image classification. In CVPR, 2012.
Platt, J., Cristianini, N., and Shawe-Taylor, J. Large margin dags for multiclass classification. In NIPS, pp. 547 553, 2000.
Rifkin, R. and Klautau, A. In defense of one-vs-all classification. JMLR, 5:101141, 2004.
Robbins, Herbert and Monro, Sutton. A stochastic approximation method. Annals of Mathematical Statistics, 22: 400407, 1951.
Sarwar, B., Karypis, G., Konstan, J., and Riedl, J.Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th international conference on World Wide Web, pp. 285295. ACM, 2001.
Schoelkopf, B., Smola, A. J., and Muller, K. R. Kernel principal component analysis. Advances in kernel methods: support vector learning, pp. 327352, 1999.
Usunier, N., Buffoni, D., and Gallinari, P. Ranking with ordered weighted pairwise classification. In ICML, 2009.
Weimer, M., Karatzoglou, A., Le, Q., Smola, A., et al.Cofirank-maximum margin matrix factorization for collaborative ranking. NIPS, 2007.
Weinberger, K.Q., Blitzer, J., and Saul, L.K. Distance metric learning for large margin nearest neighbor classification. In NIPS, 2006.
Weston, J., Bengio, S., and Usunier, N. Wsabie: Scaling up to large vocabulary image annotation. In Intl. Joint Conf. Artificial Intelligence, (IJCAI), pp. 2764 2770, 2011.
Yianilos, P.N. Data structures and algorithms for nearest neighbor search in general metric spaces. In Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms, pp. 321. Society for Industrial and Applied Mathematics, 1993.
Yue, Yisong, Finley, Thomas, Radlinski, Filip, and Joachims, Thorsten. A support vector method for optimizing average precision. In Proceedings of the 30th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 271278, 2007.
-----0
Beal, M.J. Variational Algorithms for Approximate Bayesian Inference. Phd thesis, University College London, 2003.
Beausang, J.F., Zurla, C., Manzo, C., Dunlap, D., Finzi, L., and Nelson, P.C. DNA looping kinetics analyzed using diffusive hidden Markov model. Biophys. J., 92(8):L646, May 2007.
Berger, J. Bayesian Robustness and the Stein Effect.JASA, 77(378):358368, June 1982.Bishop, C.M. Pattern recognition and machine learning. Springer, New York, 2006.
Borgia, A., Williams, P.M., and Clarke, J. Singlemolecule studies of protein folding. Ann. Rev.Biochem., 77:10125, January 2008.
Bronson, J.E., Fei, J., Hofman, J.M., Gonzalez, R.L., and Wiggins, C.H. Learning rates and states from biophysical time series: a Bayesian approach to model selection and single-molecule FRET data.
Biophys. J., 97(12):3196205, December 2009.Bronson, J.E., Hofman, J.M., Fei, J., Gonzalez, R.L., and Wiggins, C.H. Graphical models for inferring single molecule dynamics. BMC Bioinf., 11 Suppl 8 (Suppl 8):S2, January 2010.
Carlin, B.P. and Louis, T.A. Bayes and Empirical Bayes Methods for Data Analysis. Number 1. Chapman & Hall, London, 1996.
Casella, G. An introduction to empirical Bayes data analysis. Am. Statistician, 39(2):8387, 1985.
Corduneanu, Adrian and Bishop, CM. Variational Bayesian model selection for mixture distributions.AISTATS, 2001.
Cornish, P.V. and Ha, T. A survey of single-molecule techniques in chemical biology. ACS Chem. Biol., 2 (1):5361, January 2007.
Fei, J., Bronson, J.E., Hofman, J.M., Srinivas, R.L., Wiggins, C.H., and Gonzalez, R.L. Allosteric collaboration between elongation factor G and the ribosomal L1 stalk directs tRNA movements during translation. Proc. Nat. Acad. Sci. USA, 106(37): 157027, September 2009.
Ghahramani, Z. and Jordan, M.I. Factorial Hidden Markov Models. Machine Learning, 29(2-3):245 273, 1997.
Greenfeld, M., Pavlichin, D.S., Mabuchi, H., and Herschlag, D. Single Molecule Analysis Research Tool (SMART): An Integrated Approach for Analyzing Single Molecule Data. PloS One, 7(2):e30024, January 2012.
Joo, C., Balci, H., Ishitsuka, Y., Buranachai, C., and Ha, T. Advances in single-molecule fluorescence methods for molecular biology. Ann. Rev. Biochem., 77:5176, January 2008.
Jordan, M.I., Ghahramani, Z., and Jaakkola, T.S. An introduction to variational methods for graphical models. Machine Learning, 233:183233, 1999.
Kass, R.E. and Steffey, D. Approximate Bayesian inference in conditionally independent hierarchical models (parametric empirical Bayes models). JASA, 84(407), 1989.
McKinney, S.A, Joo, C., and Ha, T. Analysis of singlemolecule FRET trajectories using hidden Markov modeling. Biophys. J., 91(5):194151, September 2006.
Minka, Thomas P. Estimating a Dirichlet distribution.Technical Report 8, MIT, 2012.
Morris, C.N. Parametric Empirical Bayes Inference: Theory and Applications. JASA, 78(381):4755, March 1983.
Neuman, K.C. and Nagy, A. Single-molecule force spectroscopy: optical tweezers, magnetic tweezers and atomic force microscopy. Nat. Methods, 5(6): 491505, June 2008.
Okamoto, K. and Sako, Y. Variational Bayes Analysis of a Photon-Based Hidden Markov Model for SingleMolecule FRET Trajectories. Biophys. J., 103(6): 131524, September 2012.
Sorgenfrei, S., Chiu, C-Y., Gonzalez, R.L., Yu, Y-J., Kim, P., Nuckolls, C., and Shepard, K.L. Label-free single-molecule detection of DNA-hybridization kinetics with a carbon nanotube field-effect transistor. Nat. Nanotechnol., 6(2):12632, February 2011.
Stein, C.M. Estimation of the Mean of a Multivariate Normal Distribution. Ann. Stat., 9(6):11351151, November 1981.
Tinoco, I. and Gonzalez, R.L. Biological mechanisms, one molecule at a time. Genes Dev., 25(12):120531, June 2011.
Wainwright, M.J. and Jordan, M.I. Graphical Models, Exponential Families, and Variational Inference. Found. Trends Machine Learning, 1(12):1 305, 2008.
-----0
Ahmed, A., Ho, Q., Teo, C. H., Eisenstein, J., Smola, A. J., and Xing, E. P. Online inference for the infinite topic-cluster model: Storylines from streaming text. In AISTATS, 2011.
Aldous, D. J. Exchangeability and related topics.In Ecole dEte de probabilites de Saint-Flour XIII.1985.
Antoniak, C. E. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Ann. Statist., 2(6):11521174, 1974.
Asuncion, A., Smyth, P., and Welling, M. Asynchronous distributed learning of topic models. In NIPS, 2008.
Blei, D. M. and Jordan, M. I. Variational methods for the Dirichlet process. In ICML, 2004.
Fearnhead, P. Particle filters for mixture models with an unknown number of components. Statistics and Computing, 14:1121, 2004.
Ferguson, T. S. A Bayesian analysis of some nonparametric problems. Ann. Statist., 1(2):209230, 1973.
Fox, E. B., Sudderth, E. B., Jordan, M. I., and Willsky, A. S. An HDP-HMM for systems with state persistence. In ICML, 2008.
Ghosh, J. K. and Ramamoorthi, R. V. Bayesian Nonparametrics. Springer, 2003.
Ishwaran, H. and James, L. F. Gibbs sampling methods for stick-breaking priors. JASA, 96(453):161 173, 2001.
Kingman, J. F. C. Completely random measures. Pacific Journal of Mathematics, 21(1):5978, 1967.
Kurihara, K., Welling, M., and Teh, Y.-W. Collapsed variational Dirichlet process mixture models. In IJCAI, 2007.
Lovell, D., Adams, R. P., and Mansingka, V. K. Parallel Markov chain Monte Carlo for Dirichlet process mixtures. In Workshop on Big Learning, NIPS, 2012.
Neal, R. M. Markov chain sampling methods for Dirichlet process mixture models. Technical Report 9815, Dept. of Statistics, University of Toronto, 1998.Rodriguez, A. On-line learning for the infinite hidden Markov model. Communications in Statistics Simulation and Computation, 40(6):879893, 2011.
Sohn, K.-A. and Xing, E. P. A hierarchical Dirichlet process mixture model for haplotype reconstruction from multi-population data. Ann. Appl. Stat., 3(2): 791821, 2009.
Sudderth, E. B., Torralba, A., Freeman, W. T., and Willsky, A. S. Describing visual scenes using transformed Dirichlet processes. In NIPS, 2005.
Teh, Y.-W., Jordan, M. I., Beal, M. J., and Blei, D. M. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476): 15661581, 2006.
Teh, Y.-W., Kurihara, K., and Welling, M. Collapsed variational inference for HDP. In NIPS, 2007.
Ulker, Y., Gunsel, B., and Cemgil, A. T. Sequential Monte Carlo samplers for Dirichlet process mixtures. In AISTATS, 2010.
Wang, C., Paisley, J., and Blei, D. M. Online variational inference for the hierarchical Dirichlet process. In AISTATS, 2011.
Xing, E. P., Ng, A. Y., Jordan, M. I., and Russell, S.Distance metric learning, with application to clustering with side-information. In NIPS, 2002.
-----0
Abrahamsen, P. A review of Gaussian random fields and correlation functions. Norweigan Computing Center Technical report, 1997.
Archambeau, C. and Bach, F. Multiple Gaussian process models. arXiv preprint arXiv:1110.5238, 2011.
Bochner, Salomon. Lectures on Fourier Integrals.(AM42), volume 42. Princeton University Press, 1959.
Chatfield, C. Time Series Analysis: An Introduction.London: Chapman and Hall, 1989.
Cressie, N.A.C. Statistics for Spatial Data (Wiley Series in Probability and Statistics). WileyInterscience, 1993.
Damianou, A.C. and Lawrence, N.D. Deep Gaussian processes. arXiv preprint arXiv:1211.0358, 2012.
Durrande, N., Ginsbourger, D., and Roustant, O. Additive kernels for Gaussian process modeling. arXiv preprint arXiv:1103.4023, 2011.
Gonen, M. and Alpayd?n, E. Multiple kernel learning algorithms. Journal of Machine Learning Research, 12:22112268, 2011.
Hyndman, R.J. Time series data library. 2005.http://www-personal.buseco.monash.edu.au/ ~hyndman/TSDL/.
Keeling, C. D. and Whorf, T. P. Atmospheric CO2 records from sites in the SIO air sampling network.
Trends: A Compendium of Data on Global Change.Carbon Dioxide Information Analysis Center, 2004.
Kostantinos, N. Gaussian mixtures and their applications to signal processing. Advanced Signal Processing Handbook: Theory and Implementation for Radar, Sonar, and Medical Imaging Real Time Systems, 2000.
MacKay, David J.C. Introduction to Gaussian processes. In Christopher M. Bishop, editor (ed.), Neural Networks and Machine Learning, chapter 11, pp.133165. Springer-Verlag, 1998.
MacKay, D.J.C. et al. Bayesian nonlinear modeling for the prediction competition. Ashrae Transactions, 100(2):10531062, 1994.
McCulloch, W.S. and Pitts, W. A logical calculus of the ideas immanent in nervous activity. Bulletin of mathematical biology, 5(4):115133, 1943.
Murray, Iain and Adams, Ryan P. Slice sampling covariance hyperparameters in latent Gaussian models. In Advances in Neural Information Processing Systems 23, 2010.
Neal, R.M. Bayesian Learning for Neural Networks. Springer Verlag, 1996. ISBN 0387947248.
Rasmussen, Carl Edward. Evaluation of Gaussian Processes and Other Methods for Non-linear Regression. PhD thesis, University of Toronto, 1996.
Rasmussen, Carl Edward and Williams, Christopher K.I. Gaussian processes for Machine Learning.The MIT Press, 2006.
Rosenblatt, F. Principles of Neurodynamics. Spartan Book, 1962.
Rumelhart, D.E., Hinton, G.E., and Williams, R.J.Learning representations by back-propagating errors. Nature, 323(6088):533536, 1986.
Saatchi, Yunus. Scalable Inference for Structured Gaussian Process Models. PhD thesis, University of Cambridge, 2011.
Salakhutdinov, R. and Hinton, G. Using deep belief nets to learn covariance kernels for Gaussian processes. Advances in Neural Information Processing Systems, 20:12491256, 2008.
Stein, M.L. Interpolation of Spatial Data: Some Theory for Kriging. Springer Verlag, 1999.
Steyvers, M., Griffiths, T.L., and Dennis, S. Probabilistic inference in human semantic memory. Trends in Cognitive Sciences, 10(7):327334, 2006.
Tenenbaum, J.B., Kemp, C., Griffiths, T.L., and Goodman, N.D. How to grow a mind: Statistics, structure, and abstraction. Science, 331(6022): 12791285, 2011.
Tipping, M. Bayesian inference: An introduction to principles and practice in machine learning. Advanced Lectures on Machine Learning, pp. 4162, 2004.
Wilson, A.G. and Adams, R.P. Gaussian process kernels for pattern discovery and extrapolation supplementary material and code. 2013. http://mlg.eng.cam.ac.uk/andrew/smkernelsupp.pdf.
Wilson, Andrew G., Knowles, David A., and Ghahramani, Zoubin. Gaussian process regression networks. In Proceedings of the 29th International Conference on Machine Learning (ICML), 2012.
Wilson, Andrew Gordon, Knowles, David A, and Ghahramani, Zoubin. Gaussian process regression networks. arXiv preprint arXiv:1110.4411, 2011.
Yuille, A. and Kersten, D. Vision as Bayesian inference: analysis by synthesis? Trends in Cognitive Sciences, 10(7):301308, 2006.
-----0
Andrews, D. F. and Mallows, C. L. Scale mixtures of normal distributions. Journal of the Royal Statistical Society, 36(1):99102, 1974.
Adaptive Sparsity in Gaussian Graphical Models Banerjee, Onureena, Ghaoui, Laurent El, and dAspremont, Alexandre. Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. Journal of Machine Learning Research, 9:485516, 2008.
Duchi, John, Gould, Stephen, and Koller, Daphne.Projected subgradient methods for learning sparse Gaussians. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2008.
Fan, Jianqing and Li, Runze. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):13481360, 2001.
Figueiredo, Mario A. T. Adaptive sparseness for supervised learning. IEEE Transactions on Pattern Analysis and Machine Learning, 25(9):11501159, 2003.
Foygel, Rina and Drton, Mathias. Extended Bayesian information criteria for Gaussian graphical models. In Proceedings of Neural Information Processing Systems, 2010.
Friedman, Jerome, Hastie, Trevor, and Tibshirani, Robert. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432441, 2008.
Hunter, David R. and Li, Runze. Variable selection using MM algorithms. The Annals of Statistics, 33 (4), 2005.
Meinshausen, Nicolai and Buhlman, Peter. Highdimensional graphs and variable selection with the lasso. The Annals of Statistics, 34(3):14361462, 2006.
Park, Trevor and Casella, George. The Bayesian lasso. Journal of the American Statistical Society, 103(482):681686, 2008.
Rothman, Adam J., Bickel, Peter J., Levina, Elizaveta, and Zhu, Ji. Sparse permutation invariant covariance estimation. Electronic Journal of Statistics, 2:494515, 2008.
Rue, H?avard and Held, Leonhard. Gaussian Markov Random Fields: Theory and Applications. Chapman and Hall/CRC, 2005.
Sachs, Karen, Perez, Omar, Peer, Dana, Lauffenburger, Douglas A., and Nolan, Gary P. Causal protein-signaling networks derived from multiparameter single-cell data. Science, 308:523529, 2005.
Sun, Tingni and Zhang, Cun-Hui. Sparse matrix inversion with scaled lasso. arXiv:1202.2723 [math.ST].
Yuan, Ming and Lin, Yi. Model selection and estimation in the Gaussian graphical model. Biometrika, 94(1):1935, 2007.
-----0
Bollerslev, T. Generalized autoregressive conditional heteroskedasticity. Journal of econometrics, 31(3): 307327, 1986.
Casarin, R. and Marin, J.M. Online data processing: comparison of Bayesian regularized particle filters.Electronic Journal of Statistics, 3:239258, 2009.
Chib, S., Omori, Y., and Asai, M. Multivariate stochastic volatility. Handbook of Financial Time Series, pp. 365400, 2009.
Cont, R. Empirical properties of asset returns: Stylized facts and statistical issues. Quantitative Finance, 1(2):223236, 2001.
Dems?ar, J. Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research, 7:130, 2006.
Doucet, A., De Freitas, N., and Gordon, N. Sequential Monte Carlo methods in practice. Springer Verlag, 2001.
Engle, R.F. Autoregressive conditional heteroscedasticity with estimates of the variance of United Kingdom inflation. Econometrica: Journal of the Econometric Society, pp. 9871007, 1982.
Engle, R.F. and Kroner, K.F. Multivariate simultaneous generalized ARCH. Econometric theory, 11(01): 122150, 1995.
Fiorentini, G., Sentana, E., and Calzolari, G. Maximum likelihood estimation and inference in multivariate conditionally heteroscedastic dynamic regression models with Student t innovations. Journal of Business and Economic Statistics, 21(4):532546, 2003.
Fox, E. and Dunson, D. Bayesian nonparametric covariance regression. Arxiv preprint arXiv:1101.2017, 2011.
Gilks, W.R. and Berzuini, C. Following a moving target Monte Carlo inference for dynamic Bayesian models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(1):127146, 2001.
Gordon, N.J., Salmond, D.J., and Smith, A.F.M.Novel approach to nonlinear/non-Gaussian 
Bayesian state estimation. In Radar and Signal Processing, IEE Proceedings F, volume 140, pp.107113. IET, 1993.
Gourieroux, C., Jasiak, J., and Sufana, R. The wishart autoregressive process of multivariate stochastic volatility. Journal of Econometrics, 150(2):167181, 2009.
Harvey, A., Ruiz, E., and Sentana, E. Unobserved component time series models with ARCH disturbances. Journal of Econometrics, 52(1-2):129157, 1992.
Harvey, A., Ruiz, E., and Shephard, N. Multivariate stochastic variance models. The Review of Economic Studies, 61(2):247264, 1994.
Lazaro-Gredilla, Miguel and Titsias, Michalis K. Variational heteroscedastic gaussian process regression.In ICML, pp. 841848, 2011.
Liu, J. and West, M. Combined parameter and state estimation in simulation-based filtering. Institute of Statistics and Decision Sciences, Duke University, 1999.
Musso, C., Oudjane, N., and LeGland, F. Improving regularised particle filters, 2001.
Patton, A. J. Modelling asymmetric exchange rate dependence. International Economic Review, 47(2): 527556, 2006.
Philipov, A. and Glickman, M.E. Factor multivariate stochastic volatility via wishart processes. Econometric Reviews, 25(2-3):311334, 2006.
Pitt, M.K. and Shephard, N. Filtering via simulation: Auxiliary particle filters. Journal of the American Statistical Association, pp. 590599, 1999.
Wilson, Andrew and Ghahramani, Zoubin. Copula processes. In Lafferty, J., Williams, C. K. I., Shawe
Taylor, J., Zemel, R.S., and Culotta, A. (eds.), Advances in Neural Information Processing Systems 23, pp. 24602468. 2010.
Wilson, Andrew and Ghahramani, Zoubin. Generalised wishart processes. In UAI-11, pp. 736744, 2011.
-----0
Bansal, N., Blum, A., and Chawla, S. Correlation clustering. Machine Learning, Special Issue on Clustering, 56:89113, 2004.
Chakrabarti, D., Papadimitriou, S., Modha, D. S., and Faloutsos, C. Fully automatic cross-associations. In KDD, pp. 7988, 2004.
Cheng, Y. and Church, G. M. Biclustering of expression data. In International Conference on Intelligent Systems for Molecular Biology (ISMB), pp. 93103, 2000.
Cho, H., Dhillon, I. S., Guan, Y., and Sra, S. Minimum sum-squared residue co-clustering of gene expression data. In Proceedings of the Fourth SIAM International Conference on Data Mining, pp. 114 125. SIAM, 2004.
Dhillon, I. S., Mallela, S., and Modha, D. S.Information-theoretic co-clustering. In KDD, pp.8998, 2003.
Giotis, I. and Guruswami, V. Correlation clustering with a fixed number of clusters. Theory of Computing, 2(13):249266, 2006.
Goldreich, O., Goldwasser, S., and Ron, D. Property testing and its connection to learning and approximation. Journal of the ACM, 45(4):653750, July 1998.
Hartigan, J.A. Direct clustering of a data matrix.Journal of the American Statistical Association, 67 (337):123129, 1972.
Hirai, Hiroshi, Chou, Bin-Hui, and Suzuki, Einoshin.A parameter-free method for discovering generalized clusters in a network. In Discovery Science, pp. 135 149, 2011.
Kemp, C. and Tenenbaum, J. B. Learning systems of concepts with an infinite relational model. In In Proceedings of the 21st National Conference on Artificial Intelligence, 2006.
Kemp, C. and Tenenbaum, J. B. The discovery of structural form. Proceedings of the National Academy of Sciences of the United States of America, 2008.
Osherson, D., Stern, J., Wilkie, O., Stob, M., and Smith, E. E. Default probability. Cognitive. Science, 15:251270, 1991.
Papadimitriou, S., Sun, J., Faloutsos, C., and Yu, P. S.Hierarchical, parameter-free community discovery. In ECML/PKDD (2), pp. 170187, 2008.
Tanay, A., Sharan, R., and Shamir, R. Biclustering Algorithms: A Survey. In Handbook of Computational Molecular Biology, 2006.
-----0
DAlessandro, M., Vachtsevanos, G., Esteller, R., Echauz, 
J., Cranstoun, S., Worrell, G, Parish, L, and Litt, B.A multi-feature and multi-channel univariate selection process for seizure prediction. Clinical Neurophysiology, 116:506516, 2005.
Dawid, A. P. and Lauritzen, S. L. Hyper Markov laws in the statistical analysis of decomposable graphical models. The Annals of Statistics, 21(3):12721317, 1993.
Dong, W., Pentland, A., and Heller, K. A. Graph-Coupled HMMs for Modeling the Spread of Infection. In Proceedings of the Twenty-Eighth Conference Conference on Uncertainty in Artificial Intelligence, 2012.
Doshi-Velez, F., Wingate, D., Tenenbaum, J., and Roy, N.Infinite dynamic bayesian networks. In Proceedings of the 28th International Conference on Machine Learning, 2011.
Esteller, R., Echauz, J., Tcheng, T., Litt, B., and Pless, B. Line length: an efficient feature for seizure onset detection. In Proceedings of the 23rd EMBS Conference, 2001.
Fox, E. B., Sudderth, E. B, Jordan, M. I., and Willsky, A. S. Sharing features among dynamical systems with beta processes. Advances in Neural Information Processing Systems, 22, 2009.
Fox, E. B., Sudderth, E. B., Jordan, M. I., and Willsky, A. S. A sticky HDP-HMM with application to speaker diarization. The Annals of Applied Statistics, 5(2A): 10201056, 2011a.
Fox, E. B., Sudderth, E. B., Jordan, M. I., and Willsky, A. S. Bayesian Nonparametric Inference of Switching Dynamic Linear Models. IEEE Transactions on Signal Processing, 59(4):15691585, April 2011b.
Ghahramani, Z. and Jordan, M. I. Factorial hidden Markov models. Machine learning, 31, 1997.
Griffiths, T. L and Ghahramani, Z. Infinite Latent Feature Models and the Indian Buffet Process Infinite.
Gatsby Computational Neuroscience Unit, Technical Report #2005-001, 2005.
Heller, K., Teh, Y. W., and Gorur, D. The Infinite Hierarchical Hidden Markov Model. In Proceedings of the 12th International Conference on Artificial Intelligence and Statistics, 2009.
Hughes, M., Fox, E., and Sudderth, E. Effective SplitMerge Monte Carlo Methods for Nonparametric Models of Sequential Data. In Advances in Neural Information Processing Systems, 2012.
Ishwaran, H. and Zarepour, M. Exact and approximate sum representations for the Dirichlet process. Canadian Journal of Statistics, 30(2):269283, 2002.
Krystal, A. D., Prado, R., and West, M. New methods of time series analysis of non-stationary EEG data: eigenstructure decompositions of time varying autoregressions. Clinical Neurophysiology, 110:21972206, 1999.
Litt, B., Esteller, R., Echauz, J., DAlessandro, M., Shor, R., Henry, T., Pennell, P., Epstein, C., Bakay, R., 
Dichter, M., and Vachtsevanos, G. Epileptic seizures may begin hours in advance of clinical onset: a report of five patients. Neuron, 30(1):5164, April 2001.
Mirowski, P., Madhavan, D., Lecun, Y., and Kuzniecky, R. Classification of patterns of EEG synchronization for seizure prediction. Clinical Neurophysiology, 120(11): 19271940, 2009.
Prado, R., Molina, F., and Huerta, G. Multivariate time series modeling and classification via hierarchical VAR mixtures. Computational Statistics & Data Analysis, 51 (3):14451462, December 2006.
Schiff, S. J., Sauer, T., Kumar, R., and Weinstein, S. L.Neuronal spatiotemporal pattern discrimination: the dynamical evolution of seizures. NeuroImage, 28(4): 10431055, December 2005.
Schindler, K., Leung, H., Elger, C. E., and Lehnertz, K.Assessing seizure dynamics by analysing the correlation structure of multichannel intracranial EEG. Brain, 130: 6577, January 2007.
Teh, Y. W., Jordan, M. I., Beal, M. J., and Blei, D. M. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476):15661581, 2006.
Thibaux, R. and Jordan, M. I. Hierarchical beta processes and the Indian buffet process. In Proceedings of the Tenth Conference on Artificial Intelligence and Statistics, 2007.
Van Gael, J., Saatci, Y., Teh, Y. W., and Ghahramani, Z.Beam sampling for the infinite hidden Markov model.
In Proceedings of the 25th International Conference on Machine Learning, 2008.
-----0
Andrew, Galen and Gao, Jianfeng. Scalable training of l 1-regularized log-linear models. In Proceedings of the 24th international conference on Machine learning, pp. 3340. ACM, 2007.
Banerjee, O., Ghaoui, L. El, and dAspremont, A.Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data, 2008.
Bartlett, Peter L, Mendelson, Shahar, and Neeman, Joseph. `1-regularized linear regression: Persistence and oracle inequalities. Probability theory and related fields, pp. 132, 2012.
Beck, Amir and Teboulle, Marc. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2 (1):183202, 2009.
Duchi, J. C., Gould, S., and Koller, D. Projected subgradient methods for learning sparse Gaussians. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2008.
Friedman, J., Hastie, T., and Tibshirani, R. Sparse inverse covariance estimation with the graphical lasso.Biostatistics, 9(3):432441, 2008.
Hong, T. Global energy forecasting competition, 2012.URL http://www.gefcom.org.
Hsieh, C.-J., Sustik, M. A., Dhillon, I. S., and Ravikumar, P. Sparse inverse covariace matrix estimation using quadratic approximation. In Neural Information Processing Systems, 2011.
Lu, Z. Smooth optimization approaches for sparse inverse covariance selection. SIAM Journal on Optimization, 19(4):18071827, 2009.
Ng, A. Y. and Jordan, M. I. On discriminative vs. generative classifiers: a comparison of logistic regression and naive Bayes. 2002.
Olsen, Peder A, Oztoprak, Figen, Nocedal, Jorge, and Rennie, Stephen J. Newton-like methods for sparse inverse covariance estimation. Optimization Online, 2012.
Ravikumar, Pradeep, Wainwright, Martin J, Raskutti, Garvesh, and Yu, Bin. High-dimensional covariance estimation by minimizing 1-penalized logdeterminant divergence. Electronic Journal of Statistics, 5:935980, 2011.
Scheinberg, K., Ma, S., and Goldfarb, D. Sparse inverse covariance selection via alternating linearization methods. In Neural Information Processing Systems, 2010.
Sohn, Kyung-Ah and Kim, Seyoung. Joint estimation of structured sparsity and output structure in multiple-output regression via inverse-covariance regularization. In Proceedings of the Conference on Artificial Intelligence and Statistics, 2012.
Soliman, S. A. and Al-Kandari, A. M. Electrical Load Forecasting: Modeling and Model Construction. Elsevier, 2010.
Sutton, C. and McCallum, A. An introduction to conditional random fields. Foundations and Trends in Machine Learning, 4(4):267373, 2012.
Tseng, Paul and Yun, Sangwoon. A coordinate gradient descent method for nonsmooth separable minimization. Mathematical Programming, 117(1):387 423, 2009.
Various. PJM Manual 19: Load Forecasting and Analysis. PJM, 2012. Available at: http://www.pjm.com/planning/resource-adequacy-planning/~/ media/documents/manuals/m19.ashx.
Wainwright, M.J. Sharp thresholds for highdimensional and noisy sparsity recovery using `1constrained quadratic programming (Lasso). IEEE Transactions on Information Theory, 55(5):2183 2202, 2009.
Wytock, Matt and Kolter, J. Zico. Sparse conditional gaussian random fields. In NIPS Workshop on Log Linear Models, 2012.
Yuan, Xiao-Tong and Zhang, Tong. Partial gaussian graphical model estimation. CoRR, abs/1209.6419, 2012.
-----0
Bach, Francis, Jenatton, Rodolphe, Mairal, Julien, and Obozinski, Guillaume. Convex Optimization with Sparsity-Inducing Norms. 2010.
Beck, A. and Teboulle, M. A fast iterative shrinkagethresholding algorithm for linear inverse problems.SIAM Journal on Imaging Sciences, 2(1):183{202, 2009.
Bickel, P.J., Ritov, Y., and Tsybakov, A.B. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, 37(4):1705{1732, 2009.
Boyd, S. and Vandenberghe, L. Convex Optimization.Cambridge University Press, 2004.
Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J.Distributed optimization and statistical learning via the alternating direction method of multipliers. Now Publishers, 2011.
Breheny, P. and Huang, J. Penalized methods for bi-level variable selection. Statistics and its interface, 2(3):369, 2009.
Combettes, P.L. and Pesquet, J.C. Proximal splitting methods in signal processing. Fixed-Point Algorithms for Inverse Problems in Science and Engineering, 2010.
Donoho, D.L. De-noising by soft-thresholding. Information Theory, IEEE Transactions on, 41(3):613{627, 2002. ISSN 0018-9448.
Duchi, J., Shalev-Shwartz, S., Singer, Y., and Chandra, T. Ecient projections onto the ?1-ball for learning in high dimensions. In Proceedings of the 25th International Conference on Machine Learning, pp. 272{279. ACM, 2008.
Fan, J. and Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):1348{ 1360, 2001.
Frank, A. and Asuncion, A. UCI machine learning repository, 2010. URL http://archive.ics.uci.edu/ml.
Friedman, J., Hastie, T., and Tibshirani, R. A note on the group lasso and a sparse group lasso. Arxiv preprint arXiv:1001.0736, 2010.
Grant, M. and Boyd, S. CVX: Matlab software for disciplined convex programming, version 1.21. http://cvxr.com/cvx, April 2011.
Huang, J. and Zhang, T. The benet of group sparsity.The Annals of Statistics, 38(4):1978{2004, 2010.
Huang, J., Ma, S., Xie, H., and Zhang, C.H. A group bridge approach for variable selection. Biometrika, 96 (2):339{355, 2009.
Huang, J., Breheny, P., and Ma, S. A selective review of group selection in high dimensional models. arXiv preprint arXiv:1204.6491, 2012.
Kolmogorov, A.N. and Tihomirov, V.M. e-Entropy and e-capacity of sets in functional spaces. American Mathematical Society, 1961.
Mazumder, R., Friedman, J.H., and Hastie, T. Sparsenet: Coordinate descent with nonconvex penalties. Journal of the American Statistical Association, 106(495):1125{ 1138, 2011.
Meier, L., Van De Geer, S., and Buhlmann, P. The group lasso for logistic regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1): 53{71, 2008.
Natarajan, B. K. Sparse approximation solutions to linear systems. SIAM J. Comput., 24(2):227{234, 1995.
Nesterov, Y. Gradient methods for minimizing composite objective function. CORE Discussion Papers, 2007.
Schmidt, M., Le Roux, N., Bach, F., et al. Convergence rates of inexact proximal-gradient methods for convex optimization. In NIPS'11-25 th Annual Conference on Neural Information Processing Systems, 2011.
Shen, X., Pan, W., and Zhu, Y. Likelihood-based selection and sharp parameter estimation. Journal of American Statistical Association, 107:223{232, 2012.
Tao, P.D. and An, LTH. Convex analysis approach to dc programming: Theory, algorithms and applications.
Acta Math. Vietnam, 22(1):289{355, 1997.Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Statistical Methodology), pp. 267{288, 1996.
Van De Geer, S.A. and Buhlmann, P. On the conditions used to prove oracle results for the lasso. Electronic Journal of Statistics, 3:1360{1392, 2009.
Wang, L., Chen, G., and Li, H. Group scad regression analysis for microarray time course gene expression data.
Bioinformatics, 23(12):1486{1494, 2007.Yang, H., Xu, Z., King, I., and Lyu, M. Online learning for group lasso. In Proceedings of the 27th International Conference on Machine Learning, pp. 1191{1198. ACM, 2010.
Yuan, M. and Lin, Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49{67, 2006.
Zhang, D. and Shen, D. Multi-modal multi-task learning for joint prediction of multiple regression and classication variables in alzheimer's disease, 2011.
Zhang, T. Analysis of multi-stage convex relaxation for sparse regularization. The Journal of Machine Learning Research, 11:1081{1107, 2010.
Zhang, T. Multi-stage convex relaxation for feature selection. arXiv:1106.0565v2, 2011.
Zhao, P. and Yu, B. On model selection consistency of lasso. The Journal of Machine Learning Research, 7: 2541{2563, 2006.
-----0
Ando, R. and Zhang, T. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research (JMLR), 6:18171853, 2005.
Ben-David, S., Blitzer, J., Crammer, K., and Pereira, F. Analysis of representations for domain adaptation. In Advances in Neural Information Processing Systems (NIPS), 2007.
Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., and Vaughan, J. A theory of learning from different domains. Machine Learning, 79:151 175, 2010.
Bengio, Y. and Senecal, J. Quick training of probabilisDomain Adaptation with Probabilistic Language Adaptation Model tic neural nets by importance sampling. In Proc. of the conference on Artificial Intelligence and Statistics (AISTATS), 2003.
Bengio, Y., Ducharme, R., and Vincent, P. A neural probabilistic language model. In Advances in Neural Information Processing Systems (NIPS), 2000.
Bengio, Y., Ducharme, R., Vincent, P., and Janvin, C. A neural probabilistic language model. J. of Machine Learn. Research (JMLR), 3:11371155, 2003.
Blitzer, J., Weinberger, K., Saul, L., and Pereira, F.Hierarchical distributed representations for statistical language modeling. In Advances in Neural Information Processing Systems(NIPS), 2004.
Blitzer, J., McDonald, R., and Pereira, F. Domain adaptation with structural correspondence learning.
In Proc. of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2006.
Blitzer, J., Foster, D., and Kakade, S. Domain adaptation with coupled subspaces. In Proc. of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2011.
Carreras, X. and Ma`rquez, L. Introduction to the conll-2005 shared task: semantic role labeling. In Proc. of the Conference on Computational Natural Language Learning (CoNLL), 2005.
Collobert, R. and Weston, J. A unified architecture for natural language processing: deep neural networks with multitask learning. In Proc. of the International Conf. on Machine Learning (ICML), 2008.
Daume III, H. Frustratingly easy domain adaptation.In Proc. of the Annual Meeting of the Association of Computational Linguistics (ACL), 2007.
Daume III, H. and Marcu, D. Domain adaptation for statistical classifiers. Journal of Artificial Intelligence Research, 26:101126, 2006.
Daume III, H., Kumar, A., and Saha, A. Coregularization based semi-supervised domain adaptation. In Advances in Neural Information Processing Systems (NIPS), 2010.
Gillick, L. and Cox, S. Some statistical issues in the comparison of speech recognition algorithms. In Proc. of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1989.
Gutmann, M. and Hyvarinen, A. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proc. of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2010.
Gutmann, M. U. and Hyvarinen, A. Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. J. of Machine Learning Research (JMLR), 13:307361, 2012.
Huang, F. and Yates, A. Distributional representations for handling sparsity in supervised sequence labeling. In Proc. of the Annual Meeting of the Association for Computational Linguistics (ACL), 2009.
Huang, F. and Yates, A. Exploring representationlearning approaches to domain adaptation. In Proc.of the Annual Meeting of the Association for Computational Linguistics (ACL), 2010.
Jiang, J. and Zhai, C. Instance weighting for domain adaptation in nlp. In Proc. of the Annual Meeting of the Association of Computational Linguistics (ACL), 2007.
Maas, A. and Ng, A. A probabilistic model for semantic word vectors. In Advances in Neural Information Processing Systems Wordshop on Deep Learning and Unsupervised Feature Learning, 2010.
McClosky, D., Charniak, E., and Johnson, M. Automatic domain adaptation for parsing. In Proc.of Human Language Technologies: The Annual 
Conference of the North American Chapter of the Association for Computational Linguistics (HLTNAACL), 2010.
Mnih, A. and Hinton, G. Three new graphical models for statistical language modelling. In Proc. of the International Conference on Machine learning (ICML), 2007.
Mnih, A. and Hinton, G. A scalable hierarchical distributed language model. In Advances in Neural Information Processing Systems (NIPS), 2009.
Mnih, A. and Teh, Y. A fast and simple algorithm for training neural probabilistic language models. In Proc. of the International Conference on Machine Learning (ICML), 2012.
Socher, R., Lin, C., Ng, A., and Manning, C. Parsing natural scenes and natural language with recursive neural networks. In Proc. of the International Conference on Machine Learning(ICML), 2011.
Turian, J., Ratinov, L., and Bengio, Y. Word representations: a simple and general method for semisupervised learning. In Proc. of the Annual Meeting of the Association for Computational Linguistics (ACL), 2010.
-----0
Bengio, S., Weston, J., and Grangier, D. Label embedding trees for large multi-class tasks. NIPS, 23:163171, 2010.
Boyd, S.P. and Vandenberghe, L. Convex optimization.Cambridge Univ Pr, 2004.
Breiman, L. Classification and regression trees. Chapman & Hall/CRC, 1984.
Busa-Fekete, R., Benbouzid, D., Kegl, B., et al. Fast classification using sparse decision dags. In ICML, 2012.
Cambazoglu, B.B., Zaragoza, H., Chapelle, O., Chen, J., Liao, C., Zheng, Z., and Degenhardt, J. Early exit optimizations for additive machine learned ranking systems.In WSDM3, pp. 411420, 2010.
Chapelle, O. and Chang, Y. Yahoo! learning to rank challenge overview. In JMLR: Workshop and Conference 
Proceedings, volume 14, pp. 124, 2011.Cost-Sensitive Tree of Classifiers 
Chapelle, O., Shivaswamy, P., Vadrevu, S., Weinberger, K., Zhang, Y., and Tseng, B. Boosted multi-task learning.Machine learning, 85(1):149173, 2011.
Chen, M., Xu, Z., Weinberger, K. Q., and Chapelle, O.Classifier cascade for minimizing feature evaluation cost.In AISTATS, 2012.
Deng, J., Satheesh, S., Berg, A.C., and Fei-Fei, L. Fast and balanced: Efficient label tree learning for large scale object recognition. In NIPS, 2011.
Dredze, M., Gevaryahu, R., and Elias-Bachrach, A. Learning fast classifiers for image spam. In proceedings of the Conference on Email and Anti-Spam (CEAS), 2007.
Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R.Least angle regression. The Annals of Statistics, 32(2): 407499, 2004.
Fleck, M., Forsyth, D., and Bregler, C. Finding naked people. ECCV, pp. 593602, 1996.
Friedman, J.H. Greedy function approximation: a gradient boosting machine. The Annals of Statistics, pp. 1189 1232, 2001.
Gao, T. and Koller, D. Active classification based on value of classifier. In NIPS, pp. 10621070. 2011.
Grubb, A. and Bagnell, J. A. Speedboost: Anytime prediction with uniform near-optimality. In AISTATS, 2012.
Jarvelin, K. and Kekalainen, J. Cumulated gain-based evaluation of IR techniques. ACM TOIS, 20(4):422446, 2002.
Jordan, M.I. and Jacobs, R.A. Hierarchical mixtures of experts and the em algorithm. Neural computation, 6 (2):181214, 1994.
Karayev, S., Baumgartner, T., Fritz, M., and Darrell, T.Timely object recognition. In Advances in Neural Information Processing Systems 25, pp. 899907, 2012.
Kowalski, M. Sparse regression using mixed norms. Applied and Computational Harmonic Analysis, 27(3):303324, 2009.
Lefakis, L. and Fleuret, F. Joint cascade optimization using a product of boosted classifiers. In NIPS, pp. 13151323.2010.
Pujara, J., Daume III, H., and Getoor, L. Using classifier cascades for scalable e-mail classification. In CEAS, 2011.
Saberian, M. and Vasconcelos, N. Boosting classifier cascades. In Lafferty, J., Williams, C. K. I., Shawe-Taylor, J., Zemel, R.S., and Culotta, A. (eds.), NIPS, pp. 2047 2055. 2010.
Scholkopf, B. and Smola, A.J. Learning with kernels: Support vector machines, regularization, optimization, and beyond. MIT press, 2001.
Viola, P. and Jones, M.J. Robust real-time face detection.IJCV, 57(2):137154, 2004.Wang, J. and Saligrama, V. Local supervised learning through space partitioning. In Advances in Neural Information Processing Systems 25, pp. 9199, 2012.
Weinberger, K.Q., Dasgupta, A., Langford, J., Smola, A., and Attenberg, J. Feature hashing for large scale multitask learning. In ICML, pp. 11131120, 2009.
Xu, Z., Weinberger, K., and Chapelle, O. The greedy miser: Learning under test-time budgets. In ICML, pp.11751182, 2012.
Zheng, Z., Zha, H., Zhang, T., Chapelle, O., Chen, K., and Sun, G. A general boosting method and its application to learning ranking functions for web search. In NIPS, pp. 16971704. Cambridge, MA, 2008.
-----0
Doshi-Velez, F., Miller, K. T., van Gael, J., and Teh, Y. W. Variational inference for the Indian buffet process. In Proceedings of the International Conference on Artificial Intelligence and Statistics, volume 12, 2009.
Germain, P., Lacasse, A., Laviolette, F., and Marchand, M. PAC-Bayesian learning of linear classifiers. In International Conference on Machine Learning (ICML), pp. 353360, 2009.
Griffiths, T. and Ghahramani, Z. Infinite latent feature models and the Indian buffet process. Technical report, Gatsby Computational Neuroscience Unit, 2005.
Jaakkola, T., Meila, M., and Jebara, T. Maximum entropy discrimination. In Advances in Neural Information Processing Systems (NIPS), 1999.
Marlin, B. and Zemel, R. S. The multiple multiplicative factor model for collaborative filtering. In International Conference on Machine Learning (ICML), 2004.
McAllester, D. PAC-Bayesian stochastic model selection. Machine Learning, 51:521, 2003.
Polson, N. G. and Scott, S. L. Data augmentation for support vector machines. Bayesian Analysis, 6(1): 124, 2011.
Rennie, J. D. M. and Srebro, N. Fast maximum margin matrix factorization for collaborative prediction.In International Conference on Machine Learning (ICML), 2005.
Salakhutdinov, R. and Mnih, A. Bayesian probabilistic matrix factorization using markov chain monte carlo. In International Conference on Machine Learning (ICML), 2008.
Srebro, N., Rennie, J. D. M., and Jaakkola, T.Maximum-margin matrix factorization. In Advances in Neural Information Processing Systems (NIPS), 2005.
Tanner, M. A. and Wong, W. H. The calculation of posterior distributions by data augmentation. Journal of the Americal Statistical Association (JASA), 82(398):528540, 1987.van Dyk, D. and Meng, X. The art of data augmentation. Journal of Computational and Graphical Statistics (JCGS), 10(1):150, 2001.
Xu, M., Zhu, J., and Zhang, B. Nonparametric maximum margin matrix factorization for collaborative prediction. In Advances in Neural Information Processing Systems (NIPS), 2012.
Zhou, M., Wang, C., Chen, M., Paisley, J., Dunson, D., and Carin, L. Nonparametric Bayesian matrix completion. In Sensor Array and Multichannel Signal Processing Workshop (SAM), pp. 213216, 2010.
Zhu, J., Ahmed, A., and Xing, E. P. MedLDA: Maximum margin supervised topic models for regression and classification. In International Conference on Machine Learning (ICML), 2009.
Zhu, J., Chen, N., Perkins, H., and Zhang, B. Gibbs max-margin topic models with fast sampling algorithms. In International Conference on Machine Learning (ICML), 2013a.
Zhu, J., Chen, N., and Xing, E. P. Bayesian inference with posterior regularization and applications to infinite latent svms. arXiv Report, arXiv:1210.1766v2, 2013b.
-----0
Bonnans, J Frederic and Shapiro, Alexander. Optimization problems with perturbations: A guided tour. SIAM review, 40(2):228264, 1998.
Breiman, L. Classification and regression trees. Chapman & Hall/CRC, 1984.
Busa-Fekete, R., Benbouzid, D., Kegl, B., et al. Fast classification using sparse decision dags. In ICML, 2012.
Cambazoglu, B.B., Zaragoza, H., Chapelle, O., Chen, J., Liao, C., Zheng, Z., and Degenhardt, J. Early exit optimizations for additive machine learned ranking systems.In WSDM3, pp. 411420, 2010.
Chapelle, O. and Chang, Y. Yahoo! learning to rank challenge overview. In JMLR: Workshop and Conference 
Proceedings, volume 14, pp. 124, 2011.Chapelle, O., Vapnik, V., Bousquet, O., and Mukherjee, S. Choosing multiple parameters for support vector machines. Machine Learning, 46(1):131159, 2002.
Chapelle, O., Shivaswamy, P., Vadrevu, S., Weinberger, K., Zhang, Y., and Tseng, B. Boosted multi-task learning.Machine learning, 85(1):149173, 2011.
Chen, M., Xu, Z., Weinberger, K. Q., and Chapelle, O.Classifier cascade for minimizing feature evaluation cost.In AISTATS, 2012.
Cortes, C. and Vapnik, V. Support-vector networks. Machine learning, 20(3):273297, 1995.
Dekel, Ofer, Shalev-Shwartz, Shai, and Singer, Yoram. The forgetron: A kernel-based perceptron on a budget. SIAM Journal on Computing, 37(5):13421372, 2008.
Dredze, M., Gevaryahu, R., and Elias-Bachrach, A. Learning fast classifiers for image spam. In proceedings of the Conference on Email and Anti-Spam (CEAS), 2007.
Freund, Y. and Schapire, R. A desicion-theoretic generalization of on-line learning and an application to boosting. In Computational learning theory, pp. 2337.Springer, 1995.
Friedman, J.H. Greedy function approximation: a gradient boosting machine. The Annals of Statistics, pp. 1189 1232, 2001.
Gao, T. and Koller, D. Active classification based on value of classifier. In NIPS, pp. 10621070. 2011a.
Gao, Tianshi and Koller, Daphne. Multiclass boosting with hinge loss based on output coding. ICML 11, pp. 569 576, 2011b.
Gavrila, D. Pedestrian detection from a moving vehicle.ECCV 2000, pp. 3749, 2000.
Grubb, A. and Bagnell, J. A. Speedboost: Anytime prediction with uniform near-optimality. In AISTATS, 2012.
Grubb, A. and Bagnell, J.A. Generalized boosting algorithms for convex optimization. arXiv preprint arXiv:1105.2054, 2011.
Grubb, Alexander and Bagnell, J Andrew. Boosted backpropagation learning for training deep modular networks. In Proceedings of the International Conference on Machine Learning (27th ICML), 2010.
Kedem, Dor, Tyree, Stephen, Weinberger, Kilian Q., Sha, Fei, and Lanckriet, Gert. Non-linear metric learning. In NIPS, pp. 25822590. 2012.
Lazebnik, S., Schmid, C., and Ponce, J. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR, pp. 21692178, 2006.
Lefakis, L. and Fleuret, F. Joint cascade optimization using a product of boosted classifiers. In NIPS, pp. 13151323.2010.
Li, L.J., Su, H., Xing, E.P., and Fei-Fei, L. Object bank: A high-level image representation for scene classification and semantic feature sparsification. NIPS, 2010.
Mohan, A., Chen, Z., and Weinberger, K. Q. Websearch ranking with initialized gradient boosted regression trees. JMLR: Workshop and Conference Proceedings, 14:7789, 2011.
Petersen, K. B. and Pedersen, M. S. The matrix cookbook, Oct 2008.
Platt, J.C. Fast training of support vector machines using sequential minimal optimization. 1999.
Pujara, J., Daume III, H., and Getoor, L. Using classifier cascades for scalable e-mail classification. In CEAS, 2011.
Raykar, V.C., Krishnapuram, B., and Yu, S. Designing efficient cascaded classifiers: tradeoff between accuracy and cost. In ACM SIGKDD, pp. 853860, 2010.
Saberian, M. and Vasconcelos, N. Boosting classifier cascades. In NIPS, pp. 20472055. 2010.
Trzcinski, Tomasz, Christoudias, Mario, Lepetit, Vincent, and Fua, Pascal. Learning image descriptors with the boosting-trick. In NIPS, pp. 278286. 2012.
Tyree, S., Weinberger, K.Q., Agrawal, K., and Paykin, J.Parallel boosted regression trees for web search ranking.In WWW, pp. 387396. ACM, 2011.
Viola, P. and Jones, M.J. Robust real-time face detection.IJCV, 57(2):137154, 2004.Wang, J. and Saligrama, V. Local supervised learning through space partitioning. In NIPS, pp. 9199, 2012.
Weinberger, K.Q., Dasgupta, A., Langford, J., Smola, A., and Attenberg, J. Feature hashing for large scale multitask learning. In ICML, pp. 11131120, 2009.
Xiao, Jianxiong, Hays, James, Ehinger, Krista A, Oliva, Aude, and Torralba, Antonio. Sun database: Largescale scene recognition from abbey to zoo. In CVPR, pp. 34853492. IEEE, 2010.
Xu, Z., Weinberger, K.Q., and Chapelle, O. The greedy miser: Learning under test-time budgets. In ICML, pp.11751182, 2012.
Xu, Zhixiang, Kusner, Matt J., Weinberger, Kilian Q., and Chen, Minmin. Cost-sensitive tree of classifiers. In Dasgupta, Sanjoy and McAllester, David (eds.), ICML 13, pp. to appear, 2013.
Zheng, Z., Zha, H., Zhang, T., Chapelle, O., Chen, K., and Sun, G. A general boosting method and its application to learning ranking functions for web search. In NIPS, pp. 16971704. Cambridge, MA, 2008.
-----0
Bacrya, E., Delattreb, S., Hoffmannc, M. and Muzyd, J.F. Modeling microstructure noise with mutually exciting point processes. Quantitative Finance, 2012.
Blei, D.M. and Lafferty, J.D. Dynamic topic models.In ICML 2006, pp. 113120, 2006.
Crane, R. and Sornette, D. Robust dynamic classes revealed by measuring the response function of a social system. Proc. of the National Academy of Sciences, 105:1564915653, 2008.
Gomez-Rodriguez, M., Balduzzi, D., and Scholkopf, B. Uncovering the temporal dynamics of diffusion networks. In ICML 2011, pp. 561568, 2011.
Gomez-Rodriguez, M. and Scholkopf, B. Submodular inference of diffusion networks from multiple trees.In ICML 2012, 2012.
Guttorp, P. and Thorarinsdottir, T.L. Bayesian inference for non-markovian point processes. In Advances and Challenges in Space-time Modelling of Natural Events, pp. 79102, 2012.
Hawkes, A.G. Spectra of some self-exciting and mutually exciting point processes. Biometrika, 58:8390, 1971.
Kvam, P. and Day, D. The multivariate Polya distribution in combat modeling. Naval Research Logistics, 48:117, 2001.
Lehmann, J., Goncalves, B., Ramasco, J., and Cattuto, C. Dynamical classes of collective attention in twitter. In World Wide Web 2012, pp. 251260, 2012.
Leskovec, J., Backstrom, L., and Kleinberg, J. Memetracking and the dynamics of the news cycle. In ACM SIGKDD 2009, pp. 497506, 2009.
Liniger, T. Multivariate hawkes processes. ETH Doctoral Dissertation No. 18403, 2009.
Meyers, S. and Leskovec, J. On the convexity of latent social network inference. In Advances in Neural Information Processing Systems 26, 2010.
Mohler, G.O., Short, M.B., Brantingham, P.J., Schoenberg, F.P., and Tita, G. E. Self-exciting point process modeling of crime. Journal of the American Statistical Association, 106(493):100108, 2011.
Ogata, Y. On Lewis simulation method for point processes. IEEE Transactions on Information Theory, 27(1):2331, Jan 1981.
Pan, W., Dong, W., Cebrian, M., Kim, T., and Pentland, A. Modeling dynamical influence in human interaction. MIT Media Lab technical report March 2011.
Ryu, H., Lease, M., and Woodward, N. Finding and exploring memes in social media. In Proceedings of the 23rd ACM conference on Hypertext and social media, HT 12, pp. 295304, 2012.
Snowsill, T. M., Fyson, N., De Bie, T., and Cristianini, N. Refining causality: who copied from whom? In ACM SIGKDD 2011, pp. 466474, 2011.
Wortman, J. Viral marketing and the diffusion of trends on social networks. UPenn Technical Report MS-CIS-08-19, May 2008.
Ypma, RJ, Bataille, AM, Stegeman, A, Koch, G, Wallinga, J, and van Ballegooijen, WM. Unravelling transmission trees of infectious diseases by combining genetic and epidemiological data. Proc. of the Royal Society B, 279:444450, Feb 2012.
-----0
Barnes, J. and Hut, P. A hierarchical O(N log N) force-calculation algorithm. Nature, 324(4):446449, 1986.
Beygelzimer, A., Kakade, S., and Langford, J. Cover trees for nearest neighbor. In International Conference on Machine Learning (ICML), pp. 97104, 2006.
Carreira-Perpinan, M. The elastic embedding algorithm for dimensionality reduction. In International Conference on Machine Learning (ICML), pp. 167 174, 2010.
Gray, A. and Moore, A. N-body problems in statistical learning. In Advances in Neural Information Processing Systems (NIPS), pp. 521527, 2001.
Gray, A. and Moore, A. Rapid evaluation of multiple density models. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2003a.
Gray, A. and Moore, A. Nonparametric density estimation: toward computational tractability. In In SIAM International Conference on Data Mining (ICDM), 2003b.
Greengard, L. and Rokhlin, V. A fast algorithm for particle simulations. Journal of Computational Physics, 73(2):325348, 1987.
Hinton, G.E. and Roweis, S.T. Stochastic neighbor embedding. In Advances in Neural Information Processing Systems (NIPS), pp. 833840, 2002.
Liu, T., Moore, A., Gray, A., and Yang, K. An investigation of practical approximate nearest neighbor algorithms. In Advances in Neural Information Processing Systems (NIPS), pp. 825832, 2004.
Miller, F.P., Vandome, A.F., and McBrewster, J. Distributed Hash Table. Alphascript Publishing, 2010.Noack, A. Energy models for graph clustering. Journal of Graph Algorithms and Applications, 11(2):453 480, 2007.
van der Maaten, L. and Hinton, G. Visualizing data using t-SNE. Journal of Machine Learning Research, 9:25792605, 2008.
Venna, J. and Kaski, S. Nonlinear dimensionality reduction as information retrieval. Journal of Machine Learning Research Proceedings Track, 2:572579, 2007.
Venna, J., Peltonen, J., Nybo, K., Aidos, H., and Kaski, S. Information retrieval perspective to nonlinear dimensionality reduction for data visualization. Journal of Machine Learning Research, 11: 451490, 2010.
Vladymyrov, M. and Carreira-Perpinan, M. Partialhessian strategies for fast learning of nonlinear embeddings. In International Conference on Machine Learning (ICML), pp. 167174, 2010.
Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., and Gong, Y. Locality-constrained linear coding for image classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3360 3367, 2010.
Yang, Z., King, I., Xu, Z., and Oja, E. Heavy-tailed symmetric stochastic neighbor embedding. In Advances in Neural Information Processing Systems (NIPS), volume 22, pp. 21692177, 2009.
Yianilos, P. Data structures and algorithms for nearest neighbor search in general metric spaces. In Proceedings of the Fifth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 311321, 1993.
-----0
Angluin, D. and Laird, P. Learning from noisy examples. Machine Learning, 2:343370, 1988.
Baldridge, J. and Palmer, A. How well does active learning actually work? Time-based evaluation of cost-reduction strategies for language documentation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2009.
Hanneke, S. Theoretical Foundations of Active Learning. PhD thesis, Machine Learning Department, School of Computer Science, Carnegie Mellon University, 2009.
Hanneke, S. Activized learning: Transforming passive to active with improved label complexity. Journal of Machine Learning Research, 13(5):14691587, 2012.Settles, B. Active learning literature survey.http://active-learning.net, 2010.
Tong, S. and Koller, D. Support vector machine active learning with applications to text classification.Journal of Machine Learning Research, 2, 2001.
Vapnik, V. Estimation of Dependencies Based on Empirical Data. Springer-Verlag, New York, 1982.
Vapnik, V. and Chervonenkis, A. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16:264280, 1971.
-----0
Agarwal, Arvind, Daume III, Hal, and Gerber, Samuel. Learning multiple tasks using manifold regularization. In NIPS, pp. 4654, 2010.
Ando, Rie Kubota and Zhang, Tong. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6:18171853, 2005.
Archambeau, Cedric, Guo, Shengbo, and Zoeter, Onno. Sparse bayesian multi-task learning. In NIPS, pp. 17551763, 2011.
Argyriou, Andreas, Evgeniou, Theodoros, and Pontil, Massimiliano. Multi-task feature learning. In NIPS, pp. 4148. MIT Press, 2006.
Argyriou, Andreas, Micchelli, Charles A., Pontil, Massimiliano, and Ying, Yiming. A spectral regularization framework for multi-task structure learning. In NIPS, 2007.
Barndorff-Nielsen, O., Blsild, P., and Jensen, J. Ledet. Exponential transformation models. Proceeding of The Royal Society Lond A, 379(1776):41 65, 1982.
Baxter, Jonathan. A model of inductive bias learning.Journal of Artificial Intelligence Research, 12:149 198, 2000.
Bishop, Christopher M. Pattern Recognition and Machine Learning. Springer, first edition, 2006.
Bonilla, Edwin V., Chai, Kian Ming Adam, and Williams, Christopher K. I. Multi-task gaussian process prediction. In NIPS, 2007.
Breiman, Leo and Friedman, Jerome H. Predicting multivariate responses in multiple linear regression.Journal of the Royal Statistical Society: Series B (Statistical Methodology), 59(1):354, 1997.
Butler, Ronald W. Generalized inverse gaussian distributions and their wishart connections. Scandinavian Journal of Statistics, 25(1):6975, 1998.
Butler, Ronald W. and Wood, Andrew T. A. Laplace approximation for bessel functions of matrix argument. Journal of Computational and Applied Mathematics, 155:359382, 2003.
Caruana, Rich. Multitask learning. In Machine Learning, pp. 4175, 1997.Multi-Task Learning with the GMGIG Model 
Chen, Jianhui, Tang, Lei, Liu, Jun, and Ye, Jieping.A convex formulation for learning shared structures from multiple tasks. In ICML, pp. 18, 2009.
Chen, Jianhui, Zhou, Jiayu, and Ye, Jieping. Integrating low-rank and group-sparse structures for robust multi-task learning. In KDD, pp. 4250, 2011.
Evgeniou, Theodoros and Pontil, Massimiliano. Regularized multi-task learning. In KDD, pp. 109117, 2004.
Gupta, A. K. and Nagar, D. K. (eds.). Matrix Variate Distribution. Chapman & Hall, 2000.
Herz, Carl S. Bessel functions of matrix argument.Annals of Mathematics, 61(3):474523, 1955.
Jacob, Laurent, Bach, Francis, and Vert, JeanPhilippe. Clustered multi-task learning: A convex formulation. In NIPS, pp. 745752, 2008.
Jenatton, Rodolphe, Audibert, Jean-Yves, and Bach, Francis. Structured variable selection with sparsityinducing norms. Journal of Machine Learning Research, 12:27772824, 2011.
Kim, Seyoung and Xing, Eric P. Tree-guided group lasso for multi-task regression with structured sparsity. In ICML, pp. 543550, 2010.
Kotz, Samuel, Kozubowski, Tomasz J., and Podgorski, Krzysztof. The Laplace distribution and generalizations: a revisit with applications to communications, economics, engineering, and finance. Birkhauser, Boston, 2001.
Le, Nhu D. and Zidek, James V. Statistical Analysis of Environmental Space-Time Processes. Springer, 2006.
Mackay, David J. C. Information Theory, Inference and Learning Algorithms. Cambridge University Press, 2003.
Obozinski, Guillaume, Taskar, Ben, and Jordan, Michael I. Joint covariate selection and joint subspace selection for multiple classification problems. Statistics and Computing, 20(2):231252, 2009.
Rai, Piyush and Daume III, Hal. Infinite predictor subspace models for multitask learning. Journal of Machine Learning Research Proceedings Track, 9: 613620, 2010.
Rasmussen, Carl Edward and Williams, Christopher K. I. Gaussian Processes for Machine Learning.MIT Press, 2006.
Rothman, A. J., Levina, E., and Zhu, J. Sparse multivariate regression with covariance estimation. Journal of Computational and Graphical Statistics, pp.947962, 2010.
Sohn, Kyung-Ah and Kim, Seyoung. Joint estimation of structured sparsity and output structure in multiple-output regression via inverse-covariance regularization. Journal of Machine Learning Research Proceedings Track, 22:10811089, 2012.
Stein, Michael L. Interpolation of Spatial Data: Some Theory for Kriging. Springer, 1999.Thrun, Sebastian. Is learning the n-th thing any easier than learning the first? In NIPS, pp. 640646. The MIT Press, 1996.
Xue, Ya, Liao, Xuejun, Carin, Lawrence, and Krishnapuram, Balaji. Multi-task learning for classification with dirichlet process priors. Journal of Machine Learning Research, 8, 2007.
Yu, Shipeng, Tresp, Volker, and Yu, Kai. Robust multi-task learning with t-processes. In ICML, pp.11031110, 2007.
Zhang, Jian, Ghahramani, Zoubin, and Yang, Yiming. Learning multiple related tasks using latent independent component analysis. In NIPS, pp. 1585 1592, 2005.
Zhang, Yi and Schneider, Jeff G. Learning multiple tasks with a sparse matrix-normal penalty. In NIPS, pp. 25502558. Curran Associates, Inc., 2010.
Zhang, Yu and Yeung, Dit-Yan. A convex formulation for learning task relationships in multi-task learning. In UAI, pp. 733742. AUAI Press, 2010.
Zhang, Zhihua, Wang, Shusen, Liu, Dehua, and I.Jordan, Michael. EP-GIG Priors and Applications in Bayesian Sparse Learning. Journal of Machine Learning Research, 13:20312061, 2012.
Zhou, J., Chen, J., and Ye, J. MALSAR: Multi-tAsk Learning via StructurAl Regularization. Arizona State University, 2011. URL http://www.public.asu.edu/~jye02/Software/MALSAR.
-----0
Bickel, P., Ritov, Y., and Tsybakov, A. Simultaneous analysis of Lasso and Dantzig selector. Annals of Statistics, 37:1705{1732, 2009.
Candes, E., Romberg, J., and Tao, T. Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory, 52(2):489C{ 509, 2006.
Chen, S., Donoho, D., and Saunders, M. Atomic decomposition by basis pursuit. SIAM Journal on Scientic Computing, 20(1):33C{61, 1998.
Donoho, D. L. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289{1306, 2006.
Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. Least angle regression. Annals of Statistics, 32: 407{499, 2004.
Feuer, A. and Nemirovski, A. On sparse representation in pairs of bases. IEEE Transactions on Information Theory, 49(6):1579C{1581, 2003.
Friedman, J., Hastie, T., and Tibshirani, R. A note on the group Lasso and a sparse group Lasso. Technical report, Jan 2010.
Huang, J., Huang, X., and Metaxas, D. N. Learning with dynamic group sparsity. In ICCV, pp. 64{71, 2009a.
Huang, J., Zhang, T., and Metaxas, D. Learning with structured sparsity. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML '09, pp. 417{424, 2009b.
Jacob, L., Obozinski, G., and Vert, J. Group Lasso with overlap and graph Lasso. In ICML, 2009.
Kim, S., Koh, K., Boyd, S., and Gorinevsky, D. ?1 trend ltering. SIAM Review, 51(2):339{360, 2009.
Meier, L., Geer, S. Van De, and Bhlmann, P. The group Lasso for logistic regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1):53{71, 2008.
Percival, D. Theoretical properties of the overlapping groups Lasso. Electronic Journal of Statistics, 2011.
Rapaport, F., Zinovyev, A., Dutreix, M., Barillot, E., and Vert, J. Classication of microarray data using gene networks. BMC Bioinformatics, 8, 2007.
Roth, V. and Fischer, B. The group Lasso for generalized linear models: uniqueness of solutions and ecient algorithms. In Proceedings of the 25th international conference on Machine learning, pp. 848{ 855, 2008.
Tibshirani, R. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1):267{288, 1996.
Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., and Knight, K. Sparsity and smoothness via the fused Lasso. Journal of the Royal Statistical Society Series B, pp. 91{108, 2005.
Tibshirani, R. J. and Taylor, J. The solution path of the generalized Lasso. The Annals of Statistics, 39 (3), 2011.
Tikhonov, A. N. and Arsenin, V. Y. Solutions of IllPosed Problems. V. H. Winston & Sons, Washington, D.C.: John Wiley & Sons, New York,, 1977.
Tropp, J. Greed is good: Algorithmic results for sparse approximation. IEEE Transactions on Information Theory, 50(10):2231C{2242, 2004.
Tropp, J. A. Just relax: Convex programming methods for identifying sparse signals in noise. IEEE Transactions on Information Theory, 51(3):1030{ 1051, 2006.
Wainwright, M. Sharp thresholds for noisy and high-dimensional recovery of sparsity using ?1constrained quadratic programming (Lasso). IEEE Transactions on Information Theory, 55:2183{2202, 2009.
Xu, H., Caramanis, C., and Mannor, S. Robust regression and Lasso. IEEE Transactions on Information Theory, 56(7):3561{3574, 2010.
Yuan, M. and Lin, Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B, 68:49{67, 2006.
Zhang, T. Some sharp performance bounds for least squares regression with l1 regularization. Annals of Statistics, 37:2109{2144, 2009.
Zou, H. and Hastie, T. Regularization and variable selection via the elastic net. Journal of the Royal  Statistical Society, Series B, 67:301{320, 2005.
-----0
Bourgain, J., Lindenstrauss, J., and Milman, V. Approximation of zonoids by zonotopes. Acta Math, 162:73141, 1989.
Buchinsky, M. Changes is US wage structure 1963-87: an application of quantile regression. Econometrica, 62:405408, 1994.
Buhai, I. S. Quantile regression: Overview and selected applications. Ad Astra, 4, 2005.
Clarkson, K. L., Drineas, P., Magdon-Ismail, M., Mahoney, M. W., Meng, X., and Woodruff, D. P. The Fast Cauchy Transform and faster robust linear regression. In Proc. of the 24th Annual SODA, 2013.
Dasgupta, A., Drineas, P., Harb, B., Kumar, R., and Mahoney, M. W. Sampling algorithms and corsets for `p regression. SIAM J. Comput., 38(5):2060 2078, 2009.
Koenker, R. and Bassett, G. Regression quantiles.Econometrica, 46(1):3350, 1978.
Koenker, R. and DOrey, V. Computing regression quantiles. J, Roy. Statist. Soc. Sr. C(Appl, Statis.), 43:410  414, 1993.
Koenker, R. and Hallock, K. Quantile regression. J.of Economic Perspectives, 15(4):143156, 2001.
Mahoney, M. W. Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning. NOW Publishers, Boston, 2011. Also available at: arXiv:1104.5557.
Meng, X. and Mahoney, M. W. Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression. In Proc. of the 45th Annual STOC, 2013.
Portnoy, S. On computation of regression quantiles: Making the Laplacian tortoise faster. Lecture NotesMonograph Series, Vol. 31, L1-Statistical Procedures and Related Topics, pp. 187200, 1997.
Portnoy, S. and Koenker, R. The Gaussian hare and the Laplacian tortoise: Computability of squarederror versus absolute-error estimators, with discussion. Statistical Science, 12(4):279300, 1997.
Sohler, C. and Woodruff, D. P. Subspace embedding for the `1-norm with applications. In Proc. of the 43rd Annual ACM STOC, pp. 755764, 2011.
Yang, J., Meng, X., and Mahoney, M. W. Quantile regression for large-scale applications. Technical report, 2013. Preprint: arXiv:1305.0087.
-----0
Allab, Kais and Benabdeslem, Khalid. Constraint selection for semi-supervised topological clustering. In ECML/PKDD, pp. 2843, 2011.
Avron, H, Kale, S, Kasiviswanathan, S, and Sindhwani, V. Efficient and practical stochastic subgradient descent for nuclear norm regularization. In ICML, 2012.
Bar-Hillel, A., Hertz, T., Shental, N., and Weinshall, D. Learning a mahalanobis metric from equivalence constraints. JMLR, 6:937965, 2005.
Bartlett, P. The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network. IEEE Transactions on Information Theory, 44(2): 525536, 1998.
Basu, S, Banerjee, A, and Mooney, R J. Semisupervised clustering by seeding. In ICML, pp. 27 34, 2002.
Basu, S., Bilenko, M., and Mooney, R. J. A probabilistic framework for semi-supervised clustering. In KDD, pp. 5968, 2004.
Basu, S., Bilenko, M., and Mooney, R. J. Probabilistic semi-supervised clustering with constraints. Semisupervised Learning, pp. 7198, 2006.
Bekkerman, R. and Sahami, M. Semi-supervised clustering using combinatorial MRFs. In ICML Workshop on Learning in Structured Output Spaces, 2006.
Bhatia, S. and Deogun, J. Conceptual clustering in information retrieval. IEEE Transactions on Systems,  Man, and Cybernetics, Part B, 28(3):427436, 1998.
Bilenko, M, Basu, S, and Mooney, R J. Integrating constraints and metric learning in semi-supervised clustering. In ICML, 2004.
Cande`s, E. J. and Recht, B. Simple bounds for low-complexity model reconstruction. CoRR, abs/1106.1474, 2011.
Cande`s, E. J. and Tao, T. The power of convex relaxation: near-optimal matrix completion. IEEE Transactions on Information Theory, 56(5):2053 2080, 2010.
Chen, W., Song, Y., Bai, H., Lin, C., and Chang, E. Y.Parallel spectral clustering in distributed systems.PAMI, 33(3):568586, 2011.
Cover, T. and Thomas, J. Elements of Information Theory. Wiley, 2006. ISBN 978-0-471-24195-9.
Davidson, Ian and Ravi, S. S. Agglomerative hierarchical clustering with constraints: Theoretical and empirical results. In PKDD, pp. 5970, 2005.
Davis, Jason V., Kulis, Brian, Jain, Prateek, Sra, Suvrit, and Dhillon, Inderjit S. Information-theoretic metric learning. In ICML, pp. 209216, 2007.
Fowlkes, Charless, Belongie, Serge, Chung, Fan R. K., and Malik, Jitendra. Spectral grouping using the nystrom method. PAMI, 26(2):214225, 2004.
Frigui, Hichem and Krishnapuram, Raghu. A robust competitive clustering algorithm with applications in computer vision. PAMI, 21(5):450465, 1999.
Goldberg, D., Nichols, D., Oki, B., and Terry, D. Using collaborative filtering to weave an information tapestry. Commun. ACM, 35(12):6170, 1992.
Hoi, S.C.H., Liu, W., Lyu, M.R., and Ma, W.Y. Learning distance metrics with contextual constraints for image retrieval. In CVPR, pp. 20722078, 2006.
Hull, J.J. A database for handwritten text recognition research. PAMI, 16(5):550554, 1994.
Jain, A. K. Data clustering: 50 years beyond k-means.Pattern Recognition Letters, 31(8):651666, 2010.
Jalali, Ali, Chen, Yudong, Sanghavi, Sujay, and Xu, Huan. Clustering partially observed graphs via convex optimization. In ICML, pp. 10011008, 2011.
Law, M, Topchy, A P., and Jain, A K. Model-based clustering with probabilistic constraints. In SDM, 2005.
Li, Q and Kim, B. Clustering approach for hybrid recommender system. In Web Intelligence, pp. 33 38, 2003.
Li, Z. and Liu, J. Constrained clustering by spectral kernel learning. In ICCV, pp. 421427, 2009.
Liu, X. and Croft, W. B. Cluster-based retrieval using language models. In SIGIR, pp. 186193, 2004.
Lu, Z. and Leen, T. K. Semi-supervised learning with penalized probabilistic clustering. In NIPS, 2004.
Ng, A., Jordan, M., and Weiss, Y. On spectral clustering: Analysis and an algorithm. In NIPS, pp.849856, 2001.
Recht, Benjamin. A simpler approach to matrix completion. JMLR, 12:34133430, 2011.
Shental, N., Bar-Hillel, A., Hertz, T., and Weinshall, D. Computing gaussian mixture models with em using equivalence constraints. In NIPS, 2003.
von Luxburg, Ulrike. A tutorial on spectral clustering. Statistics and Computing, 17(4):395416, 2007.
Wagstaff, K., Cardie, C., Rogers, S., and Schrodl, S. Constrained k-means clustering with background knowledge. In ICML, pp. 577584, 2001.
Weinberger, K.Q., Blitzer, J., and Saul, L.K. Distance metric learning for large margin nearest neighbor classification. In NIPS, 2006.
Xing, E., Ng, A., Jordan, M., and Russell, S. Distance metric learning, with application to clustering with side-information. NIPS, 15:505512, 2002.
Yi, J., Jin, R., Jain, A. K., and Jain, S. Crowdclustering with sparse pairwise labels: A matrix completion approach. In AAAI Workshop on Human Computation, 2012a.
Yi, J., Jin, R., Jain, A. K., Jain, S., and Yang, T.Semi-crowdsourced clustering: Generalizing crowd labeling by robust distance metric learning. In NIPS, pp. 17811789, 2012b.
Yi, J., Yang, T., Jin, R., Jain, A. K., and Mahdavi, M.Robust ensemble clustering by matrix completion.In ICDM, pp. 11761181, 2012c.
Zeng, H. and Cheung, Y. Semi-supervised maximum margin clustering with pairwise constraints. IEEE Trans. Knowl. Data Eng., 24(5):926939, 2012.
-----0
Barndorff-Nielsen, O., Blsild, P., Jensen, J. L., and Jrgensen, B. Exponential transformation models. Royal Society of London, 379(1776):4165, 1982.
Bishop, C. M. Variational principal components. In ICANN, pp. 509514, 1999.
Blankertz, B., Curio, G., and Muller, K.-L. Classifying single trial EEG: Towards brain computer interfacing, In: NIPS, 2001.www.bbci.de/competition/ii/berlin desc.html 
Bregman, L. M. The relaxation method of finding the common points of convex sets and its application to the solution of problems in convex programming.USSR CMMP, 7(3):200217, 1967.
Butler, R. W. Generalized inverse Gaussian distributions and their Wishart connections. Scandinavian Journal of Statistics, 25(1):6975, 1998.
Butler, R. W. and Wood, A. Laplace approximation for Bessel functions of matrix argument. J. of Computational and Applied Math., 155(2):359382, 2003.
Carroll, J. D. and Chang, J. J. Analysis of individual differences in multidimensional scaling via an N-way generalization of Eckart-Young decomposition. Psychometrika, 35(3):283319, 1970.
Cemgil, A. T. Bayesian inference for nonnegative matrix factorisation models. Computational Intelligence and Neuroscience, Article ID 785152, 2009.
Fevotte, C., Bertin, N., and Durrieu, J.-L. Nonnegative matrix factorization with the Itakura-Saito divergence: With application to music analysis. Neural Computation, 21(3):793830, 2009.
Goto, M., Hashiguchi, H., Nishimura, T., and Oka, R.RWC music database: Popular, classical, and jazz music database. In ISMIR, pp. 287288, 2002.
Harshman, R. A. Foundations of the PARAFAC procedure: Models and conditions for an explanatory multi-modal factor analysis. UCLA Working Papers in Phonetics, 16(1), 1970.
Herz, C. S. Bessel functions of matrix argument. Annals of Mathematics, 61(3):474523, 1955.
Hoffman, M., Blei, D., and Cook, P. Bayesian nonparametric matrix factorization for recorded music.In ICML, pp. 439446, 2010.
Itakura, F. and Saito, S. Analysis synthesis telephony based on the maximum likelihood method. In ICA, pp. C17C20, 1968.
Kulis, B., Sustik, M., and Dhillon, I. Low-rank kernel learning with Bregman matrix divergences. JMLR, 10:341376, 2009.
Kullback, S. and Leibler, R. On information and sufficiency. Annals of Math. Stat., 22(1):7986, 1951.
Lanckriet, G., Cristianini, N., Bartlett, P., Ghaoui, L., and Jordan, M. Learning the kernel matrix with semidefinite programming. JMLR, 5:2772, 2004.
Lawrence, N. D. Gaussian process latent variable models for visualisation of high dimensional data. In NIPS, 2003.
Lee, D. and Seung, H. Algorithms for non-negative matrix factorization. In NIPS, pp. 556562, 2000.
Lee, H., Cichocki, A., and Choi, S. Nonnegative matrix factorization for motor imagery EEG classification.In ICANN, pp. 250259, 2006.
Liutkus, A., Badeau, R., and Richard, G. Gaussian processes for underdetermined source separation. IEEE Trans. on ASLP, 59(7):31553167, 2011.
Nakano, M., Kameoka, H., Roux, J. Le, Kitano, Y., Ono, N., and Sagayama, S. Convergence-guaranteed multiplicative algorithms for non-negative matrix factorization with beta divergence. In MLSP, pp.283288, 2010.
Salakhutdinov, R. and Mnih, A. Bayesian probabilistic matrix factorization using Markov chain Monte Carlo. In ICML, pp. 880887, 2008.
Sawada, H., Kameoka, H., Araki, S., and Ueda, N.Efficient algorithms for multichannel extensions of Itakura-Saito nonnegative matrix factorization. In ICASSP, pp. 261264, 2012.
Shashua, A. and Hazan, T. Non-negative tensor factorization with applications to statistics and computer vision. In ICML, pp. 792799, 2005.
Sivalingam, R., Boley, D., Morellas, V., and Papanikolopoulos, N. Tensor sparse coding for region covariances. In ECCV, pp. 722735, 2010.
Smaragdis, P. and Brown, J. C. Non-negative matrix factorization for polyphonic music transcription. In WASPAA, pp. 177180, 2003.
Tsuda, K., Ratsch, G., and Warmuth, M. K. Matrix exponentiated gradient updates for on-line learning and Bregman projection. JMLR, 6:9951018, 2005.
Tucker, L. R. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):279311, 1966.
Xu, Z., Yan, F., and Qi, Y. Infinite Tucker decomposition: Nonparametric Bayesian models for multiway data analysis. In ICML, pp. 10231030, 2012.
-----0
Abernethy, Jacob, Bach, Francis, Evgeniou, Theodoros, and Vert, Jean-Philippe. A New 
Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization. Journal of Machine Learning Research, 10:803826, December 2009.
Argyriou, Andreas, Evgeniou, Theodoros, and Pontil, Massimiliano. Convex multi-task feature learning.Machine Learning, 73(3):243272, January 2008.
Argyriou, Andreas, Micchelli, Charles A., and Pontil, Massimiliano. When is there a representer theorem? vector versus matrix regularizers. Journal of Machine Learning Research, 10:25072529, 2009.
Aronszajn, Nachman. Theory of reproducing kernels.Transactions of the American Mathematical Sociery, 68:337404, 1950.
Carmeli, Claudio, Vito, Ernesto De, and Toigo, Alessandro. Vector valued reproducing kernel Hilbert spaces of integrable functions and Mercer theorem. Analysis and Applications, 4:377408, 2006.
Dinuzzo, Francesco and Scholkopf, Bernhard. The representer theorem for Hilbert spaces: a necessary and sufficient condition, 2012. URL http: //arxiv.org/abs/1205.1928.
Kimeldorf, George and Wahba, Grace. Some results on Tchebycheffian spline functions. Journal of Mathematical Analysis and Applications, 33:8295, 1971.
Kirsch, Andreas. An Introduction to the Mathematical Theory of Inverse Problems, volume 120 of Applied Mathematical Sciences. Springer, 2nd edition, August 2011.
Micchelli, Charles A. and Pontil, Massimiliano. On learning vector-valued functions. Neural Computation, 17:177204, 2005.
Scholkopf, Bernhard and Smola, Alex J. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA, 2001.
Scholkopf, Bernhard, Herbrich, Ralf, and Smola, Alex J. A generalized representer theorem. In Conference on Computational Learning Theory, 2001.
Schwartz, Laurent. Sous-espaces Hilbertiens despaces vectoriels topologiques et noyaux associes (noyaux reproduisants). Journal dAnalyse Mathematique, 13:115256, 1964.
Shawe-Taylor, John and Cristianini, Nello. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004.
Wahba, Grace. Spline Models for Observational Data.Society for Industrial and Applied Mathematics, 1990.
Warmuth, Manfred K. and Vishwanathan, S.V.N.Leaving the span. In Conference on Computational Learning Theory, 2005.
Warmuth, Manfred K., Kotlowski, Wojciech, and Zhou, Shuisheng. Kernelization of matrix updates, when and how? In Conference on Algorithmic Learning Theory, 2012.
-----0
Bach, F.R., Lanckriet, G.R.G., and Jordan, M.I. Multiple kernel learning, conic duality, and the smo algorithm. In Proceedings of the 21th International Conference on Machine learning, pp. 6, 2004.
Bellare, K., Druck, G., and McCallum, A. Alternating projections for learning with expectation constraints. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence, pp. 4350, 2009.
Chapelle, O., Sindhwani, V., and Keerthi, S.S. Optimization techniques for semi-supervised support vector machines. The Journal of Machine Learning Research, 9:203233, 2008.
Chen, B.C., Chen, L., Ramakrishnan, R., and Musicant, D.R. Learning from aggregate views. In Proceedings of the 22nd International Conference on Data Engineering, pp. 3, 2006.
Fan, K. Minimax theorems. Proceedings of the National Academy of Sciences of the United States of America, 39(1):42, 1953.
Gillenwater, J., Ganchev, K., Graca, J., Pereira, F., and Taskar, B. Posterior sparsity in unsupervised dependency parsing. The Journal of Machine Learning Research, 12:455490, 2011.
Joachims, T. Transductive inference for text classification using support vector machines. In Proceedings of the 16th International Conference on Machine Learning, pp. 200209, 1999.
Joachims, T. Training linear SVMs in linear time.In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 217226, 2006.
Keerthi, S.S., Sundararajan, S., and Shevade, S.K.Extension of TSVM to multi-class and hierarchical text classification problems with general losses. In Proceeding of the 24th International Conference on Computational Linguistics, pp. 10911100, 2012.
Kelley Jr, J.E. The cutting-plane method for solving convex programs. Journal of the Society for Industrial & Applied Mathematics, 8(4):703712, 1960.Kuck, H. and de Freitas, N. Learning about individuals from group statistics. In Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence, pp. 332339, 2005.
Li, Y.F., Kwok, J.T., and Zhou, Z.H. Semi-supervised learning using label mean. In Proceedings of the 26th International Conference on Machine Learning, pp.633640, 2009a.
Li, Y.F., Tsang, I.W., Kwok, J.T., and Zhou, Z.H.Tighter and convex maximum margin clustering. In Proceeding of the 12th International Conference on Artificial Intelligence and Statistics, pp. 344351, 2009b.
Mann, G.S. and McCallum, A. Simple, robust, scalable semi-supervised learning via expectation regularization. In Proceedings of the 24th International Conference on Machine Learning, pp. 593600, 2007.
Musicant, D.R., Christensen, J.M., and Olson, J.F.Supervised learning by training on aggregate outputs. In Proceedings of the 7th International Conference on Data Mining, pp. 252261, 2007.
Quadrianto, N., Smola, A.J., Caetano, T.S., and Le, Q.V. Estimating labels from label proportions. The Journal of Machine Learning Research, 10:2349 2374, 2009.
Rahimi, A. and Recht, B. Random features for largescale kernel machines. Advances in Neural Information Processing Systems, 20:11771184, 2007.
Rakotomamonjy, A., Bach, F., Canu, S., and Grandvalet, Y. SimpleMKL. The Journal of Machine Learning Research, 9:24912521, 2008.
Rueping, S. SVM classifier estimation from group probabilities. In Proceedings of the 27th International Conference on Machine Learning, pp. 911918, 2010.
Shilton, A., Palaniswami, M., Ralph, D., and Tsoi, A.C. Incremental training of support vector machines. Neural Networks, IEEE Transactions on, 16 (1):114131, 2005.
Stolpe, M. and Morik, K. Learning from label proportions by optimizing cluster model selection. In Proceedings of the 2011 European Conference on Machine Learning and Knowledge Discovery in Databases-Volume Part III, pp. 349364, 2011.
Vedaldi, A. and Zisserman, A. Efficient additive kernels via explicit feature maps. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 34(3): 480492, 2012.
Xu, L., Neufeld, J., Larson, B., and Schuurmans, D.Maximum margin clustering. Advances in Neural InformationProcessingSystems,17:15371544, 2004.
-----0
Bengio, Yoshua, Courville, Aaron C., and Vincent, Pascal. Unsupervised feature learning and deep learning: A review and new perspectives. CoRR, abs/1206.5538, 2012.
Boureau, Y., Le Roux, N., Bach, F., Ponce, J., and LeCun, Y. Ask the locals: multi-way local pooling for image recognition. In ICCV, 2011.
Ciresan, D.C., Meier, U., Gambardella, L.M., and Schmidhuber, J. Deep, big, simple neural nets for handwritten digit recognition. Neural computation, 22, 2010.
Coates, A. and Ng, A.Y. The importance of encoding versus training with sparse coding and vector quantization. In ICML, 2011a.
Coates, A. and Ng, A.Y. Selecting receptive fields in deep networks. Advances in Neural Information Processing Systems, 24, 2011b.
Coates, A., Lee, H., and Ng, A.Y. An analysis of single-layer networks in unsupervised feature learning. In AISTATS, 2010.
Coates, A., Karpathy, A., and Ng, A. Emergence of object-selective features in unsupervised feature learning. In NIPS, 2012.
Csurka, G., Dance, C., Fan, L., Willamowski, J., and Bray, C. Visual categorization with bags of keypoints. In Workshop on Statistical Learning in Computer Vision, ECCV, 2004.
Feng, J., Ni, B., Tian, Q., and Yan, S. Geometric l p-norm feature pooling for image classification. In CVPR, 2011.
Gregor, K. and LeCun, Y. Efficient learning of sparse invariant representations. arXiv preprint arXiv:1105.5307, 2011.
Hinton, G.E., Osindero, S., and Teh, Y.W. A fast learning algorithm for deep belief nets. Neural computation, 18(7), 2006.
Hubel, D. H. and Wiesel, T. N. Receptive fields, binocular interaction and functional architecture in the cats visual cortex. J Physiol, 160:106154, 1962.
Jia, Y., Huang, C., and Darrell, T. Beyond spatial pyramids: Receptive field learning for pooled image features. In CVPR, 2012.
Krizhevsky, A. Convolutional deep belief networks on cifar-10. Unpublished manuscript, 2010.
Krizhevsky, A., Sutskever, I., and Hinton, G. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25, 2012.
Lazebnik, S., Schmid, C., and Ponce, J. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR, 2006.
Le, Q., Ranzato, M.A., Monga, R., Devin, M., Chen, K., Corrado, G., Dean, J., and Ng, A.Y. Building high-level features using large scale unsupervised learning. In ICML, 2012.
Le, Q.V., Ngiam, J., Chen, Z., Chia, D., Koh, P.W., and Ng, A.Y. Tiled convolutional neural networks.In NIPS, 2010.
Le, Q.V., Karpenko, A., Ngiam, J., and Ng, A.Y. Ica with reconstruction cost for efficient overcomplete feature learning. In NIPS, 2011.
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P.Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86, 1998.
LeCun, Y., Huang, F.J., and Bottou, L. Learning methods for generic object recognition with invariance to pose and lighting. In CVPR, 2004.
Lee, H., Ekanadham, C., and Ng, A.Y. Sparse deep belief net model for visual area v2. In NIPS, 2008.
Lee, H., Grosse, R., Ranganath, R., and Ng, A.Y.Convolutional deep belief networks for scalable unsupervised learning of heirarchical representations.In ICML, 2009.
Pinto, N., Cox, D., and Dicarlo, J. Why is real-world visual object recognition hard. PLoS Computational Biology, 4(1):151156, 2008.
Serre, T., Wolf, L., and Poggio, T. Object recognition with features inspired by visual cortex. In CVPR, 2005.
Yang, J., Yu, K., Gong, Y., , and Huang, T. Linear spatial pyramid matching using sparse coding for image classification. In CVPR, 2009.
Yu, K. and Zhang, T. Improved local coordinate coding using local tangents. In ICML, 2010.
Zou, W., Ng, A., Zhu, S., and Yu, K. Deep learning of invariant features via simulated fixations in video.In NIPS, 2012.
-----0
Calders, T. and Verwer, S. Three naive bayes approaches for discrimination-free classification. Data Mining and Knowledge Discovery, 21:277292, 2010.
Dwork, C. and Mulligan, D. Privacy and classification concerns in online behavioral targeting: Mapping objections and sketching solutions. In Privacy law Scholars Conference, 2012.
Dwork, C., McSherry, F., Nissim, K., and Smith, A.Calibrating noise to sensitivity in private data analysis. In Theory of Cryptograph Conference (TCC), 2006.
Dwork, C., Hardt, M., Pitassi, T., Reingold, O., and Zemel, R. Fairness through awareness. In Proceedings of Innovations of Theoretical Computer Science, 2011.
Frank, A. and Asuncion, A. UCI machine learning repository, 2010. URL http://archive.ics.uci.edu/ml.
Kamiran, F. and Calders, T. Classifying without discriminating. In 2nd International Conference on Computer, Control and Communication, pp. 16, 2009.
Kamishima, T., Akaho, S., and Sakuma, J. Fairnessaware learning through regularization approach. In IEEE 11th International Conference on Data Mining, pp. 643650, 2011.
Kohavi, R. Scaling up the accuracy of naive-bayes classifiers: a decision-tree hybrid. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 1996.
Luong, B., Ruggieri, S., and Turini, F. k-NN as an implementation of situation testing for discrimination discovery and prevention. In Proceedings of the 17th ACM KDD Conference, pp. 502510, 2011.
Pedreschi, D., Ruggieri, S., and Turini, F.Discrimination-aware data mining. In Proceedings of the 14th ACM KDD Conference, pp. 560568, 2008.
Tishby, N., Pereira, F.C., and Bialek, W. The Information Bottleneck method. In The 37th Annual Allerton Conference on Communication, Control, and Computing, 1999.
Zarsky, T. Automated prediction: Perception, law, and policy. CACM 15 (9), 2012.
-----0
Algeo, John. Where do all the new words come from? American Speech, 55(4):264277, 1980.
Bird, Steven, Klein, Ewan, and Loper, Edward. Natural Language Processing with Python. OReilly Media, 2009.
Blei, David M. and Jordan, Michael I. Variational inference for Dirichlet process mixtures. Journal of Bayesian Analysis, 1(1):121144, 2005.
Blei, David M. and Lafferty, John D. Dynamic topic models.In ICML, 2006.
Blei, David M., Ng, Andrew, and Jordan, Michael. Latent Dirichlet allocation. JMLR, 3:9931022, 2003.
Blunsom, Phil and Cohn, Trevor. A hierarchical Pitman-Yor process HMM for unsupervised part of speech induction.In ACL, 2011.
Boyd-Graber, Jordan and Blei, David M. Multilingual topic models for unaligned text. In UAI, 2009.
Chang, Jonathan, Boyd-Graber, Jordan, and Blei, David M.Connections between the lines: Augmenting social networks with text. In KDD, 2009.
Clark, Alexander. Combining distributional and morphological information for part of speech induction. 2003.
Cohen, Shay B., Blei, David M., and Smith, Noah A. Variational inference for adaptor grammars. In NAACL, 2010.
Dietz, Laura, Bickel, Steffen, and Scheffer, Tobias. Unsupervised prediction of citation influences. In ICML, 2007.
Ferguson, Thomas S. A Bayesian analysis of some nonparametric problems. The Annals of Statistics, 1(2):209230, 1973.
Goldwater, Sharon and Griffiths, Thomas L. A fully Bayesian approach to unsupervised part-of-speech tagging. In ACL, 2007.
Hall, Mark, Frank, Eibe, Holmes, Geoffrey, Pfahringer, Bernhard, Reutemann, Peter, and Witten, Ian H. The WEKA data mining software: An update. SIGKDD Explorations, 11, 2009.
Hoffman, Matthew, Blei, David M., and Bach, Francis.Online learning for latent Dirichlet allocation. In NIPS, 2010.
Jelinek, F. and Mercer, R. Probability distribution estimation from sparse data. IBM Technical Disclosure Bulletin, 28:25912594, 1985.
Knight, Kevin and Graehl, Jonathan. Machine transliteration. In ACL, 1997.
Kurihara, Kenichi, Welling, Max, and Vlassis, Nikos. Accelerated variational Dirichlet process mixtures. In NIPS, 2006.
Kurihara, Kenichi, Welling, Max, and Teh, Yee Whye.Collapsed variational Dirichlet process mixture models.In IJCAI. 2007.
Mimno, David, Hoffman, Matthew, and Blei, David. Sparse stochastic inference for latent Dirichlet allocation. In ICML, 2012.
Muller, Peter and Quintana, Fernando A. Nonparametric Bayesian data analysis. Statistical Science, 19(1):95110, 2004.
Neal, Radford M. Probabilistic inference using Markov chain Monte Carlo methods. Technical Report CRG-TR93-1, University of Toronto, 1993.
Newman, David, Karimi, Sarvnaz, and Cavedon, Lawrence.External evaluation of topic models. In ADCS, 2009.
Paul, Michael and Girju, Roxana. A two-dimensional topicaspect model for discovering multi-faceted topics. 2010.
Sato, Masa-Aki. Online model selection based on the variational Bayes. Neural Computation, 13(7):16491681, July 2001.
Sethuraman, Jayaram. A constructive definition of Dirichlet priors. Statistica Sinica, 4:639650, 1994.
Teh, Yee Whye. A hierarchical Bayesian language model based on Pitman-Yor processes. In ACL, 2006.
Wang, Chong and Blei, David. Variational inference for the nested Chinese restaruant process. In NIPS, 2009.
Wang, Chong and Blei, David M. Truncation-free online variational inference for bayesian nonparametric models.In NIPS, 2012.
Wang, Chong, Blei, David M., and Heckerman, David.Continuous time dynamic topic models. In UAI, 2008.
Wang, Chong, Paisley, John, and Blei, David. Online variational inference for the hierarchical Dirichlet process.In AISTATS, 2011.
Wei, Xing and Croft, Bruce. LDA-based document models for ad-hoc retrieval. In SIGIR, 2006.
Weinberger, K.Q., Dasgupta, A., Langford, J., Smola, A., and Attenberg, J. Feature hashing for large scale multitask learning. In ICML, pp. 11131120. ACM, 2009.
Zhai, Ke, Boyd-Graber, Jordan, Asadi, Nima, and Alkhouja, Mohamad. Mr. LDA: A flexible large scale topic modeling package using variational inference in mapreduce. In WWW, 2012.
-----1
Cai, J., Osher, S., and Shen, Z. Convergence of the linearized bregman iteration for `1-norm minimiza- tion. Mathematics of Computation, 78(268):2127 2136, 2009.
Chu, D., Goh, S. T., and Hung, Y. S. Characteriza- tion of all solutions for undersampled uncorrelated linear discriminant analysis problems. SIAM Jour- nal on Matrix Analysis and Applications, 32(3):820 844, 2011.
Clemmensen, L., Hastie, T., Wiiten, D., and Ersbll, B. Sparse discriminant analysis. Technometrics, 53 (4):406413, 2011.
Dundar, M., Fung, G., Bi, J., Sathyakama, S., and Rao, B. Sparse fisher discriminant analysis for com- puter aided detection. In Proceedings of the SIAM International Conference on Data Mining, 2005.
Friedman, J. Regularized discriminant analysis. Jour- nal of the American Statistical Association, 84(405): 165175, 1989.
Fukunaga, K. Introduction to Statistical Pattern Recognition. Academic Press, 2nd edition, 1990.
Fung, E. and Ng, M. K. On sparse fisher discriminant method for microarray data analysis. Bioinforma- tion, 2(5):230234, 2007.
Golub, G. and Loan, C. F. Van. Matrix Computations.The Johns Hopkins University Press, 3rd edition, 1996.
Hastie, T. and Tibshirani, R. Discriminant analysis by gaussian mixtures. Journal of the Royal Statistical Society. Series B (Methodological), 58(1):155176, 1996.
Hastie, T., Tibshirani, R., and Friedman, J. The El- ements of Statistical Learning: Data Mining, Infer- ence and Prediction. Springer, 2nd edition, 2009.
Howland, P. and Park, H. Generalizing discriminant analysis using the generalized singular value decom- position. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26:9951006, 2004.
Huang, B., Ma, S., and Goldfarb, D. Accelerated linearized bregman method. 2011. URL http: //arxiv.org/abs/1106.5413.
Jin, Z, Yang, J Y., Hu, Z S., and Lou, Z. Face recogni- tion based on the uncorrelated discriminant trans- formation. Pattern Recognition, 34:14051416, 2001.
Moghaddam, B., Weiss, Y., and Avidan, S. Gener- alized spectral bounds for sparse lda. In Preceed- ings of the 23th International Conference on Ma- chine Learning, pp. 641648, 2006.
Sugiyama, M. Dimensionality reduction of multi- modal labeled data by local fisher discriminant anal- ysis. Journal of Machine Learning Research, 8:1027 1061, 2007.
Tibshirani, Robert. Regression shrinkage and selec- tion via the lasso. Journal of the Royal Statistical Society, Series B, 58:1514915154, 1996.
Witten, D. M. and Tibshirani, R. Penalized classifi- cation using fishers linear discriminant. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(5):753772, 2011.
Wu, M., Zhang, L., Wang, Z., Christiani, D., and Lin, X. Sparse linear discriminant analysis for si- multaneous testing for the significance of a gene set/pathway and gene selection. Bioinformatics, 25 (9):11451151, 2009.
Ye, J. Characterization of a family of algorithms for generalized discriminant analysis on undersampled problems. Journal of Machine Learning Research, 6:483502, 2005.
Ye, J. Least squares linear discriminant analysis. In Preceedings of the 24th International Conference on Machine Learning, pp. 10871094, 2007.
Ye, J., Li, T., Xiong, T., and Janardan, R. Using uncorrelated discriminant analysis for tissue clas- sification with gene expression data. IEEE/ACM Transactions on Computational Biology and Bioin- formatics, 1(4):181190, 2004.
Ye, J., Janardan, R., Li, Q., and Park, H. Feature extraction via generalized uncorrelated linear dis- criminant analysis. IEEE Transactions on Knoledge and Data Engineering, 18(10):13121321, 2006.
Yin, W. Analysis and generalizations of the linearized bregman method. SIAM Journal on Imaging Sci- ences, 3(4):856877, 2010.
Yin, W., Osher, S., Goldfarb, D., and Darbon, J. Breg- man iterative algorithms for `1-minimization with applications to compressed sensing. SIAM Journal on Imaging Sciences, 1(1):143168, 2008.
-----0
Andrews, S., Tsochantaridis, I., and Hofmann, T. Support vector machines for multiple-instance learning. In NIPS, 2003.
Babenko, Boris, Yang, Ming-Hsuan, and Belongie, Serge.Robust object tracking with online multiple instance learning. IEEE Trans. Pattern Anal. Mach. Intell., 33 (8):16191632, 2011.
Bergeron, Charles, Moore, Gregory M., Zaretzki, Jed, Breneman, Curt M., and Bennett, Kristin P. Fast bundle algorithm for multiple-instance learning. IEEE Trans. Pattern Anal. Mach. Intell., 34(6):10681079, 2012.
Berry, Michael W. and Castellanos, Malu. Survey of Text Mining: Clustering, Classification, and Retrieval, Second Edition. 2007.
Chen, Yixin, Bi, Jinbo, and Wang, James Ze. Miles: Multiple-instance learning via embedded instance selection. IEEE Trans. Pattern Anal. Mach. Intell., 28(12): 19311947, 2006.
Dietterich, T. G., Lathrop, R. H., and Lozano-Perez, T.Solving the multiple instance problem with axis-parallel rectangles. In Artificial Intelligence, 1998.
Do, Trinh Minh Tri and Artie`res, Thierry. Large margin training for hidden markov models with partially observed states. In ICML, pp. 34, 2009.
Fu, Zhouyu and Robles-Kelly, Antonio. An instance selection approach to multiple instance learning. In CVPR, pp. 911918, 2009.
Fuduli, Antonio, Gaudioso, Manlio, and Giallombardo, Giovanni. Minimizing nonconvex nonsmooth functions via cutting planes and proximity control. SIAM Journal on Optimization, 14(3):743756, 2004.
Hare, Warren and Sagastizabal, Claudia A. A redistributed proximal bundle method for nonconvex optimization.SIAM Journal on Optimization, 20(5):24422473, 2010.
Hu, Yang, Li, Mingjing, and Yu, Nenghai. Multipleinstance ranking: Learning to rank images for image retrieval. In CVPR, 2008.
Joachims, Thorsten. Training linear svms in linear time.In KDD, pp. 217226, 2006.
Joachims, Thorsten, Cristianini, Nello, and Shawe-Taylor, John. Composite kernels for hypertext categorisation.In ICML, pp. 250257, 2001.
Joachims, Thorsten, Finley, Thomas, and Yu, ChunNam John. Cutting-plane training of structural svms.Machine Learning, 77(1):2759, 2009.
Kim, Minyoung and la Torre, Fernando De. Gaussian processes multiple instance learning. In ICML, pp. 535542, 2010.
Manning, Christopher D., Raghavan, Prabhakar, and Schtze, Hinrich. Introduction to Information Retrieval.Cambridge University Press, 2008.
Noll, Dominikus. Bundle method for non-convex minimization with inexact subgradients and function values. Computational and Analytical Mathematics, 2012.
Rahmani, R. and Goldman, S.A. MISSL: Multiple-instance semi-supervised learning. In ICML, 2006.
Ray, Soumya and Craven, Mark. Supervised versus multiple instance learning: an empirical comparison. In ICML, pp. 697704, 2005.
Schramm, Helga and Zowe, Jochem. A version of the bundle idea for minimizing a nonsmooth function: Conceptual idea, convergence analysis, numerical results. SIAM Journal on Optimization, 2(1):121, 1992.
Shawe-Taylor, John and Cristianini, Nello. Kernel Methods for Pattern Analysis. MCambridge University Press, 2004.
Smola, Alex J., Vishwanathan, S. V. N., and Le, Quoc V.Bundle methods for machine learning. In NIPS, 2007.
Teo, Choon Hui, Vishwanathan, S. V. N., Smola, Alex J., and Le, Quoc V. Bundle methods for regularized risk minimization. Journal of Machine Learning Research, 11:311365, 2010.
Wang, Jun and Zucker, Jean-Daniel. Solving multipleinstance problem: A lazy learning approach. In Langley, 
Pat (ed.), ICML, pp. 11191125, 2000.Wu, Ou, Gao, Jun, Hu, Weiming, Li, Bing, and Zhu, Mingliang. Indentifying multi-instance outliers. In SDM, pp.430441, 2010.
Yuille, A. and Rangarajan, A. The concave-convex procedure. Neural Computation, 2003.
-----0
Belkin, M., Niyogi, P., and Sindhwani, V. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research, (7):2399  2434, 2006.
Bickel, S., Bruckner, M., and Scheffer., T. Discriminative learning for differing training and test distributions. In International Conference on Machine Learning, 2007.
Bickel, S., Sawade, C., and Scheffer, T. Transfer learning by distribution matching for targeted advertising tex. In Advances in Neural Information Processing Systems 21, pp. 145152, 2009.
Daume III, H. and Marcu, D. Domain adaptation for statistical classifiers. Journal of Artical Intelligence Research, 26:101126, 2006.
Huang, J., Smola, A., Gretton, A., Borgwardt, K.M., and Scholkopf, B. Correcting sample selection bias by unlabeled data. In Advances in Neural Information Processing Systems 19, pp. 601608, 2007.
Joachims, T. Transductive inference for text classification using support vector machines. In International Conference on Machine Learning, 1999.
Kulis, B., Saenko, K., and Darrell, T. What you saw is not what you get: Domain adaptation using asymmetric kernel transforms. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 17851792, 2012.
Kumar, S., Mohri, M., and Talwalkar, A. Sampling methods for the nystrom method. Journal of Machine Learning Research, 13:9811006, 2012.
Mansour, Y., Mohri, M., and Rostamizadeh, A. Domain adaptation with multiple sources. In Advances in Neural Information Processing Systems, 2009.
Pan, J. and Yang, Q. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22: 13451359, 2010.
Pan, S. J., Tsang, I.W., Kwok, J.T., and Yang, Q. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks, 22:199210, 2011.
Scholkopf, B. and Smola, A. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT press, 2001.
Scholkopf, B., Smola, A., and Muller, Klaus-Robert. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10:12991319, 1998.
Shawe-Taylor, J., Williams, C., Cristianini, N., and Kandola, J. On the eigenspectrum of the Gram matrix and the generalization error of kernel-PCA. IEEE Transactions on Information Theory, 51:25102522, 2005.
Shimodaira, H. Improving predictive inference under covariate shift by weighting the log-likelihood function.Journal of Statistical Planning and Inference, 90:227 244, 2000.
Sugiyama, M., Nakajima, S., Kashima, H., Bunau, P., and Kawanabe, M. Direct importance estimation with model selection and its application to covariate shift adaptation. In Advances in Neural Information Processing Systems 20, pp. 14331440, 2008.
Williams, C. and Seeger, M. Using the nystrom method to speed up kernel machines. In Advances in Neural Information Processing Systems 13, 2001.
Yu, Y. and Szepesvari, C. Analysis of kernel mean matching under covariate shift. In International Conference on Machine Learning, pp. 478486, 2012.
Zadrozny, B. Learning and evaluating classifiers under sample selection bias. In International Conference on Machine Learning, pp. 903910, 2004.
Zhang, K., Tsang, I., and Kwok, J. Improved nystrom lowrank approximation and error analysis. In International Conference on Machine Learning, pp. 12321239, 2008.
-----0
Bartlett, P.L., Bousquet, O., and Mendelson, S. Local rademacher complexities. Ann. Stat., 33(4):1497 1537, 2005.
Burges, C.J.C. A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov., 2(2):121167, 1998.
Burges, C.J.C. and Scholkopf, B. Improving the accuracy and speed of support vector learning machines.In NIPS 9, pp. 375381, 1997.
Cavallanti, G., Cesa-Bianchi, N., and Gentile, C.Tracking the best hyperplane with a simple budget perceptron. Mach. Learn., 69(2-3):143167, 2007.
Cesa-Bianchi, N. and Lugosi, G. Prediction, Learning, and Games. Cambridge University Press, 2006.
Cesa-Bianchi, N., Conconi, A., and Gentile, C. On the generalization ability of on-line learning algorithms.
IEEE Trans. Inf. Theory, 50(9):20502057, 2004.Chang, C. and Lin, C. LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol., 2(3):27:127:27, 2011.
Cheng, L., Vishwanathan, S.V.N., Schuurmans, D., Wang, S., and Caelli, T. Implicit online learning with kernels. In NIPS 19, pp. 249256, 2007.
Cotter, A., Shalev-Shwartz, S., and Srebro, N. Learning optimally sparse support vector machines. In ICML, 2013.
Crammer, K., Kandola, J., and Singer, Y. Online classification on a budget. In NIPS 16, pp. 225232, 2004.
Dekel, O., Shalev-Shwartz, S., and Singer, Y. The forgetron: A kernel-based perceptron on a budget. SIAM J. Comput., 37(5):13421372, 2008.
Duchi, J. and Singer, Y. Efficient online and batch learning using forward backward splitting. J. Mach.Learn. Res., 10:28992934, 2009.
Frank, A. and Asuncion, A. UCI machine learning repository, 2010.
Freund, Y. and Schapire, R.E. Large margin classification using the perceptron algorithm. Mach. Learn., 37(3):277296, 1999.
Keerthi, S.S., Chapelle, O., and DeCoste, D. Building support vector machines with reduced classifier complexity. J. Mach. Learn. Res., 7:14931515, 2006.
Kivinen, J., Smola, A.J., and Williamson, R.C. Online learning with kernels. In NIPS 14, pp. 785792, 2002.
Langford, J., Li, L., and Zhang, T. Sparse online learning via truncated gradient. J. Mach. Learn. Res., 10: 777801, 2009.
Lee, Y. and Mangasarian, O.L. Rsvm: Reduced support vector machines. In SDM, 2001.
Mallapragada, P.K., Jin, R., Jain, A.K., and Liu, Y.Semiboost: Boosting for semi-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell., 31(11): 20002014, 2009.
Orabona, F., Keshet, J., and Caputo, B. The projectron: a bounded kernel-based perceptron. In ICML, pp. 720727, 2008.
Roth, V. Probabilistic discriminative kernel classifiers for multi-class problems. In Proceedings of the 23rd DAGM-Symposium on Pattern Recognition, pp. 246253, 2001.
Scholkopf, B. and Smola, A.J. Learning with kernels : support vector machines, regularization, optimization, and beyond. MIT Press, 2002.
Shalev-Shwartz, S., Singer, Y., and Srebro, N. Pegasos: primal estimated sub-gradient solver for SVM.In ICML, pp. 807814, 2007.
Srebro, N., Sridharan, K., and Tewari, A. Smoothness, low noise and fast rates. In NIPS 23, pp. 21992207, 2010.
Wang, Z., Crammer, K., and Vucetic, S. Breaking the curse of kernelization: Budgeted stochastic gradient descent for large-scale svm training. J. Mach. Learn.Res., 13:31033131, 2012.
Wu, M., Scholkopf, B., and Bak?r, G. A direct method for building sparse kernel learning algorithms. J.
Mach. Learn. Res., 7:603624, 2006.Zhang, L., Jin, R., Chen, C., Bu, J., and He, X. Efficient online learning for large-scale sparse kernel logistic regression. In AAAI12, pp. 12191225, 2012.
Zhao, P., Wang, J., Wu, P., Jin, R., and Hoi, S.C.H.Fast bounded online gradient descent algorithms for scalable kernel-based online learning. In ICML, pp.169176, 2012.
Zhu, J. and Hastie, T. Kernel logistic regression and the import vector machine. In NIPS 13, pp. 1081 1088, 2001.
-----0
Candela, J., Sugiyama, M., Schwaighofer, A., and Lawrence, N. (eds.). Dataset Shift in Machine Learning. MIT Press, 2009.
Chan, Y. S. and Ng, H. T. Word sense disambiguation with distribution estimation. In Proceedings of the 19th International Joint Conference on Artificial Intelligence, pp. 10101015, Scotland, 2005.
Fukumizu, K., Bach, F. R., Jordan, M. I., and Williams, C. Dimensionality reduction for supervised learning with reproducing kernel hilbert spaces. JMLR, 5:7399, 2004.
Fukumizu, K., Gretton, A., Sun, X., and Scholkopf, B.Kernel measures of conditional dependence. NIPS 20, pp. 489496, Cambridge, MA, 2008.
Gretton, A., Borgwardt, K., Rasch, M., Scholkopf, B., and Smola, A. A kernel method for the two-sampleproblem. In NIPS 19, pp. 513520, Cambridge, MA, 2007.
Gretton, A., Smola, A., Huang, J., Schmittfull, M., Borgwardt, K., and Scholkopf, B. Covariate shift and local learning by distribution matching. In Quinonero-Candela, J., Sugiyama, M., Schwaighofer, A., , and Lawrence, N. (eds.), Dataset shift in machine learning, pp. 131160. MIT Press, Cambridge, MA, 2008.
Ham, J., Chen, Y., Crawford, M. M., and Ghosh, J.Investigation of the random forest framework for classification of hyperspectral data. IEEE Trans. Geosci. Remote Sens., 43(3):492501, 2005.
Huang, J., Smola, A., Gretton, A., Borgwardt, K., and Scholkopf, B. Correcting sample selection bias by unlabeled data. In NIPS 19, pp. 601608, 2007.
Japkowicz, N. and Stephen, S. The class imbalance problem: A systematic study. Intelligent Data Analysis, 6:429450, 2002.
Jiang, J. A literature survey on domain adaptation of statistical classifiers, 2008.http://sifaka.cs.uiuc.edu/jiang4/domain adaptation/survey.
Lin, Y., Lee, Y., and Wahba, G. Support vector machines for classification in nonstandard situations.Machine Learning, 46:191202, 2002.
Manski, C. and Lerman, S. The estimation of choice probabilities from choice-based samples. Econometrica, 45:19771988, 1977.
Pan, S. J. and Yang, Q. A survey on transfer learning.IEEE Transactions on Knowledge and Data Engineering, 22:13451359, 2010.
Robert, C. P. and Casella, G. Monte Carlo Statistical Methods. Springer Press, New York, 2nd edition, 2004.
Saunders, C., Gammerman, A., and Vovk, V. Ridge regression learning algorithm in dual variables. In Proc. ICML, pp. 515521, Madison, WI, 1998.
Scholkopf, B., Janzing, D., Peters, J., Sgouritsa, E., Zhang, K., and Mooij, J. On causal and anticausal learning. In Proc. ICML 2012.
Shimodaira, H. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90:227244, 2000.
Smola, A., Gretton, A., Song, L., and Scholkopf, B.A hilbert space embedding for distributions. In Proceedings of the 18th International Conference on Algorithmic Learning Theory, pp. 1331. SpringerVerlag, 2007.
Song, L., Huang, J., Smola, A., and Fukumizu, K.Hilbert space embeddings of conditional distributions with applications to dynamical systems. In Proc. ICML 2009.
Song, L., Boots, B., Siddiqi, S., Gordon, G., and Smola, A. Hilbert space embeddings of hidden markov models. In ICML 2010.
Sriperumbudur, B., Fukumizu, K., and Lanckriet, G.Universality, characteristic kernels and rkhs embedding of measures. JMLR, 12:23892410, 2011.
Storkey, A. When training and test sets are different: Characterizing learning transfer. In Candela, J., Sugiyama, M., Schwaighofer, A., and Lawrence, N. (eds.), Dataset Shift in Machine Learning, pp.328. MIT Press, 2009.
Sugiyama, M., Suzuki, T., Nakajima, S., Kashima, H., von Bunau, P., and Kawanabe, M. Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60: 699746, 2008.
Tian, J. and Pearl, J. Causal discovery from changes: a bayesian approach. In UAI2001, pp. 512521, 2001.
Woodward, J. Making things happen: A theory of causal explanation. Oxford University Press, New York, 2003.
Yu, Y. and Zhou, Z. A framework for modeling positive class expansion with single snapshot. In PAKDD 2008.Zadrozny, B. Learning and evaluating classifiers under sample selection bias. In Proc. ICML, pp. 114121, Banff, Canada, 2004.
Zhang, K., Peters, J., Janzing, D., and Scholkopf, B.Kernel-based conditional independence test and application in causal discovery. In UAI 2011.
-----0
Agarwal, A., Bartlett, P.L., P.Ravikumar, andWainwright, M.J. Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. IEEE Trans. Inf. Theory, 58(5):32353249, 2012.
Bartlett, P.L., Bousquet, O., and Mendelson, S. Local rademacher complexities. Ann. Stat., 33(4):14971537, 2005.
Cesa-Bianchi, N. and Lugosi, G. Prediction, Learning, and Games. Cambridge University Press, 2006.
Chang, C. and Lin, C. LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol., 2 (3):27:127:27, 2011.
Chen, X., Lin, Q., and Pena, J. Optimal regularized dual averaging methods for stochastic optimization. In NIPS 25, pp. 404412, 2012.
Clarkson, K.L. Coresets, sparse greedy approximation, and the frank-wolfe algorithm. ACM Trans. Algorithms, 6 (4):63:163:30, 2010.
Cotter, A., Shamir, O., Srebro, N., and Sridharan, K. Better mini-batch algorithms via accelerated gradient methods. In NIPS 24, pp. 16471655, 2011.
Dekel, O., Gilad-Bachrach, R., Shamir, O., and Xiao, L.Optimal distributed online prediction. In Getoor, Lise and Scheffer, Tobias (eds.), ICML, pp. 713720, 2011.
Dekel, O., Gilad-Bachrach, R., Shamir, O., and Xiao, L. Optimal distributed online prediction using minibatches. J. Mach. Learn. Res., 13:165202, 2012.
Frank, M. and Wolfe, P. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3(1-2): 95110, 1956.
Ghadimi, S. and Lan, G. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization i: A generic algorithmic framework. SIAM J. Optim., 22(4):14691492, 2012.
Hazan, E. Sparse approximate solutions to semidefinite programs. In LATIN, pp. 306316, 2008.
Hazan, E. and Kale, S. Beyond the regret minimization barrier: an optimal algorithm for stochastic stronglyconvex optimization. In COLT, pp. 421436, 2011.
Hazan, E. and Kale, S. Projection-free online learning. In ICML, pp. 521528, 2012.
Hu, C., Kwok, J., and Pan, W. Accelerated gradient methods for stochastic optimization and online learning. In NIPS 22, pp. 781789, 2009.
Jaggi, M. Revisiting frank-wolfe: Projection-free sparse convex optimization. In ICML, 2013.
Jin, R., Wang, S., and Zhou, Y. Regularized distance metric learning: Theory and algorithm. In NIPS 22, pp.862870, 2009.
Juditsky, A. and Nesterov, Y. Primal-dual subgradient methods for minimizing uniformly convex functions.Technical report, 2010.
Juditsky, A., Nemirovski, A., and Tauvel, C. Solving variational inequalities with stochastic mirror-prox algorithm. Stochastic Systems, 1(1):1758, 2011.
Lacoste-Julien, S., Jaggi, M., Schmidt, M., and Pletscher, P. Block-coordinate frank-wolfe optimization for structural svm. In ICML, 2013.
Lan, G. An optimal method for stochastic composite optimization. Math. Program., 133:365397, 2012.
Levitin, E.S. and Polyak, B.T. Constrained minimization methods. USSR Computational Mathematics and Mathematical Physics, 6(5):150, 1966.
Mahdavi, M., Yang, T., Jin, R., Zhu, S., and Yi, J.Stochastic gradient descent with only one projection. In NIPS 25, pp. 503511, 2012.
Nemirovski, A. Prox-method with rate of convergence o(1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM J. Optim., 15(1):229251, 2005.
Nemirovski, A. and Yudin, D.B. Problem complexity and method efficiency in optimization. John Wiley & Sons Ltd, 1983.
Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A.Robust stochastic approximation approach to stochastic programming. SIAM J. Optim., 19(4):15741609, 2009.
Nesterov, Y. Introductory lectures on convex optimization: a basic course, volume 87 of Applied optimization.Kluwer Academic Publishers, 2004.
Nesterov, Y. Smooth minimization of non-smooth functions. Math. Program., 103(1):127152, 2005.
Nesterov, Y. Gradient methods for minimizing composite objective function. Core discussion papers, 2007.
Rakhlin, A., Shamir, O., and Sridharan, K. Making gradient descent optimal for strongly convex stochastic optimization. In ICML, pp. 449456, 2012.
Roux, N.L., Manzagol, P., and Bengio, Y. Topmoumoute online natural gradient algorithm. In NIPS 20, pp. 849 856, 2008.
Shalev-Shwartz, S., Shamir, O., Srebro, N., and Sridharan, K. Stochastic convex optimization. In COLT, 2009.
Shalev-Shwartz, S., Singer, Y., Srebro, N., and Cotter, A.Pegasos: primal estimated sub-gradient solver for svm.Math. Program., 127(1):330, 2011.
Smale, S. and Zhou, D. Geometry on probability spaces.Constr. Approx., 30:311323, 2009.
Zhang, T. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In ICML, pp. 919926, 2004.
-----0
Argyriou, A., Hauser, R., Micchelli, C. A., and Pontil, M. (2006). A DC-programming algorithm for kernel selection. In Proceedings of the 23rd international conference on Machine learning ICML 06, pages 4148, New York, New York, USA. ACM Press.
Bach, F. (2008). Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning. Advances in Neural Information Processing Systems (NIPS), 21(2):105112.
Bacry, E., Dayri, K., and Muzy, J. F. (2012). Nonparametric kernel estimation for symmetric Hawkes processes. Application to high frequency financial data. The European Physical Journal B, 85(5).
Blundell, C., Heller, K., and Beck, J. (2012). Modelling Reciprocating Relationships with Hawkes Processes. Advances in Neural Information Processing Systems (NIPS).
Cortes, C., Mohri, M., and Rostamizadeh, A. (2012).Ensembles of Kernel Predictors.
Dinuzzo, F., Ong, C., Gehler, P., and Pillonetto., G.(2011). Learning output kernels with block coordinate descent. In Proceedings of the 28th International Conference on Machine Learning.
Du, N., Song, L., Smola, A., and Yuan, M. (2012).Learning Networks of Heterogeneous Influence. Advances in Neural Information Processing Systems.
Hawkes, A. G. (1971). Spectra of some self-exciting and mutually exciting point processes. Biometrika, 58(1):8390.
Hoi, S. C. H., Jin, R., and Lyu, M. R. (2007). Learning nonparametric kernel matrices from pairwise constraints. In Proceedings of the 24th international conference on Machine learning ICML 07, pages 361368, New York, New York, USA. ACM Press.
Hunter, D. R. and Lange, K. (2003). A Tutorial on MM Algorithms. pages 128.
Lewis, E. and Mohler, G. (2011). A Nonparametric EM algorithm for Multiscale Hawkes Processes.Journal of Nonparametric Statistics, (1).
Lewis, E., Mohler, G., Brantingham, P. J., and Bertozzi, A. L. (2011). Self-exciting point process models of civilian deaths in Iraq. Security Journal, 25(3):244264.
Liniger, T. J. (2009). Multivariate Hawkes Processes.PhD thesis, Swiss Federal Institute Of Technology Zurich.
Mammen, E. and Thomas-agnan, C. (1998). Smoothing Splines And Shape Restrictions. Scandinavian Journal of Statistics, 26:239251.
Marsan, D. and Lengline, O. (2008). Extending earthquakes reach through cascading. Science, 319(5866):10769.
Mitchell, L. and Cates, M. E. (2010). Hawkes process as a model of social interactions: a view on video dynamics. Journal of Physics A: Mathematical and Theoretical, 43(4):045101.
Papangelou, F. (1972). Integrability of Expected Increments of Point Processes and a Related Random Change of Scale. Transactions of the American Mathematical Society, 165:483.
Reinsch, C. h. (1967). Smoothing by Spline Functions.Numerische Mathematik, pages 177183.
Sain, S. R. and Scott, D. W. (2002). Zero-bias locally adaptive density estimators. Scandinavian Journal of Statistics, 29(3):441460.
Simma, A. and Jordan, M. (2012). Modeling events with cascades of Poisson processes. Uncertainty in Artificial Intelligence (UAI).
Sonnenburg, S., Raetsch, G., Schaefer, C., and Scholkopf, B. (2006). Large scale multiple kernel learning. Journal of Machine Learning Research, 7(1):15311565.
Stomakhin, A., Short, M. B., and Bertozzi, A. L.(2011). Reconstruction of missing data in social networks based on temporal patterns of interactions.Inverse Problems, 27(11):115013.
Toke, I. M. (2010). Market making behaviour in an order book model and its impact on the bid-ask spread. Arxiv, page 17.
Turlach, B. A. (1997). Constrained Smoothing Splines Revisited. PhD thesis, Australian National University.
Vere-Jones, D. (1970). Stochastic Models for Earthquake Occurrence. Journal of the Royal Statistical Society, 32(1):162.
Wahba, G. (1990). Spline models for observational data. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA.
Zhu, X., Kandola, J. S., Ghahramani, Z., and Lafferty, J. D. (2004). Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning. Advances in Neural Information Processing Systems (NIPS).
-----0
Blei, D.M. and McAuliffe, J.D. Supervised topic models. Advances in Neural Information Processing Systems (NIPS), pp. 121128, 2007.
Catoni, O. PAC-Bayesian supervised classification: The thermodynamics of statistical learning. Monograph series of the Institute of Mathematical Statistics, 2007.
Dempster, A.P., Laird, N.M., and Rubin, D.B. Maximum likelihood estimation from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Ser. B, (39):138, 1977.
Devroye, L. Non-uniform random variate generation.Springer-Verlag, 1986.
Germain, P., Lacasse, A., Laviolette, F., and Marchand, M. PAC-Bayesian learning of linear classifiers. In International Conference on Machine Learning (ICML), pp. 353360, 2009.
Griffiths, T.L. and Steyvers, M. Finding scientific topics. Proceedings of National Academy of Science (PNAS), pp. 52285235, 2004.
Jiang, Q., Zhu, J., Sun, M., and Xing, E.P. Monte Carlo methods for maximum margin supervised topic models. In Advances in Neural Information Processing Systems (NIPS), 2012.
Joachims, T. Making large-scale SVM learning practical. MIT press, 1999.
Lacoste-Jullien, S., Sha, F., and Jordan, M.I. DiscLDA: Discriminative learning for dimensionality reduction and classification. Advances in Neural Information Processing Systems (NIPS), pp. 897904, 2009.
McAllester, D. PAC-Bayesian stochastic model selection. Machine Learning, 51:521, 2003.
Michael, J.R., Schucany, W.R., and Haas, R.W. Generating random variates using transformations with multiple roots. The American Statistician, 30(2): 8890, 1976.
Newman, D., Asuncion, A., Smyth, P., and Welling, M. Distributed algorithms for topic models. Journal of Machine Learning Research (JMLR), (10):1801 1828, 2009.
Polson, N.G. and Scott, S.L. Data augmentation for support vector machines. Bayesian Analysis, 6(1): 124, 2011.
Rifkin, R. and Klautau, A. In defense of one-vsall classification. Journal of Machine Learning Research (JMLR), (5):101141, 2004.
Smola, A. and Narayanamurthy, S. An architecture for parallel topic models. Very Large Data Base (VLDB), 3(1-2):703710, 2010.
Smola, A. and Scholkopf, B. A tutorial on support vector regression. Statistics and Computing, 14(3): 199222, 2003.
Tanner, M.A. and Wong, W.-H. The calculation of posterior distributions by data augmentation. Journal of the Americal Statistical Association (JASA), 82(398):528540, 1987.
van Dyk, D. and Meng, X. The art of data augmentation. Journal of Computational and Graphical Statistics (JCGS), 10(1):150, 2001.
Yang, S., Bian, J., and Zha, H. Hybrid generative/discriminative learning for automatic image annotation. In Uncertainty in Articial Intelligence (UAI), 2010.
Zhu, J. and Xing, E.P. Conditional topic random fields. In International Conference on Machine Learning (ICML), pp. 12391246, 2010.
Zhu, J., Ahmed, A., and Xing, E.P. MedLDA: maximum margin supervised topic models for regression and classification. In International Conference on Machine Learning (ICML), pp. 12571264, 2009.
Zhu, J., Chen, N., and Xing, E.P. Infinite latent SVM for classification and multi-task learning. In Advances in Neural Information Processing Systems (NIPS), pp. 16201628, 2011.
Zhu, J., Ahmed, A., and Xing, E.P. MedLDA: maximum margin supervised topic models. Journal of Machine Learning Research (JMLR), (13):2237 2278, 2012.
-----0
Almer, O., Topham, N., and Franke, B. A LearningBased Approach to the Automated Design of MPActive Learning for Multi-Objective Optimization SoC Networks. Architecture of Computing Systems (ARCS 2011, pp. 243258, 2011.
Bonilla, E., Chai, K.M.A., and Williams, C.K.I. Multitask Gaussian Process Prediction. In Conference on Neural Information Processing Systems (NIPS), 2008.
Boyd, S. and Vandenberghe, L. Convex Optimization.Cambridge University Press, 2004.
Brochu, E., Cora, V.M., and de Freitas, N. A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement learning. Arxiv preprint arXiv:1012.2599, 2010.
Coello, C., Lamont, G. B., and Veldhuizen, D. Evolutionary Algorithms for Solving Multi-Objective Problems. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006.
D., Lanping, Sobti, K., and Chakrabarti, C. Accurate Models for Estimating Area and Power of FPGA Implementations. In Intl Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1417 1420, 2008.
Jones, D. R., Schonlau, M., and Welch, W. J. Efficient Global Optimization of Expensive Black-box Functions. J Glob. Opti., 13:455492, 1998.
Knowles, J. ParEGO: a Hybrid Algorithm with Online Landscape Approximation for Expensive Multiobjective Optimization Problems. IEEE Trans. on Evolutionary Computation, 10(1):50  66, 2006.
Kunzli, S., Thiele, L., and Zitzler, E. Modular Design Space Exploration Framework for Embedded Systems. Computers & Digital Techniques, 152(2): 183192, 2005.
Palermo, G., Silvano, C., and Zaccaria, V. ReSPIR: A Response Surface-Based Pareto Iterative Refinement for Application-Specific Design Space Exploration. IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, 28:1816 1829, 2009.
Rasmussen, C.E. and Nickisch, H. Gaussian Process Regression and Classification Toolbox Version 3.1 for Matlab 7.x, 2010.
Rasmussen, C.E and Williams, C. K. I. Gaussian Processes for Machine Learning. MIT Press, 2006.
Settles, Burr. Active Learning Literature Survey.Technical Report 1648, University of WisconsinMadison, 2010.
Siegmund, N., Kolesnikov, S.S., Kastner, C., Apel, S., Batory, D., Rosenmuller, M., and Saake, G. Predicting Performance via Automated Feature-Interaction Detection. In Intl Conference on Software Engineering (ICSE), pp. 167 177, 2012.
Srinivas, N., Krause, A., Kakade, S., and Seeger, M.Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design. In Intl Conference on Machine Learning (ICML), 2010.
Srinivas, N., Krause, A., Kakade, S.M., and Seeger, M. Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting. IEEE Trans. on Information Theory, 58(5):3250  3265, 2012.
Zhang, Q., Wudong, L., Tsang, E., and Virginas, B. Expensive Multiobjective Optimization by MOEA/D with Gaussian Process Model. IEEE Trans. on Evolutionary Computation, 14(3):456  474, 2010.
Zitzler, E., Laumanns, M., and Thiele, L. SPEA2: Improving the Strength Pareto Evolutionary Algorithm for Multiobjective Optimization. In Evolutionary Methods for Design, Optimisation, and Control, pp. 95100, 2002.
Zitzler, E., Brockhoff, D., and Thiele, L. The Hypervolume Indicator Revisited: on the Design of Pareto-compliant Indicators via Weighted Integration. In Intl Conference on Evolutionary Multicriterion Optimization (EMO), pp. 862876, 2007.
Zuluaga, M., Krause, A., P.A., Milder, and Puschel, M. Smart Design Space Sampling to Predict Pareto-optimal Solutions. In Languages, Compilers, Tools and Theory for Embedded Systems (LCTES), pp. 119128, 2012a.Zuluaga, M., Milder, P., and Puschel, M. Computer Generation of Streaming Sorting Networks. In Design Automation Conference (DAC), 2012b.
-----0
Amit, Y., Fink, M., Srebro, N., and Ullman, S. Uncovering shared structures in multiclass classification. ICML, 2007.
Beck, A. and Teboulle, M. A fast iterative shrinkagethresholding algorithm for linear inverse problems.SIAM Journal on Imaging Sciences, 2(1):183202, 2009.
Bengio, S., Weston, J., and Grangier, D. Label embedding trees for large multi-class tasks. NIPS, 2010.
Berg, A., Deng, J., and Fei-Fei, L. Large scale visual recognition challenge 2010, 2010. URL http://www.image-net.org/challenges/LSVRC/2010/index.
Caruana, R. Multitask learning. Machine Learning, 1997.
Crammer, K. and Singer, Y. On the algorithmic implementation of multiclass kernel-based vector machines.JMLR, 2:265292, 2002.
Daubechies, I., Defrise, M., and De Mol, C. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on pure and applied mathematics, 57(11):14131457, 2004.
Duchi, J. and Singer, Y. Boosting with structural sparsity.In ICML, pp. 297304. ACM, 2009.
Friedman, J., Hastie, T., and Tibshirani, R. A note on the group lasso and a sparse group lasso. Arxiv preprint arXiv:1001.0736, 2010.
Gao, T. and Koller, D. Discriminative learning of relaxed hierarchy for large-scale visual recognition. ICCV, 2011.
Gehler, P. and Nowozin, S. On feature combination for multiclass object classification. In ICCV, 2009.
Griffin, G., Holub, A., and Perona, P. Caltech-256 object category dataset. 2007.
Hull, J.J. A database for handwritten text recognition research. T-PAMI, IEEE, 16(5):550554, 1994.
Hwang, S.J., Grauman, K., and Sha, F. Learning a tree of metrics with disjoint visual features. NIPS, 2011.
Jawanpuria, P. and Nath, J.S. A convex feature learning formulation for latent task structure discovery. ICML, 2012.
Kang, Z., Grauman, K., and Sha, F. Learning with whom to share in multi-task feature learning. NIPS, 2011.
Kim, S. and Xing, E.P. Tree-guided group lasso for multitask regression with structured sparsity. ICML, 2010.Krizhevsky, A. and Hinton, GE. Learning multiple layers of features from tiny images. Masters thesis, Department of Computer Science, University of Toronto, 2009.
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P.Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):22782324, 1998.
Li, L.J., Su, H., Xing, E.P., and Fei-Fei, L. Object bank: A high-level image representation for scene classification and semantic feature sparsification. NIPS, 2010.
Obozinski, G., Taskar, B., and Jordan, M. Joint covariate selection for grouped classification. Department of Statistics, U. of California, Berkeley, TR, 743, 2007.
Quattoni, A. and Torralba, A. Recognizing indoor scenes.CVPR, 2009.
Quattoni, A., Collins, M., and Darrell, T. Transfer learning for image classification with sparse prototype representations. In CVPR, 2008.
Rahimi, A. and Recht, B. Random features for large-scale kernel machines. NIPS, 2007.
Shalev-Shwartz, S., Wexler, Y., and Shashua, A. Shareboost: Efficient multiclass learning with feature sharing.Proc. NIPS, 2011.
Sprechmann, P., Ram?rez, I., Sapiro, G., and Eldar, Y.C.C-hilasso: A collaborative hierarchical sparse modeling framework. Signal Processing, IEEE Transactions on, 59(9):41834198, 2011.
Tibshirani, R. Regression shrinkage and selection via the lasso. J. Royal Statist. Soc.. B, 58:267288, 1996.
Torralba, A., Murphy, K.P., and Freeman, W.T. Sharing visual features for multiclass and multiview object detection. T-PAMI, IEEE, 29(5):854869, 2007.
Verma, N., Mahajan, D., Sellamanickam, S., and Nair, V. Learning hierarchical similarity metrics. In CVPR.IEEE, 2012.
Weinshall, D., Hermansky, H., Zweig, A., Luo, J., Jimison, H., Ohl, F., and Pavel, M. Beyond novelty detection: Incongruent events, when general and specific classifiers disagree. In Proc. NIPS, volume 8, 2008.
Wright, S.J., Nowak, R.D., and Figueiredo, M.A.T. Sparse reconstruction by separable approximation. Signal Processing, IEEE Transactions on, 57(7):24792493, 2009.
Xiao, L. Dual averaging methods for regularized stochastic learning and online optimization. JMLR, 2010.
Yang, H., Xu, Z., King, I., and Lyu, M. Online learning for group lasso. ICML, 2010.
Yang, Jian-Bo and Tsang, Ivor. Hierarchical maximum margin learning for multi-class classification. UAI, 2011.
Yuan, M. and Lin, Y. Model selection and estimation in regression with grouped variables. J. Royal Statist. Soc..B, 68(1):4967, 2006.
Zhao, B., Fei-Fei, L., and Xing, E.P. Large-scale category structure aware image categorization. Proc. NIPS, 2011.
Zhou, D., Xiao, L., and Wu, M. Hierarchical classification via orthogonal transfer. In ICML, 2011.
Zweig, A. and Weinshall, D. Exploiting Object Hierarchy: Combining Models from Different Category Levels. Proc.ICCV, 2007.