I am a pragmatic Bayesian, an award-winning researcher, and
an R&D team leader, with well-balanced academia and industry
experience of 15 years. I am now with Ant Financial of Alibaba Group,
leading an R&D team of 60+ researchers and engineers to conduct
some innovations in cognitive computing, and also develop
advanced search&recommender systems upon reinforcement
learning.
Previously, I was with Alibaba Cloud, leading an R&D team to
develop distributed machine learning platform, including distributed
deep learning on GPU clusters, online service for
predictive models etc. Before joining Alibaba, I was a team leader at
Microsoft Bing
to develop personalized search service. At Yahoo! Labs I worked with
colleagues on
web-scale user-click stream for content optimization via contextual
bandits.
My
academic interest is to design and implement statistical learning
algorithms, to discover useful patterns in enormous machine-readable
data that might otherwise not be found by human inspection. I am
fortunate enough to work with and learn from world renowned researchers
at prestigious
institutions like Columbia University and University College London. I
was mentored by
Zoubin Ghahramani and David L.
Wild on statistical machine learning as a
post-doctoral research fellow at the Gatsby Computational Neuroscience
Unit,
UCL. I received my Ph.D. degree at the
National University of Singapore,
under the joint guidance of S. Sathiya Keerthi and Chong Jin Ong with a
thesis titled "Bayesian approach to support vector machines".
I
have published 40+ papers at top-tier conferences
and journals, received 4000+ citations
according to Google Scholar as of Aug. 2017, and also earned a Best
Paper Award at ACM WSDM and a Best Demo Award at ACM CIKM. In 2016 I
was
elected as a
National Innovation Talent of China, a prestigious title for
outstanding scientists.
-
-
-
-
-
Director of Engineering, AI Dept, Ant Financial, Alibaba
Group,
2017.08 till now
-
Director of Engineering, iDST, Alibaba Cloud, Alibaba
Group,
2014.11 to 2017.08
-
Principal Applied Scientist Lead, Bing, Microsoft,
2014.01 to 2014.11
-
Senior Applied Researcher, Bing, Microsoft,
2011.05 to 2014.01
-
Scientist, Yahoo! Labs, 2008.01 to 2011.05
-
Associate Research Scientist, CCLS, Columbia
University, 2006.01 to 2008.01
-
Research Fellow, Gatsby Unit, University
College London, 2003.02 to 2006.01
Journal Article & Book
Chapter
-
In this paper, we focus on recency search and study a
number
of algorithms to improve ranking results by leveraging user click
feedback. Our contributions are three-fold. First, we use real search
sessions collected in a random exploration bucket for \emph{reliable}
offline evaluation of these algorithms, which provides an unbiased
comparison across algorithms without online bucket tests. Second, we
propose a re-ranking approach to improve search results for recency
queries using user clicks. Third, our empirical comparison of a dozen
algorithms on real-life search data suggests importance of a few
algorithmic choices in these applications, including generalization
across different query-document pairs, specialization to popular
queries, and real-time adaptation of user clicks. [pdf]
-
In this paper, we propose two new support vector
formulations
for ordinal regression, which optimize multiple thresholds to define
parallel discriminant hyperplanes for the ordinal scales. Both
approaches guarantee that the thresholds are properly ordered at the
optimal solution.
-
In this paper, we develop a segmental semi-Markov model
(SSMM) for
protein secondary structure prediction which incorporates multiple
sequence alignment profiles with the purpose of improving the
predictive performance. By
incorporating the information from long range interactions in
beta-sheets, this model is also capable of carrying out
inference on contact maps.
[ps][supplement]
-
-
In this paper, we describe a gene selection algorithm
based on Gaussian processes to discover consistent gene expression
patterns associated with ordinal clinical phenotypes. The
technique of automatic relevance determination is applied to
represent the significance level of the genes in a Bayesian framework.
[pdf]
[ps]
[code]
-
In this paper, we present a probabilistic approach to
ordinal
regression in Gaussian processes. In the Bayesian framework of
Gaussian processes, we propose a likelihood function for ordinal
variables that is a generalization of the probit function.
Two inference techniques, based on Laplace approximation and
expectation propagation respectively, are applied for model
selection.
[pdf]
[ps]
[zip]
[code]
-
In this paper, we propose an improved method to the
numerical
solution of LS-SVM. Compared with the existing algorithm (Suykens et
al, 1999) for LS-SVM, our approach is about twice as efficient.
-
In this paper, we use soft insensitive loss
function in likelihood evaluation, and describe a Bayesian
framework in a stationary Gaussian process. Bayesian methods are used
to implement model adaptation, while keeping the merits of support
vector regression, such as quadratic programming and sparseness.
Moreover, confidence interval is provided in prediction.
[pdf]
[ps]
[zip]
[code]
-
-
In this paper, we propose Bayesian support vector
classifier
by introducing a novel likelihood function, known as trigonometric
likelihood function. Model adaptation and ARD feature selection could
be implemented intrinsically in hyperparameter inference. Another
benefit is the class probability in making predictions. [pdf]
[code]
Refereed Conference
-
-
-
-
-
Over the past few years, major web search engines have
introduced knowledge bases to offer popular facts about people, places,
and things on the entity pane next to regular search results. In
addition to information about the entity searched by the user, the
entity pane often provides a ranked list of related entities. To keep
users engaged, it is important to develop a recommendation model that
tailors the related entities to individual user interests. We propose a
probabilistic Three-way Entity Model (TEM) that provides personalized
recommendation of related entities using three data sources: knowledge
base, search click log, and entity pane log. Specifically, TEM is
capable of extracting hidden structures and capturing underlying
correlations among users, main entities, and related entities.
Moreover, the TEM model can also exploit the click signals derived from
the entity pane log. We further provide an inference technique to learn
the parameters in TEM, and propose a principled preference learning
method specifically designed for ranking related entities. Extensive
experiments with two real-world datasets show that TEM with our
probabilistic framework significantly outperforms a state of the art
baseline, confirming the effectiveness of TEM and our probabilistic
framework in related entity recommendation.
-
Web search engines utilize behavioral signals to develop
search experiences tailored to individual users. To be effective, such
personalization relies on access to sufficient information about each
user's interests and intentions. For new users or new queries, profile
information may be sparse or non-existent. To handle these cases, and
perhaps also improve personalization for those with profiles, search
engines can employ signals from users who are similar along one or more
dimensions, i.e., those in the same cohort. In this paper we describe a
characterization and evaluation of the use of such cohort modeling to
enhance search personalization. We experiment with three pre-defined
cohorts-topic, location, and top-level domain preference-independently
and in combination, and also evaluate methods to learn cohorts
dynamically. We show via extensive experimentation with large-scale
logs from a commercial search engine that leveraging cohort behavior
can yield significant relevance gains when combined with a production
search engine ranking algorithm that uses similar classes of
personalization signal but at the individual searcher level. Additional
experiments show that our gains can be extended when we dynamically
learn cohorts and target easily-identifiable classes of ambiguous or
unseen queries.
-
In this paper, we propose a general ranking model
adaptation
framework for personalized search. Using a given user-independent
ranking model trained offline and limited number of adaptation queries
from individual users, the framework quickly learns to apply a series
of linear transformations, e.g., scaling and shifting, over the
parameters of the given global ranking model such that the adapted
model can better fit each individual user's search preferences.
Extensive experimentation based on a large set of search logs from a
major commercial Web search engine confirms the effectiveness of the
proposed method compared to several state-of-the-art ranking model
adaptation methods.
-
Personalized search systems tailor search results to the
current user intent using historic search interactions. This relies on
being able to find pertinent information in that user's search history,
which can be challenging for unseen queries and for new search
scenarios. Building richer models of users' current and historic search
tasks can help improve the likelihood of finding relevant content and
enhance the relevance and coverage of personalization methods. The
task-based approach can be applied to the current user's search
history, or as we focus on here, all users' search histories as
so-called "groupization" (a variant of personalization whereby other
users' profiles can be used to personalize the search experience). We
describe a method whereby we mine historic search-engine logs to find
other users performing similar tasks to the current user and leverage
their on-task behavior to identify Web pages to promote in the current
ranking. We investigate the effectiveness of this approach versus
query-based matching and finding related historic activity from the
current user (i.e., group versus individual). As part of our studies we
also explore the use of the on-task behavior of particular user
cohorts, such as people who are expert in the topic currently being
searched, rather than all other users. Our approach yields promising
gains in retrieval performance, and has direct implications for
improving personalization in search systems.
-
Search tasks, comprising a series of search queries
serving the same information need, have recently been recognized as an
accurate atomic unit for modeling user search intent. Most prior
research in this area has focused on short-term search tasks within a
single search session, and heavily depend on human annotations for
supervised classification model learning. In this work, we target the
identification of long-term, or cross-session, search tasks
(transcending session boundaries) by investigating inter-query
dependencies learned from users' searching behaviors. A semi-supervised
clustering model is proposed based on the latent structural SVM
framework, and a set of effective automatic annotation rules are
proposed as weak supervision to release the burden of manual
annotation. Experimental results based on a large-scale search log
collected from Bing.com confirms the effectiveness of the proposed
model in identifying cross-session search tasks and the utility of the
introduced weak supervision signals. Our learned model enables a more
comprehensive understanding of users' search behaviors via search logs
and facilitates the development of dedicated search-engine support for
long-term tasks.
-
User behavior provides many cues to improve the relevance
of
search results through personalization. One aspect of user behavior
that provides especially strong signals for delivering better relevance
is an individualøs history of queries and clicked docu-ments. Previous
studies have explored how short-term behavior or long-term behavior can
be predictive of relevance. Ours is the first study to assess how
short-term (session) behavior and long-term (historic) behavior
interact, and how each may be used in isolation or in combination to
optimally contribute to gains in relevance through search
personalization.[pdf]
-
Unlabeled samples can be intelligently selected for
labeling
to minimize classification error. In many real-world applications,
a large number of unlabeled samples arrive in a
streaming manner, making it impossible to maintain all the
data in a candidate pool. In this work, we consider the unbiasedness
property in the
sampling process, and design optimal instrumental distributions
to minimize the variance in the stochastic process.
Meanwhile, Bayesian linear classifiers with weighted maximum
likelihood are optimized online to estimate parameters. [pdf]
-
Online auction and shopping are gaining popularity with
the
growth
of web-based eCommerce. Criminals are also taking advantage of
these opportunities to conduct fraudulent activities against honest
parties with the purpose of deception and illegal profit. In practice,
proactive moderation systems are deployed to detect suspicious
events for further inspection by human experts. Motivated by
real-world applications in commercial auction sites in Asia, we develop
various advanced machine learning techniques in the proactive
moderation system. [pdf]
-
In this paper, we introduce a replay
methodology for contextual bandit algorithm evaluation.
Different from simulator-based approaches, our method is
completely data-driven and very easy to adapt to different
applications. More importantly, our method can provide provably
unbiased evaluations. Our empirical results on a large-scale
news article recommendation dataset collected from Yahoo!
Front Page conform well with our theoretical results.
Furthermore, comparisons between our offline replay and online
bucket evaluation of several contextual bandit algorithms show
accuracy and effectiveness of our offline evaluation method. [pdf]
-
In this paper we study the contextual ban-
dit problem (also known as the multi-armed
bandit problem with expert advice) for linear
payo. functions. we prove
a high-probability regret upper bound. We also prove
a lower bound for this setting, matching the upper bound up to
logarithmic
factors. [pdf]
-
In this paper, we propose an online learning algorithm
that
can quickly learn the best re-
ranking of the top portion of the original ranked list based
on real-time users' click feedback. In order to devise our al-
gorithm and evaluate it accurately, we collected exploration
bucket data that removes positional biases on clicks on the
documents for recency-classi.ed queries. Our initial exper-
imental result shows that our scheme is more capable of
quickly adjusting the ranking to track the varying relevance
of documents re
ected in the click feedback, compared to
batch-trained ranking functions. [pdf]
-
Personalized web services strive to adapt their services
(advertisements, news articles, etc.) to individual users by making use
of both content and user information. Despite a few recent advances,
this problem remains challenging for at least two reasons. First, web
service is
featured with dynamically changing pools of content, rendering
traditional collaborative filtering methods inapplicable. Second, the
scale of most web services of practical interest calls for solutions
that are both fast in learning and computation. In this work, we model
personalized recommendation of news articles
as a contextual bandit problem, a principled approach in which a
learning algorithm sequentially selects articles to serve users based
on contextual information about the users and articles, while
simultaneously adapting its article-selection strategy based on
user-click feedback to maximize total user clicks. [pdf]
-
Recommender systems are widely used in online e-commerce
applications to improve user engagement and then to increase revenue. A
key challenge for recommender systems is providing high quality
recommendation to users in ``cold-start" situations. We consider three
types of cold-start problems: 1) recommendation on existing items for
new users; 2) recommendation on new items for existing users; 3)
recommendation on new items for new users. We propose predictive
feature-based regression models that leverage all available information
of users and items, such as user demographic information and item
content features, to tackle cold-start problems. The resulting
algorithms scale efficiently as a linear function of the number of
observations. We verify the usefulness of our approach in three
cold-start settings on the MovieLens and EachMovie datasets, by
comparing with five alternatives including random, most popular,
segmented most popular, and two variations of Vibes affinity algorithm
widely used at Yahoo! for recommendation.
-
In multiway data, each sample is measured by multiple sets
of
correlated attributes. We develop a probabilistic framework for
modeling structural dependency from partially observed
multi-dimensional array data, known as pTucker. Latent components
associated with individual array dimensions are jointly retrieved
while the core tensor is integrated out. The resulting algorithm
is capable of handling large-scale data sets. We verify the
usefulness of this approach by comparing against classical models
on applications to modeling amino acid fluorescence, collaborative
filtering and a number of benchmark multiway array data. [pdf]
[third-party
pTucker code]
-
In Web-based services of dynamic content (such as news
articles),
recommender systems face the difficulty of timely identifying new
items of high-quality and providing recommendations for new users.
We propose a feature-based machine learning approach to
personalized recommendation that is capable of handling the
cold-start issue effectively. We maintain profiles of content of
interest, in which temporal characteristics of the content, e.g.
popularity and freshness, are updated in real-time manner. We also
maintain profiles of users including demographic information and a
summary of user activities within Yahoo! properties. Based on all
features in user and content profiles, we develop predictive
bilinear regression models to provide accurate personalized
recommendations of new items for both existing and new users. This
approach results in an offline model with light computational
overhead compared with other recommender systems that require
online re-training. The proposed framework is general and flexible
for other personalized tasks. The superior performance of our
approach is verified on a large-scale data set collected from the
Today-Module on Yahoo! Front Page, with comparison against six
competitive approaches. [pdf] [slides]
-
In this paper, we report a successful large-scale
case study of conjoint analysis on click through stream in
a real-world application at Yahoo!. We consider identifying
users°Ø heterogenous preferences from millions of click/view
events and building predictive models to classify new users
into segments of distinct behavior pattern. A scalable conjoint
analysis technique, known as tensor segmentation, is
developed by utilizing logistic tensor regression in standard
partworth framework for solutions. [pdf]
-
We consider the case when relationships are postulated to
exist due to hidden common
causes. We discuss how the resulting graphical model differs from
Markov
networks, and how it describes different types of real-world relational
processes.
A Bayesian nonparametric classification model is built upon this
graphical representation
and evaluated with several empirical studies.
GOTO Ricardo Silva's
homepage for [pdf], [data]
and [code]
-
In this paper we model relational random variables on
the edges of a network using
Gaussian processes (GPs). We describe appropriate GP priors, i.e.,
covariance
functions, for directed and undirected networks connecting homogeneous
or heterogenous
nodes. The framework suggests an intimate connection between link
prediction and transfer learning, which were traditionally two separate
topics. [pdf]
-
Censored targets, such as the time to events in survival
analysis,
can generally be represented by intervals on the real line. In
this paper, we propose a novel support vector technique (named SVCR)
for
regression on censored targets. Interestingly,
this approach provides a general formulation for both standard
regression and binary classification tasks. [pdf] [longer
version]
-
We consider the problem of utilizing unlabeled data for
Gaussian process inference. Using a geometrically motivated
data-dependent prior, we propose a graph-based construction of
semi-supervised Gaussian processes. We demonstrate this approach
empirically on several classification problems. [pdf]
-
Correlation between instances is often modelled via a
kernel function using input attributes of the instances. Relational
knowledge can further reveal additional pairwise correlations
between variables of interest. In this paper, we develop a class
of models which incorporates both reciprocal relational
information and input attributes using Gaussian process
techniques. This approach provides a novel non-parametric Bayesian
framework with a data-dependent prior for supervised learning
tasks. We also apply this framework to semi-supervised learning.
Experimental results on several real world data sets verify the
usefulness of this algorithm.
[pdf]
-
We introduce a Gaussian process (GP) framework,
stochastic relational models
(SRM), for learning social, physical, and other relational phenomena
where interactions
between entities are observed. The key idea is to model the stochastic
structure of entity relationships (i.e., links) via an interplay of
multiple GPs, each
defined on one type of entities.
[pdf]
-
We present two new support vector approaches for ordinal
regression.
These approaches find the concentric spheres with minimum volume that
contain most of the training samples.
[pdf]
-
We propose a Bayesian approach to identify protein
complexes and their constituents from high-throughput protein-protein
interaction screens. An infinite latent feature model that allows for
multi-complex membership by individual proteins is coupled with a graph
diffusion kernel that evaluates the likelihood of two proteins
belonging to the same complex. Gibbs sampling is then used to infer a
catalog of protein complexes from the interaction screen data. An
advantage of this model is that it places no prior constraints on the
number of complexes and automatically infers the number of significant
complexes from the data. Validation results using affinity
purification/mass spectrometry experimental data from yeast
RNA-processing complexes indicate that our method is capable of
partitioning the data in a biologically meaningful way.
-
In this paper, we propose a new basis
selection criterion for building sparse GP regression
models that provides promising gains in accuracy as well as
efficiency over previous methods.
Our algorithm is much faster than that of Smola and Bartlett,
while, in generalization it greatly outperforms the
information gain approach proposed by Seeger et al, especially
on the quality of predictive distributions.
[ps]
[code]
-
In this paper, we propose a probabilistic kernel
approach to
preference learning based on Gaussian processes. A new likelihood
function is proposed to capture the preference relations in the
Bayesian framework. The generalized formulation is also applicable to
tackle many multiclass problems. [pdf] [ps] [zip]
[code]
-
In this paper, we propose two new support vector
formulations
for ordinal regression, which optimize multiple thresholds to define
parallel discriminant hyperplanes for the ordinal scales. Both
approaches guarantee that the thresholds are properly ordered at the
optimal solution.
[pdf]
[ps]
[zip]
[code]
-
In this paper, we present a graphical model that extends
segmental semi-Markov
models (SSMM) to exploit multiple sequence alignment profiles for
protein structure
prediction. A novel parameterized model is proposed as the likelihood
function
for the SSMM. By incorporating the information from long range
interactions in
beta-sheets, this model is capable of carrying out inference on contact
maps.
[pdf]
[ps]
[zip]
[webserver]
-
W.
Chu, Z.
Ghahramani and D.
L. Wild (2004) Protein secondary
structure prediction using sigmoid belief networks to parameterize
segmental
semi-Markov models, European Symposium on
Artificial Neural Networks (ESANN-05):81-86
-
-
-
Refereed Workshop
-
J. Yang, Y. Chen, S. Wang, L. Li, C. Meng, M. Qiu, W. Chu
(2017) Practical
lessons of distributed deep learning, Workshop on
Principled Approaches to Deep Learning, at
ICML
(View
Abstract)
With the advent of big data and big model, there are
increasing needs on training deep learning model in distributed mode.
Although the open source deep learning software such as Tensor- Flow
and MXNet do support training deep learn- ing model in parallel, it is
still a challenging task for data scientists to implement scalable and
high performance distributed deep learning algo- rithms. In this paper,
we share several practical lessons on optimizing distributed deep
learning training process, including optimization strate- gies for
typical model architecture such as DNN and CNN. For DNN, we exploit its
computation- to-communication ratio to reduce the commu- nication
overhead. For CNN, we find hybrid- parallelism an effective way to
squeeze the po- tential of strong-scaling. Experiments in off-the-
shelf deep learning software show that, with our optimization
strategies we are able to have 10x speed-up on AlexNet against the
standard dis- tributed implementation.
-
User interactions with search engines provide many cues
that can be leveraged
to improve the relevance of search results through personalization. The
context
information (history of queries, clicked documents, etc.) provides
strong signals
about users’ search intent, which can be used to personalize the search
experience
and improve a web search engine. We demonstrate how to generate the
semantic
features from in-session contextual information with deep learning
models, and
incorporate these semantic features into the current ranking model to
re-rank the
results. We evaluate our approach using a large, real-world search log
data from a
major commercial web search engine, and the experimental results show
our approach
can significantly improve the performance of the search engine.
Furthermore,
we also find that the domain-specific, click-based features can
effectively
decrease the unsatisfied clicks for the current ranking model to
improve the search
experience.
-
Contextual bandit algorithms have become popular tools in
online recommendation and advertising systems. Offline evaluation of
the effectiveness of new algorithms in these applications is critical
for protecting online user experiences but very challenging due to
their ``partial-label'' nature. The purpose of this paper is two-fold.
First, we review a recently proposed offline evaluation technique.
Different from simulator-based approaches, the method is completely
data-driven, is easy to adapt to different applications, and more
importantly, provides provably unbiased evaluations. We argue for the
wide use of this technique as standard practice when comparing bandit
algorithms in real-life problems. Second, as an application of this
technique, we compare and validate a number of new algorithms based on
generalized linear models. Experiments using real Yahoo! data suggest
substantial improvement over algorithms with linear models when the
rewards are binary. [pdf]
-
-
Unlabelled examples in supervised learning tasks can be
optimally
exploited using semi-supervised methods and active learning. We
focus on ranking learning from pairwise instance preference to
discuss these important extensions, semi-supervised learning and
active learning, in the probabilistic framework of Gaussian
processes.
[ps]
-
Thesis
-
In this thesis, we develop Bayesian support vector
machines
for regression and classification. This work can also be regarded as
support vector variants of Gaussian processes. [pdf]
[zip]
[code]
-
User trustworthiness, US Patent 9519682 B1
-
Determining user preference of items based on user
ratings and user features, US Patent 8301624 B2
-
Predicting item-item affinities based on item features by
regression, US Patent 8442929 B2
-
Enhanced matching through explore/exploit schemes, US
Patent 8244517 B2
-
Dynamic estimation of the popularity of web content, US
App. 20100241597 A1
-
Conjoint analysis with bilinear regression models for
segmented predictive content ranking, US App. 20100125585 A1
-
Methods and systems relating to ranking functions for
multiple domains, US App. 20110087673 A1
-
Contextual-bandit approach to personalized news article
recommendation, US App. 20120016642 A1
-
Feature-based method and system for cold-start
recommendation of online ads, US App. 20110112981 A1
-
Online active learning in user-generated content streams,
US App. 20130111005 A1
-
Personalized recommendations on dynamic content, US App.
20100211568 A1
-
-
National Innovation Talent, 国家千人, 2016
-
Best Paper Award, ACM WSDM, 2011
-
Super Star Team Award, Yahoo!, 2008
-
Honorable Mention Team, ACM KDD CUP, 2002
|