We recently wrapped up an annual review season at my previous role. Between self-review, manager review, calibration, and delivery, the process took close to three months! The annual review process can be a big challenge to everyone involved. Introspection can be difficult, and for many, it can be a stressful affair. Executed well, annual reviews are a fantastic way to celebrate successes, advocate for your team, connect the dots on important conversations, and get better alignment on how someone’s long-term growth goals fit into the overall business strategy.
This year involved closely collaborating with my managers in advance of the review period as well as during our calibration period after self-reviews and manager reviews were written. I shared some advice to help make their reviews generate the kinds of outcomes we want to have for our people, and how to use these discussiosn to effect the right kind of changes. After getting everything all wrapped up, I wanted to memorialize some of this advice here to generate comment and provide some free advice for newer engineering managers daunted by this task for their first time.
I once worked with an HR business partner who prefaced review season with this analogy, and it always stuck with me – probably because I’ve got Harney & Sons on speed dial. “Feedback is like tea – if you steep it for too long, it may become bitter, but if you don’t steep it for long enough, it has no flavor.” In other words, when giving feedback in any form – be it a peer review, a self-review, or a manager review – you should spend an appropriate amount of time forming your thoughts, and doing so with appropriate tact and care. Avoid ruminating on the negatives, but make sure that constructive feedback is clear, unequivocable, and adequately balanced. Conversely, even a glowing review should have appropriate detail and consideration for areas of improvement.
What’s better than getting someone to seriously, honestly tell you how you’re doing? The answer is, getting that same feedback accompanied with the knowledge that the person giving it has your best interest at heart. So long as you go into the process with that in mind, I encourage you to be bold and candid in your discussions. Every single one of us has something we need specific and actionable feedback on to get better — otherwise, we’d be perfect! Don’t be afraid to say important things that you know can help make a person you think is great even greater.
An oft-repeated bit of advice is to use something called “sandwich feedback”, where you say something positive, then deliver the critical piece of advice, and then finish with another positive. The latest consensus on this approach is that it is in fact an anti-pattern when it comes to delivering the advice. The reasons for this range from the observation that it can be considered manipulative, to the fact that the number of positives outweigh the number of negatives, setting off a subliminal Two out of three ain’t bad response from the recipient. In other words, don’t use sandwich feedback to soften the blow of what should be an important conversation. Hedging this way makes your feedback more ambiguous and less effective.
Ideally, the format of your annual review should be informed by the tool that you are using or the standards put in place by your HR business partner. When it comes to pacing, structure, and the like, stick to the customary script provided. Be direct, clear, and fair at the appropriate juncture – don’t hide the things you want to get better behind the things that are going fine.
Taking overly positive feedback to its logical extreme, there is no place for the “all fives” school of reviews (or whatever your max value may be). This is another example of the “tea” being too “weak”. If you’re feeling pressure to give these kinds of reviews to protect your reports, that’s a big culture red flag at your company. If you feel the need to give these kinds of reviews because delivering difficult feedback makes you uncomfortable, you should bear in mind that this is an important competency for a manager. You owe it to your reports to get comfortable delivering constructive feedback, and doing it in a way that keeps them feeling listened to and supported. Giving the highest possible ratings in every category and not providing thoughtful advice isn’t only a disservice to your reports, but it indicates a lack of engagement in the process to your HR business partner and your manager.
If you’re having a performance conversation about something for the first time during an annual review, that should give you pause for thought. This should not be the first time important feedback is provided. Annual reviews are a time to synthesize, summarize, and reflect on what trajectory you can extrapolate. In fact, for sticky performance issues, you should be able to reference whether the problem has improved, has persisted, or gotten worse. You should be able to say something along the lines of:
We initially discussed this problem back in July, and talked about how we could improve it. You agreed that you’d try to work on it by using strategies such as X, Y, and Z. I’ve commended you in past one-on-ones on implementing those strategies on occasion, such as in November when you did X instead of M. But we did have a discussion in December after I observed you did N, when I hoped you would have done Z instead. So we’re seeing growth in this area, but continued growth is still important here, because it still prevents you from being successful in ABC, which we agree is one of your goals.
The above is an example of how performance discussions are iterative, with the annual review cycle giving us the opportunity to neatly contextualize those trends, sucesses, and challenges over a broader period of time.
Along this theme, if you’re tying promotions to the annual review cycle, don’t keep reports with a vested interest in their development in the dark on where they are tracking. Recurring conversations should note what the current deltas are between how the indiviudal is performing and what the expectations of their desired role looks like. Make sure that you are giving clear, actionable feedback on how to more regularly exhibit the expectations of that role. For in-time interactions, don’t be cryptic – if there’s an opportunity to step up, make the reasoning behind your request clear. Remember that part of your responsibility as a manager is to help intentionally develop your reports along their desired career path. In short, your reports should have a reasonable level of certainty whether they will be awarded that promotion because of how they have performed in the concrete milestones you’ve set for them.
Would you rather have a nice, organic Japanese green tea brewed at a perfect 180º with 12 ounces of cold, filtered water, or a generic green tea bag brewed with water you boiled from the tap? Even if they tasted identically to you, science suggests that you’d like the first more, largely because of the detail that’s been provided to you.
Detail makes things more memorable, and more compelling. Data is the ultimate form of detail. Data doesn’t just include being able to reference previous discussions, as mentioned in the previous section.
While you should be using a variety of indicators to measure your team’s performance throughout the entire year, annual reviews are a fantastic time to sit down and zoom out on those metrics. Think about it like long-term investment. It can be difficult to watch your money grow on a day-to-day basis in an index fund, but taking a look at it over a year can tell a valuable story, especially compared to other index funds you may have tracked over the same period of time.
For individuals at the same level, you should be able to identify clear trends on throughput based on key metrics that can be gleaned from version control, documentation, and issue tracking. Ideally, these will correlate to the expectations of the job description which you hopefully have on file. We can then use that information to address outliers, giving clear metrics they need to improve on. It might not be a bad thing that that data point is an outlier, but outliers are always valuable conversation starters, and correlating outliers often tell an interesting story.
Here are some suggestions for how to use data as a storytelling device instead of a cudgel. First, establish the metric of interest that the person has exhibited, and then share the delta observationally. Explain what that indicates, and then impart why improving the metric is valuable. For example:
Other team leads are performing N code reviews per month for a team with similar pull request volume, whereas you’re averaging M. This is important, because the folks on the team you’re leading deserve your feedback and guidance on the work they’re doing. Engaging in code review is also a great way to stay on top of what everyone is doing, and a great way to push on quality and ensure your team is meeting our overall architectural goals. What are some ways we could get you more involved in your team’s code reviews?
When using data to drive performance discussions, I encourage you to thread the needle on adequate context, and to be as objective as possible when laying out the details. Hearing constructive criticism is hard in general, and it can be very uncomfortable for the individual on the receiving end of this concrete evidence. Because of this, when using data to underscore room for growth, you should make doubly sure that this isn’t the first time that you’ve had the discussion, and that you’ve used gentler strategies in the past. In this case, I would hope that you’ve got notes from previous one-on-ones where you’ve written down that you asked the lead in question to get more involved in the team’s code reviews.
I like to conclude my reviews with where I hope to see the person in a year’s time, and what support I hope to give to get them there. Since so much of an annual review is rehashing what happened in the last 12 months, I find that it’s a little uplifting – even in the face of some difficult conversations – the reconfirm your belief in the potential of the person you’re delivering the review to. So take the time a the end to build folks up, get them excited about what’s next, and make sure they know they are an important part of your vision for the team’s success. This hopefully won’t just help to inspire, but to confirm the important sense of belonging that every manager should foster in their team.
Annual reviews are a fact of life in most companies. I like to face such inevitabilities head-on, with lots of preparation so that I can bring my best self to the occasion. There are plenty of other approaches to take, but I hope that there’s room for a few of my recommendations in those approaches. What is most important is to strike that delicate balance between being fact-based and candid, supporting growth, and leaving each discussion with the recipient with a clear understanding of how to move forward – ideally feeling good about their future. Executing on these successfully are a responsibility and a privilege that can make a huge impact on company morale and culture, so sieze the opportunity to be a positive force that helps your team get ever better.
]]>You should be responsible whenever possible in writing the job description for the role in question. When you aren’t, you should work closely with the original author of the JD, and make sure to provide any feedback that might be useful in finding the right candidate. As the hiring manager, you should be ready with a clear sense of what 30, 60, and 90-day goals would look like for the role you’re looking to hire.
Consider looking at the competencies of comparable engineers with the role you’re searching for. If there’s a big delta in skill or background based on what your description calls out, you may want to consider further refining your criteria, or reconsidering how the role is titled or leveled.
As hiring manager, you should be the first interview post phone screen. Since you’re the person responsible for fostering team culture, you should have the strongest barometer for whether a candidate’s values aligns with your team’s and your company’s.
The candidate deserves to get to know their manager as early into the process as possible to guarantee it would be the right relationship. Since your team’s time comes at a premium, if you or the candidate determine it’s not a good fit after the first round, you’re also preserving the limited bandwidth of your engineers.
If you’re a very soft yes or you’re on the fence without any strong red flags, this is a great time to defer to your team, and still a solid use of their cycles.
Work with your HR business partner to introduce a consistent process for hiring. Having these processes spelled out is important for training others to interview (a valuable skill for engineers interested in growth either into a managerial role as well as senior/staff/principal IC), and also setting clear standards for what a successful candidate looks like.
Having these criteria clearly spelled out among a handful of well-rounded interview phases. They should be tailored to understand technical, organizational, and interpersonal competencies. Knowing what you’re evaluating – and how – allows us to evaluate individuals consistently across interviewers. A consistent approach to interviewing helps minimize both conscious and unconscious bias. This goes even more so if you develop these hiring rounds collaboratively.
Don’t just include your peers in engineering. These standards should be established as part of your process for defining any new role, so it’s important to get HR buy-in when building out the components of this process.
In testing, we evaluate the happy path as well as the most common degenerate cases. It should be no different when it comes to interviewing. You should work with your counterparts in HR to make sure you’ve identified the most effective way to give candidates who you don’t intend to move forward with fast feedback – even if it’s just a form email.
The impression any and all interviewees receive when going through your process can impact your company’s reputation whether or not the candidate receives an offer. Because of this, you should work with your hiring team to make sure that all candidates feel listened to and respected. Make sure to express gratitude for every candidate’s time, and for the opportunity get to know them, even if they clearly won’t be a fit for the role. Work as many candidates as you can through the full interview they are scheduled for. Don’t cut an interview short just because you’ve run into a skill mismatch or some other clear deal breaker – use it as an opportunity to get to know the person, and try to understand whether they may be a fit in another role. At the very least, use this as a teaching opportunity so that the candidate leaves the interview having learned something valuable for a future interview.
Early in my career, and even further on, I’ve had some very valuable interviews where – despite not receiving an offer – I learned something important that helped me be more effective in my current role, and ultimately more successful in future interviews.
Anyone in Customer Success can tell you that a deal doesn’t stop when the ink dries. This is just as applicable in hiring. I’ve seen great candidates rescind their offer acceptance because of poor follow-through, or even just room for second thoughts to creep in. Once you have a start date squared away, you should begin a countdown. Around a week out, send an email to the candidate with your team CC’ed, letting them know how excited you are for them to join the team. Encourage your team to reach out sharing their enthusiasm.
Make sure that you have an onboarding buddy assigned who can act as a peer point-of-contact to show this person the ropes, and give them the opportunity to build a positive relationship with this teammate early on.
Build out a consistent 30/60/90-day plan that includes links to onboarding docs, key contacts within engineering and outside of engineering when necessary, and projects of increasing complexity. Having these spelled out will help the candidate establish a foundation and start delivering compounding benefits in their areas of responsibility.
Meet with your new hire frequently in the first 90 days to make sure that they are on track for your 30/60/90 plan. Use that process not only to make sure that they are acclimatizing and acculturating properly, but to receive feedback on your processes, and the state of your team and organization as a whole. I was once told that there is nothing more valuable that “fresh pain” when it comes to working through new processes that everyone on your team may have just adapted to and forgotten about. Make sure to collect that fresh pain, empathize with it, and consider possible solutions.
You will find that if you follow these approaches that you’ll be able to hire the kind of people who align with your goals and your team’s values. With continued investment in their growth and success, you will have developed an individual who won’t just help you meet your business goals, but will support you further down the line. Onboarded properly, they too will develop techniques that use empathy, preparation, and communication to continue to build an organization of the kind of people who you would be proud to work with.
In other words, imagine every candidate you speak with interviewing a future candidate to join the team you hope to build in six to twelve months. That philosophy will help you succeed in hiring for growth in more ways than one.
]]>Back in February, I made a very exciting move by joining Vultr as Senior Director of Engineering. Vultr is an independent cloud provider that has been in the industry for roughly 20 years. Over the last several years, this has become a very compelling space. Digital transformation has brought more and more businesses into the cloud, and many more businesses have started their lives as cloud native over the last decade. With the growing sentiment of cloud agnosticism, the fear of vendor lock-in, and the availability of highly configurable approaches to deploying infrastructure as code, there’s never been a better opportunity to sieze a portion of a growing and exciting market.
Not hitching your wagon to the “big three” isn’t just a matter of cost, though price arbitrage will always be a compelling reason. Diversifying your providers and data centers is the most effective way to avoid service-impacting outages. It helps you reduce latency by keeping the edge closer to customers. It also helps maturing companies meet a growing laundry list of regulatory issues varying from location to location related to data residency. Vultr has dozens of points of presence, and growing – thanks to our excellent deployment operations team and the sysadmins who continue to build and grow our data centers.
It’s easier than ever to take advantage of Vultr’s platform thanks to our broad adoption of the industry standards that have evolved over the last several years. We offer many things that most cloud developers already use. I thought it’d be fun to create a reference implementation that took advantage of our available tooling, showing how truly easy it is to build modern web applications off of Vultr’s offerings.
To kick the tires on everything, I created a simple website called enough.recipes. Why call it that? I noticed countless recipe websites on Hacker News, and while they were all interesting proofs of concept, most of them just felt like another To-Do List App. The challenge to me was usually that they expected people to add recipes themselves – as if there weren’t already enough recipes on the web!
I used my background as a search engineer at Wikia (now known as Fandom) to scrape the Recipes Wiki and create a search engine to make its over 40,000 pages discoverable from a simple search. Back in my day, the company embraced open source a little bit more and made it far easier to extract content from their site. We even contributed to open source tools to make it easier to interact with data from a given wiki. That seems to have changed over the last decade, but no biggie – I could still use the core MediaWiki API to enumerate the URLs for all pages, and then simply extract the relevant portion of the DOM from a simple HTTP request.
On a daily basis (using a K8s cron resource), I would iterate over the page list in the MediaWiki API, publishing each URL to my message bus. A consumer resource will perform a request against the URL in question and store the appropriate data from the DOM in both a database as well as a search engine.
So what are all the bits and pieces that I used to create this site?
I built a containerized application using Django, with all the pieces working in a simple Docker Compose definition before proceeding to get it deployed to a production environment.
You can easily deploy containerized applications to a Kubernetes cluster, and Vultr’s Kubernetes Engine is a major achievement that we’ve GA’ed this year. Having used a variety of cloud-based Kubernetes offerings, I’m quite pleased with what the team had delivered. Being able to back our clusters with a variety of customizable instance types gave me a great deal of flexibility around both cost and performance.
I was able to create the Kubernetes cluster using Vultr’s Terraform provider, which was very seamless to use. It’s built on top of our API, which is very nicely documented since it conforms with the OpenAPI spec.
The Docker image for the Django app served as the basis for the kubernetes deployment and service definitions that acted as a backend to a simple nginx image, which was used to provide SSL termination and allow scaling across one or more deployments of Gunicorn + WSGI.
Vultr also provides a Load Balancer that can be deployed as a resource within a Kubernetes deployment. This automatically exposed a static IP for public ingress, and annotations provided the ability to properly handle TLS and port-forwarding.
I was even able to use Vultr’s DNS by pointing the domain I purchased to their nameservers, and then setting the LB’s external IP as the A record for the domain.
Since this was a Django app, I would need to get a database set up.
Vultr has recently rolled out MySQL as a beta Database as a Service
offering. It was super exciting to get a chance to use this as an opportunity to preview how its functionality.
My favorite thing about how our DBaaS works is that the UI provides the ability to
“click to copy” the database URL (i.e. mysql://user:pass@some-ip:3306/your_db
).
The database URL has become the lingua franca of many ORMs, and
I’ve become quite accustomed to having to compose this string myself to work with
dj-database-url. The convenience
was fantastic. Remember that this URL contains secrets, and so should be handled
as sensitive information.
Since I was doing some basic styling with Tailwind, I needed a place to store my static assets. I was able to very easily use the django-storages S3 backend with Vultr’s S3-Compatible Object Storage by simply plugging in the right credentials and configurations.
Some of my deployments needed block storage, as did the helm charts I needed for both Kafka (my message queue) and Elasticsearch. I was able to use our Scalable Block Storage product to support these use cases. It’s worth noting that this required specifying the storage class name as well as the desired size. Without both of these configured for each deployment the helm chart maintains, you wouldn’t be able to successfully run those instances in your cluster. Here’s an example for the helm command used to get Kafka running:
helm install -n enough-recipes\
broker bitnami/kafka \
--set=persistence.storageClass=vultr-block-storage \
--set=persistence.size=10Gi \
--set=zookeeper.persistence.storageClass=vultr-block-storage \
--set=zookeeper.persistence.size=10Gi
I’ve built plenty of things with EKS that involved a lot of banging my head against the wall trying to figure out the nuances of IAM roles and various permissions settings. Vultr’s default behavior eschews many of the arcane things that make the bigger cloud providers so hard to work with. That’s one of the reason’s we call it the Developer-First Platform. By enabling fast prototyping at a competitive price, we should be on the tip of your tongue when developing MVPs or building contract applications for cost-conscious clients.
You can view Enough Recipes on GitHub for all of the application code and definitions.
It was a lot of fun to build this site and understand how all the great pieces of the Vultr platform fit together. I talked about this with our developer advocate, Walt Ribeiro, for Vultr’s YouTube channel. You can check it out here:
So what’s next for Vultr? Well, without giving too much away, we just had a very exciting H2 planning session with lots of great takeaways. You’ll be able to do a lot more on our platform with many more of the conveniences you may have come to expect from the bigger guys.
Did I mention we’re hiring? Solving interesting problems on a daily basis is just par for the course. We occupy a space that’s not going away any time soon, and will only gain more attention as cost-conscious companies revisit their cloud costs during the upcoming business cycle. If you’re interested, drop me a line or check out our jobs page!
]]>When the weather gets cold, I’ve tended to stay in that frame of mind. So I decided to make a fun, little website that pays tribute to these songs by exploring what characterizes them based on their title. Thanks to the folks who originated the term with the Yacht Rock web series, and then the Beyond Yacht Rock podcast, getting the data for this project was a breeze.
Yachtkov is a Yacht Rock song title generator that uses a Hidden Markov Model trained on song titles rating greater than 50 on the Yachtski Scale.
Using Beautiful Soup, I simply pulled the data from the site and fed it into Markovify, which is a high-level tool for working with hidden markov models in Python. Something I like a lot about this language generation tool is that it has configurations for trying to create content that is sufficiently unlike the source data. With short things like song titles, this helps avoid duplicating the source content – meaning you shouldn’t see song titles that are identical to titles in the source data.
This was also an opportunity to work with React Hooks. I used the state and effect hooks to handle asynchronous requests to a simple Flask backend.
The highly throwback look and feel wasn’t just a function of being easy to code; I thinkit pays a solid tribute to what these kinds of songs evince from the moniker of Yacht Rock alone.
I would probably be remiss if I didn’t take a minute to acknowledge that I’ve “been the fool before” with these sorts of projects, albeit in a difference genre of music. You may remember my PyCon 2013 presentation on building a Media Takeout Headline Generator, where I discussed probabilistic language modeling within the context of some of my favorite hip hop artists at the time.
While it should be noted that my car’s still blasting Hip Hop Nation most of the time (don’t get me started on Yacht Rock Radio), a bunch of things have changed since then. And I don’t just mean I’m a little older and wiser.
The MTO Headline Generator generated a bunch of text with Python, and then served up the pre-processed text with PHP. This project uses a Flask backend that builds the language model once, stores it in memory, and serves it just as fast as retrieving a random value from a large flat file.
The PHP site was, of course, intentionally modeled after Media Takeout’s styling at the time, and so used entirely server-side rendering (if memory serves me correctly). This Flask app serves an extremely simple React App, and makes async requests to retrieve data from the same app on a separate endpoint.
And of course, from a computational linguistics standpoint, HMMs produce far better output than the somewhat more naive N-Gram-based approach I used in the past.
I think there’s something special in creating art from your code, especially when you can do it in a way that lays bare the endemic characteristics of another form of art. It lets you engage as both a craftsman and observer, providing commentary and critique simultaneously.
I encourage you to imagine what songs like Show Me The Night, You Made a Fool Believes, Caught Up in the Business, and Tell Me What You Won’t Do For Love sound like. Rest assured, they’ll have that Doobie bounce, smooth production, a little something extra melodically, and probably a Porcaro or two in the personnel.
]]>One of the components you get is an RGB LCD. Once you’ve plugged it into the Grove and have everything connected to the Pi, you can hop into a terminal on your Pi and use the libraries that the Grove comes with to control the screen.
The libraries are impressively easy. Sending text to the screen is as easy as:
import grove_rgb_lcd as screen
screen.setRGB(255, 0, 0) # all red, for instance
screen.setText("I'm going to show up on the screen!")
So what are some of the things you can do with a screen like this? Grove has some examples related to its other sensors, such as temperature and humidity, which is cool. I’m definitely interested in integrating my Sonos with the screen as a “now playing” display. But I wanted to find a more interesting way to play with the three dimensions of color.
This is where sentiment analysis comes in. There are a handful of different ways to analyze sentiment. For instance, Stanford CoreNLP’s deep learning model charts sentiment on a one to five score. Last time I checked, it was among the best in class. But there’s a different kind of sentiment analysis that evaluates text along three different axes of intensity: positive language, neutral language, and negative language. Each value will range from zero to one. This is great, because an RGB screen has three different values we can play with, a minimum of zero to a maximum of 255. I used CJ Hutto’s Vader Sentiment Analysis Project, a reasonably recent and easy-to-use project geared towards social media text. Another perk of this project is that it has been incorporated into NLTK, which tends to be my go-to NLP toolkit for hobbyist stuff like this.
Using NLTK, I can get the sentiment data for a bit of text like this:
# assuming you already have the vader data downloaded
# otherwise, you will need to call nltk.download('vader')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
analyzer.polarity_scores("This is the text I want to analyze")
# will return a dict with intensity of 0.0 to 1.0 for:
# 'neg' (negative)
# 'pos' (positive)
# 'neu' (neutral)
Now I need some kind of textual source to analyze, of course. The Python Twitter Client is perfect for this kind of a project. In particular, I found it very easy to work though the issues related to OAuth2 when attempting to initiate an authenticated session.
Assuming you have handled authentication correctly, you
can use a TwitterStream
instance to sample random
tweets or filter against a specific topic. For my
project, I’m filtering for the term “Atlanta”, the
city I just moved to. Assuming you have an OAuth
object correctly configured, that’s as easy as:
for tweet in TwitterStream(auth=oauth_object).statuses.filter(track="atlanta"):
print tweet['text'] # or anything else you want to do with a tweet!
I’ve posted some code called the Grove Sentiment Scanner that shows exactly how we combine these parts all together. We search for the term “Atlanta” in the Twitter stream, and for each tweet, we can parse the sentiment. We take the zero to one intensity values for each axis, and transpose them to zero to 255 values corresponding to red (negative), green (positive), or blue (neutral).
I’m hesitant to say there’s a ton of intrinsic utility in a project like this. But it’s a lot of fun, and I think there’s a bit of art involved, here.
There is an internally consistent sense of synesthesia here between what we see and what a bit of text is intended to make us feel. An approach like this has the capacity to encourage empathy in unexpected ways, as we adapt the visual components of our mind to reason over emotional and verbally symbolic components at the same time. We can use the colors that we see to understand the underlying emotion of the text before we finish processing a sentence. This could serve as a tool of disambiguation emotional content in a text for individuals who have higher than average difficulty divining emotion from text.
We are grounded in some very basic sense of semiotics by mapping positivity to green and negativity to red. The stop light most immediately comes to mind as an artifact we use every day that takes advantage of this opposition of color and emotional intent. Blue, as being part of how RGB generates the range of colors it does, is in many regards incidental. But we can look at how blue is used and understand that it does often communicate neutrality quite effective. This is the reason, for instance, why blue is so often used as the color scheme for retail locations.
]]>The SolrCloud deployment consists of dozens of collections, each configured to have two shards with a replication factor of two. This was quite lucky for us recently, because when one of our nodes went down, all of them continued to be available, and we did not experience a service outage of any kind.
One of our nodes experienced disk failure that resulted in total data loss. This was denormalized data, but re-analysis is time-consuming. We already had the data replicated to other nodes that were working just fine. For the most part, we were interested in restoring the previous cluster to its former health with a fresh instance of Solr taking the place of the downed node. This node would have an identical image up to even the same hostname and IP.
I was somewhat surprised to find that there wasn't really an off-the-shelf solution to this problem, and not much came up when Googling. I searched for things like "solrcloud restore lost replicas", "solrcloud recover nodes", and there were few actionable results.
I was able to use Solrcloudpy, a client library for Python, to identify downed nodes using the new collections API, and send instructions to the cluster to manually remove the orphaned replica listings, and then manually re-create those replicas to the same location. You can see how I did it in this Gist.
This kind of robustness and fault-tolerance is what makes SolrCloud one of my favorite distributed data stores to work with. Even for APIs without power-packed off-the-shelf clients, you can easily interact with them and understand how they should behave. This is largely thanks to the many talented maintainers and community members that continue to use and support it.
In many cases, I have simply used Python's requests library for simple JSON interactions with Solr, but sometimes it's nice to have better data modeling in your code. If you're a Python developer looking to get into SolrCloud, feel free to check out Solrcloudpy, an intuitive client for working with multi-server, multi-tenant search deployments.
]]>Keeping vital libraries like this up to date is important. As Solr evolves, we want to make sure you can continue to use Solrcloudpy with a minimum of disruption. There are other Solr libraries such as SolrClient, but this one in particular supports Python 3 only. If you have a high-scale infrastructure primarily in Python, you may be using Celery. A popular backend for Celery is RabbitMQ. The Celery community has no plans to add Python 3 support to their RabbitMQ driver, keeping many committed to 2.7 without changing backends, which has significant ramifications in operationalization, monitoring, and other production acceptance concerns. In other words, we've got a real hairy yak on our hands.
For this reason, I reached out to Didier Deshommes, the former maintainer of Solrcloudpy. His work was moving away from Solr, and he was looking for new maintainers. I worked with him to build out the Solrcloudpy organization in Github. We collaborated on delivering version 1.8.
Version 1.8 includes the following features:
Now as one of the library's primary maintainers, I plan on doing some work in the future to add functionality in a backwards-compatible and well-tested manner. We welcome contributors and look forward to everyone's involvement in the future of this project. Thanks again to Didier Deshommes for the great work that he's done with Solrcloudpy, and to all its other former contributors.
]]>I had just completed a stint focusing on improving Wikia's search functionality. What I really wanted to focus on was finding ways to deliver interesting product features directly from the content provided by our user communities. This would allow me to apply my talents in machine learning and natural language processing to put a greater degree of intelligence behind the MediaWiki platform.
I led a very small team, and built a lot of fun technologies. Little made it to production, but we had some very interesting results and learned a great deal from what did. During this time, Grant Ingersoll, one of the authors of Taming Text, reached out to us to talk about how we were using Solr and other natural language processing technologies as part of the data processing pipeline we were building. This prompted us to write a chapter for a possible second edition of Taming Text. We got great feedback from Grant on the chapter, but after about two years, the project has stalled a bit.
With Grant's approval, I'm making this chapter available here for free for the first time: A HighScale Deep Learning Pipeline for Identifying Similarities in Online Communities.
It's always fun to look back at projects like this and ask what I'd do differently. I probably should have used a Storm topology for the CoreNLP parsing component. That would have made things a bit more reliable (about as reliable as Storm, though, I guess). I definitely think a lot of the threaded functions should have probably gone into Celery or some other worker processing platform. The amount we got done with just boto, the current literature, and tens of millions of English articles was tons of fun, though.
I'd also like to give a quick thank you to Murad Salahi, whose article Teaching a Computer How To Read provided much inspiration for what we ultimately accomplished at Wikia. Not to mention, thanks to my co-authors John Kuner and Tristan Chong for providing valuable contributions to the paper and the work it outlined.
If you want to take a deeper dive, most of the code for this project is still available online as part of Wikia's Data Science Toolkit project.
]]>If you're using any kind of sufficiently large set of data, where your number of distinct values for a particular field is in the range of thousands or hundreds of thousands, getting all values can be hard.
The traditional approach is to serially iterate or paginate using a fixed limit and a growing offset. This can be a problem at any kind of reasonable scale. This amount of serialization requires you to put the latency for all of your queries end to end, providing an unnecessarily large worst-case time complexity.
There are a lot smarter ways to do things these days, and each individual approach can be boiled down to the use of asynchronous processing. It's easy to asynchronously process a paginated dataset and maintain its order assuming you know the size of the data first.
The problem then becomes determining the number of distinct values on the client side within a document set without first iterating over each value (or asking Solr for the max, which is complicated, and is often difficult to implement client-side). We will need to make a series of educated guesses that should work faster than iterating serially.
I've created the following quick-and-dirty heuristic to get the maximum number of distinct values for a resultset where you can easily determine the total number of documents:
Start with the total number of documents in the result set as your assumed number of distinct values (we'll call this the expected max). Attempt to retrieve that value for the facet offset at the expected from solr, with and a sufficiently large set of values as your limit. If you don't find any values, cut the expected max in half and start over with that value as the expected max. If you do find values, and the number of values is as large as the limit requested, then set your expected max to 1.15% its current value and start over.
Once you have retrieved a result set where the number of values returned is less than the limit provided, you can add the length of the available values to your expected max, and you now know the number of distinct values. You should manage to do this with a lot fewer requests than normally, too.
Once you have the the max, you can retrieve all distinct values asynchronously client-side, because you can now derive all the arguments required to invoke or enqueue requests against each result set as you retrieve them from Solr.
]]>Today I'm celebrating two events: my 30th birthday (woop woop), and the 1.0 release of the Python CoreNLP XML Library.
corenlp_xml is a Python library that provides a data model on top of Stanford CoreNLP's XML output. You can install it using pip. corenlp_xml uses lxml and lazy-loading techniques for high-performance querying and data access capabilities. It uses NLTK's tree parsing capabilities to provide additional interactions against the XML's S-expression sentence parse node.
I've used corenlp_xml to solve the following problems at high scale:
More information is available on corenlp_xml's ReadTheDocs page. This library should be a great help to anyone who wants the power and accuracy of CoreNLP's parsing output, but is interested in using Python's fast numerical computing affordances for further analytics, data science, or machine learning.
If you think the library would be useful for you, please feel free to contribute back to the project, or vote for my upcoming Lucene/Solr Revolution talk.
]]>