Clever new SEO scam hijacks Harvard emails and student blogs

Now I’m curious, what is used for this kind of work, what would be good search terms?

FYI, two of the exploited platforms are hosted by vendors, not on Harvard-controlled servers. Canvas is used by thousands of higher ed institutions (though it’s possible they don’t all have the exploited “eportfolio” component).

A scan for URLs or URL-like structures is the first pass. It’s easily parsed and SEO spam is useless without it. Corpuses of spam-associated words (including obfuscated versions) are readily available for searches like this.

Most of this Harvard blog spam sounds like it’s lazy and cheap, so I’m betting at least 80% of it would be flagged for human review quickly the first time the search routine is run.

1 Like

What do you mean by “a scan,” setting up a server to crawl and index every reachable page on a harvard.edu domain (or limited to pages on the “at-risk” domains)?

I googled “Corpuses of spam-associated words” but the results didn’t look very promising, multiple marketing-oriented articles about a few hundred words to avoid, one article linked to more substantial data but the data was pretty old.

It doesn’t sound like you’re describing off-the-shelf software at all but the parts that would need to be put together to create the software oneself; not a soufflé but a recipe for a soufflé.

The problem seems to be with the student blogs, which I would expect operate under their own subdomains. Focusing on the database that generates those pages would address the stated problems.

It’s pretty easy for an industry professional to find a stand-alone corpus like this:

https://plg.uwaterloo.ca/~gvcormac/treccorpus/

But they’re usually incorporated into the off-the-shelf scanners too.

I’d be surprised if there isn’t a Canvas plugin/extension, but one addressing this specific problem might have to be custom built. It’s a shame those Harvard people don’t have the smarts to set one up, right?

1 Like

Only one platform I saw in the article, canvas.harvard.edu/eportfolios/, is for students only and it’s not a blog, it’s meant to be a persistent place where students can show work they’ve done in courses or share research they’ve conducted. The other platforms, like blogs.harvard.edu and scholar.harvard.edu, are used more by faculty and staff but are also available to students. Those are the “at-risk” domains I meant, all cites as examples in the article; scans could be restricted to locations like that while closely managed domains like www.harvard.edu would not need to be.

What industry are you referring to? InfoSec?

Linked example is from 2005 (with links to ones from '06 '07), I’m sure there are lots of newer scams and boner pills since then, not to mention horse dewormers and other fresh hells. It’s also a corpus of email messages labeled “spam” and “ham,” it doesn’t sound like the bad bits are called out.

off-the-shelf scanners

Again, please name one, not for email spam, for published web content.

I don’t think Canvas even has a framework for adding plugins/ extensions like that. Canvas instances can add “LTI’s” for including different kind of content to a course but they can’t “wrap” the native content widgets in a “bad content” filter. And the vendor, Instructure, hosts every Canvas instance so you can’t get your hands on the server-side “guts.”

It’s a shame those Harvard people don’t have the smarts to set one up, right?

Plenty of smarty-pants MBAs would say don’t build what you can buy, focus on your core competencies. They might also do a risk analysis and determine that the cost of developing an in-house tool (including opportunity costs) for each platform used are greater than the downsides of doing nothing (or doing nothing more than getting better at responding when problem content is pointed out).

It’s also become a persistent place for posting spam under the harvard.edu domain. If I were the manager in charge of that system I’d consider that a problem, but then I know the rough value of a brand like “Harvard” (whether or not I have the same degree of respect for it that the general public has is a different matter).

So I’d focus on the canvas subdomain or the eportfolios tables in the database (or insist that the vendor do it). The point is, go where the problems are (apparently this too is beyond the ken of the geniuses at Harvard).

The digital content industry in general. It’s a long list of positions: infosec, sysadmins, database administrators, content platform managers, moderators, marketing execs, brand managers, in-house counsel, etc. Really anyone who deals with public-facing Internet services at a business and technical management level has to be familiar enough to address spam issues or find a subcontractor or employee who will.

I forgot, corpuses can never be updated or provide a template for new ones, and updated indices certainly aren’t a thing. I think that’s something else you learn in Harvard CS classes.

And again, the bulk of the dreck being sold by the spammers isn’t much different than it was in 2005.

Most modern content platforms have them. Here are 12 for the most popular one, Wordpress:

I’m not familiar with Canvas, but if there isn’t something similar for that platform it indicates a larger problem about the choice of platform, especially if it’s sold as SaaS. Switching platforms can be expensive, but in this case a lot of the migration work could be placed on those portfolio owners who cared to re-post their stuff on the new system (the abandoned stuff could be archived, perhaps with the help of archive.org).

MBAs, who are trained not to look past the next couple of fiscal quarters, aren’t the best people to start with fo long-term arms-race issues like this. Identify the problem, explain its impact, find a couple of solutions, show the pros and cons in terms of effectiveness, and then make the cost-benefit case to the finance guy.

You can continue to claim that addressing this issue isn’t possible, but this is a known problem (at least if an institution takes protecting its brand seriously) with known and cost-effective solutions (at least if an instiutution employs competent people). It’s been this way for the 25 years plus I’ve been in the business.

3 Likes

This topic was automatically closed after 5 days. New replies are no longer allowed.