Get entries played more than a year ago

Hey all,

I’m trying to get a list of entries that have not been played in more than a year. I’m poking around in the kaltura_dw database, specifically the dwh_fact_events and dwh_facts_events_archive table and it seems that it could contain the data I’m looking for, but I’m not sure how to account for data being moved in and out of these tables.

I poked around the API for a bit to see if perhaps I could write a script to accomplish this, but I haven’t got very far just yet.

Perhaps if I had a list of the entries that WERE played in the last year, I can deduce the ones that were not using a list of all entries.

Anyway, if anyone has any hints and can point me in the right direction I would appreciate it!

Hi Dave,

As far as I recall, CE 5 had a major bug there.
It didn’t include the script that synced the plays from the DWH DBs onto the operation ‘kaltura’ DB and as a result, the ‘plays’ and ‘views’ columns in the entry table were not correctly populated.
On your new install, checkout /opt/kaltura/app/alpha/scripts/dwh/dwh_plays_views_sync.sh which is called from /etc/cron.d/kaltura-dwh.
You should be able to take all the files this runs, i.e:
$BASE_DIR/app/alpha/scripts/dwh/trigger.sql
$BASE_DIR/app/alpha/scripts/dwh/updateEntryPlaysViews.php
$BASE_DIR/app/alpha/scripts/dwh/wrap.sql
copy them to the old ENV and run /opt/kaltura/app/alpha/scripts/dwh/dwh_plays_views_sync.sh

Basically, the following stored proc is called in $BASE_DIR/app/alpha/scripts/dwh/trigger.sql:

CALL kalturadw.get_data_for_operational('entry');

The results are then piped to /opt/kaltura/app/alpha/scripts/dwh/updateEntryPlaysViews.php which executes:

"UPDATE entry set views='$views',plays='$plays',last_played_at='$mysqlNow' WHERE id='$entryId'";

Lastly, kalturadw.mark_operational_sync_as_done(‘entry’) is called.

After you’ve done that, entry.views and entry.plays will be populated and you could get them per entry with media/baseentry->list() or by querying the ‘kaltura’ DB directly.

Thanks Jess. Thanks also for remembering I’m dealing with CE 5 stil.

I will try this in my production clone. Is there any risk to running this script in terms of data loss or modification considering the difference in versions?

Hi Dave,

You’re welcome:)
Shouldn’t pose a risk but… being well experienced, one knows that well, you never know:) Run it on the close first.

Hey Jess,

Yea, I definitely understand what you’re saying. I’m reviewing the scripts right now. :slight_smile:

A few potential discrepancies I’m seeing

  1. There is no stored proc in the kalturadw DB called get_data_for_operational. I would need to find this and move it over as well, no?
  2. the entry table in the Kaltura DB does not have a column called last_played_at. I take it I would need to create one for this to function.

I’m wondering if there are other dependencies to this process that I have not uncovered quite yet… What’s your take on these items?

Hey Dave,

The stored proc is defined here: /opt/kaltura/dwh/ddl/dw/functions/get_data_for_operational.sql
I don’t think the DWH schemas changed dramatically since 5 so I hope you’ll be able to use it as is.
As for last_played_at, must have been added to the entry table schema later on, you can just drop it from the update statement in /opt/kaltura/app/alpha/scripts/dwh/updateEntryPlaysViews.php

I’ve got a pretty good memory but CE 5 was over 5 years ago, plus, I’m hardly the only developer who worked on it so I can’t say for sure. You’ll have to find the differences as you go along, I’m afraid:)

Thanks @jess

So it seems there is a table missing in the kalturadw schema that is needed for this process to work.

dwh_entry_plays_views doesn’t exist in the CE 5 schema.

The get_data_for_operational proc uses this table to populate the operational database.

Since I don’t have this table populated, I need to populate it. Or find another way to accomplish this.

Can you point me to what scripts might populate this table? I’ll start poking around as well.

Hi Dave,

Use this to create the table

USE kalturadw;


CREATE TABLE dwh_entry_plays_views
(entry_id VARCHAR(255), 
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
plays INT(11), 
views INT(11), 
PRIMARY KEY (entry_id),
KEY (updated_at));

As for the stored proc, this is what I last used for CE 5 [again about 5 years ago but I believe it should work for you]:

DELIMITER $$

USE `kalturadw`$$

DROP PROCEDURE IF EXISTS `get_data_for_operational`$$

CREATE DEFINER=`etl`@`localhost` PROCEDURE `get_data_for_operational`(p_sync_type VARCHAR(55))
BEGIN
        DECLARE v_execution_start_time DATETIME;

        DECLARE v_group_column VARCHAR(1024);
        DECLARE v_entity_table VARCHAR(1024);
        DECLARE v_aggregation_phrase VARCHAR(1024);
        DECLARE v_aggregation_table VARCHAR(1024);
        DECLARE v_bridge_entity VARCHAR(1024);
        DECLARE v_bridge_table VARCHAR(1024);
        DECLARE v_last_execution_parameter_id INT;
        DECLARE v_execution_start_time_parameter_id INT;

        SET v_execution_start_time = NOW();

        SELECT group_column, entity_table, aggregation_phrase, aggregation_table, 
                bridge_entity, bridge_table, last_execution_parameter_id, execution_start_time_parameter_id
        INTO    v_group_column, v_entity_table, v_aggregation_phrase, v_aggregation_table, 
                v_bridge_entity, v_bridge_table, v_last_execution_parameter_id, v_execution_start_time_parameter_id
        FROM kalturadw_ds.operational_syncs WHERE operational_sync_name = p_sync_type;

        UPDATE kalturadw_ds.parameters  SET date_value = v_execution_start_time WHERE id = v_execution_start_time_parameter_id;

        SET @s = CONCAT('SELECT dim.', v_group_column,', ', v_aggregation_phrase, 
                        ' FROM ', v_aggregation_table ,' aggr, ', IF (v_bridge_table IS NULL, '', CONCAT(v_bridge_table, ' bridge, ')), v_entity_table, ' dim, kalturadw_ds.parameters p',
                        ' WHERE aggr.', IF(v_bridge_entity IS NULL, v_group_column, 
                                                CONCAT(v_bridge_entity, ' = bridge.',v_bridge_entity, ' AND bridge.', v_group_column)), 
                        ' = dim.', v_group_column, ' AND dim.operational_measures_updated_at > p.date_value AND p.id = ', v_last_execution_parameter_id,
                        ' GROUP BY dim.',v_group_column);

        PREPARE stmt FROM  @s;
        EXECUTE stmt;
        DEALLOCATE PREPARE stmt;
END$$

DELIMITER ;

This version of updateEntryPlaysViews.php should work with CE 5:

<?php
define ('ROOT_DIR','/monsoon/opt/kaltura/app');  
require_once(ROOT_DIR.'/api_v3/bootstrap.php');
require_once(ROOT_DIR . '/alpha/config/kConfLocal.php');
require_once(ROOT_DIR . '/infra/bootstrap_base.php');
require_once(ROOT_DIR . '/infra/KAutoloader.php');

KAutoloader::addClassPath(KAutoloader::buildPath(ROOT_DIR, "vendor", "propel", "*"));
KAutoloader::addClassPath(KAutoloader::buildPath(ROOT_DIR, "plugins", "*"));
KAutoloader::addClassPath(KAutoloader::buildPath(ROOT_DIR, "infra", "*"));
KAutoloader::setClassMapFilePath(kConf::get("cache_root_path") . '/deploy/classMap.cache');
KAutoloader::register();
 
require_once(ROOT_DIR.DIRECTORY_SEPARATOR.DIRECTORY_SEPARATOR."infra".DIRECTORY_SEPARATOR."bootstrap_base.php");

require_once (ROOT_DIR.DIRECTORY_SEPARATOR.'alpha'.DIRECTORY_SEPARATOR.'config'.DIRECTORY_SEPARATOR.'kConf.php');

// Autoloader
require_once(ROOT_DIR. DIRECTORY_SEPARATOR."infra".DIRECTORY_SEPARATOR."KAutoloader.php");

$f = fopen("php://stdin", "r");
$count = 0;
$sphinxMgr = new kSphinxSearchManager();
$dbConf = kConf::getDB();
DbManager::setConfig($dbConf);
DbManager::initialize();
$connection = Propel::getConnection();
$partnerId=100;
while($s = trim(fgets($f))){
        $sep = strpos($s, "\t") ? "\t" : " ";
        list($entryId, $plays, $views) = explode($sep, $s);
        myPartnerUtils::resetAllFilters();
        entryPeer::setDefaultCriteriaFilter();
        $entry = entryPeer::retrieveByPK ( $entryId);
        if (is_null ( $entry )) {
                KalturaLog::err ('Couldn\'t find entry [' . $entryId . ']' );
                continue;
        }
        if ($entry->getViews() != $views || $entry->getPlays() != $plays){
                $entry->setViews ( $views );
                $entry->setPlays ( $plays );
                KalturaLog::debug ( 'Successfully saved entry [' . $entryId . ']' );


                try {
                        // update entry without setting the updated at
                        $updateSql = "UPDATE entry set views='$views',plays='$plays' WHERE id='$entryId'";
                        $stmt = $connection->prepare($updateSql);
                        $stmt->execute();
                        $affectedRows = $stmt->rowCount();
                        KalturaLog::log("AffectedRows: ". $affectedRows);
                        // update sphinx log directly
                        $sql = $sphinxMgr->getSphinxSaveSql($entry, false);
                        $sphinxLog = new SphinxLog();
                        $sphinxLog->setEntryId($entryId);
                        $sphinxLog->setPartnerId($partnerId);
                        $sphinxLog->setSql($sql);
                        $sphinxLog->save(myDbHelper::getConnection(myDbHelper::DB_HELPER_CONN_SPHINX_LOG));

                } catch (Exception $e) {
                        KalturaLog::log($e->getMessage(), Propel::LOG_ERR);

                }
        }
        $count++;
        if ($count % 500 === 0){
            entryPeer::clearInstancePool ();
        }
}
?>

Hey @jess

Thanks for the scripts.

I did find the dwh_entry_plays_views creation script last night before turning in.

I noticed that the updateEntryPlaysViews.php script you provided does not include updating the last_played_at column, though. That’s what I really want to figure out here. The date that and entry was last played. :frowning: Was this supported in some iteration of CE 5?

Hi Dave,

last_played_at was not part of the schema in 5, it was added later on.
You should be able to modify your current schema and add it [I don’t think it should cause issues though you won’t know until you try] but if you’ll take a look at the current version of /opt/kaltura/app/alpha/scripts/dwh/updateEntryPlaysViews.php, you’ll see that:

$mysqlNow = date( 'Y-m-d H:i:s', $now );
$updateSql = "UPDATE entry set views='$views',plays='$plays',last_played_at='$mysqlNow' WHERE id='$entryId'";

So, the date is set during the execution of this script, not taken from the DWH DBs. Therefore, if you were to run it now, all entry IDs returned from the get_data_for_operational proc will have last_played_at=NOW().
It’s OK when running this script daily but wheb backlogging it’s pretty useless, I’m afraid.

However, look at kalturadw.dwh_fact_events.
This table stores all play events in the DWH schema.
That has event_time which is the actual timestamp taken from the access log.
To learn more about how DWH works, see https://github.com/kaltura/platform-install-packages/blob/Mercury-13.1.0/doc/kaltura-packages-faq.md#analytics-issues, the general flow did not change since CE 5.

Hey @jess

I was afraid you’d say that (the backlogging part). I’m not really concerned with the CE 5 config going forwards. I am more interested in historical info. Drats!

I think I can come up with a different way to accomplish this though. The KMC analytics interface can give me a content reports for the last year. If a entry does not appear there, I think it’s safe to assume the entry was not played. So, I could manually compare the content report to a total list of entries and find what I’m looking for.

In your opinion, does this jive?

The ultimate goal here is to purge some entries from CE5 before I export\import into Kaltura 12.

I see your edit now. Yea, I had started with the dwh_fact_events and dwh_fact_events_archive table when I started down this rabbit hole. I’ll see if I can go that route and get an easier answer than manually comparing lists :slight_smile:

Hey Dave,

I was never one for manual labour myself:)
The “Analytics” view in KMC utilises the ‘report’ API to retrieve the DWH info, you can do the same.
There is a walk through of that API which includes code samples here:
https://developer.kaltura.com/workflows/Review_Media_Analytics/Analytics_Reports

Should be a decent enough starting point.

Hey @jess

I’ve looked at the link you provided one of your previous replies about the DWH flow in general but it doesn’t seem to tell me the what relationship between the dwh_fact_events table and the dwh_fact_events_archive table. The name would suggest that the _archive table contains older records from the other table, but I’m not sure. I could poke around in the dwh scripts and stored procs but I thought I would ask.

I want to be sure I’m using the data from these tables in a way that makes sense. I could just query both of these tables to fit my criteria, then combine the results…

Thanks

Hi @david.hahn1,

As far as I can tell, dwh_fact_events_archive is only used in /opt/kaltura/dwh/ddl/dw/aggr/calc_aggr_day_procedure.sql and during migrations between versions but I’ll confirm that with our DWH developer and let you know. My involvement in that particular area [DWH] is mostly limited to configuration, API integration and basic troubleshooting:)

Hey @jess

Just curious if you ever heard back from the DWH developer concerning these tables. Thanks for the follow up!

Dave

Hi Dave,

In order to get the list of entries that were played in the past year, best use this query:

mysql> select distinct(entry_id) from kalturadw.dwh_hourly_events_entry where partner_id = <> and date_id between <from_date_YYYYMMDD> and <to_date_YYYYMMDD> and ifnull(count_plays,0) > 0

As for the archive tables, there is a procedure that runs once a day and copies data from the fact to fact_archive table and deletes data from the archive tables according to the rules defined in kalturadw_ds.retention_policy.
See /opt/kaltura/dwh/ddl/dw/maintenance/move_innodb_to_archive.sql

Generally speaking, the archive tables are there to retain old data for investigation purposes and are not needed for day to day operations.

Thanks @jess. I appreciate the follow up.